Re: Data Format and Model Design help needed
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Data Format and Model Design help needed

I presume this is some data for your PhD.  If so, I can not say much but
can attempt to point you in the right direction.
With careful understanding and consideration of the data you may get some
of what you want. It is clear that you need to do some research about
this type of analysis.
There have been a couple of papers on heritability estimates from horse
races in the past couple of years.  These should provide some outline of
what you should be doing.  You should also look at some methods to deal
with longitudinal data / repeated measures.

> I wish to use ASREML to analyse a large dataset of horse race results.  I
> have previosuly tried to use DFREML for this dataset with unsatisfactory
> results as it seems unable to handle the volume of data and often fails to
> find results or just crashes.

I think that your model is inadequate for the pedigree and data structure.
There is nothing in either ASREML or DFREML that limits the volume of
data except physical limits of the computers being used.
> I should add that I have virtually no comprehension of advanced statistics
> and I am far from confident in my ability to use ASREML appropriately.  I
> thus apologise for the possible naivity of the questions below but I don't
> wish to misuse the product.   Accordingly if anyone is prepared to offer me
> advice privately rather than through this list I would be most grateful.

You can always ask...
> Below are approximate volumes of the data I possess.
> Total Horses:	100,000
> Horses with race results: 60,000  (the others are Sires, Dams or Damsires)

This is a major concern as you probably have poor genetic connectiveness.
How many sires?
How many dams per sire?

> Individual Race results:	500,000 (there is a "result" for each horse
> in each race)
> For each race result I have the following (conceptual) data structure:
> Horse, Sire, Dam, Damsire, Age, Sex, Race Distance, Track Condition, Year,
> Score1, Score2, Score3, Score4, Score5
> where Race Distance is in metres, Track Condition consists of five
> categories of how wet/dry the track was and Year is whether the result was
> in first or second year for which I have data.
> Scores 1 through 5 are assessments of the horse's performance using
> different techniques such as time, earnings etc.

Do all horses race in the same race or the same distance?
Are all race distances approximately the same?
Are your scores standarized to avoid bias due to factors such as:
    low versus high stakes
    time depends on distance - speed is an inappropriate measure.
Are the scores ordinal or nominal? 
What is meant by age?

> It should be noted that my race results are all within one generation - no
> sires or dams have results.  Some sires do however also exist as damsires.

How does this relate to your pedigree structure, more than one year of 
results and the term age?
You have only 40,000 sires, dams and damsires.  So most dams missing?

> Desired analyses:
> 1) Heritability estimates for each technique of determining Scores

> 2) Correlation / regression analyses for the techniques of determining
> scores

> 3) Estimation of any maternal effect
Very unlikely, given your information I doubt that you have sufficient
information in the data to extract this.

> Questions:
> 1) How do I construct a datastructure (and appropriate .as file) for the
> input which supports a highly variable number of raceresults per horse (1 to
> about 70) plus allows the multiple regression analyses?  Do they need to be
> performed separately?
You need to determine the most appropriate model that is consistent with
the pedigree and data structures.  This will suggest the appropriate
files that you will need.
> 2) Will ASREML be able to handle this volume of data for these calculations?
> I am currently using a midrange pc with windows.  
Depends on how much real memory you have.  If you have insufficient, it
ASREML will probably use virtual memory and be extremely slow.  Given
DFREML ran, you are probably okay depending on the model.

> 3) Any suggestions on how to model Race Distance?  In previous simple
> analysis of variance calculations I used distance ranges but I have been
> instructed to not do this if possible.  I don't consider that Distance is a
> fixed effect as different horses excel at different distances, although for
> assessments such as race times/ speed it clearly has a close to fixed
> effect.  

I don't understand want you want to achieve here.  I would be concerned
the degree of balance between distance and horses.

> 4) What is the correct Pin file for calculating heritability using Sire, Dam
> and Damsire simultaneously?
> Any assistance most appreciated.
> regards
> Stuart Williamson
> PhD student
> University of Melbourne
> --
> Asreml mailinglist archive:

Regards Bruce

Asreml mailinglist archive: