.po 1.05i
.ll 6.55i
7 October 1992
.pp
Here is a current version of the maximum likelihood program for estimating
quantitative genetic parameters for two populations and obtaining tests of
the null hypothesis that their G
matrices are the same. This version
provides the capability of estimating additive, dominance, and environmental
components of variance and covariance from, for example, a nested sib
design. However, it can also be used to analyse data in simpler structures
(e.g. full-sib or clonal design).
I've tried to make the program
understandable by augmenting the comments (within braces{})
throughout, but I can't claim that it is particularly "user-friendly".
I hope you at least find it usable.
.pp
The source code for the program, written in standard Pascal, is on the
diskette as pcr2.p. The program estimates
genetic and environmental variances
and covariances. It also estimates
the grand means for the traits in the two populations,
as well as additional fixed effects (e.g., block effects in the example).
It should compile without trouble under Turbo,
Berkeley Unix, or VMS. As it stands, the programs should compile under Unix
or VMS. To compile under Turbo, remove brackets around the two "assign"
statements at the beginning of the main program and the "close" statement
toward the end of the main program (3rd line from the end).
.pp
I haven't sent compiled versions, because you
will undoubtedly find it necessary to modify the constants at the beginning
of the programs, in order to accommodate your data. Complaints like "array
index out of bounds" indicate that the dimensions are too small.
The constants are defined as follows:
.np
r = the number of traits to be analysed. This may be fewer than the number
of traits in the dataset (see below).
.np
maxnparm = 6 * nvarcov. This is the maximum number of variance and
covariance components to be estimated.
.np
f = the number of "extended families". This is the number of distinct
groups of interrelated individuals.
.np
sibs = the maximum number of individuals in an extended family, included
unmeasured individuals (i.e., often parents).
.np
hfsibs = the maximum number of individuals related through at least one parent;
.np
fsibs = the maximum number of individuals related as full sibs;
(hsibs and fsibs will not differ in the case of full-sib or clonal data).
.np
nvc = r (r + 1)/2.
.np
endval integer value used to indicate the last line of data. Must appear
at the beginning of that line.
.np
missing is used to test for missing data. The values specified here must
be larger than the missing value used in the data, but less than any actual
data values. Thus, in the code and example file you receive,
-98 is specified in the program, and -99 is used as the
missing value in the data.
.np
pos = r(sibs)
.np
nmax = the maximum total number of individuals in the dataset.
.np
null = the missing value for parental ids. (i.e. the parents of the
founders. This must be left set to zero).
.np
delta = the stopping criterion for the maximization. When the difference of
successive likelihoods < delta, then the program stops and prints results.
.np
maxround = the maximum number of iterations permitted. If maxround is
reached before the difference between successive likelihoods < delta, then
the program prints a warning, current estimates and stops.
.np
maxfix=the maximum number of fixed factors (including population).
.np
p=the maximum number of fixed effects to be estimated: i.e. the sum of number
of levels for each fixed factor, multiplied by the number of traits.
.pp
The data should be in a file called 'pcdata'. The input file should contain
the following information in order:
.pp
Line 1: a) A designation of the type of analysis you want, '1'
for maximum likelihood or '2' for restricted maximum likelihood.
(The former gives biased estimates with smaller variances, the latter the
converse. The latter gives estimates identical to ANOVA in the balanced
case.) b) An integer flag to indicate that the 'sibs' within a family are
clonally related: 1, if this is the case; 0 otherwise.
c) The number of characters to be analysed (may be less than the number in
the dataset. d) the number of categorical
fixed factors in the data (apart from the grand mean). If the only fixed
factor is the population effect, the number here is 1. e) A list of integers
designating which of the traits are to be entered in the analysis, referring
to the traits in the order in which they appear in the data.
.pp
Line 2: Supply hypothetical values for all the variance and covariance
components to be estimated.
They should appear in this order: V\dA\u of trait 1, Cov\dA\u of traits 1 and 2,
Cov\dA\u of traits 1 and 3, etc. V\dA\u of trait 2, Cov\dA\u
of traits 2 and 3, and so
on through the additive variances and covariances for population 1,
Followed by all these components for population 2. After these, come the
dominance components in the same order, and these are followed by
environmental components in the same order. If
there is one trait, then the program expects six values on this line.
In general, there are p=3r(r+1) parameters, where r is the number of
characters. The values given must be such that phenotypic variances
(V\dA\u + V\dD\u + V\dE\u) are > or = 0 for each population,
and phenotypic covariances must be less than or equal to the
absolute value of the square root of the product of the two phenotypic
variances. The values you choose serve as starting values for the iterative
estimation.
N.B. These values
need not all appear on the same line. In the example dataset,
each value occupies a different line. For the numbering of lines that
follows, assume a single line is used for these values.
.pp
Line 3: On the third line should appear the number of components you
wish to eliminate from the full model, followed by their indices, according
to the ordering given for values in line 2 (e.g, in a one character case, if
you want to constrain V\dD\u for population 1 to 0, specify 1 3). This
permits you to explicitly constrain to zero any components of variance (and
their associated covariance components) known (from previous analyses, for
example) to take on estimates < 0. This program does not automatically
constrain the estimates to lie within the parameter space, but such a
version is in the works.
.pp
Line 4: On the fourth line should appear the number of
additive components you wish to constrain to be equal in the two populations,
followed by their indices, within the range 1 to r*(r+1)/2.
.pp
Line 5ff: The data from the pedigree begin on this line, one line for
each individual in the pedigree, including parents of the progeny
generation, whether or not they are measured. Each individual has:
.np
an ID number
(these must be consecutive through the whole dataset, and the sequence of ID
numbers must include the ID's for both parents)
.np
the ID# of the paternal parent
.np
the ID# of the maternal parent. In the case of a clonal analysis, it is
necessary to imagine that each observed individual has maternal and paternal
clonal 'parents', who are the same individual. Thus an individual
indexed 1, might have father 107 and mother 107.
.np
an integer identifier for the population (1 or 2)
.np
an integer
identifier for the level of each categorical fixed factor. The number of
columns giving fixed effect levels must match the number of fixed factors
given in line 1. (This is used only if there are fixed factors
other than 'population'). Consecutive values must be used for the levels
within a fixed factor.
.np
the phenotype(s)
in the order in which you referred to them in line 2. Individuals in two
generations must be included, whether or not the parents were measured.
(It is straightforward to extend the analysis to more generations, but this
version of the program will not properly accommodate such data.)
Every offspring must have ID#s for two parents specified, in order to
establish the pedigree. The ID#s for their parents, in turn, should be 0.
Where any phenotype is unmeasured on a given individual, the phenotypic
value should be -99.0. (You can change this to another value by changing
the value of the constant, 'missing', at the beginning of the program).
.lp
After all the data, there must be a line that contains only a zero.
You can check to make sure that the sorting into families is being done
properly by examining the output from the procedure called "PrintFams",
which gives a listing of the last 20 lines of data in the order received.
In each line there will also be values for next maternal sib (NMS), and
family ID number (FamID) immediately before the value of the first character.
(Here, as above, "family" refers to the extended family, the collection of
all individuals that are related).
.pp
As noted above, this version of the program gives unconstrained estimates,
that is, estimates that are not confined to the parameter space, as true ML
(REML) estimates should be. This means you may obtain negative estimates
for variance components and estimates of covariance components that yield
genetic correlations outside the range (-1,1).
The estimates are reported (to a file name 'pcout')
in the order corresponding to line 2 above.
.pp
A word of caution about the number of characters that can or should be
analysed for comparison of G: the number of parameters to be estimated
(maxnparm)
goes up steeply with the number of traits. The number of parameters can
readily surpass the total number of observations available. When that
happens, it is not possible to obtain estimates for all parameters.
Similarly, as that point is approached, there is progressively less
information available per parameter. Thus, comparison of G for
many traits in a single analysis may be infeasible for many datasets.
(Obviously, how many is too many depends on the number of individuals and
their genetic design).
Analyses of too many traits for the data at hand will tend to have low
statistical power and may have
convergence difficulties.
This is not a drawback of likelihood, but simply a consequence of the amount
of information available in the data. If information is available on many
traits, I recommend that analyses be carried out on subsets of the traits.
.pp
Included on the diskette are four files, including this one (progdoc.pcr).
The source code of the
ml analysis program is pcr2.p. There is a sample set of
data (pcdata), and an output file for the analysis of those data
(pcout. Additional analyses in which particular components are constrained are
given in the file pc2tests. The example dataset is that for tiller dry
weight used in Shaw and Billington (1991, Evolution 45:1287).
.nf
Ruth Shaw, Department of Botany and Plant Sciences
University of California, Riverside, CA 92521
After January 1:
Department of Ecology, Evolution, and Behavior
University of Minnesota, St. Paul, MN 55108