DNArates_1_0
Gary J. Olsen
August 23, 1993
The DNArates program takes a set of sequences and a corresponding phylogenetic
tree and produces and maximum likelihood estimate of the rate of nucleotides
substitution at each sequence position.
Input is read from standard input. The format is very much like that of the
fastDNAml program. The first line of the input file gives the number of
sequences and the number of bases per sequence. Also on this line are the
requested program option letters. Any auxiliary data required by the options
follow on subsequent lines. Either the user must specify the empirical base
frequencies (F) option, or immediately preceding the data matrix there must be
a line of data with the frequencies of A, C, G and T. Next, the program
expects a data matrix. The first 10 characters of the first line of data for a
given sequence in interpreted as the name (blanks are counted). Elsewhere in
the data matrix, blanks and numbers are ignored. The default data matrix
format is interleaved. If all the data for a sequence are on one input line,
then interleaved and noninterleaved are equivalent. Following the data matrix
there must be a line with the number of user-specified trees for which rates
are to be estimated (as with the U option is fastDNAml). The rest of the input
file is one or more user-specified trees with branch lengths (as with the U and
L options in fastDNAml).
The program writes to standard output. The output lists the estimated rate of
change at every site in the sequence, or "Undefined" if there are not
sufficient unambiguous data at the site.
If the C option is specified, the program also categorizes the rates into the
requested number of categories. The current categorization algorithm is rather
crude, but is probably adequate if the number of categories is large enough. A
weighting mask is also created in which sites with Undefined rates are assigned
a weight of zero.
If the Y option is specified, the program writes the weights and categories
data to a file in a format appropriate for use by fastDNAml.
Options summary:
1 - Print data. Toggles print data option (default = noprint).
C - Categorize rates. Requires auxiliary line with a C and the desired
number of categories.
F - Empirical base frequencies. Calculates base frequencies from data matrix,
rather than expecting a base frequency input line.
I - Interleave. Toggles the data interleave option (default = interleave).
L - Userlengths. This is implicit in the program, so the option is ignored.
M - Minimum informative sequences. Requires an auxiliary data line with an
M and the minimum number of sequences in which a sequence position
(alignment column) must have unambiguous information in order for the rate
at the site to be defined (default = 4).
T - Transitions/transversion ratio. Requires auxiliary line with a T and
the ration of observed transitions to transversions (default = 2.0).
U - User trees. This is implicit in the program, so the option is ignored.
W - User weights. Requires weights auxiliary data.
Y - Categories file. Writes the weights and categories to a file.
The option scripts usertree, weights, n_categories and categories_file are
useful for adding the appropriate options to the input data matrix.
The option script weights_categories is useful for adding the resulting outfile
to a fastDNAml input file.
Example Input:
--------------
5 114 F C
C 9
Sequence1 ACACGGTGTCGTATCATGCTGCAGGATGCTAGACTGCGTCANATGTTCGTACTAACTGTG
Sequence2 ACGCGGTGTCGTGTCATGCTACATTATGCTAGACTGCGTCGGATGCTCGTATTGACTGCG
Sequence3 ACGCGGTGCCGTGTNATGCTGCATTATGCTCGACTGCGRCGGATGCTAGTATTGACTGCG
Sequence4 ACGCGCTGCCGTGTCATCCTACACGATGCYAGACAGCGTCAGCTGCTAGTACTGGCTGAG
Sequence5 ACGCGCTGTCGTGTCATACTGCAGGATGCTAGACTGCGTCAGCTGCTAGTACTGGCTGAG
AGCTCGATGATCGGTGACGTAGACTCAGGGGCCATGCCGCGAGTTTGCGATGCG
AGCACGGTGATCAATGACGTAGNCTCAGGRTCCACGCCGTGACTTTGTGATNCG
AGCACGATGACCGATGACGTAGACTGAGGGTCCGTGCCGCGACTTTGTGATGCG
ACCTCGGTGATTGATGACGTAGACTGCGGGTCCATGCCGCGATTTTGCGRTGCG
ACCTCGATGCTCGATGACGTAGACTGCGGGTCCATGCCGTGATTTTGCGATGCG
1
( Sequence3: 0.061772,
Sequence2: 0.053462,
( Sequence1: 0.082889,
( Sequence4: 0.067423,
Sequence5: 0.018731
): 0.087748
): 0.069398
):0.0;
Example Output:
---------------
DNArates, version 1.0.0, April 11, 1992
Portions based on Joseph Felsenstein's Nucleic acid sequence
Maximum Likelihood method, version 3.3
5 Species, 114 Sites
There must be at least 4 informative residues per column
Analyzing 41 distinct data patterns (columns)
Empirical Base Frequencies:
A 0.18570
C 0.24823
G 0.31783
T(U) 0.24823
Transition/transversion ratio = 2.000000
(Transition/transversion parameter = 1.571835)
User-defined tree:
5 taxon user-supplied tree read
Total length of tree branches = 0.441423
Site Rate
---- ---------
1 1.1327
2 1.1327
3 3.5969
4 1.1327
5 1.1327
6 4.2142
7 1.1327
8 1.1327
9 8.0384
.
.
.
111 1.1327
112 1.1327
113 1.1327
114 1.1327
Weights 1111111111 1111111111 1111111111 1111111111 1111111111 1111111111
1111111111 1111111111 1111111111 1111111111 1111111111 1111
C 9 1.13270 2.27308 3.57506 5.62280 8.84345 13.90884 21.87560
34.40559 256.00000
Categories 1131131151 1131111511 5119411111 2111311131 3121131711 1313211151
1313114112 3232111111 1111182111 3112311115 1151111311 1111