Program:
K-Estimator v5.2
Author:
Josep M. Comeron
Dep. Ecology & Evolution
Univ. of Chicago
1101 East 57th St.
Chicago, IL 60637
jcomeron@midway.uchicago.edu
GENERAL DESCRIPTION
K-Estimator is a computer program for estimating the number of nucleotide
substitutions per site (synonymous [Ks] and nonsynonymous [Ka] for coding
regions, and overall [K] for noncoding regions) and the confidence intervals
of these estimates obtained by Monte-Carlo simulations.
References:
Comeron, J.M. (1999) K-Estimator: Calculation of the number of nucleotide
substitutions per site and the confidence intervals.
Bioinformatics (in press)
Comeron, J.M. (1995) A method for estimating the numbers of synonymous
and nonsynonymous substitutions per site.
J. Mol. Evol. 41: 1152-1159.
The program is written in Microsoft Visual Basic v.5.0 and it runs on any IBM-PC
compatible computer under Windows 98/NT.
Kest52.exe is a self-extracting file that will auto launch the SetUp.
DETAILED DESCRIPTION
K-Estimator is a Windows program written in Visual Basic 5.0 (Microsoft (c))
and can run on any IBM compatible computer under Windows 95/98 or Windows NT.
The program accepts several multiple-sequence formats of already aligned
nucleotide sequences (ASCII files): Clustal W (Thompson et al., 1994), PHYLIP
(Felsenstein, 1993), MSF(PileUp)/GCG (Devereux et al., 1984), GDE (S. Smith,
Harvard University Genome Center), MEGA (Kumar et al., 1994), NBRF/PIR (Sidman
et al., 1988), LWL(91) (with or without spaces between codons). There is no
program limit to the maximum length or number of sequences to be compared. For
both noncoding and coding region sequences, it is possible to analyze
particular regions or to obtain results from a sliding window analysis.
Divergence Estimates:
For noncoding regions, the program can estimate the overall (K) number of
nucleotide substitutions per site using several multiple-hits at a site
correcting methods: Jukes and Cantorīs 1-parameter (Jukes and Cantor, 1969),
Kimuraīs 2-p (Kimura, 1980), Tajima and Nei (Tajima and Nei, 1984), and
Tajimaīs 1-p, 2-p and 4-p (Tajima, 1993).
When coding regions are under analysis, K-Estimator 5.0 applies the method
described in Comeron (1995) to estimate Ks and Ka. This method, a modification
of the method of Li (1993) and Pamilo and Bianchi (1993) (LPB), better
quantifies the actual number of transitions and transversions and reduces
stochastic errors (see Comeron, 1995, for details and comparison to previous
methods). Three genetic codes can be applied: Universal, Vertebrate
mitochondrial, or Drosophila mitochondrial. Furthermore, three different
options can be applied to restrict the codons that are under analysis: 1)
Maximum one substitution per codon (analyzes only those codons with no or only
one substitution), 2) No three differences per codon (removes from the analyzes
those homologous codons that differ in the three positions), and 3) Only AAs
Substitution (estimates the Ks analyzing only those homologous codons that code
for different amino acid but do not differ at the three positions).
A file with a MEGA-compatible Distance matrix format (Lower-Left matrix) can be
obtained for any estimated Divergence value.
Confidence Intervals:
K-Estimator obtains the Confidence Intervals (C.I.) of divergence estimates
(K for noncoding regions, and Ks and Ka for coding regions) by Monte Carlo
simulations (Comeron, 1995). Computer simulations take into account the
following parameters: 1) Divergence Value; K or Ks and Ka, 2) number of
nucleotides or codons, 3) the transition : transversion (alfa:beta) substitution
ratio, and 4) the G+C content for noncoding regions, and the amino acid
composition and G+C content at the third position of codons for coding regions.
When alfa:beta is different than that expected under random nucleotide
substitution, the substitution pattern is biased accordingly to maintain the
original G+C percentage.
For all simulations, the number of substitutions applied in each replicate
follows a random Poisson-distributed number with a mean equal to the estimated
number of substitutions (Divergence value x Number of analyzed sites).
Substitutions are randomly distributed along the sequence.
Since most multiple-hits correcting methods can give slightly biased divergence
estimates under some conditions, Monte Carlo simulations using a number of
substitutions based on these estimates could give inaccurate C.I. caused by a
biased divergence average. To solve this putative problem, K-Estimator first
scans for the optimal number of substitutions that will give the closest
divergence average to the analyzed divergence value under the queried
conditions, and subsequently it runs the final set of replicates.
Confidence intervals for Ks and Ka estimates are analyzed together and can only
be obtained after estimating Ks and Ka with K-Estimator 5.1; the number of
codons, the amino acid composition (average of the two compared sequences), the
G+C content at the third position of codons, as well as the number of
synonymous and nonsynonymous substitutions, are fixed from the analyzed
sequences.
Confidence intervals are obtained directly from the null distribution of the
divergence estimates from each replicate. The program can also calculate the
exact probability of obtaining any particular divergence value (K for noncoding
regions, and Ks, Ka, and Ka/Ks for coding regions). The program shows the
number, if any, of replicates (failed replicates) where the multiple hits at a
site correction method was not applicable or the estimated number of nucleotide
substitutions per site was greater than 5.0.
K-Estimator 5 also obtains the expected distribution of the ratio Ka/Ks that has
been classically applied to detect the action of positive selection (Ka/Ks>1).
In particular, the program simulates a condition where the number of
nonsynonymous substitutions per nonsynonymous site (Ka) is on average equal to
the value estimated for Ks. Therefore, it is obtained the null distribution for
Ks, Ka, and more important the ratio Ka/Ks.
Thus, the program can be used to test if Ka/Ks is significantly higher than 1
under the null hypothesis of Ka=Ks. As in all simulations, the number of both
synonymous and nonsynonymous substitutions is Poisson-distributed.
Results of both divergence and confidence intervals for divergence estimates
analyses can be printed and/or saved as independent text files.
Good luck and have fun,
Josep M
September 23, 1999