Ne: Estimation of Ancestral Population sizes

© Copyright 1997 by Ziheng Yang


Introduction

This is the documentation for the program Ne, written and distributed for academic use free of charge by Ziheng Yang. The program implements the maximum likelihood methods of Takahata, Satta, and Klein (1995) and Yang (1997) for analysing sequence divergence data from multiple loci to estimate THETA = 4N*MU, where MU is the mutation (substitution) rate per nucleotide site and N is the population size of either an extant species or the common ancestor of two extant species. (The notation of Yang [1997] is used, with BLOCK CAPITALS used for greek letters.) Takahata et al. also considered the case of three species, but this is not yet implemented in the program. Takahata et al.'s method assumes the same mutation rate for all loci, while Yang's method allows variable mutation rates among loci either by using a gamma distribution or by using independently estimated relative mutation rates for different loci.

For one species, two individuals are sampled at random at each locus, and the data consist of the numbers of sites (ni) and mutations (ki) at locus i, with i = 1, 2, ..., p. The program estimates THETA = 4N*MU, where N is the population size of the analysed species.

For two species, one individual is sampled at random from each species at each locus, and the data consist of the number of sites at locus i (ni) and the number of mutations (ki) separating the two species at locus i. The two parameters that can be estimated are THETA0 = 4N0*MU, where N0 is the population size of the ancestral population, and GAMMA = TAU*MU, where TAU is the separation time of the two species. If independent estimate of GAMMA is available, say, from phylogenetic analysis, THETA0 alone can be estimated by the program (see Yang 1997). While Yang (1997) suggested this as a possible way of reducing the sampling error in the estimated THETA0, Takahata and Satta (1997) pointed out that phylogenetic estimates of GAMMA may be biased for the exact reason that ancestral polymorphism is ignored in the estimation.

In Takahata et al.'s data (which were used by Yang 1997), the numbers of mutations at loci were estimated by the number of different sites between the two sequences under the infinite-sites model. The data sets are included in this distribution. It is a good idea that you run the example data sets to reproduce the estimates of Yang (1997) before attempting to analyse your own data. Besides this documentation, the readme file gives a brief description of the program.

Platforms and program availability

The source code (one single file Ne.c) is written in ANSI C, and should work with any ANSI C compatible comiplers. Furthermore, executables for PowerMac and Windows 95 are supplied.

UNIX systems. The command for compiling the program will depend on your compiler. Try cc, gcc or acc with and without the flag -lm. One of them should work.

cc -o Ne Ne.c -lm
gcc -o Ne Ne.c

Windows 95. The executable file Ne.exe is a Win32 console application.

PowerMac. PowerMac executable Ne.PPC uses the default data file name (Ne.dat).

The program can be obtained from

  1. Berkeley at ftp://mw511.biol.berkeley.edu/pub/Ne/
  2. London at ftp://abacus.gene.ucl.ac.uk/pub/Ne
  3. Indiana molecular biology ftp site at ftp://ftp.bio.indiana.edu/molbio/evolve. Check the Incoming directory too.

Data file format

Check the example data file Ne.dat. The following data are from Takahata et al. (1995 Table 2) for a two-species analysis.

15
11855 181
1362 38
2250 31
711 9
238 3
370 1
1147 14
1071 17
216 3
89 2
62 0
619 2
362 1
383 6
100 4

The first line specifies the number of loci. Each row then lists ni and ki for each locus. If externally obtained estimates of relative rates for loci are to be used, they should be listed as the third column in the data file.

Executing the program

You run the program by the command Ne, and it will read the default data file Ne.dat, and output results on the screen. On UNIX or MS Windows systems, you can specify the data file name as a command line argument:

Ne <DataFileName>

The program uses a simple interactive interface, asking questions about how you want the analysis to be done. The questions are self-explanatory. For a one-species analysis, it asks whether rates are constant among loci, approximated by a gamma distribution, or provided in the data file. For a two-species analysis, the program asks whether you want to estimate GAMMA (=TAU*MU) or specify its value so that THETA0 alone is to be estimated. The program then goes through a loop to perform several runs using different initial values for the maximum likelihood iteration. This is done to guard against possible existence of local optima. The program makes use of an intuitive scaling strategy, which may not work well in some cases. Depending on your data, you may encounter floating exception errors with poor initial values. If you are unable to obtain estimates even with good initial values, contact the author.

All results are printed out on the screen, and no output file is generated by the program.

Comments and bug reports

Comments and bug reports should be directed to the author, Ziheng Yang.

References
  1. Takahata, N. 1986. An attempt to estimate the effective size of the ancestral species common to two extant species from which homologous genes are sequenced. Genetical Research 48, 187-190.
  2. Takahata, N., and Y. Satta. 1997. Evolution of the primate lineage leading to modern humans: phylogenetic and demographic inferences from DNA sequences. Proceedings of National Academy of Sciences of USA 94, 4811-4815.
  3. Takahata, N, Y. Satta, and J. Klein. 1995. Divergence time and population size in the lineage leading to modern humans. Theoretical and Population Biology 48, 198-221.
  4. Yang, Z. 1997. On the estimation of ancestral population sizes of modern humans. Genetical Research Cambridge 69, 111-116.

Last modified by Ziheng Yang on Mon Dec 8 12:36:43 GMT 1997.