From gonnet@inf.ethz.ch Mon Jun 17 08:24 EST 1991
Received: from iubio.bio.indiana.edu by cricket.bio.indiana.edu (5.64/A/UX-2.00)
id AA05262; Mon, 17 Jun 91 08:24:12 EST
From: Gaston Gonnet
Date: Mon, 17 Jun 91 15:24:22 +0200
Message-Id: <9106171324.AA13087@rutishauser.inf.ethz.ch>
To: gilbert@bio.indiana.edu
Subject: All-Against-All peptide database matching
Status: R
All Against All matching completed at ETH Zurich
The Computational Biochemistry Research Group (CBRG) is happy to
announce that a complete All-Against-All matching of an entire peptide
database has recently been completed.
An all-against-all matching begins with a peptide database, in this
case the MIPS database version 64, for each position in each sequence,
attempt a Needleman & Wunsch (dynamic programming) matching with every
other position in every other sequence. Version 64 of the MIPS database
has a total of 8,344,353 amino acids in 27,233 sequences Thus a total of
8,344,353 x 8,344,352 / 2 = 34,814,109,322,128 sequence matchings were
attempted. If each matching were to use 1 second (fast for most alignment
programs), the above computation would have taken about 1.1 million years.
Of course alternative algorithms were used.
Each of the matchings was run with the original Dayhoff matrix,
(recomputed locally to fix some minor deficiencies and to obtain it with
full accuracy, rather than truncated to integers). The penalty for dele-
tions was set to 8 per amino acid. All matches achieving a score of 80 or
higher were considered significant and were recorded. Partial matchings
were also found, that is, we do not force alignments to be on a complete
sequence, we just report all matches which reach the similarity goal within
some subsequence of the sequences. This is done to maximize the ability to
trace the genealogy of sequences, as if only a portion of a sequence is
homologous to another sequence, this may not be detectable if entire
sequences are forced to match. All these computations were done with the
DSF system (DNA Searching Facility) version 1.2, developed here at ETH
Zurich.
The all-x-all matching found of 6,502,120 matches. This required a
compounded total CPU time of 404.5 days, using five workstations locally
and two computers at the University of Waterloo in Canada. (It was quite a
challenge to keep all these computers running, mostly the ones across the
Atlantic; a time consuming babysitting job!). The actual computation
extended from Nov 15, 1990 until Mar 23, 1991. All these computers were
used for other activities in the foreground, while the all-x-all matching
was running in the background. The effective parallelism throughout all
this time was 3.16, or the same as having roughly 3 computers totally dedi-
cated to the task.
Once these matches were obtained, they had to be refined. We use the
term ``Refine'' and ``RefineShake'' from the DSF commands used to do these
operations. A Refine operation is defined as running the dynamic program-
ming algorithm (Needleman & Wunsch) from a pair of starting positions to
the point where the similarity is maximized (or the sequences are
exhausted). RefineShake consists of running the Refine from left to right
up to a maximum, then from this maximum right to left until another max-
imum, and so on until no more improvements are achieved. This process has
the effect of focusing a matching precisely over the area where it matches
best. The first phase of refining consisted of running this RefineShake
over every match previously found. As expected, some previously found
matches that were close together, when refined, converged to some match,
typically better. By doing this we reduced the number of different matches
to 2,657,849. The first phase of refining, i.e. running RefineShake over
6.5 million matches, took a total of 334.4 cpu days, or 4.5 seconds per
refined match. It lasted between the end of the matching until June 9.
1991, giving an effective parallelism factor of 4.46.
The last step of refinement is to compute the PAM distance and vari-
ance of each of the matches (see the DSF tutorial for a precise definition
of PAM distance). Once the PAM number is estimated, we can ``RefineShake''
with a Dayhoff matrix computed to this PAM distance (recall that the stan-
dard Dayhoff matrix is computed for a PAM of 250, but it is easy to compute
Dayhoff matrices for any distance; again, see DSF tutorial). Iterating
this procedure until no more improvements are made focuses on the best pos-
sible match at the most likely PAM distance (this is called ``Best-
PamShake'' in DSF). This last refinement is a necessary step for any
further work dealing with phylogenetic trees or homology evaluation. This
step is presently being done, and we estimate that the final results will
be ready by the first week in July 1991. For various other purposes, the
output of the first refinement phase is more than sufficient at this time.
We would like to give special thanks to Digital Equipment Corporation for
providing the bulk of the equipment to do these computations (our five
workstations in Zurich).
What do we expect to extract from this database of matches? There are
some obvious answers and some not-so-obvious ones.
(i) It is direct to compute a complete tree for any particular sequence.
Given a sequence, all the matches are found involving this sequence.
From all these matches, a phylogenetic tree is constructed, ancestral
sequences reconstructed and multiple alignments computed. All this
can be done with little computation, as all the matching is already
done and recorded. Furthermore matches can be selected according to
arbitrary criteria, for example bound the sequence length or bound the
PAM distance, or bound the cost, etc. Finally, the sets obtained this
way will be complete, in the sense that absolutely any other sequence
which would match any of the ones involved is guaranteed to be
included.
(ii) We are very interested in estimating better Dayhoff matrices. From
this database we can automatically extract samples which require
minimal human inspection/ approval. We expect to compute much more
accurate matrices, as well as being able to compute matrices for sub-
families of sequences (e.g. a matrix for mammals, or a matrix for glo-
bins, etc.).
(iii)We expect to collect enough information about deletions to be able to
propose a more accurate model of deletions and to evaluate it
appropriately. While matching with Dayhoff matrices is well substan-
tiated from the theoretical point of view, deletion models are virtu-
ally non-existent. Again, we can automatically find hundreds of exam-
ples of deletions with certain characteristics (e.g. extract all the
matches with deletions, from sequences of at least 300 amino acids
with a PAM distance not greater than 50).
(iv) We are considering publishing all the connected components of this
massive matching together with their phylogenetic trees, ancestral
sequences and multiple alignments. We have received some feedback
which indicated that this would be a worthwhile form of distribution
for this wealth of information.
(v) We are planning to study consistency and correlation between PAM dis-
tances and geological time since divergence. By having all the match-
ings done, it is easy to estimate PAM distances for pairs of species
and correlate this information to known data.
(vi) We would like to make this resource available to other researchers
working on this field, although it is not clear as to whether this is
of general interest nor how we could effectively transmit this infor-
mation.
We would like to hear some feedback, in particular about items (iv)
and (vi) above.
The all-x-all matching of an entire database is done by an algorithm
due to G.H. Gonnet and analyzed in ``All-Against-All Sequence Matching, by
R.A. Baeza-Yates and G.H. Gonnet, submitted''. This algorithm is based on
Pat trees (Patricia trees built over semi-infinite strings) and runs in
much less time than the standard dynamic programming algorithm. This tech-
nique is available in standard DSF as the command ``SelfMatch''. This
algorithm can also be applied to match an arbitrary sequence against an
entire database (using the dynamic programming algorithm and a Dayhoff
matrix) in sublinear time (usually less than 15 seconds for the MIPS data-
base). This command is also available in DSF as ``OneAllMatch''.
Prof. Gaston H. Gonnet
Wissenschaftliches Rechnen, Informatik
ETH Zurich.
gonnet@inf.ethz.ch
For further information, reprints, etc. please contact
Ms. Ricarda Oettinger
Wissenschaftliches Rechnen
ETH Zurich, 8092
Switzerland
oetting@inf.ethz.ch