From gonnet@inf.ethz.ch Mon Jun 17 08:24 EST 1991 Received: from iubio.bio.indiana.edu by cricket.bio.indiana.edu (5.64/A/UX-2.00) id AA05262; Mon, 17 Jun 91 08:24:12 EST From: Gaston Gonnet Date: Mon, 17 Jun 91 15:24:22 +0200 Message-Id: <9106171324.AA13087@rutishauser.inf.ethz.ch> To: gilbert@bio.indiana.edu Subject: All-Against-All peptide database matching Status: R All Against All matching completed at ETH Zurich The Computational Biochemistry Research Group (CBRG) is happy to announce that a complete All-Against-All matching of an entire peptide database has recently been completed. An all-against-all matching begins with a peptide database, in this case the MIPS database version 64, for each position in each sequence, attempt a Needleman & Wunsch (dynamic programming) matching with every other position in every other sequence. Version 64 of the MIPS database has a total of 8,344,353 amino acids in 27,233 sequences Thus a total of 8,344,353 x 8,344,352 / 2 = 34,814,109,322,128 sequence matchings were attempted. If each matching were to use 1 second (fast for most alignment programs), the above computation would have taken about 1.1 million years. Of course alternative algorithms were used. Each of the matchings was run with the original Dayhoff matrix, (recomputed locally to fix some minor deficiencies and to obtain it with full accuracy, rather than truncated to integers). The penalty for dele- tions was set to 8 per amino acid. All matches achieving a score of 80 or higher were considered significant and were recorded. Partial matchings were also found, that is, we do not force alignments to be on a complete sequence, we just report all matches which reach the similarity goal within some subsequence of the sequences. This is done to maximize the ability to trace the genealogy of sequences, as if only a portion of a sequence is homologous to another sequence, this may not be detectable if entire sequences are forced to match. All these computations were done with the DSF system (DNA Searching Facility) version 1.2, developed here at ETH Zurich. The all-x-all matching found of 6,502,120 matches. This required a compounded total CPU time of 404.5 days, using five workstations locally and two computers at the University of Waterloo in Canada. (It was quite a challenge to keep all these computers running, mostly the ones across the Atlantic; a time consuming babysitting job!). The actual computation extended from Nov 15, 1990 until Mar 23, 1991. All these computers were used for other activities in the foreground, while the all-x-all matching was running in the background. The effective parallelism throughout all this time was 3.16, or the same as having roughly 3 computers totally dedi- cated to the task. Once these matches were obtained, they had to be refined. We use the term ``Refine'' and ``RefineShake'' from the DSF commands used to do these operations. A Refine operation is defined as running the dynamic program- ming algorithm (Needleman & Wunsch) from a pair of starting positions to the point where the similarity is maximized (or the sequences are exhausted). RefineShake consists of running the Refine from left to right up to a maximum, then from this maximum right to left until another max- imum, and so on until no more improvements are achieved. This process has the effect of focusing a matching precisely over the area where it matches best. The first phase of refining consisted of running this RefineShake over every match previously found. As expected, some previously found matches that were close together, when refined, converged to some match, typically better. By doing this we reduced the number of different matches to 2,657,849. The first phase of refining, i.e. running RefineShake over 6.5 million matches, took a total of 334.4 cpu days, or 4.5 seconds per refined match. It lasted between the end of the matching until June 9. 1991, giving an effective parallelism factor of 4.46. The last step of refinement is to compute the PAM distance and vari- ance of each of the matches (see the DSF tutorial for a precise definition of PAM distance). Once the PAM number is estimated, we can ``RefineShake'' with a Dayhoff matrix computed to this PAM distance (recall that the stan- dard Dayhoff matrix is computed for a PAM of 250, but it is easy to compute Dayhoff matrices for any distance; again, see DSF tutorial). Iterating this procedure until no more improvements are made focuses on the best pos- sible match at the most likely PAM distance (this is called ``Best- PamShake'' in DSF). This last refinement is a necessary step for any further work dealing with phylogenetic trees or homology evaluation. This step is presently being done, and we estimate that the final results will be ready by the first week in July 1991. For various other purposes, the output of the first refinement phase is more than sufficient at this time. We would like to give special thanks to Digital Equipment Corporation for providing the bulk of the equipment to do these computations (our five workstations in Zurich). What do we expect to extract from this database of matches? There are some obvious answers and some not-so-obvious ones. (i) It is direct to compute a complete tree for any particular sequence. Given a sequence, all the matches are found involving this sequence. From all these matches, a phylogenetic tree is constructed, ancestral sequences reconstructed and multiple alignments computed. All this can be done with little computation, as all the matching is already done and recorded. Furthermore matches can be selected according to arbitrary criteria, for example bound the sequence length or bound the PAM distance, or bound the cost, etc. Finally, the sets obtained this way will be complete, in the sense that absolutely any other sequence which would match any of the ones involved is guaranteed to be included. (ii) We are very interested in estimating better Dayhoff matrices. From this database we can automatically extract samples which require minimal human inspection/ approval. We expect to compute much more accurate matrices, as well as being able to compute matrices for sub- families of sequences (e.g. a matrix for mammals, or a matrix for glo- bins, etc.). (iii)We expect to collect enough information about deletions to be able to propose a more accurate model of deletions and to evaluate it appropriately. While matching with Dayhoff matrices is well substan- tiated from the theoretical point of view, deletion models are virtu- ally non-existent. Again, we can automatically find hundreds of exam- ples of deletions with certain characteristics (e.g. extract all the matches with deletions, from sequences of at least 300 amino acids with a PAM distance not greater than 50). (iv) We are considering publishing all the connected components of this massive matching together with their phylogenetic trees, ancestral sequences and multiple alignments. We have received some feedback which indicated that this would be a worthwhile form of distribution for this wealth of information. (v) We are planning to study consistency and correlation between PAM dis- tances and geological time since divergence. By having all the match- ings done, it is easy to estimate PAM distances for pairs of species and correlate this information to known data. (vi) We would like to make this resource available to other researchers working on this field, although it is not clear as to whether this is of general interest nor how we could effectively transmit this infor- mation. We would like to hear some feedback, in particular about items (iv) and (vi) above. The all-x-all matching of an entire database is done by an algorithm due to G.H. Gonnet and analyzed in ``All-Against-All Sequence Matching, by R.A. Baeza-Yates and G.H. Gonnet, submitted''. This algorithm is based on Pat trees (Patricia trees built over semi-infinite strings) and runs in much less time than the standard dynamic programming algorithm. This tech- nique is available in standard DSF as the command ``SelfMatch''. This algorithm can also be applied to match an arbitrary sequence against an entire database (using the dynamic programming algorithm and a Dayhoff matrix) in sublinear time (usually less than 15 seconds for the MIPS data- base). This command is also available in DSF as ``OneAllMatch''. Prof. Gaston H. Gonnet Wissenschaftliches Rechnen, Informatik ETH Zurich. gonnet@inf.ethz.ch For further information, reprints, etc. please contact Ms. Ricarda Oettinger Wissenschaftliches Rechnen ETH Zurich, 8092 Switzerland oetting@inf.ethz.ch