From usenet.ucs.indiana.edu!sol.ctr.columbia.edu!zaphod.mps.ohio-state.edu!uwm.edu!biosci!daresbury!daresbury!not-for-mail Thu May 6 06:55:21 EST 1993 Article: 2292 of bionet.software Path: usenet.ucs.indiana.edu!sol.ctr.columbia.edu!zaphod.mps.ohio-state.edu!uwm.edu!biosci!daresbury!daresbury!not-for-mail From: mbpcr@s-crim1.dl.ac.uk (A. Parsons) Newsgroups: bionet.software Subject: Parallel Processors/Algorithms (Summary) Message-ID: <1sar9sINNlml@s-crim1.dl.ac.uk> Date: 6 May 93 11:03:24 GMT Organization: Daresbury Lab., Warrington, U.K. Lines: 1507 NNTP-Posting-Host: s-crim1.dl.ac.uk NetFolk, Well all threads on my original post on parallel architectures in database similarity searching have all but dried up. Here is the summary (LONG = 1500 lines) including some related threads and earlier comms I had with other folk. It also features the controversial Intelligenetics survey on BLAZE. My overall opinion is that generally people feel that using an implementation of the S&W algorithm is: (a) valuable (b) delivers better quality results. (c) the only real practicable way of implemeting S & W in a realistic timeframe is to use massively parallel hardware. (mpsrch and BLAZE). This is only my impression others may come away with different ones. PVM is an interesting idea but given a greenfield site (we dont have lots of Unix workstations networked - much less any with too many spare cycles) I think that massively parallel is the way to go. The speed aspect may be important but I think it is the quality argument that wins the day (see Chris Uptons posts). Hope you find this valuable/useful/helpful and not too much of a waste of network bandwidth. Many thanks to all who replied! Tony Parsons +----------------------------------------------------------------------------+ | Dr.Tony Parsons, VOX : + 44 304 616871 | | Information Technology, FAX : + 44 304 616670 | | Pfizer Central Research, | | Ramsgate Road, +---------------------- e-mail ------------------------+ | Sandwich, |JANET : parsons_a@uk.co.pcr.snd01 | | KENT |Internet : parsons_a@snd01.pcr.co.uk | | CT13 9NJ UK |Compuserve : 100064,765 PSImail:234216700127.parsons_a | +-----------------+----------------------------------------------------------+ begin ;-------------------------------------------------------------------------------- From: S S Sturrock Early Correspondence from Shane Sturrock (author of Mpsrch) Well, MPsrch will arrive on a MasPar at Harrow (HGMP-RC) probably in the next week or so, they will also be evaluating BLAZE. The EMBL installation is waiting on good network connections and an upgrade of the OS on their machine since that MasPar was supplied by DEC. Performance wise, MPsrch is about two and a bit times quicker, performance of BLAZE is about 55 million cell updates a second compared with 130 million on MPsrch. More to the point though, although BLAZE is a Smith Waterman on the MasPar the reconstruction is by FASTDB, a word based method which is supposedly fast but does not guarantee to get the same alignment as the MasPar, certainly the scores would not agree, the score is the one reported by the S/W algorithm, the alignment is FASTDB. On MPsrch, the reconstruction is also by full S/W and still manages to be about 10x quicker than the FASTDB method used on BLAZE, plus the score will agree because the algorithm is the same. Further, I have introduced the concept of gaps where a gap may be one or many indels thus giving the clumping effect of affine gaps by a simple modification to the S/W algorithm for reconstruction, this is because many paths may give the same score but generally it is best to pick the one with least gaps (although the same number of indels). We still have a single penalty though since having affine gaps increases the complexity though we will introduce them if anyone feels they will be an advantage over the route I have chosen. eg: With DNA in particular - Standard algorithm: ****** * * * * ************ acttatattcctctatattggatgatgctggtctgat acttct--------a--t-g-a---tgctggtctgat MPsrch: ******** ************** acttatattcctctatattggatgatgctggtctgat acttctat---------------gatgctggtctgat These two score identically but the second is obviously the best alignment because it only has one gap rather than the five of the standard method. -------------------------------------------------------------------------------- From: S S Sturrock What you may be interested in is that the smallest MasPar machine is now only 55K pounds including the workstation host, this gives you a 1024 processor MasPar box with MP1 processors, upgradable to 4096 processors, either MP1 or MP2 (which would give over 26000 mips + 6 gigaflops) without the need to change anything else, just a simple board swap. MPsrch will run on this set up, we have discussed the possibility of a package deal too so that may be something to interest you. At these prices it seems crazy to go for anything else, and on the minimum configuration you will still get 40 million cell updates a second at the moment, but should be much more after I come back from California, basically you will get better performance than BLAZE on a far cheaper machine. Too good to be true? -------------------------------------------------------------------------------- Bill Pearson (Fasta fame) writes: >Having read some of the literature that compares various algorithms, >the articles tend to sort into two classes: > > 1) Theoretically based papers discussing the various mathematical > merits of one algorithm versus another. The authors are likely to > have an investment in their favorite tool, and it's not always clear > (to me) how the math translates to the real world > > 2) Very general papers which look at two or three sequences and > draw conclusions about algorithm strengths from a limited database. > >It seems to me that all the algorithms have their unique strengths and >weaknesses- and this is going to be strongly sequence dependent. That >is why I made the original post, in an attempt to discern the relative >strengths of the programs for different applications. I (with bias, no doubt) recommend the paper: W. R. Pearson (1991) Genomics "Searching Protein Sequence Libraries: Comparison of the Sensitivity and Selectivity of the Smith-Waterman and FASTA Algorithms" 11:635-650 It examines the performance of FASTA, BLAST, and the Smith Waterman algorithm (which is used by BLAZE) on about 34 different superfamilies of proteins (all superfamilies with 20 or more members). The bottom line is that Smith-Waterman performs better than BLAST or FASTA, but that FASTA will perform as well as Smith-Waterman if one optimizes all the library scores (the -o option) and search about 10X faster than Smith-Waterman (though not, of course, 10X faster than Smith-Waterman on a MasPar). A description of the -o option and the reference to the Genomics paper is provided with the FASTA documentation. In the past 6 months I have extended these studies to 75 superfamilies and used more rigorous statistical tests, but the result is the same, if you use the -o option with FASTA, you will do as well as Smith-Waterman. If you don't, you won't. These comments apply only to protein sequence searching. I do not believe that there is any advantage to slower, more rigorous, algorithms for DNA sequence searches. For DNA, if a significant similarity exists in a non-protein-coding sequence, BLAST or FASTA will find it as well as, or better than, Smith-Waterman. With DNA, selectivity is much more of a problem than sensitivity and both FASTA and BLAST are more selective. Since the protein sequence libraries are considerably smaller than the DNA sequence libraries (and will continue to be, even after entire genomes are sequenced), I believe that it is reasonable for most researchers to rely on FASTA with the -o option for protein library searches and on FASTA or BLAST for DNA searches. There is one place where the Smith-Waterman algorithm should be used, however - in the preparation of alignments for publication. For some proteins with large variable-length loops, FASTA will not allow gaps of the appropriate size (and BLAST does not construct alignments with gaps, although it often shows several aligned regions). I am considering modifying FASTA to display rigorous Smith-Waterman alignments rather than the approximate (gaps < 16 residues) it calculates currently. A Smith-Waterman implementation (SSEARCH) is available with the FASTA distribution. Bill Pearson -------------------------------------------------------------------------------- Subject: Genbank BLAZE user survey summary Cc: stamm@COM.MasPar Sender: stamm@COM.MasPar Thank you all for your patience. Here is the survey summary as promised. Regards, Rich Stamm stamm@maspar.com ============================================================ GENBANK BLAZE USER SURVEY SUMMARY OF RESPONSES BLAZE email server was operated by IntelliGenetics from May to September of 1992 funded in part by IntelliGenetics, MasPar and the National Institutes of Health. A summary of the survey questions and responses follows. A total of 63 people sent responses by email. 1) When you first used BLAZE did you have any previous experience searching sequence databases with the Smith-Waterman algorithm? Previous experience reported: SMITH-WATERMAN = 7 FASTA = 27 BLAST = 20 WORDSEARCH = 2 GAP = 1 FASTDB = 1 EMBL/ARGOS = 1 If so, what were your impressions of the algorithm? Impressions reported FAVORABLE = 29 (mentioned are speed, accuracy, sensitivity, ease of use, annotations, exploratory nature) OK = 3 UNFAVORABLE = 3 (unfavorable experiences were due to problems with email, software bugs, missing features) 2) Did you achieve noticably better sensitivity using BLAZE over other search algorithms? Which ones (FASTA, BLAST, FASTDB...)? Reports of increased sensitivity OVER FASTA = 15 OVER BLAST = 13 OVER FASTDB = 1 OVER WORDSEARCH = 1 NON-SPECIFIC INCREASED OBSERVED = 7 NO ASSESSMENT DONE = 11 SENSITIVITY INCREASE NOT OBSERVED = 13* *7 OF THESE 13 RESPONDEES REPORTED THAT THEIR QUERY SEQUENCES EITHER HAD MANY STRONG MATCHES OR (NEARLY) NO MATCHES AT ALL IN THE DATABASE. 3) Did you explore the parameter space to attempt to improve the sensitivity of your BLAZE searches (i.e. vary the PAM matrix, gap penalty and gap extension penalty)? If so was the improvement from your parameter search noticeable? EXPLORED PARAMETER SPACE = 9 FOUND NOTICABLE IMPROVEMENTS = 5 FOUND NO SIGNIFICANT DIFFERENCES = 3* *2 OF THE 3 RESPONDEES WHO REPORTED NO SIGNIFICANT DIFFERENCES FROM VARYING THE SEARCH PARAMETERS ALSO REPORTED THAT THEY HAD USED QUERIES HAVING MANY STRONG MATCHES OR (NEARLY) NO MATCHES AT ALL IN THE DATABASE. 4) Did you notice any increased biological significance in your search results when using BLAZE? TOTAL REPORTED = 38 -returning grouped superfamilies ranked by their evolutionary distance from the query YES = 7 -previously insignificant hits having their scores elevated to where they became significant, when these hits actually were biologically significant YES = 12 -fewer false positive and false negative hits YES = 10 -finding matches with large gaps YES = 8 5) Do you have any suggestions on how to improve BLAZE? RESPONSE CATEGORIES: ENHANCED SEARCH FEATURES = 7 MORE COMPREHENSIVE DATABASES = 6 DOCUMENTATION IMPROVEMENTS = 6 LOW COST ACCESS = 6 MISSING FEATURES/BUG FIXES = 5 EMAIL RELIABILITY = 5 OUTPUT FORMATTING = 4 USER INTERFACE & CONTROL = 4 6) Are you interested in receiving product information on BLAZE and on MasPar computers to run BLAZE for your laboratory? YES = 36 7) Do you wish to be kept informed about significant BLAZE events? YES = 48 8) Would you like to receive information on how to objectively compare database searching algorithms for sensitivity? YES = 49 9) What other programs or algorithms do you think should be implemented and offered (either commercially or openly) on massively parallel computers? NEW OR ADVANCED DATABASE SEARCHING METHODS = 7 EXISTING DATABASE SEARCHING METHODS = 7 MULTIPLE SEQUENCE ALIGNMENT = 4 DOT-MATRIX PLOTS (INTERACTIVE) = 2 MOLECULAR MODELING (INTERACTIVE) = 2 SECONDARY STRUCTURE PREDICTION (PROTEIN) = 1 RNA FOLDING = 1 NMR SPECTROSCOPY = 1 IMAGE PROCESSING = 1 "MAMOT" = 1 Other comments: Want to continue using BLAZE = 5 NCBI should offer BLAZE = 3 Wanted access for further evaluation = 1 Output format was much better than FASTA = 1 Easy to do sensitive searches = 1 Studying effects of varying PAM matrices = 1 Access period was too limited = 1 Costly to dedicate a machine to BLAZE = 1 GENBANK 1992 BLAZE USER SURVEY -- LISTING OF ALL QUALITATIVE COMMENTS ===================================================================== No better sensitivity -- I still have no match for my novel gene. The Blaze algorithm seemed as sensitive. Major advantage was speed. Blaze was routinely the most sensitive. It picked up matches that were significant but very difficult to see other than by such an algorithm Maybe fastdb could be tweaked to get better sensitivity but Blaze is easier. YES. THE SENSITIVITY WAS BETTER WHEN COMPARED TO PROGRAMS LIKE TFASTA AND WORDSEARCH FROM THE 'GCG' PACKAGE. Yes, over FASTA and BLAST the default values gave me the results I needed. I would very much enjoy being able to continue to use it either on the Genbank or NCBI servers as an on-line service. Will this be possible in the near future? I liked the algorithm. It did the job. I did get different results sometimes and the BLAZE results were more in line with what I expected. I don't know about sensitivity. Most of the times, I did not run the same search on both FASTA and BLAZE. Generally (better) yes. I liked BLAZE better because it was fast and very sensitive for the alignments of very distantly related Drosphila vs yeast protein comparisons but that is not my usual goal here at the FDA...but when an unknown sequence comes up in a sequence it is great. very sensitive for the alignments of very distantly related Drosphila vs yeast protein comparisons... Defaults are usually the best for me unless I have a lot of time to play with the program and sequence...not really often does a bench scientist have the time/knowledge/inclination to massage alignments to that degree. A person interested in 2ndary, and tertiary structures is a different story and is more familiar with computer work than test tubes. Fewer false positives seemed to be the case -finding matches with large gaps Yes this seemed to be better than Fasta...of course much better than blast did not find improved sensitivity. I only ran a few searches... for unique proteins. No, but this is hard to judge. the protein I was looking for homologies too appears to be unique, under all search algorithms. I think the overall senstivity is similar-I really don't know enough specifics to judge this very well. We were quite impressesd with BLAZE. BLAZE appeared to be more sensitive than FASTA and BLAST. We noticed a small margin in improvement in biological significance, we had greater similarities amongst the bacterial sequences, with fewer false hits. revealed matches that were missed by the other programs. greater sensitivity than FASTA & BLAST yes; fewer false positive and negative hits Yes. (better than) Fasta, Blast Previously insignificant hits elevated to a level where they bacame significant. YES, MUCH BETTER THAN WITH FASTA AND BLAST. -previously insignificant hits having their scores elevated to where they became significant, when these hits actually were biologically significant YES, REALLY I found blaze to be more complete and allowed me to find more homologies. I particularly liked the annotation that appeared with the search. Very useful. I did find better sensitivity than other searches... from time to time the blaze search did not come up with significant homologies found in a Blast search. There did appear to be fewer negative hits. Since I never got an alignment I'm not sure if I was impressed or not. I like the results and was impressed. we found some additional sequences with BLAZE, and we felt that some of the ones we found were significant. Other than speed, the results were pretty similar for comparable substitution matrices using otherwise default gap penalties. Yes. I've carried out some of my own tests with SSEARCH versus BLAST and FASTA. Given similar substitution matrices, the Smith-Waterman algorithm gave better performance overall... Since August, BLAST has been reporting P-values for multiple hits, and combined with the use of word-size=3 sensitivity seems improved. My earlier tests compared it to the older version of BLAST, in which BLAZE did better overall. I'm less sure now. In any case, the BLAST e-mail server uses PAM120 as a default, and this greatly limits its potential. Since FASTA (or FASTDB) tries to model the Smith-Waterman algorithm with smart heuristics, it's hard to see how these programs can do as well; certainly not at ktup=2. Mostly I used BLOSUM 60 with the default gap penalties for anything important... the matrix (PAM vs. BLOSUM for example) had more of an effect in my tests, such as for the G-protein coupled receptors, and other families... I might be prejudiced, but I think that the substitution matrix makes more of a difference than FASTA ktup=1 vs S-W, and maybe even FASTA ktup=2 vs S-W. I would find the results of this survey interesting if people have comments about relative performance of substitution matrices using BLAZE. I already have a copy of the Brutlag et al manuscript describing your tests. your implementation of s-w seems to give comparable output to other versions i have used Yes (better than) FASTA, BLAST -returning grouped superfamilies ranked by their evolutionary distance from the query Yes -fewer false positive and false negative hits Yes My impression was that BLAZE worked very well. Blaze is great a vast improvement. The format of the output was particularly impressive. I could not really judge if the program would generate better matches because I did not do side by side searches. Yes, I did vary the pam scores and achieved differnt results My use of BLAZE was relatively limited so I not able to provide any useful input for your survey. However, I would like to be able to continue to use BLAZE in the future and hope it will continue to be available to us. I personally like having the option of doing BLAZE searches.I find them easier and I also like the fact that a different algorithm can be used to find any sequence similarity to what I'm looking for. I was impressed with speed even via email server. Yes - but the sequences used were not designed to best show this - an Actin sequence which of course dredged all the actins - the setting of different PAM's definietely had an effect on the ordering. The other sequence was a Varicella zoster virus sequence which we knew only had two anywhere near homologous sequences. If the advertising speak is to be believed with the distantly related haemoglobins and plant leghaemoglobins than it is impressive! Not in general, although in a single case BLAZE found a similarity which had been missed by FASTA and BLAST I varied the PAM matrix over a broad range and found considerable differences. The example mentioned under 2) was only found with PAM250 (if i remember correctly). I did not experiment with varying penalties. I had the impression of a less clear significance threshold in BLAZE compared to BLAST. Positive and negative hits scored closer toghether. In general, high BLAZE scores, although not clearly separated from the main field, in most cases had a biological meaning. --Clearly fewer false positives -returning grouped superfamilies ranked by their evolutionary distance from the query sure. this is what we really want sometimes. -- finding matches with large gaps sure. No gaps introduced in alignment might be one of the disadvantages of BLAST (so it is faster?). Yes, (better than) both FASTA and BLAST formats. I varied the gap and gap extension penalties as well as the PAM matrix but observed no significant difference. My searches yielded gapped matches that otherwise might not have appeared. I would like to see BLAZE offered thru NCBI as it is the only up to date database to which I have access. our overall impression : BLAZE is significantly more sensitive than FASTA or BLAST. We would be very interested in access to a MasPar computer running BLAZE, either interactive access or via a mail server. I cannot really compare sensitivities, because I (still) do not know which alignments were significant in the first place. I was using BLAZE to try to find alignments to my (still) unknown new clone. noticed increased speed, did not assess sensitivity. The algorithm identifies useful sequences on this system very quickly. I think Blaze may be a little more sensitive than FASTA I used a complement of FastA, Blast and Blaze. However, the search that worked the best is the Sensitive Search of Pat Argos at the EMBL. Yes:varying the Pam matrix and providing a user programmable input is essential. -returning grouped superfamilies ranked by their evolutionary distance from the query -- that was great. -previously insignificant hits having their scores elevated to where they became significant, when these hits actually were biologically significant -- Yes, we did see improvement I have only performed a database search once using the BLAZE program. In comparison to BLAST available through NCBI/NLM/NIH, the search took much longer (based on the amount of time required to receive the results) no more sensitive than BLAST. I did not try to optimize my search parameters for either program, however, the results in both instances were satisfactory. My favorite would have to be Blaze... I was in fact becoming quite dependent on Blaze, and am quite interested in gaining access to it again, though cost is certainly a factor. I like it more for it's exploritory nature. when one is lost with no related proteins that one knows of, the ability to spin the knobs and look through long lists of mostly crap is essential. When there are close homologies in the databases it doesn't really matter, and Blast is probably best at giving an uncluttered list of family members. But I have been dealling with proteins that only have distantly related homologs in the data base and the "touchie-feelie" nature of Blaze has been very usefull in pulling then out. Yes, (I varied the search parameters) quite a lot, and the ability to do that quickly was quite important. First because the default values are very poor for my particular family of proteins. I got much better results with increased Gap/length penalties, and with Blosum rather that the PAM matrix. Of cource I didn't know that at first since I didn't know what my protein was related to! It was the ability to fire off a bunch of searches and see what hits were common to all of them, even though far down the list, that saved me. And some luck... WORD might have done OK, but is to slow on the VAX to do many variations on the search (more later) and so wasn't heplfull. The usefullness of this tech. can be seen in some of Steve Henikkoff's work with BLOCKs. False hits change dramatically with (search parameter) conditions, while the real hits, even if weak, tend to hang in. IT APPEARED ADEQUATE. I haven't noticed better sensitivity using BLAZE over FASTA or BLAST. I may not have come up against a good test case to judge the differences. I.e., my cases were either substantially homologous or not homologous at all. This algorithm is a significant improvement over FASTA in speed, ability to account for gaps, usefulness and flexibility in the way the search data is formatted and expressed. This is hard to say. there were no matches in any of the databases for the sequences in which I was interested. far better results than the Blast program Blaze found more significant areas of homology than blast, some of which showed other genes in the same family with similar functional areas. I can't tell - The sequences I used for the BLAZE searches did (so far) not show any significant similarity to any sequence in the databases with any of the search algorithms. Other search algorithms used were FASTA, TFASTA, BLAST, WORDSEARCH (UWGCG) in a complementary manner to convince myself that there is indeed no similar sequence in the database. Yes, I used different parameters - yet nothing significant came up. It (BLAZE) is very useful and quick to check homologies of sequences as they are sequenced in the laboratory, but only in the one direction. I didn't necessarily pick up any matches in the database that I didn't see with FASTA but the alignments achieved were extended over greater distances. It did seem to eliminate the usual false positives like gamma globulin. very pleased with the algorithm and decided to use blaze instead of blast or fasta. The sensitivity was probably somewhat better with blaze, but we did not make a lot of comparisons between the different programs. No, we did not see any new protien matches that were not picked up by fasta. We liked the output format for alignments much better than that produced by fasta. It was quite useful and very fast. Yes. (better than) FASTDB. Yes. A grouping of related species not previously detected as such. I was hoping that BLAZE would be offered by NCBI. No noticable increase in sensitivity compared to BLAST. My search sequence was a novel sequence, though. The algorithm seemed efficient to me. I think that many of the local homology detection programs, like blaze and blast find many good short matches that are not biologicvally relevant Yes. (better than) FASTA and BLAST I did not try changing search parameters as the default was excellent for my purposes. Blaze is far better than anything I have tried. The protein I was searching the banks for is of parasite origin, but I could trace evolutionary relationships well with this program. The length of the comparison that is displayed combined with the annotations make the program especially valuable compared to others. The sensitivity of the search of course is the main attraction. Matches were found despite large gaps. YES. Better than both FASTA and BLAST. Found matches with large gaps and improved matches with partial sequences. -------------------------------------------------------------------------------- From: Luc Simon In article <1p6q49INNe18@s-crim1.dl.ac.uk>, you wrote: > > Greetings Netfolk!, > > We have been trying to do a limited evaluation of the utility of Smith and > Waterman searches on massively parallel hardware (MasPar MP-1 and MP-2) using > both BLAZE (from Intelligenetics) and Mpsrch (from Shane Sturrock and John > Collins at the Biocomputing Unit at Edinburgh University). > > I was wondering if anyone has done/intending to do something similar? More to > the point has any independent assessment been carried out that anyone is > intending to publish? I have seen some data from Intelligenetics which is >very > impressive but then again hardly impartial. > > Has anyone any experiences they feel are relevant - for example how necessary >is > the rigour of the SW algorithm when most people have been happy to get by with > BLAST and FASTA to date? Is it only a matter of time (with the masses of data > being generated by automatic DNA sequencers) when data parallel solutions will > not only be desirable but essential? Is it not the case that the intellectual > assessments of the results (from whatever method) is the rate limiting step in > sequence analysis? > > Any and all responses welcome either to the group or by email - I will > summarise and post responses if there is sufficient interest. > > Cheers, > > Tony Parsons > > (mbpcr@seqnet.dl.ac.uk) or preferably for mail (parsons_a@snd01.pcr.co.uk) I was very impressed by a demonstration of the ABI Inherit system. Its a kind of massively parallel hardware in a small box (Fast Data Search). The unit demonstrated to us had 2 slots of 720 processors, with space for 3 more boards. Their proprietary software seem to implement a S&W search and produces results in 48 sec for Genbank release 74. Since we have access to a MasPar on campus and not to the ABI box yet, I would be interested in getting more info on software that might run on that beast and be useful to us!! Cheers, Luc Simon Faculte de foresterie et de geomatique Universite Laval Sainte-Foy, QUE G1K 7P4 E-mail: Lsimon@alnus.for.ulaval.ca Phone: (418) 656-5496 Fax : (418) 656-3551 -------------------------------------------------------------------------------- ko61fr@genius.embnet.dkfz-heidelberg.de Newsgroups: bionet.software Organization: DKFZ Heidelberg Cc: Status: R I have been using BLAZE as long as it was freely available, and I felt the big disadvantage with this service was that you could not get an alignment of your query sequence to the hits. BLITZ (which is Mpsrch) includes alignments (in fact you can choose what you want, hits and/or alignments) and is therfore my preferred choice. The only problem I have with the BLITZ alignments is that sometimes single residues are aligned (i.e. a single residue with gaps on both sides); structurally this seems nonsense to me, but this is a minor problem I did not attempt to compare the BLAZE and BLITZ results systematically, because of the missing alignments. Regards, - Friedrich - ------------------------------------------------------------------------------- Dr. Friedrich Rippmann work: c/o E.MERCK home: Pha Fo Chem, Drug Design Schroederstrasse 72 6100 Darmstadt 6900 Heidelberg 1 POBox 4119 +49-6221-413366 +49-6151-726290 Email: ko61fr@genius.embnet.dkfz-heidelberg.de Fax ... 710757 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- From: Laurent.Mouchard@fr.crihan I'll be interested to hear about the answers the question you have posted in the bionet.software newsgroup. We have a MasPar and we have already checked BLAZE a couple of months ago. We are currently using blast and fasta (a lot of research teams have already written shell scripts in order to provide their users a good user interface for submitting sequences). They consider blast and fasta as good tools and they don't think they can obtain better results (I mean more reliable or more accurate results) using BLAZE, they assumed it's only a matter of time (or they think so). I think that a PVM based version of fasta and blast will speed up the results. I hope I will be able to check it in a very few days. Regards, Laurent. --------------------------------------------------------------------- Laurent Mouchard (lm) | C R I H A N | Centre de Ressources Informatiques Phone: +33 35 59 61 49 | de Haute Normandie Fax : +33 35 59 61 40 | Parc Technologique de la Vatine | 32, rue Raymond Aron Mail : Laurent.Mouchard@crihan.fr| 76130 Mont-Saint-Aignan, France --------------------------------------------------------------------- -------------------------------------------------------------------------------- Bill Pearson (Fasta fame) writes: Regarding FASTA/BLAST vs Smith-Waterman. I am in the process of writing up an extensive comparison of FASTA, BLAST, and Smith-Waterman and various scoring matrices. This paper will be an extension of my earlier one: "Pearson, (1991) Genomics "Searching Protein Sequence Libraries: Comparison of the Sensitivity and Selectivity of the Smith-Waterman and FASTA Algorithms" 11:635-650. I feel uncomfortable giving away the punch line, since the paper has neither been written nor reviewed, but one of the conclusions is that the results of the Genomics paper - that FASTA with optimization performs as well as Smith-Waterman, will be supported with considerably more data and better statistical analyses. I should note also, since some readers of this group may be interested, that I now have a version of our parallel "platform" for sequence comparison ( Despande, Richards, and Pearson (1991) CABIOS "A platform for biological sequence comparison on parallel computers" 7:237-247) running on networks of workstations using PVM (parallel virtual machine), a freely available package for almost any machine. If you are doing lots of sequence comparisons, I can provide you with PVM versions for FASTA and Smith-Waterman, with BLAST to be available in about a month. Here are some typical timings on a network of 12 Sparc IPC's using PVM2.4 (PVM3.0 is a little slower) pvm2.4, 20 protein sequences vs annotated PIR34 (approx 10K sequences) nodes 11 7 3 -------------------- k2 78 105 207 (times in seconds) 76 128 206 (73.9) k1 310 466 1070 312 471 1083 (94.6) ok1 559 836 1995 532 811 1898 (97.3) Smith-Waterman times are about 5X the ok1 times. The values in parentheses indicate the relative efficiency of 11 nodes compared to 3 nodes. Thus k2 on 11 nodes is 11/3*.739 times faster; ok1 is 11/3*.973 times faster. Bill Pearson -------------------------------------------------------------------------------- doelz@urz.unibas.ch writes : In article , wrp@cyclops.micr.Virginia.EDU (Bill Pearson) writes: ... |> |> pvm2.4, 20 protein sequences vs |> annotated PIR34 (approx 10K sequences) |> |> nodes 11 7 3 |> -------------------- |> k2 78 105 207 (times in seconds) |> 76 128 206 |> (73.9) |> The figures look great indeed. The only problem I see with pvm that there is very little implemented with respect to the availability, accounting, security and vulnerability of the distribution. In contrast to hierarchical access systems, pvm does not intend to utilize code which is available as standalone but requires further code modification and maintenance; also, it needs to be configured with the targeted nodes and might not be as dynamic as possible if there's a biologist workstation which wants to benefit from the physics department. Last, remote login with pvm requires 'trusted', at best yellow pages or similar networked environments. This is beyond applicability in WANs, but increasingly also affects LANs with interdepartemental connections. Regards Reinhard -- +----------------------------------+-------------------------------------+ | Dr. Reinhard Doelz | RFC doelz@urz.unibas.ch | | Biocomputing | DECNET 20579::48130::doelz | |Biozentrum der Universitaet | X25 022846211142036::doelz | | Klingelbergstrasse 70 | FAX x41 61 261- 6760 or 267- 2078 | CH 4056 Basel | TEL x41 61 267- 2076 or 2247 | +------------- bioftp.unibas.ch is the SWISS EMBnet node ----------------+ ----------------------------------------- --------------------------------------------------------------------------------- Bill Pearson (Fasta fame) in response to Doug Brutlag writes: Bill, Isn't FASTA with optimization identical to the Smith-Waterman? The optimization step in FASTDB is precisely a Smith-Waterman scoring of the top 5,000 sequences, and hence FASTDB with optimization is a Smith-Waterman analysis on those sequences. The only difference in this case are very remotely related sequences which do not show up in the top 5,000 sequences in the initial pass. BLAZE and MPSearch rectify this by basically doing the Smith Waterman on the entire database. I would be interested to know how you handle the "gold-standard" problem. Whenever I try to test the sensitivity of a database search program using a particular query, I attempt to find all the members of the super family to which the query belongs. I try to use super families in which 1) there is a broad range of degrees of relationship to get a maximal test of the search sensitivity and 2) families in which there is some independent measure of relationship other than just sequence similarity. False negatives (family members missing from the search results at some threshold) are relatively easy to handle, but false negatives (sequences not known to be in the superfamily) give me problems. Although not necessarily a member of a super family, the high scoring "false negatives" are often proteins that share at least a structural or functional motif with part of the query sequence and hence I cannot truly classify them as false positives. The only way I have found around this problem is to 1) not report ROC curves or 2) use short queries which themselves are a single motif or protein domain. This seems to me to be a common problem of using a "long" query with a local alignment algorithm (SW, FASTA, FASTDB, BLAST etc.). Suggestions? Doug Brutlag ------- In article brutlag@CMGM.STANFORD.EDU (Douglas Brutlag) writes: >Bill, > > Isn't FASTA with optimization identical to the Smith-Waterman? The >optimization step in FASTDB is precisely a Smith-Waterman scoring of the top >5,000 sequences, and hence FASTDB with optimization is a Smith-Waterman >analysis on those sequences. ... No, FASTA uses a band of 32 residues for optimization. Smith-Waterman uses both sequences in their entirety for the optimization. FASTA with ktup=1 and optimization is about 5 - 10 X faster than Smith-Waterman, reflecting the fact that the average query sequence size is about 150 - 300 residues. With FASTA, you can either optimize every sequence or optimize those with a score greater than a threshold - either method works as well as Smith-Waterman. Regarding the gold-standard - I work with as many superfamilies as I can find, with several members of the superfamily (some randomly chosen), and I do comparisons with Smith-Waterman. Since I am trying to find sequences that share a common ancestor (and thus have a common structure), I think false-negatives are exactly that. There is little evidence for common structural motifs that can be recognized by sequence comparison in the absence of a common ancestor. Most recently, I have moved from a "criterion" that is a fixed function of the scores of the top-scoring unrelated sequences (the Genomics paper) to one that balances the number of high-scoring unrelated and low-scoring related sequences. This gives the same results, but seems esthetically more pleasing. I feel pretty uncomfortable with "motifs" that result from convergence. I prefer to focus on common ancestry. For me, that solves many of the problems you mention. Bill --------------------------------------------------------------------------------- Anne Juelich writes.. In article , wrp@cyclops.micr.Virginia.EDU (Bill Pearson) writes: |> [ ... ] |> |> I should note also, since some readers of this group may be |> interested, that I now have a version of our parallel "platform" for |> sequence comparison ( Despande, Richards, and Pearson (1991) CABIOS "A |> platform for biological sequence comparison on parallel computers" |> 7:237-247) running on networks of workstations using PVM (parallel |> virtual machine), a freely available package for almost any machine. |> If you are doing lots of sequence comparisons, I can provide you with |> PVM versions for FASTA and Smith-Waterman, with BLAST to be available |> in about a month. |> |> [ ...] At The University of Texas System Center for High Performance Computing, we have developed a PVM based version of BLAST running as part of GenTools(TM). GenTools is a distributed genomic tool suite designed to provide the experimental molecular biologist and/or geneticist transparent access to computational tools on high performance platforms. (To obtain more information about GenTools, please send E-mail to gentools@chpc.utexas.edu.) For example, blastp, comparing a 799 residues query sequence versus PIR, release 32 (40.298 sequences) ran on a net of up to 8 SPARC 2 workstations (times are elapsed times in seconds): # processors time -------------------------------- 1 194.6 2 100.1 4 53.2 6 38.1 8 29.6 These timings were obtained with blastp, version 1.2.5 and PVM, version 2.4.2. We are now using PVM, version 3.0, but I haven't redone the timings yet. Anne Juelich ------------------------------------------------------------------------------- >From guccione@com.ibm.austin Tue Mar 30 23:47:41 1993 > Article 947 (23 more) in bionet.software: > From: mbpcr@s-crim1.dl.ac.uk (A. Parsons) > Subject: Massively Parallel Applications in Sequence Analysis > Date: 29 Mar 93 12:30:01 GMT > Distribution: bionet > Organization: Daresbury Lab., Warrington, U.K. > > Greetings Netfolk!, > > We have been trying to do a limited evaluation of the utility of Smith and > Waterman searches on massively parallel hardware (MasPar MP-1 and MP-2) using > both BLAZE (from Intelligenetics) and Mpsrch (from Shane Sturrock and John > Collins at the Biocomputing Unit at Edinburgh University). > > I was wondering if anyone has done/intending to do something similar? More to > the point has any independent assessment been carried out that anyone is > intending to publish? I have seen some data from Intelligenetics which is very > impressive but then again hardly impartial. > Mr. Parsons, I'm also interested in this area. You may have more luck posting to the comp.parallel newsgroup. I do know that several people are working in this area. I recall something on the Thinking Machines CM-5 (and possible an older version on the CM-2). There are also people taking a systolic approach. D. T. Hoang at Brown University has done some interesting work in this area. D. Lopresti (formerly of Brown) built a custom VLSI chip for this application a few years ago. I'd be interested in whatever other information you can find. -- Steve -- 3/30/93 ------------------------------------------------------------------------------ From: Michael McKenna writes ... Personally, I have had far greater success using MPsrch at EMBL than with any of the other algorithms, including Blaze. Unfortunately, Swiss-Prot is the only database available now for protein sequences, and probably represents a subset of what is available at the NCBI. My proteins were distantly related to a poorly conserved superfamily, on the order of 25% identity, but MPsrch identified related molecules where FASTA, BLAST, Blaze, and BLOCKS failed. I believe every algorithm has its strengths and weaknesses, and routinely use all of them. The amount of computer time a search uses is basically irrelevant, and will eventually fade away as computers become more powerful. Just wanted to give you my opinion. Mike McKenna MCKMICP@YALEVM.YCC.YALE.EDU ------------------------------------------------------------------------------ Geir Egil Hauge writes... > I should note also, since some readers of this group may be >interested, that I now have a version of our parallel "platform" for >sequence comparison ( Despande, Richards, and Pearson (1991) CABIOS "A >platform for biological sequence comparison on parallel computers" >7:237-247) running on networks of workstations using PVM (parallel >virtual machine), a freely available package for almost any machine. >If you are doing lots of sequence comparisons, I can provide you with >PVM versions for FASTA and Smith-Waterman, with BLAST to be available >in about a month. My shareware package dtask v1.1s (no previous versions available) for running UNIX workstations in parallel when comparing biological sequences, is to be released in about 1 month (as soon as it is cleared by my supervisors). A sequence comparison program using the Smith-Waterman-1981 algorithm is included in the package. I have tested the package on as much as 96 UNIX workstations in parallel. The speed was then measured to be 42 million matrix cells updates per second, using a 801 residue long protein query sequence against Swiss-Prot #21. (It took 151 seconds). The speedup was measured to be 32 against a Sun Sparc-10 station. This is 82% of "perfect speedup". The speedup will be better on heavier jobs (longer query sequences) and smaller on lighter jobs (short query sequences). The speedup will also be better when a smaller number than 96 machines are run in parallel. Among the 96 machines were machines like: SUN 3/50, SUN 3/60, Sparc-2, Sparc-10, DEC3100, DEC5000/200, and some SGI and HP machines running System V derived UNIX systems. (In version 1.1s of dtask, BSD signals are needed on system V systems). The machines have to depart in a common file system like NFS, and must be able to do UNIX socket(2)/AF_INET communication. The programs are built in such a way that the programs may detect if a workstation is heavily used by other users, and then stop using that/those machines for a specified time before the machine(s) is/are tried again. I use indexfiles in such a way that the programs are quite independent of library format. Only the program that creates indexfiles has to be altered. A program for making indexfiles from Pearson/FASTA-format libraries are included. The package, containing complete C-source, documentation and tests, will be available from anonymous ftp "ftp.ifi.uio.no" in about a month. Geir Egil Hauge -------------------------------------------------------------------------------- Tony Parsons writes in response.... This sounds very impresive performance, and I dont want to sound facetious but HOW much would a network of 96 Un*x workstations cost? And more to the point - how much floor space would they take up! Also - my understanding is that when it comes to symmetric (or even asymmetric) multiprocessing that performance starts to degrade after a certain number of processors are added due to interprocessor communication and synchronisation. I suppose what I am saying is - if more is better - and parallelism is the preferred paradigm - then massively parallel is surely the ultimate solution?? I heard Donald Lindberg give a lecture at the Royal Society in London on Monday and this was very much the thinking behind his talk so presumably this is the preferred route that the NCBI/NLM are going to take? The question I originally asked also had the caveat (which noone to date has commented on) "How much longer can we do WITHOUT data parallel solutions for searching the masses of data being generated by the HGMP?" As this thread is starting to die down i will summarise all responses soon. Tony Parsons (mbpcr@seqnet.dl.ac.uk AKA parsons_a@snd01.pcr.co.uk) Pfizer Central Research - Sandwich, UK ------------------------------------------------------------------------------ From: cup@ca.ualberta.biochem.bones (Chris Upton) Hi, I find it's essential to use something like BLAZE or BLITZ to find low homologies that stretch over the full length of a protein. BLAST just doesn't see them and while I've heard that with fiddling FASTA/FASTDB can be as sensitive for most users I would recommend a BLITZ search. Actually, I always do BOTH, since BLAST is great for those small regions often associated with functional domains. I have a couple of test cases if you're interested. I work with poxviruses and often it's hard to make the connection between the viral proteins and the host counterpart. Cheers, Chris Upton cup@bones.biochem.ualberta.ca Biochemistry University of Alberta Edmonton AB Canada ================ and follow up with two examples of value of S&W method ==== Tony, The sequences I have are published so I have no problem with summaries to the net. Anything to awaken users to the importance of correct searches. I'll just append the seaquences and maybe you could run the searches with your parameters. Cheers, Chris T7PEP MIKMKERLFFIWFLTVTSTDTVRLTSYDLNIFVNWRDDGYAYNVSIKPYTTGTWINVCEWASSSCNVSAA LQNDLDIMTWVRLTRLGESIEYSLEPTCNVARFSPPEVRLSRLGPSVEVVIQHSVVNLRGDNVPVYGYPF CDDYFGYKMFFLFSNDKHAEYDVDDRYCDYVQCRFTIESQERVCVTAVLVFGNSYRSEAGEDVCVSELVK YVVDPYIVKKPSDLEDVKRIISNEYRFDKTEERSRLEDLYLMIASMFQRLVEDIF This is Shope fibroma virus (SFV) T7 protein, it's related to human and mouse IFN-gamma receptors. It binds and blocks IFN-gamma activity. Upton et al Science 258:1369-1372 (1992) sfvd6pep MRRVFLSHEPYVIEYHEDWENIITRLVDMYNEVAEWILKDDTSPTPDKFFKQLSVSLKDKRVCVCGIDPY PRDATGVPFEESHNFTKKTIKYIAETVSNITGVRYYKGYNLNNVEGVFPWNYYLSCKIGETKSHALHWKR ISKLLLQHITKYVNVLYCLGKTDFANIRSILETPVTTVIGYHPAAREKQFEKDKKEINDKPSIRWEQGFS Y This the SFV uracil DNA glycosylase (ORF Bam-D6). Upton et al PNAS (in press) In each case other related poxviral proteins were known but the connection to mammalian counterparts had not been made. I actually used a program called NW_align which is part of a package by Boyko and Wishart here at the University of Alberta. They're in just about finished writing it up, but it essentially does full alignments and jumbling of sequences to increase sensitivity. Blaze and Blitz also pulled the correct matches out of the database. Cheers, Chris Chris Upton Biochemistry University of Alberta Edmonton Alberta T6G 2H7 Canada cup@bones.biochem.ualberta.ca ------------------------------------------------------------------------------ Shane Sturrock (mpsrch Fame).. In article <1peuciINNicd@s-crim1.dl.ac.uk> mbpcr@s-crim1.dl.ac.uk (A. Parsons) writes: > >This sounds very impresive performance, and I dont want to sound facetious >but HOW much would a network of 96 Un*x workstations cost? And more to the >point - how much floor space would they take up! It depends, no doubt there are many workstations hiding in peoples offices which have lots of spare cycles available so distributed processing of this kind sounds like a sensible use on the other hand...... > >Also - my understanding is that when it comes to symmetric (or even asymmetric) >multiprocessing that performance starts to degrade after a certain number of >processors are added due to interprocessor communication and synchronisation. Yes. I wrote a full Smith Waterman for the transputer surface at University of Kent while I was a student and performance did tail off but it depends very much on how the communications are dealt with and how much data you shunt around. When I did it I was rather silly creating full alignments on the individual processors and shipping those back to the farming processor. Obviously now it is possible to just do the comparison and pass the score and coordinates back to the farmer for ranking then only reconstruct however many are required, but I was young and naive way back in those days :-) Even so, given that scheme I ran on 86 T800s and still achieved over 90% processor utilisation, so for a pipeline that long the over head was about 10%, now I could run a pipeline well into the hundreds with negligible tailing off of efficiency. But, transputers are designed for that sort of parallelism, writing code on workstations to handle all the communications etc is much harder than writing a true occam parallel program as I did so hats off to those who do such things. Me, I stick to true parallel machines since they are easier to program. > >I suppose what I am saying is - if more is better - and parallelism is the >preferred paradigm - then massively parallel is surely the ultimate solution?? Hmmmm, parallelism is the only foreseable way ahead with present technology since the limitations of performance on single processors will force the move if we are to stick to silicon based processors. As for massively parallel (ie SIMD) versus MIMD, well its a matter of personal preference, SIMD machines are a doddle to program (mind you so are MIMD machines so don't let that put anyone off) but the style of programming differs greatly between the two. I would avoid saying ultimate about anything since something can only be ultimate if the world ends immediately after the statement. Anything else is just plain hype. > >I heard Donald Lindberg give a lecture at the Royal Society in London on Monday >and this was very much the thinking behind his talk so presumably this is the >preferred route that the NCBI/NLM are going to take? > >The question I originally asked also had the caveat (which noone to date has >commented on) "How much longer can we do WITHOUT data parallel solutions for >searching the masses of data being generated by the HGMP?" One thing that looks like being rather fun is the ability given very fast machines to go beyond the Smith Waterman algorithm and do more work to improve the sensitivity of the program. Of course, you need to get the Smith Waterman fast enough before you can claim to have time to spare to deal with such things. I don't see that there is much point in mucking about with heuristic algorithms which may be sensitive but given the advances in hardware will pretty much be pointless when it takes a matter of seconds to do the full SW (or more) algorithm on parallel machines. > >As this thread is starting to die down i will summarise all responses soon. I look forward to reading them. -- Shane Sturrock, Biocomputing Research Unit, Darwin Building, Mayfield Road, University of Edinburgh, Scotland, Commonwealth of Independent Kingdoms. :-) Civilisation is a Haggis Supper with salt and sauce and a bottle of Irn Bru. ---------------------------------------------------------------------------------- Rheinhard Doelz writes... >From daresbury!bioftp.unibas.ch!comp.bioz.unibas.ch!doelz Mon Apr 5 13:50:26 GMT 1993 In article <1peuciINNicd@s-crim1.dl.ac.uk>, mbpcr@s-crim1.dl.ac.uk (A. Parsons) writes: |> The question I originally asked also had the caveat (which noone to date has |> commented on) "How much longer can we do WITHOUT data parallel solutions for |> searching the masses of data being generated by the HGMP?" |> Sorry for playing devil's advocate but I would say that the individual site can live long without a parallel computer. The database providers might like to have one, which they can use effectively. The biocomputer sites might need one (if they can afford to abandon the safe tracks and develop, maintain, and optimize own hardware/software solutions). You might have seen the discussion on the suspected contaminations reported in human cDNA libraries (Science, issue of previous week). In order to run thousands of sequences vs. EMBl, SWISSPROT, GENBANK, PIR we used a Silicon Graphics Cluster, containing 2 Indigos, a Crimson, and a 2-Processor Power series. It took about two to three weeks elapsed time on this setup, where the machines were also used for development work, molecular modelling, and general GCG usage. Now if you have such a problem of running 10000 fasta and blast jobs, how often do you need the results `yesterday` ? How often do you have these giantic questions at all? If you run these analysis, the 'granularity' as given in the fact that you run each sequence after another is sufficient to dissipate it amongst machines. You might want to have a parallel version if you have a multi-processor machine but the real idea is that the effort to manage, and maintain, a pvm installation is tedious on the long run. A Hierarchical Access System is perfect enough - you distribute single jobs across the net and keep using the 'established' software. This is alos beneficious due to the fact that the experience with the parameters and the search results is generally higher with fasta or blast-type of searches than with special algorithms. Last point coming into mind is that the question on 'similarity' is differently asked at database providers rather than end users. A reasona- ble check of a database annotator is to look for identities, maybe figure out alleles, but not weak homologies as most of the end users are interes- ted in. This is explicitly not talking on the 'biocomputers' who search the needle in the hey (no pun intended - just to scale the effort right). Those guys might not be able to live without monsters in the future. -- +----------------------------------+-------------------------------------+ | Dr. Reinhard Doelz | RFC doelz@urz.unibas.ch | | Biocomputing | DECNET 20579::48130::doelz | |Biozentrum der Universitaet | X25 022846211142036::doelz | | Klingelbergstrasse 70 | FAX x41 61 261- 6760 or 267- 2078 | CH 4056 Basel | TEL x41 61 267- 2076 or 2247 | +------------- bioftp.unibas.ch is the SWISS EMBnet node ----------------+ ----------------------------------------- ------------------------------------------------------------------------------ Bill Pearson (Fasta fame).. In article <1993Apr13.110138.5033@gserv1.dl.ac.uk> tekaia@pasteur.fr (Fredj Tekaia) writes: >I am interrested in comparing the performances of the 2 family programs : >BLAST and FASTA. Performances in terms of speed of execution and accuracy >of the results. There is a brief discussion of the relative performance of blast and FASTA in Pearson, WR (1991) Genomics "Searching Protein Sequence Libraries: Comparison of the Sensitivity and Selectivity of the Smith-Waterman and FASTA Algorithms" 11:635-650. Since then, I have done more extensive comparisons and am writing up the results. A brief summary: BLASTP performs about as well as FASTA with ktup 1 - 1.5 (which does not exist). It is better than FASTA with ktup=2. It is not as good as Smith-Waterman. FASTA with -o and ktup=2 or ktup=1 performs better than BLASTP, and as well as Smith-Waterman. (-o tells FASTA to calculate an "optimized" score - a score with gaps - for every sequence in the database.) >In fact I would like to know if for comparing a test sequence to a database : >Is it best to do it by : >a) fasta programs, b) the blast programs or c) both programs? I would run BLAST first. If BLAST fails to find any significant matches, I would run FASTA with -o and ktup=1. This is substantially slower than BLAST, but will not miss anything significant. At the moment, the main advantage of BLAST is that you can run it on the NCBI non-redundant sequence database. When this database becomes generally available, it should be used for FASTA searching with the -o option. Bill Pearson --------------------------------------------------------------------------------- tbrannon@lion.eecs.lehigh.edu writes (on a related theme) Much of the problem with parallel algorithms is that they are machine specific. There have been attempts at standardization of the parallel languages (F90, Fortran D, CMFortran), but communication calls are far from standard. If you are porting code between very different architectures, often the datasets will need to be restructured to insure reasonable performance. When you ask about code already written, you should also inquire as to which platform it was developed on. PCN , developed between Caltech and Argonne Labs allows for the same program to run with NO modification on numerous uniprocessor platforms as well as a number of popular hypercubes. It is available for anonymous ftp from: info.mcs.anl.edu in pub/pcn > 2. Is parallel computing on campus something we would use? In general, if you have a problem that requires a significant amount of interprocessor communication, a serial machine might give you better performance. Hmm, I beg to disagree. I spent a summer 2 years ago parallelizing an fcc-lattice algorithm to predict tertiary protein structures. My parallel version (developed with PCN) ran worlds faster. -- Terry Brannon tbrannon@lion.eecs.lehigh.edu medical biology via acupunctural bioelectrics primitive reigns supreme over nearly everyone ------------------------------------------------------------------------------- samodena@csemail.cropsci.ncsu.edu writes (on a related theme)... In article <9304132221.AA23039@pi18.arc.umn.edu> eichler@pi18.arc.umn.edu (Rogene Eichler) writes: >> 1. What presently is available (or seen coming in the near >> future) on parallel machines that would aid molecular biology >> research? > >There is a program called BigNet written for the Cray, CM200, and CM5 Of course, there are other machines that might be considered....the Maspar machine has an excellent $-per-CPU-minute ratio when kept busy on very large problems....I don't happen to know it for the Convex products but I'll bet it's good also. > >Much of the problem with parallel algorithms is that they are machine >specific. There have been attempts at standardization of the parallel >languages (F90, Fortran D, CMFortran), but communication calls are >far from standard...... >....... >In general, if you have a problem that requires a significant amount >of interprocessor communication, a serial machine might give you better >performance. > -Rogene Let's put this the other way around: if one has an algorithm which is purely synchronous (each processor in the machine is executing the same line of FORTRAN at the same moment), then you may be home free! The implementation of the algorithm on your machine-of-choice may be greatly simplified (and I know of at least one major sequence classification algorithm that falls into this category). So to expand Rogene's point, when you decide to implement *new* algorithms (can't be a copy-cat forever), what new knowledge-skills-software tools will be required...what is the learning curve....what will be the likely structural nature of algorithms of interest to you and your colleagues....is massively parallel supercomputing the most fruitful path for your interests in computational molbio? For some categories of algorithms, massively parallel is the way to go! And for other categories, it may be the pits. :^) [ I'm thinking of the contrast in two crop process modeling presentations a few years ago....the guy that grew FORTRAN "supercomputed" corn had a gut-wrenching experience, whereas the guy that threw away COMAX (FORTRAN Cotton Model) and grew cotton in LISP from first principles thought it was a breeze to do. ] Steve --- +------------------------------------------------------------------+ | In person: Steve Modena AB4EL | | On phone: (919) 515-5328 | | At e-mail: nmodena@unity.ncsu.edu | | samodena@csemail.cropsci.ncsu.edu | | [ either email address is read each day ] | | By snail: Crop Sci Dept, Box 7620, NCSU, Raleigh, NC 27695 | +------------------------------------------------------------------+ Lighten UP! It's just a computer doing that to you. (c) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO --------------------------------------------------------------------------------- Don Gilbert writes..gilbert@bio.indiana.edu.. ona related theme In article <9APR199307174120@aardvark.ucs.uoknor.edu> broe@aardvark.ucs.uoknor.edu (Bruce Roe) writes: > 1. What presently is available (or seen coming in the near > future) on parallel machines that would aid molecular biology > research? The phylogeny program FastDNAml has some parallelized code in it: "The program includes "P4" code for distributing the problem over multiple processors (either within one machine, or across multiple machines)." I'm not working with this myself, but Chuck Delwiche here (delwiche@bio. indiana.edu) is looking into using it on IU's Intel Paragon parallel box. (is that what Intel does with all those obsolete 8088's? :) He may have something to report in a month or so. Pick it up by ftp to info.mcs.anl.gov@/pub/RDP/?software/... or gopher/ftp to ftp.bio.indiana.edu@/molbio/evolve/fast-dnaml.tar.Z -- don -- Don Gilbert gilbert@bio.indiana.edu biocomputing office, biology dept., indiana univ., bloomington, in 47405 end --------------------------------------------------------------------------