From: GOLD::EDU%"cjv@MACE.CC.PURDUE.EDU" "westerman" 13-MAR-1990 12:52
To: Don Gilbert
Subj: modifications to gap/bestfit
Received: From UICVM(MAILER) by IUGOLD with Jnet id 7259
for GILBERTD@IUBACS; Tue, 13 Mar 90 12:49 EST
Received: by UICVM (Mailer R2.03B) id 0600; Tue, 13 Mar 90 10:52:41 CST
Date: Tue, 13 Mar 90 10:27:42 -0500
Reply-To: westerman
Sender: "INFO-GCG: GCG Genetics Software Discussion"
From: westerman
Subject: modifications to gap/bestfit
To: Don Gilbert
-- I had enough requests for my modifications that I decided to post them
-- to all.
--
-- This file includes the changes to GAP so that it will generate a
-- standard deviation via many comparisons of shuffled sequences. My comments
-- are preceeded by "--". Changed lines in the source code are preceeded by
-- a ">" and surrounded by unchanged lines (from version 6.1 source). All
-- lines are preceeded by an informative line number; note that the numbering
-- may not be the same as in your version of the source. The source contains
-- some comments that should be read.
--
-- To implement the changes, do the normal upgrade procedure for GCG software:
--
-- 1) Type "gcgsupport" from a priviledged account
-- 2) Type "take gap mod" to retrieve your source code
-- 3) Using an editor, insert the changes below into 'gap.for'
-- 4) As a check, type "fortran gap" and "sharelink gap" to make sure
-- the changes compile correctly.
-- 5) Type "save gap" to save your source
-- 6) Type "make gap" to compile & install the program
--
--
-- Please email or call me if you have problems or comments.
--
-- Rick Westerman, Biochemistry, Purdue University, USA
-- email: cjv@mace.cc.purdue.edu phone: 317-494-0505
--
--
--
-- There are 6 sections of changes to GAP.FOR --
--
-- section 1 --
--
12 !* Gap /Out mode if you are setting the requirements of this
13 !* program at or near the limits of the address space available on
14 !* your machine!
15 !*
16 > !* The following notes were made by Rick Westerman at Purdue Univeristy,
USA
17 > !* email: cjv@mace.cc.purdue.edu phone: 317-494-0505
18 > !*
19 > !* Modified, Jan 15, 1990 by RPW to add the /STAND option. This option
will
20 > !* calculate the standard deviation of a number of randomly shuffled
21 > !* combinations.
22 > !* Modified, Feb 22, 1990 by RPW to add high & low values to standard
dev. and
23 > !* to keep local structure within strands by not shuffling them much.
24 > !* Modified, Mar 5, 1990 by RPW to add the number of high values above
the
25 > !* original score between the two sequences
26 > !*
27 > !*
28 > !* Gap was changed to allow generation of a mean, standard deviation,
high,
29 > !* and low values of a user-specified number of randomly shuffled
comparisons.
30 > !* This option can be either choosen from the command line using the
31 > !* "/STANDard" switch or by answering the interactive question "Do
standard
32 > !* deviation?". The default is to *not* do the standard deviation
trials. The
33 > !* number of trials to run defaults to 100 and has a maximum of 300;
this
34 > !* number is not selectable from the command line.
35 > !*
36 > !* The sequences are shuffled in a manner that will preserve small
stretches
37 > !* of the sequence. To quote from Pearson and Lipman's paper on testing
38 > !* FASTA searches (PNAS, 85 (1988), pp. 2444-2448):
39 > !*
40 > !* "Locally biased amino acid or nucleotide composition is perhaps the
most
41 > !* common reason for high similarity scores of dubious biological
significance.
42 > !* High scoring alignments between query and library sequences may be
due to
43 > !* patches of hydrophobic or charged amino acid residues or to A+T or
G+C
44 > !* regions of DNA. A simple Monte Carlo shuffle analysis that
constructs
45 > !* random sequences by taking each residue in one sequence and placing
it
46 > !* randomly along the length of the new sequence will break up these
patches
47 > !* of biased composition. As a result, the scores of the shuffled
sequences
48 > !* may be much lower than those of the unshuffled sequence, and the
sequences
49 > !* will appear to be related. Alternatively, shuffled sequences can be
50 > !* constructed by permutating small blocks of 10 or 20 residures so
that, while
51 > !* order of the sequence is destroyed, the local composition is not."
52 > !*
53 > !* The 'alternate' method mentioned above is the one employed in this
program.
54 > !*
55 > !* Evaluating the results:
56 > !* While the mean plus the standard deviation can be used to
determine
57 > !* significance, because of the divergance of comparison scores from
58 > !* a normal curve, this does not work well. That is, the normal
method
59 > !* of significance -- the mean plus 3 SDs -- doesn't work on the type
60 > !* of distributions generated by GAP or FASTA. A better method of
61 > !* significance is to look at the number of shuffled sequences that
have
62 > !* a higher score than the non-shuffled sequence.
63 > !*
64
!******************************************************************************
65
66 Program GAP
67
--
-- Section 2
--
96 Integer OutLen, Str_Len, GetIntNum
97 Real Sim, Iden
98
99 > Logical Std/.false./
100 > Integer StdTimes/100/
101 > Character s1dorg(MaxSeg), s2dorg(MaxSeg), splace(25)
102 > Character s1d(MaxSeg), s2d(MaxSeg), Text3(512)
103 > Character AboveStr(150)
104 > Real QualityD, RatioD, QuD(300), RatD(300)
105 > Real QuDAvg, QuDSum, QuDStd, RatDAvg, RatDSum, RatDStd
106 > Real MinQu, MaxQu, MinRat, MaxRat
107 > Integer NGapsD, Off1D, Off2D, J, Above
108 > Integer StrCompare
109 >
110 Real Similarity, Identity
111
112 !* print the documentary banner
--
-- Section 3
--
188 If ( Seq1Rev ) Call RevSeq(Seq,Seq1Begin,Seq1End)
189 Call StrCopy(s1,seq(Seq1begin))
190 > Call StrCopy(s1dorg,seq(Seq1begin))
191
192 !* Get sequence 2
--
-- Section 4
--
224 If (Seq2Rev) Call RevSeq(Seq,Seq2Begin,Seq2End)
225 Call StrCopy(s2, seq(Seq2begin))
226 > Call StrCopy(s2dorg, seq(Seq2begin))
227
228 !* Get the gap weight, and gap length weight
--
-- Section 5
--
273 End If
274
275 Shift2 = -Shift2
276
277 > !* see if to do standard deviation
278 >
279 > Std = CLRetBool('STAndard')
280 > If (.not.Std .and. .not.CLNoInteract()) then
281 > Call WriteF('\n Do standard deviation (* No *) ? ')
282 > Call GetYesNo(Std)
283 > End If
284 > If (Std .and. .not.CLNoInteract()) then
285 > Call WriteF('\n Number of trials to base standard
286 > & deviation on (* %d *) ? ', StdTimes)
287 > StdTimes = GetIntNum('Number of trials',1,300,StdTimes)
288 > End If
289 >
290 !* trap runs with no output files
291
292 Call StrCopy(Out1FName, Seq1Name)
293 Call StrCopy(Out2FName, Seq1Name)
--
-- Section 6
--
353 Sim = Similarity(s1, s2, CmpTable)
354 Iden = Identity(s1, s2)
355
356 > !* RPW: If desired, collect statistics and calculate standard deviation.
357 >
358 > If (Std) then
359 > QuDSum = 0.0
360 > RatDSum = 0.0
361 > QuDStd = 0.0
362 > RatDStd = 0.0
363 >
364 > Do 101 I = 1, StdTimes
365 > If (.not.CLNoInteract()) then
366 > Call WriteF('\nStandard deviation trial #%d:',I)
367 > End If
368 >
369 > Do 103 J = 1, Str_Len(s1dorg), 20
370 > Call SubString(splace,s1dorg,J,J+19)
371 > Call ShuffleSeq(splace,Str_Len(splace))
372 > If (J .eq. 1) then
373 > Call StrCopy(s1d,splace)
374 > Else
375 > Call StrConcat(s1d,splace)
376 > EndIf
377 > 103 Continue
378 >
379 > 107 Do 104 J = 1, Str_Len(s2dorg), 20
380 > Call SubString(splace,s2dorg,J,J+19)
381 > Call ShuffleSeq(splace,Str_Len(splace))
382 > If (J .eq. 1) then
383 > Call StrCopy(s2d,splace)
384 > Else
385 > Call StrConcat(s2d,splace)
386 > EndIf
387 > 104 Continue
388 >
389 > ! ... because shuffleseq is time based, sometimes the same
390 > ! sequence can be obtained (if comparing the same
391 > ! sequence to itself), check for this and
392 > ! reshuffle if this is so
393 > If (StrCompare(s1d,s2d) .eq. 0) goto 107
394 >
395 > If ( .not.ShiftAlign(s1d, s2d, WgtGap, WgtGapLen, NGapsD,
396 > & Off1D, Off2D, Shift1, Shift2, QualityD,
397 > & Path, MaxSurface, CmpTable, Best, WgtEnds) )
398 > & Call WriteF('\n *** ERROR in GAP/BESTFIT, '//
399 > & 'No alignment done ***\n\b\s')
400 > RatioD = QualityD/
401 > & Float( Min(StrCount(s1,'~.',-1),StrCount
(s2,'~.',-1)))
402 > QuD(I) = QualityD
403 > RatD(I) = RatioD
404 > QuDSum = QuDSum + QualityD
405 > RatDSum = RatDSum + RatioD
406 > 101 Continue
407 > QuDAvg = QuDSum / StdTimes
408 > RatDAvg = RatDSum / StdTimes
409 >
410 > MinQu = QuD(1)
411 > MaxQu = QuD(1)
412 > MinRat = RatD(1)
413 > MaxRat = RatD(1)
414 > Above = 0
415 >
416 > Do 111 I = 1, StdTimes
417 > If (QuD(I) .gt. MaxQu) MaxQu = QuD(I)
418 > If (QuD(I) .lt. MinQu) MinQu = QuD(I)
419 > If (QuD(I) .gt. Quality) Above = Above + 1
420 >
421 > If (RatD(I) .gt. MaxRat) MaxRat = RatD(I)
422 > If (RatD(I) .lt. MinRat) MinRat = RatD(I)
423 >
424 > QuDStd = QuDStd + (QuD(I) - QuDAvg) ** 2
425 > RatDStd = RatDStd + (RatD(I) - RatDAvg) ** 2
426 > 111 Continue
427 > QuDStd = (QuDStd / (StdTimes - 1)) ** 0.5
428 > RatDStd = (RatDStd / (StdTimes - 1)) ** 0.5
429 > End If
430 >
431 >
432 > !* show parameters used to both screen and output file.
433 >
434 > If (Std) then
435 > Call SWriteF(AboveStr,'\n')
436 > If (Above .gt. 0)
437 > & Call SWriteF(AboveStr,'\n\t\t\tNote: %d trials with a'//
438 > & ' quality greater than %5.1f \n\t\t\t '//
439 > & ' and with a ratio greater than %5.3f.\n\n',
440 > & Above, Quality, Ratio)
441 >
442 > End If
443 >
444 > If ( .not.CLNoInteract() .or. CLRetBool('SUMmary') ) then
445 > Call WriteF('\n\n
446 & Gaps: %5d\n
447 & Quality: %5.1f\n
448 & Quality Ratio: %5.3f\n
449 & %% Similarity: %5.3f\n
450 & Length: %5d\n\n', NGaps, Quality, Ratio, Sim, OutLen)
451
452 > If (Std) then
453 >
454 > Call WriteF('
455 > & -- Statistics with random sequences containing the\n
456 > & same number of bases and the same base content --\n\n
457 > & Trials: %5d\n
458 > & Avg. Quality: %5.1f\t Std. Deviation: %5.1f\t
459 > & Min: %5.1f\t Max: %5.1f \n
460 > & Avg. Ratio: %5.3f\t Std. Deviation: %5.3f\t
461 > & Min: %5.3f\t Max: %5.3f \n',
462 > & StdTimes, QuDAvg, QuDStd, MinQu, MaxQu,
463 > & RatDAvg, RatDStd, MinRat, MaxRat)
464 > Call WriteF(AboveStr)
465 >
466 > End If
467 > End If
468 >
469 !* display the alignment or write out the aligned sequences
470
471 Call SWriteF(Text,
472 & ' Gap Weight: %6.3f Average Match: %6.3f\n'//
473 & ' Length Weight: %6.3f Average Mismatch: %6.3f\n\n'//
474 & ' Quality: %6.1f Length: %6d\n'//
475 & ' Ratio: %6.3f Gaps: %6d\n'//
476 & ' Percent Similarity: %6.3f Percent Identity: %6.3f',
477 & WgtGap, AvMatch, WgtGapLen, AvMismatch,
478 & Quality, OutLen, Ratio, NGaps, Sim, Iden )
479 Call SWriteF(Text2,
480 & ' Gap limit one: %6d Gap limit two: %6d',
481 & -Shift2, Shift1 )
482 > If (Std) then
483 > Call SWriteF(Text3,'\n
484 > & -- Statistics with random sequences containing the\n
485 > & same number of bases and the same base content --\n\n
486 > & Trials: %5d\n
487 > & Avg. Quality: %5.1f\t Std. Deviation: %5.1f\t
488 > & Min: %5.1f\t Max: %5.1f\n
489 > & Avg. Ratio: %5.3f\t Std. Deviation: %5.3f\t
490 > & Min: %5.3f\t Max: %5.3f \n\n',
491 > & StdTimes, QuDAvg, QuDStd, MinQu, MaxQu,
492 > & RatDAvg, RatDStd, MinRat, MaxRat)
493 > End If
494
495 !* now write the output
496
497 If ( Out1File.ne.0 ) then
498 Call FWriteF(Out1File, '\n%s\n', Text)
499 If ( Limit ) Call FWriteF(Out1File,'\n%s\n',Text2)
500 > If ( Std) Call FWriteF(Out1File,'\n%s',Text3)
501 > If ( Std) Call FWriteF(Out1File,AboveStr)
502 Call BaseName(Seq1Name)
503 Call BaseName(Seq2Name)
--
-- End of changes to GAP
--