<< , >> , up , Title , Contents

6.5. Interpreting PROSRCH output


6.5.1. Data check

From the top of the PROSRCH output:

Additional Alignments displayed to complete coverage of Query Sequence, if necessary.
Query query23620
Loaded at Fri Mar 20 15:19:16 1992
No description supplied.
Using Residues 1 to 100; Sequence Length 100 KVCGGGSTYVLIKLGIILVSLMYLGGLEGLLLKGEIGCGNCGCVITGLWR RIGHGVSGCNACLKVCGGGLTYVAIGLSTYVIELKRKGSKVCGGGSTYVL
Drawing Individual Alignment Maps.

The "query23620" is the DAP's job number. Any technical problems encountered may be traced using this reference. The program parameters (such as cover selected, map selected) are shown, along with your query sequence. Check that the number of residues in your query sequence has been interpreted correctly (if not, your data may have been sent in the wrong format)

6.5.2. Symbol comparison table

Pick-a-Pam Noise Reduction (if applied)
PAM Value supplied : 100
Check List of All Table Values: Indels -14
23 X 23 Similarity Score Table
A B C D E F G H I K L M....
A 6 -1 -5 -1 0 -7 1 -5 -3 -4 -5 -3....
B -1 8 -8 8 5 -6 -1 2 -4 1 -6 -5....
.
.
X 0 0 0 0 0 0 0 0 0 0 0 0....
Y -6 -3 -2 -9 -7 6 -11 -1 -4 -10 -5 -8....
Z 0 5 -11 5 8 -10 -2 4 -5 -1 -3 -2....

The PAM value, indel penalty (=gap penalty) and noise reduction factor (=empirically derived adjustment) define the symbol comparison table. Low PAM values are for identifying closely related sequences. High PAM values are for identifying distant relationships. All the alignments shown later have been made using this table, just as BESTFIT uses a symbol comparison table.

6.5.3. Additional information

Using protein sequence Database pir31 Packed Sun Mar 8 20:39:29 1992
Proteins 36150 Residues 10324010

The database name (PIR31), update time and contents.

6.5.4. How PROSRCH works

PROSRCH does a BESTFIT-like alignment (ie:"Best local homology") on each entry in the database. PROSRCH can record up to 4096 local alignments for each pair of sequences. The score is recorded for each alignment, with the top 16000 or so scores retained ("Total number of results collected"). The scores can be plotted as a graph (see figure).

6.5.5. Figure: score vs log (number of entries).

Taken from Collins J.F., Coulson A.F.W., & Lyall A., CABIOS, 4 67-71, (1988)

Click here for Picture

Fig1: Plot of the number of reported alignments achieving a given score (logarithmic scale) against the score value, for a search of the protein database with bovine proglucagon as the query sequence, using the 100 PAM similarity table. The straight line is fitted to the lowest 97% of the reported part of the distribution. The top 49 results represent unequivocally related sequences.

6.5.6. The score distribution and statistics

The major assumption of this method is that most of the alignments will be random. The top 3% are assumed to include all the significant alignments. The other 97% are defined as random alignments and used to create a least-squares fit of score vs log (number of entries), (see figure 1).

Two columns in the output show the observed number of sequences in the database which achieved that score, and the number expected with that score. Alignments with an expected frequency of <1 are of main interest. Note that the statistics are based not on the whole database, but only on the "best" alignments observed.

Total Number of Results Collected 14373
Final Threshold Value is 43
Maximum Score 117
Score Observed Predicted
43 2717 2797.
44 2297 2257.
45 1726 1820.
46 1308 1469.
47 1098 1185.
48 1027 955.7
49 814 771.0
50 569 622.0
.
.
67 15 16.14
68 17 13.02
69 12 10.50
70 7 8.474
71 6 6.836
72 9 5.515
73 4 4.449
74 7 3.589
75 2 2.895
76 4 2.336
77 3 1.884
78 4 1.520
79 2 1.226
80 2 0.9891
83 1 0.5193
84 1 0.4189
89 1 0.1431
92 3 0.7514E-01
94 1 0.4890E-01
95 1 0.3945E-01
99 1 0.1671E-01
105 3 0.4604E-02
107 2 0.2996E-02
108 2 0.2417E-02
115 1 0.5374E-03
117 2 0.3498E-03

Statistical Details of Fitted Distribution:
Intercept 17.1725; Sigma 0.2883
Slope -0.2148; Sigma 0.0056
Chi-square 0.2406
Distribution fitted from Score= 43 to 60

6.5.7. The alignments

No. 1. >S15027 443 Amino-acids
*GATA-3 protein - Human
Score= 117 Quality 41.053
13 IDs; 7 CONs; 10 MisMatches; 0 Gaps. Exp. No. 0.3498E-03 SD 27.6

Ratio(Found/Expected) Identities 1.006; Positives 0.962; Negatives 1.088

.. .*.** * **** ..* . ****
312 RAGTSCANCQTTTTTLWRRNANGDPVCNAC 341
33 KGEIGCGNCGCVITGLWRRIGHGVSGCNAC 62*

The test sequence is shown as the bottom line, with the compared sequence on the top line. You should note that the shown alignment is a local (BESTFIT) homology, not a global (GAP) homology, and the alignment might be extendable.

The symbols are simply to show the clustering of good and bad alignments.

'*' for identities

'.' for positive-scoring substitutions,

' ' for negative-scoring and zero-scoring substitutions.

'-' for a deletion or insertion in the sequence.

6.5.8. The individual alignment scores

Ranked by score, the quality is also calculated. Quality is the % of the maximum possible score for the alignment shown (The score obtained by comparing that part of the query sequence with itself).

Each alignment shows the expected number of alignments and a standard deviation. The program does not show alignments with an expected number >14. All alignments with an SD of >6 may be considered statistically significant.

The number of identities, conservative substitutions (in fact: all positive scoring comparisons), and substitutions (in fact: negative and zero-scoring comparisons) are recorded. The ratio of observed to expected is also shown.

6.5.9. Score ratios and PAM tables

Local homology alignment scores are very sensitive to the symbol comparison table, ie: the PAM value used. The ratio values can indicate whether a different PAM value would give a better result.

If the number of identical matches is LESS than expected (ie: an identities ratio <1.0), then repeating the search with a higher PAM value is likely to bring this alignment higher up the output list. If the ratio is >1.0, then lowering the PAM value may increase the discrimination.

6.5.10. Mapping

When the MAP option in PROSRCH is used, a simple diagram of the location of each local alignment is shown.

Result Start Final
No. Score Residue Residue
( 1)-------------------------------------------- ( 100)
1 117 312 ---------------- 341
7 107 263 ------------- 287
No further Alignments recorded from this Sequence.

The above shows that, of the 100 residues in the query sequence: The highest scoring match AND the seventh-highest scoring match are with the same database sequence. In this example they also lie in the same region of the query sequence. Residues 312 to 341 and 263 to 287 of the database sequence match part of the 100 residue sequence.

An accumulated map is shown at the end eg:

Accumulated Map Display: Max Value = 23

- -
-- - -
---------- --------
--------------------
------------------------
---------------------------
---------------------------
-----------------------------
--------------------------------------------------

The bottom line shows the full length of the query sequence. The matched regions are above. Clearly, in this example, many residues have not been aligned in the top scoring alignments.

The Max value is the greatest number of times that a single residue has been matched in the best alignments.

6.5.11. Additional alignments

If the COVER option in PROSRCH was selected, then the best scoring alignment FOR EACH INDIVIDUAL RESIDUE not matched in the overall best scores is given. This may be useful when a part of the query sequence generates so many alignments it dominates the output list. This feature is a main advantage over FASTA.

Additional Best alignment to cover residue 1

No. 51. >A36136 442 Amino-acids
Citrate transport protein - Lactococcus lactis
Score= 59 Quality 37.342 Loc. Exp. No. 0.1403E+01
8 IDs; 4 CONs; 6 Mismatches; 0 Gaps. Exp. No. 0.9539E+02
Ratio(Found/Expected) Identities 1.032; Positives 0.962; Negatives 1.088

** * ..* * *.*.*
289 KVFPGINAYAFIILSIVL 306
1 KVCGGGSTYVLIKLGIIL 18

Note - these alignments can be well within the random-alignments part of the calculation (as the above example clearly is). It may be possible to alter this by using a different PAM table).

6.5.12. A PROSRCH strategy

Common motifs eg: signal peptides may be found in many proteins. To avoid these alignments dominating the output, remove those segments from your sequence before submitting a search. This allows the other domains to be checked.

Submit segments of the protein, 100 residues or less, to detect weak alignments. They may not be significant by their score alone (which is where FASTA can lose them) but the nature of the proteins in which matching segments are found may help the interpretation.

Vary the PAM values - the alignments can appear highly significant over a range of low PAM values, then decrease in significance at higher PAM values. In itself this suggests new alignments to unrelated proteins, but without showing the importance of the lower PAMs' homology. A 100 PAM starting value is suggested.

Once a sequence is identified, use other local homology search programs to investigate further. (ie: GAP, BESTFIT, COMPARE, SEQHB).


<< , >> , up , Title , Contents