AFRC TRAINING MANUAL - CHAPTER 5 5. COMPARING SEQUENCES 5.1. Identifying sequence homology 5.1.1. Word comparison COMPARE identifies regions of homology between two sequences. This example compares RNA1.SEQ, a 5s rRNA from a red alga to RNA4.seq, a 5s rRNA from a diatom. $ Compare/word COMPARE/Word what horizontal sequence ? rna1.seq Begin (* 1 *) ? End (* 121 *) ? Reverse (* No *) ? to what vertical sequence (* rna1.seq *) ? rna4.seq Begin (* 1 *) ? End (* 118 *) ? Reverse (* No *) ? What comparison word size (* 6 *) ? 2 What should I call the output file (* Rna1.Pnt *) ? word2.pnt Submit Compare job to which batch queue (* LONG *) ? Short 5.1.2. Dotplotting The output from COMPARE can be shown as a graph using the program DOTPLOT. Dots that lie on the same diagonal are connected by lines to show stretches of identity against the background of random dots. $Dotplot/font=1 word2.pnt { accept defaults } 5.1.2.1. Interpreting the plot - Broadly speaking COMPARE arranges one sequence vertically, and the second sequence horizontally, as the Y and X axes on a graph. At the X,Y coordinates where identical residues coincide, a score of 1.0 is recorded. X,Y coordinates of mismatches score 0.0. Using DOTPLOT, dots are plotted at the X,Y coordinates which scored 1.0. 5.1.2.2. The effect of word size - The example above would fit the 'broadly speaking' description exactly if a 'WORD' size of 1 was used. With wordsize=2 the program scores 1.0 when any two consecutive residues match on both axes. By increasing the word size we can reduce the background. Some examples of increasing the word size will be demonstrated. They are also available in the files WORD4.PNT and WORD6.PNT. (The disadvantage is that regions of partial homology disappear.) 5.1.2.3. Types of patterns - A sequence COMPAREd against itself gives a perfect diagonal line. Offsets from the diagonal indicate where deletions/insertions have taken place. Overlapping lines would indicate duplications. The 'off-diagonal' lines may indicate other features (here: internal repeats). Inversions are best seen by REVERSING one of the sequences, as COMPARE and DOTPLOT only join diagonals from bottom-left to upper-right. The diagrams in appendix C show some simple examples of DOTPLOT relationships between sequences. 5.1.3. Window comparison A more sensitive method of identifying partial homology is the 'window-stringency' method. We decide the number of consecutive residues (window) and the program scans ALL registers of this length across both sequences. In each window the program adds up the score of matches and mismatches. If the score reaches a threshold (stringency) number of matches, a dot is placed at the X,Y coordinate in the centre of the window. The higher the 'stringency', and the smaller the 'window', the more precise the match has to be before being displayed. This method reduces the obscuring background whilst retaining regions with a defined amount of mismatching. (When window and stringency both equal 1 the result is the same as the /word=1 comparison.) $ Compare COMPARE what horizontal sequence ? rna1.seq Begin (* 1 *) ? End (* 121 *) ? Reverse (* No *) ? to what vertical sequence (* rna1.seq *) ? rna4.seq Begin (* 1 *) ? End (* 118 *) ? Reverse (* No *) ? What comparison window size (* 21 *) ? What stringency (* 14.0 *) ? 12 What should I call the output file (* Rna1.Pnt *) ? rna12.pnt Submit Compare job to which batch queue (* LONG *) ? Short The output can be plotted using $ DOTPLOT/font=1 File RNA14.PNT was obtained using the default stringency value. 5.1.4. Comparison of proteins For peptide sequences the COMPARE/WORD method is the same as for DNA and RNA sequences, displaying only identical regions. The window/stringency approach is different, as a different scoring system is used. The stringency is no longer equivalent to 'the number of matches in the window', but is 'the similarity score within the window'. $ Compare COMPARE what horizontal sequence ? pir1:avms67 Begin (* 1 *) ? End (* 144 *) ? Reverse (* No *) ? to what vertical sequence (* pir1:avms67 *) ? pir1:hvcqg4 Begin (* 1 *) ? End (* 117 *) ? Reverse (* No *) ? What comparison window size (* 30 *) ? 5 What stringency (* 9.0 *) ? 4 What should I call the output file (* Avms67.Pnt *) ? Submit Compare job to which batch queue (* LONG *) ? Short The default window and stringency values are set in GCG for nucleotide comparisons. The ratio of window size to stringency is usually too high for protein comparisons, giving too much suggested homology. 5.1.5. Symbol comparison tables Nucleotide sequence comparison uses a simple scoring system of 1.0 for a match and 0.0 for a mismatch. For amino acids a range of scores from 1.5 for a match to -1.2 for a mismatch is scored. The pair-scores are based on the PAM250 table (Point Accepted Mutations) which was derived by observing the most frequent substitutions in related proteins. In effect this combines a measurement of functional and evolutionary similarity between different amino acids. See page 2-9 for an example of a symbol comparison table. 5.2. Sequence alignments 5.2.1. BESTFIT BESTFIT finds the best single region of similarity between two sequences. Note the high default gap penalties. $ Bestfit BESTFIT of what sequence 1 ? rna1.seq Begin (* 1 *) ? End (* 121 *) ? Reverse (* No *) ? to what sequence 2 (* rna1.seq *) ? rna4.seq Begin (* 1 *) ? End (* 118 *) ? Reverse (* No *) ? What is the gap weight (* 5.00 *) ? What is the gap length weight (* 0.30 *) ? What should I call the paired output display file (* Rna1.Pair *) ? Type the file RNA1.PAIR to see the alignment. 5.2.2. GAP GAP aligns two sequences, finding a 'maximum similarity'. There will be no 'better' alignments, although there may be others just as good. GAP is more tolerant of mismatches than BESTFIT, but it can be confounded by duplications. If you run GAP in the same way as described for BESTFIT you will notice a marked difference in the alignment. $ Gap GAP of what sequence 1 ? rna1.seq . . to what sequence 2 (* rna1.seq *) ? rna4.seq {accept all the defaults } Type the file RNA1.PAIR to see the alignment. Using GAP with the /out command line modifier tells GAP to write out the aligned sequences as new files. The new files have the file ending .GAP (eg: RNA1.GAP and RNA4.GAP) GAP cannot cope with extremely long sequence alignments - you can try using FASTA instead! The alignment is different because BESTFIT uses a higher penalty for mismatches (-0.9) than GAP (0.0). The GCG data files that define these penalties are SWGAPDNA.CMP (BESTFIT), and NWSGAPDNA.CMP (GAP). You can also vary the alignments by changing the gap-weight and gap-length weight (penalty scores). 5.2.3. Protein sequence alignment Alignment of protein sequences is slightly different. Using the PAM250 table (SWGAPPEP.CMP) of similarities between different amino acids, the paired display shows a range of relationships. $ Gap GAP of what sequence 1 ? pir1:avms67 . . to what sequence 2 (* pir1:avms67 *) ? pir1:hvcqg4 {accept all the defaults } Type the file AVMS67.PAIR. '|' is for identical residues, ':' is for comparison values >= 0.5, '.' is for >= 0.1. If the modifier /PAIR is used, eg: $ GAP/PAIR=1.0,0.5,0.1 then the '|' symbol is used for comparison values >= 1.0, ':' for >=0.5 and '.' for >=0.1 5.2.4. Alignment measurements The alignments and comparison measurements are very sensitive to changes in the symbol comparison values and the gap weights. 5.2.4.1. Quality Quality is the optimised (ie:maximum) value calculated by summing the matches, mismatches, gaps and gap weightings for that pair of sequences. You cannot compare the quality of one pair of sequences' alignment to another. 5.2.4.2. Ratio Ratio is the Quality divided by the number of residues in the shorter sequence. ie: mean quality per aligned symbol. 5.2.4.3. Identity Percent Identity is the percent of the residue symbols that match. (NB:Match = equal or exceed the match 'threshold'). This figure is NOT optimised. 5.2.4.4. Similarity Percent Similarity is the percent of the residue symbols that are similar. This figure is NOT optimised. The threshold for determining the similarity is the second parameter in the command line option /PAIR=1.0,0.5,0.1 The /limit modifier should be used when aligning long sequences with GAP or BESTFIT. This defines the maximum permitted length of gaps (the default is the length of the smallest sequence!) Page 5-6 intentionally blank