AFRC TRAINING MANUAL - CHAPTER 10 10. SEQUENCE PROFILING 10.1. The profiling method A "profile" is created from a group of aligned sequences. The profile is a position-specific table of symbol comparison values and gap weights. Instead of each symbol having the same match- value at every position, the match-value is higher at the positions where the symbol is conserved in the multiple alignment. Example three-sequence alignment: Position 1 5 144 Seq 1 M k M W L n W ..... T V S S Seq 2 M d L R L t Y ..... T V S S Seq 3 M t h W L c F ..... . . . . Consensus M - - W L - - ..... T V S S The profile table looks like this: Cons A B C D ... L M N ... Y Z Gap Len M 0 -30 -60 -40 ... 130 150 -30 ... -10 -10 100 100 D 15 35 -19 42 ... -19 -4 27 ... -29 29 100 100 L -4 -8 -31 -10 ... 54 52 -4 ... 10 4 100 100 W -46 -33 -66 -55 ... 17 -11 -13 ... 43 -36 100 100 L -10 -50 -80 -50 ... 150 130 -40 ... 30 -20 100 100 . . S 20 15 35 10 ... -20 -15 15 ... -20 0 10 10 * 25 0 9 16 ... 37 10 13 ... 23 0 10.2. Description of the profile table The top row shows all the possible symbols. Each subsequent row contains a score for the alignment of the symbol in that sequence position. eg: For position 1 symbol A scores 0, B scores -30, C scores -60, and symbol M scores 150; for position 3 symbol A scores -4, L scores 54 etc. The highest scoring symbol in each row is taken as a representative symbol for that position in the sequence, and is copied into the first column. This 'consensus' shows the symbol with the smallest evolutionary distance from all the residues known to exist at that position. The values under 'Gap' and 'Len' are percentages, and show how the application of gap penalties should vary at each position depending on whether the region has been observed to have gaps. For our example of three sequences, one of the sequences has a gap at the end, hence the 10% gap penalties in the last position (consensus symbol S). The last row (starting *) shows the composition of the multiple alignment. In the first column the total number of A's in all the sequences of the alignment is shown. 10.2.1. A profile as a symbol comparison table The profile is constructed from a symbol comparison table (chapter 2-8, 2-9) using the multiple alignment to determine a weighting favouring some symbols. The result is a position- specific symbol comparison table. There are two different programs available to create a profile: PROFILEMAKE uses the PAM250 table, and PROFILEWEIGHT (a non-GCG program) uses the BLOSUM62 table. Using PROFILEMAKE the L symbol scores 150 in position 5, but only 54 in position 2. The original alignment shows all three sequences with an L at position 5, but only one has L in position 3. Note that the L-M match in the PAM250 table (2-8) is a relatively good one (1.3, where a perfect match is 1.5). The "consensus" sequence is defined very differently to the consensus produced by PRETTY. (So different, in fact, that a different term might be more suitable.) 10.3. Finding a new member of the alignment Any unrepresented symbols at a given position may actually be found in new sequences, so a small weighting (0.025) towards these symbols is included in the table, unless the /STRINGENT option is used. If we have a large enough family creating an alignment the effect of including an unrepresented symbol is much smaller. Instead of the usual symbol comparison tables used by GAP, BESTFIT etc, we use the new profile table. We can compare single sequences to the profile (PROFILEGAP), or scan a database (PROFILESEARCH). The comparison allows a new sequence to be aligned optimally to the family of sequences. The comparison of a sequence to a profile can emphasise similarity to conserved regions while tolerating diversity in other regions. For a full description of the profiling method you should read the essay on "Profile analysis" by John Devereux in chapter 7 of the GCG program manual. 10.4. Profiling tutorial 10.4.1. PROFILEMAKE PROFILEMAKE creates a symbol comparison file based on a group of aligned sequences, ie: the "profile" of the aligned sequences. $ Profilemake/stringent PROFILEMAKE of what aligned sequences ? chn.msf{*} chn.msf{Avms67}, begin: 1 end: 145 len: 145 weight: 1.00 chn.msf{Evrtr2}, begin: 1 end: 145 len: 145 weight: 1.00 chn.msf{Hvcqg4}, begin: 1 end: 145 len: 145 weight: 1.00 What should I Call the output profile (* Chn.Prf *) ? 10.4.2. PROFILEGAP PROFILEGAP shows a BESTFIT-like alignment of a sequence to a profile. Using /GLObal performs a GAP alignment, ie: across the whole sequence: $ Profilegap/global (Global) PROFILEGAP of what sequence(s) ? pir1:avms67 Begin (* 1 *) ? End (* 144 *) ? and what profile (* Avms67.Prf *) ? chn.prf What is the gap weight (* 4.50 *) ? What is the gap length weight (* 0.05 *) ? What should I Call the paired output display file (* Chn.Pair *) ? Type the CHN.PAIR file to see the alignment between the sequence and the profile which was calculated using PIR1:AVMS67 and similar sequences. As with GAP you may get a different alignment for different gap weightings. Also examine the comparison to PIR1:AVMST5. As with GAP you may get a different alignment for different gap weightings. 10.4.3. PROFILESEARCH PROFILESEARCH is the closest we can get to a PROSRCH on the VAX. It performs a "best local homology" on all entries in a database, using the profile, and sorts out the best alignments. $ Profilesearch PROFILESEARCH with what query profile ? chn.prf Search for query in what sequences(s) (* SwissProt:* *) ? What is the gap weight (* 4.50 *) ? What is the gap length weight (* 0.05 *) ? What should I Call the output file (* Chn.Pfs *) ? Submit to which batch queue (* OFFPEAK *) ? The Z statistic is obtained after normalising all the scores against sequence length. An equation is calculated, all very similar sequences eliminated, then the equation calculated again. As with PROSRCH this gives a comparison of the alignments values for dissimilar sequences against similar sequences. The Z score is the Standard Deviation for the derived fit. 10.4.4. PROFILESEGMENTS PROFILESEGMENTS displays the best alignment of the profile to each of the sequences identified by PROFILESEARCH. $ Profilesegments (Local) PROFILESEGMENTS from what PROFILESEARCH output file? chn.pfs Stop after how many alignments (* 15 *) ? What should I call the paired output display file ( * chn.pairs *) ? Type the CHN.PAIRS file. The sequences are scored for an optimal alignment. Note the QUALITY scores. Unlike PROSRCH only the best fit of the profile to any sequence will be found. Alternative alignments may exist. 10.4.5. PROFILESCAN PROFILESCAN compares a sequence to known profiles for their similarity. $ Profilescan PROFILESCAN of what sequence(s) ? pir1:avms67 . . What profile library (* ProfileScan.fil *) ? What should I call the paired output file (* avms67.scan *)? What should I call the summary output file (* avms67.sum *)? Avms67.sum shows which profiles the sequence has been checked against. Avms67.scan summarises the scores, showing any BESTFIT- like alignments to the profile. 10.5. Nucleotide profiling Protein profiles can be used to scan the nucleotide databases using the EGCG program TPROFILESEARCH. The program TPROFILEGAP allows the comparison of a nucleotide sequence to a profile. (T)Profilesearch is slow and can only search databases of 30,000 (Profilesearch) or 80,000 (Tprofilesearch) entries. The Genbank and EMBL databases already exceed 100,000 entries, so you should restrict searches to divisions or a list of combined divisions which do not exceed the limits. You can check the division totals by typing the database .NAM files: $Type EMBLDIR:EMBL.NAM $Type GENBANKDIR:GENBANK.NAM A file of sequence names (FOSN) which defines the prokaryote, bacteriophage and virus divisions in EMBL would contain: EM_Ba:* EM_Ph:* EM_Vi:* If the file is called EMBUGS.FIL, the three divisions would be searched on using the following: Search for query in what sequence(s) (* GenEmbl:* ) ? @Embugs.fil PROFILESEARCH only examines the first 100,000 symbols of any sequence. (The limit for all other GCG programs is 350,000.) Page 10-6 intentionally blank