That method takes a lot of time, and can introduce some spectacular errors. Some sequences happen to be arranged so that repeated gapping in pairs always introduces new gaps.
To avoid these problems, sequences are best aligned in pairs, and the resulting groups aligned by a "cluster method". The programs PILEUP and GCLUSTALV both use the cluster method to align all the sequences.
We first define the set of sequences we wish to align. One way is to make a file of sequence names (FOSN) using the VAX editor. For this exercise we shall align three Ig heavy chain proteins that are in the PIR1 database:
$Edit prot.fil
*Change
pir1:avms67
pir1:evrtr2
pir1:hvcqg4
<ctrl-Z>
*exit
If you prefer to do the exercise with RRNA sequences, try it with the following five entries in the EMBL database: BA5SRR, BE5SRR, CC5SRR, GC5SRR, HF5SRR.
Check the FOSN file is correct before starting the alignment, then:
$ Pileup
PileUp of what sequences ? @prot.fil
1 Avms67 144 aa
2 Evrtr2 142 aa
3
Hvcqg4 117 aa
What is the gap weight (* 3.00 *) ?
What is the gap length weight (* 0.10 *) ?
{accept defaults}
What should I call the output file name (* PileUp.MSF *) ?
Submit Pileup job to which batch queue (* LONG *) ? short
You will receive a message when the job is completed, or you might log off and come back later. Always check the LOG file to see if the job ran correctly.
The files created are : PILEUPMSF.LOG, PILEUP.MSF and PILEUP.FIGURE
PILEUP.MSF shows the multiple alignment, with gaps added. Run the program FIGURE to see how the clusters were aligned.
$ Figure Pileup.figure
GCLUSTALV converts GCG sequence files to PIR format, then runs a non-GCG program ClustalV. ClustalV aligns the PIR format sequences and writes out the alignments into the required format such as a GCG MSF file.
As in the previous exercise, you must first make a FOSN. We will do it a different way this time. First - copy the PIR1 entries using FETCH and pretend they are your own sequence files:
FETCH the relevant files:
$ FETCH @PROT.FIL
The sequences copied from the PIR1 database are now in files called avms67.pir1, evrtr2.pir1 and hvcgg4.pir1. Now, using the VAX editor, create a file of sequence names called CLUST.FIL (Enter the names of the files, ending .PIR1, just fetched)
$ Edit CLUST.FIL
*Change
avms67.pir1
evrtr2.pir1
hvcqg4.pir1
<ctrl-Z>
*ex
To run GCLUSTALV:
$ GCLUSTALV
GCLUSTALV with what LINEUP file ? @CLUST.FIL
Choose output format:
[GCG] GCG MSF format
[PHY] PHYLIP
format
[PIR] PIR format
[CLU] Clustal format
Which format (* GCG *) ?
Are the sequences (P)rotein or (D)NA (* P *) ?
Choose weight matrix:
[250] PAM 250
[100] PAM 100
[ID] Identity matrix
Which matrix, or give filename (* 250 *) ?
What fixed gap penalty [1-500] (* 10 *) ?
What floating gap penalty [1-500] (* 10 *) ?
Submit to which batch queue (* LONG *) ? Short
Job GCLUSTALV (queue ARCB$SHORT, entry 164) started on ARCB$SHORT
Your
output files should be CLUST.MSF and CLUST.DND
Logfile created as
CLUST.LOG;
When the batch job has completed, the new files are:
CLUST.MSF - The alignment results.
CLUST.DND - The dendrogram for use with the interactive program ICLUSTALV.
CLUST.LOG - The log file.
CLUSTALV provides more than one symbol comparison table. The choice of table can influence the alignment!