<< , >> , up , Title , Contents

7.1. Cluster alignments


Alignment of several related sequences, maximising the display of homology, is used to highlight conserved and diverging regions. This can be done using GAP/out several times, until no more pairs of comparisons introduce new gaps.

That method takes a lot of time, and can introduce some spectacular errors. Some sequences happen to be arranged so that repeated gapping in pairs always introduces new gaps.

To avoid these problems, sequences are best aligned in pairs, and the resulting groups aligned by a "cluster method". The programs PILEUP and GCLUSTALV both use the cluster method to align all the sequences.

7.1.1. PILEUP

We first define the set of sequences we wish to align. One way is to make a file of sequence names (FOSN) using the VAX editor. For this exercise we shall align three Ig heavy chain proteins that are in the PIR1 database:

$Edit prot.fil

*Change

pir1:avms67
pir1:evrtr2
pir1:hvcqg4

<ctrl-Z>

*exit

If you prefer to do the exercise with RRNA sequences, try it with the following five entries in the EMBL database: BA5SRR, BE5SRR, CC5SRR, GC5SRR, HF5SRR.

Check the FOSN file is correct before starting the alignment, then:

$ Pileup

PileUp of what sequences ? @prot.fil

1 Avms67 144 aa
2 Evrtr2 142 aa
3 Hvcqg4 117 aa

What is the gap weight (* 3.00 *) ?

What is the gap length weight (* 0.10 *) ?

{accept defaults}

What should I call the output file name (* PileUp.MSF *) ?

Submit Pileup job to which batch queue (* LONG *) ? short

You will receive a message when the job is completed, or you might log off and come back later. Always check the LOG file to see if the job ran correctly.

The files created are : PILEUPMSF.LOG, PILEUP.MSF and PILEUP.FIGURE

PILEUP.MSF shows the multiple alignment, with gaps added. Run the program FIGURE to see how the clusters were aligned.

$ Figure Pileup.figure

7.1.2. GCLUSTALV

GCLUSTALV converts GCG sequence files to PIR format, then runs a non-GCG program ClustalV. ClustalV aligns the PIR format sequences and writes out the alignments into the required format such as a GCG MSF file.

As in the previous exercise, you must first make a FOSN. We will do it a different way this time. First - copy the PIR1 entries using FETCH and pretend they are your own sequence files:

FETCH the relevant files:

$ FETCH @PROT.FIL

The sequences copied from the PIR1 database are now in files called avms67.pir1, evrtr2.pir1 and hvcgg4.pir1. Now, using the VAX editor, create a file of sequence names called CLUST.FIL (Enter the names of the files, ending .PIR1, just fetched)

$ Edit CLUST.FIL

*Change

avms67.pir1
evrtr2.pir1
hvcqg4.pir1

<ctrl-Z>

*ex

To run GCLUSTALV:

$ GCLUSTALV

GCLUSTALV with what LINEUP file ? @CLUST.FIL

Choose output format:
[GCG] GCG MSF format
[PHY] PHYLIP format
[PIR] PIR format
[CLU] Clustal format

Which format (* GCG *) ?

Are the sequences (P)rotein or (D)NA (* P *) ?

Choose weight matrix:
[250] PAM 250
[100] PAM 100
[ID] Identity matrix

Which matrix, or give filename (* 250 *) ?

What fixed gap penalty [1-500] (* 10 *) ?

What floating gap penalty [1-500] (* 10 *) ?

Submit to which batch queue (* LONG *) ? Short

Job GCLUSTALV (queue ARCB$SHORT, entry 164) started on ARCB$SHORT
Your output files should be CLUST.MSF and CLUST.DND
Logfile created as CLUST.LOG;

When the batch job has completed, the new files are:

CLUST.MSF - The alignment results.

CLUST.DND - The dendrogram for use with the interactive program ICLUSTALV.

CLUST.LOG - The log file.

CLUSTALV provides more than one symbol comparison table. The choice of table can influence the alignment!


<< , >> , up , Title , Contents