CLUSTALW

is one of the most widely used sequence alignment programs. There are many reasons that you might want to align some sequences. For example, differences in alignment can be used to infer phylogenetic history and evolution. DNA sequences are often aligned to identify identical residues to target mutations, and alignments of protein sequences can identify conserved domains or moieties.

This section of the course will briefly describe the program CLUSTALW. We will begin with a brief description of the input and output formats, and then move on to the alignments per se. The session will conclude with a couple of examples of alignments performed by CLUSTALW.

What is CLUSTALW

CLUSTALW is a multiple sequence alignment program that can efficiently handle the alignment of many DNA or protein sequences. CLUSTALW takes a sequence file and aligns all the sequences in that file. CLUSTALW can either be run from the command line interface, or through many web based interfaces.


DNA Sequence input

All the sequences must all be in one file and in the same format. CLUSTALW can take many formats, but the simplest and most widely used is the FASTA format. We will use the fasta format for all our analysis.

CLUSTALW looks at the sequence to see what type of sequence it is. If 85% or more of the sequence consists of the following letters: A,C,G,T,U or N CLUSTALW will assume that the sequence represents a nucleotide sequence. Otherwise it will assume that the sequence represents a protein sequence.


DNA Sequence output

Sequence alignments.

CLUSTALW can also write the output in a variety of formats, but we will only focus on the standard output. This has blocks of sequence that have been aligned, and - marks to indicate where gaps were added.

CLUSTALW is also flexible in that you can also choose between having the sequences in the same order as in the input file or writing them out in an order that more closely matches the order used to carry out the multiple alignment.

Trees.

CLUSTALW can calculate phylogenetic relationships, but does not output the trees directly. For that other software must be used, and we will discuss some options in class.


Sequence Alignments

Multiple alignments are carried out in 3 stages:

  1. all sequences are compared to each other (pairwise alignments)
  2. a dendrogram (like a phylogenetic tree) is constructed, describing the approximate groupings of the sequences by similarity.
  3. the final multiple alignment is carried out, using the dendrogram as a guide.

There are several parameters that you can set in clustal, and we will examine most of them using examples by trying them and seeing what happens.

These parameters control the final multiple alignment.

Each step in the final multiple alignment consists of aligning two alignments or sequences. This is done progressively, following the branching order in the guide tree. The basic parameters to control this are two gap penalties and the scores for various identical-non-indentical residues.

The gap penalties control the cost of opening up every new gap and the cost of every item in a gap. Increasing the gap opening penalty will make gaps less frequent. Increasing the gap extension penalty will make gaps shorter. Gaps near the ends of sequences are not penalised so that the last few residues or bases don't get forced to the end of the sequence.

The delay divergent sequences switch delays the alignment of the most distantly related sequences until after the most closely related sequences have been aligned. The setting shows the percent identity level required to delay the addition of a sequence; sequences that are less identical than this level to any other sequences will be aligned later.

The transition weight gives transitions (A <--> G or C <--> T i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0 and 1; a weight of zero means that the transitions are scored as mismatches, while a weight of 1 gives the transitions the match score. For distantly related DNA sequences, the weight should be near to zero; for closely related sequences it can be useful to assign a higher score.

Matrices

More help on the matrices is given here

The protein weight matrix offers a choice of weight matrices. The default for proteins is the PAM series derived by Gonnet and colleagues. Different matrices work differently at each evolutionary distance.

The DNA weight matrix has a choice of two matrices, IUB and CLUSTAL. The CLUSTAL is the matrix used by BESTFIT for comparison of nucleic acid sequences.

Click here to access the online clustal search page and here to read more about the parameters.


References about clustal

The following three references describe using clustal in more detail.

  1. Higgins, D.G., Bleasby, A.J. and Fuchs, R. (1992) CLUSTAL V: improved software for multiple sequence alignment. Computer Applications in the Biosciences (CABIOS), 8(2):189-191.
  2. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Research, 22(22):4673-4680.
  3. Higgins DG, Thompson JD, Gibson TJ (1996) Using CLUSTAL for multiple sequence alignments.Methods Enzymol 266:383-402