This section of the course will briefly describe the program CLUSTALW. We will begin with a brief description of the input and output formats, and then move on to the alignments per se. The session will conclude with a couple of examples of alignments performed by CLUSTALW.
CLUSTALW is a multiple sequence alignment program that can efficiently handle the alignment of many DNA or protein sequences. CLUSTALW takes a sequence file and aligns all the sequences in that file. CLUSTALW can either be run from the command line interface, or through many web based interfaces.
CLUSTALW looks at the sequence to see what type of sequence it is. If 85% or more of the sequence consists of the following letters: A,C,G,T,U or N CLUSTALW will assume that the sequence represents a nucleotide sequence. Otherwise it will assume that the sequence represents a protein sequence.
CLUSTALW can also write the output in a variety of formats, but we will only focus on the standard output. This has blocks of sequence that have been aligned, and - marks to indicate where gaps were added.
CLUSTALW is also flexible in that you can also choose between having the sequences in the same order as in the input file or writing them out in an order that more closely matches the order used to carry out the multiple alignment.
Trees.
CLUSTALW can calculate phylogenetic relationships, but does not output the trees directly. For that other software must be used, and we will discuss some options in class.
Multiple alignments are carried out in 3 stages:
There are several parameters that you can set in clustal, and we will examine most of them using examples by trying them and seeing what happens.
These parameters control the final multiple alignment.
Each step in the final multiple alignment consists of aligning two alignments or sequences. This is done progressively, following the branching order in the guide tree. The basic parameters to control this are two gap penalties and the scores for various identical-non-indentical residues.
The gap penalties control the cost of opening up every new gap and the cost of every item in a gap. Increasing the gap opening penalty will make gaps less frequent. Increasing the gap extension penalty will make gaps shorter. Gaps near the ends of sequences are not penalised so that the last few residues or bases don't get forced to the end of the sequence.
The delay divergent sequences switch delays the alignment of the most distantly related sequences until after the most closely related sequences have been aligned. The setting shows the percent identity level required to delay the addition of a sequence; sequences that are less identical than this level to any other sequences will be aligned later.
The transition weight gives transitions (A <--> G or C <--> T i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0 and 1; a weight of zero means that the transitions are scored as mismatches, while a weight of 1 gives the transitions the match score. For distantly related DNA sequences, the weight should be near to zero; for closely related sequences it can be useful to assign a higher score.
Matrices
More help on the matrices is given here
The protein weight matrix offers a choice of weight matrices. The default for proteins is the PAM series derived by Gonnet and colleagues. Different matrices work differently at each evolutionary distance.
The DNA weight matrix has a choice of two matrices, IUB and CLUSTAL. The CLUSTAL is the matrix used by BESTFIT for comparison of nucleic acid sequences.
Click here to access the online clustal search page and here to read more about the parameters.