Clustal W is a general purpose multiple alignment program for DNA or proteins.

Sequence Input

All sequences must be in 1 file, one after another. 7 formats are automatically recognised: NBRF-PIR, EMBL-SWISSPROT, Pearson (Fasta), Clustal (*.aln), GCG-MSF (Pileup), GCG9-RSF and GDE flat file. All non-alphabetic characters (spaces, digits, punctuation marks) are ignored except "-" which is used to indicate a GAP ("." in MSF-RSF).

This is how clustal knows what sequence type you are using:

Please choose either a nucleotide or protein sequence from the checkbox. The script will not let you carry on unless you do so.


Alignments

Multiple alignments are carried out in 3 stages:

  1. all sequences are compared to each other (pairwise alignments)
  2. a dendrogram (like a phylogenetic tree) is constructed, describing the approximate groupings of the sequences by similarity.
  3. the final multiple alignment is carried out, using the dendrogram as a guide.

Pairwise alignment parameters control the speed and sensitivity of the initial alignments.
Multiple alignment parameters control the gaps in the final multiple alignments.
Output format allows you to choose from 6 different alignment formats (CLUSTAL, GCG, PIR, PHYLIP, and GDE). I suggest you try all of them, and use the one you like the most!


Pairwise alignments

A distance is calculated between every pair of sequences and these are used to construct the dendrogram which guides the final multiple alignment. The scores are calculated from separate pairwise alignments. These can be calculated using 2 methods: dynamic programming (slow but accurate) or by the method of Wilbur and Lipman (extremely fast but approximate).

The slow-accurate method is fine for short sequences but will be VERY SLOW for many (e.g. >100) long (e.g. >1000 residue) sequences.


Multiple alignment parameters

These parameters control the final multiple alignment. This is the core of the program and the details are complicated but we'll try and deal with them here.

Each step in the final multiple alignment consists of aligning two alignments or sequences. This is done progressively, following the branching order in the guide tree. The basic parameters to control this are two gap penalties and the scores for various identical-non-indentical residues.

The gap penalties control the cost of opening up every new gap and the cost of every item in a gap. Increasing the gap opening penalty will make gaps less frequent. Increasing the gap extension penalty will make gaps shorter. Gaps near the ends of sequences are not penalised so that the last few residues or bases don't get forced to the end of the sequence.

The delay divergent sequences switch delays the alignment of the most distantly related sequences until after the most closely related sequences have been aligned. The setting shows the percent identity level required to delay the addition of a sequence; sequences that are less identical than this level to any other sequences will be aligned later.

The transition weight gives transitions (A <--> G or C <--> T i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0 and 1; a weight of zero means that the transitions are scored as mismatches, while a weight of 1 gives the transitions the match score. For distantly related DNA sequences, the weight should be near to zero; for closely related sequences it can be useful to assign a higher score.

Matrices

Please note that the matrices will automatically change depending on whether you select nucleotides or proteins

More help on the matrices is given below

The protein weight matrix offers a choice of weight matrices. The default for proteins is the PAM series derived by Gonnet and colleagues. Different matrices work differently at each evolutionary distance.

The DNA weight matrix has a choice of two matrices, IUB and CLUSTAL. The CLUSTAL is the matrix used by BESTFIT for comparison of nucleic acid sequences.

The Hydrophilic protein gap parameters allows you to set some Gap Penalty options which are only used in protein alignments. Hydrophilic gap penalties are used to increase the chances of a gap within a run (5 or more residues) of hydrophilic amino acids; these are likely to be loop or random coil regions where gaps are more common.

Output format options.

Five output formats are offered:

  1. CLUSTAL format output is a self explanatory alignment format. It shows the sequences aligned in blocks. It can be read in again at a later date to (for example) calculate a phylogenetic tree or add a new sequence with a profile alignment.
  2. GCG output can be used by any of the GCG programs that can work on multiple alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG .msf format files (multiple sequence file).
  3. PHYLIP format output can be used for input to the PHYLIP package of Joe Felsenstein. This is an extremely widely used package for doing every imaginable form of phylogenetic analysis.
  4. NBRF-PIR this is the same as the standard PIR format with an addition that gap characters "-" are used to indicate the positions of gaps in the multiple alignment.
  5. GDE this is the flat file format used by the GDE package of Steven Smith.
CLUSTALW sequence numbers adds residue numbers may be added to the end of the alignment lines in clustalw format.

output order is used to control the order of the sequences in the output alignments.


Weight matrix

For protein alignments, you use a weight matrix to determine the similarity of non-identical amino acids. For example, Tyr aligned with Phe is usually judged to be 'better' than Tyr aligned with Pro.

There are three 'in-built' series of weight matrices offered. Each consists of several matrices which work differently at different evolutionary distances. Crudely, several matrices are stored in memory, spanning the full range of amino acid distance (from almost identical sequences to highly divergent ones). For very similar sequences, it is best to use a strict weight matrix which only gives a high score to identities and the most favoured conservative substitutions. For more divergent sequences, it is appropriate to use "softer" matrices which give a high score to many other frequent substitutions.

The matrices are:

  1. BLOSUM (Henikoff). These matrices appear to be the best available for carrying out database similarity (homology searches). The matrices used are: Blosum 80, 62, 45 and 30.

  2. PAM (Dayhoff). These have been extremely widely used since the late '70s. clustalw uses the PAM 20, 60, 120 and 350 matrices.

  3. Gonnet. These matrices were derived using almost the same procedure as the Dayhoff one (above) but are much more up to date and are based on a far larger data set. They appear to be more sensitive than the Dayhoff series. clustalw uses the GONNET 80, 120, 160, 250 and 350 matrices.
  4. Identity matrix which gives a score of 1.0 to two identical amino acids and a score of zero otherwise. This matrix is not very useful.

For DNA two matrices are available:

  1. IUB. This is the default scoring matrix used by BESTFIT for the comparison of nucleic acid sequences. X's and N's are treated as matches to any IUB ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.
  2. CLUSTALW. Matches score 1.0 and mismatches score 0. All matches for IUB symbols also score 0.


Additional Options

clustalw option list:-
                -help 
                -check 
                -options 
                -align 
                -newtree=filename 
                -usetree=filename 
                -newtree1=filename 
                -usetree1=filename 
                -newtree2=filename 
                -usetree2=filename 
                -bootstrap 
                -tree 
                -quicktree 
                -convert 
                -interactive 
                -batch 
                -infile=filename 
                -profile1=filename 
                -profile2=filename 
                -type=protein OR dna
                -profile 
                -sequences 
                -matrix=filename 
                -dnamatrix=filename 
                -negative 
                -gapopen=f 
                -gapext=f 
                -endgaps 
                -nopgap 
                -nohgap 
                -hgapresidues=string 
                -maxdiv=n 
                -gapdist=n 
                -pwmatrix=filename 
                -pwdnamatrix=filename 
                -pwgapopen=f 
                -pwgapext=f 
                -ktuple=n 
                -window=n 
                -pairgap=n 
                -topdiags=n 
                -score=percent OR absolute
                -transweight=f 
                -seed=n 
                -kimura 
                -tossgaps 
                -bootlabels=node OR branch
                -debug=n 
                -output=gcg OR gde OR pir OR phylip
                -outputtree=nj OR phylip OR dist
                -outfile=filename 
                -outorder=input OR aligned
                -case=lower OR upper
                -seqnos=off OR on
                -nosecstr1 
                -nosecstr2 
                -secstrout=structure OR mask OR both OR none
                -helixgap=n 
                -strandgap=n 
                -loopgap=n 
                -terminalgap=n 
                -helixendin=n 
                -helixendout=n 
                -strandendin=n 
                -strandendout=n