All sequences must be in 1 file, one after another. 7 formats are automatically recognised: NBRF-PIR, EMBL-SWISSPROT, Pearson (Fasta), Clustal (*.aln), GCG-MSF (Pileup), GCG9-RSF and GDE flat file. All non-alphabetic characters (spaces, digits, punctuation marks) are ignored except "-" which is used to indicate a GAP ("." in MSF-RSF).
This is how clustal knows what sequence type you are using:
Multiple alignments are carried out in 3 stages:
Pairwise alignment parameters control the speed and sensitivity of the initial alignments.
Multiple alignment parameters control the gaps in the final multiple alignments.
Output format allows you to choose from 6 different alignment formats (CLUSTAL, GCG, PIR, PHYLIP, and GDE). I suggest you try all of them, and use the one you like the most!
A distance is calculated between every pair of sequences and these are used to construct the dendrogram which guides the final multiple alignment. The scores are calculated from separate pairwise alignments. These can be calculated using 2 methods: dynamic programming (slow but accurate) or by the method of Wilbur and Lipman (extremely fast but approximate).
The slow-accurate method is fine for short sequences but will be VERY SLOW for many (e.g. >100) long (e.g. >1000 residue) sequences.
These parameters do not have any affect on the speed of the alignments. They are used to give initial alignments which are then rescored to give percent identity scores. These % scores are the ones which are displayed in the results. The scores are converted to distances for the trees.
| Gap Open Penalty | the penalty for opening a gap in the alignment. |
| Gap extension penalty | the penalty for extending a gap by 1 residue. |
| Protein weight matrix | the scoring table which describes the similarity of each amino acid to each other. |
| DNA weight matrix | the scores assigned to matches and mismatches (including IUB ambiguity codes). |
Fast and Approximate alignment parameters:
These similarity scores are calculated from fast, approximate, global alignments, which are controlled by 4 parameters. 2 techniques are used to make these alignments very fast: 1) only exactly matching fragments (k-tuples) are considered; 2) only the 'best' diagonals (the ones with most k-tuple matches) are used.
| K-Tuple size | This is the size of exactly matching fragment that is used. For longer sequences (e.g. >1000 residues) you may need to increase the default. Increase for speed (max= 2 for proteins; 4 for DNA), Decrease for sensitivity. |
| Gap penalty | This is a penalty for each gap in the fast alignments. It has little affect on the speed or sensitivity except for extreme values. |
| Top diagonals | The number of k-tuple matches on each diagonal (in an imaginary dot-matrix plot) is calculated. Only the best ones (with most matches) are used in the alignment. This parameter specifies how many. Decrease for speed; increase for sensitivity. |
| Window size | This is the number of diagonals around each of the 'best' diagonals that will be used. Decrease for speed and increase for sensitivity. |
These parameters control the final multiple alignment. This is the core of the program and the details are complicated but we'll try and deal with them here.
Each step in the final multiple alignment consists of aligning two alignments or sequences. This is done progressively, following the branching order in the guide tree. The basic parameters to control this are two gap penalties and the scores for various identical-non-indentical residues.
The gap penalties control the cost of opening up every new gap and the cost of every item in a gap. Increasing the gap opening penalty will make gaps less frequent. Increasing the gap extension penalty will make gaps shorter. Gaps near the ends of sequences are not penalised so that the last few residues or bases don't get forced to the end of the sequence.
The delay divergent sequences switch delays the alignment of the most distantly related sequences until after the most closely related sequences have been aligned. The setting shows the percent identity level required to delay the addition of a sequence; sequences that are less identical than this level to any other sequences will be aligned later.
The transition weight gives transitions (A <--> G or C <--> T i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0 and 1; a weight of zero means that the transitions are scored as mismatches, while a weight of 1 gives the transitions the match score. For distantly related DNA sequences, the weight should be near to zero; for closely related sequences it can be useful to assign a higher score.
Matrices
Please note that the matrices will automatically change depending on whether you select nucleotides or proteins
More help on the matrices is given below
The protein weight matrix offers a choice of weight matrices. The default for proteins is the PAM series derived by Gonnet and colleagues. Different matrices work differently at each evolutionary distance.
The DNA weight matrix has a choice of two matrices, IUB and CLUSTAL. The CLUSTAL is the matrix used by BESTFIT for comparison of nucleic acid sequences.
The Hydrophilic protein gap parameters allows you to set some Gap Penalty options which are only used in protein alignments. Hydrophilic gap penalties are used to increase the chances of a gap within a run (5 or more residues) of hydrophilic amino acids; these are likely to be loop or random coil regions where gaps are more common.
Five output formats are offered:
output order is used to control the order of the sequences in the output alignments.
Output format options.
CLUSTALW sequence numbers adds residue numbers may be added to the end of the alignment lines in clustalw format.
For protein alignments, you use a weight matrix to determine the similarity of non-identical amino acids. For example, Tyr aligned with Phe is usually judged to be 'better' than Tyr aligned with Pro.
There are three 'in-built' series of weight matrices offered. Each consists of several matrices which work differently at different evolutionary distances. Crudely, several matrices are stored in memory, spanning the full range of amino acid distance (from almost identical sequences to highly divergent ones). For very similar sequences, it is best to use a strict weight matrix which only gives a high score to identities and the most favoured conservative substitutions. For more divergent sequences, it is appropriate to use "softer" matrices which give a high score to many other frequent substitutions.
The matrices are:
For DNA two matrices are available:
clustalw option list:-
-help
-check
-options
-align
-newtree=filename
-usetree=filename
-newtree1=filename
-usetree1=filename
-newtree2=filename
-usetree2=filename
-bootstrap
-tree
-quicktree
-convert
-interactive
-batch
-infile=filename
-profile1=filename
-profile2=filename
-type=protein OR dna
-profile
-sequences
-matrix=filename
-dnamatrix=filename
-negative
-gapopen=f
-gapext=f
-endgaps
-nopgap
-nohgap
-hgapresidues=string
-maxdiv=n
-gapdist=n
-pwmatrix=filename
-pwdnamatrix=filename
-pwgapopen=f
-pwgapext=f
-ktuple=n
-window=n
-pairgap=n
-topdiags=n
-score=percent OR absolute
-transweight=f
-seed=n
-kimura
-tossgaps
-bootlabels=node OR branch
-debug=n
-output=gcg OR gde OR pir OR phylip
-outputtree=nj OR phylip OR dist
-outfile=filename
-outorder=input OR aligned
-case=lower OR upper
-seqnos=off OR on
-nosecstr1
-nosecstr2
-secstrout=structure OR mask OR both OR none
-helixgap=n
-strandgap=n
-loopgap=n
-terminalgap=n
-helixendin=n
-helixendout=n
-strandendin=n
-strandendout=n