Assembling Sequences

As well as viewing trace data, Sequencher is also excellent at assembling data. From the last three sections we looked at the basics of sequencher and calling bases. Now we will try and assemble this data into contigs.

The basis of assembly is similar to the multiple sequence alignments that we have looked at. In essence we are looking for regions of homology at the ends of the DNA sequence. We will then try and align these sequences together to create a single sequence like this:

how contigs are made

Of course, each sequence could be homologous at either end to another sequence, and each sequence could be homologous on either strand at either end to another sequence, and so the possibilities get daunting if you were to do this by hand. Luckily Sequencher has built in functions to do this. Go to the main sequencher window:

Image of first sequencer view with traces

Now take a look at the top line below the buttons. Notice that it says:

Parameters(Dirty Data): Min Overlap = 20, Min Match = 85%

These are the current assembly parameters, but we can change these. Click on the Assembly Parameters button. That will pull up this window:

Assembly parameters dialog

There is a text box that decides the overall parameters for clean data and dirty data, and sliders where you can change the minimum match percentage and the minimum overlap. For many (most) alignments the defaults will be satisfactory, but you may want to alter these values particularly if your data is poor at the end of a sequence and may not match up. Click OK to accept these values and we will continue with the alignment.

When you click OK you are returned to the main window. At the moment we will only consider automatic assembly. It is generally sufficient for most assemblies, but you can assemble sequences manually if you wish, and these will be covered later.

Click the Assemble Automatically button. If you do not have any sequences highlighted, it will warn you that there is an error, and that you need to highlight some sequences. Highlight them all and try again.

The assembly will be rapid and take about 3 or 4 passes through the data. At each pass it makes an alignment and then on the next pass it compares that alignment to the other alignments and the other sequences. The more sequences you have, the more alignments it will make and the more passes it will need. Notice that the icon to show sequence icon has changed to icon to show contig to indicate that this is a contig. Also notice that the size has increased larger than any of the single sequences. This is the size of the contig. Finally notice that the kind says "Contig of x sequences". Click on the little triangle next to the contig icon and the contig will expand to list all the sequences inside of it like this:

View of expanded contig

Now double click on the contig name. You will see a new window open with a graphical view of the contig:

View of expanded contig

There are many things to note about this view:

  1. The sequences are represented by arrows, with the name of the sequence above the arrow. Green arrows represent sequence in the forward direction and red arrows represent reverse complemented sequence. The position of the arrow represents the position of the sequence in the contig. The start and stop positions of the sequence in the contig are also noted on the arrow.
  2. The large blue/green bar represents the coverage of the contig by the sequence like this:
  3. Below the coverage line is the base pairs representing the boundaries of the different levels of coverage.
  4. There is a three frame translation showing potential start and stop codons for this region. Start codons are shown as green tags (Start codon icon) and stop codons are shown as red lines (Stop codon icon).


There are a large number of options from this view, and we will explore some of them below. Start by clicking on the bases window. You will see something like this:

bases view with all sequences

(Note that I scrolled along the window some.) Notice that the sequence is given, and that secondary base calls are highlighted in pink as we have seen before. Mismatches are denoted by a a dot and conservative mismatches are denoted by a + sign.

Highlight a selection of bases by clicking on the consensus sequence. All bases should highlight as shown in the above image. Now click the Show Chromatograms button. You will see a view like this:

multiple views of chromatograms

There are several things to notice in this view:

  1. The bases that were highlighted in the sequence view are still highlighted.
  2. I have adjusted the slider to optimize the sequences. There is one slider for each chromatogram.
  3. Some of the sequences are reversed. To make the alignment these sequences had to be complemented. There are several ways of noticing this:
    1. The sequence is written backwards so an an G becomes a a backwards G and so on.
    2. The icon above the slider has changed from a forwards icon to a reverse icon
  4. Any sequence that has a discrepant base has the name highlighted in blue (e.g. sequence02.ab1 in the above picture).
You can edit sequences in this chromatogram window as we have done before.