Sequence Data Analysis

Genomics and Sequencing

In this class we will talk about genomic sequencing and sequencing in general.Many of the same techniques apply whether you are sequencing a single clone or an entire genome.

DNA sequencing involves labeling the DNA, usually with four different fluorescent dyes, separating the DNA by electrophoresis, and measuring the intensity of the fluorescence to determine the base order. When sequencing is performed by electrophoretic techniques a trace file can be obtained. The machines that separate the DNA and generate the trace file (at UT these are generally either ABI 3100 or 3700 capillary sequencers) can also guess at the DNA sequence based on the trace. The techniques used by the automated sequencers are not very sophisticated, and the sequence returned may contain errors.

For genome sequencing, a suite of tools written by researchers at U. of Washington, Seattle is most commonly used. These programs, Phred, Phrap, and Consed are the most widely used for sequence data analysis.

In almost all cases, running phred, phrap, and the other programs is simply typing phredPhrap and running a script. This script makes sure that all the files are named correctly and that the programs are run in the same order. The order is critical and is generally:

Phred

Phred is the basecalling program. This looks at the trace file and considers each peak in the sequence. It then gives that peak a score based on the probability that it is real. The score is generally a number between 1 and 40, where the number represents the power to 10. So a Phred score of 20 means that the peak is incorrectly called one time in 100 (102) and a Phred score of 45 means that sequence is wrong one time in 31622.8 times (104.5). In general a Phred score of 20 means that the sequence is reliable.

Phred reads in the chromatogram file and outputs a "phd" file that has one base and its corresponding quality score per line.

Determine Read Types

Determine Read Types is a perl script that parses the file name to look for information about the sequence. Specifically it looks for direction, primer type, and template name. These are important for the subsequent analysis.

Most high throughput sequencing centers have a very standard naming convention that is always followed. For example, one of the reads from the Salmonella Enteritidis genome is SE220168000_SEG-188.0_B10_077_74.ab1. This is sequencing plate number 168, colony plate number 188, well B10. Another read is SE220200000_SEG-225.0_E12_SEG-120.0_C06.T7.ab1. This is sequencing plate number 200 from colony plate number 225. Plate 225 was a re-arrayed plate, and this particular well cam from colony plate 120. This is well C06 and is a reverse read using the T7 primer.

Crossmatch

Crossmatch screens each of the sequences for vector contamination. If it finds any potential vector sequence at either end of the sequence it will convert that region to X. These bases are then ignored by subsequent programs and not included in the assembly

phd2fasta

phd2fasta converts the phd file from phred into two fasta files that can be imported by phrap. The first has the sequence data, and the second contains the quality scores that phred caculated.

Phrap

Phrap is the assembly program. It reads the two fasta files (sequence and quality) generated above, and compares each sequence with every other sequence. Phrap has a very long output, some of which is shown here

phred, phrap, and consed are available from the author's site. We have the programs available on the UT campus, and if you would ike further information, contact Rob Edwards.


For most sequencing applications, graphical programs may also be used. The vectorNTI suite that we used previously has the ability to assemble contigs, and you may use that software on campus

For this class, we are going to consider a different application, sequencher

Sequencher

Sequencher is a program that can read one or hundreds of raw trace files. Sequencher is a versatile stand-alone application that can be run on a PC or Macintosh, and can provide a great deal of information about sequence data. Sequencher is particularly appropriate for single reads or small assemblies with hundreds of reads, however it has been used to assemble small genomes. We will use Sequencher to analyze a single read, and optimize the base calling for this read. We will then combine more reads, and try and perform an alignment of these reads.

Click here to continue with the Sequencher tutorial

Please note:


We thank all authors for the use of their software in this course.