This is an ongoing attempt to build a glossary of terms that are frequently used in bioinformatics. If you would like a term explained that is not on this list, please email Rob Edwards
Index
- BLAST
- redundant DNA codes
- Standard genetic code
- FASTA format
- NCBI Databases
BLAST
The Basic Local Alignment Tool (BLAST) is an efficient computer program for comparing DNA and protein sequences. There are several flavors of BLAST available:
- BLASTP compares an amino acid query sequence against a protein sequence database
- BLASTN compares a nucleotide query sequence against a nucleotide sequence database
- BLASTX compares a nucleotide query sequence translated in all reading frames against a protein sequence database
- TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
- TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
Not all flavors are available at all times. The six-frame translations require a lot of processor time and are often limited in usage.
Return to index
DNA codes
| symbol | base | | symbol | base |
| A | adenosine | | M | A C (amino) |
| C | cytidine | | S | G C (strong) |
| G | guanine | | W | A T (weak) |
| T | thymidine | | B | G T C |
| U | uridine | | D | G A T |
| R | G A (purine) | | H | A C T |
| Y | T C (pyrimidine) | | V | G C A |
| K | G T (keto) | | N | A G C T (any) |
| | | - | gap of indeterminate length |
Return to index
Standard Genetic Code
This is the standard genetic code, Alternative genetic codes and modifications to this code are available from the NCBI.
Codons in bold are potential initiation codons.
| | Second base |
| | T | C | A | G | 3rd base |
First base | T | TTT | F | Phe | TCT | S | Ser | TAT | Y | Tyr | TGT | C | Cys | T |
| TTC | F | Phe | TCC | S | Ser | TAC | Y | Tyr | TGC | C | Cys | C |
| TTA | L | Leu | TCA | S | Ser | TAA | * | Ter | TGA | * | Ter | A |
| TTG | L | Leu | TCG | S | Ser | TAG | * | Ter | TGG | W | Trp | G |
|
| C | CTT | L | Leu | CCT | P | Pro | CAT | H | His | CGT | R | Arg | T |
| CTC | L | Leu | CCC | P | Pro | CAC | H | His | CGC | R | Arg | C |
| CTA | L | Leu | CCA | P | Pro | CAA | Q | Gln | CGA | R | Arg | A |
| CTG | L | Leu | CCG | P | Pro | CAG | Q | Gln | CGG | R | Arg | G |
|
| A | ATT | I | Ile | ACT | T | Thr | AAT | N | Asn | AGT | S | Ser | T |
| ATC | I | Ile | ACC | T | Thr | AAC | N | Asn | AGC | S | Ser | C |
| ATA | I | Ile | ACA | T | Thr | AAA | K | Lys | AGA | R | Arg | A |
| ATG | M | Met | ACG | T | Thr | AAG | K | Lys | AGG | R | Arg | G |
|
| G | GTT | V | Val | GCT | A | Ala | GAT | D | Asp | GGT | G | Gly | T |
| GTC | V | Val | GCC | A | Ala | GAC | D | Asp | GGC | G | Gly | C |
| GTA | V | Val | GCA | A | Ala | GAA | E | Glu | GGA | G | Gly | A |
| GTG | V | Val | GCG | A | Ala | GAG | E | Glu | GGG | G | Gly | G |
FASTA format
FASTA format was developed by Bill Pearson and is one of the simplest formats for sequences. The basic format is as follows:
>sequence_id_1
gataggctgagcgatgcgatgctagctagctagc
>sequence_id_2
gatagctcgatcgatcggagcgatcgatcgagctagc
The identifier line always begins with a greater than sign (>), and is only one line. The sequence begins on the next line. Multiple sequences can be maintained in the same file. Some programs allow spaces in the identifier line, and some do not.
Return to index
NCBI Databases
These are the databases available from NCBI:
- nr: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant" but that is what the name means. It used to be the single unified database for everything.
- SWISS-PROTSWISS PROT protein sequence database
- month: All new or revised GenBank, EMBL (European), DDBJ (Japanese), and PDB (protein database) sequences released in the last 30 days.
- Drosophila genome: Drosophila genome provided by Celera and Berkeley Drosophila Genome Project (BDGP).
- dbest: Database of GenBank, EMBL, DDBJ, and PDB sequences from EST Divisions. Expressed Sequence Tags.
- dbsts: Database of GenBank, EMBL, DDBJ, and PDB sequences from STS Divisions. Sequence Tagged Sites.
- htgs: Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 (finished, phase 3 HTG sequences are in nr)
- gss: Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.
- yeast: Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences
- E. coli: Escherichia coli genomic nucleotide sequences
- pdb: Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank
- Patent: Nucleotide sequences derived from the Patent division of GenBank
- vector: Vector subset of GenBank
- mito: Database of mitochondrial sequences
- alu: Select Alu repeats from REPBASE (the repeat database), suitable for masking Alu repeats from query sequences.
- Kabat's database of Immunologically interesting protein sequences
- ESTs: ESTs from mouse, human, and other projects maintained as separate databases
- EPD: Eukaryotic promoter database.
Return to index