BLAST Searches

Why?

Suppose that we are walking along a street and see something that we have never seen before. As good scientists, we decide that we would like to know what this unkown object is and what it does. We could do several things. We could dismantle it and try and see what happens, or we could begin by looking at it and asking does it look like anything that we already know what the function is. The second option is obviously easier, and quicker in the long run.

  1. The first thing we notice is that it has four wheels. So it's probably a vehicle of some sort.
  2. Does it have a place where an engine could be? If not, maybe it's a horse drawn buggy. If it does, it is probably an auto.
  3. How big is it?
  4. Does it have a place where people could sit? If that is all the object has, it is possibly an automobile or a bus or a coach depending on how big it is.
  5. Does it have space for cargo? If it does it is possibly a truck or a van. Again, this will depend on the size.
So by asking a few questions, and comparing it to what we already know (trucks, vans, cars, busses, etc) we can say that this thing we have never seen before is a type of vehicle and is probably used for hauling goods, carrying people, and so on.

Note this was inspired in part by The Salvation of Doug. I encourage everyone to read it.

That is one of the most important aspects of the BLAST search. If we come across a DNA or protein sequence and we don't know what it is, we can compare it to all the sequences that we do know and try and identify something similar where the function has already been described. This would be a good starting point to track down the actual function of the DNA or protein that we are interested in.

How?

BLAST searches take DNA sequences or protein sequences and use some very fast computer tricks to find optimal alignments of these sequences to a database. The alignment process involves some very complicated statistics for identifying High-scoring Segment Pairs, regions of sequence (segments) that align together (pairs) with a probability that is greater than a random chance (high-scoring)

The results are then displayed in several forms. There is a graphical overview, a list of the most significant HSPs with their score and probability, and a view of the alignments of those high-scoring segment pairs

For this course, we are not going to consider the details of the BLAST system. We are going to focus on understanding how to use the system, understanding the results, and what the results actually mean.

Statistics

There is a lot of information about the BLAST search scoring system from NCBI, the Nucleic Acids Research paper describing BLAST, linked from the BLAST home page, and in the references.

For this course, we are not going to place a lot of emphasis on theory behind how the statistics are generated, however it is important that you know what they mean and understand the mechanisms used to generate them as this will affect how you view the results and the results that you get.

Scores

The score is the basis of the BLAST search. Each nucleotide or amino acid in an alignment is given a score depending on whether it matches or not. If there is no match the score is negative, and if there is a match the score is positive. For each region of alignment all the scores are added up, though the score can never go below zero. So, for example, consider the following alignment:

In this simple alignment the first positive running score value is at position 3. This is therefore defined as the start of an high scoring region. The maximum running score is at position 10 (maximum score = 6) and so this is defined as the potential end of the high scoring region. The value would turn negative at position 17, and so this confirms the end of the high scoring region. The next region that is positive is at position 19 and runs through position 21. Again, the result would turn negative at position 25, which is also the end of the sequence, and so 19-21 is also a high scoring region.

Although I used a simple +1/-1 system to demonstrate the scoring principle, in reality the BLAST program running a nucleotide search by default will use a +1 score for a positive match and a -3 score for a mismatch. This score is then normalized based on the scoring system used and whether gaps were included.

DNA alignments are simple because there are only limited changes that can take place, and so we can score a match with a positive value and a mismatch with a negative value. However with protein sequences the changes that can occur are more limited. For example, consider the following amino acids: (note the full genetic code is available in the Glossary)

Serine: AGT or AGC
Arginine: AGA or AGG
Phenylalanine: TTT or TTC

To change from a serine to an arginine will only take one base change (the last base) for either of the serine residues. To change from a serine to a phenylalanine will take two base changes (the first two bases). Therefore a Ser->Phe change is much less likely to happen than a Ser->Arg change. Of course, there are several other factors like function and protein structure that have to be taken into account. If you are aligning two protein sequences you should give a Ser->Phe change more penalty than a Ser->Arg change because there will have been more changes at the DNA level (therefore less chance of an amino acid substitution happening).

Dayhoff and others studied how frequently amino acids change to other observed aino acids in closely related proteins. From these results they constructed the PAM (for "point accepted mutation") model for amino acid substitutions to give each amino acid a score for the likelihood that it will change to another amino acid.

For example, a PAM matrix may look like the table below. Positive scores are highlighted for easy identification, and the scores for Ser->Arg and Ser-Phe are shown in red

PAM100 matrix

Using the Phe/Ser/Arg example above, what are the relative scores for Ser->Arg and Ser->Phe substitutions? Does this make sense?

There are a lot of different matrices that have been optimized for different alignment parameters. An alternative to PAM is the BLOSUM matrix which was derived using blocks of similar protein sequences.
In essence, the critical parameter for protein searches is the length of the query sequence. As a rough guide the following should be used as each has been optimized for these conditions

Gaps

The latest BLAST searches can introduce gaps into a sequence, and these are included as negative scores in the match. The default values for a gap is -5 to start a gap and -2 to continue a gap once it has been started. Gap scores are calculated based on the length of the gap, so a gap score would be:

    -(opening cost + (length of gap x continuing cost)),

so in this case a gap of 8 nucleotides would have a score of:
    -(5 + (8 x 2)) = -21

Therefore there is a heavy penalty for opening and continuing a gap in a run of bases.

Probability

When you get a BLAST result you also get a probability value [reported as P (probability) or E (expected occurence of random sequences)]. This value is the probability that the high-scoring segment pair occurs by random. For example if you take a completely random sequence the same length as your query sequence there is a chance that you could get exactly the same sequence by random. This probability is based on the length of the query sequence and the total length of the database. If you are comparing a short sequence it is more likely that a random sequence could give you the same result. If you are comparing a sequence to a large database of sequences there is more chance that you have a random sequence in there that matches your query sequence than if you are comparing your sequence to a small database.

A probability of 0 means that there is essential no chance your match was random. A probability of 10-50 (reported as 1 E -50) means that 1 result in 1050 results reported will be random and not significant. This is not very likely, and so most hits returned with a score of 10-50 are probably real. In contrast a result of 0.1 means that 1 in 10 of the results that you have got back occurred by random and are not significant. It is up to you to figure out which is the random result!

Databases

The National Centers used to maintain a single database that contained all the sequence data. A lot of effort went into maintaining that database (imaginatively called the non-redundant dtabase) and removing redundant sequences from the database. A submission to any of the centers would result in the permeation of the data into all the databases. Therefore people in Europe could submit to EMBL; people in Japan to the DNA databank of Japan; and people in the Americas to the NCBI. This is still true to some extent, and data is shared among all these systems, however because of the overwhelming amount of DNA sequence being made available (for example see this page on NCBI's growth statistics it is now not feasible to maintain separate databases. Instead there are a large, and ever growing, number of databases that you can search against. These include:

  1. nr: All GenBank, EMBL, DDBJ, and PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant" but that is what the name means. It used to be the single unified database for everything.
  2. SWISS-PROTSWISS PROT protein sequence database
  3. month: All new or revised GenBank, EMBL (European), DDBJ (Japanese), and PDB (protein database) sequences released in the last 30 days.
  4. Drosophila genome: Drosophila genome provided by Celera and Berkeley Drosophila Genome Project (BDGP).
  5. dbest: Database of GenBank, EMBL, DDBJ, and PDB sequences from EST Divisions. Expressed Sequence Tags.
  6. dbsts: Database of GenBank, EMBL, DDBJ, and PDB sequences from STS Divisions. Sequence Tagged Sites.
  7. htgs: Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 (finished, phase 3 HTG sequences are in nr)
  8. gss: Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.
  9. yeast: Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences
  10. E. coli: Escherichia coli genomic nucleotide sequences
  11. pdb: Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank
  12. Patent: Nucleotide sequences derived from the Patent division of GenBank
  13. vector: Vector subset of GenBank
  14. mito: Database of mitochondrial sequences
  15. alu: Select Alu repeats from REPBASE (the repeat database), suitable for masking Alu repeats from query sequences.
  16. Kabat's database of Immunologically interesting protein sequences
  17. ESTs: ESTs from mouse, human, and other projects maintained as separate databases
  18. EPD: Eukaryotic promoter database.

Query Sequence Entry

Query sequences can be entered in two formats. Most often you will use the fasta format. In this case it should have an ID line that begins with a greater than sign (>) and then the sequence should start on the next line like this:

Note that the sequence can be in upper or lower case (or both mixed together!). If the sequence contains numbers or spaces these are ignored.

Alternatively you can enter an accession number of GI (genbank identifier) number. For example, for the S. Enteritidis SefR sequence the accession number would be AF239978. This is the sequence that has been used in all of the following examples.


These are the basics behind the BLAST system. We are not going to discuss the actual algorithims, a lot more information is available on this from the appropriate papers (see references below) than I can go into here. Instead we are going to focus on how to use BLAST and what the results mean.


References

General

A lot of information was taken from Gary Olsen's bioinformatics course taught at the University of Illinois, Urbana-Champaign, (UIUC) and I am indebted to him for his help, advice, and suggestions.

BLAST searches

  1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.
  2. Gish, W. & States, D.J. (1993) "Identification of protein coding regions by database similarity search." Nature Genet. 3:266-272.
  3. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402.

Matrices

  1. Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. (1978) "A model of evolutionary change in proteins." In "Atlas of Protein Sequence and Structure, vol. 5, suppl. 3." M.O. Dayhoff (ed.), pp. 345-352, Natl. Biomed. Res. Found., Washington, DC.
  2. Henikoff, J.G. (1992) "Amino acid substitution matrices from protein blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919.
  3. Altschul, S.F. (1993) "A protein alignment scoring system sensitive at all evolutionary distances." J. Mol. Evol. 36:290-300.

More references are available from the NCBI site.