Blast Searching II

David R. Nelson Jan. 14, 2002


You now have a sense of what is in Genbank. Now how do you search this data? The method is the BLAST search. Before continuing you should have read Rob Edwards' Blast Searching I under module 2 Links. Some of this section will repeat what was in Rob's section, but we will be getting into specific examples here so be patient. Please go to the BLAST page from the NCBI homepage. You will notice that there are several types of blast search methods depending on what you have and what you want to find. Notice the first three sections on this page. Nucleotide blast searches, Protein blast searches and Translated blast searches. If you have nucleotide sequence that is from a non-coding region, or you do not know what is in it, the nucleotide blast Blastn is what you should try first to identify your sequence. It may match a known gene. If you have protein sequence and you want to search the known protein translations in Genbank, BlastP is one of your choices. The translated blast searches are more powerful and they take more computer time. Tblastn is one of the most useful search tools. This is the one I use almost all of the time with fragments of genes that I have been able to translate into probable open reading frames based on other known protein sequences. With Tblastn you can search your protein sequence against all the nucleotides in Genbank sections that are translated in all six reading frames (three per DNA strand). You can hardly miss anything that matches if it is out there. Nucleotide searches are not as sensitive because there are only four bases but there are 20 amino acids, so there are better search statistics to find a match. The Blastx program lets you find ORFs (open reading frames) in your nucleotide sequence by hunting for matches to translated proteins. This is very helpful in analyzing fresh genomic sequence that is unannotated. The Tblastx program is the most computer intensive since it does the translation of the query sequence in six frames as well as the translation of the nulceotide database in 6 frames, so it is like doing 6 tblastn searches. Not too many servers will even offer this search, since it eats computer time so badly.

The Blast server page has a window to paste your sequence. This may be in FASTA format or just raw sequence. FASTA format always starts with an >IDENTIFIER LINE This may be anything you want to identify your sequence. This will appear in the search output. If you will be doing many searches and saving them or printing them, it is a good idea to include this identifier line so you will not mix up your output. The > symbol is required for an identifier, otherwise the program will try to interpret your identifier as amino acid or nucleotide sequence. The sequence can have numbers in it such as are found in the Genbank sequence format, or other formats like GCG (Genetics Computer Group) or PIR (Protein Information Resource). These numbers will be ignored. Non-standard letters like O, and J will also be ignored. For this example use this sequence.

>Dictyostelium CYP51 P450 C-terminal
ETQKDINDIVQKENQGEINFDGLKRMNRLETVIREVLRLHPPLIFLMRK VMTPMEYKGKTIPAGHILAVSPQVGMRLPTVYKNPDSFEPKRFDVED KTPFSFIAFGGGKHGCPGENFGILQIKTIWTVLSTKYNLEVGPVPPTD

FTSLVAGPKGPCMVKYSKKQK*

Paste this sequence in the window. Once your sequence is in the window, you need to select a program, like tblastn from the pull down menu. Choose Tblastn. You also need to select a database from another pull down menu. Select nr (the default). We will talk about the other options later. Below the sequence window is the blast button. Do not click this yet. Futher down the screen is the filter option. I usually recommend turning the filter off, unless there is a weird repetive sequence like a run of QQQQQ in your sequence. If you leave the filter on, you may find some unpleasant runs of XXXXXXX in the blast output where you do not want it.

One of the best features of this blast page is the limitation of sequences to be searched. Under options for advanced blasting, you can use a pull down menu to select a common species like mouse (Mus musculus) or a whole taxonomic range of organisms like green plants (Viridiplantae) or Fungi. I really wish they had some more of these ranges incorporated here, but you can also choose to type in a more specific set yourself like Diplomonadida or Rhodophyta. This goes in the window called limit by entrez query. I use this feature all the time to restrict the search output to just what I want. You may select Dictyostelium discoideum from the pull down menu if you want to find the exact sequence you are searching with, or you may expand the search to closely related organisms by typing in Mycetozoa. I think it may be more interesting to use Fungi on this first blast, since there are many CYP51s known from fungi in the nr database. Go ahead and click the blast button.

You will get a screen that says Format on a button. Click this button to begin formatting your results. This will give you another window that says how long the process will take before the next refresh. This could be short or long depending on usage at NCBI. Monday mornings are bad. Sometimes they say it will be 60 minutes or more. Evenings it may turn around in less than a minute. If the number of seconds is greater than 150, you may want to close the window and rehit the format button. This can give you the result sooner, though it may not.

The output will have a graphical box at the top that has colored bars representing the hits. Red is for the highest scoring hit and the colors get cooler as the scores drop. Red hits are generally pretty good matches. Since we searched a slime mold vs. Fungi here, the best hits are magenta not red. Below the graphics box is a text list of your hits with the best scores at the top. The e-19 numbers refer to the chance that this was an accidental match. The larger the negative exponent the better the match and the lower the probability that it was a chance occurrence.

Below the text list are the actual alignments. This is where you see what you have found. The output here shows a gap in the Dicty sequence compared to most of the fungal sequences. Since fungi have relatively few introns, this is probably not an intron in the fungi, but it may be an insertion that is found in fungi and not in the slime mold sequence. Notice that all the top hits are identified as CYP51 sequences or 14 alpha demethylase enzymes. This is because the cytochrome P450 family is extensively annotated and over 1000 of the genes are named. Even the new sequences that are not officially named are usually identified by this type of blast search and the authors often tag them as a P450 (in this case CYP51).
Your first search was intended to get you started in blast searching with a simple example. There are many aspects of this tool that we need to discuss in more detail. The first is the selection of the different databases to search. Most first time users of a software package tend to use the default settings. In this case that could be disasterous, because you could miss a large number of sequences that are not to be found in the nr section of Genbank (the default). There are large EST projects on some organisms that will make only EST sequences. These will not be found in nr. One example is soybean (Glycine max) with 148,000 ESTs. If you are after a plant gene, it may be present in these ESTs but not in nr. The 160,000 Bos taurus (cattle) ESTs can also be used to help assemble human or mouse genes, by indicating the location of introns. Another example is Giardia (hiker’s diarrhea). Giardia is a common human pathogen and it is thought to be a very ancient eukaryote. Therefore, it is valuable to sequence for medical and evolutionary studies. The Giardia Genome project is being conducted at Woods Hole Oceanographic Institute and the data is searchable at NCBI in the HTGS section of Genbank. (HTGS = High Throughput Genomic Sequences).

If you have not added the blast page at NCBI to your bioinformatics bookmark file, please do it now. NCBI Blast page

I recommend that you add the TBLASTN page also. TBLASTN

You will be needing these over and over as we progress. As a general rule, it is good policy to search nr, HTGS, EST and GSS sections of Genbank when hunting for matches to a new sequence. You can also search STS if you are desperate, but there are not very many entries in STS.

The following is a real example taken from an alignment of mitochondrial carrier sequences. Mitochondrial carriers are membrane transport proteins of the inner mitochondrial membrane (and peroxisomal and hydrogenosomal membranes). These carriers enable communication betweeen the mitochondrial matrix and the cytosol. They are responsible for exporting ATP to the cytosol and importing ADP into the matrix to be remade in to ATP. There are 46 of these genes in humans and similar numbers in other species (35 in yeast). I am posting a link to a sequence alignment of 265 of these carriers that has some gaps in some of the sequences. We will try to fill in some of these gaps. Once you are at the page use the find command to search for GMPA-. This should take you to a small gap in sequence 92. This gap was not possible to fill when the alignment was last revised, because this sequence did not exist in Genbank. The sequence looks like this with eight missing amino acids.

ANEDGQVSPGSLLLAGAIAGMPA--------VIKTRLQVAARAGQ-----TTYNGVT

The second gap is an alignment gap and there are no missing amino acids. Copy the sequence above and paste it in the blast window at NCBI. The dashes will be automatically removed as illegal characters. Select the mouse EST database and turn the filter off. Notice that the EST database has several options including EST, mouse EST, human EST and others EST. This is helpful to reduce searching unwanted sequences. If you want to find only mouse, use the mouse EST option. For all mammals use the pull down menu under options and select mammalia. For this search just use mouse ESTs. Do the blast.




>gi|1387895|gb|W76821.1|W76821 me73f10.r1 Soares mouse embryo NbME13.5 14.5 Mus musculus cDNA
           clone IMAGE:401227 5' similar to WP:K02F3.2 CE01348 ;.
          Length = 455

 Score = 48.1 bits (113), Expect = 2e-05
 Identities = 23/23 (100%), Positives = 23/23 (100%)
 Frame = +3

Query: 1   ANEDGQVSPGSLLLAGAIAGMPA 23
           ANEDGQVSPGSLLLAGAIAGMPA
Sbjct: 384 ANEDGQVSPGSLLLAGAIAGMPA 452

>gi|1387483|gb|W77458.1|W77458 me66c04.r1 Soares mouse embryo NbME13.5 14.5 Mus musculus cDNA
          clone IMAGE:400518 5' similar to WP:K02F3.2 CE01348 ;.
          Length = 471

 Score = 43.9 bits (102), Expect = 3e-04
 Identities = 21/21 (100%), Positives = 21/21 (100%)
 Frame = +2

Query: 24 VIKTRLQVAARAGQTTYNGVT 44
          VIKTRLQVAARAGQTTYNGVT
Sbjct: 2  VIKTRLQVAARAGQTTYNGVT 64

Shown above are the two ESTs that were at the existing boundaries of the small gap. Look at the first one W76821.1 (this is an accession number for the sequence. It has a version number. Some version numbers may be as high as 29). Notice on the 4th line that the Length is 455 nucleotides. Now look at the alignment. The upper sequence in the alignment is your query sequence. The lower sequence is the mouse EST hit. The line in between shows the matches between the two sequences. Look at the numbering on the lower sequence from 384 to 452. This is only three nucleotides from the end of the sequence. So this sequence stops where our gap was. The second EST W77458.1 starts where the gap ends. It has only one nucleotide protruding into the gap. If you click on the accession number of the first sequence it takes you to the sequence entry which was created on June 20, 1996. The second est was created on the same date. Notice that both ESTs have another identifier that looks like this: me73f10.r1. The r1 on the end of this name indicates it was sequenced from a reverse primer. This information may be valuable, since the insert of this clone may have been sequenced from the opposite end of the clone. That sequence would have a different accession number, but it would have the same clone name with .s1 at the end instead of .r1. The s1 indicates a standard read from a forward primer. The two sequences would be from the same clone, but from opposite ends. This can unite sequence fragmets that do not overlap in the middle. You can search for the clone name (without the .r1 part) in the NCBI homepage search window. If you find the .s1 match then you have more sequence information to use.

We have seen that the gap in the original sequence was caused by these sequences ending just a little short of overlapping. The two best hits in this search do span the gap, allowing the missing sequence to be filled in.





>gi|9904727|gb|BE624311.1|BE624311 uu45a06.y1 Soares_thymus_2NbMT Mus musculus cDNA clone
           IMAGE:3374866 5' similar to TR:O14575 O14575 SIMILAR TO
           ADP/ATP CARRIER PROTEINS ;.
          Length = 480

 Score = 80.1 bits (196), Expect = 4e-15
 Identities = 44/52 (84%), Positives = 44/52 (84%), Gaps = 8/52 (15%)
 Frame = +1

Query: 1   ANEDGQVSPGSLLLAGAIAGMPA--------VIKTRLQVAARAGQTTYNGVT 44
           ANEDGQVSPGSLLLAGAIAGMPA        VIKTRLQVAARAGQTTYNGVT
Sbjct: 283 ANEDGQVSPGSLLLAGAIAGMPAASLVTPADVIKTRLQVAARAGQTTYNGVT 438

>gi|3519846|gb|AI119522.1|AI119522 uf04h04.y1 Sugano mouse liver mlia Mus musculus cDNA clone
           IMAGE:1499671 5' similar to WP:K02F3.2 CE01348 ;.
          Length = 670

 Score = 80.1 bits (196), Expect = 4e-15
 Identities = 44/52 (84%), Positives = 44/52 (84%), Gaps = 8/52 (15%)
 Frame = +2

Query: 1   ANEDGQVSPGSLLLAGAIAGMPA--------VIKTRLQVAARAGQTTYNGVT 44
           ANEDGQVSPGSLLLAGAIAGMPA        VIKTRLQVAARAGQTTYNGVT
Sbjct: 482 ANEDGQVSPGSLLLAGAIAGMPAASLVTPADVIKTRLQVAARAGQTTYNGVT 637


BE624311 uu45a06.y1 was created August 24, 2000 so it is fairly new. The second one was created Sept. 2, 1998. Note that the clone name has a .y1 after it. That pairs up with a .z1 of the same name. The hit shown below does not come from the same gene. It is nearly identical on the right side, but not on the left. You need to be careful, especially with short sequences, that you are not making chimeric or hybrid sequences when you assemble protein sequences from fragments.




>gi|10753840|gb|BF022507.1|BF022507 uy43g06.y1 NCI_CGAP_Lu30 Mus musculus cDNA clone IMAGE:3662362 5'
           similar to TR:O75746 O75746 ARALAR1 PROTEIN. ;.
          Length = 450

 Score = 56.6 bits (135), Expect = 5e-08
 Identities = 32/51 (62%), Positives = 39/51 (75%), Gaps = 8/51 (15%)
 Frame = +3

Query: 1   ANEDGQVSPGSLLLAGAIAGMPA--------VIKTRLQVAARAGQTTYNGV 43
           A+E+G+V   +LL AGA+AG+PA        VIKTRLQVAARAGQTTY+GV
Sbjct: 294 ADENGRVGGINLLTAGALAGVPAASLVTPADVIKTRLQVAARAGQTTYSGV 446


That was an easy example of using the EST database to find a missing sequence fragment from a partial gene. Now we will try a harder problem with a larger gap, also in a mouse carrier. Go back to the sequence alignment link at the top of this page and search for -SFLAG in the alignment. You should find sequence 96 as shown below.



segment 3
 94A ----------PRKSDGSGEAVFYWSLIAGLLSGMTSAFMVTPFDVVKTRLQADGEKK--------FKGIM    0
 95  ------------FNELAGKASFAHSFVSGCVAGSIAAVAVTPLDVLKTRIQTLKKGLGE----DMYSGIT    0
 96  ------------------------SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNE----DTYSGFL    0
 96A ----------PRRNDGSGEAVFWCSFLAGLAAGSTAALAVNPFDVVKTRLQAIKKADGE----KEFKGIS    0


This gap extends upstream over several pages of the alignment. For your convenience they are shown here with some surrounding sequence. This will be important later. Notice that sequence 95 and 96 in the first segment below are nearly 100% identical. Sequence 95 is a human sequence and it is the ortholog of the mouse sequence with the gap. That will be useful to us.



segment 1
 94  GMYRGAAVNLTLVTPEKAIKLAANDFFRHQLS-KDG-----QKLTLLKEMLAGCGAGTCQVIVTTPMEML  112
 94A GMYRGSAVNIVLITPEKAIKLTANDFFRYHLASDDG------VIPLSRATLAGGLAGLFQIVVTTPMELL    0
 95  GMYRGAAVNLTLVTPEKAIKLAANDFFRRLLM-EDG-----MQRNLKMEMLAGCGAGMCQVVVTCPMEML    0
 96  GMYRGAAVNLTLVTPEKAIKLAANDFLRH-----------------------------------------    0
 96A GMYRGSGVNILLITPEKAIKLTANDYFRHKLTTKDG------KLPLTSQMVAGGLAGAFQIIVTTPMELL    0

segment 2
 94  KIQLQDAGRIAAQRKILAPRSTATQLTRDLLRSRG-IAGLYKGLGATLLRDVPFSVVYFPLFANLNQLG-    0
 94A KIQMQDAGRVDRAAGREVKTITALGLTKTLLRERG-IFGLYKGVGATGVRDITFSMVYFPLMAWINDQG-    0
 95  KIQLQDAGRLAVHHQGSARRPSATLIAWELLRTQG-LAGLYRGLGATLLRDIPFSIIYFPLFANLNNLG-    0
 96  ----------------------------------------------------------------------    0
 96A KIQMQDAGRVAKLAGKTVEKVSATQLASQLIKDKG-IFGLYKGIGATGLRDVTFSIIYFPLFATLNDLG-    0

Try to do the same technique as before. Copy the beginning and end of sequence 96 and place them in the tblastn window. Select mouse ESTs and turn off the filter. Do the blast. The results will show hits that extend into the gap region.



>gi|11620519|gb|BF533156.1|BF533156 602073695F1 NCI_CGAP_Li9 Mus musculus cDNA clone IMAGE:4210775 5'.
          Length = 890

 Score = 86.3 bits (212), Expect = 4e-17
 Identities = 42/42 (100%), Positives = 42/42 (100%)
 Frame = +1

 Query: 30  SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNEDTYSGFL 71
            SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNEDTYSGFL
 Sbjct: 124 SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNEDTYSGFL 249

>gi|10379135|gb|BE861312.1|BE861312 UI-M-AK0-adi-c-11-0-UI.r1 NIH_BMAP_MHY Mus musculus cDNA clone
           UI-M-AK0-adi-c-11-0-UI 5'.
          Length = 538

 Score = 65.5 bits (158), Expect = 8e-11
 Identities = 39/73 (53%), Positives = 43/73 (58%), Gaps = 13/73 (17%)
 Frame = +3

 Query: 1   GMYRGAAVNLTLVTPEKAIKLAANDFLRH-------------SFLAGCVAGSAAAVAVNP 47
            GMYRGAAVNLTLVTPEKAIKLAANDFLR                LAGC AG        P
 Sbjct: 309 GMYRGAAVNLTLVTPEKAIKLAANDFLRQLLMQDGTQRNLKMEMLAGCGAGICQVGITCP 488

Query: 48  CDVVKTRLQSLER 60
            +++K +LQ   R
Sbjct: 489 KEMLKIQLQDAGR 527

The first hit above (BF533156) extends 123 bp into the gap upstream. You can see this from the numbering of the lower sequence starting at 124. To get this upstream sequence you must click on the accession number from the blast search and go to the sequence entry. Copy the sequence and take it to the Quick Protein Translator from MBS (Molecular Biology Shortcuts). Paste it in the window and select three frames since you know it is in the plus strand and then translate it. Copy all three frames back to a word processor and compare them to the mouse and the human sequence above it to find the correct frame. The DNA sequence of BF533156


GCCAACCTGAATCAGCTGGGCAAACAGGGGGAAGTAGACAATGGAGAAGGGAACATCCCC
CTGTTTGCCAACCTGAATCAGCTGGGCCGCCCATCCTCTGAGGAGAAGTCGCCTTTCTAT
GTGTCCTTCCTAGCAGGCTGCGTGGCTGGGAGTGCAGCCGCTGTGGCCGTCAACCCTTGT
GATGTGGTGAAGACTCGGCTCCAGTCCCTTGAGAGAGGTGTTAATGAGGACACTTACTCT
GGGTTTCTGGACTGTGCAAGGAAGATCTGGAGACATGAAGGTCCCTCAGCCTTCCTGAAA
GGCGCATACTGCCGTGCGCTGGTCATTGCCCCGCTGTTTGGCATCGCCCAGGTGGTCTAC
TTTCTGGGCATTGCCGAGTCCCTGCTGGGGCTGCTGCAAGAACCCCAGGCCTGAGCCCAT
GGCTGCTTCTCTCCAGCCTATGGGCAGGGGCCAGAACAGGGTGACCAGCACAAGCCTGAG
GAGGAGTGGTCTCTCCCCGGTCCTCCTCATTAAGATGGGAAGGCAAGGGGAGGGTGCAGG
GTCCACATGGGTGATGCACACATAAGCCCCTGTGTGGTCCTGAAGGGACAACAAATGGGA
TCGAGGTCTTATCTATGTAGAAAATGCAGAAATCTGTACATCCCTCAAGCCAGTTCTGTC
CCATCCTTGTTACTCAAACCCAGTCCACTGGCTGAACACCCATGGGACAGAGCTGGTCTC
TGGGTGGGGGCCCCAGGCCTGGTTTGGGAGGGGGACCTACCTGGGGTTCACTGGGCCTGG
CCCTGGGGGCCCTGGCTTCCATAGGGGCCCACCCCCGATTTTTTGGGTTCCCCGCCAGGG
GTCTCGGCCGGCGAGCTTGCCGGTGGTCCCCTCGACCTGTCCCCGCTTGG


Only the first three lines are needed to include the first 124 bp. Translation of the first 180bp phase 1 has the known sequence SFLAG... And no stop codons, so this is probably correct if there are no frameshifts. Compare it to the human sequence.




Phase 1 ANLNQLGKQGEVDNGEGNIPLFANLNQLGRPSSEEKSPFYVSFLAGCVAGSAAAVAVNPC Phase 2 PT*ISWANRGK*TMEKGTSPCLPT*ISWAAHPLRRSRLSMCPS*QAAWLGVQPLWPSTL Phase 3 QPESAGQTGGSRQWRREHPPVCQPESAGPPIL*GEVAFLCVLPSRLRGWECSRCGRQPL Phase 1 mouse ANLNQLGKQGEVDNGEGNIPLFANLNQLGRPSSEEKSPFYVSFLAGCVAGSAAAVAVNPC ||||||| || | | || |||||| ||||| | YRGLGATLLRDIPFSIIYFPLFANLNNLGFNELAGKASFAHSFVSGCVAGSIAAVAVTPL Human 95
At least from PLFAN the sequences match, but upstream they do not and this region must be considered suspect. You have now filled in part of the gap. The region between FANLNNLG and GCVAG is not very strong and may contain a frameshift. Watch for better matches in this region. Look at the other hit BE861312. This has sequence going into the gap from the other side. Compare the sequence with the human sequence.



mouse
QLLMQDGTQRNLKMEMLAGCGAGICQVGITCPKEMLKIQLQDAGR
 ||| || ||||||||||||||| |||  ||||||||||||||||
RLLMEDGMQRNLKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGR
Human 95

They are almost identical and so the frame is correct. This fills in much of the gap from the other side. There are only about 48 amino acids missing in the middle now. To try to fill this gap you can repeat the search of the EST database or you can go to another section to find the missing part, the HTGS section. I recommend this. You will find genomic DNA with introns here, so you need to have the human DNA to help identify the missing exon sequence. Construct a hybrid sequence with the new mouse sequence you just found and human from seq 95 to fill in the gap. Then blast the mouse HTGS section of Genbank. In the sequence below human is in lower case.



RLLMEDGMQRNLKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGR
lavhhqgsarrpsatliawellrtqglaglyrglgatllrdipfsiiyf
PLFANLNNLGFNELAGKASFAHSFVSGCVAGSIAAVAVTPL


The result from this search shows a good match which is probably to the same gene. There are some differences, however this is a genomic sequence and the gene is on a 90,000 bp contig that is probably more accurate than the EST sequences. There are also three different exons here (see the nucleotide numbering). One of the regions with the greatest number of sequence differences is the region that looked like it might contain frameshifts earlier. The differences on the 2nd exon are mouse human differnces.



>gb|AC084273.11|AC084273 Mus musculus clone rp23-313e8, WORKING DRAFT SEQUENCE, 5 unordered pieces
          Length = 182050

Score = 81.6 bits (200), Expect = 2e-15
 Identities = 38/45 (84%), Positives = 40/45 (88%)
 Frame = -2

 Query: 9      QRNLKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGRLAVHHQGS 53
               QRNLKMEMLAGCGAG+CQVV+TCPMEMLKIQLQDAGRL   H  S
 Sbjct: 142635 QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLGEAHPNS 142501

 Score = 58.5 bits (140), Expect = 2e-08
 Identities = 30/34 (88%), Positives = 31/34 (90%)
 Frame = -3

 Query: 55     RRPSATLIAWELLRTQGLAGLYRGLGATLLRDIP 88
               RRPSATLIA ELLRTQGL+GLYRGLGATLLR  P
 Sbjct: 139970 RRPSATLIARELLRTQGLSGLYRGLGATLLR*AP 139869

Score = 92.4 bits (228), Expect = 1e-18
 Identities = 44/53 (83%), Positives = 47/53 (88%)
 Frame = -3

 Query: 83     LLRDIPFSIIYFPLFANLNNLGFNELAGKASFAHSFVSGCVAGSIAAVAVTPL 135
               L RDIPFSIIYFPLFANLN LG +EL GKASF HSFV+GC AGS+AAVAVTPL
 Sbjct: 139628 LYRDIPFSIIYFPLFANLNQLGVSELTGKASFTHSFVAGCTAGSVAAVAVTPL 139470

Probable sequence after assembly
QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLXXXXXXX
RRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGKASFTHSFVAGCTAGSVAAVAVTPL

The sequence is almost done now, but there is the small matter of an intron exon joint to be settled. The human sequence has a phase 1 intron after DAGRL and the mouse is expected to have the same intron location. However, the sequence between the human and mouse does not match here for 7 amino acids. It would be desirable to find an EST for mouse that bridged this gap. The following search shows that there is an EST that does this and this sequence indicates that the mouse sequence is longer than the human sequence from the alignment.




>gb|AI848503.1|AI848503 UI-M-AP1-agf-c-11-0-UI.s2 NIH_BMAP_MST_N Mus musculus cDNA clone
           UI-M-AP1-agf-c-11-0-UI 3'.
          Length = 438

 Score =  200 bits (508), Expect = 5e-51
 Identities = 106/131 (80%), Positives = 106/131 (80%), Gaps = 18/131 (13%)
 Frame = -3

 Query: 1   QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLXX------------------XX 42
            QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRL                      
 Sbjct: 415 QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLAVCHQASASATPTSRPYSTGST 236

 Query: 43  XXXRRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGK 102
               RRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGK
 Sbjct: 235 STHRRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGK 56

 Query: 103 ASFTHSFVAGC 113
            ASFTHSFVAGC
 Sbjct: 55  ASFTHSFVAGC 23

This actually identifies a flaw in the alignment where a part of the human gene was left out because it was too long to fit into the existing alignment. The match below is human on the bottom and mouse on the top, showing the extra sequence also exists in a human EST.




>gb|BF329616.1|BF329616 CM2-BN0273-080600-227-b08 BN0273 Homo sapiens cDNA.
          Length = 298

 Score =  167 bits (423), Expect = 9e-41
 Identities = 84/95 (88%), Positives = 88/95 (92%)
 Frame = -3

 Query: 4   LKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLAVCHQASASATPTSRPYSTGSTSTH 63
            LKMEMLAGCGAG+CQVV+TCPMEMLKIQLQDAGRLAV HQ SASA  TSR Y+TGS STH
 Sbjct: 287 LKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGRLAVHHQGSASAPSTSRSYTTGSASTH 108

 Query: 64  RRPSATLIARELLRTQGLSGLYRGLGATLLRDIPF 98
            RRPSATLIA ELLRTQGL+GLYRGLGATLLRDIPF
 Sbjct: 107 RRPSATLIAWELLRTQGLAGLYRGLGATLLRDIPF 3

This effort now successfully fills in the gap in mouse sequence 96.