Extracting Information on Genomes

David R. Nelson Jan. 15, 2002


		
If you are interested in following the progress of genome sequencing, or various EST 
projects in different species, you need a method to find out how many sequences are present 
in a database.  You might also want to know where sequences have been placed.   For 
example, the Giardia genome is mostly sequenced, but a blast search on the nr section of 
Genbank will not find it, because it is in HTGS.  

This section will show you how to find out how many sequences are in Genbank for a given 
species or Taxon.  It will also show you how to find out where sequences are located (which 
main division of Genbank).  You will also learn how  to find out when they were deposited.

Go to the NCBI home page and select taxonomy from the menubar on the top of the screen.
taxonomy This will take you to the homepage for taxonomy.  On the left side click 
on  taxonomy statistics. 
 This shows the number of higher level taxa, like families and orders, the number of 
genera and species and subspecies in the main divisions of life represented in Genbank.

Now hit the back button.  On the homepage type in tetraodon in the window and hit go.
This shows a list of tetraodon species in Genbank.  At the top of the page is the taxonomic 
lineage.  To back up one level click on the last name in the list Tetraodontidae.  This will 
take you back one level in the hierarchy.  Now you see that Tetraodon is a sister genus to 
Takifugu (the Fugu fish).  They are both pufferfish.

On the check boxes above check nucleotides and then click on Takifugu rubripes 
(torafugu).  This will give you the count of how many Genbank entries are present for that 
species or taxonomic rank.  47,450  The Tetraodon genus has more entries than Fugu 
189,167 (try it).  The Fugu genome has been sequenced but it is not yet in Genbank, so 
Tetraodon is a good alternative when searching for fish genes.

Where are these Tetraodon sequences?  Click on the submit query button.  All the sequence 
entries are listed in 9459 pages of 20 entries each.  You will notice that a lot of these start 
with AJxxxxxx.  Click on one of the entries and look at it.  On the first line there is the three 
letter code VRT (vertebrate).  That is one of the divisions of the NR section of Genbank. Hit 
the back button.  Click on limits.  Under the ALL FIELDS menu check the box that says 
exclude all of the above.  This limits the search to nr only by excluding EST, GSS, STS and 
HTGS (Working Draft).  In the search window type #3 (or the number of the tetraodon 
search) and go.  The results show 180 sequences.  This means only 180 sequences are in 
nr.  If we exclude everything except for ESTs and nr we get 204 hits so there are only 24 
ESTs for Tetraodon.  If we go down the line and remove the check from STSs we will get 
the number in nr + EST + STS = 204 the same as before so there are no STSs.  Now 
remove the check off the GSS box and go.  This is 189,167, or all the rest.  This means that 
all except 204 Tetraodon entries are in the GSS section.  

Now hit the history button and type in the window #3 NOT #24 (most recent search) and 
this gives you the set from GSS.  Click on the number 188,963 on the history page and this 
will list the results.  Click on one hit.

The following is found in the entry.
COMMENT     This sequence is a single read and was generated as part of a large
            scale clone-end sequencing project of the Tetraodon nigroviridis
            genome. For more information, please take a look at
            http://www.genoscope.cns.fr/Tetraodon.

Go to the limits page and take off the check boxes, but add in the modification dates
2001-01-01 to 2001-12-31 and search #26 (most recent).  This will give you a result of 0.
None of the entries were deposited in 2001.  Move the date back to 2000 and try again.  
This time all the entries are found in the year 2000.  Look at one of the entries and see the 
date July 27, 2000.  Try the limit 2000-04-01 to 2000-08-31.  You will find that all the 
sequences were deposited in this five month window.  You can continue to narrow the dates 
to see when the sequences were deposited (they were not all put in on one day).   

This same approach can be used to follow the progress in any sequencing project that has 
data at Genbank. 

Another example is Arabidopsis.  If you search Arabidopsis[ORGN] to limit to species, 
you will get 219,285 hits.  How many are ESTs?  Use the limits feature to exclude ESTs
You get 105,955 By intersecting these two searches with #38 NOT #39 the difference is 
113,330.  That is the number of ESTs.  If you want to know how many of the non-EST 
sequences are still mRNAs limit the search to mRNA as the molecule type and search the 
#39 (105,955 entries).  This gives 12,426 entries are mRNA.  Again do the intersection
#39 NOT #43 to get the rest 105,955-12426 = 93,529.  This is now the number of entries 
that are not ESTs and not mRNA.  (You have to remember to take off the limits checkbox 
before doing this or you will get 0).  If you now limit the results to genomic DNA/RNA and 
search this last result #45 you get 93,520 so all but 9 are genomic DNA/RNA.  
How many are in HTGS (working draft)?  Under limits exclude working draft and do the 
search on #46.  The result is 93,486, so very few (34) are in HTGS.  Try exclude all which 
leaves nr only.  5335 are in NR.  Where are the rest?  Exclude GSS leaves only 6621 and 
so 86,899 are in the GSS section. We know that 5335 of the 6621 are in NR so 1286 are 
left.  If we exclude patents from these then only 171 remain.  If we now exclude STS only 
34 remain.  These are the 34 from HTGS that we detected earlier, but did not remove.
This kind of strategy can be used to find where sequences are stored in Genbank and what 
type of sequences they are.

Arabidopsis    
total 219,285
ESTs 113,330
mRNA 12,426
genomic 93,520
     HTGS 34
     GSS 86899
     Nr 5335
     Patent 1115
     STS 137