Extracting
Information on Genomes
David R. Nelson Jan. 15, 2002
If you are interested in following the progress of genome sequencing, or various EST
projects in different species, you need a method to find out how many sequences are present
in a database. You might also want to know where sequences have been placed. For
example, the Giardia genome is mostly sequenced, but a blast search on the nr section of
Genbank will not find it, because it is in HTGS.
This section will show you how to find out how many sequences are in Genbank for a given
species or Taxon. It will also show you how to find out where sequences are located (which
main division of Genbank). You will also learn how to find out when they were deposited.
Go to the NCBI home page and select taxonomy from the menubar on the top of the screen.
taxonomy This will take you to the homepage for taxonomy. On the left side click
on taxonomy statistics.
This shows the number of higher level taxa, like families and orders, the number of
genera and species and subspecies in the main divisions of life represented in Genbank.
Now hit the back button. On the homepage type in tetraodon in the window and hit go.
This shows a list of tetraodon species in Genbank. At the top of the page is the taxonomic
lineage. To back up one level click on the last name in the list Tetraodontidae. This will
take you back one level in the hierarchy. Now you see that Tetraodon is a sister genus to
Takifugu (the Fugu fish). They are both pufferfish.
On the check boxes above check nucleotides and then click on Takifugu rubripes
(torafugu). This will give you the count of how many Genbank entries are present for that
species or taxonomic rank. 47,450 The Tetraodon genus has more entries than Fugu
189,167 (try it). The Fugu genome has been sequenced but it is not yet in Genbank, so
Tetraodon is a good alternative when searching for fish genes.
Where are these Tetraodon sequences? Click on the submit query button. All the sequence
entries are listed in 9459 pages of 20 entries each. You will notice that a lot of these start
with AJxxxxxx. Click on one of the entries and look at it. On the first line there is the three
letter code VRT (vertebrate). That is one of the divisions of the NR section of Genbank. Hit
the back button. Click on limits. Under the ALL FIELDS menu check the box that says
exclude all of the above. This limits the search to nr only by excluding EST, GSS, STS and
HTGS (Working Draft). In the search window type #3 (or the number of the tetraodon
search) and go. The results show 180 sequences. This means only 180 sequences are in
nr. If we exclude everything except for ESTs and nr we get 204 hits so there are only 24
ESTs for Tetraodon. If we go down the line and remove the check from STSs we will get
the number in nr + EST + STS = 204 the same as before so there are no STSs. Now
remove the check off the GSS box and go. This is 189,167, or all the rest. This means that
all except 204 Tetraodon entries are in the GSS section.
Now hit the history button and type in the window #3 NOT #24 (most recent search) and
this gives you the set from GSS. Click on the number 188,963 on the history page and this
will list the results. Click on one hit.
The following is found in the entry.
COMMENT This sequence is a single read and was generated as part of a large
scale clone-end sequencing project of the Tetraodon nigroviridis
genome. For more information, please take a look at
http://www.genoscope.cns.fr/Tetraodon.
Go to the limits page and take off the check boxes, but add in the modification dates
2001-01-01 to 2001-12-31 and search #26 (most recent). This will give you a result of 0.
None of the entries were deposited in 2001. Move the date back to 2000 and try again.
This time all the entries are found in the year 2000. Look at one of the entries and see the
date July 27, 2000. Try the limit 2000-04-01 to 2000-08-31. You will find that all the
sequences were deposited in this five month window. You can continue to narrow the dates
to see when the sequences were deposited (they were not all put in on one day).
This same approach can be used to follow the progress in any sequencing project that has
data at Genbank.
Another example is Arabidopsis. If you search Arabidopsis[ORGN] to limit to species,
you will get 219,285 hits. How many are ESTs? Use the limits feature to exclude ESTs
You get 105,955 By intersecting these two searches with #38 NOT #39 the difference is
113,330. That is the number of ESTs. If you want to know how many of the non-EST
sequences are still mRNAs limit the search to mRNA as the molecule type and search the
#39 (105,955 entries). This gives 12,426 entries are mRNA. Again do the intersection
#39 NOT #43 to get the rest 105,955-12426 = 93,529. This is now the number of entries
that are not ESTs and not mRNA. (You have to remember to take off the limits checkbox
before doing this or you will get 0). If you now limit the results to genomic DNA/RNA and
search this last result #45 you get 93,520 so all but 9 are genomic DNA/RNA.
How many are in HTGS (working draft)? Under limits exclude working draft and do the
search on #46. The result is 93,486, so very few (34) are in HTGS. Try exclude all which
leaves nr only. 5335 are in NR. Where are the rest? Exclude GSS leaves only 6621 and
so 86,899 are in the GSS section. We know that 5335 of the 6621 are in NR so 1286 are
left. If we exclude patents from these then only 171 remain. If we now exclude STS only
34 remain. These are the 34 from HTGS that we detected earlier, but did not remove.
This kind of strategy can be used to find where sequences are stored in Genbank and what
type of sequences they are.
Arabidopsis
total 219,285
ESTs 113,330
mRNA 12,426
genomic 93,520
HTGS 34
GSS 86899
Nr 5335
Patent 1115
STS 137