A First Introduction to NCBI

David R. Nelson rev. Jan. 11, 2002



   The first place to go for access to sequence data is NCBI the National Center for Biotechnology Information. There is no more comprehensive site. If you will be using bioinformatics at all in the future, then you should create a bookmark file for your bioinformatics links so they will be easily accessible. I have set my browser home button to the NCBI site and one of my first bookmarks is to the blast server at NCBI. The link to NCBI is NCBI. For convenience you should make a separate browser window for these notes so you can move back and forth between screens. In Netscape select FILE, NEW NAVIGATOR, In Internet Explorer select FILE, NEW WINDOW, then paste in the URL for this page so you will have a duplicate.

There are many layers and functions at NCBI and we will not go over them all now. To start, I will give you a brief summary of the sequence content in some of the most useful sections of the database. The sequence data at Genbank have been growing exponentially. To see this graphically and in table format go to the Genbank growth statistics. You can see the growth rate is very steep, with 7.3 billion bases being added in 2000.

There is a daily base counter at NCBI. The base counter showed 19.8 billion bases on Jan 10, 2002. For a current view click here. The basepair counter showed 15.7 billion bases on May 10, 2001, so 4 billion bases have been added in the last 8 months. The Dec. 3, 2001 number of 14.4 billion bases (taken from the table) is quite a bit smaller than the Jan. 10, 2002 value and I have requested an explanation.

These sequences are in several sections and it is important to know about these so you can find what you are looking for at Genbank. This table shows the breakdown of the four main sequence divisions in Genbank. These are nr (non-redundant), est for expressed sequence tags, htgs for high throughput genomic sequences and gss for genome survey sequences. STS is a small section called sequence tagged sites used in mapping. We will talk about each of them in more detail. Nr is the default section of the database that you will search when you do a BLAST search unless you select another section. It is a good place to start. It contains all the genes that have been sequenced by individuals over the years, It has both cDNA and gene sequences. The human genome sequence moves into nr as it is finished. Unless a submitter is sending in a batch submission of large numbers of sequences for the other sections of the database their sequence goes here. Note that as of Jan. 4, 2002 nr has 4.8 billion bases or 29% of the total Genbank bases. If you only search here for a match to your sequence, you are neglecting more than 70% of Genbank data. Below is an extract from the release notes of Genbank nr release 121 of Dec. 15, 2000 and similar data from release 126 of Oct. 15, 2001 so you can compare changes over the past year. A link is provided to the full release notes for release 126. These are found on the ftp download page.


2.2.7 Selected Per-Organism Statistics 

The following table provides the number of entries and bases of DNA/RNA for
the twenty most sequenced organisms in Genbank nr Release 121.0  Dec. 15, 2000  
(chloroplast and mitochondrial sequences not included):

Entries      Bases   Species

3918724  6702881570   Homo sapiens
2456194  1291602139   Mus musculus
 166554   487561384   Drosophila melanogaster
 181388   242674129   Arabidopsis thaliana
 114553   203544197   Caenorhabditis elegans
 188993   165539271   Tetraodon nigroviridis (freshwater pufferfish)
 151411   125948974   Oryza satiava (rice)
 218598   106344366   Rattus norvegicus
 159473    71215626   Bos taurus (cow)
 141802    62817102   Glycine max (soybean)
 104535    50991920   Medicago truncatula
  91334    49855996   Trypanosoma brucei
  97112    49415566   Lycopersicon esculentum (tomato)
  54328    47639714   Giardia intestinalis
  77532    47590936   Strongylocentrotus purpuratus (sea urchin)
  49938    44522016   Entamoeba histolytica
  57779    44489692   Hordeum vulgare (barley)
  83726    40906902   Danio rerio (zebrafish)
  77506    36885212   Zea mays (maize or corn)
  18361    32779082   Saccharomyces cerevisiae 
		
2.2.7 Selected Per-Organism Statistics 
The following table provides the number of entries and bases of DNA/RNA for the twenty most sequenced organisms in Genbank nr Release 126.0 Oct. 15, 2001 (chloroplast and mitochondrial sequences not included)
Entries Bases Species
5074650 7915783043 Homo sapiens 3282738 1982497435 Mus musculus 309512 615314337 Drosophila melanogaster 277024 342250586 Rattus norvegicus 196531 292339256 Oryza sativa (rice) 194296 258809578 Arabidopsis thaliana 140700 187274610 Caenorhabditis elegans 189005 165547824 Tetraodon nigroviridis (freshwater pufferfish) 198152 95024632 Bos taurus (cow) 204698 92361300 Glycine max (soybean) 156413 89308950 Danio rerio (zebrafish) 155185 80380251 Lycopersicon esculentum (tomato) 140798 72431327 Medicago truncatula 80582 72089785 Entamoeba histolytica (liver parasite) 121918 60487285 Xenopus laevis 102233 58906089 Chlamydomonas reinhardtii 124150 57745385 Zea mays (maize or corn) 86956 54526352 Strongylocentrotus purpuratus (sea urchin) 104222 54130240 Sus scrofa (pig) 91420 53130188 Trypanosoma brucei
The Genbank release 126 is broken into 240 files. If the EST, GSS, STS and HTGS files are ignored, nr is what's left. The sequences are categorized into the following 17 groups. The following codes are used to designate the data file divisions in Genbank:
		
                                                            files sequences   bases (in millions)	
1.  PRI - primate sequences                                   14    202,842        2200						
2.  ROD - rodent sequences                                     2     71,463         162					
3.  MAM - other mammalian sequences (not PRI or ROD)           1     34,963          34						
4.  VRT - other vertebrate sequences (not PRI, ROD or MAM)     1     62,274          59						
5.  INV - invertebrate sequences (fly, worm etc.)              4    100,881         499						
6.  PLN - plant, fungal, algal and protist sequences           4    177,479         429						
7.  BCT - bacterial sequences                                  4    118,777         320						
8.  VRL - viral sequences                                      2    134,470         118						
9.  PHG - bacteriophage sequences                              1      1,794           6						
10. SYN - synthetic sequences                                  1      6,579          12						
11. UNA - unannotated sequences                                1        668           0.3						
12. EST*- EST sequences (expressed sequence tags)            133  9,268,640        4247						
13. PAT - patent sequences                                     3    466,531         196						
14. STS*- STS sequences (sequence tagged sites)                2    112,466          44						
15. GSS*- GSS sequences (genome survey sequences, BAC ends)   41  2,732,066        1493						
16. HTG*- HTGS sequences (high throughput genomic sequences)  25     88,367        4538						
17. HTC - HTC sequences (high throughput cDNA sequences)       1     22,002          28
These designations appear in the top line of any sequence you pull out of Genbank. In nr, Primate is the largest section with the human sequence in it. Rodent is about 13 times smaller, with other mammal and other vertebrate (zebrafish, xenopus etc) tiny by comparison. Invertebrate is pretty large since it has both fly and C. elegans sequences in it. Plant is a catch-all section that has every eukaryotic species that is not from an animal. Itr really should be called EUK for other eukaryotes. This includes fungi and protists, even though they are not plants. It is pretty large because Arabidopsis is here. Bacteria is fairly big since 62 complete public bacterial genomes are here. Viral rounds out the nr section with 665 speices. Many complete viral genomes have been sequenced. The HTC section is new and contains cDNAs from projects designed to obtain full length cDNAs from mouse (RIKEN in Japan) and human and mouse (MGC the Mammalian Gene Collection at NIH) The goal of the Mammalian Gene Collection (MGC) is to provide a complete set of full-length (open reading frame) sequences and cDNA clones of expressed genes for human and mouse. (NIH-MGC Project URL: http://mgc.nci.nih.gov). You can specify most of these categories except invertebrate when doing a blast. This is done using pull down menus to limit a search. There are also 27 common species in this pull down menu, but you can type in any taxon name in another window to limit a search to your own specific subset, such as Tetraodon.

Now follow the link to molecular databases from the left frame of the NCBI homepage. This new page has links on the left frame to the databases EST, GSS and STS. Click on the EST link. Est is the division for sequencing projects focusing on cDNAs, so they represent genes that are expressed as mRNA. Usually large numbers of ESTs are deposited from these projects. Going back to the table of Genbank statistics, I break down the human, mouse and other ests to show that 40% are human, over 20% are mouse and about 1/3rd are other. Note, there are about the same number of bases in the EST section as in the nr section. The EST and GSS databases have per organism statistics. The top twenty organisms are shown below.

dbEST release 010402 (this is a date 1/4/02)
Summary by Organism - January 4, 2002
Number of public entries: 10,032,616

		
 
Homo sapiens (human)                                3,927,122
Mus musculus + domesticus (mouse)                   2,514,136
Rattus sp. (rat)                                      317,151
Drosophila melanogaster (fruit fly)                   255,456
Glycine max (soybean)                                 223,351
Bos taurus (cattle)                                   213,787
Xenopus laevis (African clawed frog)                  198,118
Danio rerio (zebrafish)                               185,251
Lycopersicon esculentum (tomato)                      141,735
Medicago truncatula (barrel medic)                    137,588
Caenorhabditis elegans (nematode)                     135,203
Zea mays (maize)                                      119,145
Arabidopsis thaliana (thale cress)                    113,330
Chlamydomonas reinhardtii                             111,958
Oryza sativa (rice)                                   100,480
Sus scrofa (pig)                                      100,006
Hordeum vulgare (barley)                               90,314
Sorghum bicolor (sorghum)                              84,712
Ciona intestinalis (sea squirt, a primitive chordate)  82,071
Triticum aestivum (wheat)                              73,395
dbGSS release 010402
Summary by Organism - January 4, 2002
Number of public entries: 3,328,802

					
Mus musculus                          939,684
Homo sapiens (human)                  871,247
Brassica oleracea                     198,932
Tetraodon nigroviridis                188,963
Pan troglodytes (chimp)               157,533
Rattus norvegicus                     110,521
Oryza sativa (rice)                    93,119
Trypanosoma brucei                     90,540
Arabidopsis thaliana                   86,899
Entamoeba histolytica                  79,716
Strongylocentrotus purpuratus          76,019
Anopheles gambiae                      60,351
Takifugu rubripes                      47,111
Drosophila melanogaster                45,610
Zea mays                               44,990
Schistosoma mansoni                    42,015
Trypanosoma cruzi                      21,317
Leishmania major                       15,401
Magnaporthe grisea (rice blast fungus) 12,674
Lycopersicon esculentum (tomato)       11,892
					

The HTGS section is for genomic sequencing. This is where the genome projects send their data. It is the largest section of Genbank. It has a lot of human and mouse sequence. GSS stands for genome survey sequence. Most genome projects now are based on sequencing clones of differing sizes. Some are cosmids and some are BACs or bacterial artificial chromosomes. These are too big to sequence in one run so they need to be broken into smaller pieces. One way to keep track of BACs is to end sequence them to get a marker for mapping and later assembly. The BAC end sequences go in the GSS section of Genbank. They are similar in size to the ESTs but they are from genomic DNA. There are almost 2 billion bases of GSS sequence. I recommend searching all four main sections for matches to any sequence you have, since you don't want to miss anything. The smaller section called STS (only 44 million bases) contains sequences used for mapping. I usually search there too, even though there is not as much data. I have found missing exons on the STS database that are not in any other section of Genbank.

To follow the progress on sequencing the human genome look at this link. The percent finished (63%) and the percent in draft (34.8%) is given for Dec 31, 2001.