There is a daily base counter at NCBI. The base counter showed 19.8 billion bases on Jan 10, 2002. For a current view click here. The basepair counter showed 15.7 billion bases on May 10, 2001, so 4 billion bases have been added in the last 8 months. The Dec. 3, 2001 number of 14.4 billion bases (taken from the table) is quite a bit smaller than the Jan. 10, 2002 value and I have requested an explanation.
These sequences are in several sections and it is important to know about these so you can find what you are looking for at Genbank. This table shows the breakdown of the four main sequence divisions in Genbank. These are nr (non-redundant), est for expressed sequence tags, htgs for high throughput genomic sequences and gss for genome survey sequences. STS is a small section called sequence tagged sites used in mapping. We will talk about each of them in more detail. Nr is the default section of the database that you will search when you do a BLAST search unless you select another section. It is a good place to start. It contains all the genes that have been sequenced by individuals over the years, It has both cDNA and gene sequences. The human genome sequence moves into nr as it is finished. Unless a submitter is sending in a batch submission of large numbers of sequences for the other sections of the database their sequence goes here. Note that as of Jan. 4, 2002 nr has 4.8 billion bases or 29% of the total Genbank bases. If you only search here for a match to your sequence, you are neglecting more than 70% of Genbank data. Below is an extract from the release notes of Genbank nr release 121 of Dec. 15, 2000 and similar data from release 126 of Oct. 15, 2001 so you can compare changes over the past year. A link is provided to the full release notes for release 126. These are found on the ftp download page.
2.2.7 Selected Per-Organism Statistics The following table provides the number of entries and bases of DNA/RNA for the twenty most sequenced organisms in Genbank nr Release 121.0 Dec. 15, 2000 (chloroplast and mitochondrial sequences not included): Entries Bases Species 3918724 6702881570 Homo sapiens 2456194 1291602139 Mus musculus 166554 487561384 Drosophila melanogaster 181388 242674129 Arabidopsis thaliana 114553 203544197 Caenorhabditis elegans 188993 165539271 Tetraodon nigroviridis (freshwater pufferfish) 151411 125948974 Oryza satiava (rice) 218598 106344366 Rattus norvegicus 159473 71215626 Bos taurus (cow) 141802 62817102 Glycine max (soybean) 104535 50991920 Medicago truncatula 91334 49855996 Trypanosoma brucei 97112 49415566 Lycopersicon esculentum (tomato) 54328 47639714 Giardia intestinalis 77532 47590936 Strongylocentrotus purpuratus (sea urchin) 49938 44522016 Entamoeba histolytica 57779 44489692 Hordeum vulgare (barley) 83726 40906902 Danio rerio (zebrafish) 77506 36885212 Zea mays (maize or corn) 18361 32779082 Saccharomyces cerevisiae 2.2.7 Selected Per-Organism StatisticsThe Genbank release 126 is broken into 240 files. If the EST, GSS, STS and HTGS files are ignored, nr is what's left. The sequences are categorized into the following 17 groups. The following codes are used to designate the data file divisions in Genbank:
The following table provides the number of entries and bases of DNA/RNA for the twenty most sequenced organisms in Genbank nr Release 126.0 Oct. 15, 2001 (chloroplast and mitochondrial sequences not included)
Entries Bases Species
5074650 7915783043 Homo sapiens 3282738 1982497435 Mus musculus 309512 615314337 Drosophila melanogaster 277024 342250586 Rattus norvegicus 196531 292339256 Oryza sativa (rice) 194296 258809578 Arabidopsis thaliana 140700 187274610 Caenorhabditis elegans 189005 165547824 Tetraodon nigroviridis (freshwater pufferfish) 198152 95024632 Bos taurus (cow) 204698 92361300 Glycine max (soybean) 156413 89308950 Danio rerio (zebrafish) 155185 80380251 Lycopersicon esculentum (tomato) 140798 72431327 Medicago truncatula 80582 72089785 Entamoeba histolytica (liver parasite) 121918 60487285 Xenopus laevis 102233 58906089 Chlamydomonas reinhardtii 124150 57745385 Zea mays (maize or corn) 86956 54526352 Strongylocentrotus purpuratus (sea urchin) 104222 54130240 Sus scrofa (pig) 91420 53130188 Trypanosoma brucei
files sequences bases (in millions)
1. PRI - primate sequences 14 202,842 2200
2. ROD - rodent sequences 2 71,463 162
3. MAM - other mammalian sequences (not PRI or ROD) 1 34,963 34
4. VRT - other vertebrate sequences (not PRI, ROD or MAM) 1 62,274 59
5. INV - invertebrate sequences (fly, worm etc.) 4 100,881 499
6. PLN - plant, fungal, algal and protist sequences 4 177,479 429
7. BCT - bacterial sequences 4 118,777 320
8. VRL - viral sequences 2 134,470 118
9. PHG - bacteriophage sequences 1 1,794 6
10. SYN - synthetic sequences 1 6,579 12
11. UNA - unannotated sequences 1 668 0.3
12. EST*- EST sequences (expressed sequence tags) 133 9,268,640 4247
13. PAT - patent sequences 3 466,531 196
14. STS*- STS sequences (sequence tagged sites) 2 112,466 44
15. GSS*- GSS sequences (genome survey sequences, BAC ends) 41 2,732,066 1493
16. HTG*- HTGS sequences (high throughput genomic sequences) 25 88,367 4538
17. HTC - HTC sequences (high throughput cDNA sequences) 1 22,002 28
These designations appear in the top line of any sequence you pull out of Genbank. In nr, Primate is the largest section with the human sequence in it. Rodent is about 13 times smaller, with other mammal and other vertebrate (zebrafish, xenopus etc) tiny by comparison. Invertebrate is pretty large since it has both fly and C. elegans sequences in it. Plant is a catch-all section that has every eukaryotic species that is not from an animal. Itr really should be called EUK for other eukaryotes. This includes fungi and protists, even though they are not plants. It is pretty large because Arabidopsis is here. Bacteria is fairly big since 62 complete public bacterial genomes are here. Viral rounds out the nr section with 665 speices. Many complete viral genomes have been sequenced. The HTC section is new and contains cDNAs from projects designed to obtain full length cDNAs from mouse (RIKEN in Japan) and human and mouse (MGC the Mammalian Gene Collection at NIH) The goal of the Mammalian Gene Collection (MGC) is to provide a complete set of full-length (open reading frame) sequences and cDNA clones of expressed genes for human and mouse. (NIH-MGC Project URL: http://mgc.nci.nih.gov). You can specify most of these categories except invertebrate when doing a blast. This is done using pull down menus to limit a search. There are also 27 common species in this pull down menu, but you can type in any taxon name in another window to limit a search to your own specific subset, such as Tetraodon.
Now follow the link to molecular databases from the left frame of the NCBI homepage. This new page has links on the left frame to the databases EST, GSS and STS. Click on the EST link. Est is the division for sequencing projects focusing on cDNAs, so they represent genes that are expressed as mRNA. Usually large numbers of ESTs are deposited from these projects. Going back to the table of Genbank statistics, I break down the human, mouse and other ests to show that 40% are human, over 20% are mouse and about 1/3rd are other. Note, there are about the same number of bases in the EST section as in the nr section. The EST and GSS databases have per organism statistics. The top twenty organisms are shown below.
dbEST release 010402 (this is a date 1/4/02)
Summary by Organism - January 4, 2002
Number of public entries: 10,032,616
Homo sapiens (human) 3,927,122 Mus musculus + domesticus (mouse) 2,514,136 Rattus sp. (rat) 317,151 Drosophila melanogaster (fruit fly) 255,456 Glycine max (soybean) 223,351 Bos taurus (cattle) 213,787 Xenopus laevis (African clawed frog) 198,118 Danio rerio (zebrafish) 185,251 Lycopersicon esculentum (tomato) 141,735 Medicago truncatula (barrel medic) 137,588 Caenorhabditis elegans (nematode) 135,203 Zea mays (maize) 119,145 Arabidopsis thaliana (thale cress) 113,330 Chlamydomonas reinhardtii 111,958 Oryza sativa (rice) 100,480 Sus scrofa (pig) 100,006 Hordeum vulgare (barley) 90,314 Sorghum bicolor (sorghum) 84,712 Ciona intestinalis (sea squirt, a primitive chordate) 82,071 Triticum aestivum (wheat) 73,395dbGSS release 010402
Mus musculus 939,684 Homo sapiens (human) 871,247 Brassica oleracea 198,932 Tetraodon nigroviridis 188,963 Pan troglodytes (chimp) 157,533 Rattus norvegicus 110,521 Oryza sativa (rice) 93,119 Trypanosoma brucei 90,540 Arabidopsis thaliana 86,899 Entamoeba histolytica 79,716 Strongylocentrotus purpuratus 76,019 Anopheles gambiae 60,351 Takifugu rubripes 47,111 Drosophila melanogaster 45,610 Zea mays 44,990 Schistosoma mansoni 42,015 Trypanosoma cruzi 21,317 Leishmania major 15,401 Magnaporthe grisea (rice blast fungus) 12,674 Lycopersicon esculentum (tomato) 11,892
The HTGS section is for genomic sequencing. This is where the genome projects send their data. It is the largest section of Genbank. It has a lot of human and mouse sequence. GSS stands for genome survey sequence. Most genome projects now are based on sequencing clones of differing sizes. Some are cosmids and some are BACs or bacterial artificial chromosomes. These are too big to sequence in one run so they need to be broken into smaller pieces. One way to keep track of BACs is to end sequence them to get a marker for mapping and later assembly. The BAC end sequences go in the GSS section of Genbank. They are similar in size to the ESTs but they are from genomic DNA. There are almost 2 billion bases of GSS sequence. I recommend searching all four main sections for matches to any sequence you have, since you don't want to miss anything. The smaller section called STS (only 44 million bases) contains sequences used for mapping. I usually search there too, even though there is not as much data. I have found missing exons on the STS database that are not in any other section of Genbank.
To follow the progress on sequencing the human genome look at this link. The percent finished (63%) and the percent in draft (34.8%) is given for Dec 31, 2001.