Overview of sequence data at Genbank

The sequence data at Genbank have been growing exponentially. The base counter showed 19.8 billion bases on Jan 10, 2002. For a current view click here. The basepair counter showed 15.7 billion bases on May 10, 2001, so 4 billion bases have been added in the last 8 months. These sequences are in several sections and it is important to know about these so you can find what you are looking for at Genbank. This table shows the breakdown of the four main sequence divisions in Genbank. These are nr (non-redundant), est for expressed sequence tags, htgs for high throughput genomic sequences and gss for genome survey sequences. STS is a small section called sequence tagged sites used in mapping. We will talk about each of them in more detail. Nr is the default section of the database that you will search when you do a BLAST search unless you select another section. It is a good place to start. It contains all the genes that have been sequenced by individuals over the years, It has both cDNA and gene sequences. The human genome sequence moves into nr as it is finished. Unless a submitter is sending in a batch submission of large numbers of sequences for the other sections of the database their sequence goes here. Note that there are 4.8 billion bases or 29% of the total Genbank bases. If you only search here for a match to your sequence, you are neglecting 70% of Genbank data.

Est is the division for sequencing projects focusing on cDNAs, and usually large numbers of ESTs are deposited from these projects. I break down the human, mouse and other ests to show that 40% are human, over 20% are mouse and about 1/3rd are other. Note, there are about the same number of bases in the EST as in the nr section. The HTGS section is for genomic sequencing. This is where the genome projects send their data. It is the largest section of Genbank. It has a lot of human and mouse sequence. GSS stands for genome survey sequence. Most genome projects now are based on sequencing clones of differing sizes. Some are cosmids and some are BACS or bacterial artificial chromosomes. These are too big to sequence in one run so they need to be broken into smaller pieces. One way to keep track of BACS is to end sequence them to get a marker for mapping and later assembly. The BAC end sequences go in the GSS section of Genbank. They are similar in size to the ESTs but they are from genomic DNA. There are almost 2 billion bases of GSS sequence. I recommend searching all four main sections for matches to any sequence you have, since you don't want to miss anything. The smaller section called STS (only 44 million bases) contains sequences used for mapping. I usually search there too, even though there is not as much data. I have found missing exons on the STS database that are not in any other section of Genbank.

Let's look more closely at the nr section. Release notes for the April 15 2001 release show the following. The Genbank release is broken into several hundred files. If the EST, GSS and HTGS files are ignored, nr is what's left. The sequences are categorized into the following 8 groups. These designations appear in the top line of any sequence you pull out of Genbank. Primate is the largest section with the human sequence in it. Rodent is about 10 times smaller, with other mammal and other vertebrate (zebrafish, xenopus etc) tiny by comparison. Invertebrate is pretty large since it has both fly and C. elegans sequences in it. Plant is a catch-all section that has every species that is not animal, bacterial or viral. This includes fungi and protists, even though they are not plants. It is pretty large because Arabidopsis is here. Bacteria is fairly big since all the 40 public bacterial genomes are here. Viral rounds out the nr section. Many complete viral genomes have been sequenced.

You can specify any of these categories except invertebrate when doing a blast search to limit your search. There are also 27 common species in this pull down menu, but you can type in any taxon name in another window to limit a search.

The release notes for the current release 126 (Oct. 15, 2001) of the nr section can be found on the ftp download page. A link is provided here. The per organism stats are shown in this table. The Genbank divisions are shown in this table.

To follow the progress on sequencing the human genome look at this link. The percent finished and the percent in draft is given for Dec 31, 2001.

The EST database has release statistics showing the most sequenced organisms .

These type of stats are available for the GSS section also.