MSCI815 Module 1.

Non-GenBank databases

This module begins MSCI815, the one credit part of the bioinformatics course. Due to loss of one week because the B107 computers were being overhauled, we will combine MSCI815 modules one and two.

The average molecular biologist assumes that everything in the way of DNA sequence data can be found at GenBank. Many people also assume that the default settings at GenBank will allow one to search all the sequence data. We talked at some length about the different parts of GenBank and how most of the sequences are partitioned into the nr, ESTdb, HTGS and GSS sections. We need not go over that again here. If you need refreshing please see the overview of GenBank from MSCI814 Module 1. For this session we will look at sequence data outside of GenBank. There are more and more private companies, institutions and university labs doing sequencing on a large scale, and there are many labs setting up their own servers to post data that is unannotated and preliminary. These labs expect to produce a finished product and submit it to GenBank someday, but for the most recent results you need to go to their site and use their servers. This area is changing every week, so what is true in April 2002 may not be true in June. You have to check.

Genome sequence data is becoming a matter of national pride. Countries can claim a place on the world stage if they take on a significant genome project. Just because human is almost done does not mean there are no other attractive projects out there. The rice genome is an example. At 460 Mb, it is bigger than Arabidopsis but samller than human. About one third of the world depends on rice for food. All other cereal crops are syntenic with rice, so sequencing rice is like mapping every cereal grain. Because of this, there has been intense interest in sequencing this genome, and not just one genome, but two or three subspecies and varieties. Like human, there has been a public and a private side. Rob will be talking about the ethics of private companies sequencing genomes and then charging or requiring collaborations/patent rights to see the data. I will be talking more about how you can see the data available to the public now.

Rice was proposed for sequencing by a public international research collboration IRGSP based out of Japan. Ten countries were going to do rice (the japonica subspecies, Nipponbare variety). This project started rather slowly in 1997 with a map and clone strategy similar to the human genome public project. The projected finish date was 2008. This timetable was upset by Monsanto who paid Leroy Hood (a sequence expert) to do a 5X coverage of rice in a few months. This was donated to the IRGSP to help them finish their clone by clone approach. The data was not released to the public, but you could see it if you had a legal agreement signed (A material transfer agreement). I did this, but it took about 6 months here at UT. The agreement is strict. You cannot even let your own students look at the data unless they are listed on the agreement.

Meanwhile, Syngenta/Myriad, a private company also sequenced the japonica rice genome and they assembled it. The Monsanto data was just loose reads. The Syngenta data was reported to be better and more complete than the Monsanto data. To complicate matters, China decided to get in the game and sequence another subspecies called indica that is grown in China. China formed the Beijing Genome Institute and went from nothing to a complete rice genome in about 3-4 years. They posted the data on line this year and let anyone download it or blast search it. This is in GenBank as AAAA01000001-AAAA01103044. A special blast server is set up at GenBank to search this data. indica blast server. This is not searchable from the nr or HTGS sections of GenBank. It was only in place since April 5th, 2002.

The IRGSP has released 168 Mb of sequence to GenBank (about 40% of the genome), though on their website they say they have released 68% to public databases. I do not know why there is a 28% difference in their numbers. On April 3 they posted a press release titled IRGSP Response to Publication of Draft Sequences of the Rice Genome This was forced on them by the other efforts passing them by and publishing in Science April 5, 2002. Currently, they are planning to finish a 10X coverage draft by the end of 2002. They are still doing the clone by clone approach to give 99.99% accuracy over the whole euchromatic region of the genome, but they are in fear for their funding since two assembled genomes of rice are "published". The IRGSP genome will be a higher quality product, and it will all be linked to the genetic and physical maps of rice, so it is worth finishing. It is still hard to convince your funding sources of the value of coming in third.

A blast search of one CYP51A2 P450 from japonica against indica showed the two sequences were 99% identical, so the two genomes are not that different from each other at the level of gene sequences.

Around the world, there are many Genome Centers and Databases. These often have their own servers for blast searching and ftp downloading of sequence data. Much of this data is not in Genbank.

The Joint Genome Institute is one of the major sites for this type of data. They have a Genomic Diversity Group with the goal of performing comparative analyses of whole genome sequences being produced by JGI. This is where the Ciona intestinalis sequence is being done in addition to the white rot fungus Phanerochaete chrysosporium, the Fugu genome, and Chalmydomonas.

Chlamydomonas (green algae) at JGI has 36,605 BAC end sequences; 34,726,766 total letters in the BAC end database. These are blast searchable. Only 706 genomic sequences are in Genbank and only 4 of these are BAC ends. So all this data is non-GenBank data.

JGI is also working on a 3X coverage of the balsam poplar (Populus balsamifera). This will be the first tree genome. Sequence data should be collected by the end of 2002, but no blast serveris linked yet.

In October of 2000 JGI sequenced 15 bacterial genomes in one month as demonstration project. I am sure that centers like that can now do one or more bacterial genomes in one day once the clones are made.

Some of the larger sequencing centers are profiled at this link.

Eukaryotic genomes are larger and more complex. They are more interesting from the point of understanding evolution and biology, but they are harder and more expensive to sequence. This has not slowed down the effort to obtain these genomes and currently there are 179 eukaryotic genome projects listed on a Eukaryal Genomes Sequence Table. Some of these are multiple sites working on the same genome, so there are 114 different species being worked on. The site is not comprehensive.

Some of the more interesting sites include the Baylor College of Medicine chimpanzee Blast server (BLASTN only)

Dictyostelium discoideum (cellular slime mold) has multiple servers. There is a blast server at Jena Germany and another blast server at San Diego. Baylor has its own server and Sanger Center has its own. The Dictyostelium cDNA project in Japan has another. It used to be that each site had new data on their site that was not on the other sites, but over time I think they have a more uniform sharing of data so each of the genome sites is nearly comprehensive. There are many hits that do not have Genbank Accession numbers, so this is a good example of non-GenBank sequence data. Look at my P450 collection of accession numbers from Dicty to see the number of non-GenBank entries.

Some sites are good to use even if the data are deposited in GenBank, because they are dedicated curated databases. The TIGR site is an excellent example. The TIGR gene indices are annotated sets of genes from a specific organism, with separate search tools. Other databases like this include the Anopheles database, (mosquito) Flybase (for Drosophila), WormBase , for C. elegans, TAIR The Arabidopsis Information Resource PlasmoDB for Plasmodium falciparum (malaria) and others.

Assignment 12.

Find a non-GenBank chicken EST database and blast search it with the following sequence. Return the URL for the server and the best hit to me. GenBank has only 70325 entries under chicken (gallus[ORGN]) and 61495 are ESTs. The site I found had 327,836 EST sequences; 243,440,884 total letters.



>56. CYP51 NM_000786
MAAAAGMLLLGLLQAGGSVLGQAMEKVTGGNLLSMLLIACAFTL
SLVYLIRLAAGHLVQLPAGVKSPPYIFSPIPFLGHAIAFGKSPIEFLENAYEKYGPVF
SFTMVGKTFTYLLGSDAAALLFNSKNEDLNAEDVYSRLTTPVFGKGVAYDVPNPVFLE
QKKMLKSGLNIAHFKQHVSIIEKETKEYFESWGESGEKNVFEALSELIILTASHCLHG
KEIRSQLNEKVAQLYADLDGGFSHAAWLLPGWLPLPSFRRRDRAHREIKDIFYKAIQK
RRQSQEKIDDILQTLLDATYKDGRPLTDDEVAGMLIGLLLAGQHTSSTTSAWMGFFLA
RDKTLQKKCYLEQKTVCGENLPPLTYDQLKDLNLLDRCIKETLRLRPPIMIMMRMART
PQTVAGYTIPPGHQVCVSPTVNQRLKDSWVERLDFNPDRYLQDNPASGEKFAYVPFGA
GRHRCIGENFAYVQIKTIWSTMLRLYEFDLIDGYFPTVNYTTMIHTPENPVIRYKRRS
K