Module 10.
Superfamily genomics
This module is designed to show how to find all members of a protein superfamily in a genome. The next module will demonstrate how to assemble all the genes found in this module. Taken together, these two modules cover much of genome annotation.
Genome sequences are being generated world wide at the rate of 200-300 million bases of sequence a day. Celera Genomics has a 100 million bases per day capacity. The Whitehead institute at MIT is near 60 million a day. The Joint Genome Institute (DOE) is of similar capacity and that does not count the Sanger Center and the genome centers like Kazusa in Japan in other sites in the world. These centers are producing more sequence than they can possibly annotate in any detail. They are relying on automated gene finding programs and automated blast comparisons to annotate these genomes. It will be a moderate success, but it will not be 100% correct. The gene finding programs only get about 60% accuracy. They fuse adjacent genes together. They skip exons. Some exons are very short, as short as seven nucleotides. See the GHE sequence below from several p450 genes from the white rot fungus. Automated programs miss short exons and they fail to detect bad exon boundaries that are probably sequence errors. In short, they do not have expert knowledge of a single protein family. Celera realized this when they did the Drosophila genome and they held two Gene Jamborees, where expert annotators were brought in to work on assembling genes from individual familes. The Riken mouse cDNA project had a similar meeting with about 50 invited annotators.
Figure showing a very short exon GHE from several P450s
Phanerochaete chrysosporium (white rot fungus) Scaffold_388a very similar to
sequences 77 417 112 129
gene model complete all boundaries checked
3583 MISDTFALAISSGLSLFLCLKAFIDYRAGLRSI (2) 3684 ex1
3732 NHSYLPGFRALISSFGILGLFFKEPKRGLWGGRRRFWLRKHLDFEEAGVDIISH (0) 3893 ex2
3954 IAFLPSVSTYLLLADAAAIK (0) 4013 ex3
4069 EVTGHRARFPKPTYKTLRIFGGNVLASEGEEWKRHRKVVGPAFSE (0) 4203 c-helix ex4
4255 HNNRLVWNETVKIVNDLFANVWGSQSEVYVDNVVQSVTLP (0) 4374 ex5
4423 MALYVISIAGFGKRALWQADGNLPPGHKLSFQ (0) 4521 ex6
4576 DALHILGTDLWIKAATPTLLMNWAPTTRIANVKLAFDEVK (0) 4692 ex7
4747 QYMLELIQERRNSEKRDERYDLFSSLLDANDLNEDGNGNVTLTNDELL (1) 4890 ex9
GNIFIFMLA (1) 4973 ex9 split
GHE (0) ex9 split
5087 TTAHTLAFTFGLLALHPDYQETVYQQIKSIVPDNRPP (0) 5197 ex10
MYEEMNSLTECMA (2) ex11
5351 YETLRLFPP (0) 5380 ex12
5436 TATIPKIAAEDTYLVTIDRAGNRVVVPVPCGTALHLNVIALHHN (1) 5564 ex13
5614 PRYWDNPSAFKPERFRGDWPRDAFIPFSTGSRSCIGRR (2) 5730 ex14
5780 FFETESIAILTMILSRYKIELRNDPRFADETYEERWQRVLRVKDGLTPA* 5932 ex15
compare to scaffold 388b not a separate exon
GNIFIFLLAGHE(0) 14307 ex9
14247 TTAHTLAFTFGLLALYPEQQDKLYKHIKHVIPDGRIP (0) 14137 ex10
and scaffold 129
GNIFIFMLA (1) ex9 split
GHE (0) ex9 split
25150 TTAHTLAFTFGLLALHSDYQEKVHQQIKSIMPDNRLP (0) 25260 ex10
and scaffold 12a not a separate exon (exons have been fused)
54432 VKANMTEDAKSRLSEEEMYAEMR (2)
54552 TILFAGHETTSTTISWVLLE
Expert knowledge reveals GHE to be a real exon, by comparison with other P450s. This happens to be in a motif region that is conserved (AGXETT) and easily recognized, but automated gene finding programs are not going to see this. Please note that the intron exon phases are indicated in (). The number 0, 1 or 2 tells the phase of the junction. Exons that join between codons are phase 0, those that join one base into a codon are phase 1 and those that join two bases in are phase 2. You cannot join exons together in frame unless the phase is preserved at both ends of the intron. In the example above, the sequence GNIFIFMLA ends in a phase 1 boundary, while the sequence TTAHTLAFTFGLLA starts with a phase 0 boundary. The two cannot be joined unless there is an exon in between with a phase 1 start and a phase 0 end. That is the GHE exon. This short seven base pair exon is seen in six different genes in white rot.
As an expert in a gene family, you may be interested in finding all the genes in that family in a given species. That might be human or rat or mouse, but it could just as easily be Fugu (Japanese pufferfish) or the sea squirt (Ciona). As a teaching tool, we are going to look at and annotate the Ciona genome cytochrome P450s. This is like a purification process. For Ciona savingyi, there are about 2.5 billion bases in 4.3 million reads. We want to find and assemble roughly 50 genes of about 1500 bases coding sequence each (75000 bases). That is about a 33000 fold purification. A concrete example would be the Memphis phone book. It contains pages with about 200 characters by 100 lines per page = 20000 characters per page. The book is about 1300 pages long. This is 26 million characters. The savignyi data would equal about 100 Memphis phone books. We want 3.75 pages out of that many phone books. In fact we want about 7-8 lines of text on 50 different pages equal to 50 different genes.
There has to be a systematic way to do this so sequences are not missed. I have outlined the strategy I used for the Fugu P450s in a poster that I have placed on the web. Fugu P450s. We will be going over the strategy and using it here, possibly with improvements based on automated assemblers like Contig Express or Sequencher.
The first part of the strategy involves finding all the accession numbers in a genome that contain P450 sequence. This is a cataloging job. It is not glamorous, but it is necessary. We will make a simplifying assumption about the gene content of Ciona. We will assume that because Ciona has a smaller genome than vertebrates (it is a urochordate) but it is still in the same major evolutionary group (sister group to chordates) that it will have the same P450 families seen in humans and Fugu. There are 18 mammalian P450 families, so we will search with at least one member from each family to cover the "P450 space" of Ciona. This seems to be justifiable based on experience with Fugu and mammals. If the Blast server that Rob sets up can take the heat we may search with more human sequences (57 total, but some are redundant, like 2C8, 2C9, 2C18, 2C19). We should only need one from each subfamily.
Listed below is a set of P450s to search with. There are 41 P450 sequences, one from each subfamily in humans in the whole set. The shorter set shown in red, would be 18 sequences as follows: 1A1, 2C8, 3A4, 4A11, 5A1, 7A1, 8A1, 11A1, 17, 19, 20, 21A2, 24, 26A1, 27A1, 39, 46, 51. Three interesting subfamiles that are not seen in humans but are seen in Fugu are 1C1 and 3B1 and 7C1. These might be included in a more comprehensive search.
>1. CYP1A1 NM_000499
MLFPISMSATEFLLASVIFCLVFWVIRASRPQVPKGLKNPPGPW
GWPLIGHMLTLGKNPHLALSRMSQQYGDVLQIRIGSTPVVVLSGLDTIRQALVRQGDD
FKGRPDLYTFTLISNGQSMSFSPDSGPVWAARRRLAQNGLKSFSIASDPASSTSCYLE
EHVSKEAEVLISTLQELMAGPGHFNPYRYVVVSVTNVICAICFGRRYDHNHQELLSLV
NLNNNFGEVVGSGNPADFIPILRYLPNPSLNAFKDLNEKFYSFMQKMVKEHYKTFEKG
HIRDITDSLIEHCQEKQLDENANVQLSDEKIINIVLDLFGAGFDTVTTAISWSLMYLV
MNPRVQRKIQEELDTVIGRSRRPRLSDRSHLPYMEAFILETFRHSSFVPFTIPHSTTR
DTSLKGFYIPKGRCVFVNQWQINHDQKLWVNPSEFLPERFLTPDGAIDKVLSEKVIIF
GMGKRKCIGETIARWEVFLFLAILLQRVEFSVPLGVKVDMTPIYGLTMKHACCEHFQM
QLRS
>3. CYP1B1 NM_000104
MGTSLSPNDPWPLNPLSIQQTTLLLLLSVLATVHVGQRLLRQRR
RQLRSAPPGPFAWPLIGNAAAVGQAAHLSFARLARRYGDVFQIRLGSCPIVVLNGERA
IHQALVQQGSAFADRPAFASFRVVSGGRSMAFGHYSEHWKVQRRAAHSMMRNFFTRQP
RSRQVLEGHVLSEARELVALLVRGSADGAFLDPRPLTVVAVANVMSAVCFGCRYSHDD
PEFRELLSHNEEFGRTVGAGSLVDVMPWLQYFPNPVRTVFREFEQLNRNFSNFILDKF
LRHCESLRPGAAPRDMMDAFILSAEKKAAGDSHGGGARLDLENVPATITDIFGASQDT
LSTALQWLLLLFTRYPDVQTRVQAELDQVVGRDRLPCMGDQPNLPYVLAFLYEAMRFS
SFVPVTIPHATTANTSVLGYHIPKDTVVFVNQWSVNHDPVKWPNPENFDPARFLDKDG
LINKDLTSRVMIFSVGKRRCIGEELSKMQLFLFISILAHQCDFRANPNEPAKMNFSYG
LTIKPKSFKVNVTLRESMELLDSAVQNLQAKETCQ
>4. CYP2A6 NM_000762
MLASGMLLVALLVCLTVMVLMSVWQQRKSKGKLPPGPTPLPFIG
NYLQLNTEQMYNSLMKISERYGPVFTIHLGPRRVVVLCGHDAVREALVDQAEEFSGRG
EQATFDWVFKGYGVVFSNGERAKQLRRFSIATLRDFGVGKRGIEERIQEEAGFLIDAL
RGTGGANIDPTFFLSRTVSNVISSIVFGDRFDYKDKEFLSLLRMMLGIFQFTSTSTGQ
LYEMFSSVMKHLPGPQQQAFQLLQGLEDFIAKKVEHNQRTLDPNSPRDFIDSFLIRMQ
EEEKNPNTEFYLKNLVMTTLNLFIGGTETVSTTLRYGFLLLMKHPEVEAKVHEEIDRV
IGKNRQPKFEDRAKMPYMEAVIHEIQRFGDVIPMSLARRVKKDTKFRDFFLPKGTEVY
PMLGSVLRDPSFFSNPQDFNPQHFLNEKGQFKKSDAFVPFSIGKRNCFGEGLARMELF
LFFTTVMQNFRLKSSQSPKDIDVSPKHVGFATIPRNYTMSFLPR
>7. CYP2B6 AC023172.1 CDS (hIIB1) cryptic exon 3A = 18813-18856 (hIIB2)
MELSVLLFLALLTGLLLLLVQRHPNTHDRLPPGPRPLPLLGNLLQMDRRGLLKSFL
RFREKYGDVFTVHLGPRPVVMLCGVEAIREALVDKAEAFSGRGKIA
MVDPFFRGYGVIFANGNRWKVLRRFSVTTMRDFGMGKRSVEERIQEEAQCLIEELRKS
KGALMDPTFLFQSITANIICSIVFGKRFHYQDQEFLKMLNLFYQTFSLISSVFGQLFE
LFSGFLKYFPGAHRQVYKNLQEINAYIGHSVEKHRETLDPSAPKDLIDTYLLHMEKEK
SNAHSEFSHQNLNLNTLSLFFAGTETTSTTLRYGFLLMLKYPHVAERVYREIEQVIGP
HRPPELHDRAKMPYTEAVIYEIQRFSDLLPMGVPHIVTQHTSFRGYIIPKDTEVFLIL
STALHDPHYFEKPDAFNPDHFLDANGALKKTEAFIPFSLGKRICLGEGIARAELFLFF
TTILQNFSMASPVAPEDIDLTPQECGVGKIPPTYQIRFLPR
>8. CYP2C8 M17397
MEPFVVLVLCLSFMLLFSLWRQSCRRRKLPPGPTPLPIIGNMLQ
IDVKDICKSFTNFSKVYGPVFTVYFGMNPIVVFHGYEAVKEALIDNGEEFSGRGNSPI
SQRITKGLGIISSNGKRWKEIRRFSLTNLRNFGMGKRSIEDRVQEEAHCLVEELRKTK
ASPCDPTFILGCAPCNVICSVVFQKRFDYKDQNFLTLMKRFNENFRILNSPWIQVCNN
FPLLIDCFPGTHNKVLKNVALTRSYIREKVKEHQASLDVNNPRDFMDCFLIKMEQEKD
NQKSEFNIENLVGTVADLFVAGTETTSTTLRYGLLLLLKHPEVTAKVQEEIDHVIGRH
RSPCMQDRSHMPYTDAVVHEIQRYSDLVPTGVPHAVTTDTKFRNYLIPKGTTIMALLT
SVLHDDKEFPNPNIFDPGHFLDKNGNFKKSDYFMPFSAGKRICAGEGLARMELFLFLT
TILQNFNLKSVDDLKNLNTTAVTKGIVSLPPSYQICFIPV
>12. CYP2D6 NM_000106
MGLEALVPLAVIVAIFLLLVDLMHRRQRWAARYPPGPLPLPGLG
NLLHVDFQNTPYCFDQ LRRRFGDVFSLQLAWTPVVVLNGLAAVREALVTHGEDTADRP
PVPITQILGFGPRSQGVFLARYGPAWREQRRFSVSTLRNLGLGKKSLEQWVTEEAACL
CAAFANHSGRPFRPNGLLDKAVSNVIASLTCGRRFEYDDPRFLRLLDLAQEGLKEESG
FLREVLNAVPVLLHIPALAGKVLRFQKAFLTQLDELLTEHRMTWDPAQPPRDLTEAFL
AEMEKAKGNPESSFNDENLRIVVADLFSAGMVTTSTTLAWGLLLMILHPDVQRRVQQE
IDDVIGQVRRPEMGDQAHMPYTTAVIHEVQRFGDIVPLGMTHMTSRDIEVQGFRIPKG
TTLITNLSSVLKDEAVWEKPFRFHPEHFLDAQGHFVKPEAFLPFSAGRRACLGEPLAR
MELFLFFTSLLQHFSFSVPTGQPRPSHHGVFAFLVSPSPYELCAVPR
>13. CYP2E1 J02843
MSALGVTVALLVWAAFLLLVSMWRQVHSSWNLPPGPFPLPIIGN
LFQLELKNIPKSFTRLAQRFGPVFTLYVGSQRMVVMHGYKAVKEALLDYKDEFSGRGD
LPAFHAHRDRGIIFNNGPTWKDIRRFSLTTLRNYGMGKQGNESRIQREAHFLLEALRK
TQGQPFDPTFLIGCAPCNVIADILFRKHFDYNDEKFLRLMYLFNENFHLLSTPWLQLY
NNFPSFLHYLPGSHRKVIKNVAEVKEYVSERVKEHHQSLDPNCPRDLTDCLLVEMEKE
KHSAERLYTMDGITVTVADLFFAGTETTSTTLRYGLLILMKYPEIEEKLHEEIDRVIG
PSRIPAIKDRQEMPYMDAVVHEIQRFITLVPSNLPHEATRDTIFRGYLIPKGTVVVPT
LDSVLYDNQEFPDPEKFKPEHFLNENGKFKYSDYFKPFSTGKRVCAGEGLARMELFLL
LCAILQHFNLKPLVDPKDIDLSPIHIGFGCIPPRYKLCVIPRS
>14. CYP2F1 J02906
MDSISTAILLLLLALVCLLLTLSSRDKGKLPPGPRPLSILGNLL
LLCSQDMLTSLTKLSKEYGSMYTVHLGPRRVVVLSGYQAVKEALVDQGEEFSGRGDYP
AFFNFTKGNGIAFSSGDRWKVLRQFSIQILRNFGMGKRSIEERILEEGSFLLADVRKT
EGEPFDPTFVLSRSVSNIICSVLFGSRFDYDDERLLTIIRLINDNFQIMSSPWGELYD
ILDPRFPSLLDWVPGPHQRIFQNFKCLRDLIAHSVHDHQASSPRDFIQCFLTKMAEEK
EDPLSHFHMDTLLMTTHNLLFGGTKTVSTTLHHAFLALMKYPKVQARVQEEIDLVVGR
ARLPALKDRAAMPYTDAVIHEVQRFADIIPMNLPHRVTRDTAFRGFLIPKGTDVITLL
NTVHYDPSQFLTPQEFNPEHFLDANQSFKKSPAFMPFSAGRRLCLGELLARMELFLYL
TAILQSFSLQPLGAPEDIDLTPLSSGLGNLPRPFQLCLRPR
>15. CYP2J2 NM_000775
MLAAMGSLAAALWAVVHPRTLLLGTVAFLLAADFLKRRRPKNYP
PGPWRLPFLGNFFLVDFEQSHLEVQLFVKKYGNLFSLELGDISAVLITGLPLIKEALI
HMDQNFGNRPVTPMREHIFKKNGLIMSSGQAWKEQRRFTLTALRNFGLGKKSLEERIQ
EEAQHLTEAIKEENGQPFDPHFKINNAVSNIICSITFGERFEYQDSWFQQLLKLLDEV
TYLEASKTCQLYNVFPWIMKFLPGPHQTLFSNWKKLKLFVSHMIDKHRKDWNPAETRD
FIDAYLKEMSKHTGNPTSSFHEENLICSTLDLFFAGTETTSTTLRWALLYMALYPEIQ
EKVQVEIDRVIGQGQQPSTAARESMPYTNAVIHEVQRMGNIIPQNVPREVTVDTTLAG
YHLPKGTMILTNLTALHRDPTEWATPDTFNPDHFLENGQFKKREAFMPFSIGKRACLG
EQLARTELFIFFTSLMQKFTFRPPNNEKLSLKFRMGITISPVSHRLC
>16. CYP2R1 Mikael Oscarson AC018795.4 also AC025730 AC025748 EST AA663042
MWKLWRAEEGAAALGGALFLLLFALGVRQLLKQRRPMGFPPGPPGLPFIGNIY
SLAASSELPHVYMRKQSQVYGE
IFSLDLGGISTVVLNGYDVVKECLVHQSEIFADRPCLPLFMKMTKMGGLLNSR
YGRGWVDHRRLAVNSFRYFGYGQKSFESKILEETKFFNDAIETYKGRPFDFKQLITNAVS
NITNLIIFGERFTYEDTDFQHMIELFSENVELAASASVFLYNAFPWIGILPFGKHQQLFR
NAAVVYDFLSRLIEKASVNRKPQLPQHFVDAYLDEMDQGKNDPSSTFSKENLIFSVGELI
IAGTETTTNVLRWAILFMALYPNIQGQVQKEIDLIMGPNGKPSWDDKCKMPYTEAVLHEV
LRFCNIVPLGIFHATSEDAVVRGYSIPKGTTVITNLYSVHFDEKYWRDPEVFHPERFLDS
SGYFAKKEALVPFSLGRRHCLGEHLARMEMFLFFTALLQRFHLHFPHELVPDLKPRLGMT
LQPQPYLICAERR
>17. CYP2S1 AC011510 one exon per line 78% to mouse 2s1 49% to 2B6 47% to 2A13
MEATGTWALLLALALLLLLTLALSGTRARGHLPPGPTPLPLLGNLLQLRPGALYSGLMR
LSKKYGPVFTIYLGPWRPVVVLVGQEAVREALGGQAEEFSGRGTVAMLEGTFDGH
GVFFSNGERWRQLRKFTMLALRDLGMGKREGEELIQAEARCLVETFQGTE
GRPFDPSLLLAQATSNVVCSLLFGLRFSYEDKEFQAVVRAAGGTLLGVSSQGGQ
TYEMFSWFLRPLPGPHKQLLHHVSTLAAFTVRQVQQHQGNLDASGPARDLVDAFLLKMAQ
EEQNPGTEFTNKNMLMTVIYLLFAGTMTVSTTVGYTLLLLMKYPHVQ
KWVREELNRELGAGQAPSLGDRTRLPYTDAVLHEAQRLLALVPMGIPRTLMRTTRFRGYTLPQ
GTEVFPLLGSILHEPNIFKHPEEFNPDRFLDADGRFRKHEAFLPFSL
GKRVCLGEGLAKAEVFLFFTTILQAFSLESPCPPDTLSLKPTVSGLFNIPPAFQLQVRPTDLHSTTQTR
>18. CYP2U1 AC025090, (AC000016 has C-term) 41% to 2N1 intron joints not yet defined
MSSPGPSQPPAEDPPWPARLLRAPLGLLRLDPSGGALLLCGLVALLGWSWLRRRRARGI
PPGPTPWPLVGNFGHVLLPPFLRRRSWLSSRTRAAGIDPSVIGPQVLLAHLARVYGSI
FSFFIGHYLVVVLSDFHSVREALVQQAEVFSDRPRVPLISIVT
GPVWRQQRKFSHSTLRHFGLGKLSLEPKIIEEFKYVKAEMQKHGEDPFCPF
SIISNAVSNIICSLCFGQRFDYTNSEFKKMLGFMSRGLEICLNSQVLLVNICPWLYYLPF
GPFKELRQIEKDITSFLKKIIKDHQESLDRENPQDFIDMYLLHMEEERKNNSNSSFDEE
YLFYIIGDLFIAGTDTTTNSLLWCLLYMSLNPDVQ
KVHEEIERVIGANRAPSLTDKAQMPYTEATIMEVQRLTVVVPLAIPHMTSENT
LQGYTIPKGTLILPNLWSVHRDPAIWEKPEDFYPNRFLDDQGQLIKKETFIPFGIG
KRVCMGEQLAKMELFLMFVSLMQSFAFALPEDSKKPLLTGRFGLTLAPHPFNITISRR
>19. CYP2W1 AC073957.3 chromosome 7 clone RP11-449P15 40% to 2F1
MALLLLLFLGLLGLWGLLCACAQDPSPAARWAPGLRPLPLVGNLHLLRLSQQDRSLME
LSERYGPVFTVHLGRQKTVVLTGFEAVKEALAGPGQELADRP
PIAIFQLIQRGGGIFFSSGARWRAARQFTVRALHSLGVGREPVADKILQELKCLSGQL
DGYRGRPFPLALLGWAPSNITFALLFGRRFDYRDPVFVSLLGLIDEVMVLLGSPGLQL
FNVHPWLGALLQLHRPVLRKIEEVRAILRTLLEARRPHVCPGDPVCSYVDALIQQGQG
DDPEGLFAEANAVACTLDMVMAGTETTSATLQWAALLMGRHPDVQGRVQEELDRVLGP
GRTPRLEDQQALPYTSAVLHEVQRFITLLPHVPRCTAADTQLGGFLLPKGTPVIPLLT
SVLLDETQWQTPGQFNPGHFLDANGHFVKREAFLPFSA
GRRVCVGERLARTELFLLFAGLLQRYRLLPPPGVSPASLDTTPARAFTMRPRPRALCAVPRP*
>20. CYP3A4 J04449
MAVIPDLAMETWLLLAVSLVLLYLYGTHSHGLFKKLGIPGPTPL
PFLGNILSYHKGFCMFDMECHKKYGKVCGFYDGQQPVLAITDPDMIKTVLVKECYSVF
TNRRPFGPVGFMKSAISIAEDEEWKRLRSLLSPTFTSGKLKEMVPIIAQYGDVLVRNL
RREAETGKPVTLKDVFGAYSMDVITSTSFGVNIDSLNNPQDPFVENTKKLLRFDFLDP
FFLSIIFPFLIPILEVLNICVFPREVTNFLRKSVKRMKESRLEDTQKHRVDFLQLMID
SQNSKETESHKALSDLELVAQSIIFIFAGYETTSSVLSFIMYELATHPDVQQKLQEEI
DAVLPNKAPPTYDTVLQMEYLDMVVNETLRLFPIAMRLERVCKKDVEINGMFIPKGWV
VMIPSYALHRDPKYWTEPEKFLPERFSKKNKDNIDPYTYTPFGSGPRNCIGMRFALMN
MKLALIRVLQNFSFKPCKETQIPLKLSLGGLLQPEKPVVLKVESRDGTVSGA
>24. CYP4A11 NM_000778 12 exons BG533264 BF594611 W84867 W84868 T83194 T83178
MSVSVLSPSRLLGDVSGILQAASLLILLLLLIKAVQLYLHRQWLLKALQQFPCPPSHWLFGHIQE
LQQDQELQRIQKWVETFPSACPHWLWGGKVRVQLYDPDYMKVILGRS
DPKSHGSYRFLAPWI
GYGLLLLNGQTWFQHRRMLTPAFHYDILKPYVGLMADSVRVML
DKWEELLGQDSPLEVFQHVSLMTLDTIMKCAFSHQGSIQVDR
NSQSYIQAISDLNNLVFSRVRNAFHQNDTIYSLTSAGRWTHRACQLAHQHT
DQVIQLRKAQLQKEGELEKIKRKRHLDFLDILLLAK
MENGSILSDKDLRAEVDTFMFEGHDTTASGISWILYALATHPKHQERCREEIHSLLGDGASITW
NHLDQMPYTTMCIKEALRLYPPVPGIGRELSTPVTFPDGRSLPKG
IMVLLSIYGLHHNPKVWPNPEV
FDPSRFAPGSAQHSHAFLPFSGGSR
NCIGKQFAMNELKVATALTLLRFELLPDPTRIPIPIARLVLKSKNGIHLRLRRLPNPCEDKDQL*
>26. CYP4B1 NM_000779
MVPSFLSLSFSSLGLWASGLILVLGFLKLIHLLLRRRTLAKAMD
KFPGPPTHWLFGHALEIQETGSLDKVVSWAHQFPYAHPLWFGQFIGFLNIYEPDYAKA
VYSRGDPKAPDVYDFFLQWIGRGLLVLEGPKWLQHRKLLTPGFHYDVLKPYVAVFTES
TRIMLDKWEEKAREGKSFDIFCDVGHMALNTLMKCTFGRGDTGLGHRDSSYYLAVSDL
TLLMQQRLVSFQYHNDFIYWLTPHGRRFLRACQVAHDHTDQVIRERKAALQDEKVRKK
IQNRRHLDFLDILLGARDEDDIKLSDADLRAEVDTFMFEGHDTTTSGISWFLYCMALY
PEHQHRCREEVREILGDQDFFQWDDLGKMTYLTMCIKESFRLYPPVPQVYRQLSKPVT
FVDGRSLPAGSLISMHIYALHRNSAVWPDPEVFDSLRFSTENASKRHPFAFMPFSAGP
RNCIGQQFAMSEMKVVTAMCLLRFEFSLDPSRLPIKMPQLVLRSKNGFHLHLKPLGPG
SGK
>27. CYP4F2 NM_001082 alternative 2nd exon
MSQLSLSWLGLCDVAASPWLLLLLVGASWLLAHVLAWTYAFYDN
CRRLRCFPQPPRRNWFWGHQGMVNPTEEGMRVLTQLVATYPQGFKVWMGPISPLLSLC
HPDIIRSVINASAAIAPKDKFFYSFLEPWLGDGLLLSAGDKWSRHRRMLTPAFHFNIL
KPYMKIFNESVNIMHAKWQLLASEGSACLDMFEHISLMTLDSLQKCVFSFDSHCQEKP
SEYIAAILELSALVSKRHHEILLHIDFLYYLTPDGQRFRRACRLVHDFTDAVIQERRR
TLPSQGVDDFLQAKAKSKTLDFIDVLLLSKDEDGKKLSDEDIRAEADTFMFEGHDTTA
SVSPGSCTTLQSTQNTRSVCRQEVQELLKDREPKEIEWDDLAHLPFLTMCMKESLRCI
PPVPVISRHVTQDIVLPDGRVIPKGIICLISVFGTHHNPAVWPDPEVYDPFRFDPENI
KERSPLAFIPFSAGPRNCIGQTFAMAEMKVVLALTLLAFRVLPDHTEPRRSRSWSCAQ
RADFGCGWSP
>34. CYP4V2 formerly CYP4AH1 AC012525 Homo sapiens chromosome 4
MAGLWLGLVWQKLLLWGAASAVSLAGASLVLSLLQRVASYARKWQQMRPIPTVARAYPLVGHALLMKPDGR
EFFQQIIEYTEEYRHMPLLKLWVGPVPMVALYNAENVEG
ILTSSKQIDKSSMYKFLEPWLGLGLLT
STGNKWRSRRKMLTPTFHFTILEDFLDIMNEQANILVKKLEKHINQEAFNCFFYITLCALDIIC
ETAMGKNIGAQSNDDSEYVRAVYR
MSEMIFRRIKMPWLWLDLWYLMFKEGWEHKKSLQILHTFTNSV
IAERANEMNANEDCRGDGRGSAPSKNKRRAFLDLLLSVTDDEGNRLSHEDIREEVDTFMFE
GHDTTAAAINWSLYLLGSNPEVQKKVDHELDDV
KSDRPATVEDLKKLRYLECVIKETLRLFPSVPLFARSVSED
YFLTAGYRVLKGTEAVIIPYALHRDPRYFPNPEEFQPERFFPENAQG
RHPYAYVPFSAGPRNCIG
QKFAVMEEKTILSCILRHFWIESNQKREELGLEGQLILRPSNGIWIKLKRRNADER*
>33. CYP4X1 R56515, R53456, AA652746, AC026935
MEFSWLETRWARPFYLAFVFCLALGLLQAIKLYLRRQRLLRDLRPFPAPPTHWFLGHQK
FIQDDNMEKLEEIIEKYPRAFPFWIGPFQAFFCIYDPDYAKTLLSRTDPKSQYLQKFSPP
LLGKGLAALDGPKWFQHRRLLTPGFHFNILKAYIEVMAHSVKMMLDKWEKICSTQDTSVE
VYEHINSMSLDIIMKCAFSKETNCQTNSTHDPYAKAIFELSKIIFHRLYSLLYHSDIIFK
LSPQGYRFQKLSRVLNQYTDTIIQERKKSLQAGVKQDNTPKRKYQDFLDIVLSAKDES
GSSFSDIDVHSEVSTFLLAGHDTLAASISWILYCLALNPEHQERCREEVRGILGDGSSIT
WDQLGEMSYTTMCIKETCRLIPAVPSISRDLSKPLTFPDGCTLPAGITVVLSIWGLHHNP
AVWKNPKVFDPLRFSQENSDQRHPYAYLPFSAGSRNCIGQEFAMIELKVTIALILLHFRV
TPDPTRPLTFPNHFILKPKNGMYLHLKKL
>25. CYP4Z1 AJ131016 AC026935 161971-176942 52% to 4A11 52% to 4X1
MEPSWLQELMAHPFLLLILLCMSLLLFQVIRLYQRRRWMIRALHLFPAPPAHWFYGHKE
FYPVKEFEVYHKLMEKYPCAVPLWVGPFTMFFSVHDPDYAKILLKRQDP
KSAVSHKILESWVGRGLVTLDGSKWKKHRQIVKPGFNISILKIFITMMSE
SVRMML
NKWEEHIAQNSRLELFQHVSLMTLDSIMKCAFSHQGSIQLDRS
SYLKAVFNLSKISNQRMNNFLHHNDLVFKFSSQGQIFSKFNQELHQFT
HLEKVIQDRKESLKDKLKQDTTQKRRWDFLDILLSAKV
ENTKDFSEADLQAEVKTFMFAGHDTTSSAISWILYCLAKYPEHQQRCRDEIRELLGDGSSITW
EHLSQMPYTTMCIKECLRLYAPVVNISRLLDKPITFPDGRSLPA
GITVFINIWALHHNPYFWEDPQV
FNPLRFSRENSEKIHPYAFIPFSAG
PRNCIGQHFAIIECKVAVALTLLRFKLAPDHSRPPQPVRQVVLKSKNGIHVFAKKV
>35. CYP5A1 NM_001061 this gene is 197000 bases long
MMEALGFLKLEVNGPMVTVALSVALLALLKWYSTSAFSRLEKLG
LRHPKPSPFIGNLTFFRQGFWESQMELRKLYGPLCGYYLGRRMFIVISEPDMIKQVLV
ENFSNFTNRMASGLEFKSVADSVLFLRDKRWEEVRGALMSAFSPEKLNEMVPLISQAC
DLLLAHLKRYAESGDAFDIQRCYCNYTTDVVASVPFGTPVDSWQAPEDPFVKHCKRFF
EFCIPRPILVLLLSFPSIMVPLARILPNKNRDELNGFFNKLIRNVIALRDQQAAEERR
RDFLQMVLDARHSASPMGVQDFDIVRDVFSSTGCKPNPSRQHQPSPMARPLTVDEIVG
QAFIFLIAGYEIITNTLSFATYLLATNPDCQEKLLREVDVFKEKHMAPEFCSLEEGLP
YLDMVIAETLRMYPPAFRFTREAAQDCEVLGQRIPAGAVLEMAVGALHHDPEHWPSPE
TFNPERFTAEARQQHRPFTYLPFGAGPRSCLGVRLGLLEVKLTLLHVLHKFRFQACPE
TQVPLQLESKSALGPKNGVYIKIVSR
>36. CYP7A1 NM_000780
MMTTSLIWGIAIAACCCLWLILGIRRRQTGEPPLENGLIPYLGC
ALQFGANPLEFLRANQRKHGHVFTCKLMGKYVHFITNPLSYHKVLCHGKYFDWKKFHF
ATSAKAFGHRSIDPMDGNTTENINDTFIKTLQGHALNSLTESMMENLQRIMRPPVSSN
SKTAAWVTEGMYSFCYRVMFEAGYLTIFGRDLTRRDTQKAHILNNLDNFKQFDKVFPA
LVAGLPIHMFRTAHNAREKLAESLRHENLQKRESISELISLRMFLNDTLSTFDDLEKA
KTHLVVLWASQANTIPATFWSLFQMIRNPEAMKAATEEVKRTLENAGQKVSLEGNPIC
LSQAELNDLPVLNSIIKESLRLSSASLNIRTAKEDFTLHLEDGSYNIRKDSIIALYPQ
LMHLDPEIYPDPLTFKYDRYLDENGKTKTTFYCNGLKLKYYYMPFGSGATICPGRLFA
IHEIKQFLILMLSYFELELIEGQAKCPPLDQSRAGLGILPPLNDIEFKYKFKHL
>37. CYP7B1 NM_004820
MAGEVSAATGRFSLERLGLPGLALAAALLLLALCLLVRRTRRPG
EPPLIKGWLPYLGVVLNLRKDPLRFMKTLQKQHGDTFTVLLGGKYITFILDPFQYQLV
IKNHKQLSFRVFSNKLLEKAFSISQLQKNHDMNDELHLCYQFLQGKSLDILLESMMQN
LKQVFEPQLLKTTSWDTAELYPFCSSIIFEITFTTIYGKVIVCDNNKFISELRDDFLK
FDDKFAYLVSNIPIELLGNVKSIREKIIKCFSSEKLAKMQGWSEVFQSRQDVLEKYYV
HEDLEIGAHHLGFLWASVANTIPTMFWAMYYLLRHPEAMAAVRDEIDRLLQSTGQKKG
SGFPIHLTREQLDSLICLESSIFEALRLSSYSTTIRFVEEDLTLSSETGDYCVRKGDL
VAIFPPVLHGDPEIFEAPEEFRYDRFIEDGKKKTTFFKRGKKLKCYLMPFGTGTSKCP
GRFFALMEIKQLLVILLTYFDLEIIDDKPIGLNYSRLLFGIQYPDSDVLFRYKVKS
>38. CYP8A1 D83402
MAWAALLGLLVALLLLLLLSRRRTRRPGEPPLDLGSIPWLGYAL
DFGKDAASFLTRMKEKHGDIFTILVGGRYVTVLLDPHSYDAVVWEPRTRLDFHAYAIF
LMERIFDVQLPHYSPSDEKARMKLTLLHRELQALTEAMYTNLHAVLLGDATEAGSGWH
EMGLLDFSYSFLLRAGYLTLYGIEALPRTHESQAQDRVHSADVFHTFRQLDRLLPKLA
RGSLSVGDKDHMCSVKSRLWKLLSPARLARRAHRSKWLESYLLHLEEMGVSEEMQARA
LVLQLWATQGNMGPAAFWLLLFLLKNPEALAAVRGELESILWQAEQPVSQTTTLPQKV
LDSTPVLDSVLSESLRLTAAPFITREVVVDLAMPMADGREFNLRRGDRLLLFPFLSPQ
RDPEIYTDPEVFKYNRFLNPDGSEKKDFYKDGKRLKNYNMPWGAGHNHCLGRSYAVNS
IKQFVFLVLVHLDLELINADVEIPEFDLSRYGFGLMQPEHDVPVRYRIRP
>39. CYP8B1 AF090318 AC010192
MVLWGPVLGALLVVIAGYLCLPGMLRQRRPWEPPLDKGT
VPWLGHAMAFRKNMFEFLKRMRTKHGDVFTVQLGGQYFTFVMDP
LSFGPILKDTQRKLDFGQYAKKLVLKVFGYRSVQGDHEMIHSASTKHLRGDGLKDLNE
TMLDSLSFVMLTSKGWSLDASCWHEDSLFRFCYYILFTAGYLSLFGYTKDKEQDLLQA
GELFMEFRKFDLLFPRFVYSLLWPREWLEVGRLQHLFHKMLSVSHSQEKEGISNWLGN
MLQFLREQGVPSAMQDKFNFMMLWASQGNTGPTSFWALLYLLKHPEAIRAVREEATQV
LGEARLETKQSFAFKLGALQHTPVLDSVVEETLRLRAAPTLLRLVHEDYTLKMSSGQE
YLFRHGDILALFPYLSVHMDPDIHPEPTVFKYDRFLNPNGSRKVDFFKTGKKIHHYTM
PWGSGVSICPGRFFALSEVKLFILLMVTHFDLELVDPDTPLPHVDPQRWGFGTMQPSH
DVRFRYRLHPTE
>40. CYP11A1 NM_000781
MLAKGLPPRSVLVKGYQTFLSAPREGLGRLRVPTGEGAGISTRS
PRPFNEIPSPGDNGWLNLYHFWRETGTHKVHLHHVQNFQKYGPIYREKLGNVESVYVI
DPEDVALLFKSEGPNPERFLIPPWVAYHQYYQRPIGVLLKKSAAWKKDRVALNQEVMA
PEATKNFLPLLDAVSRDFVSVLHRRIKKAGSGNYSGDISDDLFRFAFESITNVIFGER
QGMLEEVVNPEAQRFIDAIYQMFHTSVPMLNLPPDLFRLFRTKTWKDHVAAWDVIFSK
ADIYTQNFYWELRQKGSVHHDYRGMLYRLLGDSKMSFEDIKANVTEMLAGGVDTTSMT
LQWHLYEMARNLKVQDMLRAEVLAARHQAQGDMATMLQLVPLLKASIKETLRLHPISV
TLQRYLVNDLVLRDYMIPAKTLVQVAIYALGREPTFFFDPENFDPTRWLSKDKNITYF
RNLGFGWGVRQCLGRRIAELEMTIFLINMLENFRVEIQHLSDVGTTFNLILMPEKPIS
FTFWPFNQEATQQ
>41. CYP11B1 NM_000497
MALRAKAEVCMAVPWLSLQRAQALGTRAARVPRTVLPFEAMPRR
PGNRWLRLLQIWREQGYEDLHLEVHQTFQELGPIFRYDLGGAGMVCVMLPEDVEKLQQ
VDSLHPHRMSLEPWVAYRQHRGHKCGVFLLNGPEWRFNRLRLNPEVLSPNAVQRFLPM
VDAVARDFSQALKKKVLQNARGSLTLDVQPSIFHYTIEASNLALFGERLGLVGHSPSS
ASLNFLHALEVMFKSTVQLMFMPRSLSRWTSPKVWKEHFEAWDCIFQYGDNCIQKIYQ
ELAFSRPQQYTSIVAELLLNAELSPDAIKANSMELTAGSVDTTVFPLLMTLFELARNP
NVQQALRQESLAAAASISEHPQKATTELPLLRAALKETLRLYPVGLFLERVASSDLVL
QNYHIPAGTLVRVFLYSLGRNPALFPRPERYNPQRWLDIRGSGRNFYHVPFGFGMRQC
LGRRLAEAEMLLLLHHVLKHLQVETLTQEDIKMVYSFILRPSMCPLLTFRAIN
>43. CYP17 NM_000102
MWELVALLLLTLAYLFWPKRRCPGAKYPKSLLSLPLVGSLPFLP
RHGHMHNNFFKLQKKYGPIYSVRMGTKTTVIVGHHQLAKEVLIKKGKDFSGRPQMATL
DIASNNRKGIAFADSGAHWQLHRRLAMATFALFKDGDQKLEKIICQEISTLCDMLATH
NGQSIDISFPVFVAVTNVISLICFNTSYKNGDPELNVIQNYNEGIIDNLSKDSLVDLV
PWLKIFPNKTLEKLKSHVKIRNDLLNKILENYKEKFRSDSITNMLDTLMQAKMNSDNG
NAGPDQDSELLSDNHILTTIGDIFGAGVETTTSVVKWTLAFLLHNPQVKKKLYEEIDQ
NVGFSRTPTISDRNRLLLLEATIREVLRLRPVAPMLIPHKANVDSSIGEFAVDKGTEV
IINLWALHHNEKEWHQPDQFMPERFLNPAGTQLISPSVSYLPFGAGPRSCIGEILARQ
ELFLIMAWLLQRFDLEVPDDGQLPSLEGIPKVVFLIDSFKVKIKVRQAWREAQAEGST
>44. CYP19 NM_000103
MVLEMLNPIHYNITSIVPEAMPAATMPVLLLTGLFLLVWNYEGT
SSIPGPGYCMGIGPLISHGRFLWMGIGSACNYYNRVYGEFMRVWISGEETLIISKSSS
MFHIMKHNHYSSRFGSKLGLQCIGMHEKGIIFNNNPELWKTTRPFFMKALSGPGLVRM
VTVCAESLKTHLDRLEEVTNESGYVDVLTLLRRVMLDTSNTLFLRIPLDESAIVVKIQ
GYFDAWQALLIKPDIFFKISWLYKKYEKSVKDLKDAIEVLIAEKRCRISTEEKLEECM
DFATELILAEKRGDLTRENVNQCILEMLIAAPDTMSVSLFFMLFLIAKHPNVEEAIIK
EIQTVIGERDIKIDDIQKLKVMENFIYESMRYQPVVDLVMRKALEDDVIDGYPVKKGT
NIILNIGRMHRLEFFPKPNEFTLENFAKNVPYRYFQPFGFGPRGCAGKYIAMVMMKAI
LVTLLRRFHVKTLQGQCVESIQKIHDLSLHPDETKNMLEMIFTPRNSDRCLEH
>46. CYP20 AC011737.8 chr 2 (missing exons 12,13) AC080075.2 (missing exons 1,7,8)
MLDFAIFAVTFLLALVGAVLYLYP
ASRQAAGIPGITPTEEK
DGNLPDIVNSGSLHEFLVNLHERYGPVVSFWFGRRLVVSLGTVDVLKQHINPNKTS
DPFETMLKSLLRYQSGGGSVSENHMRKKLYENGVTDSLKSNFALLLK
LSEELLDKWLSYPETQHVPLSQHMLGFAMKSVTQMVMGSTFEDDQEVIRFQKNHGT
VWSEIGKGFLDGSLDKNMTRKKQYED
ALMQLESVLRNIIKERKGRNFSQHIFIDSLVQGNLNDQQ
ILEDSMIFSLASCIITAK
LCTWAICFLTTSEEVQKKLYEEINQVFGNGPVTPEKIEQLR
YCQHVLCETVRTAKLTPVSAQLQDIEGKIDRFIIPRE
TLVLYALGVVLQDPNTWPSPHK
FDPDRFDDELVMKTFSSLGFSGTQECPELR
FAYMVTTVLLSVLVKRLHLLSVEGQVIETKYELVTSSREEAWITVSKRY
>45. CYP21A2 M26856
MLLLGLLLLLPLLAGARLLWNWWKLRSLHLPPLAPGFLHLLQPD
LPIYLLGLTQKFGPIYRLHLGLQDVVVLNSKRTIEEAMVKKWADFAGRPEPLTYKLVS
RNYPDLSLGDYSLLWKAHKKLTRSALLLGIRDSMEPVVEQLTQEFCERMRAQPGTPVA
IEEEFSLLTCSIICYLTFGDKIKDDNLMPAYYKCIQEVLKTWSHWSIQIVDVIPFLRF
FPNPGLRRLKQAIEKRDHIVEMQLRQHKESLVAGQWRDMMDYMLQGVAQPSMEEGSGQ
LLEGHVHMAAVDLLIGGTETTANTLSWAVVFLLHHPEIQQRLQEELDHELGPGASSSR
VPYKDRARLPLLNATIAEVLRLRPVVPLALPHRTTRPSSISGYDIPEGTVIIPNLQGA
HLDETVWERPHEFWPDRFLEPGKNSRALAFGCGARVCLGEPLARLELFVVLTRLLQAF
TLLPSGDALPSLQPLPHCSVILKMQPFQVRLQPRGMGAHSPGQNQ
>47. CYP24 NM_000782
MSSPISKSRSLAAFLQQLRSPRQPPRLVTSTAYTSPQPREVPVC
PLTAGGETQNAAALPGPTSWPLLASLLQILWKGGLKKQHDTLVEYHKKYGKIFRMKLG
SFESVHLGSPCLLEALYRTESVPQRLEIKPWKAYRDYRKEGYGLLILEGEDWQRVRSA
FQKKLMKPGEVMKLDNKINEVLADFMGRIDELCDERGHVEDLYSELNKWSFESICLVL
YEKRFGLLQKNAGDEAVNFIMAIKTMMSTFGRMMVTPVELHKSLNTKVWQGHTLAWDT
IFKSVKACIDNRLEKYSQQPSADFLCDIYHQNRLSKKELYAAVTELQLAAVETTANSL
MWILYNLSRNPQVQQKLLKEIQSVLPENQRPREEDLRNMPYLKACLKESMRLTPGVPF
TTRTLDKATVLGEYALPKGTVLMLNTQVLGSSEDNFEDSSQFRPERWLQEKEKINPFA
HLPFGVGKRMCIGRRLAELQLHLALCWIVRKYDIQATDNEPVEMLHSGTLVPSRELPI
AFCQR
>48. CYP26A1 NM_000783
MGLPALLASALCTFVLPLLLFLAAIKLWDLYCVSGRDRSCALPL
PPGTMGFPFFGETLQMVLQRRKFLQMKRRKYGFIYKTHLFGRPTVRVMGADNVRRILL
GDDRLVSVHWPASVRTILGSGCLSNLHDSSHKQRKKVIMRAFSREALECYVPVITEEV
GSSLEQWLSCGERGLLVYPEVKRLMFRIAMRILLGCEPQLAGDGDSEQQLVEAFEEMT
RNLFSLPIDVPFSGLYRGMKARNLIHARIEQNIRAKICGLRASEAGQGCKDALQLLIE
HSWERGERLDMQALKQSSTELLFGGHETTASAATSLITYLGLYPHVLQKVREELKSKG
LLCKSNQDNKLDMEILEQLKYIGCVIKETLRLNPPVPGGFRVALKTFELNGYQIPKGW
NVIYSICDTHDVAEIFTNKEEFNPDRFMLPHPEDASRFSFIPFGGGLRSCVGKEFAKI
LLKIFTVELARHCDWQLLNGPPTMKTSPTVYPVDNLPARFTHFHGEI
>49. CYP26B1 AC007002
MLFEGLDLVSALATLAACLVSVTLLLAVSQQLWQLRWAATRDKSCKLPIPKGSMGFPLIGETGHWLLQ
GSGFQSSRREKYGNVFKTHLLGRPLIRVTGAENVRKILMGEHHLVSTEWPRSTRMLLGPNTVSNS
IGDIHRNKRKVFSKIFSHEALESYLPKIQLVIQDTLRAWSSHPEAINVYQEAQ
KLTFRMAIRVLLGFSIPEEDLGHLFEVYQQFVDNVFSLPVDLPFSGYRR
GIQARQILQKGLEKAIREKLQCTQGKDYLDALDLLIESSKEHGKEMTMQELKDGTLELIF
AAYATTASASTSLIMQLLKHPTVLEKLRDELRAHGILHSGGCPCEGTLRLDTLSGLRYLD
CVIKEVMRLFTPISGGYRTVLQTFELDGFQIPKGWSVMYSIRDTHDTAPVFKDVNVFDP
DRFSQARSEDKDGRFHYLPFGGGVRTCLGKHLAKLFLKVLAVELASTSRFELATRTFPRI
TLVPVLHPVDGLSVKFFGLDSNQNEILPETEAMLSATV
>50. CYP26C1 AL358613.11 May 2, 2001 522 amino acids, 6 exons,
MFPWGLSCLSVLGAAGTALLCAGLLLSLAQHLWTLRWMLSRDRASTLPLPKGSMGWPFFGETLHWLVQ
GSRFHSSRRERYGTVFKTHLLGRPVIRVSGAENVRTILLGEHRLVRSQWPQSAHILLGSHTLLGAVGEPHRRRRK
VLARVFSRAALERYVPRLQGALRHEVRSWCAAGGPVSVYDASKALTFRMAARILLGLRL
DEAQCATLARTFEQLVENLFSLPLDVPFSGLRK
GIRARDQLHRHLEGAISEKLHEDKAAEPGDALDLIIHSARELGHEPSMQELK
ESAVELLFAAFFTTASASTSLVLLLLQHPAAIAKIREELVAQGLGRACGCAPGAAGGSEGPPPD
CGCEPDLSLAALGRLRYVDCVVKEVLRLLPPVSGGYRTALRTFELD
GYQIPKGWSVMYSIRDTHETAAVYRSPPEGFDPERFGAAREDSRGASSRLHYIPFGGGARSCLG
QELAQAVLQLLAVELVRTARWELATPAFPAMQTVPIVHPVDGLRLFFHPLTPSVAGNGLCL*
>51. CYP27A1 NM_000784
MAALGCARLRWALRGAGRGLCPHGARAKAAIPAALPSDKATGAP
GAGPGVRRRQRSLEEIPRLGQLRFFFQLFVQGYALQLHQLQVLYKAKYGPMWMSYLGP
QMHVNLASAPLLEQVMRQEGKYPVRNDMELWKEHRDQHDLTYGPFTTEGHHWYQLRQA
LNQRLLKPAEAALYTDAFNEVIDDFMTRLDQLRAESASGNQVSDMAQLFYYFALEAIC
YILFEKRIGCLQRSIPEDTVTFVRSIGLMFQNSLYATFLPKWTRPVLPFWKRYLDGWN
AIFSFGKKLIDEKLEDMEAQLQAAGPDGIQVSGYLHFLLASGQLSPREAMGSLPELLM
AGVDTTSNTLTWALYHLSKDPEIQEALHEEVVGVVPAGQVPQHKDFAHMPLLKAVLKE
TLRLYPVVPTNSRIIEKEIEVDGFLFPKNTQFVFCHYVVSRDPTAFSEPESFQPHRWL
RNSQPATPRIQHPFGSVPFGYGVRACLGRRIAELEMQLLLARLIQKYKVVLAPETGEL
KSVARIVLVPNKKVGLQFLQRQC
>52. CYP27B1 NM_000785
MTQTLKYASRVFHRVRWAPELGASLGYREYHSARRSLADIPGPS
TPSFLAELFCKGGLSRLHELQVQGAAHFGPVWLASFGTVRTVYVAAPALVEELLRQEG
PRPERCSFSPWTEHRRCRQRACGLLTAEGEEWQRLRSLLAPLLLRPQAAARYAGTLNN
VVCDLVRRLRRQRGRGTGPPALVRDVAGEFYKFGLEGIAAVLLGSRLGCLEAQVPPDT
ETFIRAVGSVFVSTLLTMAMPHWLRHLVPGPWGRLCRDWDQMFAFAQRHVE RREAEAA
MRNGGQPEKDLESGAHLTHFLFREELPAQSILGNVTELLLAGVD TVSNTLSWALYELS
RHPEVQTALHSEITAALSPGSSAYPSATVLSQLPLLKAVVKEVLRLYPVVPGNSRVPD
KDIHVGDYIIPKNTLVTLCHYATSRDPAQFPEPNSFRPARWLGEGPTPHPFASLPFGF
GKRSCMGRRLAELELQMALAQ
ILTHFEVQPEPGAAPVRPKTRTVLVPERSINLQFLDR
>53. CYP27C1 AC027142 N-terminal deleted for searches
AEGPRSLAAMPGPRTLANLAEFFCRDGFSRIHEIQ
QKHTREYGKIFKSHFGPQFVVSIADRDMVAQVLRAEGAAPQRANMESWREYRDLRGRATGLISA
EGEQWLKMRSVLRQRILKPKDVAIYSGEVNQVIADLIKRIYLLRSQAEDGETVTNVNDLFFKYSME
GVATILYESRLGCLENSIPQLTVEYIEALELMFSMFKTSMYAGAIPRWLRPFIPKPWREFC
RSWDGLFKFS
QIHVDNKLRDIQYQMDRGRRVSGGLLTYLFLSQALTLQEIYANVTEMLLAGVDT
TSFTLSWTVYLLARHPEVQQTVYREIVKNLGERHVPTAADVPKVPLVRALLKETLR
LFPVLPGNGRVTQEDLVIGGYLIPKG
TQLALCHYATSYQDENFPRAKEFRPERWLRKGDLDRVDNFGSIPFGHGVRSCIGRRIAELEIHLVVIQ
LLQHFEIKTSSQTNAVHAKTHGLLTPGGPIHVRFVNRK*
>54. CYP39A1 AC008104 AL035670 note heme region exon corrected 1/18/02
MELISPTVIIILGCLALFLLLQRKNLRRPPCIKGWIPWIGVGFEFGKAPLEFIEKARIK
YGPIFTVFAMGNRMTFVTEEEGINVFLKSKKVDFELAVQNIVYRT
ASIPKNVFLALHEKLYIMLKGKMGTVNLHQFTGQLTEELHEQLENLGTHGTMDLNNLVR
HLLYPVTVNMLFNKSLFSTNKKKIKEFHQYFQVYDEDFEYGSQLPECLLR
NWSKSKKWFLELFEKNIPDIKACKSAKDNSM
TLLQATLDIVETETSKENSPNYGLLLLWASLSNAVP
VAFWTLAYVLSHPDIHKAIMEGISSVFGKAG
KDKIKVSEDDLENLLLIKWCVLETIRLKAPGVITRKVVKPVEIL
NYIIPSGDLLMLSPFWLHRNPKYFPEPELFKPERW
KKANLEKHSFLDCFMAFGSGKFQCPARW
FALLEVQMCIILILYKYDCSLLDPLPKQ
SYLHLVGVPQPEGQCRIEYKQRI
>55. CYP46 NM_006668
MSPGLLLLGSAVLLAFGLCCTFVHRARSRYEHIPGPPRPS
FLLGHLPCFWKKDEVGGRVLQDVFLDW
AKKYGPVVRVNVFHKTSVIVTSPESVK
KFLMSTKYNKDSKMYRALQTVFGER
LFGQGLVSECNYERWHKQRRVIDLAFSRSSLVSLMETFNEKAEQLVEILEAKADGQTPVSMQDMLTYTAMDILAK
AAFGMETSMLLGAQKPLSQAVKLMLEGITASRNTLAK
FLPGKRKQLREVRESIRFLRQVGRDWVQRRREALKRGEEVPADILTQILK
AEEGAQDDEGLLDNFVTFFIA
GHETSANHLAFTVMELSRQPEIVAR
LQAEVDEVIGSKRYLDFEDLGRLQYLSQ
VLKESLRLYPPAWGTFRLLEEETLIDGVRVPGNTPLL
FSTYVMGRMDTYFEDPLTFNPDRFGPGAPK
PRFTYFPFSLGHRSCIGQQFAQ
MEVKVVMAKLLQRLEFRLVPGQRFGLQEQATLKPLDPVLCTLRPRGWQPAPPPPPC
>56. CYP51 NM_000786
MAAAAGMLLLGLLQAGGSVLGQAMEKVTGGNLLSMLLIACAFTL
SLVYLIRLAAGHLVQLPAGVKSPPYIFSPIPFLGHAIAFGKSPIEFLENAYEKYGPVF
SFTMVGKTFTYLLGSDAAALLFNSKNEDLNAEDVYSRLTTPVFGKGVAYDVPNPVFLE
QKKMLKSGLNIAHFKQHVSIIEKETKEYFESWGESGEKNVFEALSELIILTASHCLHG
KEIRSQLNEKVAQLYADLDGGFSHAAWLLPGWLPLPSFRRRDRAHREIKDIFYKAIQK
RRQSQEKIDDILQTLLDATYKDGRPLTDDEVAGMLIGLLLAGQHTSSTTSAWMGFFLA
RDKTLQKKCYLEQKTVCGENLPPLTYDQLKDLNLLDRCIKETLRLRPPIMIMMRMART
PQTVAGYTIPPGHQVCVSPTVNQRLKDSWVERLDFNPDRYLQDNPASGEKFAYVPFGA
GRHRCIGENFAYVQIKTIWSTMLRLYEFDLIDGYFPTVNYTTMIHTPENPVIRYKRRS
K
I have already done 18 searches with these sequences to show you what the output is like. The 18 sequences are in the red short set (1A1, 2C8, 3A5, 4A11, 5A1, 7A1, 8A1, 11A1,
17A1,
19A1,
20A1,
21A2,
24A1,
26A1,
27A1,
39A1,
46A1, and 51A1,
have been tblastn searched against the Ciona genome at the JGI blast server. The results have been merged into a single list of non-redundant accession numbers. This list is 687 unique accession numbers long. The breakdown is as follows:
1A1 search, expect = 10, hits limited to 250, results = 250 hits
2C8 search, expect = 10, hits limited to 250, results = 70 new hits, 320 total
3A5 search, expect = 10, hits limited to 250, results = 85 new hits, 405 total
4A11 search, expect = 10, hits limited to 250, results = 112 new hits, 517 total
5A1 search, expect = 1, hits limited to 250, results = 5 new hits, 522 total
7A1 search, expect = 1, hits limited to 250, results = 36 hits ? new hits
12 more searches, one from each family, expect = 1, 1099 hits, 125 new hits
In the searches done above, 5A1 returned the fewest new hits because 5A1 is quite similar to 3A5. That search is not really covering new sequence space. I also decided to limit the expect value to 1 rather than 10 since this eliminates false positives. Searches with the other 13 families in the short (red) set can be expected to give new hits each time, but the number of new hits should drop. By the last search there should be relatively few new hits found, and these are probably going to be from family specific regions of the CYP51 sequence.
We had planned to search with the short set of p450s against sequences from Ciona savignyi on blast servers set up at the bioinformatics suite server 1 and server 2. These searches would be much shorter, because the blast searchable files contain only 1/44 of the sequence data. This is 14X coverage of the genome, so each file will be about 0.3X genome. The total amount of sequence is close to 2.5 billion letters. This is compared to the C. intestinalis data, which is 453 million letters or 2.5X coverage.
Rob Edwards has used his skill in Unix to do blast seaches of the 44 data files of Ciona savignyi. He set the searches up to run all 41 of the P450 query sequences in the set given above. The result is 41 X 44 = 1804 Blast outputs with expect = 1. This should prevent false positive hits. These searches are available here. Because these have been done, we will not have to do them in class. The output has been summarized from the individual sequence blasts. There are a total of 95200 hits or about 53 hits per search. Remember that each data file is searched 41 times, and some highly conserved regions like the heme signature may be found in most of those 41 searches. There were 9453 unique accessions (or unique sequence reads). This is about 0.2% of the total number of 4.3 million reads. In our purification analogy, we have made a 500 fold purification and we have 66 fold to go. The 100 phone books have been reduced from 130000 pages to 260 pages. However, the last 66 fold is more difficult than the first 500 fold. If there are about 50 P450 genes in Ciona that would be about 190 reads per gene, but some will be more abundant and others will be rare or absent. The number of hits for each P450 query is given as a table. Notice that the distribution is not random. CYP20 may not exist in Ciona, while CYP2s are very common. The CYP27C1 results (9683 hits) indicate a problem with that search. There was a low complexity sequence in the 27C1 N-terminal that caused many false positive hits. The N-terminal region has been deleted from the sequence shown above. Searches with the shorter protein give about 31 hits per search or about 1364 total. That means the total hits is off by about 8300. It should be closer to 86,900. The unique accession count should be closer to 7598.
A blast server with P450 protein sequence contigs from Ciona intestinalis is available. These were assembled from about 300 sequences from the JGI blasts. This may be useful as an aid in sorting the genes by family and individual sequence. It will also help in seeing how similar the two species are. New results from the searches can be compared to these files to see if there are close matches. For links to Ciona intestinalis accession numbers sorted by sequence and for a link to the P450 contigs see the Ciona page
The power of perl has been demonstrated by the shear mass of data Rob produced by setting up batch Blast searches and letting them run unattended over night. Compare my own efforts at the JGI doing only a few searches. I was pleased to get 500 unique accession numbers at the JGI site, while Rob got 7600 unique accessions from the savignyi data, [while he was sleeping]. I could not realistically do 1800 blast searches manually.
Because of this glut of data, I have had to rethink what to cover in this section. Obviously, we do not need to go and do the Blast searches, since they are done. What still remains is the analysis if this data. The next step in the process would be creation of a FASTA file of all the protein sequences from the unique accession numbers. Because a single accession number might have 20 or 30 hits to different P450s, there should be a way to select the best alignment. This could be done by percent indentity, but that could be misleading for short alignments. Length of the alignment might be a better indicator. For the sake of the argument assume that an automated way exists to prepare the FASTA file from the single best alignment for each unique accession. That would give about 7600 protein fragments from about 50 different P450 genes.
I should describe what I did with the CYP1A1 blast output from the JGI data. There were 250 hits that can be viewed above where the links to the blast output are given. I made estimates of where intron exon boundaries were and deleted probable intron sequences. I deleted all non-sequence data and added a > before the accession number to create a FASTA file format. I manually stripped the query sequence lines and the middle matching lines from each alignment, leaving only the bottom lines. For hits with more than one fragment, I rearranged the sequence fragments so they were in order. Once all 250 hits were processed this way, I had a file with 250 protein sequences in FASTA format. A perl script could do all of this (except removal of introns) automatically And THIS WOULD SAVE SEVERAL HOURS OF LABOR.
Once the FASTA file was made, I put it into the blast server and began to search each fragment against all other fragments. This permitted identification of identical fragments. The accession numbers for identical fragments were sorted into bins representing single genes. Overlapping fragments were assembled into longer contigs. The duplicate fragments were deleted. I tried to search with the most abundant fragments first to eliminate as many sequences as possible early on. This process was carried out until all duplicates were deleted and the set of 69 contigs remained. This meant doing about 100 blast searches and periodically updating the blast server file.
The next step was to look at blast number two from the 2C8 sequence and find the unique accession numbers that were new to this blast. The new accession numbers were determined by taking the Blast list at the top of the output and editing it to get just the accession numbers. All other scores etc. were deleted. The list was then added to the list of the 250 accessions from the first blast in a Word file. The numbers from the second blast were colored red, then the whole set was sorted by the table sort command in Word. Any duplicate red numbers were deleted. New accession numbers will remain as red numbers in the backgound of 250 black numbers. These 70 new hits were copied from the blast output, placed in a separate file and processed the same way as before. After joining overlapping fragments and removing duplicates, any new contigs were added to the FASTA file for use in the blast server.
I have finished the process for the 18 sequences in red. The question remains, how complete is the accession number list? To find out I want each of you to take one sequence from the blue set of P450s above and blast it against the JGI blast server. Use TBLASTN, expect = 1, filter off, 250 descriptions and 250 alignments, graphical view off, use PAM70. Select all and copy the whole thing and paste it in Word and save it. Edit the sequence accession number list at the top of the file to remove the three lines between the accession numbers. Some results may be 250 hits long, so this may take a few minutes, others may be shorter. When you have the edited list of accession numbers, place them in a new file color them red and open this link. This is the list of 687 accession numbers for the JGI data. Copy this list to the top of your other list. Delete the top seven lines starting with 111, 112 etc. Now select all and use the Table sort command to sort your list alphabetically. Go through the list and compare the red numbers to the black numbers. Where there is a duplicate delete the red number. Continue to the end of the file. Now delete all the black numbers (or move all the red numbers to one place). Lable the top of the file with the query sequence and your name and bring it to me on a floppy. Please include the whole blast file too. That way I can get to the alignment. I brought 11 floppies, so you may have to share one. I will incorporate the new hits into the table.
Ciona intestinalis has 608,952 sequences at the JGI server. 687 sequences is 0.1% this represents about a 900 fold purification factor. The problem we now have is how to sort these sequences into individual genes. This is an iterative problem. The farther you go toward complete gene assembly, the easier it becomes. It is kind of like working a jigsaw puzzle. As you near the end it goes faster and faster. This means that the first part is hard work and not very rewarding. I have already examined about 250 of these 687 sequence fragments and I have partially done the next 70. The result so far is the 69 P450 sequence contigs. In our purification analogy, that would be about a 3.6 fold purification 250/69. The best result is that the 69 contigs are the seed of many if not most of the P450s that are going to be found in Ciona. Lets look at them. gene bins You can see right away that they are not full length. P450s are about 500 amino acids long. These are much shorter. Many are only single exons. Others have two or more exons. Please notice that there are extensions on the accesion numbers, .x1 .y1 .x2 .y2. These indicate the direction of sequencing. Each clone should have a read from both directions, and maybe more than one if the clone was sequenced twice. This can be a help in assembly of the genes. If we use the find command and type in the last 4 digits 9712 of the first accession, we find that sequence. If we now do control G (on a Mac) for find again, we do not find any other match. The opposite end is not in this set. It may be from the non-coding region of the gene. If we try the next number 9311, we find two matches, 9311.x2 and 9311.x1. These are from the same direction, just two different sequences. There is no .y1. If we now try the third number 6161, we find two hits, 6161.x1 and 6161.y1. this is where it gets interesting. The two matches are in different contigs. Sequence 1 and sequence 2,3. This suggests that they might be in the same gene. The other possibility is that they are on a genomic sequence that has two or more P450s in a cluster. They may be from adjacent genes. For our purposes today, we will assume they are from the same gene unless they are clearly not the same sequence. Look at the two contig sequences. Are they alike? Sequence one is a single exon that is 94% identical to sequence 2,3. This might be sequence error. Are there any more accession number matches in these two sets? This would strengthen the case for them being in the same gene. 7602 is in both, 3827 is in sequence 9. Sequence 9 is from the N-terminal and sequence 1 and 2,3 are from the C-terminal region, so these might belong in the same gene. 9056 is in sequence 8, another N-terminal. Note that the first 65 aa of sequences 8 and 9 are identical, so they may be from the same gene. 6240 is in both 1 and 2,3. 9515 is in both. 8468 is in both. There are four sets of accession numbers in the two contigs sequence 1 and sequence 2,3. They differ in 3 amino acids at the heme signature region.
SKHVVAFSVGPRHCLG
|||| || |||||||
SKHVIPFSIGPRHCLG
If this same set of differences is seen in mutiple accessions, then they are probably different genes, or maybe alternative exons of the same gene. In any case, sequences 8 and 9 seem the be from the opposite ends of these genes. This process gives clues in how to assemble the gene contigs, even if they do not overlap.
Assignment 10
Go through the gene bins above and look for pairs of accessions with .x1 and .y1 extensions that are in different contigs. When you find a pair like this look at the other accessions in those two contigs to see if any other pairs are present.
Send me your results. Please take contigs with large sets of accessions to improve your chances. Do not all start with sequence 4. Skip around.
Links