Module 11.

Superfamily genomics II. Eukaryotic gene assembly

This module is a continuation of Module 10. Last time we attempted to find all the accession numbers for P450 containing sequences from the Ciona intestinalis server at JGI. Unfortunately, we overloaded their server and their response was to block our access. After about 6 or 8 blasts got through, the server reported a forbidden warning to all utmem.edu domain names. I sent an apology to the administrators and requested that we be given access again, but only for one-at-a-time blasts. They relented and took us off the deny list. We need to be very careful not to overload them again.

At the start of last session I had done 18 blast searches, with one p450 from each human P450 family to cover the P450 "sequence space" in the Ciona intestinalis genome. Our goal was to try the remaining 23 sequences in the P450 set from module 10 (the blue sequences) so we would have one search from every subfamily of p450 in mammals. The blasts that were done before we were cut off and more that I did from home (not a utmem.edu domain) covered 26 of the 41 sequences. We moved from 687 accessions to 752 accessions with those eight additional searches. We only found 65 new hits in 8 searches, so the number of new hits was tapering off. I did find 28 more accessions while working on individual sequences for a total of 780 accessions. There are still more accessions out there to be found, but the payoff is getting less and less for each additional search. It is time to call it good enough for this class and to move on to the next phase.

The next step is gene assembly. I took the blast output from each of the searches and recovered the Ciona protein sequences (bottom lines in the blast alignments) as FASTA format sequences. The file of these sequences was prepared as a database for our blast server, or it was used in the Do-it-yourself Blast server from our Bioinformatics links page. Then each sequence was blast searched against the others to find identical overlapping sequences. These were joined into 201 contigs. This reflects incorporation of more data than we had last session when there were only 69 contigs from about 300 accessions. Here is the same set of 201 contigs in FASTA format.

Next, I selected all those sequences from the 201 contig set that are from the C-terminal of the P450s. These are the most conserved regions of the P450s. I pasted them into the ClustalW server and created an alignment of these sequences. The alignment was horrible because the fragments were all different lengths and covered different regions of the sequence. To edit the alignment, I imported it into Se-Al (the alignment editor for Macs). This required downloading version 1, since version 2 had bugs in it and would not load the interleaved format of the sequence alignment. Version 1 worked and I was able to edit the 138 sequence alignment into pretty good shape. It is shown here. Numbering refers to the contig number.

As we begin to look more closely at the P450 alignment, it is time to discuss the motifs in P450 sequences that are the benchmarks for making alignments and assembling genes. There are six main benchmarks in the C-terminal half of the sequence. The first is the I-helix. Helices in proteins are named sequentially by letter. The P450 crystal structure has A-L conserved helices. The I-helix starts about 300 amino acids into the sequence and it begins the C-terminal conserved half of the protein. Because these regions are conserved, they are often found in the blast searches. The middle region of the P450s from about 130-300 is poorly conserved and it is not seen very often in the blast output. The middle portion of P450s is always the hardest part to assemble.

The I helix motif

In our alignment, the I-helix motif starts nine amino acids from the right on the first page of the alignment. A few selected sequences are shown below to point out the important features. There are 56 different I-helix regions in the alignment.
FVAGTETTT
FLGGTETTT
LFEGHDTTA
LYGGVDTTS
FLAGSETTS
FIAGTETST
YVAGTETTT
The consensus has two small side chains A or G followed by any amino acid then a negative charge and then two Ts. This is written as a pattern [A,G]-[A,G]-X-[D,E]-[TT]. Some variation is tolerated. The FEGHDTT variant is characteristic of P450s in the CYP4 clan.

The K helix motif

The K-helix has two invariant residues in P450s EXXR. These never vary in a real P450 protein. In a pseudogene they might be different, but we are interested in real P450s today. On the second page 5 amino acids from the right edge, you can see the EXXR motif.
ESMRL
EAMRL
ETLRI
EIFRF
EVFRY
EIYRY
This is short, but easy to look for and recognize. The I-helix and the K-helix are about 55 amino acids apart, though there is some room for variation.

The PKG motif

About 20 amino acids past the K-helix is the PKG motif. (middle of the third page of our alignment)
YKIPKNT
YKIPAGE
YVIPKGT
YDIPSGP
YKIPKDT
YHIPKST
FHIPKST
YVIPQGT
YRIPKGW
This sequence is fairly easy to recognize when it is in a longer fragment with another motif. That is also true for other motifs. They are something to look for upstream or downstream of another identified motif. Many P450s have an intron in this location. Sometimes it is in the G coded for by GGT.

The WXXP and PERF motifs

The next two motifs are close together. They are the WXXP and PERF motifs. The W in WXXP is about 20 amino acids from the G in PKG and there is not much variation in that distance. PERF is 5 amino acids away from the P in WXXP. There is some variation in the motifs as shown below.
WDDPXXXXPERH
WEDPXXXXPERF
WKNPXXXXPSRF
WPEPXXXXPQRH
WVDPXXXXPDRH
YEQPXXXXPDRF
FPDAXXXXPDRW
FDQPXXXXPERW

The heme signature

All P450s have a cysteine ligand to the heme iron. This is part of a signature sequence for P450s. It is very easy to recognize. The pattern is FXXGXXXCX[G,A]. Note there are 66 different heme signatures found in the Ciona sequence alignment. That means there are at least 66 different p450 genes in Ciona.
FVAGSRRCAG
FSIGPRSCPG
FGVGARSCIA
FSVGPRHCLG
FSLGPRNCLG
FGGGPRVCIG
FSAGPRNCIG
FSVGARYCMG
FSTGARKCPG
This is about 16-24 amino acids past PERF and about 50 amino acids from the end of the protein. Recognition of these motifs allows you to place a P450 fragment in its approximate location in a sequence. To improve your skill, look at some of the blast output in the Ciona savignyi files Here is an example:

Query:   288 SSFNDENLRIVVADLFSAGMVTTSTTLAWGLLLMILHPDVQRRVQQEID-DVIGQVRRPE 346
             S +N   L   V +LF  G  TT+T L W LLL+  +P++ ++++ EID +V G+V+   
Sbjct:   214 SWWNKYQLFHYVKELFLGGTDTTATALRWALLLIHTYPEIAKKLKAEIDTNVTGRVK--- 384

Query:   347 MGDQAHMPYTTAVIHEVQRFGDIVPLGMTHMTSRDIEVQGFRIPKGTTLITNLSSVLKDE 406
             + D  ++PYT AVI EV R+  +V  G    TS      G+RIPKGT ++ N+ SV  D 
Sbjct:   385 LDDMINLPYTQAVIQEVFRYRPVVNFGTMRKTSAVGTAAGYRIPKGTIVMPNIWSVHHDP 564

Query:   407 AVWEKPFRFHPEHFLDAQGHFVKPEAFLPFSAGRRACLGEPLARMELFLFFTSLLQHFSF 466
               W+ P  F PE  +D  G F K +A +PF+ G+R+CLG  LA+ E+F+F   ++Q F F
Sbjct:   565 VRWKNPEVFRPERHIDENGKFKKSDAVIPFNVGQRSCLGRQLAQTEIFIFLVRMMQKFDF 744

The top alignment contains the I-helix. The second alignment has the EXXR and PKG motifs. The third section has the WXXP, PERF and heme signatures. This is one continuous sequence. Note that the blast search top line has the amino acid position of the query sequence. This is also a strong aid in looking for the motifs because it locates your position in the 500 amino acid long P450 protein. Most blast output is much shorter than this and contains only one or two motifs.

Now lets look at the three N-terminal motifs.

The N-terminal Proline rich motif

About 35-50 amino acids into the P450 sequence from the N-terminal is a proline rich region with a pattern like this [PPGP]-XXX-[P]-[hydrophobic]-[hydrophobic]- [G]-X-[hydrophobic]. Hydrophobic means FILMYV. Some examples are shown below. You have to allow for some variation, such as the second and third P are missing in sequence 190.
PPGPTPLPLLGNL Cyp2s1 mouse
PPGPFAFPIVGNF CYP2N10 Fugu
PPGPRGVPFLGVI sequence 2
PPGPMGVPFLGCL sequence 7
PPGPRGIPFLGVL CIONA SAVIGNYI ortholog to seq 2
PPGPRGFPIVGVL sequence 1
PRGNMGFPLVGEM sequence 190 
PPGPRGIPFLGII sequence 220
PPGPAGLPLIGSL sequence 222

The KYG motif

About 12-20 amino acids after the PPGPXXXPLLGNL motif sequence is a KYG motif with a pattern like this [K,R,Q,E,I,T]-[Y]-[G]
KYG sequence 1, 2, 18, 49, 57, 94, 102, 103, 104, 105, 106 ,107, 181, 199, 201
RYG sequence 83, 89, 193, 210
IYG sequence 15, 36
QYG sequence 65, 29
EYG sequence 71, 79
TYG sequence 54

The C-helix motif

About 125 amino acids from the N-terminal is a WXXXR motif with a pattern like this [W]-[XXX]-[R]-[R,K]. In Ciona the second site is usually K. Not every sequence has a recognizable C-helix motif.
WKTQRR sequence 2, 20
WKEHRK sequence 7
WKMQRR sequence 8
WKTHRK sequence 199
WKVQRR sequence 210

Intron Exon Boundaries

The genomic DNA sequence we are seeing in these blasts is broken up into introns and exons. The exons are coding sequence and the introns are not. The spliceosome in eukaryotic nuclei can recognize the boundaries of these introns and remove them, joining the exons together. To assemble genes, you have to be the spliceosome and find the boundaries yourself. There are guidelines to help you.

1. The exons end with a GT in almost all cases. In phase 0 introns, that are probably the most common type, the amino acid at the boundary will be V since all four V codons start with GT. The GT is part of the intron and it is removed.

2. The intron ends with an AG in almost all cases. For phase 0 introns that means the codons CAG(Q), AAG(K), GAG(E) or TAG(*) are found just before the next exon begins. The combination V and Q is very common at the start and stop of phase 0 introns.

3. Phase must be preserved between exons. An exon that ends as phase 1 must join with the next exon at a phase 1 joint. This limits the number of possible joints.

4. Phase is determined by the reading frame of the exons. Phase 0 introns break between codons of the open reading frame of the exons. Phase 1 breaks one nucleotide into a codon and phase two breaks two nucleotides in.

5. Phase 1 introns often use GGT glycine codons for the beginning of an intron. Phase 2 introns often end with AGA arginine.

6. Blast alignments often show good similarity right up to an intron, where the similarity drops off dramatically. This is especially true if the sequences being compared are very similar. That is why it is best to use the closest match possible when doing blast searches to assemble a gene.

Query:     7 QLFHYVKELFLGGTDTTATALRWALLLIHTYPEIAKKLKAEIDTNVTGRVK 57
             QLFHYVKELFL GTDTTAT LRWA+LLIH + ++ +K+  EID   +  VK
Sbjct:     2 QLFHYVKELFLAGTDTTATTLRWAILLIHYHTDVHEKIHEEIDREASNPVK 52

In the blast shown above, it is possible that the Q on the left is part of a phase 0 intron boundary, however, note the numbering. The lower sequence is starting at amino acid 2, so it looks like this sequence was cut off in the middle of a strongly conserved region. If this had been a nucleotide sequence (TBLASTN search) and the number of the nucleotide on the bottom was higher, then the Q would probably be at the intron boundary. In a case like this it would be necessary to see more output that covered the sequence upstream. This is shown below for a related sequence. Here the sequence does continue upstream, but the match does not. This supports the idea that the Q is part of an intron boundary. That would require a phase 0 boundary on the next exon upstream.

Query:     7 QLFHYVKELFLGGTDTTATALRWALL-LIHTYPEIAKKLKAEI 48
             QL HYV++LF+ GT+TT + LRW+LL LIH  PEI  KL+ EI
Sbjct:   415 QLLHYVRDLFVAGTETTTSTLRWSLLCLIHD-PEIQDKLRKEI 290

Putting it all together

Gene building requires two things, you need all the pieces, and you have to be able to find the intron-exon boundaries. When searching in long regions of contiguous DNA, you have the advantage of knowing that all or most of a gene will be there if you have found part of it. It is just a matter of locating the other conserved regions and then homing in on the boundaries of the exons. We are not so lucky with the Ciona data. Most of the fragments are quite small, covering one or two exons. The strategy has to be different in this case. We have to look for overlapping pieces to assemble the contiguous protein sequence. The process is simple enough. Blast searches are done with a starting region, probably a heme signature or I-helix fragment. This has already been done for the Ciona intestinalis data at JGI, but more fragments could be found by taking the sequence of interest and blasting that against JGI again. Remember, we found these fragments with human sequences and not Ciona sequences. It is worth trying again with Ciona sequence to extend a match. Once a hit is found, look at the nucleotide numbers to see if it might extend several hundred bases beyond your query. If it does, translate it and look for the appropriate motifs. The nice feature of the JGI blast results is they give a link for the opposite end of the clone. For each hit, both ends of the clone can be retrieved. It is worth doing this to see if the other end has any recognizable motifs. If it does, they are probably part of the same gene. They might belong to a gene cluster with two or more related genes, but that will become apparent as you try to build the gene. A gene cluster will start to give multiple sequences for the same region (like two I-helix sequences). As the assignment last time, you looked for P450 fragments in the data that had been assembled into different contigs. These accession numbers matched but had .x1 and .y1 extensions. This was not too hard, since there were many of these. I have used that information in joining contigs together, even if they do not overlap. We can do this because they are from the same clone. This strategy is probably going to join N-terminal contigs to C-terminal contigs with no overlap in the middle part of the gene. There are three ways to proceed.

1. Go for the ends of the gene. The N-terminal and C-terminal should be easy to find if you have the fragments with the heme signature or the first 40 amino acids of the P450. Get your DNA sequence from the server by clicking on the link and then translate in the proper three frames and look for your match. Just follow it up or down to find the stop codon at the C-terminal or the start Met at the N-terminal.

2. Work toward the middle of the gene. Once again retrieve your DNA sequence and translate to follow the sequence forward. In the middle, there are no easy motifs. The best strategy is to blast candidate exons against the Ciona data itself to see if there is a match. There are three places to blast. A. The P450 server has the 201 contigs in it and this may hit your new sequence. B. The JGI server may hit a closely related sequence that lets you find the probable exon. C. The Ciona savignyi blast servers 1 and 2 could be use to hit the ortholog in the other Ciona species.

3. If 1 and 2 did not work, walk the chromosome. Find the far end of your DNA sequence and look for overlaps to a new accession. This would continue your sequence farther down the chromosome, then repeat step two on this new sequence. You should be able to find the next exon this way.

Gene models

If you have a gene that has already been assembled from Ciona, this can help you to assemble the next related gene if they are in the same family. They should have the same intron-exon structures and phases. Use the known gene model to predict where intron-exon boundaries should be. This is a great time saver. However, we do not have pre assembled genes to work with yet. We do have some intron-exon boundaries defined and these can be used. Look at the 201 contigs file. Also see below under case studies for sequence 2.

cDNA

Do not forget that Ciona has a cDNA project going on in Japan. There is a blast server at the bioinformatics links page. These ESTs are also in Genbank (at least some of them are), so you can search Genbank (others ESTs limited to Ciona in the species window).

Case studies, sequence 2

Sequence 2 is complete. It has 81 accession numbers that match it, so it is the P450 sequence with the best representation in the database. Part of the reason for this is it is complete, while most other sequences have only a small portion identified. Therfore, all the accessions from both ends and the middle are represented. Note that this sequence is composed of contigs 2 ,3 ,9, and 45 that have been joined together. There are several close relatives to sequence 2 like sequence 1, which shares some of the same accessions with opposite ends being found in both sequences. The C-termnal part of sequence 1 looks like the same sequence as sequence 2 but the N-terminal parts are different. There even seem to be two different N-terminals. This is indicative of a gene cluster.

>sequence 1,50 16 accessions 94% to sequence 2 PKG to heme 49% to 2R1 or 2j6
          Length = 293

 Score = 853 (300.3 bits), Expect = 3.4e-115, Sum P(2) = 3.4e-115
 Identities = 161/178 (90%), Positives = 165/178 (92%)
C-terminal
Query:   292 QLLQYVRDLFVAGTETTTSTLRWSILCMIHNPEKQEKLRKEICDVIGQD-----RVPAMN 346
             QLLQYVRDLFVAGTETTTSTLRWSILCMIHNPEKQEKLRKEICDVIG+      RVPAMN
Sbjct:   116 QLLQYVRDLFVAGTETTTSTLRWSILCMIHNPEKQEKLRKEICDVIGKKIQARHRVPAMN 175

Query:   347 DKAQMPYTCAFMQEVFRYRTLVPLSVVHMTNQDVVLNGYTIPKGTTISPNLWAVHNNPDV 406
             DKAQMPYTCAFMQEVFRYRTLVPLSVVHMTNQDVVLNGYTIPK      NLWAVHN+PDV
Sbjct:   176 DKAQMPYTCAFMQEVFRYRTLVPLSVVHMTNQDVVLNGYTIPKXXXXXXNLWAVHNDPDV 235

Query:   407 WDEPSKFKPERHLDDKGNFVQSKHVIPFSIGPRHCLGEQLARMEYFIYLVSMVQKFEF 464
             WDEPSKFKPERHLDDKGNFVQSKHV+ FS+GPRHCLGEQLARMEYFIYLVSMVQKFEF
Sbjct:   236 WDEPSKFKPERHLDDKGNFVQSKHVVAFSVGPRHCLGEQLARMEYFIYLVSMVQKFEF 293

 Score = 272 (95.7 bits), Expect = 3.4e-115, Sum P(2) = 3.4e-115
 Identities = 46/71 (64%), Positives = 52/71 (73%)
N-terminal
Query:    15 VIFFTAFLALYYWYTRPKNFPPGPRGVPFLGVIPFLGNYPERVMRKWSKKYGPVMSVRMG 74
             ++     L  Y WY RP  FPPGPRG P +GV+PFL  Y ER M KWSKKYGPVMSVRMG
Sbjct:     1 ILTLICLLLFYTWYRRPSRFPPGPRGFPIVGVLPFLEKYSERTMHKWSKKYGPVMSVRMG 60

Query:    75 REDWVVLGDYE 85
              EDWVV+G+YE
Sbjct:    61 NEDWVVMGNYE 71

 Score = 189 (66.5 bits), Expect = 2.0e-106, Sum P(2) = 2.0e-106
 Identities = 33/48 (68%), Positives = 39/48 (81%)
N-terminal
Query:    42 PFLGVIPFLGNYPERVMRKWSKKYGPVMSVRMGREDWVVLGDYETIQQ 89
             P +GV+PFL  Y ER M KWSKKYGPVMSVRMG +DWVV+G+YE + Q
Sbjct:    72 PSVGVLPFLEKYSERTMHKWSKKYGPVMSVRMGNDDWVVMGNYEQLLQ 119


The best human match to sequence 2 is CYP2R1. The CYP2 family sequences in mammals and in fish have 9 exons, preserved over 420 million years. Note that sequence 2 also has 9 exons. Are they in the same place? Look at the Cyp2s1 from mouse and CYP2N10 from Fugu. I think you can see that the conserved motifs (red) are in the same regions, and the exon boundaries are also in equivalent positions. The phases are 0,1,1,0,0,1,0,1 in both. Below those are sequence 2 and the Ciona savignyi ortholog that I assembled. Note right away that the phases are not the same. 0,2,0,2,1,0,1,0. By looking at the spacing between the motifs, it is clear that not even one intron location is the same as the CYP2 family. Because of this, the human CYP2 gene model cannot be use with Ciona. However, the Ciona intestinalis gene 2 model can be used with Ciona savignyi to build related genes. It can also be used with other intestinalis genes in the same family.


Cyp2s1   mouse 
MEAASTWALLLALLLLLLLLSLTLFRTPARGYLPPGPTPLPLLGNLLQLRPGALYSGLLR (0)
LSKKYGPVFTVYLGPWRRVVVLVGHDAVREALGGQAEEFSGRGTLATLDKTFDGHG (1)
VFFANGERWKQLRKFTLLALRDLGMGKREGEELIQAEVQSLVEAFQKTE (1)
GRPFNPSMLLAQATSNVVCSLVFGIRLPYDDKEFQAVIQAASGTLLGISSPWGQ (0)
AYEMFSWLLQPLPGPHTQLQHHLGTLAAFTIQQVQKHQGRFQTSGPARDVVDAFLLKMAQ (0)
EKQDPGTEFTEKNLLMTVTYLLFAGTMTIGATIRYALLLLLRYPQVQ (1)
QRVREELIQELGPGRAPSLSDRVRLPYTDAVLHEAQRLLALVPMGMPHTITRTTCFRGYTLPK (0)
GTEVFPLIGSILHDPAVFQNPGEFHPGRFLDEDGRLRKHEAFLPYSL (1)
GKRVCLGEGLARAELWLFFTSILQAFSLETPCPPGDLSLKPAISGLFNIPPDFQLRVWPTGDQSR*

CYP2N10 Fugu Scaffold_3261b complete gene 9 exons 
MWLYSVLSWDFTSLLLFFFVLILFANYLKNRDPPNFPPGPFAFPIVGNFFTMDSKNLHLYFNK 13695 (0)
LADVHGNVFSFRLGGDKMVCVSGHKMVKEAIVTQADNFVDRPYDPISARVYGGQT 12393 (1)
DGLFQSNGEVWKRQRRFALSTLRNFGLGKNILEQSICEEAQHLLEEMRSHG 12153 (1)
GKPFNPARLFNNTVSNIICQLVMGKRFEYSDHKFQMLLKYLSEVLVLEGSFWGQ 11913 (0)
LYEAFPSVMKHLPGPHNKVFSHFNHLKDFMNEEIQNHKKDLDHNNPRDYIDAFIIEMEK 11638 (0)
NKDTNLGFTETNLAMCSLDLFIAGTETTATTLLWDLVYLINNPDIQ 11413 (1)
GKVQAEIDQVIGQNRQPTMADRPNLPYTDAVIHEIQRMGNIVPLNGPRMAAKDTTLGGYFIPK (0)
GTSLMPILTSVLFDKNEWETPDKFNPGHFLDAEGKFKKREALLPFSA (1)
GKRVCLGEGLAKMELFLFFVSLFQNFTFFVPGGAELNTEGITGTTRVPHPFEILARPR* 10619

>sequence 2,3,9,45 COMPLETE 40% to 2U1
MVLQLLSDINVSSLVIFFTAFLALYYWYTRPKNFPPGPRGVPFLGVIPFLGNYPERVMRKWSKKYGPVMSVRMG
REDWVVLGDYETIQQ (0)
SLVKQGQCFSGRPDVPVLNQITNGHGLITVDYNEDWKTQRRFGITTLRG (2)
FGVGKRSMEDRIVEEVAYLNDAIRSHNEKPFDIL (0)
SILSNAVSNNICSVVMGRRFDYDDKRFMEIMARLSRS (2)
FNDPTANFALNVVMFMPILVKIPPFSRINNQLMTDVRVIL (1)
QMLREILSEHKSTFNKDDVRDFIDAFIAEQNSESKHSSYT (0)
DLQLLQYVRDLFVAGTETTTSTLRWSILCMIHNPEKQEKLRKEICDVI (1)
GQDRVPAMNDKAQMPYTCAFMQEVFRYRTLVPLSVVHMTNQDVVLNGYTIPKGTT (0)
ISPNLWAVHNNPDVWDEPSKFKPERHLDDKGNFVQSKHVIPFSIGPRHCLGEQLARMEYFIYLVSMV
QKFEFFPDPNEPDLPDVEDGSSGVVFVPLRFKQIAKIV*

>CIONA SAVIGNYI SEQUENCE assembled 69% TO CIONA INTESTINALIS 2,3,9
MLQRMLNEINVSTSFIFLTVFLGLYYWYRRPKNFPPGPRGIPFLGVLPFLGNYPERKMRK
WSNKYGPVMSVRMGRQDWVVLGDHETIQQ (0)
TLVKQGSIFSGRPSIPILEEMTKGHGILLLDYGEKWKSQRKFGLMTLRG (2)
FGVGKRSMEDRITEEVAYLNDAIRTHDGKAFNIQ (0) 
SILSNAISNNICSIVMGQRFDYDDERFKEIMTKLSYG (2)
FNDPEVSLVRQILIFMPALVNAPYFSRINAELMENVRVIS (1)
ELLREIVADHKALYDQDNHRDFIDAFLGEQKSENGSETSRYI (0?)
DKQLLHYVRDLFFAGSETSTSTLRWTLLCLIHHPKKQERLRKEIFEVL (1)
GQEKIPAVDNKSYMPYTCAFMQEVYRYRTLAPFGVAHMTNEDVNLNGYSIPNGTT (0)
ISSNLWAVHNDPDVWNEPSKFKPERHLDDKGNFVQSSHVIPFSVGPRHCLGEQLARME
VFIYLVSLVQKFEFLPDPDATELPDIKIGSNGPAYVPLPFNMVARVV*
How was sequence 2 assembled? Remember that we used one sequence from each human P450 family to do the original search. That means that CYP2C8 was the sequence used from the CYP2 family. Later some other CYP2s were used, 2F1, 2B6 and maybe another. What is found in doing the CYP2C8 Search? 250 hits are returned and 48 of them are in the set of 82 accessions listed as part of the sequence 2 set. That is 59% of the accessions were found in this single search. Since the JGI results also include the opposite ends of the clones, I searched the 2C8 output from this set against the 82 accession numbers and found 19 that were in the set from the opposite end of the clone. This means that both ends contain P450 sequence and they probably come from N and C-terminal regions. 11 of these 19 accessions were in both sets, so only 8 new accessions were found in the set of opposite ends. That brings the total to 56/82 = 68% of the sequence 2 accessions identifed so far were in the CYP2C8 blast output.

What is the sequence coverage of these 56 accessions? The red regions were found.

>sequence 2,3,9,45 COMPLETE 81 accessions 94% to sequence 1 40% to 2U1
MVLQLLSDINVSSLVIFFTAFLALYYWYTRPKNFPPGPRGVPFLGVIPFLGNYPERVMRKWSKKYGPVMSVRMG
REDWVVLGDYETIQQ (0)
SLVKQGQCFSGRPDVPVLNQITNGHGLITVDYNEDWKTQRRFGITTLRG (2)
FGVGKRSMEDRIVEEVAYLNDAIRSHNEKPFDIL (0)
SILSNAVSNNICSVVMGRRFDYDDKRFMEIMARLSRS (2)
FNDPTANFALNVVMFMPILVKIPPFSRINNQLMTDVRVIL (1)
QMLREILSEHKSTFNKDDVRDFIDAFIAEQNSESKHSSYT (0)
DLQLLQYVRDLFVAGTETTTSTLRWSILCMIHNPEKQEKLRKEICDVI (1)
GQDRVPAMNDKAQMPYTCAFMQEVFRYRTLVPLSVVHMTNQDVVLNGYTIPKGTT (0)
ISPNLWAVHNNPDVWDEPSKFKPERHLDDKGNFVQSKHVIPFSIGPRHCLGEQLARMEYFIYLVSMV
QKFEFFPDPNEPDLPDVEDGSSGVVFVPLRFKQIAKIV*
We can see that exons 5 and 6 in the middle of the protein were not found. Also, the N and C-terminals were not matched all the way to the ends. The beginning of exon 7 (the I-helix exon) was missed, and phase 2 boundaries did not match exactly, because these codons are made from two separate codons and the amino acid at the joint may not match either amino acid in the DNA translation. The problem then is finding the missing N and C-terminals (easy) and finding the missing middle two exons (hard). The problem of finding the exon boundaries is a third challenge.

Notice that the results so far will give two different contigs, one for the N-terminal and one for the C-terminal. The N-terminal might even be broken into more than one contig if two exons are not found on the same clone. The assignment last time to look for .x1 and .y1 versions of the same accession numbers could be used here to join contigs together. This is what happened, since sequence 2 was made up of four different contigs 2, 3, 9, and 45. Sequence 9 was an N-terminal contig and sequence 2 and 3 were C-terminals.

Once the two are joined into a single gene bin, the hunt can continue for the ends. These can be found by getting the sequences that are closest to the ends and translating them to look for start Met or stop codons. Once these are found at the expected distance, the sequence should be blast searched against the JGI data to see if other clones give the same sequence. This can detect frameshift errors. It would also be good to blast against the C. savignyi data for orthologous matches. That would go a long way toward confirmation that the sequence is correct.

A related strategy can be used to find the two missing middle exons. Get the closest DNA sequence from the server and translate it in the three proper frames. Look for a candidate region for an exon (a region with about 40 or more amino acids without a stop codon). This should be within 200 to 800 bases from the last exon, though it might be more than that. If you find a reasonable sequence blast it against both Ciona genomes to see if it matches another related sequence. This will identify the exon and come close to identifying the exon intron boundaries. If that does not work get the opposite end clone from the JGI data and translate that looking for motifs. You may find that the opposite end clone covers the missing region. Try to identify the nearest motif in the translation to guide placement of the sequence. A faster method would be to paste the whole translation in the blast window and let the blast program find any matches for you. The JGI server allows TBLASTX, so you could paste the nucleotide sequence there and look for matches that way. BEWARE OF REPEAT SEQUENCES.

The following blast result shows that exon 6 is 520 bp upstream of exon 7, so if this sequence were blast searched against either Ciona database this exon 6 would show up as a strong match. Remember, for this to work, you do not want to find an exact match, because that will not tell you where the exons are. You want a 75-90% match in the exon region and a much lower match in the intron region.


LQW183999.x1 
LQW183999.x1.phd.1 LQW183999.y1 

           11:52:52 2001 TEMPLATE: LQW183999 DIRECTION: fwd
          Length = 978

 Score = 97.9 bits (240), Expect = 3e-20
 Identities = 45/48 (93%), Positives = 47/48 (97%)
 Frame = +2

Query: 81  DLQLLQYVRDLFVAGTETTTSTLRWSILCMIHNPEKQEKLRKEICDVI 128
           DLQLLQYVRDLFVAGTETTTSTLRWSILCMIHNPEKQEK R+EICDV+
Sbjct: 596 DLQLLQYVRDLFVAGTETTTSTLRWSILCMIHNPEKQEKFRQEICDVL 739

 Score = 54.3 bits (128), Expect = 5e-07
 Identities = 25/25 (100%), Positives = 25/25 (100%)
 Frame = +2

Query: 56 KDDVRDFIDAFIAEQNSESKHSSYT 80
          KDDVRDFIDAFIAEQNSESKHSSYT
Sbjct: 2  KDDVRDFIDAFIAEQNSESKHSSYT 76
A good example is shown below where a related sequence has exon 6 only 69 bp upstream. The boundary is crying out Phase 0, Phase 0 (intron starts with V ends with *).


LQW269368.y1 
LQW269368.y1.phd.1 LQW269368.x1 

           12:23:14 2001 TEMPLATE: LQW269368 DIRECTION: rev
          Length = 968

 Score =  138 bits (344), Expect = 2e-32
 Identities = 73/113 (64%), Positives = 79/113 (69%), Gaps = 23/113 (20%)
 Frame = +2

Query: 39  ILQMLREILSEHKSTFNKDDVRDFIDAFIAEQNSESKHSSYT------------------ 80
           I   L EI+SEHKSTFNKDD RDFIDAFIAE+NS++KHSS+T                  
Sbjct: 176 IADHLDEIVSEHKSTFNKDDARDFIDAFIAEKNSQNKHSSFTVRISLILKKGLLTGCFMI 355

Query: 81  -----DLQLLQYVRDLFVAGTETTTSTLRWSILCMIHNPEKQEKLRKEICDVI 128
                D QLL YV DLF AGTETTTSTL WSILCMIHNPEKQEKLRKEIC V+
Sbjct: 356 IL*M*DSQLLHYVVDLFEAGTETTTSTLMWSILCMIHNPEKQEKLRKEICSVV 514
There were 10 accession numbers that had both exon 5 and 6 on the same piece of DNA. So once you found exon 6 it would be easy to find exon 5, using the same strategy. There were at least 18 accessions with exons 4 and 5 on the same piece of DNA. So you may have to go one exon at a time through a gap region. The coverage seems to be good enough to do this, at least for this sequence 2 gene. It may be worse for some of the other genes.

Case Studies, Sequence 69

Two students worked on sequence 69 after class last week. They assembled most of the gene in about two hours. I want to go through the process in detail here, since there are a few things I did not mention in the class that would be helpful.

Step 1. Identifying motifs

sequence 69 has five motifs from the C-terminal half of the molecule.
sequence 69 3 accessions 49% to 2R1
DMSHHESMPYLRAFIQEVHRFQTIAPLNIPHCVTEDCVLFGYHIPKSTPVMSNIWRVHND 
PKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLARVELFLF
STLKKYEFQIDPEYGL 
LQW231417.x1  
LQW265377.y2  
DEV27943.x1


Step 2. Look for the opposite end of the clone

Sequence 37 shares an accession number with sequence 69 LQW265377.y2 and LQW265377.x1. These are assumed to be opposite ends of the same gene. Sequence 37 has the C-helix motif, so it is from the N-terminal.

sequence 37 3 accessions 46% to 2C8
TLIQALLKQGESFSGRPQSYLMNQLTEGCGIVFSTGPRWQAQRRFVLTALKTY
LQW269127.y1 
LQW265377.x1 
LQW48810.x1


Step 3. Find the C-terminal end of the gene.

Blast the Ciona server at JGI with the last part of the gene PKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLARVELFLFSTLKKYEFQIDPEYGL Use TblastN without a filter and probably use PAM30 as the matrix, though BLOSUM 62 will work. PAM30 is made for more similar sequences and we want exact matches. This search will hit 3 exact matches at JGI. One is shown below.

LQW231417.x1 
LQW231417.x1.phd.1 LQW231417.y1 

           11:26:56 2001 TEMPLATE: LQW231417 DIRECTION: fwd
          Length = 931

Score =  217 bits (504), Expect = 3e-56
 Identities = 67/69 (97%), Positives = 67/69 (97%), Gaps = 2/69 (2%)
 Frame = +1

Query: 1   PKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLARVELFLF--STLKKYE 58
           PKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLARVELFLF  STLKKYE
Sbjct: 373 PKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLARVELFLFYASTLKKYE 552

Query: 59  FQIDPEYGL 67
           FQIDPEYGL
Sbjct: 553 FQIDPEYGL 579
Note that this hit suggests that I have missed two amino acids YA in the original sequence. This appears in two of the three hits and the last hit does not cover the region. Assume that these are correct and revise sequence 69 to include the YA. Note that the bottom sequence ends at 579 bp but the length of this accession is 931 bp. This means the end of the gene is on this fragment. Click on the hyperlinked accession number LQW231417.x1 to retrieve the DNA sequence. Translate it in the three forward frames using the Quick Protein Translator. (The blast output was in frame +1, so you are on the forward or plus strand). The protein translator will give you the three frames of translation and one of them looks like this:


KTINNSL*RSRGCVFL*IFQFSLFTTKWEKKE*KGVPTSPSTLLCIYSEC*IFVIYFPIGCAEPDMSHHESMPYLRAFIQEV
HRFQTIAPLNIPHCVTEDCVLFGYHIPKSTPVMSNIWRVHNDPKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLG
VQLARVELFLFYASTLKKYEFQIDPEYGLPDWSNDRSGTVKTPKKFSVLLKSR*
RTRKNVYKELYSAYQRF*SSLQTQNAFTIFTTLSSWCSFIQIAQIVE*LCCQGEVLP
VF*IVI*HYFNVSYLWE*SGLPYTTA*ELQHGVYK


The sequence you searched with is shown here in red. Note that the sequence continues on 24 more amino acids to a stop codon. This is the probable C-terminal of this gene. Below we show the translation in three frames of the 5 prime end of this exon (N-terminal end is the 5-prime end). Now we have to find the intron-exon boundary of this exon.



  1 - AAAACAATTAACAACAGTCTATAACGGTCGCGTGGGTGCGTTTTTTTATAAATCTTTCAA - 60 
    - K  T  I  N  N  S  L  *  R  S  R  G  C  V  F  L  *  I  F  Q   
    -  K  Q  L  T  T  V  Y  N  G  R  V  G  A  F  F  Y  K  S  F  N   
    -   N  N  *  Q  Q  S  I  T  V  A  W  V  R  F  F  I  N  L  S  I   
 61 - TTTTCTTTGTTTACTACCAAATGGGAAAAAAAAGAATAAAAAGGTGTCCCAACTTCCCCC - 120 
    - F  S  L  F  T  T  K  W  E  K  K  E  *  K  G  V  P  T  S  P   
    -  F  L  C  L  L  P  N  G  K  K  K  N  K  K  V  S  Q  L  P  P   
    -   F  F  V  Y  Y  Q  M  G  K  K  R  I  K  R  C  P  N  F  P  Q   
121 - AGCACTCTACTATGTATTTATTCAGAATGTTGAATCTTTGTAATATATTTTCCTATAGGA - 180 
    - S  T  L  L  C  I  Y  S  E  C  *  I  F  V  I  Y  F  P  I  G   
    -  A  L  Y  Y  V  F  I  Q  N  V  E  S  L  *  Y  I  F  L  *  D   
    -   H  S  T  M  Y  L  F  R  M  L  N  L  C  N  I  F  S  Y  R  M   
181 - TGCGCTGAACCAGATATGTCCCACCATGAGAGCATGCCTTACCTGCGTGCTTTCATACAA - 240 
    - C  A  E  P  D  M  S  H  H  E  S  M  P  Y  L  R  A  F  I  Q   
    -  A  L  N  Q  I  C  P  T  M  R  A  C  L  T  C  V  L  S  Y  K   
    -   R  *  T  R  Y  V  P  P  *  E  H  A  L  P  A  C  F  H  T  R   
241 - GAAGTGCATAGATTCCAAACCATAGCTCCGTTGAATATTCCCCACTGCGTCACTGAAGAC - 300 
    - E  V  H  R  F  Q  T  I  A  P  L  N  I  P  H  C  V  T  E  D   
    -  K  C  I  D  S  K  P  *  L  R  *  I  F  P  T  A  S  L  K  T   
    -   S  A  *  I  P  N  H  S  S  V  E  Y  S  P  L  R  H  *  R  L

To read this translation, the DNA sequence is above the three translations. The first line of the three translations is the +1 frame, since the codons start with nucleotide 1 (AAA = K, ACA = T etc.) The second line is the +2 frame. The codons start with nucleotide 2 (AAA = K, CAA = Q). The third line is the +3 frame. The codons start with nucleotide 3. (AAC= N, AAT = N) For a complete table of codon translations see the genetic code To find an amino acid sequence in these translations it is necessary to placee two spaces between each amino acid as in E Y S P. The program cannot find the sequence without the spaces included. Also, it is only necessary to search for three amino acid blocks. Longer sequences may wrap at the ends of lines and the program cannot find them if they wrap to the next line. If you cannot find a sequence that you know is present try moving over by three amino acids. Your sequence may be on the end of a line.

Step 4. Finding the intron-exon boundary.

The EXXR motif is present in this exon, so the exon must extend upstream of this motif. The best way to find the boundary is to search for similar gene exons using the Ciona savignyi servers or the JGI server. Let us try the savignyi server. If your search stops at the line

Searching....10....20....30....40....50....60....70....80....90....100% done

Just hit the reload button.



IFVIYFPIGCAEPDMSHHESMPYLRAFIQEV
HRFQTIAPLNIPHCVTEDCVLFGYHIPKSTPVMSNIWRVHND
PKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLARVELFLFYASTLKKYEFQIDPEYGL
PDWSNDRSGTVKTPKKFSVLLKSR 
Use the whole open reading frame that contains the EXXR motif, the PKG motif, the WXXP motif, the PERF motif and the heme signature. TBLASTN search the savignyi data. The default is file 00, the first of the 44 data files for the savignyi reads. Try this one first and see the results. The besat match is only 45% identical. This is not very good, so repeat the search using file 1. We can expect an ortholog of the gene in savignyi that will be about 70% identical. This will show the location of the exon boundary.

>scf/ciona01/G126/seq_dir/hrs/G126P64932F.T0/G126P64932FF8.T0.seq    741      0
            741  ABI
            Length = 741

  Plus Strand HSPs:


 Score = 472 (166.2 bits), Expect = 8.8e-52, Sum P(3) = 8.8e-52
 Identities = 83/116 (71%), Positives = 96/116 (82%), Frame = +2

Query:     5 YFPIGCAEPDMSHHESMPYLRAFIQEVHRFQTIAPLNIPHCVTEDCVLFGYHIPKSTPVM 64
             Y  IG A P+MSH E MPYLRAFIQEVHRFQTIA LNIPHCVTEDC+L+GY IPK TPVM
Sbjct:   323 YCYIGSAVPNMSHQEKMPYLRAFIQEVHRFQTIAVLNIPHCVTEDCILYGYRIPKGTPVM 502

Query:    65 SNIWRVHNDPKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLARVE 120
             SNIW VHNDP +W+ P+KF PERHLD +G+F+PSNRV+ F+ GHRSC G  +A+ E
Sbjct:   503 SNIWFVHNDPAHWQEPQKFRPERHLDGDGKFIPSNRVIPFSGGHRSCPGRAVAKGE 670

 Score = 53 (18.7 bits), Expect = 8.8e-52, Sum P(3) = 8.8e-52
 Identities = 11/15 (73%), Positives = 12/15 (80%), Frame = +3

Query:   112 LGVQLARVELFLFYA 126
             LGVQ  R +LFLFYA
Sbjct:   645 LGVQWQRAKLFLFYA 689

 Score = 43 (15.1 bits), Expect = 8.8e-52, Sum P(3) = 8.8e-52
 Identities = 6/13 (46%), Positives = 9/13 (69%), Frame = +1

Query:   132 YEFQIDPEYGLPD 144
             +EF +DP +G  D
Sbjct:   703 HEFSVDPNFGFTD 741


The search with file 1 gave 48% identity but the search with file 2 gave 71% identity. This is probably the ortholog sequence. Notice that that alignment extends back to amino acid 5 but is very strong from MSH on, at amino acid 15. The intron boundary will probably fall in the range between amino acid 5 and amino acid 15. Remember, it cannot go beyond amino acid 1 since there is a stop codon before that. Look at the three frame translation above. The line starting with 121 and going to 180 covers the beginning of this sequence. Look for AG pairs in the DNA. These are the candidates for the intron exon boundary. The frame we are interested in is frame +1. There is one AG above the G in P I G. In the next line of DNA 181-240, there is another AG above D in C A E P D. These are both phase 1 boundaries, since they both break the codons for G (GGA) and D (GAT) after the first nucleotide in the codon. The exon boundary will be at one of these two Ags.

The other end of this alignment falls apart after 112. There are a couple of frameshifts, then the sequence ends at 741, before the end of the exon. It would be good to get another hit to the end of this gene to verify that the end of the exon is correct as we predicted it. Go back and do some additional blasts of the savignyi files until you find a better match at the end.

scf/ciona01/G126/seq_dir/hrs/G126P68275R.T0/G126P68275RC12.T0.seq    663
            0    663  ABI
            Length = 663

  Minus Strand HSPs:

 Score = 339 (119.3 bits), Expect = 1.7e-30, P = 1.7e-30
 Identities = 58/86 (67%), Positives = 72/86 (83%), Frame = -3

Query:    81 EKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLARVELFLFYASTLKKYEFQIDPEY 140
             + F PERHLD +G+F+PSNRV+ F+VGHRSCLGVQLA+ ELFLFYA  LK +EFQ+DP +
Sbjct:   661 QXFRPERHLDGDGKFIPSNRVIPFSVGHRSCLGVQLAKAELFLFYAGVLKHHEFQVDPNF 482

Query:   141 GLPDWSNDRSGTVKTPKKFSVLLKSR 166
             GLPDWS D  GT+KTPK+F+V +K R
Sbjct:   481 GLPDWSRDDGGTLKTPKEFTVCIKER 404


This hit from datafile 6 has a 67% identity that goes all the way to the ends of the predicted exon. This verifies the translation is correct.

To identify the exact intron boundary, you need to find the next exon upstream. This should contain the I-helix motif. Blast search the JGI server with just the first part of the exon,

YFPIGCAEPDMSHHESMPYLRAFIQEVHRFQTIAPLNIPHCVTEDCVLFGYHIPKSTPVM

This will give you some hits that you can translate and check for the I-helix motif. (use TBLASTN, PAM30, no filter, expect = 1)



LQW265377.y2 
LQW265377.y2.phd.1 LQW265377.x1 

13:00:22 2001 TEMPLATE: LQW265377 DIRECTION: rev
          Length = 825

 Score =  406 bits (948), Expect(3) = e-137
 Identities = 121/124 (97%), Positives = 121/124 (97%)
 Frame = -3

Query: 5   YFPIGCAEPDMSHHESMPYLRAFIQEVHRFQTIAPLNIPHCVTEDCVLFGYHIPKSTPVM 64
           Y   GCAEPDMSHHESMPYLRAFIQEVHRFQTIAPLNIPHCVTEDCVLFGYHIPKSTPVM
Sbjct: 481 YISLGCAEPDMSHHESMPYLRAFIQEVHRFQTIAPLNIPHCVTEDCVLFGYHIPKSTPVM 302

Query: 65  SNIWRVHNDPKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLARVELFLF 124
           SNIWRVHNDPKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLARVELFLF
Sbjct: 301 SNIWRVHNDPKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLARVELFLF 122

Query: 125 YAST 128
           YAST
Sbjct: 121 YAST 110

This sequence goes 344 bp upstream (825 - 481). Translate it in the minus strand (6 phase translation on the quick protein translator, right three sequence boxes). Look for the I-helix motif.

Frame -1
LLVMACMYVGEMGHVSNF*SSPLLVIRNKFI*SSFTALKPVLLLKLN*DI*CVKGITILYSQ*YKGNTRDFFSTSKHPNTLF
TTKWEKKKKRIKRCPNFPQHSTMYLFRMLNLCNIFP*DALNQICPTMRACLTCVLSYKKCIDSKP*LR*IFPTASLKTAFFL
VTTYRSRHR**AIYGESTTTRNIGKIRRNFPPNVT*IQKVDSFPLIVFCHSQSGIGAV*AYSWRESSYFYFTHLPLKKYEFQ
TDPEYGLPDWSNDRSGLLNTKKFSVLLKS
Frame -2
YW*WLVCMLGKWATFPIFDHRLC***ETNLYKVASRL*NQFCC*N*IEIYDVLKA*PYYTLSNIRGTRETSLVHLNIQTPCL
LPNGKRKKKE*KGVPTSPSTLLCIYSEC*IFVIYFPRMR*TRYVPP*EHALPACFHTRSA*IPNHSSVEYSPLRH*RLRSFW
LPHTEVDTGDEQYMESPQRPEILGKSGEIFPRTSLRFRR*IRSL*SCSVIRSRA*ELFRRTVGESRVISILRIYP*RNTSFK
LIQSMDCPTGAMIVRDC*TPRSFLCCSS
Frame -3
IGNGLYVCWGNGPRFQFLIIAFVSNKKQIYIK*LHGSKTSFVVKTKLRYMMC*RHNHIILSVI*GEHARLL*YI*TSKHLVY
YQMGKEKKKNKKVSQLPPALYYVFIQNVESL*YISLGCAEPDMSHHESMPYLRAFIQEVHRFQTIAPLNIPHCVTEDCVLFG
YHIPKSTPVMSNIWRVHNDPKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLARVELFLFYASTPEEIRVSN
*SRVWIARLEQ*SFGTVKHQEVFCAAQV
Above are the 3 frames of translation on the minus strand. The exon we searched with is in the third frame, line 2. No I-helix motif is visible so walk the chromosome upstream by blasting JGI with the top line of the translation (use frame -3).

DEV55412.x1 

CHEM: term DYE: ET TIME: Fri May  4 12:20:52 2001
           TEMPLATE: DEV55412 DIRECTION: fwd
          Length = 447

 Score =  104 bits (238), Expect = 4e-22
 Identities = 51/83 (61%), Positives = 52/83 (62%), Gaps = 9/83 (10%)
 Frame = +3

Query: 3   NGLYVCWGNGPRFQFLIIAFV-------SNKKQIYI-K*LHGSKT-SFVVKTKLRYMMC* 53
           NGLYVCWG      F I  F         NK   +I  *LHGSKT     K KLRY MC*
Sbjct: 219 NGLYVCWGK--WSTFSI--FYHRLW**KRNK---FI*M*LHGSKTQCWLLKLKLRYIMC* 377

Query: 54  RHNHIILSVI*GEHARLL*YI*T 76
           RHNHIILSVI*GEHARLL YI*T
Sbjct: 378 RHNHIILSVI*GEHARLLXYI*T 446

There is no exact match to the whole sequence. Above is a good match with exact matches on both ends, but not in the middle. Remember that this is in the intron region, and both sequences may have some errors, so the middle may be frameshifted. Both segments that do not match well are not of the same length, suggesting possible frameshifts. Assume this is the same gene and get the sequence from translation. Look on the plus strand for the I-helix motif (upstream of nucleotide 219).

TFVFLQELQLLHLVRDLFVGAIDTTTATLGWGIICLLHYPECQVRIQEEIDDVIG*YVFILLSFNLTCKIFVNGLYVCW
GKWSTFSIFYHRLW**KRNKFI*M*LHGSKTQCWLLKLKLRYIMC*RHNHIILSVI*GEHARLLYTSKH
The I-helix motif is seen in the third frame. Get the three frame translation and look for the intron exon boundary.

 1  - ATACATTTGTTTTTTTGCAGGAGCTGCAGTTGTTGCATTTAGTACGAGACTTATTTGTCG - 60 
    - I  H  L  F  F  C  R  S  C  S  C  C  I  *  Y  E  T  Y  L  S   
    -  Y  I  C  F  F  A  G  A  A  V  V  A  F  S  T  R  L  I  C  R   
    -   T  F  V  F  L  Q  E  L  Q  L  L  H  L  V  R  D  L  F  V  G   
 61 - GAGCTATTGACACGACAACAGCTACGTTAGGATGGGGAATCATATGTTTACTACATTACC - 120 
    - E  L  L  T  R  Q  Q  L  R  *  D  G  E  S  Y  V  Y  Y  I  T   
    -  S  Y  *  H  D  N  S  Y  V  R  M  G  N  H  M  F  T  T  L  P   
    -   A  I  D  T  T  T  A  T  L  G  W  G  I  I  C  L  L  H  Y  P   
121 - CGGAATGCCAAGTTAGAATACAGGAAGAAATAGACGATGTTATCGGTTAGTATGTTTTTA - 180 
    - R  N  A  K  L  E  Y  R  K  K  *  T  M  L  S  V  S  M  F  L   
    -  G  M  P  S  *  N  T  G  R  N  R  R  C  Y  R  L  V  C  F  Y   
    -   E  C  Q  V  R  I  Q  E  E  I  D  D  V  I  G  *  Y  V  F  I   

blast the open reading frame with the I-helix motif

TFVFLQELQLLHLVRDLFVGAIDTTTATLGWGIICLLHYPECQVRIQEEIDDVIG

against the savignyi data to help locate the boundary. Search until you get about 70% identity.


>scf/ciona01/G126/seq_dir/hrs/G126P66970F.T0/G126P66970FG1.T0.seq    685      0
            685  ABI
            Length = 685

  Minus Strand HSPs:

 Score = 211 (74.3 bits), Expect = 5.9e-17, P = 5.9e-17
 Identities = 37/51 (72%), Positives = 45/51 (88%), Frame = -3

Query:     5 LQELQLLHLVRDLFVGAIDTTTATLGWGIICLLHYPECQVRIQEEIDDVIG 55
             LQELQL HL+RDLFVG IDTTTA LGWGI+CLL++PECQ +I +EI+ +IG
Sbjct:   350 LQELQLCHLIRDLFVGGIDTTTAALGWGIVCLLNFPECQDKIHQEIEQIIG 198

Datafile 2 gives the match above at 72% all the way to the VIG. The exon boundary should be at the G in the VIG. Look at the three frame translation above. In the last line find V I G. The glycine codon is GGT. The GT is the probable exon boundary. This is phase 1. Remember the two candidate joints for the next exon were both phase 1. This rules out the GT in the stop codon TAGT and the GT in valine of VIG (phase 0). The unanswered question about his exon joint is which of the two candidates downstream is the correct junction. The best way to answer this is to compare against other known P450s to see which is the right length, because length is usually preserved. Try blast searching the two possibilities against the Ciona contigs, because some of them like seq 2 and seq 36 will span this exon joint.

LQELQLLHLVRDLFVGAIDTTTATLGWGIICLLHYPECQVRIQEEIDDVI (1) 
GCAEPDMSHHESMPYLRAFIQEVHRFQTIAPLNIPHCVTEDCVLFGYHIPKSTPVMSNIWRVHNDPKYWENPEKFSPERHLD
SEGRFVPSNRVLSFAVGHRSCLGVQLARVELFLFYASTLKKYEFQIDPEYGLPDWSNDRSGTVKTPKKFSVLLKSR
Searches of the 201 contigs show that the I-helix exon = sequence 14 in the set of 201 contigs. The joint we are interested in is one amino acid short if we start from GCAEPD. It would be 6 amino acids short if we started from DMSHH. The first choice seems best. To confirm search the savignyi data.

scf/ciona01/G126/seq_dir/hrs/G126P64298F.T0/G126P64298FG4.T0.seq    749      0
            749  ABI
            Length = 749

  Minus Strand HSPs:

 Score = 185 (65.1 bits), Expect = 4.4e-32, Sum P(2) = 4.4e-32
 Identities = 31/44 (70%), Positives = 39/44 (88%), Frame = -2

Query:     8 HLVRDLFVGAIDTTTATLGWGIICLLHYPECQVRIQEEIDDVIG 51
             HL+RDLFVG IDTTTA LGWGI+CLL++PECQ +I +EI+ +IG
Sbjct:   664 HLIRDLFVGGIDTTTAALGWGIVCLLNFPECQDKIHQEIEQIIG 533

 Score = 184 (64.8 bits), Expect = 4.4e-32, Sum P(2) = 4.4e-32
 Identities = 35/41 (85%), Positives = 36/41 (87%), Frame = -2

Query:    50 IGCAEPDMSHHESMPYLRAFIQEVHRFQTIAPLNIPHCVTE 90
             IG A P+MSH E MPYLRAFIQEVHRFQTIA LNIPHCVTE
Sbjct:   340 IGSAVPNMSHQEKMPYLRAFIQEVHRFQTIAVLNIPHCVTE 218

This hit suggests that the joint at the glycine codon is the most likely.

>scf/ciona01/G126/seq_dir/hrs/G126P66970F.T0/G126P66970FG1.T0.seq    685      0
            685  ABI
            Length = 685

  Minus Strand HSPs:

 Score = 211 (74.3 bits), Expect = 5.9e-17, P = 5.9e-17
 Identities = 37/51 (72%), Positives = 45/51 (88%), Frame = -3

Query:     5 LQELQLLHLVRDLFVGAIDTTTATLGWGIICLLHYPECQVRIQEEIDDVIG 55
             LQELQL HL+RDLFVG IDTTTA LGWGI+CLL++PECQ +I +EI+ +IG
Sbjct:   350 LQELQLCHLIRDLFVGGIDTTTAALGWGIVCLLNFPECQDKIHQEIEQIIG 198

This search from above of the I-helix exon shows that the sequence alignment stops at LQEL. The three frame translation of this region has a CAG glutamine codon for Q of LQEL. This is probably a phase 0 intron boundary.

 1  - ATACATTTGTTTTTTTGCAGGAGCTGCAGTTGTTGCATTTAGTACGAGACTTATTTGTCG - 60 
    - I  H  L  F  F  C  R  S  C  S  C  C  I  *  Y  E  T  Y  L  S   
    -  Y  I  C  F  F  A  G  A  A  V  V  A  F  S  T  R  L  I  C  R   
    -   T  F  V  F  L  Q  E  L  Q  L  L  H  L  V  R  D  L  F  V  G   

The sequence we have built so far is


TLIQALLKQGESFSGRPQSYLMNQLTEGCGIVFSTGPRWQAQRRFVLTALKTY (C-helix)

(0) ELQLLHLVRDLFVGAIDTTTATLGWGIICLLHYPECQVRIQEEIDDVI (1) 
GCAEPDMSHHESMPYLRAFIQEVHRFQTIAPLNIPHCVTEDCVLFGYHIPKSTPV
MSNIWRVHNDPKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLAR
VELFLFYASTLKKYEFQIDPEYGLPDWSNDRSGTVKTPKKFSVLLKSR*

To find the N-terminal blast the C-helix exon against the JGI server.

LQW269127.y1 
LQW269127.y1.phd.1 LQW269127.x1 

10:52:57 2001 TEMPLATE: LQW269127 DIRECTION: rev
          Length = 988

 Score =  173 bits (401), Expect = 3e-43
 Identities = 53/53 (100%), Positives = 53/53 (100%)
 Frame = +1

Query: 1   TLIQALLKQGESFSGRPQSYLMNQLTEGCGIVFSTGPRWQAQRRFVLTALKTY 53
           TLIQALLKQGESFSGRPQSYLMNQLTEGCGIVFSTGPRWQAQRRFVLTALKTY
Sbjct: 403 TLIQALLKQGESFSGRPQSYLMNQLTEGCGIVFSTGPRWQAQRRFVLTALKTY 561

>LQW269127.y1
CTAGAACTGAGCCTGGTCCCACAGGTTCTTTTGTGTGGAACAGTTGTAGT
ATTTACTCAGAACTTTATATTATTATATAACCGTAGTCCACGACTTTAAA
AGAACGTTTTTTTATCAATAAGGGTGTCTTAACTACTGTTGTTTTACCGT
TACAACCCGACTTTTCGCCTTTGTCAACCTCTTATTTTTCGTTATCGTGA
CCGAAGAGACAAAATAAATTTCATTCATTCAACGTTGTTGGTTGTTTAAA
ACCACGATCGGAAAATACGGGTTTCAGTGGTAACAGTATCCCATCTTTCC
CCCACTGTATTATATATAGGCAATTTCTGTTCGCTTGGCTGAATGCCGTG
CGGTTATTAACTGGCATTTAAATCCCAACGTTATCCCATCTTACCCCACA
GTACTTTAATACAGGCACTGTTAAAGCAAGGAGAGAGCTTCTCTGGGCGA
CCGCAGTCATACTTAATGAACCAATTGACTGAAGGATGCGGTATTGTGTT
TTCCACAGGACCACGATGGCAAGCACAGCGTCGATTTGTATTAACGGCAC
TGAAAACGTATGTATTCCAACTCGTATATAGTTTACCGTGCACTTTTTTT
TATATAGTAGGATGGGGGAAGGTGGGACACCTGTTCATTTTGCTCGTCTA
TTTATACCGTTCATGCCTATTATTCATTTCGGTAGTAAACAAGGAAACAA
TTCAAGGAACTATGAAACCTTGTCCTCACGACTTACATAGACCGTGGATA
TACTGATTGAAAACACGGACCAGGGATATCATATATTTTCTGCTAAACAT
GTCCGATCTACACGAACCGATATGTATTGCCATGTGTAGGCTAAGATGGA
GACAGAACATGACCGGATTGTTAACAAAAAACCAAAGTTGACACGAATAA
ACTCTGGAGAAGGAAAAATGGAAAACCATCGATTGGCAAAACCATAAAAA
GAGAGCAGCCACAAAAAGACTGAAACGCGCAAGTAAAC

Three frames of this sequence: does not have KYG or N-term proline rich motifs
LELSLVPQVLLCGTVVVFTQNFILLYNRSPRL*KNVFLSIRVS*LLLFYRYNPTFRLCQPLIFRYRDRRDKINFIHSTLLVV
*NHDRKIRVSVVTVSHLSPTVLYIGNFCSLG*MPCGY*LAFKSQRYPILPHSTLIQALLKQGESFSGRPQSYLMNQLTEGCG
IVFSTGPRWQAQRRFVLTALKTYVFQLVYSLPCTFFYIVGWGKVGHLFILLVYLYRSCLLFISVVNKETIQGTMKPCPHDLH
RPWIY*LKTRTRDIIYFLLNMSDLHEPICIAMCRLRWRQNMTGLLTKNQS*HE*TLEKEKWKTIDWQNHKKRAATKRLKRAS
K

*N*AWSHRFFCVEQL*YLLRTLYYYITVVHDFKRTFFYQ*GCLNYCCFTVTTRLFAFVNLLFFVIVTEETK*ISFIQRCWLF
KTTIGKYGFQW*QYPIFPPLYYI*AISVRLAECRAVINWHLNPNVIPSYPTVL*YRHC*SKERASLGDRSHT**TN*LKDAV
LCFPQDHDGKHSVDLY*RH*KRMYSNSYIVYRALFFI**DGGRWDTCSFCSSIYTVHAYYSFR**TRKQFKEL*NLVLTTYI
DRGYTD*KHGPGISYIFC*TCPIYTNRYVLPCVG*DGDRT*PDC*QKTKVDTNKLWRRKNGKPSIGKTIKREQPQKD*NAQV
N

RTEPGPTGSFVWNSCSIYSELYIII*P*STTLKERFFINKGVLTTVVLPLQPDFSPLSTSYFSLS*PKRQNKFHSFNVVGCL
KPRSENTGFSGNSIPSFPHCIIYRQFLFAWLNAVRLLTGI*IPTLSHLTPQYFNTGTVKARRELLWATAVILNEPID*RMRY
CVFHRTTMASTASICINGTENVCIPTRI*FTVHFFLYSRMGEGGTPVHFARLFIPFMPIIHFGSKQGNNSRNYETLSSRLT*
TVDILIENTDQGYHIFSAKHVRSTRTDMYCHV*AKMETEHDRIVNKKPKLTRINSGEGKMENHRLAKP*KESSHKKTETRK*


LQW265377.x1 
LQW265377.x1.phd.1 LQW265377.y1 

12:29:20 2001 TEMPLATE: LQW265377 DIRECTION: fwd
          Length = 908

 Score =  173 bits (401), Expect = 3e-43
 Identities = 53/53 (100%), Positives = 53/53 (100%)
 Frame = +1

Query: 1   TLIQALLKQGESFSGRPQSYLMNQLTEGCGIVFSTGPRWQAQRRFVLTALKTY 53
           TLIQALLKQGESFSGRPQSYLMNQLTEGCGIVFSTGPRWQAQRRFVLTALKTY
Sbjct: 118 TLIQALLKQGESFSGRPQSYLMNQLTEGCGIVFSTGPRWQAQRRFVLTALKTY 276

>LQW265377.x1
AACGGXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXC
CATCTTTCCCCCACTGTATTATATATAGGCAATTTCTGTTCGCTTGGCTG
AATGCCGTGCGGTTATTGACTGGCATGTAAATCCCAACGTTATCCCATCT
TACCCCACAGTACTTTAATACAGGCACTGTTAAAGCAAGGAGAGAGCTTC
TCTGGGCGACCGCAGTCATACTTAATGAACCAATTGACTGAAGGATGCGG
TATTGTGTTTTCCACAGGACCACGATGGCAAGCACAGCGTCGATTTGTAT
TAACGGCACTGAAAACGTATGTATTCCAACTCGTATATAGTTTACCGTGC
ATTTTTTTTTTATATAGTAGGATGGGGGAAGATGGGACACTTTTTCATTT
TTCTCGTCTATTTTTTTCTTTCATTCTATTATTCTCTTTGGTAGTAAACA
AAGAACATTCAAGGAACTATGAAACCTTGTCCTCGCGACTTTCATAGACC
GTTGTTACTTGTTTAAAACACGACCAGGATATTTATATATTTTCTGCTAA
AGATGTCCCATCTTCCCCCAGCCCACTATATATTTACCATTTTTTAGCCT
AGGAATGGGAACACGAAGCATGGTCGGAATTATAAACGAAGAAAACCAGA
ATTTCTCATCCGTAATACAATCTTCTGGAGGACAGGTTAACATTTTGGTA
AGCCATATTTCTCATTGGTACCGTCCTTTATTTTTACGCTATATGCATTT
TTACTACATCTGAATTCGGAACTCTGTCGATATGTAGCATCGTCTGGGAA
GCGTCACACAAGACGAAACGCGGCACGGGACAAGTATCACATTACATACC
GCCGTTCACGCAGGCCTATCTAAAGGGGGGCGCGGCCCACGCAACCGGCC
CCCCCCGCAACACGACCACCGAACCCAAAGTGCCCCCCGCAACAGCACGA
CC

More exact matches

LQW48810.x1 173 3e-43LQW48810.y1 
LQW48810.y1 152 1e-36LQW48810.x1 
DEV27943.y1 116 5e-26DEV27943.x1 opposite end of seq 69


These two hits take the sequence 402 bp upstream and 908-276 bp = 632 bp downstream. Since the translation of the 402 bp upstream region did not show any motifs, I walked the chomosome by searching the fist line of the translations and got this sequence.


DEV1604.y1 

CHEM: term DYE: ET TIME: Wed Mar 21 15:22:56 2001
           TEMPLATE: DEV1604 DIRECTION: rev
          Length = 704

 Score =  225 bits (522), Expect = 2e-58
 Identities = 71/71 (100%), Positives = 71/71 (100%)
 Frame = -2

Query: 9   SFVWNSCSIYSELYIII*P*STTLKERFFINKGVLTTVVLPLQPDFSPLSTSYFSLS*PK 68
           SFVWNSCSIYSELYIII*P*STTLKERFFINKGVLTTVVLPLQPDFSPLSTSYFSLS*PK
Sbjct: 277 SFVWNSCSIYSELYIII*P*STTLKERFFINKGVLTTVVLPLQPDFSPLSTSYFSLS*PK 98

Query: 69  RQNKFHSFNVV 79
           RQNKFHSFNVV
Sbjct: 97  RQNKFHSFNVV 65

Three translations minus strand
ERGSIA*EKGLCVTQQ*ILLHCRVTELYSL*KSLV**KCI*SARDNINITAS*PYAIDYAAQWLHFRVCISLYLLSVAMAAS
S*EFPTRPTRHSIVWHSAVRWC*YA*IPCHLLCEIRWRDVISTCN*RLDCSERH*SNNTGFFCVEQL*YLLRTLYYYITVVH
DFKRTFFYQ*GCLNYCCFTVTTRLFAFVNLLFFVIVTEETK*ISFIQRCWLFKTTIGKYGFSVVTVSHLS

REVL*REKKGFVLRNNKFYYIAV*LSSTVFESL*FDRNVFSLLVIILILQLVSRML*TMLPSGYIFVFVFLCIYYLLQWRRR
PKNFPPGPLGIPLFGIAPFAGVDMHKYLATYYAKYGGVMSFRLATKDWIVLNDIEAITQVSFVWNSCSIYSELYIII*P*ST
TLKERFFINKGVLTTVVLPLQPDFSPLSTSYFSLS*PKRQNKFHSFNVVGCLKPRSENTGFQW*QYPIYP

ERFYSVRKRALCYATINSITLPCN*ALQSLKVFSLIEMYLVCS**Y*YYS*LAVCYRLCCPVVTFSCLYFFVFTICCNGGVV
LRISHPAHSAFHCLA*RRSLVLICINTLPPIMRNTVA*CHFDLQLKTGLF*TTLKQ*HRFLLCGTVVVFTQNFILLYNRSPR
L*KNVFLSIRVS*LLLFYRYNPTFRLCQPLIFRYRDRRDKINFIHSTLLVV*NHDRKIRVFSGNSIPSIP

This take the sequence 427 bases farther upstream. This should be far enough to find the next motif. This is correct. The second translation has an open reading frame ORF that has both the PPGP and the KYG motifs. The PPGP starts 29 amino acids from a possible start methionine, so this is probably the intact N-terminal of the protein.

MLPSGYIFVFVFLCIYYLLQWRRRPKNFPPGPLGIPLFGIAPFAGVDMHKYLATYYAKYG
GVMSFRLATKDWIVLNDIEAITQVSFVWNSCSIYSELYIII


>scf/ciona01/G126/seq_dir/hrs/G126P67101F.T0/G126P67101FD8.T0.seq    692      0
            692  ABI
            Length = 692

  Plus Strand HSPs:

 Score = 205 (72.2 bits), Expect = 2.5e-16, P = 2.5e-16
 Identities = 42/90 (46%), Positives = 57/90 (63%), Frame = +2

Query:     4 SGYIFVFVFLCIYYLLQWRRRPKNFPPGPLGIPLFGIAPFAGVDMHKYLATYYAKYGGVM 63
             + ++F+ VFL +Y+   W RRPKN+PPGP GIP  G+ PF G    + +  +  KYG VM
Sbjct:   320 TSFVFLAVFLGLYH---WYRRPKNYPPGPRGIPFLGVLPFLGNYPERTMHKWSKKYGPVM 490

Query:    64 SFRLATKDWIVLNDIEAITQV--SF--VWNSCS 92
             S R+  +DW+VL D E I QV  SF   +  CS
Sbjct:   491 SVRMGRQDWVVLGDYETIQQVG*SFG*YYRPCS 589

The match above to savignyi shows a probable stop of the exon at QV, this suggests that the V may be the GT boundary (phase 0). Translation shows this is the only GT nearby. Lets now look for a phase o boundary in the C-helix exon, which should join with this one.


LAFKSQRYPILPHSTLIQALLKQGESFSGRPQSYLMNQLTEGCGIVFSTGPRWQAQRRFVLTALKTYVFQLVYSLPCTFFYI
VGWGKVGHLFILLVYLYRSCLLFISVVNKETIQGTMKPCPHDLHRPWIY
The C-helix exon taken from a translation above is contained in this sequence. Blasts against savignyi show


>scf/ciona01/G126/seq_dir/hrs/G126P64664R.T0/G126P64664RC5.T0.seq    769      0
            769  ABI
            Length = 769

  Plus Strand HSPs:

 Score = 186 (65.5 bits), Expect = 2.3e-14, P = 2.3e-14
 Identities = 36/50 (72%), Positives = 42/50 (84%), Frame = +2

Query:    16 LIQALLKQGESFSGRPQSYLMNQLTEGCGIVFSTGPRWQAQRRFVLTALK 65
             L+ A  KQGESFSGRP+S + +QLT+GCGIVF+ G RWQ QRRFVLTALK
Sbjct:   341 LL*AFSKQGESFSGRPKSVVFDQLTQGCGIVFANGDRWQHQRRFVLTALK 490

This is a strong indicator that the exon starts after LIQ and Q is probably The intron boundary. The opposite end seems to end with TALK. Translation of this sequence shows the Q in LIQ is a CAG for a phase 0 boundary and TALK is followed by ACGT = phase 2 in T, or it could be phase 0 after TALKTY which is followed by Valine.


121 - TTAATACAGGCACTGTTAAAGCAAGGAGAGAGCTTCTCTGGGCGACCGCAGTCATACTTA - 180 
    - L  I  Q  A  L  L  K  Q  G  E  S  F  S  G  R  P  Q  S  Y  L   
    -  *  Y  R  H  C  *  S  K  E  R  A  S  L  G  D  R  S  H  T  *   
    -   N  T  G  T  V  K  A  R  R  E  L  L  W  A  T  A  V  I  L  N   
181 - ATGAACCAATTGACTGAAGGATGCGGTATTGTGTTTTCCACAGGACCACGATGGCAAGCA - 240 
    - M  N  Q  L  T  E  G  C  G  I  V  F  S  T  G  P  R  W  Q  A   
    -  *  T  N  *  L  K  D  A  V  L  C  F  P  Q  D  H  D  G  K  H   
    -   E  P  I  D  *  R  M  R  Y  C  V  F  H  R  T  T  M  A  S  T   
241 - CAGCGTCGATTTGTATTAACGGCACTGAAAACGTATGTATTCCAACTCGTATATAGTTTA - 300 
    - Q  R  R  F  V  L  T  A  L  K  T  Y  V  F  Q  L  V  Y  S  L   
    -  S  V  D  L  Y  *  R  H  *  K  R  M  Y  S  N  S  Y  I  V  Y   
    -   A  S  I  C  I  N  G  T  E  N  V  C  I  P  T  R  I  *  F  T   
The sequence we have built so far is

MLPSGYIFVFVFLCIYYLLQWRRRPKNFPPGPLGIPLFGIAPFAGVDMHKYLATYYAKYG
GVMSFRLATKDWIVLNDIEAITQ (0)
ALLKQGESFSGRPQSYLMNQLTEGCGIVFSTGPRWQAQRRFVLTALKTY (0) (C-helix)

(0) ELQLLHLVRDLFVGAIDTTTATLGWGIICLLHYPECQVRIQEEIDDVI (1) 
GCAEPDMSHHESMPYLRAFIQEVHRFQTIAPLNIPHCVTEDCVLFGYHIPKSTPV
MSNIWRVHNDPKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLAR
VELFLFYASTLKKYEFQIDPEYGLPDWSNDRSGTVKTPKKFSVLLKSR*

Length is now 341 amino acids. When complete it should be about 500, so we are missing about 160 more amino acids.

Translation of c-helix region to extend into gap: tried walking downstream and did not find any more matches.


NGHLSPTVLYIGNFCSLG*MPCGY*LACKSQRYPILPHSTLIQALLKQGESFSGRPQSYLMNQLTEGCGIVFSTGPRWQAQR
RFVLTALKTYVFQLVYSLPCIFFLYSRMGEDGTLFHFSRLFFSFILLFSLVVNKEHSRNYETLSSRLS*TVVTCLKHDQDIY
IFSAKDVPSSPSPLYIYHFLA*EWEHEAWSEL*TKKTRISHP*YNLLEDRLTFW*AIFLIGTVLYFYAICIFTTSEFGTLSI
CSIVWEASHKTKRGTGQVSHYIPPFTQAYLKGGAAHATGPPRNTTTEPKVPPATAR

TAIFPPLYYI*AISVRLAECRAVIDWHVNPNVIPSYPTVL*YRHC*SKERASLGDRSHT**TN*LKDAVLCFPQDHDGKHSV
DLY*RH*KRMYSNSYIVYRAFFFYIVGWGKMGHFFIFLVYFFLSFYYSLW**TKNIQGTMKPCPRDFHRPLLLV*NTTRIFI
YFLLKMSHLPPAHYIFTIF*PRNGNTKHGRNYKRRKPEFLIRNTIFWRTG*HFGKPYFSLVPSFIFTLYAFLLHLNSELCRY
VASSGKRHTRRNAARDKYHITYRRSRRPI*RGARPTQPAPPATRPPNPKCPPQQHD

RPSFPHCIIYRQFLFAWLNAVRLLTGM*IPTLSHLTPQYFNTGTVKARRELLWATAVILNEPID*RMRYCVFHRTTMASTAS
ICINGTENVCIPTRI*FTVHFFFI**DGGRWDTFSFFSSIFFFHSIILFGSKQRTFKEL*NLVLATFIDRCYLFKTRPGYLY
IFC*RCPIFPQPTIYLPFFSLGMGTRSMVGIINEENQNFSSVIQSSGGQVNILVSHISHWYRPLFLRYMHFYYI*IRNSVDM
*HRLGSVTQDETRHGTSITLHTAVHAGLSKGGRGPRNRPPPQHDHRTQSAPRNSTT
Blast search of translation three showed the following:

scf/ciona01/G126/seq_dir/hrs/G126P65637R.T0/G126P65637RE9.T0.seq    730      0
            730  ABI
            Length = 730

  Plus Strand HSPs:

 Score = 120 (42.2 bits), Expect = 5.8e-07, P = 5.8e-07
 Identities = 26/52 (50%), Positives = 35/52 (67%), Frame = +3

Query:    14 FFSLGMGTRSMVGIINEENQNFSSVIQSSGGQVNILVSHIS-HWYRPLFLRY 64
             F+SLGMG R+M  IINEE   F + +Q +GG VNILV  +S  +Y  +F+ Y
Sbjct:    87 FYSLGMGKRTMDAIINEETNRFIASVQLAGGTVNILV*KVSLGYYDKIFMLY 242

The three frame translation of this region shows

541 - TACCATTTTTTAGCCTAGGAATGGGAACACGAAGCATGGTCGGAATTATAAACGAAGAAA - 600 
    - Y  H  F  L  A  *  E  W  E  H  E  A  W  S  E  L  *  T  K  K   
    -  T  I  F  *  P  R  N  G  N  T  K  H  G  R  N  Y  K  R  R  K   
    -   P  F  F  S  L  G  M  G  T  R  S  M  V  G  I  I  N  E  E  N   
601 - ACCAGAATTTCTCATCCGTAATACAATCTTCTGGAGGACAGGTTAACATTTTGGTAAGCC - 660 
    - T  R  I  S  H  P  *  Y  N  L  L  E  D  R  L  T  F  W  *  A   
    -  P  E  F  L  I  R  N  T  I  F  W  R  T  G  *  H  F  G  K  P   
    -   Q  N  F  S  S  V  I  Q  S  S  G  G  Q  V  N  I  L  V  S  H   

An AG above PFFS is phase two in the Ser codon and this could match to phase 2 on the T codon after TALK. The other end may be phase 0 after VNIL. The sequence we have built so far is

MLPSGYIFVFVFLCIYYLLQWRRRPKNFPPGPLGIPLFGIAPFAGVDMHKYLATYYAKYG
GVMSFRLATKDWIVLNDIEAITQ (0)
ALLKQGESFSGRPQSYLMNQLTEGCGIVFSTGPRWQAQRRFVLTALKT (2) (C-helix)
LGMGTRSMVGIINEENQNFSSVIQSSGGQVNIL (0)

(0) ELQLLHLVRDLFVGAIDTTTATLGWGIICLLHYPECQVRIQEEIDDVI (1) 
GCAEPDMSHHESMPYLRAFIQEVHRFQTIAPLNIPHCVTEDCVLFGYHIPKSTPV
MSNIWRVHNDPKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLAR
VELFLFYASTLKKYEFQIDPEYGLPDWSNDRSGTVKTPKKFSVLLKSR*

The following sequence encodes the I helix on the minus strand

>LQW232493.x1
CCGGAAACGGCAGTGCAGCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TAATATGGTTATACCTTTACACATAATATATCTCAATTTTAGTTTTAACA
ACCAACACTGCGTTTTAGAACCGTGAAGCTACATTTATATAAATTTGTTT
CTTTTTTACTACCAAAGGCGATGATAAAAATTTGAAAACGTGGACCATTT
TCCCCAATATACATACAAACCATTAACGAATATTTTACAAGTGAGATTAA
ACGAAAGTAAAATAAAAACATACTAACCGATAACATCGTCTATTTCTTCC
TGTATTCTAACTTGGCATTCCGGGTAATGTAGTAAACATATGATTCCCCA
TCCTAACGTAGCTGTTGTCGTGTCAATAGCTCCGACAAATAAGTCTCGTA 
CTAAATGCAACAACTGCAGCTCCTGCAAAAAAACAAATGTATTCAGAGAT
ATGCTACAAAATGATTTTTTCAGTTAATAAATAAATATAAACAAACTAAA
TTTGTTTTTGTTTTTGTAGAAGATAAAGTTCATGGTAAATCAAGGTAAAC
GCCGGCGCGACTTCCGTAATATCTGGTCTGGTTTAACGTGTTTTGTTCTC
GTGGTCGTCGATTGAAAAATGTCGATCTCTGTAAAGTGTAGACTGACAAA
AGTTACCTCACGGCATAATCACGAGATCTTATCTATAATACAACACATAC
ACAATCAACATATACGCAACAGACTAGTTCAAGTATACATATTGCGGGCG
TTCAGCTATGTATACTGCAGGAGAACTTAGCTTTTTTCCCAAAACGTAAC
TTTAATATATAATTGCTAAGTTTCTGTAAAATGCAGTTAAGATAGTCTTA
TGCGTGCCGAATAGGACAAATAAAGATTCGGGTTACCCCTCGATAATTTG
GAATTAA

NSKLSRGNPNLYLSYSARIRLS*LHFTET*QLYIKVTFWEKS*VLLQYT*LNARNMYT*TSLLRIC*LCMCCIIDKIS*LCR
EVTFVSLHFTEIDIFQSTTTRTKHVKPDQILRKSRRRLP*FTMNFIFYKNKNKFSLFIFIY*LKKSFCSISLNTFVFLQELQ
LLHLVRDLFVGAIDTTTATLGWGIICLLHYPECQVRIQEEIDDVIG*YVFILLSFNLTCKIFVNGLYVYWGKWSTFSNFYHR
LW**KRNKFI*M*LHGSKTQCWLLKLKLRYIMCKGITILAALPFP

*FQIIEG*PESLFVLFGTHKTILTAFYRNLAIIY*SYVLGKKLSSPAVYIAERPQYVYLN*SVAYMLIVYVLYYR*DLVIMP
*GNFCQSTLYRDRHFSIDDHENKTR*TRPDITEVAPAFTLIYHELYLLQKQKQI*FVYIYLLTEKIIL*HISEYICFFAGAA
VVAFSTRLICRSY*HDNSYVRMGNHMFTTLPGMPS*NTGRNRRCYRLVCFYFTFV*SHL*NIR*WFVCILGKMVHVFKFLSS
PLVVKKKQIYINVASRF*NAVLVVKTKIEIYYV*RYNHISCTAVS

LIPNYRGVTRIFICPIRHA*DYLNCILQKLSNYILKLRFGKKAKFSCSIHS*TPAICILELVCCVYVDCVCVVL*IRSRDYA
VR*LLSVYTLQRSTFFNRRPREQNTLNQTRYYGSRAGVYLDLP*TLSSTKTKTNLVCLYLFIN*KNHFVAYL*IHLFFCRSC
SCCI*YETYLSELLTRQQLR*DGESYVYYITRNAKLEYRKK*TMLSVSMFLFYFRLISLVKYSLMVCMYIGENGPRFQIFII
AFGSKKETNLYKCSFTVLKRSVGC*N*N*DILCVKV*PY*LHCRFR
Walking upstream on an adjacent overlapping sequence we get:

The green sequence is the reference for overlapping the two sequences.

A blast of the top line of the third frame shows a 66% match in savignyi.


FLSIYFCWR*NDGTANISQ*VKLDFRVINEKFATFRGKLDLNCTL*FPKCSRGE*TPESVYFVPISALHKDYS*PAFYRKLS
NYILKLRFWEKS*VLLQYT*LNARNMYT*TSLLRIC*LCMCCIIDKIS*LCREVTFVSLHFTEIDIFQSTTTRTKHVKPDQI
LRKSRRRLP*FTMNFIFYKNKNKFSLFIFIY*LKKSFCSISLNTFVFLQELQL

FYRYISVGDETTEQRIFHSK*SLTFVLLTKNLPRLEENWI*IVRYSFQNVVEESKRRNLFILFLFRHCIKTILNLHFTENLA
IIY*SYDFGKKAKFSCSIHS*TPAICILELVCCVYVDCVCVVL*IRSRDYAVR*LLSVYTLQRSTFFNRRPREQNTLNQTRY
YGSRAGVYLDLP*TLSSTKTKTNLVCLYLFIN*KNHFVAYL*IHLFFCRSCSC

FIDIFLLEMKRRNSEYFTVSKA*LSCY*RKICHV*RKIGFKLYVIVSKM*SRRVNAGICLFCSYFGTA*RLFLTCILQKT*Q
LYIKVTILGKKLSSPAVYIAERPQYVYLN*SVAYMLIVYVLYYR*DLVIMP*GNFCQSTLYRDRHFSIDDHENKTR*TRPDI
TEVAPAFTLIYHELYLLQKQKQI*FVYIYLLTEKIIL*HISEYICFFAGAAV

>scf/ciona01/G126/seq_dir/hrs/G126P64336F.T0/G126P64336FB8.T0.seq    738      0
            738  ABI
            Length = 738

  Plus Strand HSPs:

 Score = 57 (20.1 bits), Expect = 5.7, P = 1.00
 Identities = 14/21 (66%), Positives = 15/21 (71%), Frame = +1

Query:     1 FIDIFLLEMKRRNSEY-FTVS 20
             FID F+LEMKR  S   FTVS
Sbjct:   424 FIDAFILEMKRDQSNSDFTVS 486

When this is searched in JGI a hit is found that can extend the sequence upstream.

LQW160072.x2 
LQW160072.x2.phd.1 LQW160072.y1 

13:35:30 2001 TEMPLATE: LQW160072 DIRECTION: fwd
          Length = 1068

 Score = 70.7 bits (159), Expect = 8e-13
 Identities = 20/20 (100%), Positives = 20/20 (100%)
 Frame = +2

Query: 1   FIDIFLLEMKRRNSEYFTVS 20
           FIDIFLLEMKRRNSEYFTVS
Sbjct: 455 FIDIFLLEMKRRNSEYFTVS 514

3 frames
K*KRQAALWES*EYGLLII*FFCDK*DEKIK*KGVPSSPTPLYIIPIYLHKVYTSA*FCL*CVNFTINCNAGRQAASLCLFR
FFDLYHHSNKAIRD*SPARNGF*VHGLILV*FIPLYL*HINFVPYAVQILSLK*LKSTKILSTKTTLEILSIYFCWR*NDGT
ANISQ*VKLDFRVINEKFATFRGKLHLNCTL*FPKCSRGE*RRNLFILFLFRHSIKTILNLHFTENLAIIY*SYDFGKIAKF
SCSIIAERRNIYLKLFVRNVNG*CD*DRTRYAGVICSILRVVL*SERTNVTTIGVRVR*DIQTTMCFRQGQN*RVTADS*IY
YIV*LDREKTNT*RETGKNKRKEEKRV

EIETAGSSMGVVRIRFINYMIFLRQMR*ENKMKRCPIFPYPTIYHTHLSTQSVYKCLILFIMR*FYNKL*CRPTSSVVMFIP
FLRFIPPFKQGYQRLIASTQRVLGTWLNTSVVYSLIFITY*LCAIRRTDIISEITEEHKNSFDENNLRDFIDIFLLEMKRRN
SEYFTVSKA*LSCY*RKICHV*RKIAFKLYVIVSKM*SRRMTPESVYFVPISAQHKDYS*PAFYRKLSNYILKLRFWENS*V
LLQYNS*TPQYLSETIRA*CEWVV*LR*NQICRSNLQYTKSCIIIRTNKRYHDWSAGQIRYSNNNVF*TRTKLEGDC*LLDI
LYCLVR*RED*YIKRNRKEQKKGRKES

GNRNGRQLYGSRENTVY*LYDFFATNEMRK*NEKVSHLPLPHYISYPFIYTKCIQVLNFVYNALILQ*TVMQADKQRRYVYS
VSSIYTTIQTRLSEINRQHATGFRYMA*Y*CSLFPYIYNILTLCHTPYRYYL*NN*RAQKFFRRKQP*RFYRYISVGDETTE
QRIFHSK*SLTFVLLTKNLPRLEENCI*IVRYSFQNVVEENDAGICLFCSYFGTA*RLFLTCILQKT*QLYIKVTILGK*LS
SPAV**LNAAIFI*NYSCVM*MGSVIKIEPDMPE*FAVY*ELYYNQNEQTLPRLECGSDKIFKQQCVLDKDKIRG*LLTLRY
IILFS*IERRLIHKEKQERTKERKKREC

Possible exon upstream of the I-helix. The exon probably ends with GT in Valine in the sequence YFTV (phase 0) which matches the phase 0 beginning of the I-helix. Remember, valine always starts with GT. The other end is based on alignment to savignyi sequences, but it is not as strong a prediction.

DIISEITEEHKNSFDENNLRDFIDIFLLEMKRRNSEYFT (0)
The translation around this exon shows:

361 - CTTATATTTATAACATATTAACTTTGTGCCATACGCCGTACAGATATTATCTCTGAAATA - 420 
    - L  I  F  I  T  Y  *  L  C  A  I  R  R  T  D  I  I  S  E  I   
    -  L  Y  L  *  H  I  N  F  V  P  Y  A  V  Q  I  L  S  L  K  *   
    -   Y  I  Y  N  I  L  T  L  C  H  T  P  Y  R  Y  Y  L  *  N  N   
421 - ACTGAAGAGCACAAAAATTCTTTCGACGAAAACAACCTTAGAGATTTTATCGATATATTT - 480 
    - T  E  E  H  K  N  S  F  D  E  N  N  L  R  D  F  I  D  I  F   
    -  L  K  S  T  K  I  L  S  T  K  T  T  L  E  I  L  S  I  Y  F   
- *  R  A  Q  K  F  F  R  R  K  Q  P  *  R  F  Y  R  Y  I  S
- 

The AG above D in the sequence DIIS is the only one nearby. It is phase 1 since it occurs one nucleotide into the D codon GAT. This is the probable end of this exon.

The sequence we have built so far is


MLPSGYIFVFVFLCIYYLLQWRRRPKNFPPGPLGIPLFGIAPFAGVDMHKYLATYYAKYG
GVMSFRLATKDWIVLNDIEAITQ (0)
ALLKQGESFSGRPQSYLMNQLTEGCGIVFSTGPRWQAQRRFVLTALKT (2) (C-helix)
LGMGTRSMVGIINEENQNFSSVIQSSGGQVNIL (0)

(1) DIISEITEEHKNSFDENNLRDFIDIFLLEMKRRNSEYFT (0) 
ELQLLHLVRDLFVGAIDTTTATLGWGIICLLHYPECQVRIQEEIDDVI (1) 
GCAEPDMSHHESMPYLRAFIQEVHRFQTIAPLNIPHCVTEDCVLFGYHIPKSTPV
MSNIWRVHNDPKYWENPEKFSPERHLDSEGRFVPSNRVLSFAVGHRSCLGVQLAR
VELFLFYASTLKKYEFQIDPEYGLPDWSNDRSGTVKTPKKFSVLLKSR*

A blast against the 201 Ciona contigs shows that the C-terminal half is 80% identical to sequence 58 and sequence 70, but they do not extend back into the gap region. The sequence is also 47% identical to sequence 2, which does go into the gap. The N-terminal also matches sequence 2. the missing region is 79 amino acids long.

Query:     7 IFVFVFLCIYYLLQWRRRPKNFPPGPLGIPLFGIAPFAGVDMHKYLATYYAKYGGVMSFR 66
             IF   FL +YY   W  RPKNFPPGP G+P  G+ PF G    + +  +  KYG VMS R
Sbjct:    16 IFFTAFLALYY---WYTRPKNFPPGPRGVPFLGVIPFLGNYPERVMRKWSKKYGPVMSVR 72

Query:    67 LATKDWIVLNDIEAITQALLKQGESFSGRPQSYLMNQLTEGCGIV-FSTGPRWQAQRRFV 125
             +  +DW+VL D E I Q+L+KQG+ FSGRP   ++NQ+T G G++       W+ QRRF 
Sbjct:    73 MGREDWVVLGDYETIQQSLVKQGQCFSGRPDVPVLNQITNGHGLITVDYNEDWKTQRRFG 132

Query:   126 LTALKTLGMGTRSMVGIINEENQNFSSVIQS 156
             +T L+  G+G RSM   I EE    +  I+S
Sbjct:   133 ITTLRGFGVGKRSMEDRIVEEVAYLNDAIRS 163

>sequence 2,3,9,45 COMPLETE 81 accessions 94% to sequence 1 40% to 2U1
          Length = 498

 Score = 607 (213.7 bits), Expect = 1.9e-62, P = 1.9e-62
 Identities = 117/245 (47%), Positives = 165/245 (67%)

Query:     2 IISEITEEHKNSFDENNLRDFIDIFLLEM--KRRNSEYFTELQLLHLVRDLFVGAIDTTT 59
             ++ EI  EHK++F+++++RDFID F+ E   + ++S Y T+LQLL  VRDLFV   +TTT
Sbjct:   251 MLREILSEHKSTFNKDDVRDFIDAFIAEQNSESKHSSY-TDLQLLQYVRDLFVAGTETTT 309

Query:    60 ATLGWGIICLLHYPECQVRIQEEIDDVIGCAE-PDMSHHESMPYLRAFIQEVHRFQTIAP 118
             +TL W I+C++H PE Q ++++EI DVIG    P M+    MPY  AF+QEV R++T+ P
Sbjct:   310 STLRWSILCMIHNPEKQEKLRKEICDVIGQDRVPAMNDKAQMPYTCAFMQEVFRYRTLVP 369

Query:   119 LNIPHCVTEDCVLFGYHIPKSTPVMSNIWRVHNDPKYWENPEKFSPERHLDSEGRFVPSN 178
             L++ H   +D VL GY IPK T +  N+W VHN+P  W+ P KF PERHLD +G FV S 
Sbjct:   370 LSVVHMTNQDVVLNGYTIPKGTTISPNLWAVHNNPDVWDEPSKFKPERHLDDKGNFVQSK 429

Query:   179 RVLSFAVGHRSCLGVQLARVELFLFYASTLKKYEFQIDP-EYGLPDWSNDRSGTVKTPKK 237
              V+ F++G R CLG QLAR+E F++  S ++K+EF  DP E  LPD  +  SG V  P +
Sbjct:   430 HVIPFSIGPRHCLGEQLARMEYFIYLVSMVQKFEFFPDPNEPDLPDVEDGSSGVVFVPLR 489

Query:   238 FSVLLK 243
             F  + K
Sbjct:   490 FKQIAK 495

>sequence 2,3,9,45 COMPLETE 81 accessions 94% to sequence 1 40% to 2U1
MVLQLLSDINVSSLVIFFTAFLALYYWYTRPKNFPPGPRGVPFLGVIPFLGNYPERVMRKWSKKYGPVMSVRMG
REDWVVLGDYETIQQ (0)
SLVKQGQCFSGRPDVPVLNQITNGHGLITVDYNEDWKTQRRFGITTLRG (2)
FGVGKRSMEDRIVEEVAYLNDAIRSHNEKPFDIL (0)
SILSNAVSNNICSVVMGRRFDYDDKRFMEIMARLSRS (2)
FNDPTANFALNVVMFMPILVKIPPFSRINNQLMTDVRVIL (1)
QMLREILSEHKSTFNKDDVRDFIDAFIAEQNSESKHSSYT (0)
DLQLLQYVRDLFVAGTETTTSTLRWSILCMIHNPEKQEKLRKEICDVI (1)
GQDRVPAMNDKAQMPYTCAFMQEVFRYRTLVPLSVVHMTNQDVVLNGYTIPKGTT (0)
ISPNLWAVHNNPDVWDEPSKFKPERHLDDKGNFVQSKHVIPFSIGPRHCLGEQLARMEYFIYLVSMV
QKFEFFPDPNEPDLPDVEDGSSGVVFVPLRFKQIAKIV*

We are only missing two exons shown in red above. The gene model for sequence two is the same except sequence 69 does not have the intron at PKG as seen in sequence 2. To finish this gene, we have to look upstream of the DIISE sequence for an exon like exon 5 of sequence 2. Blast searches of the sequences on either side of the gap do not allow extension into the gap with overlapping sequences. There seems to be a hole in the sequence here. It might be possible to bridge this hole by looking at the opposite ends of the sequences using the JGI links to the opposite ends. To do this I will search each exon separately and collect all the accession numbers for exact matches, with their opposite ends. These are shown below. The gap still remains and it is not possible to find the two missing exons with this data as it is now. A search of the Ciona ESTs did not find a match in the middle region, so that was not an option either.

LQW4880.y1    exon 1 opposite = LQW4880.x1 (exon 6)
DEV1604.y1    exon 1 opposite = DEV1604.x1 ?
LQW232493.y1  exon 1 opposite = LQW232493.x1 (exon 7)
LGJ1425.x1    exon 1 opposite = LGJ1425.y1 (exon 7) 
LQW269127.y1  exon 2 opposite = LQW269127.x1 (exon 7)
LQW265377.x1  exons 2 and 3 opposite = LQW265377.y1 (exons 8,9)
LQW48810.x1   exon 2 opposite = LQW48810.y1 (exon 2)
LQW48810.y1   exon 2 opposite = LQW48810.x1 (exon 2)
DEV27943.y1   exon 2 opposite = DEV27943.x1 (exons 8,9 fused)
              exons 4 and 5 are missing
LQW37658.y1   exon 6 opposite LQW37658.x1 ?
LQW98270.x1   exon 6 opposite = LQW98270.y1 ?
LQW160072.x1  exon 6 LQW160072.y1 (C-terminal and 3 prime UTR)
LQW160072.x2  exon 6 LQW160072.y1 (C-terminal and 3 prime UTR)
DEV47237.x1   exon 6 no opposite
LQW4880.x1    exon 6 opposite = LQW4880.y1 (exon 1)
DEV55412.x1   exon 7 opposite = DEV55412.y1 ?
LQW232493.x1  exon 7 opposite = LQW232493.y1 (exon 2)
LQW103809.x01 exon 7 opposite = LQW103809.y1 ?
LQW103809.x1  exon 7 opposite = LQW103809.y1 ?
LQW269127.x1  exon 7 opposite = LQW269127.y1 (exon 2)
LQW231417.x1  exons 8,9 fused opposite = LQW231417.y1
LQW265377.y2  exons 8,9 fused opposite = LQW265377.x1 (exons 2, 3)
DEV27943.x1   exons 8,9 fused opposite = DEV27943.y1 (exon 2)
LQW160072.y2  exons 8,9 fused opposite = LQW160072.x1 (exon 6)


note: exon 3 has only 3 diffs with LQW69473.x1 and LQW163272.x1. These extend 280 bp into the gap region, but they are probably from a different gene. The middle region is always the hardest to assemble. If we had DNA sequence from this region we could do it, but that is lacking.

Assignment 11.

This is the last assignment for MSCI814. Next week we start MSCI815. This may be the most difficult assignment yet. Please be patient in trying this assignment. It is a real world problem. I would like you to select a Ciona contig and assemble the gene as far as possible. That may not mean that you get the whole gene, but do what you can. As you can see from the case study of sequence 69, this can be a difficult task. Good luck.

DO NOT BLAST THE JGI SERVER IN CLASS. WE WILL GET BANNED AGAIN IF TOO MANY HITS GO THERE AT ONE TIME.

The Ciona savignyi servers are fair game, and you can choose to assemble one of them instead of a Ciona intestinalis P450. The current problem with that data right now is we cannot retrieve the DNA sequences from the server. I have asked Rob to work on fixing that. The only sequences that are off limits to work on are sequence 2 and sequence 36. These are both done. You can use them to find related genes to assemble. I would warn you that seq 2 seems to be in a gene cluster and this could be a problem for you. Gene 36 has no introns, but that does not mean its relatives will also have no introns. The quick protein translator is one of the best ways to find the GT AG pairs for exon boundaries. Be aware that it deletes Ns in a sequence and that causes frameshifts. The Vector NTI software is another option for viewing the translations above the DNA sequence to find the boundaries. Email me with problems and I will try to suggest some solutions.

I recommend looking at the blast output from the 1804 blast searches done by Rob on the savignyi data. These may lead you to an interesting sequence to start with. The common ancestor to bilateral animals about 670 million years ago seemed to have at least 5 P450s, a CYP2, 3, 4, a mitochondrial P450 and CYP51. I have looked already for the CYP51, but cannot find it, so Ciona may have dumped CYP51 (required for making cholesterol). Ciona may have a sterol requirement in its diet. The mitochondrial P450s are very interesting. They include CYP11A, CYP11B, CYP24, CYP27A and CYP27B. I would be very curious to see one of these in Ciona.

Links to useful files and servers:

Our P450 blast server, with 201 Ciona contigs

780 Ciona accessions

201 contigs with accession numbers

201 contigs in FASTA format without accession numbers

Ciona C-term alignment Numbering refers to the contig number.

index to blast output from the 1804 Ciona savignyi blast files

savignyi blast server 1

savignyi blast server 2

JGI Ciona blast server

Quick protein translator

Ciona EST blast server in Japan also use NCBI