Course Number MSCI 814

Course Title Bioinformatics I

Course Credit 2 hours

Prerequisites general knowledge of gene structure and protein structure

Catalog Description 814 MSCI, Bioinformatics I.

This course will consist of eleven 2.5 hour segments (27.5 hours). The material will be introduced in a brief lecture format for 30-45 minutes as necessary. The majority of the time will be spent using computer applications of bioinformatics tools. The course is designed to provide practical training in bioinformatics methods including accessing the major public sequence databases, use of the five BLAST tools to find sequences, analysis of protein and nucleic acid sequences by various software packages like Vector NTI or SeqWeb, detection of motifs or domains in proteins, assembly of protein sequences from genomic DNA, detection of exons and finding intron-exon boundaries, aligning sequences (Clustal W), making phylogenetic trees (Phylip), and comparative genomics. The student should leave the course with a working knowledge of how to carry out research using these tools. Recommended text: Developing Bioinformatics Computer Skills by C. Gibas and P. Jambeck. Spring even numbered years Credit 2.

Course Content and Objectives Each module is described separately with the objectives given for that module.

Module 1. Introduction to the class, course requirements (textbook, access to internet, registration for SeqWeb), material to be covered and course format, timetable, assignments, exams. Effective use of PubMed (searches, related links, search terms, history, etc.) Accessing online journals and retrieving papers. OMIM, Online Mendelian Inheritance in Man. Tour of Bioinformatics computer lab 101 MSB.

Objectives: Introduce the course and outline expectations of the students. Go over the extra features of PubMed (often overlooked) that can be used to find articles and how to obtain those articles. Introduce OMIM, the major source of information on human genetic diseases.

Module 2. Entrez nucleotide and introduction to BLAST. Structure of GenBank, tour of the various sections nr, ESTdb, GSS, HTGS. UNIGENE and COGS, two additional resources at GenBank. Use of the taxonomy browser, how to find out how many sequences are present for a given organism, how many are EST sequences, how many are genomic, how many are mRNA. Retrieving DNA and protein sequences via the web. Introduction to the basic BLAST algorithm, how it works, how to use it, how to interpret your results.

Objectives: Introduce the four main sections of the Entrez nucleotide interface to GenBank. Understand what is there and how to find it. Explain what a BLAST search is and demonstrate the power of this tool. Understand BLAST output and the significance of the scores and alignments.

Module 3. Advanced features of the BLAST technique. The five main types of BLAST: blastn, blastp, blastx, tblastn, tblastx. Why you would use one over another and how to set the parameters to get what you want out of a search (expect value, cutoff values, filters, repeat masker, etc). Deciding where to search. PSI BLAST and PHI BLAST give some extra power.

Objectives: Understand the choices offered by the BLAST page at NCBI: choice of programs and choice of settings to optimize results.

Module 4. Vector NTI for DNA. Programs used to analyze DNA sequences in the Vector NTI package. Installing the software and registering, entering data, generating restriction maps, plasmid maps, in silico cloning, PCR primer design. Annotating sequences and saving in different formats. In silico gel electrophoresis. Generating Vector NTI reports

Objectives: Hands on practical instruction in the use of UT's new Vector NTI software. Emphasis will be on manipulating DNA sequences, primer design etc.

Module 5. Vector NTI for proteins. Programs used to analyze protein sequences in the Vector NTI package. Generating protein sequences from DNA sequences. Calculating protein properties. Hydrophobicity plots, antigenicity plots. Aligning proteins, the ALIGN X protocol. Setting parameters for gaps, choosing the right matrix, editing the multiple alignment.

Objectives: Hands on practical instruction in the use of UT's new Vector NTI software. Emphasis will be on analyzing protein sequences and aligning protein sequences.

Module 6. Introduction to other sequence analysis packages, with emphasis on the web based version of GCG called SeqWeb (available on campus by registration), Creating files and using the sequence analysis tools, items not present in Vector NTI will be pointed out. (Ability to blast your own database, motif/domain searches, etc.). Discussion of other web based tools EXPASY, Jellyfish, etc.

Objectives: Outline alternative sequence analysis software and what it can do. Provide information on the SeqWeb version of GCG on campus or the web and how to access it and use it.

Module 7. Multiple sequence alignment, Introduction to Clustal W. Sequence input formats. Inspection of output for large alignment errors and manual editing using a sequence editor. Introduce the concept of motifs and the databases that emphasize motifs and protein families (Interpro and its internal hyperlinks to Pfam, Prints, SMART, Prosite ProDom, etc.)

Objectives: Explain the use of the most common multiple sequence alignment algorithm. Understand how to search a protein sequence for domains/motifs and how to obtain premade and edited protein family sequence alignments.

Module 8. Phylip phylogeny inference package, constructing phylogenetic trees and cladograms from a multiple sequence alignment. Step by step procedure to make a tree. Making the distance matrix, computing the tree (Neighbor Joining or UPGMA), drawing the tree. Introduction to the Tree of Life web site as a resource for phylogenetic information.

Objectives: Understand how to use the Clustal W output to make a phylogenetic tree by several common methods. Understand the artifacts caused by long branch attraction.

Module 9. Genomics, sequencing of bacterial and eukaryotic genomes, how is it done and how is the data assembled into a contiguous sequence? Viewing ABI sequence traces and assessing quality of each base. Phred and Phrap and Consed programs. Annotation in bacterial genomes (operons but no introns), finding ORFs and gene boundaries. Grail and GeneFinder programs. Searching for regulatory elements. Use of Artemis.

Objectives: Explain the current strategies for genome sequencing projects and quality assesment of the raw data. Undestand how annotation is achieved by automated methods.

Module 10. Annotation in eukaryotic genomes, finding exons in a sea of non-coding DNA sequence. Comparison between species (human, mouse) as a tool to identify genes. Alternative splicing is present in about 50% of human genes. Detection of alternate exons. Practice at finding missing sequence data from mouse mitochondrial carriers. Searches with partial sequences to fill in holes in a sequence alignment. Searching ESTdb (cDNA) and HTGS (genomic) data.

Objectives: Explain gene hunting methods and how to assemble a gene from genomic DNA. Understand the uses of cDNA databases in genome annotation.

Module 11. Building genes without gene models as a guide. Using similar members of a gene cluster to identify intron-exon boundaries with TBLASTX, an example from the white rot fungus cytochrome P450s. Practice in this technique with real unannotated sequence data.

Objectives: Explain how to assemble a gene by using other non-annotated sequence as a helper. Understand the logic of finding exons without a cDNA to help.

Method of Evaluation Since this is a practical course, students will be given assignments on the techniques taught in class. Grades will be based on the assignments. Assignments will be given and students will have one week to turn them in. They will be looked over by the instructors and returned if not of high enough quality. The students will be given until the following class to redo the assignment to improve it. After the second week the answer will be posted so no further revision is possible.

Course Number MSCI 815

Course Title Bioinformatics II

Course Credit 1 hour

Prerequisites Bioinformatics I or permission of the instructor

Catalog Description 815 MSCI, Bioinformatics II.

This course will consist of six 2.5 hour segments (15 hours) partially as lecture and partially as computer tutorial sessions to demonstrate advanced bioinformatics methods and the use of databases. The course will follow Bioinformatics I. The topics covered are more advanced and require some prior knowledge. Topics include finding and using public databases other than NCBI; private databases and the politics of genomics; genome browsers and NCBI's genomic biology section; gene arrays, their construction, use and data analysis; mapping quantitative trait loci (QTLs) and radiation hybrid mapping; 3D protein structure viewers and threading. Recommended text: none. Spring even numbered years Credit 1.

Course Content and Objectives Each module is described separately with the objectives given for that module.

Module 1. . Non-GenBank public Databases, searching out public data that is not at GenBank. Examples: Dictyostelium discoideum, blast servers at Baylor, Japan, Sanger Center and San Diego, all have different data; JGI (Joint Genome Institute), Walnut Hill California (White Rot Fungus Genome and other genomes); Plamodb (must register to use all of P. falciparum genome); Sanger blast servers for data not in Genbank yet; Riken database of full length cDNAs from mouse. KEGG, Kyoto Encyclopedia of Genes and Genomes (a pathway database).

Objectives: To demonstrate that Genbank is not the only place to go for sequence data. Give the students practical experience in hunting down alternative genome related sites.

Module 2. Private Databases and the politics and ethics of data usage restrictions. The Bermuda agreement on rapid release of publicly funded data. The Science agreement with Celera to publish the human genome. Dissent from some participants. Private databases include Celera Genomics, Incyte Genomics, ERGO. These require subscriptions or collaborations to access data, though some data are publicly available.

Objectives: The naive notion that scientific results should be free to everyone will be discussed in light of private acquisition of genome data and charging for access to that data. Examine the evolving concepts of public versus private science.

Module 3. Genome browsers, NCBI, UCSB, Ensemble, Celera, etc. Objectives: To teach the use of several different genome browsers and how to locate genes in genomes. How to view large segments of genomic data. Getting information such as the transcriptome map that influences gene expression in a local region of a genome. Tour of NCBI's Genome Biology section.

Objectives: Understand how to find a gene in its genomic context and how that may influence expression. Appreciate the construction of beautiful tools for the community and making of these tools available free over the web.

Module 4. Gene arrays. Construction, use, data analysis. Spot finding, commercial products, Affymetrix DNA chips. GeneSpring software.

Objectives: Explain the newest genomic tool, gene arrays and their use.

Module 5. Gene mapping in mouse and human, Quantitative Trait Loci (QTL), radiation hybrid mapping (guest lecturer Rob Williams). Rob Williams from Anatomy and Neurobiology will be guest lecturer on gene mapping.

Objectives: Explain the process of linking a trait (phenotype) to a particular gene and identification of that gene.

Module 6. RasMol and Cn3D (NCBI's 3D protein structure viewer). The Molecular Modelling Database (MMDB) with over 10,000 modeled structures. How to submit a sequence for threading based on a known sequence. Discussion of the reliability of this procedure.

Objectives: Understand how to fetch 3D X-ray crystal coordinates from PDB and display the structures on your computer screen. Understand the concept of threading and protein modelling, especially the limitations of the models.