Biocomputing |
Gibson Group |
EMBL |
EMBL
Spring '00 Practical Courses on Gene Prediction and Sequence Analysis
by Toby Gibson, Chenna Ramu, Christine Gemünd and José Castresana
These courses are open to all who might be interested at EMBL. In the genome era, many researchers are finding that they need to predict genes in poorly characterised DNA sequences. Three main strategies are used, where possible in combination: (1) de novo gene prediction; (2) Detecting homology with known proteins; (3) Detecting homologous ESTs. The Eukaryotic Gene Prediction course provides an introduction to some Web servers available on the Internet to identify genes in genomic DNA. In Gene Analysis with Artemis and Staden, we will introduce two UNIX packages that have some very useful sequence analysis features, especially for handling large chunks of DNA. In the Eukaryotic Gene Investigation course we use ESTs and public gene annotation servers to investigate genes and genomes. The fourth course focusses on a different topic: Molecular Phylogeny with Maximum Likelihood provides an introduction to tree calculation and analysis using the maximum likelihood approach.
The courses can be taken individually or in combination, according to your needs. Each course consists of an introduction to the topic followed by a hands-on practical. Schedules for the practicals will be provided on Web pages accessible by clicking on the links below.
The students will be paired up for each X-terminal/iMac. Practicals
will take place in the computer teaching lab, room V125.
Tuesday 16th May
This practical introduces some web servers for gene prediction. These
can be accessed from any computer and are simple to use. Web servers are
often a convenient way to do sequence analysis although none of the prediction
servers we checked can be said to be outstanding. You should also be aware
that Web servers can be unreliable, need constant care from their providers
and are not suited to every task - some of the gene prediction servers
we tried were not working! Therefore sometimes you have to run programs
on local machines too.
These packages provide somewhat complementary capabilities and include features that are not available in GCG or in the gene prediction servers in Course 1. The Staden Package, developed by Rodger Staden, is a long established package for manipulating sequences. For some years development has concentrated on GAP4, an advanced sequence assembly program used at the Sanger Centre and elsewhere (but probably not of much use at EMBL). Recently, the sequence analysis programs have been given a new and attractive graphical interface - which allows custom assembly of complex graphic displays by drag-and-dropping component graphs! However some of the analysis functions (especially for eukaryotic gene prediction) are in need of upgrading to modern standards. Currently, the package is most likely to be useful for flexible pairwise sequence comparision in Sip4 and prokaryotic gene prediction in Nip4.
Artemis is a newly released program from the Sanger Centre
which displays large regions of genomic sequence and their annotated features
(which can be edited) in graphical form. It is very useful for anyone who
has to work with large DNA segments encoding complex genetic information.
This course introduces three web servers that provide tools for investigating
genes in genomic DNA. Gene2EST is a server developed in our group that
accepts up to 100,000 bases of genomic DNA as input for BLAST searches
of EST databases. The results are parsed to provide an alignment
of the genomic DNA with matching EST segments and a summary of the results
in EMBL format for graphical display in Artemis. In favourable cases (i.e.
lots of matching ESTs), Gene2EST can give a quite precise description of
gene structure, including differential splicing. ENSEMBL is the "real time"
annotation tool for the human genome project jointly developed by the EBI
and Sanger Centre. It uses de novo gene prediction, ESTs and protein
homology to rapidly annotate genes in newly determined human genome sequence.
The human genome project (HGP) web site at the Sanger centre provides an
overview of the human genome project and tools to focus on chromosomes
and subchromosomes.
Maximum likelihood is widely acknowledged as one of the best (if very
slow) methods for calculating trees from sequence data. If you think of
trees as a best fit to the data, you will realise that they can be incorrect,
or at least statistically insignificant, if the data are not well resolved
as is often the case. Maximum likelihood strategies thus explicitly acknowledge
the probabilistic nature of the tree reconstruction problem and the ML
framework provides methods for assaying the reliability of tree branching
orders. In the practical we will make and evaluate trees from smallish
datasets.
You can find this page at http://www.embl-heidelberg.de/~seqanal/courses/spring00/Top.html