Biocomputing Unit
Biocomputing
Sequence Analysis Service
Gibson Group
EMBL
EMBL

Practical Course on Sequence Analysis

Eukaryotic Gene Prediction

 

by Toby Gibson, Aidan Budd, Chenna Ramu and Christine Gemünd, July 2001


The aim of this practical is to get hands-on experience running some Web Servers for eukaryotic gene prediction on an "unkown" sequence. We will use coding prediction programs together with promoter, splice and poly-A site prediction programs. All the information gathered has to be combined to find the coding regions of the gene. There is not time to go into the theory behind the prediction methods.

As anyone who has looked at automatically annotated genome sequence data will know, automatic gene prediction in higher eukaryotes does not work very well. Perhaps this is not surprising, considering that differential and tissue-specific splicing, genes within genes, TATA-less promoters, the rare class of U12-dependent (AT-AC) splice sites, the need for species-specific parameters, all complicate the problem.

In practice, the bench biologist is best advised to inform themselves of the species-specific signals (e.g. mammals tend to have a looser splice site consensus than the worm) that they should expect and then run several servers and synthesise the results, looking (in part) for consistency between different servers. As there is a lot of activity in gene prediction (even though much of it is "reinventing the wheel") we may hope for better predictions and better presentation in the future.

Several web sites maintain useful lists of gene prediction services e.g.

 Gene Prediction Links of the Bork Group
 CNRS Marseille
 Rockefeller University
 Atelier Bioinfomatique, Universite Aix-Marseille

WWW Servers used in the Gene Prediction practical

We will use:


Step 1  Getting the DNA sequence

For the practical we have choosen a human genomic sequence that may perhaps encode an adenylate kinase...

You now have the sequence available in a form that can be cut and pasted into the query forms we are going to use.

Step 2 Looking for a Promoter candidate

Promoter prediction is heavily dependent on finding good matches to the TATAAAA motif. Further clues may be provided by CCAAT-Box, CpG islands and other transcription factor binding sites. But if you run MatInspector with the TRANSFAC DB on our query, you will be astonished at the profusion of candidate sites throughout the sequence - so we won't bother. We will run two promoter prediction programs and tentatively assume that the best intersecting promoter is the correct one.

Note: These outputs are rather terse to say the least. Doing this for real, you should look at the web site helps to get some idea of what is being done and be willing to go to the literature if need be.


Step 3. Poly(A) site prediction

In mammalian genes, polyadenylation sites are usually preceded by AATAAA or ATTAAA ~20 bases before the cleavage site and followed by a more weakly conserved GT-based motif. While these motifs are trivial to find, they only function in the right context - which is harder to define and includes regulation by upstream splicing factors. An important rule to remember is that there should not be an in-frame stop codon in an internal exon ie the true translation termination will be in the last exon. (Violations to this rule suppress mRNA production, to the cost of many experimentalists, and is occasionally used for differential mRNA regulation in vivo e.g. for certain Ig splice variants.)

We can't assess the context of these sites properly until we have the coding/splicing predictions to hand.



Step 4. Predicting Splice Sites and Coding Exons

There are a number of servers that separately predict splice sites and coding sequence bias but this information needs to be analysed together. We found that the CBS site in Denmark could provide all the information, though from two different servers. The NetGene2 server provides a graphical postscript output that we can print out and mark our predictions on. From the same group, the HMMgene server (using different algorithms) provides list output including potential Start and Stop codons. Both servers overpredict splice site candidates. In case you need reminding, classical splice sites look something like:



Step 5. Combining the server outputs into an overall prediction

We now have predictions for all the components needed to assemble the gene, rather inconveniently spread over many separate web outputs. We have to manually assemble all this into one prediction. This can be done on the Netgene2 and DNA sequence outputs using biro and fluorescent marker. The following guidelines may help.




Step 6. Gene prediction by homology using GeneWise

Usually nowadays, related sequences are already present in the databases. When available these may be the fastest way to get a good gene prediction. Often this prediction will be more reliable than the coding bias predictions though one should be aware of the possibility of sequence error, differential splicing etc. and of course finding the coding exons is not a complete gene prediction. The GeneWise program has an exhaustive (slow) algorithm to align a protein to a DNA sequence, allowing for splice site recognition. (In a real situation, BLAST programs would be useful for first picking up the matches in a DB search.)


Take Home Lessons

There seems to be no single high quality tool for doing eukaryotic gene prediction work. The variations in the results from prediction servers indicate that there is scope for improving the algorithms. Graphical presentation of the results is patchy - for example we need to know start, stop codons and which frame has the coding potential, information that we did not get from the graphical plot. To do this for real, we would need to assemble results from many servers and work with a hard copy of the DNA sequence and it would take longer than the morning we set aside today, to do the job properly. In fact the Staden package has for many years been able to produce a plot with all this information, although it was clumsy and old fashioned to use and some of the prediction methods may be more sensitive now. This package has been updated recently and we will have a look at how useful it is for gene prediction. 


Here are the results if the Web servers don't work!

You can find this page at http://www.embl-heidelberg.de/~seqanal/courses/Jul01/GenePred.html