Practical
Course on Sequence Analysis
Eukaryotic Gene Prediction
by Toby Gibson, Aidan Budd, Chenna Ramu and Christine Gemünd, July 2001
The aim of this practical is to get hands-on experience running
some Web Servers for eukaryotic gene prediction on an "unkown" sequence.
We will use coding prediction programs together with promoter, splice and
poly-A site prediction programs. All the information gathered has to be
combined to find the coding regions of the gene. There is not time to go
into the theory behind the prediction methods.
As anyone who has looked at automatically annotated genome sequence
data will know, automatic gene prediction in higher eukaryotes does not
work very well. Perhaps this is not surprising, considering that differential
and tissue-specific splicing, genes within genes, TATA-less promoters,
the rare class of U12-dependent (AT-AC) splice sites, the need for species-specific
parameters, all complicate the problem.
In practice, the bench biologist is best advised to inform themselves
of the species-specific signals (e.g. mammals tend to have a looser
splice site consensus than the worm) that they should expect and then run
several servers and synthesise the results, looking (in part) for consistency
between different servers. As there is a lot of activity in gene prediction
(even though much of it is "reinventing the wheel") we may hope for better
predictions and better presentation in the future.
Several web sites maintain useful lists of gene prediction services
e.g.
Gene
Prediction Links of the Bork Group
CNRS
Marseille
Rockefeller
University
Atelier
Bioinfomatique, Universite Aix-Marseille
WWW Servers used in the Gene Prediction practical
We will use:
-
NetGene2 and
HMMgene
to predict coding exons and splice sites in eukaryotes.
-
The LBL Promoter
prediction service and TSSG,
TSSW from the Sanger Centre.
-
The POLYAH server
for the prediction of the poly-A site.
-
GeneWise (from the Wise2
package) for aligning protein v. spliced DNA sequence.
Step 1 Getting the DNA sequence
For the practical we have choosen a human genomic sequence that may
perhaps encode an adenylate kinase...
-
Open a new navigator window and load this page into it.
-
Click here
to get the ~12 kb sequence (press on the middle mouse button to open a
new browser window).
You now have the sequence available in a form that can be cut and pasted
into the query forms we are going to use.
Step 2 Looking for a Promoter candidate
Promoter prediction is heavily dependent on finding good matches
to the TATAAAA motif. Further clues may be provided by CCAAT-Box, CpG islands
and other transcription factor binding sites. But if you run MatInspector
with the TRANSFAC DB on
our query, you will be astonished at the profusion of candidate sites throughout
the sequence - so we won't bother. We will run two promoter prediction
programs and tentatively assume that the best intersecting promoter is
the correct one.
-
Open a new navigator window and load this page into it.
-
Load the LBL Promoter
query submission page.
-
It is worth familiarising yourself with the layout and note the on-line
help.
-
Cut and paste the query sequence into the sequence box and submit the job.
-
When the result arrives look at the set of predicted promoters
-
Can you see matches to the TATA box consensus tATAA/TAA/T
-
Which promoter has the silliest TATA-box?
-
Open a new navigator window and load this page into it.
-
Load the TSSG query
submission page.
-
Toggle on the TSSG button.
-
Cut and paste the query sequence into the sequence box and submit the job.
-
When the result arrives look at the set of predicted promoters
-
How many TATA boxes were found?
-
Are the listed transcription factor binding sites informative?
-
How well do the two searches agree on candidate promoters?
-
How many candidates do they both find?
-
Is there a single best candidate from combining the searches?
Note: These outputs are rather terse to say the least. Doing this for
real, you should look at the web site helps to get some idea of what is
being done and be willing to go to the literature if need be.
Step 3. Poly(A) site prediction
In mammalian genes, polyadenylation sites are usually preceded by
AATAAA or ATTAAA ~20 bases before the cleavage site and followed by a more
weakly conserved GT-based motif. While these motifs are trivial to find,
they only function in the right context - which is harder to define and
includes regulation by upstream splicing factors. An important rule to
remember is that there should not be an in-frame stop codon in an internal
exon ie the true translation termination will be in the last exon.
(Violations to this rule suppress mRNA production, to the cost of many
experimentalists, and is occasionally used for differential mRNA regulation
in
vivo e.g. for certain Ig splice variants.)
-
(As needed, open a new navigator window and load this page into it.)
-
Load the POLYAH query
submission page.
-
Toggle on the POLYAH button.
-
Look at the POLYAH help and note the quoted prediction accuracy.
-
Cut and paste the query sequence into the sequence box and submit the job.
-
When the result arrives, look at the predicted poly(A) sites.
- How many candidate sites were found?(You need to open the numbered sequence to see this.)
-
If one or more of these sites are false is the prediction accuracy as
good as claimed?
-
How might overprediction of poly(A) sites be avoided?
We can't assess the context of these sites properly until we have the coding/splicing
predictions to hand.
Step 4. Predicting Splice Sites and Coding Exons
There are a number of servers that separately predict splice sites
and coding sequence bias but this information needs to be analysed together.
We found that the CBS site in Denmark could provide all the information,
though from two different servers. The NetGene2
server provides a graphical postscript output that we can print out and
mark our predictions on. From the same group, the HMMgene
server (using different algorithms) provides list output including potential
Start and Stop codons. Both servers overpredict splice site candidates.
In case you need reminding, classical splice sites look something like:
-
Donor Consensus:
c/aAG^GTA/gAGt
-
Acceptor Consensus:
(T>C)nN(C>T)AG^gt
-
(As needed, open a new navigator window and load this page into it.)
-
Load the NetGene2
query submission page.
-
Paste in the sequence and submit the job, which takes a few minutes to
run.
-
The output provides a list of candidate splice sites (on both strands)
and a graphical coding/splicing prediction.
-
However it is not clear which translation frame is supposed to be coding.
-
It is worth printing this figure out and using it to summarise our prediction
attempts!
- Click on the Direct strand link and save the compressed postscript output (has a .gz suffix).
-
Open a UNIX X-window (terminal from the desktop)
-
Uncompress the file by typing UNIX command
-
Print the file to the printer outside room V111 by typing
-
lpr -Plj-v111 filename.ps
-
Now load the HMMgene
query submission page.
- Get the FASTA format sequence and paste in the local file box.
- (These servers are rather fussy about sequence formats!)
- select 5 best predictions and toggle on predict signals.
- Submit the job.
- The output appears confusing:
- It is a list of predicted exons, coding spans, coding starts and stops and splice donors and acceptors.
- Click on the Explanation link to understand the output format.
- We can now begin to assemble a complete gene prediction.
Step 5. Combining the server outputs into an overall prediction
We now have predictions for all the components needed to assemble the gene, rather inconveniently spread over many separate web outputs. We have to manually assemble all this into one prediction. This can be done on the Netgene2 and DNA sequence outputs using biro and fluorescent marker. The following guidelines may help.
-
Start from a strong point such as:
-
A well-predicted internal coding exon with good splice borders.
-
Work forwards and backwards toward the promoter and poly(A) boundary signals.
-
Reported splice site quality is not a completely robust guide to usage.
-
Context dependence is also important.
-
Splice sites tend to be overpredicted.
-
Some (true) splice sites might be better predicted by the HMMgene algorithm
than by NetGene2...
-
The terminal exon should be partially coding, including the stop codon
and the poly(A) signature.
-
The initiation codon should obey the Kozak rules:
-
It is normally the first methionine from the 5' end of the mRNA.
-
At least one of the underlined residues should be present in the consensus
APuXXAUGG.
-
Once the prediction is completed, we can check it in the next exercise.
-
Good luck!
Step 6. Gene prediction by homology using
GeneWise
Usually nowadays, related sequences are already present in the databases.
When available these may be the fastest way to get a good gene prediction.
Often this prediction will be more reliable than the coding bias predictions
though one should be aware of the possibility of sequence error, differential
splicing etc. and of course finding the coding exons is not a complete
gene prediction. The GeneWise
program has an exhaustive (slow) algorithm to align a protein to a DNA
sequence, allowing for splice site recognition. (In a real situation, BLAST
programs would be useful for first picking up the matches in a DB search.)
- Open a new X-window and type rlogin tau. Give your username and password.
-
Type prepare wise2 in the window. (prepare is a way to set up program environments on EMBL UNIX).
-
We've prepared files with the human DNA and a homologous chicken protein
to compare.
-
Now you can type or cut and paste the following command into the UNIX window:
-
genewise /home/seqanal/public_html/courses/Jul01/kad1_chick /home/seqanal/public_html/courses/Jul01/hsak1.dna
- GeneWise will run with default parameters and after a couple of minutes will print out the matched exons.
-
Now compare the results to the predictions so far
-
How many Exons are found?
-
Are the splice sites between or within codons?
-
Did you find all these coding regions earlier?
-
Have we now found all the coding exons (the chicken homologue has 194
AA)?
-
Now lets look at the annotated genomic sequence entry for our test sequence,
HSAK1
-
Note that no cDNA has been sequenced for this gene: gene structure was
inferred by some transcript mapping and by protein homology.
-
Most of the elements of the gene are listed in the feature table.
-
Did you get the promoter?
-
Did you get the starting methionine? Does it obey Kozak's rules?
-
How many amino acids are in the first coding exon?
-
If you made any errors in the prediction, can you see where you went
wrong?
-
There is a problem with the annotation of the first intron's acceptor:
-
do you think this is -
- a). an unusual splice site?
- b). an annotation error made by the authors?
Take Home Lessons
There seems to be no single high quality tool for doing eukaryotic
gene prediction work. The variations in the results from prediction servers
indicate that there is scope for improving the algorithms. Graphical presentation
of the results is patchy - for example we need to know start, stop codons
and which frame has the coding potential, information that we did not get
from the graphical plot. To do this for real, we would need to assemble
results from many servers and work with a hard copy of the DNA sequence
and it would take longer than the morning we set aside today, to do the
job properly. In fact the Staden
package has for many years been able to produce a plot with all this
information, although it was clumsy and old fashioned to use and some of
the prediction methods may be more sensitive now. This package has been
updated recently and we will have a look at how useful it is for gene prediction.
Here
are the results if the Web servers don't work!
You can find this page at http://www.embl-heidelberg.de/~seqanal/courses/Jul01/GenePred.html