Biocomputing Unit
Biocomputing
Sequence Analysis Service
Gibson Group
EMBL
EMBL

Practical Course on Sequence Analysis

Eukaryotic Gene Investigation

Toby Gibson, Christine Gemünd and Chenna Ramu, May '00


The aim of this practical is to get hands-on experience with two web servers that can help to investigate eukaryoticgenomic sequence - especially human. We are focussing particularly on the human genome servers since the genome is expected to be largely completed during this year. The Ensembl server is a collaboration between the EBI and the Sanger Centre to provide "real time" human genome annotation as the sequence is generated. The annotation is automatically generated from a combination of gene prediction, encoded protein homology and EST matches: obviously it is not going to be perfect. The second server that we want to introduce is one that we are developing and is not finished yet! The Gene2EST server will allow you to submit gene-size queries (up to about 50,000 bp) to search the EST databases and provides both graphical and alignment output: in favourable cases - i.e. plentiful ESTs - Gene2EST can rapidly give a good description of a gene structure.

We did plan to show the human genome server at the Sanger Centre too - but the exercises gave too many problems, so we dropped it.


Exercise 1. Querying Ensembl with a protein sequence

Currently, Ensembl is not easy to query by keyword. It is best queried by sequence and provides quite a flexible combination of protein/DNA query/databases. For example, it is well suited to probing with ESTs of interest. Today we are going to use a protein sequence as query.
 

  • Getting the Query:
  • Querying Ensembl
  • Notes:

    Good starting points for querying Ensembl are protein, EST or cDNA sequences of interest. Ensembl does automatic annotation, so results are not likely to be perfect. Bear in mind that sequence is being annotated as fast as it is being generated: genes may be only partially sequenced. While results are imperfect, the rapid deployment of gene predictions and the links to homologous sequences could be very useful. The underlying data is made available and the user can - and should - evaluate the predicted gene annotation quality.

    Exercise 2. Using ESTs to reveal gene structure with Gene2EST

    The BLAST2 program is well suited to EST detection. Because it tolerates small gaps in alignments, it deals well with sequencing errors in the ESTs. However it is painfully time-consuming to work through the output, especially if there are many 100s of matches. Furthermore, most BLAST2 servers will not allow the user to submit large query sequences. We have tried to address these deficiencies with Gene2EST.The Gene2EST server presents BLAST2-derived output in a way that allows the user to rapidly evaluate the results. A graphical display (viewed with Artemis) provides an overview of the results, while an alignment of the ESTs on the query sequence lets the user follow up the exons in detail.
     

  • Getting the Query:
  • Using Gene2EST:
  • Examining the alignment output:
  • Examining the Graphical output with Artemis:
  • Repeat using HSAK1, the gene sequence from yesterday's prediction practical:
  • Notes:

    Our chosen example works spectacularly well: ESTs alone can quickly reveal the entire gene structure. However, if we took the EMBL entry AF106656 (the human adenylosuccinate lyase gene) there would be problems. About 1/2 of the 20 kb sequence consists of dispersed repeat sequences matching highly similar ESTs. Then there are many ESTs primed for intronic poly-A runs in the genomic sequence. The gene itself is not very highly expressed, so we struggle to find the true ESTs in the flood of noisy matches. Fortunately the program RepeatMasker can filter out repetitive elements. We hope to build this program into the Gene2EST server. If there is time, it is an instructive exercise to see the effects of using Repeatmasker filtered and unfiltered sequence from AF106656 in Gene2EST. After that exercise, one is left to wonder what fraction of the ~2 million public ESTs belong in the "really useful" category....

    Take Home Lessons

    We looked at twodifferent servers that can help to investigate genes in human genomic sequence. Gene2EST will give a good overview of a gene structure, provided that sufficient ESTs are present, and can reveal alternative splicing. It will be useless if there are no ESTs derived from the query gene. Ensembl provides diagrammatic summaries of gene prediction, protein homology and cDNA/EST matches that can be helpful in defining a gene, but are currently rather terse. HGP at the Sanger Centre(which we did not use today) provides some more information, including some chromosomal overviews and links to other resources, such as OMIM. These human genomic servers are likely to develop in usefulness as more sequence is completed.


    You can find this page at http://www.embl-heidelberg.de/~seqanal/courses/spring00/GeneInv.00.html