Biocomputing Unit
Biocomputing
Sequence Analysis Service
Gibson Group
EMBL
EMBL

Practical Course on Sequence Analysis

Eukaryotic Gene Investigation

Toby Gibson, Aidan Budd, Christine Gemünd and Chenna Ramu, May '01


The aim of this practical is to get hands-on experience with three web servers that can help to investigate eukaryotic genomic sequence - especially human. We are focussing particularly on the human genome servers since the genome is largely completed now. The Ensembl server is a collaboration between the EBI and the Sanger Centre to provide "real time" human genome annotation as the sequence is generated. The annotation is automatically generated from a combination of gene prediction, encoded protein homology and EST matches: obviously it is not going to be perfect. The Human Genome Browser developed in Santa Cruz is another interface useful for getting an overview of the genome. Ensembl and the Genome Browser have links to each other. The third server that we want to introduce is one that we gave developed at EMBL. The Gene2EST server will allow you to submit gene-size queries (if need be >100,000 bp) to search the EST databases and provides both graphical and alignment output: in favourable cases - i.e. plentiful ESTs - Gene2EST can rapidly give a good description of a gene structure, including alternaitve splicing.


Exercise 1. Querying Ensembl with a protein sequence

Ensembl can be queried by keyword or by sequence similarity, depending what your needs are. It provides quite a flexible combination of protein/DNA query/databases. For example, it is well suited to probing with ESTs of interest. Today we are going to use a protein sequence as query.

Getting the Query:

Querying Ensembl

Notes:

Good starting points for querying Ensembl are protein, EST or cDNA sequences of interest. Ensembl does automatic annotation, so results are not likely to be perfect. Bear in mind that sequence is being annotated as fast as it is being generated: many genes may be only partially sequenced. While results are imperfect, the rapid deployment of gene predictions and the links to homologous sequences could be very useful. The underlying data is made available and the user can - and should - evaluate the predicted gene annotation quality.

Exercise 2. Using the Genome Browser to evaluate whether two chromosomes are related

The Human Genome Browser is another useful interface to the human genome. We can easily click between Ensembl and the Browser and back again: You may find that some things are easier in one of these browsers than the other - though this may change over time as they are both being rapidly developed. The exercise we are going to do now could be done in Ensembl too but was easier in the Browser as we tried it - but next year who knows?

There is currently some controversy whether the vertebrate genome underwent tetraploidy or even octaploidy a long time ago in a common ancestor: We know that there are eight Src paralogues in the human genome. One of them, HCK is close to Src on chromosome 20: this gene pair was formed by a tandem duplication. Another, LYN, is found on chromosome 8 so might have arisen by genome duplication. We are going to see whether there are other genes close to Src and LYN that exhibit synteny. Syntenic chromosomal regions possess similar genes.

Going from Ensembl to the Genome Browser:

Finding a set of genes linked to Src:

Finding a set of genes linked to LYN:

Notes:

This exercise shows quite well the current state of the human genome sequence. There are some nearly finished regions and some less good ones. Also there seem to be some gaps between clone contigs - so the contigs could invert, changing gene order. Also the sequence contig order within a clone is not always correct - although the project does try to order them - so could mess up local gene prediction. I.e. the fine mapping is not reliable in the rough sequence. It will be interesting to see how much the gene order changes in these two regions in the future and if they get a bit more like each other.


Exercise 3. Using ESTs to reveal gene structure with Gene2EST

The BLAST2 program is well suited to EST detection. Because it tolerates small gaps in alignments, it deals well with sequencing errors in the ESTs. However it is painfully time-consuming to work through the output, especially if there are many 100s of matches. Furthermore, most BLAST2 servers will not allow the user to submit large query sequences. We have tried to address these deficiencies with Gene2EST. The Gene2EST server presents BLAST2-derived output in a way that allows the user to rapidly evaluate the results. A graphical display (viewed with Artemis) provides an overview of the results, while an alignment of the ESTs on the query sequence lets the user follow up the exons in detail.

  • Getting the Query:
  • Using Gene2EST:
  • Examining the alignment output:
  • Examining the Graphical output with Artemis:
  • Repeat using HSAK1, the gene sequence from yesterday's prediction practical:

    Notes:

    Our chosen example works spectacularly well: ESTs alone can quickly reveal the entire gene structure. However, if we took the EMBL entry AF106656 (the human adenylosuccinate lyase gene) there might be problems. Fortunately the program RepeatMasker can filter out repetitive elements and is on by default: About 1/2 of the 20 kb sequence consists of dispersed repeat sequences matching highly similar ESTs. Then there are many ESTs primed from intronic poly-A runs in the genomic sequence. The gene itself is not very highly expressed, so we could struggle to find the true ESTs in the flood of noisy matches. If there is time, it is an instructive exercise to see the effects of using Repeatmasker filtered and unfiltered on the sequence from AF106656 in Gene2EST. After that exercise, one is left to wonder what fraction of the ~2 million public ESTs belong in the "really useful" category....

    Take Home Lessons

    We looked at three different servers that can help to investigate genes in human genomic sequence. Gene2EST will give a good overview of a gene structure, provided that sufficient ESTs are present, and can reveal alternative splicing. It will be useless if there are no ESTs derived from the query gene. Ensembl provides diagrammatic summaries of gene prediction, protein homology and cDNA/EST matches that can be helpful in defining a gene, but are currently rather terse. The human genome browser provides much of the same information as Ensembl but is currently better for viewing larger segments of chromosomes. These human genomic servers are likely to develop in usefulness as more sequence is completed.


    The gene orders for exercise 2 using the December 2000 release of the human genome are:

    SRC, HNF4A, PKIG, MATN4, EYA2, NCOA3, BIG2, STAU

    LYN, BIG1, NCOA2, EYA1, STAU2, HNF4G, PKIA, MATN2

    By the way, LYN is actually a closer relative of HCK rather than SRC, but HCK is further from the rest of these genes than SRC, so we used SRC as the starting point. It will be interesting to see if this corresponds to true gene order due to a chromosomal inversion, as the sequence is improved.


    You can find this page at http://www.embl-heidelberg.de/~seqanal/courses/Jul01/GeneInv.01.html