Biocomputing Unit
Sequence Analysis Service
Gibson Group

Autumn 01 Course

Eukaryotic Gene and Genome Investigation

Toby Gibson, Aidan Budd and Chenna Ramu, November 26th-29th 2001

The aim of this practical is to get hands-on experience with three web servers that can help to investigate eukaryotic genomic sequence - especially human. We are focussing particularly on the human genome servers since the genome is largely completed now. The Ensembl server is a collaboration between the EBI and the Sanger Centre to provide "real time" human genome annotation as the sequence is generated. The annotation is automatically generated from a combination of gene prediction, encoded protein homology and EST matches: obviously it is not going to be perfect. The Human Genome Browser developed in Santa Cruz is another interface useful for getting an overview of the genome. Ensembl and the Genome Browser have links to each other. The third server that we want to introduce is one that we gave developed at EMBL. The Gene2EST server will allow you to submit gene-size queries (if need be >100,000 bp) to search the EST databases and provides both graphical and alignment output: in favourable cases - i.e. plentiful ESTs - Gene2EST can rapidly give a good description of a gene structure, including alternative splicing.

Getting started

The teaching machines are INTEL PCs running the LINUX OS. It will take a few moments to get set up.

Exercise 1. Querying Ensembl with a protein sequence

Ensembl can be queried by keyword or by sequence similarity, depending what your needs are. It provides quite a flexible combination of protein/DNA query/databases. For example, it is well suited to probing with ESTs of interest. Today we are going to use a protein sequence as query.
Getting the Query:

Querying Ensembl


Good starting points for querying Ensembl are protein, EST or cDNA sequences of interest. Ensembl does automatic annotation, so results are not likely to be perfect. Bear in mind that sequence is being annotated as fast as it is being generated: many genes may be only partially sequenced. While results are imperfect, the rapid deployment of gene predictions and the links to homologous sequences could be very useful. The underlying data is made available and the user can - and should - evaluate the predicted gene annotation quality.

Exercise 2. Using the Genome Browser to evaluate whether two chromosomes are related

The Human Genome Browser is another useful interface to the human genome. We can easily click between Ensembl and the Browser and back again: You may find that some things are easier in one of these browsers than the other - though this may change over time as they are both being rapidly developed. The exercise we are going to do now could be done in Ensembl too but was easier in the Browser as we tried it - but next year who knows?

There is currently some controversy whether the vertebrate genome underwent tetraploidy or even octaploidy a long time ago in a common ancestor: We know that there are eight Src paralogues in the human genome. One of them, HCK is close to Src on chromosome 20: this gene pair was formed by a tandem duplication. Another, LYN, is found on chromosome 8 so might have arisen by genome duplication. We are going to see whether there are other genes close to Src and LYN that exhibit synteny. Syntenic chromosomal regions possess similar genes.

One very important aspect of interpreting human (and other large-genomed organisms) genome data is the quality of the assembly in a particular region. Where a region consists of many, finished overlapping clones, one has a very high confidence that factors such as gene orientation, intron size, gene order etc. are accurately modelled. However, where there are a large number of gaps both within unfinished clones, and also (even worse) between clones or even fingerprinted contigs, one's confidence in such predictions is much reduced. In this exercise we will use different tracks in the genome browser to assess sequence quality in our regions of interest.

Going from Ensembl to the Genome Browser:

Finding a set of genes linked to Src: Finding a set of genes linked to LYN:
The aim of this part of the exercise is to see whether a region of the genome has a several  genes in it which are similar to  genes that are near to (syntenic) to src. If this is so, then the reason for this sharing of similar genes might be that the regions are in fact duplicates. Notes:

This exercise shows quite well the current state of the human genome sequence. There are some nearly finished regions and some less good ones. Also there seem to be some gaps between clone contigs - so the contigs could invert, changing gene order. Also the sequence contig order within a clone is not always correct - although the project does try to order them - so could mess up local gene prediction. I.e. the fine mapping is not reliable in the rough sequence. It will be interesting to see how much the gene order changes in these two regions in the future and if they get a bit more like each other.

Exercise 3. Using ESTs to reveal gene structure with Gene2EST

The BLAST2 program is well suited to EST detection. Because it tolerates small gaps in alignments, it deals well with sequencing errors in the ESTs. However it is painfully time-consuming to work through the output, especially if there are many 100s of matches. Furthermore, most BLAST2 servers will not allow the user to submit large query sequences. We have tried to address these deficiencies with Gene2EST. The Gene2EST server presents BLAST2-derived output in a way that allows the user to rapidly evaluate the results. A graphical display (viewed with Artemis) provides an overview of the results, while an alignment of the ESTs on the query sequence lets the user follow up the exons in detail.

  • Getting the Query:
  • Using Gene2EST:
  • Examining the alignment output:
  • Examining the Graphical output with Artemis:
  • Repeat using HSAK1, the gene sequence from yesterday's prediction practical:
  • Notes:

    Our chosen example works spectacularly well: ESTs alone can quickly reveal the entire gene structure. However, if we took the EMBL entry AF106656 (the human adenylosuccinate lyase gene) there might be problems. Fortunately the program RepeatMasker can filter out repetitive elements and is on by default: About 1/2 of the 20 kb sequence consists of dispersed repeat sequences matching highly similar ESTs. Then there are many ESTs primed from intronic poly-A runs in the genomic sequence. The gene itself is not very highly expressed, so we could struggle to find the true ESTs in the flood of noisy matches. If there is time, it is an instructive exercise to see the effects of using Repeatmasker filtered and unfiltered on the sequence from AF106656 in Gene2EST. After that exercise, one is left to wonder what fraction of the ~4 million public ESTs belong in the "really useful" category....

    Take Home Lessons

    We looked at three different servers that can help to investigate genes in human genomic sequence. Gene2EST will give a good overview of a gene structure, provided that sufficient ESTs are present, and can reveal alternative splicing. It will be useless if there are no ESTs derived from the query gene. Ensembl provides diagrammatic summaries of gene prediction, protein homology and cDNA/EST matches that can be helpful in defining a gene, but are currently rather terse. The human genome browser provides much of the same information as Ensembl but is currently better for viewing larger segments of chromosomes. These human genomic servers are likely to develop in usefulness as more sequence is completed.

    The gene orders for exercise 2 using the December 2000 release of the human genome are:



    By the way, LYN is actually a closer relative of HCK rather than SRC, but HCK is further from the rest of these genes than SRC, so we used SRC as the starting point. It will be interesting to see if this corresponds to true gene order due to a chromosomal inversion, as the sequence is improved.

    You can find this page at