Practical Course on Sequence Analysis
Eukaryotic Gene Investigation
Toby Gibson, Aidan Budd, Christine Gemünd and Chenna Ramu, May '01
The aim of this practical is to get hands-on experience with three web servers that can help to investigate eukaryotic genomic sequence - especially human. We are focussing particularly on the human genome servers since the genome is largely completed now. The Ensembl server is a collaboration between the EBI and the Sanger Centre to provide "real time" human genome annotation as the sequence is generated. The annotation is automatically generated from a combination of gene prediction, encoded protein homology and EST matches: obviously it is not going to be perfect. The Human Genome Browser developed in Santa Cruz is another interface useful for getting an overview of the genome. Ensembl and the Genome Browser have links to each other. The third server that we want to introduce is one that we gave developed at EMBL. The Gene2EST server will allow you to submit gene-size queries (if need be >100,000 bp) to search the EST databases and provides both graphical and alignment output: in favourable cases - i.e. plentiful ESTs - Gene2EST can rapidly give a good description of a gene structure, including alternaitve splicing.
Exercise 1. Querying Ensembl with a protein sequence
Ensembl can be queried by keyword or by sequence similarity, depending what your needs are. It provides quite a flexible combination of protein/DNA query/databases. For example, it is well suited to probing with ESTs of interest. Today we are going to use a protein sequence as query.
Getting the Query:
- In a Netscape window, go to SRS and click Start.
- Click the SWISSPROT Box and then click Continue.
- Type SRC_HUMAN in an ID box and click Do Query.
- Click on the SWISSPROT:SRC_HUMAN entry. This entry contains the sequence for human SRC oncoprotein.
- Click on the Save button.
- Set Use view to FastaSeqs then click SAVE.
- The sequence is now in a suitable format.
Querying Ensembl
- Click to Ensembl.
- Click to the BLAST page.
- Cut and paste the sequence into the box.
- Select the database of Ensembl Confirmed peptides.
- Check that the protein version of BLAST will run.
- Click the search button.
- The search should take a few seconds to run.
- Examine the BLAST output:
- Which chromosome has the top hit?
- What happens when you point the mouse at a coloured chromosomal band?
- Now click on one with the middle button.
- Is the top hit the Src sequence?
- Check the second best hit (use the middle button):
- Is it less than 70% identical to the query?
- Is it on chromosome 6?
- Click on the link from the alignment of the top hit. Ensembl will give a graphical representation.
- Which part of the overview provides the detailed view?
- Look at the sources of annotation data:
- Which of (a) gene prediction, (b) protein sequence, (c) EST match, are used to find genes?
- (As needed, use the middle button to click on the items).
- What information does a Transcript link provide?
- How many known genes are within 50,000 bases either side of Src?
- Is Src on the + or - DNA strand?
- How consistent are the different methods to find genes?
- Do they agree well or badly?
- Are any exons missing in some of the methods?
- Are there many repeat elementsin this region?
- Click on an Ensembl gene.
- Review the information supplied.
- Click on Transcript > supporting evidence:
- Are there any database proteins that matched the gene?
- Are all the exons supported?
- Are the 5' exons better supported than the 3' exons?
Notes:
Good starting points for querying Ensembl are protein, EST or cDNA sequences of interest. Ensembl does automatic annotation, so results are not likely to be perfect. Bear in mind that sequence is being annotated as fast as it is being generated: many genes may be only partially sequenced. While results are imperfect, the rapid deployment of gene predictions and the links to homologous sequences could be very useful. The underlying data is made available and the user can - and should - evaluate the predicted gene annotation quality.
Exercise 2. Using the Genome Browser to evaluate whether two chromosomes are related
The Human Genome Browser is another useful interface to the human genome. We can easily click between Ensembl and the Browser and back again: You may find that some things are easier in one of these browsers than the other - though this may change over time as they are both being rapidly developed. The exercise we are going to do now could be done in Ensembl too but was easier in the Browser as we tried it - but next year who knows?
There is currently some controversy whether the vertebrate genome underwent tetraploidy or even octaploidy a long time ago in a common ancestor: We know that there are eight Src paralogues in the human genome. One of them, HCK is close to Src on chromosome 20: this gene pair was formed by a tandem duplication. Another, LYN, is found on chromosome 8 so might have arisen by genome duplication. We are going to see whether there are other genes close to Src and LYN that exhibit synteny. Syntenic chromosomal regions possess similar genes.
Going from Ensembl to the Genome Browser:
- From Ensembl, click on Jump to UCSC.
- We will get a new window with another graphic of the Src genomic region.
- Click on coverage:
- This expands to reveal the clones
- If they are all black they are fully sequenced.
- You can set coverage back to dense with the controls below.
- Now adjust the figure size:
- Set pixel width to 1000 and click jump.
- If need be fine tune the size some more so it fills your screen nicely.
Finding a set of genes linked to Src:
- Use the zoom out buttons to get a view with about 20-30 genes.
- As needed, customise the plot to minimise unwanted info such as RNA genes.
- Work rightward of Src using the move buttons to establish the gene order for:
- BIG2, EYA2, HNF4A, MATN4, NCOA3, PKIG, SRC and STAU
- Draw out the gene order on paper.
- Are any of these genes fragmented?
- Also note the position of any major gaps that are present between the genes.
Finding a set of genes linked to LYN:
- Open the Genome Home link with the middle mouse button.
- Click on the Browser link.
- Type LYN into the genome position box and submit.
- Click on the top matching link to get the graphical display.
- Set up the figure as before to view about 20-30 genes.
- From the coverage and gap summaries, is the sequence as good as for the SRC region?
- Work rightward of LYNusing the move buttons to establish the gene order for the paralogous genes to the list above.
- (The genes will have similar names with different numbers.)
- Also note the position of any major gaps that are present between the genes.
- We can now compare the two regions:
- Is the gene order conserved?
- What is the longest conserved gene order?
- Are there any inversions of gene order?
- Do you think the inversions are real or due to error in the assembly and mapping?
Notes:
This exercise shows quite well the current state of the human genome sequence. There are some nearly finished regions and some less good ones. Also there seem to be some gaps between clone contigs - so the contigs could invert, changing gene order. Also the sequence contig order within a clone is not always correct - although the project does try to order them - so could mess up local gene prediction. I.e. the fine mapping is not reliable in the rough sequence. It will be interesting to see how much the gene order changes in these two regions in the future and if they get a bit more like each other.
Exercise 3. Using ESTs to reveal gene structure with Gene2EST
The BLAST2 program is well suited to EST detection. Because it tolerates small gaps in alignments, it deals well with sequencing errors in the ESTs. However it is painfully time-consuming to work through the output, especially if there are many 100s of matches. Furthermore, most BLAST2 servers will not allow the user to submit large query sequences. We have tried to address these deficiencies with Gene2EST. The Gene2EST server presents BLAST2-derived output in a way that allows the user to rapidly evaluate the results. A graphical display (viewed with Artemis) provides an overview of the results, while an alignment of the ESTs on the query sequence lets the user follow up the exons in detail.
Getting the Query:
- In a Netscape window, go to SRS and click Start.
- Click the EMBL Box and then click Continue.
- Type AF004877 in an ID box and click Do Query.
- Click on the EMBL:AF004877 entry. This entry contains the gene for human pro-alpha2(I) collagen (COL1A2).
- The COL1A2 gene is large (nearly 40,000 bases), highly spliced and there are plenty of corresponding ESTs, so it is a good example to demonstrate the capabilities of Gene2EST.
- Click on the Save button.
- Set Use view to FastaSeqs then click SAVE.
- The sequence is now in a suitable format. (If you want, you can save it in a file as COLIA2.SEQ).
Using Gene2EST:
- Open a new navigator window and load this page into it.
- Load the Gene2EST query submission page.
- Cut and paste the query sequence into the sequence box and submit the job.
- The search will take a few minutes.When the result arrives, bookmark it.
- Note that Repeatmasker runs first to mask out dispersed repeats like Alus.
- It will report these while BLAST is running: Did it find any?
- Examine the BLAST output.
- How many matching ESTs are there?
- Ought we to allow more top hits if we ran the search again?
- How does % identity vary in the hits?
- Are any of the matches spread over multiple HSPs?
Examining the alignment output:
- Click on the Alignment button. The alignment of high identity matches will be provided in a new window.
- Have most of the exons been found?
- Has the 5' exon been found? Is there an upstream TATAAAA-like sequence? A CCAAT box?
- Has the 3' exon been found? Is there a poly-A signal - AATAAA? Is just one poly-A site used?
- Why are there more 3' ESTs than 5' and internal ESTs?
- Are internal exons bordered by plausible splice sites?
- Are there any "funny" matches that don't make good sense?
- Is it possible that these are not true "Expressed Sequence Tags"?
- Could there be some ESTs primed from genomic DNA at runs of As?
- Are there any ALUs masked out in the introns?
- Two short exons of 10 and 15 bp are missing in the alignment. How is that marked?
Examining the Graphical output with Artemis:
- Click on the EMBL_Entry button. The sequence with ESTs annotated as an EMBL format feature table will be provided in a new window.
- Note how the ESTs are annotated as mRNA features:
- why are there join statements in some ESTs?
- Save to file as e.g. /net/fileserver4/scrap/your_username/collagen.embl.
- In an X-window, type xhost +
- Then rlogin tau
- on the unix command line type prepare artemis. Artemis will be set up.
- Type art collagen.embl & to start Artemis with the COL1A2 sequence loaded.
- In a few seconds, the Artemis graphical display should appear.
- Examine the Artemis Display:
- Can you see the exon-intron structure of the gene?
- Click on an exon - How many neighbouring exons are bolded?
- Is there any evidence for alternative splicing in this gene?
- Does the promoter region have a high GC content, typical of many human promoters? (Hint: Use the Display menu).
- Artemis is a display-editor of EMBL features:
- you could edit the file to clean up the results by deleting the implausible matches.
- you could use it to assemble features for the entire mRNA and the inferred coding content.
Repeat using HSAK1, the gene sequence from yesterday's prediction practical:
- Collect HSAK1 (collect just the sequence in FASTA format).
- Cut and Paste into Gene2EST and run the search.
- Collect the outputs and load the EMBL format into Artemis:
- How well is the gene predicted?
- Are the first and last exons found?
- Are there any EST matches that are not in the annotated EMBL entry?
- Does the promoter have a high GC content?
Notes:
Our chosen example works spectacularly well: ESTs alone can quickly reveal the entire gene structure. However, if we took the EMBL entry AF106656 (the human adenylosuccinate lyase gene) there might be problems. Fortunately the program RepeatMasker can filter out repetitive elements and is on by default: About 1/2 of the 20 kb sequence consists of dispersed repeat sequences matching highly similar ESTs. Then there are many ESTs primed from intronic poly-A runs in the genomic sequence. The gene itself is not very highly expressed, so we could struggle to find the true ESTs in the flood of noisy matches. If there is time, it is an instructive exercise to see the effects of using Repeatmasker filtered and unfiltered on the sequence from AF106656 in Gene2EST. After that exercise, one is left to wonder what fraction of the ~2 million public ESTs belong in the "really useful" category....
Take Home Lessons
We looked at three different servers that can help to investigate genes in human genomic sequence. Gene2EST will give a good overview of a gene structure, provided that sufficient ESTs are present, and can reveal alternative splicing. It will be useless if there are no ESTs derived from the query gene. Ensembl provides diagrammatic summaries of gene prediction, protein homology and cDNA/EST matches that can be helpful in defining a gene, but are currently rather terse. The human genome browser provides much of the same information as Ensembl but is currently better for viewing larger segments of chromosomes. These human genomic servers are likely to develop in usefulness as more sequence is completed.
The gene orders for exercise 2 using the December 2000 release of the human genome are:
SRC, HNF4A, PKIG, MATN4, EYA2, NCOA3, BIG2, STAU
LYN, BIG1, NCOA2, EYA1, STAU2, HNF4G, PKIA, MATN2
By the way, LYN is actually a closer relative of HCK rather than SRC, but HCK is further from the rest of these genes than SRC, so we used SRC as the starting point. It will be interesting to see if this corresponds to true gene order due to a chromosomal inversion, as the sequence is improved.
You can find this page at http://www.embl-heidelberg.de/~seqanal/courses/Jul01/GeneInv.01.html