Practical
Course on Sequence Analysis
Eukaryotic Gene Investigation
Toby Gibson, Christine Gemünd and
Chenna Ramu, May '00
The aim of this practical is to get hands-on experience with two
web servers that can help to investigate eukaryoticgenomic sequence - especially
human. We are focussing particularly on the human genome servers since
the genome is expected to be largely completed during this year. The Ensembl
server is a collaboration between the EBI and the Sanger Centre to provide
"real time" human genome annotation as the sequence is generated. The annotation
is automatically generated from a combination of gene prediction, encoded
protein homology and EST matches: obviously it is not going to be perfect.
The second server that we want to introduce is one that we are developing
and is not finished yet! The Gene2EST
server will allow you to submit gene-size queries (up to about 50,000 bp)
to search the EST databases and provides both graphical and alignment output:
in favourable cases - i.e. plentiful ESTs - Gene2EST can rapidly give a
good description of a gene structure.
We did plan to show the human genome server at the Sanger Centre too
- but the exercises gave too many problems, so we dropped it.
Exercise 1. Querying Ensembl with a protein sequence
Currently, Ensembl is not easy to query by keyword. It is
best queried by sequence and provides quite a flexible combination of protein/DNA
query/databases. For example, it is well suited to probing with ESTs of
interest. Today we are going to use a protein sequence as query.
Getting the Query:
-
In a Netscape window, go to SRS
and click Start.
-
Click the SWISSPROT Box and then click Continue.
-
Type SRC_HUMAN in an ID box and click Do Query.
-
Click on the SWISSPROT:SRC_HUMAN entry. This entry contains the
sequence for human SRC oncoprotein.
-
Click on the Save button.
-
Set Use view to FastaSeqs then click SAVE.
-
The sequence is now in a suitable format.
Querying Ensembl
-
Click to Ensembl.
-
Click to the BLAST page.
-
Cut and paste the sequence into the box.
-
Click the search button. (We will use the default protein search set up).
-
The search should take a few seconds to run.
-
Examine the output:
-
Is the top hit the Src sequence?
-
How many hits can you find that are identical in sequence to the query?
-
Click on the clone link for one of the identical sequences. It will
give a graphical representation.
-
Look at the sources of annotation data:
-
Which of (a) gene prediction, (b) protein sequence, (c) EST match, are
used to find genes?
-
How many genes are listed on the Src contig?
-
Are the genes on the + or - DNA strand?
-
What are these two genes?
-
What happens if you click on a coloured block on the plot?
-
How consistent are the different methods to find genes?
-
Do they agree well or badly?
-
Are any exons missing in some of the methods?
-
Are repeats found on both DNA strands?
-
Do the repeats occur in introns, exons or both?
-
Click on an Ensembl gene.
-
Review the information supplied.
-
Click on supporting evidence.
-
Are there any database proteins that matched the gene?
-
Are there any ESTs that matched the gene?
-
If so, do the ESTs collectively cover all the exons?
Notes:
Good starting points for querying Ensembl are protein, EST or
cDNA sequences of interest. Ensembl does automatic annotation, so
results are not likely to be perfect. Bear in mind that sequence is being
annotated as fast as it is being generated: genes may be only partially
sequenced. While results are imperfect, the rapid deployment of gene predictions
and the links to homologous sequences could be very useful. The underlying
data is made available and the user can - and should - evaluate the predicted
gene annotation quality.
Exercise 2. Using ESTs to reveal gene structure
with Gene2EST
The BLAST2 program is well suited to EST detection. Because it tolerates
small gaps in alignments, it deals well with sequencing errors in the ESTs.
However it is painfully time-consuming to work through the output, especially
if there are many 100s of matches. Furthermore, most BLAST2 servers will
not allow the user to submit large query sequences. We have tried to address
these deficiencies with Gene2EST.The Gene2EST
server presents BLAST2-derived output in a way that allows the user to
rapidly evaluate the results. A graphical display (viewed with Artemis)
provides an overview of the results, while an alignment of the ESTs on
the query sequence lets the user follow up the exons in detail.
Getting the Query:
-
In a Netscape window, go to SRS
and click Start.
-
Click the EMBL Box and then click Continue.
-
Type AF004877 in an ID box and click Do Query.
-
Click on the EMBL:AF004877 entry. This entry contains the gene for
human pro-alpha2(I) collagen (COL1A2).
-
The COL1A2 gene is large (nearly 40,000 bases), highly spliced and there
are plenty of corresponding ESTs, so it is a good example to demonstrate
the capabilities of Gene2EST.
-
Click on the Save button.
-
Set Use view to FastaSeqs then click SAVE.
-
The sequence is now in a suitable format. (If you want, you can save it
in a file as COLIA2.SEQ).
Using Gene2EST:
-
Open a new navigator window and load this page into it.
-
Load the Gene2EST
query submission page.
-
(The server is still under development - later there will be on-line help
etc).
-
Cut and paste the query sequence into the sequence box and submit the job.
-
The search will take a fewminutes.When the result arrives,
bookmark
it.
-
Examine the BLAST output.
-
How many matching ESTs are there?
-
Ought we to allow more top hits if we ran the search again?
-
How does % identity vary in the hits?
-
Are any of the matches spread over multiple HSPs?
-
Might any of the matches contain ALU repeats?
-
(Hints: These are likely to be 80-90% identical, not 100%;click on links
to the entries).
-
Are repetitive elements likely to be a problem for this server?
Examining the alignment output:
-
Click on the Compact_Alignment button. The alignment of high identity
matches will be provided in a new window.
-
Have most of the exons been found?
-
Has the 5' exon been found? Is there an upstream TATAAAA-like sequence?
A CCAAT box?
-
Has the 3' exon been found? Is there a poly-A signal - AATAAA? Is just
one poly-A site used?
-
Why are there more 3' ESTs than 5' and internal ESTs?
-
Are internal exons bordered by plausible splice sites?
-
Are there any "funny" matches that don't make good sense?
-
Is it possible that these are not true "Expressed Sequence Tags"?
-
Could there be some ESTs primed from genomic DNA at runs of As?
-
Are there any ALUs with high enough identity to be included?
-
Two short exons of 10 and 15 bp are missing in the alignment. How is
that marked?
Examining the Graphical output with Artemis:
-
Click on the EMBL_Entry button. The sequence with ESTs annotated
as an EMBL format feature table will be provided in a new window.
-
Note how the ESTs are annotated as mRNA features:
-
why are there join statements in some ESTs?
-
Save to file as e.g. collagen.embl.
-
In an X-window, on the unix command line type prepare artemis. Artemis
will be set up.
-
Type art collagen.embl & to start Artemis with the COL1A2 sequence
loaded.
-
In a few seconds, the Artemis graphical display should appear.
-
Examine the Artemis Display:
-
Can you see the exon-intron structure of the gene?
-
Click on an exon - How many neighbouring exons are bolded?
-
Is there any evidence for alternative splicing in this gene?
-
Does the promoter region have a high GC content, typical of many human
promoters? (Hint: Use the display menu).
-
Artemis is a display-editor of EMBL features:
-
you could edit the file to clean up the results by deleting the implausible
matches.
-
you could use it to assemble features for the entire mRNA and the inferred
coding content.
Repeat using HSAK1, the gene sequence from yesterday's prediction practical:
-
Collect HSAK1
(collect just the sequence in FASTA format).
-
Cut and Paste into Gene2EST and run the search.
-
Collect the outputs and load the EMBL format into Artemis:
-
How well is the gene predicted?
-
Are the first and last exons found?
-
Are there any matches that are not in the annotated EMBL entry?
-
Does the promoter have a high GC content?
Notes:
Our chosen example works spectacularly well: ESTs alone can quickly
reveal the entire gene structure. However, if we took the EMBL entry AF106656
(the human adenylosuccinate lyase gene) there would be problems. About
1/2 of the 20 kb sequence consists of dispersed repeat sequences matching
highly similar ESTs. Then there are many ESTs primed for intronic poly-A
runs in the genomic sequence. The gene itself is not very highly expressed,
so we struggle to find the true ESTs in the flood of noisy matches. Fortunately
the program RepeatMasker
can filter out repetitive elements. We hope to build this program into
the Gene2EST server. If there is time, it is an instructive exercise
to see the effects of using Repeatmasker filtered and unfiltered
sequence from AF106656 in Gene2EST. After that exercise,
one is left to wonder what fraction of the ~2 million public ESTs belong
in the "really useful" category....
Take Home Lessons
We looked at twodifferent servers that can help to investigate
genes in human genomic sequence. Gene2EST will give a good overview of
a gene structure, provided that sufficient ESTs are present, and can reveal
alternative splicing. It will be useless if there are no ESTs derived from
the query gene. Ensembl provides diagrammatic summaries of gene prediction,
protein homology and cDNA/EST matches that can be helpful in defining a
gene, but are currently rather terse. HGP at the Sanger Centre(which we
did not use today) provides some more information, including some chromosomal
overviews and links to other resources, such as OMIM. These human genomic
servers are likely to develop in usefulness as more sequence is completed.
You can find this page at http://www.embl-heidelberg.de/~seqanal/courses/spring00/GeneInv.00.html