Autumn
01 Course
Eukaryotic Gene and Genome Investigation
Toby Gibson, Aidan Budd and Chenna Ramu,
November 26th-29th 2001
The aim of this practical is to get hands-on experience with three
web servers that can help to investigate eukaryotic genomic sequence -
especially human. We are focussing particularly on the human genome servers
since the genome is largely completed now. The Ensembl
server is a collaboration between the EBI and the Sanger Centre to provide
"real time" human genome annotation as the sequence is generated. The annotation
is automatically generated from a combination of gene prediction, encoded
protein homology and EST matches: obviously it is not going to be perfect.
The Human Genome Browser developed
in Santa Cruz is another interface useful for getting an overview of the
genome. Ensembl and the Genome Browser have links to each other. The third
server that we want to introduce is one that we gave developed at EMBL.
The Gene2EST server
will allow you to submit gene-size queries (if need be >100,000 bp) to
search the EST databases and provides both graphical and alignment output:
in favourable cases - i.e. plentiful ESTs - Gene2EST can rapidly give a
good description of a gene structure, including alternative splicing.
Getting started
The teaching machines are INTEL PCs running the LINUX OS. It will take a few moments to get set up.
- Login with your EMBL name and password.
- Start the KDE Desktop by typing exec startx.
- Start netscape from the pullup starting icon at the lower left of the screen.
- Load this page into it. (It is a few clicks from EMBL's home page).
- Check that java, javascript and style sheets are enabled in the netscape preferences in the advanced options.
- Also in preferences, set the fonts to 14 point.
Exercise 1. Querying Ensembl with a protein sequence
Ensembl can be queried by keyword or by sequence similarity,
depending what your needs are. It provides quite a flexible combination
of protein/DNA query/databases. For example, it is well suited to probing
with ESTs of interest. Today we are going to use a protein sequence as
query.
Getting the Query:
-
Before accessing SRS, you need to enable Javascript and Java within netscape.
To do this go Edit -> Preferences -> Advanced and then click in appropriate
boxes.
-
In a Netscape window, go to SRS
and click Start.
-
Click the SWISSPROT Box and then click Continue.
-
Type SRC_HUMAN in an ID box and click Do Query.
-
Click on the SWISSPROT:SRC_HUMAN entry. This entry contains the
sequence for human SRC oncoprotein.
-
Click on the Save button.
-
Set Use view to FastaSeqs then click SAVE.
-
The sequence is now in a suitable format.
Querying Ensembl
-
Click to Ensembl.
-
Click to the BLAST page.
-
Cut and paste the sequence into the box.
-
Select the database of Ensembl Confirmed peptides.
-
Check that the protein version of BLAST will run.
-
Click the search button.
-
The search should take a few seconds to run.
-
Examine the BLAST output:
-
Which chromosome has the top hit?
-
What happens when you point the mouse at a coloured chromosomal band?
-
What information does this give you?
-
Is the top hit the Src sequence?
-
Check the second best hit :
-
Is it less than 70% identical to the query?
-
Is it on chromosome 6?
-
Click on the link (the ensembl peptide identifier, which begins "ENSP...")
from the alignment of the top hit. Ensembl will give a graphical representation
of the genomic environment around this gene.
-
This page has two main panels. Which of these provides the more detailed
information?
-
Look at the sources of annotation data:
-
Which of (a) de novo gene predictions, (b) protein sequence, (c) EST
match (d) mRNA sequences, are used to identify putative genes?
-
Is it easy to assess the quality of support given to this predicted gene
structure from these sources within ensembl? (later on in this course you
will use gene2est, a tool designed specifically to compare EST data to
genomic DNA sequences)?
-
What information does a Transcript link provide?
-
How many known genes are within 50,000 bases either side of Src?
-
Is Src on the + or - DNA strand?
-
How consistent are the different methods to find genes?
-
Do they agree well or badly?
-
Are any exons missing in some of the methods?
-
Are there many repeat elements in this region?
-
To answer this question you need to add additional information tracks to
the ensembl viewer. To do this, click on the label Features
that lies above the lower panel. Then choose Repeats and Reload
the page from the tool-bar at the top of your browser. You can add
other tracks by clicking on the Advanced option in the Features
list.
-
Click on an Ensembl gene.
-
Review the information supplied.
-
Click on Transcript > supporting evidence:
-
Are there any database proteins that matched the gene?
-
Are all the exons supported?
-
Are the 5' exons better supported than the 3' exons?
Notes:
Good starting points for querying Ensembl are protein, EST or
cDNA sequences of interest. Ensembl does automatic annotation, so
results are not likely to be perfect. Bear in mind that sequence is being
annotated as fast as it is being generated: many genes may be only partially
sequenced. While results are imperfect, the rapid deployment of gene predictions
and the links to homologous sequences could be very useful. The underlying
data is made available and the user can - and should - evaluate the predicted
gene annotation quality.
Exercise 2. Using the Genome Browser to evaluate
whether two chromosomes are related
The Human Genome Browser is
another useful interface to the human genome. We can easily click between
Ensembl
and the Browser and back again: You may find that some things are
easier in one of these browsers than the other - though this may change
over time as they are both being rapidly developed. The exercise we are
going to do now could be done in Ensembl too but was easier in the Browser
as we tried it - but next year who knows?
There is currently some controversy whether the vertebrate genome underwent
tetraploidy or even octaploidy a long time ago in a common ancestor: We
know that there are eight Src paralogues in the human genome. One of them,
HCK is close to Src on chromosome 20: this gene pair was formed by a tandem
duplication. Another, LYN, is found on chromosome 8 so might have arisen
by genome duplication. We are going to see whether there are other genes
close to Src and LYN that exhibit synteny. Syntenic chromosomal
regions possess similar genes.
One very important aspect of interpreting human (and other large-genomed
organisms) genome data is the quality of the assembly in a particular region.
Where a region consists of many, finished overlapping clones, one has a
very high confidence that factors such as gene orientation, intron size,
gene order etc. are accurately modelled. However, where there are a large
number of gaps both within unfinished clones, and also (even worse) between
clones or even fingerprinted contigs, one's confidence in such predictions
is much reduced. In this exercise we will use different tracks in the genome
browser to assess sequence quality in our regions of interest.
Going from Ensembl to the Genome Browser:
-
From Ensembl, click on the Jump to button above the bottom panel.
Choose UCSC browser.
-
We will get a new window with another graphic of the Src genomic region.
-
Now adjust the figure size:
-
Set pixel width to 1000 and click jump.
-
If need be fine tune the size some more so it fills your screen nicely.
Finding a set of genes linked to Src:
-
Use the zoom out buttons to get a view with about 20-30 genes.
-
As needed, customize the plot to minimize unwanted info such as RNA
genes.
-
Work rightward of Src using the move buttons to establish the gene
order for:
-
BIG2, EYA2, HNF4A, MATN4, NCOA3, PKIG, SRC and STAU
-
Draw out the gene order on paper.
-
Are any of these genes fragmented?
-
Also note the position of any major gaps that are present between the genes.
-
Gaps are identified using the Gap track (doh)
-
Mark these on the paper.
-
There are two main ways of controlling the type and amount of annotation
data presented by the browser. Try altering the descriptions (hide, dense
or full) for the track-list given below the visual annotation panel to
vary what is observed (after changing these descriptions, you must click
the Refresh button to implement the changes). It is also possible
to change the degree of coverage given by a track by clicking on the name
of the track as given within the visual panel. Information about individual
annotated features presented in the panel can be obtained by clicking on
them with the left mouse button.
Finding a set of genes linked to LYN:
The aim of this part of the exercise is to see whether a region
of the genome has a several genes in it which are similar to
genes that are near to (syntenic) to src. If this is so, then the
reason for this sharing of similar genes might be that the regions are
in fact duplicates.
-
Open the Genome Home link with the middle mouse button.
-
Click on the Browser link.
-
Type LYN into the genome position box and submit.
-
Click on the top matching link to get the graphical display.
-
Set up the figure as before to view about 20-30 genes.
-
From the coverage and gap summaries, is the sequence as good as for
the SRC region?
-
The sequence quality in this region is much worse than that around SRC.
Around SRC, and indeed throughout chromosome 20, the coverage track shows
nothing (as the chromosome is almost completely finished sequence) However,
in this region you can see that this track reveals the names of the clones
used in the assembly, and shows how fragmented they are (if they are completely
black, then they are fully sequenced/finished) This gives another way of
assessing sequence quality in the region, in addition to the gap
track.
-
Work rightward of LYN using the move buttons to establish the gene
order for genes which are similar to those given in the list in the section
Finding a set of genes linked to Src: above here.
-
(The genes will have similar names with different numbers, some of them
are a long way from LYN.)
-
Also note the position of any major gaps that are present between the genes.
-
We can now compare the two regions:
-
Is the gene order conserved?
-
What is the longest conserved gene order?
-
Are there any inversions of gene order?
-
Do you think the inversions are real or due to error in the assembly
and mapping?
Notes:
This exercise shows quite well the current state of the human genome
sequence. There are some nearly finished regions and some less good ones.
Also there seem to be some gaps between clone contigs - so the contigs
could invert, changing gene order. Also the sequence contig order within
a clone is not always correct - although the project does try to order
them - so could mess up local gene prediction. I.e. the fine mapping is
not reliable in the rough sequence. It will be interesting to see how much
the gene order changes in these two regions in the future and if they get
a bit more like each other.
Exercise 3. Using ESTs to reveal gene structure
with Gene2EST
The BLAST2 program is well suited to EST detection. Because it tolerates
small gaps in alignments, it deals well with sequencing errors in the ESTs.
However it is painfully time-consuming to work through the output, especially
if there are many 100s of matches. Furthermore, most BLAST2 servers will
not allow the user to submit large query sequences. We have tried to address
these deficiencies with Gene2EST. The Gene2EST
server presents BLAST2-derived output in a way that allows the user to
rapidly evaluate the results. A graphical display (viewed with Artemis)
provides an overview of the results, while an alignment of the ESTs on
the query sequence lets the user follow up the exons in detail.
Getting the Query:
-
In a Netscape window, go to SRS
and click Start.
-
Click the EMBL Box and then click Continue.
-
Type AF004877 in an ID box and click Do Query.
-
Click on the EMBL:AF004877 entry. This entry contains the gene for
human pro-alpha2(I) collagen (COL1A2).
-
The COL1A2 gene is large (nearly 40,000 bases), highly spliced and there
are plenty of corresponding ESTs, so it is a good example to demonstrate
the capabilities of Gene2EST.
-
Click on the Save button.
-
Set Use view to FastaSeqs then click SAVE.
-
The sequence is now in a suitable format. (If you want, you can save it
in a file as COLIA2.SEQ).
Using Gene2EST:
-
Open a new navigator window and load this page into it.
-
Load the Gene2EST
query submission page.
-
Cut and paste the query sequence into the sequence box and submit the job.
-
The search will take a few minutes. When the result arrives,
bookmark
it.
-
Note that Repeatmasker runs first to mask out dispersed repeats
like Alus.
-
It will report these while BLAST is running: Did it find any?
-
Examine the BLAST output.
-
How many matching ESTs are there?
-
Ought we to allow more top hits if we ran the search again?
-
How does % identity vary in the hits?
-
Are any of the matches spread over multiple HSPs?
Examining the alignment output:
-
Click on the Alignment button. The alignment of high identity matches
will be provided in a new window.
-
Have most of the exons been found?
-
Has the 5' exon been found? Is there an upstream TATAAAA-like sequence?
A CCAAT box?
-
Has the 3' exon been found? Is there a poly-A signal - AATAAA? Is just
one poly-A site used?
-
Why are there more 3' ESTs than 5' and internal ESTs?
-
Are internal exons bordered by plausible splice sites?
-
Are there any "funny" matches that don't make good sense?
-
Is it possible that these are not true "Expressed Sequence Tags"?
-
Could there be some ESTs primed from genomic DNA at runs of As?
-
Are there any ALUs masked out in the introns?
-
Two short exons of 10 and 15 bp are missing in the alignment. How is
that marked?
Examining the Graphical output with Artemis:
-
Click on the EMBL_Entry button. The sequence with ESTs annotated
as an EMBL format feature table will be provided in a new window.
-
Note how the ESTs are annotated as mRNA features:
-
why are there join statements in some ESTs?
-
Save to file as e.g. /net/fileserver4/scrap/your_username/collagen.embl.
-
In an X-window, type xhost +
-
Then rlogin tau
-
on the unix command line type prepare artemis. Artemis will be set
up.
-
Type art collagen.embl & to start Artemis with the COL1A2 sequence
loaded.
-
In a few seconds, the Artemis graphical display should appear.
-
Examine the Artemis Display:
-
Can you see the exon-intron structure of the gene?
-
Click on an exon - How many neighbouring exons are bolded?
-
Is there any evidence for alternative splicing in this gene?
-
Does the promoter region have a high GC content, typical of many human
promoters? (Hint: Use the Display menu).
-
Artemis is a display-editor of EMBL features:
-
you could edit the file to clean up the results by deleting the implausible
matches.
-
you could use it to assemble features for the entire mRNA and the inferred
coding content.
Repeat using HSAK1, the gene sequence from yesterday's prediction practical:
-
Collect HSAK1
(collect just the sequence in FASTA format).
-
Cut and Paste into Gene2EST and run the search.
-
Collect the outputs and load the EMBL format into Artemis:
-
How well is the gene predicted?
-
Are the first and last exons found?
-
Are there any EST matches that are not in the annotated EMBL entry?
-
Does the promoter have a high GC content?
Notes:
Our chosen example works spectacularly well: ESTs alone can quickly
reveal the entire gene structure. However, if we took the EMBL entry AF106656
(the human adenylosuccinate lyase gene) there might be problems. Fortunately
the program RepeatMasker
can filter out repetitive elements and is on by default: About 1/2 of the
20 kb sequence consists of dispersed repeat sequences matching highly similar
ESTs. Then there are many ESTs primed from intronic poly-A runs in the
genomic sequence. The gene itself is not very highly expressed, so we could
struggle to find the true ESTs in the flood of noisy matches. If there
is time, it is an instructive exercise to see the effects of using Repeatmasker
filtered and unfiltered on the sequence from AF106656 in Gene2EST. After that exercise, one is left to wonder what fraction of the ~4 million public ESTs belong in the "really useful" category....
Take Home Lessons
We looked at three different servers that can help to investigate
genes in human genomic sequence. Gene2EST will give a good overview of
a gene structure, provided that sufficient ESTs are present, and can reveal
alternative splicing. It will be useless if there are no ESTs derived from
the query gene. Ensembl provides diagrammatic summaries of gene prediction,
protein homology and cDNA/EST matches that can be helpful in defining a
gene, but are currently rather terse. The human genome browser provides
much of the same information as Ensembl but is currently better for viewing
larger segments of chromosomes. These human genomic servers are likely
to develop in usefulness as more sequence is completed.
The gene orders for exercise 2 using the December 2000 release
of the human genome are:
SRC, HNF4A, PKIG, MATN4, EYA2, NCOA3, BIG2, STAU
LYN, BIG1, NCOA2, EYA1, STAU2, HNF4G, PKIA, MATN2
By the way, LYN is actually a closer relative of HCK rather than
SRC, but HCK is further from the rest of these genes than SRC, so we used
SRC as the starting point. It will be interesting to see if this corresponds
to true gene order due to a chromosomal inversion, as the sequence is improved.
You can find this page at http://www.embl-heidelberg.de/~seqanal/courses/Nov01/genomes.html