Exploring gene structure with Artemis and Gene2EST
by Toby Gibson, Chenna Ramu and Christine Gemünd, October 11th-17th 2000
In this practical we will use two software tools that are useful for examining genes. Artemis is a package for graphical display of annotated genomic sequence in EMBL format. The annotations can be edited so anyone who needs to work with (or is sequencing) segments of genomic sequence can custom annotate their sequence. Gene2EST is a new server developed by our group. It allows the user to search the EST (Expressed Sequence Tag) databases with large gene queries and map the results onto the gene sequence. In favourable cases, Gene2EST may give a good overview of gene structure including differential splicing. Artemis is used to display the graphical output from Gene2EST.
Getting the gene sequence for the practical:
- Log on to TAU:
- if you are working on a Unix machine, type rlogin tau.
- if you are working on an X-Terminal, choose Tau from the menu for logging in.
- you will get a windows-type desktop and need to open a terminal (ie an X-window).
- if you are working on a Mac, use the MacX2.0 software to open a window on tau.
- On the unix command line type prepare artemis. Artemis will be set up.
- Now open Netscape by typing Netscape4 -ncols 64 &.
- This stops Netscape using all the colours as the X-terms only have 256...
- In a Netscape window, go to SRS. and click Start.
- Tick the EMBL Box and then click Continue.
- Toggle Sequence Format to embl and Use View to Complete entries.
- Type HSAK1 in an ID box and click Do Query.
- This is the Human gene encoding the cytosolic adenylate kinase.
- Click on the Save button.
- Set Use view Complete Entries and use mime type to file and then click SAVE.
- Save file as hsak1.seq.
Exercise 1. Examining the hsak1 gene annotation with Artemis
The EMBL database is distributed as a "flat file" format, as are most other biological databases. This means it is simply a text file with a defined structure to it, allowing any programmer to easily write software that can retrieve specified information. Flat files are inefficient for cross-referencing data, but the portability and accessibility versus the historically high cost of relational database architectures have been overriding factors. Nevertheless, wading through large texts can be time consuming and dull for users while biologists like to present gene structure with graphics. Artemis is a program developed to display gene features graphically and to edit the features while adhering to correct EMBL format. Artemis is written in the JAVA graphical language, which ensures portability: most users at EMBL would use the Mac or PC version, rather than the UNIX one used here.
- Type art hsak1.seq & to start Artemis with the gene sequence loaded.
- In a few seconds the Artemis graphical display should appear.
- (Remember to type prepare artemis if it does not run.)
- Examine the Artemis display:
- There are 3 subwindows. Which window has:
- A display of the features?
- The sequence and 6 frame translation?
- A list of features from the hsak1 entry feature table?
- Double click on something - anything:
- What happens?
- Try double clicking on lots of things.
- What does a single click do?
- There are 2 horizontal scroll bars - What do they do?
- There are 3 vertical scroll bars - What do the do?
- Now lets look at the features themselves:
- Can you see the exon-intron structure of the gene?
- Is the coding sequence (CDS) encoded over one, two or three translation frames?
- Is the coding sequence derived from all the exons?
- The terminal exon should be partially coding, including the stop codon and the poly(A) signature.
- The initiation codon should obey the Kozak rules:
- It is normally the first methionine from the 5' end of the mRNA.
- At least one of the underlined residues should be present in the consensus APuXXAUGG.
- Are both the TATA and polyadenylation signals annotated?
- Does the TATA signal span 6 residues, as often portrayed?
- Which 3' splice site is a poor match to the consensus: (C/T)nN(C/T)AG^gt)?
- What do you think is the probable explanation for this?
So far we have only worked within the Artemis windows themselves. Now lets look at what we can do with the pull down menus.
- Display the amino acid sequence of the CDS feature.
- (Hint: look under the View menu).
- Is the encoded amino acid sequence 195 residues long?
- Make a new feature for the gene's polyA_signal:
- Find the AATAAA motif upstream of the 3' end and note the base range.
- Open the Create > New Feature box.
- Select the correct key, type in the location base range and click OK.
- What colour is the new feature given?
- Now display the GC content plot from within the Display menu.
- What does the vertical scroll bar do in this menu?
- Which part of the gene has the highest GC content?
- Is this consistent with a so-called CpG island?
- Could this be useful for gene prediction?
Exercise 2. Using Gene2EST and Artemis to reveal the hsak1 gene structure as defined by ESTs
ESTs - Expressed Sequence Tags - are randomly cloned and sequenced cDNAs. Because no special analyses are undertaken, economies of scale have enabled huge quantitites of ESTs to be sequenced: >2 million for human and >1million for mouse. Originally, ESTs provided handles for cloning genes belonging to known gene families and mapping them to chromosomal locations. With the advent of complete genome sequences the ESTs have acquired new functions, both to reveal hitherto unknown genes and to accurately map gene structures. Currently, de novo gene prediction for complex eukaryotes is so poor that ESTs are easily the best tool for this task - provided only that genes are highly enough expressed to be well represented in the EST databases. Therefore our group has provided the Gene2EST server which specialises in exactly this task and provides output mapping ESTs onto the query gene stucture.
Getting the Query:
- In a Netscape window, go to SRS and click Start.
- Click the EMBL Box and then click Continue.
- Type hsak1 in an ID box
- Set Sequence Format to fasta and Use View to FastaSeqs.
- Now click Do Query.
- The sequence is now in a suitable format. (If you want, you can save it in a file as hsak1.fasta).
- Open a new navigator window and load this page into it.
- Load the Gene2EST query submission page.
- Cut and paste the query sequence (including the >name line) into the DNA sequence box
- Set Database to HumanEST.
- Set RepeatMasker to mask Human Repetitive elements.
- Click Do Search.
- The search will take a few minutes. (It could be queued longer if the server is busy.)
- The first job is to run Repeatmasker:
- It masks all (known) types of Alu, Lin1 and other dispersed repeats it finds in the query.
- These are massively abundant in the human genome - and in the EST databases too!
- When the BLAST result arrives, bookmark it.
- Examine the BLAST output.
- How many listed ESTs are there?
- How many with interesting E-values (say below 1e-4)?
- What happens when you click on the list number link?
- What happens when you click on the ID link?
- How does % identity vary in the hits?
- Are any of the ESTs potentially from different but paralogous genes?
- Are any of the matches spread over multiple HSPs (high scoring pieces)?
- If so, are they all in the same strand orientation (ie. plus or minus)?
- Are the HSPs overlapping each other, exactly adjacent or dispersed in the EST?
- Therefore are they consistent with spliced exons from the gene?
Using the Gene2EST alignment output:
- Note the E-value and identity cut-offs:
- These are set to prevent collection of insignificant matches, or ESTs that are perhaps related yet are not identical to this gene:
- Are the settings appropriate to this case?
- Will we miss short exons if we don't alter them?
- Click on the Get_Alignment button. The alignment of high identity matches will be provided in a new netscape window.
- Have most of the exons been found?
- is it true there are more 5' ESTs than 3' and internal ESTs?
- why is there a big imbalance in coverage?
- Has the 5' exon been found? Is there an upstream TATAAAA-like sequence?
- Are the ESTs all on the same strand?
- If not, is there any local consistency of strand orientation?
- Where is the main 3' termination?
- Hint: numbering will start at around 1 for poly-A primed ESTs!
- Is there a poly-A signal similar to AATAAA nearby?
- Is this the same poly-A site annotated in the EMBL entry?
- Are alternative poly-A sites used by the gene?
- Are internal exons bordered by plausible splice sites?
- Why are there some HSP "overhangs" at the splice sites?
- The splice acceptor site for the 2nd exon is controversial in the EMBL database?
These exercises should have shown you that the Gene2EST alignment output is very useful. However, you probably still lack an overview of how things fit together. For that we need to use the Artemis graphical representation.
Using the Gene2EST graphical output with Artemis:
- Click on the Get_EMBL_Entry button. The EMBL flat file output will appear in a new netscape window:
- What feature types are used to specifiy the ESTs?
- What feature types are used to specifiy the repeats?
- why are there join statements in some EST features?
- Save to file as e.g. hsak1.embl.
- (If Artemis is not open, type art hsak1.embl & to load the entry).
- You can load the file into the same Artemis display used earlier:
- Use File > Read an Entry to load in hsak1.embl.
- Note the checked boxes at top left:
- Try toggling between the original database entry and the gene2EST results.
- Now examine the new results:
- What colour are the dispersed repeats?
- Do they make up a signigicant fraction of this gene region?
- Are any repeats found in expressed sequence?
- What colour are the EST matches?
- The expressed region spanning bases ~11,000-12,000 has no joins to other exons:
- Is it likely to be part of the gene?
- Can we be sure either way?
- Comparing the 3' exon in both entries:
- What is unusual about the annotated exon in the EMBL database entry?
- Is there any strong candidate for an alternative splice?
- (Caution, apparent alternative splicing could be artefacts - e.g. a short exon with sequencing errors could be dropped by the parser: if so it will be indicated by XXXs in the alignment output.)
The Gene2EST graphical display output complements the BLAST and alignment outputs. Each has its uses: BLAST for information on search results and EST entries; Alignment for examining the matching EST sequences in detail; graphical display for the gene structure overview.
A more dramatic example: The human COL1A2 Gene:
- If there is time, you could challenge Gene2EST with a bigger, much more highly spliced gene:
- Use SRS to retrieve the EMBL entry AF004877 in fasta format and submmit to Gene2EST.
- Are there more than 500 ESTs matching this gene?
- Are there more than 50 exons?
- Is the whole gene represented in the EST databases?
With our example gene, using Gene2EST we have been able to learn more about the gene structure than is present in the EMBL database entry (e.g. this suffers from an incorrect splice annotation and an "abnormal" 3' end). Gene2EST will give a good overview of a gene structure, provided that sufficient ESTs are present, and can reveal alternative splicing. It will be useless if there are no ESTs derived from the query gene. Artemis is an excellent tool for providing graphical overview of genomic sequence. In the Sanger Centre it is used for annotating small genomes (bacteria etc.) Artemis might be useful for researchers at EMBL for keeping track of gene features during experimental projects.