Practical Course on Sequence Analysis
Gene Analysis with the Artemis and Staden Packages
Toby Gibson, Christine Gemuend and Chenna Ramu, May '99
We will look at two graphical sequence analysis packages with some useful
functions. Artemis is a package for displaying annotated genomic sequence in EMBL format.
The annotations can be edited so anyone who needs to work with (or is sequencing)
segments of genomic sequence can custom annotate their sequence. The Staden Sequence Analysis Package provides functions for analysing sequences. It is an old package with a
new interface and could be rather useful for certain sequence analyis tasks
that are not well covered in GCG. Drag-and-Drop graphs allow you to assemble
customised plots from selected tasks. The program SIP4 has a very powerful set of routines for pairwise sequence comparison. The
program NIP4 has options for investigating nucleic acid sequences, although a number
of useful gene prediction routines from older Staden releases have not yet
been incorporated in the new package.
These programs run under X-windows on Tau, the UNIX server. We will do the
practical from UNIX X-terminals but you should be able to run these programs
from (say a fast G3) Mac if you install the X-emulator Mac X 2.0. Installation instructions are given in the accompanying GCG10 course page. For help with UNIX, Björn Kindler has a UNIX course web page and runs regular courses.
In this practical we will use the Staden package for sequence analysis: with Sip4, we will compare two homologous eukaryotic genomic sequences and focus
on conserved regions; with Nip4 we will do a bacterial gene prediction. We will use Artemis to display the known sequence features in order to find out how well we
are doing with the Staden analyses. We also need to use SRS via Netscape for extracting the test sequences.
Getting started
- Log on to TAU:
- if you are working on a Unix machine, type rlogin tau.
- if you are working on an X-Terminal, choose Tau from the menu for logging in.
- you will get a windows-type desktop and need to open a terminal (ie an X-window).
- if you are working on a Mac, use the MacX2.0 software to open a window on
tau.
- On the unix command line type prepare staden. The Staden package will be set up.
- On the unix command line type prepare artemis. Artemis will be set up.
Getting the sequences for the practical
- Now open Netscape by typing Netscape4 -ncols 64 &.
- This stops Netscape using all the colours as the X-terms only have 256...
- In a Netscape window, go to SRS. and click Start.
- Click the EMBL Box and then click Continue.
- Type EBV in an ID box and click Do Query.
- Click on the EMBL:EBV entry. This is the complete genome of Epstein-Barr Virus.
- Toggle to selected entries. Click on Save.
- Set Use view Complete Entries and use mime type to file and then click SAVE.
- Save file as ebv.seq.
- Using the same protocol collect three more DNA sequences:
- RNCCKAR1 and save as ccka.rat.
- MMD605 and save as ccka.mouse.
- EC2MIN and save as ec2min.seq.
Exercise 1. Can Artemis handle >100 kb of sequence?
Using Artemis:
- Type art ebv.seq & to start Artemis with the EBV sequence loaded.
- In a few seconds, the Artemis graphical display should appear.
- (Remember to type prepare artemis if it does not run.)
- There are 3 subwindows. Which window has:
- A display of the features?
- The sequence and 6 frame translation?
- A list of features from the EBV entry feature table?
- Double click on something, anything:
- What happens?
- Try double clicking on lots of things.
- What does a single click do?
- There are 2 horizontal scroll bars - what do they do?
- How fast can you get from one end of the sequence to the next?
- How big is the sequence?
- There are 3 vertical scroll bars - what do they do?
- Click on a large open reading frame.
- Now invoke the View menu Show Feature Statistics
- Is it true that A/T-rich codons are preferred?
- Now invoke the Display menu GC Content % option.
- Are the regions of highest G/C content outside coding regions?
- What feature do they correspond to?
When you have finished looking at EBV, close its window.
Exercise 2. Comparative nucleotide sequence analysis with SIP4
One way to look for functional motifs in DNA is to compare sequences from
different species and note the most conserved elements. For example conserved
segments in aligned promoters might reveal transcription factor binding
site or, as in the example here, aligning complete genes may help to reveal
coding exons. SIP4 does pairwise alignments in two ways, by DotPlot and by dynamic programming
alignment. It can handle very large sequences so can be used for eukaryotic
gene comparisons. We will use Artemis to display the known gene features to see if we could find them with SIP4.
Pairwise comparison with Sip4:
- Type sip4 & to start.
- In a few seconds, the Sip4 main window should appear.
- (Remember to type prepare staden if it does not run.)
- (The database extraction facility in Sip4 is not yet running. We may be able
to get it working in the future via SRS links).
- Invoke File > Load Sequences > Simple.
- Set the toggles to personal file for both sequences.
- Using the Browse buttons, load ccka.mouse and ccka.rat.
- Under Options, turn off Hide duplicate matches (or you only get 1/2 the dotplot).
- Now do Comparison > Find similar spans.
- The dotplot will take a moment or two to calculate.
- Diagonals indicate regions of similar sequence
- Staggerred breaks indicate deletion or insertion - are there any?
- If the dotplot colour is not strong we should change it:
- With the Right button, click on the coloured handle at top right and select configure.
- Choose e.g. a line width of 2 and pure blue colour.
- How does that look? If needed repeat till you like it.
- It turns out that only part of the sequences overlap so we can plot just
this region.
- Again do Comparison > Find similar spans.
- Using the cross hairs, define appropriate new start and stop positions for the sequences.
- (Note there is an annoying bug in the window - your cursor must not pass
over a Seq Identifier box or the positions are reset!!! Hopefuly this will be fixed soon)
- The new plot appears superposed on the original.
- Using the middle button, drag the coloured handle of the first plot away
- You've split the graphs! And the second plot is automatically resized to
the residue range.
- In fact you can now delete this first plot using the right button and the Remove option.
Investigating conserved regions of the plot:
- The alignment path is very clear in the plot with almost unbroken diagonals.
- In fact, if we expand the scale we can see that some regions are much less
well conserved.
- Use the X and Y scale bars and test all the mouse buttons
- Note the rough sequence coordinates of a strongly conserved region.
- With Results > Display Sequences, open the sequence display window.
- Using the sequence sliders, what happens in the dotplot window?
- Slide the sequences until the crosshairs are at the end of a conserved segment
- Click on Nearest Match to align the sequences at the crosshairs.
- Can you see the appropriate splice site consensus match:
- Donor Consensus: c/aAG^GTA/gAGt
- Acceptor Consensus: (T>C)nN(C>T)AG^gt
- Now move the sequences so the cross-hairs are at the other side of the conserved
segment.
- Can you find another splice consensus? Have you defined a candidate coding
exon?
- Try to find more exons by investigating other conserved regions.
Cross-checking with Artemis:
- Load either ccka.mouse or ccka.rat into Artemis.
- How do the CDSs match to the dotplot conservation?
- Are they all conserved?
- Are there any other conserved regions?
- Why might such regions be conserved?
Sip4 pairwise sequence alignment:
- Select Comparison > Align sequences and click OK for the default alignment settings.
- The alignment will appear in the main window and in the dotplot.
- How did it do?
- Did it jump the big Indel?
- Are there any conserved regions in the dotplot that it missed?
- (You could change parameters to try and improve the alignment, but there
is probably not time).
- If you want to save the alignment (or any other text output) from Sip4,
you need to invoke Redirect output.
Notes:
- (1) If the comparison was too divergent to yield a single best alignment,
the Local Alignment option could be used to recover the strongest partial alignments.
- (2) You cannot currently save the plots from Staden programs. To make figures,
you would need to do a screen dump and work with that.
Exercise 3. Bacterial Gene Prediction with Nip4
Encoding proteins leads to sequence biases in the DNA. This property can
be exploited for gene prediction and we will use one of the earliest algorithms
as implemented in Nip4. The algorithm simply looks for regions of sequence that score highly as
encoding an "average globular protein". It works well for prokaryotes but
needs large window spans so is not suitable for highly spliced genes with
short exons.
Gene Prediction in Nip4:
- Type nip4 & to start.
- In a few seconds, the Nip4 main window should appear.
- (Remember to type prepare staden if it does not run.)
- (The database extraction facility in Nip4 is not yet running. We may be able
to get it working in the future via SRS links).
- Invoke File > Load Sequences > Simple.
- Set the toggle to personal file for both sequences.
- Using the Browse button, load ec2min.seq.
- Now invoke Search > Protein genes > base prefs and click OK.
- A plot with the 3-frame prediction will appear.
- These plots can be customised and manipulated in the same way as for Sip4.
- Are there any high peaks (i.e. predictions) simultaneously in more than one frame?
- Now Plot Stop codons and Start Codons from the Search menu.
- How well do the predictions and stop codons fit together?
- Count the number of predicted protein-coding genes in each reading frame.
- If you like, complement the sequence and check prediction for the second strand.
Evaluate predictions with Artemis:
- Load ec2min.seq into Artemis.
- How do the annotated CDSs match to the gene predictions?
Note:
Older versions of the Staden package have options to search for bacterial promoters and translation
starts (Shine-Dalgarno sequence). Currently, these options have not been
implemented in Nip4. You could do this crudely with the string search by typing in a consensus
sequence.
Take Home Lessons
We looked at several different genomic sequences with Artemis and Staden.
These programs are likely to be useful to anyone who has to analyse large
DNA sequences. The graphical displays allow you to get an overview of sequence
features which would be difficult to get without using graphical displays
such as these. The pairwise gene comparison in Sip4 would complement well
the servers we investigated in the eukaryotic gene prediction course. The
overall utility of the Staden package is dependent on the number of analytical
tools that are included. Currently these are a bit limited but it may be
worthwhile to look at future releases of the package as it is continually
maintained and developed.
You can find this page at http://www.embl-heidelberg.de/~seqanal/courses/spring99/GeneAnal.html