Practical
Course on Sequence Analysis
Gene Analysis with the Artemis and Staden Packages
Toby Gibson, Aidan Budd, Christine Gemünd and Chenna Ramu, July 01
We will look at two graphical sequence analysis packages with some useful functions for handling big bits of sequence. Artemis is a package for displaying annotated genomic sequence in EMBL format. The annotations can be edited so anyone who needs to work with (or is sequencing) segments of genomic sequence can custom annotate their sequence. The Staden
Sequence Analysis Package provides functions for analysing sequences.
It is an old package with a new interface and could be rather useful for
certain sequence analyis tasks that are not well covered in GCG. Drag-and-Drop
graphs allow you to assemble customised plots from selected tasks. The
program SIP4 has a very powerful set of routines for pairwise sequence
comparison. The program NIP4 has options for investigating nucleic
acid sequences, although a number of useful gene prediction routines from
older Staden releases have not yet been incorporated in the new package.
These programs are avaiable under X-windows on Tau, the UNIX server. We will do the practical via LINUX PCs but Artemis is portable and runs very nicely on G3 or G4 Macs too. For help with UNIX, Björn Kindler has a UNIX
course web page and runs regular courses.
In this practical we will use the Staden package for sequence
analysis: with Sip4, we will compare two homologous eukaryotic genomic
sequences and focus on conserved regions; with Nip4 we will do a
bacterial gene prediction. We will use Artemis to display the known
sequence features in order to find out how well we are doing with the Staden
analyses. We also need to use SRS via Netscape for extracting the
test sequences.
Getting started
- Login with your EMBL name and password.
- Start the KDE Desktop by typing exec startx.
- Open an Xterm (available from the lower menus of the desktop).
- Type xhost +
- This allows tau to open X-windows on the local machine.
- Log on to tau:
- if you are working on a Unix machine, type rlogin tau.
- Type cd /scrap/your_username to change to the scrap disk for the practical.
- On the unix command line type prepare staden. The Staden package will be set up.
- On the unix command line type prepare artemis. Artemis will be set up.
Getting the sequences for the practical
- Open another Linux Xterm.
- Type cd /net/fileserver4/scrap/yourname/
- We want to work on the central disks and this command should mount /scrap.
- Now start Netscape from the command line.
- In a Netscape window, go to SRS. and click Start.
- Click the EMBL Box and then click Continue.
- Type EBV in an ID box and click Do Query.
- Click on the EMBL:EBV entry. This is the complete genome of Epstein-Barr Virus.
- Toggle to selected entries. Click on Save.
- Set Use view Complete Entries and use mime type to file and then click SAVE.
- Save file as ebv.seq.
- Using the same protocol collect three more DNA sequences:
- RNCCKAR1 and save as ccka.rat.
- MMD605 and save as ccka.mouse.
- EC2MIN and save as ec2min.seq.
Exercise 1. Can Artemis handle >100 kb of sequence?
Using Artemis:
-
Type art ebv.seq & to start Artemis with the EBV sequence
loaded.
-
In a few seconds, the Artemis graphical display should appear.
-
(Remember to type prepare artemis if it does not run.)
-
There are 3 subwindows. Which window has:
-
A display of the features?
-
The sequence and 6 frame translation?
-
A list of features from the EBV entry feature table?
-
Double click on something, anything:
-
What happens?
-
Try double clicking on lots of things.
-
What does a single click do?
-
There are 2 horizontal scroll bars - what do they do?
-
How fast can you get from one end of the sequence to the next?
-
How big is the sequence?
-
There are 3 vertical scroll bars - what do they do?
-
Click on a large open reading frame.
-
Now invoke the View menu Show Feature Statistics
-
Is it true that A/T-rich codons are preferred?
-
Now invoke the Display menu GC Content % option.
-
Are the regions of highest G/C content outside coding regions?
-
What feature do they correspond to?
When you have finished looking at EBV, close its window.
Exercise 2. Comparative nucleotide sequence analysis with
SIP4
One way to look for functional motifs in DNA is to compare sequences from different species and note the most conserved elements. For example conserved segments in aligned promoters might reveal transcription factor binding sites or, as in the example here, aligning complete genes may help to reveal coding exons. SIP4 does pairwise alignments in two ways,
by DotPlot and by dynamic programming alignment. It can handle very large
sequences so can be used for eukaryotic gene comparisons. We will use Artemis
to display the known gene features to see if we could find them with SIP4.
Pairwise comparison with Sip4:
-
Type sip4 & to start.
-
In a few seconds, the Sip4 main window should appear.
-
(Remember to type prepare staden if it does not run.)
-
(The database extraction facility in Sip4 is not yet running. We may
be able to get it working in the future via SRS links).
-
Invoke File > Load Sequences > Simple.
-
Set the toggles to personal file for both sequences.
-
Using the Browse buttons, load ccka.mouse and ccka.rat.
-
Under Options, turn off Hide duplicate matches (or you only
get 1/2 the dotplot).
-
Now do Comparison > Find similar spans.
-
The dotplot will take a moment or two to calculate.
-
Diagonals indicate regions of similar sequence
-
Staggerred breaks indicate deletion or insertion - are there any?
-
If the dotplot colour is not strong we should change it:
-
With the Right button, click on the coloured handle at top
right and select configure.
-
Choose e.g. a line width of 2 and pure blue colour.
-
How does that look? If needed repeat till you like it.
-
It turns out that only part of the sequences overlap so we can plot just
this region.
-
Again do Comparison > Find similar spans.
-
Using the cross hairs, define appropriate new start and stop
positions for the sequences.
-
The new plot appears superposed on the original.
-
Using the middle button, drag the coloured handle of the first plot away
-
You've split the graphs! And the second plot is automatically resized to
the residue range.
-
In fact you can now delete this first plot using the right button and the
Remove
option.
Investigating conserved regions of the plot:
-
The alignment path is very clear in the plot with almost unbroken diagonals.
-
In fact, if we expand the scale we can see that some regions are much less
well conserved.
-
Use the X and Y scale bars and test all the mouse buttons
-
Note the rough sequence coordinates of a strongly conserved region.
-
With Results > Display Sequences, open the sequence display window.
-
Using the sequence sliders, what happens in the dotplot window?
-
Slide the sequences until the crosshairs are at the end of a conserved
segment
-
Click on Nearest Match to align the sequences at the crosshairs.
-
Can you see the appropriate splice site consensus match:
-
Donor Consensus: c/aAG^GTA/gAGt
-
Acceptor Consensus: (T>C)nN(C>T)AG^gt
-
Now move the sequences so the cross-hairs are at the other side of the
conserved segment.
-
Can you find another splice consensus? Have you defined a candidate coding
exon?
-
Try to find more exons by investigating other conserved regions.
Cross-checking with Artemis:
-
Load either ccka.mouse or ccka.rat into Artemis.
-
How do the CDSs match to the dotplot conservation?
-
Are they all conserved?
-
Are there any other conserved regions?
-
Why might such regions be conserved?
Sip4 pairwise sequence alignment:
-
Select Comparison > Align sequences and click OK for the
default alignment settings.
-
The alignment will appear in the main window and in the dotplot.
-
How did it do?
-
Did it jump the big Indel?
-
Are there any conserved regions in the dotplot that it missed?
-
(You could change parameters to try and improve the alignment, but there
is probably not time).
-
If you want to save the alignment (or any other text output) from Sip4,
you need to invoke Redirect output.
Cross-checking with the mouse ccka mRNA:
- Get the mouse mRNA sequence and save as ccka.mrna
- Load ccka.mrna in place of the rat sequence.
- Now do Comparison > Find similar spans.
- Resize the plot so that the diagonals are approx 45º - this will look more sensible.
- Can you see the exons in the plot?
- Is this a good way to get an overview of a genes coding content?
Notes:
-
(1) If the comparison was too divergent to yield a single best alignment,
the Local Alignment option could be used to recover the strongest
partial alignments.
-
(2) You cannot currently save the plots from Staden programs. To make
figures, you would need to do a screen dump and work with that.
Exercise 3. Bacterial Gene Prediction with Nip4
Encoding proteins leads to sequence biases in the DNA. This property
can be exploited for gene prediction and we will use one of the earliest
algorithms as implemented in Nip4. The algorithm simply looks for
regions of sequence that score highly as encoding an "average globular
protein". It works well for prokaryotes but needs large window spans so
is not suitable for highly spliced genes with short exons.
Gene Prediction in Nip4:
-
Type nip4 & to start.
-
In a few seconds, the Nip4 main window should appear.
-
(Remember to type prepare staden if it does not run.)
-
(The database extraction facility in Nip4 is not yet running. We may
be able to get it working in the future via SRS links).
-
Invoke File > Load Sequences > Simple.
-
Set the toggle to personal file for both sequences.
-
Using the Browse button, load ec2min.seq.
-
Now invoke Search > Protein genes > base prefs and click OK.
-
A plot with the 3-frame prediction will appear.
-
These plots can be customised and manipulated in the same way as for Sip4.
-
Are there any high peaks (i.e. predictions) simultaneously in more
than one frame?
-
Now Plot Stop codons and Start Codons from the Search
menu.
-
How well do the predictions and stop codons fit together?
-
Count the number of predicted protein-coding genes in each reading frame.
-
If you like, complement the sequence and check prediction for the
second strand.
Evaluate predictions with Artemis:
-
Load ec2min.seq into Artemis.
-
How do the annotated CDSs match to the gene predictions?
Note:
Older versions of the Staden package have options to search
for bacterial promoters and translation starts (Shine-Dalgarno sequence).
Currently, these options have not been implemented in Nip4. You
could do this crudely with the string search by typing in a consensus sequence.
Take Home Lessons
We looked at several different genomic sequences with Artemis
and Staden. These programs are likely to be useful to anyone who has to
analyse large DNA sequences. The graphical displays allow you to get an
overview of sequence features which would be difficult to get without using
graphical displays such as these. The pairwise gene comparison in Sip4
would complement well the servers we investigated in the eukaryotic gene
prediction course. The overall utility of the Staden package is dependent
on the number of analytical tools that are included. Currently these are
a bit limited but it may be worthwhile to look at future releases of the
package as it is continually maintained and developed.
You can find this page at http://www.embl-heidelberg.de/~seqanal/courses/Jul01/GeneAnal.01.html