Biocomputing Unit
Biocomputing
Sequence Analysis Service
Gibson Group
EMBL
EMBL

Practical Course on Sequence Analysis

Gene Analysis with the Artemis and Staden Packages

Toby Gibson, Aidan Budd, Christine Gemünd and Chenna Ramu, July 01


We will look at two graphical sequence analysis packages with some useful functions for handling big bits of sequence. Artemis is a package for displaying annotated genomic sequence in EMBL format. The annotations can be edited so anyone who needs to work with (or is sequencing) segments of genomic sequence can custom annotate their sequence. The Staden Sequence Analysis Package provides functions for analysing sequences. It is an old package with a new interface and could be rather useful for certain sequence analyis tasks that are not well covered in GCG. Drag-and-Drop graphs allow you to assemble customised plots from selected tasks. The program SIP4 has a very powerful set of routines for pairwise sequence comparison. The program NIP4 has options for investigating nucleic acid sequences, although a number of useful gene prediction routines from older Staden releases have not yet been incorporated in the new package.

These programs are avaiable under X-windows on Tau, the UNIX server. We will do the practical via LINUX PCs but Artemis is portable and runs very nicely on G3 or G4 Macs too. For help with UNIX, Björn Kindler has a UNIX course web page and runs regular courses.

In this practical we will use the Staden package for sequence analysis: with Sip4, we will compare two homologous eukaryotic genomic sequences and focus on conserved regions; with Nip4 we will do a bacterial gene prediction. We will use Artemis to display the known sequence features in order to find out how well we are doing with the Staden analyses. We also need to use SRS via Netscape for extracting the test sequences.


Getting started

Getting the sequences for the practical

Exercise 1. Can Artemis handle >100 kb of sequence?

Using Artemis:

When you have finished looking at EBV, close its window.

Exercise 2. Comparative nucleotide sequence analysis with SIP4

One way to look for functional motifs in DNA is to compare sequences from different species and note the most conserved elements. For example conserved segments in aligned promoters might reveal transcription factor binding sites or, as in the example here, aligning complete genes may help to reveal coding exons. SIP4 does pairwise alignments in two ways, by DotPlot and by dynamic programming alignment. It can handle very large sequences so can be used for eukaryotic gene comparisons. We will use Artemis to display the known gene features to see if we could find them with SIP4.

Pairwise comparison with Sip4:

Investigating conserved regions of the plot: Cross-checking with Artemis: Sip4 pairwise sequence alignment:

Cross-checking with the mouse ccka mRNA:

Notes:

Exercise 3. Bacterial Gene Prediction with Nip4

Encoding proteins leads to sequence biases in the DNA. This property can be exploited for gene prediction and we will use one of the earliest algorithms as implemented in Nip4. The algorithm simply looks for regions of sequence that score highly as encoding an "average globular protein". It works well for prokaryotes but needs large window spans so is not suitable for highly spliced genes with short exons.

Gene Prediction in Nip4:

Evaluate predictions with Artemis: Note:

Older versions of the Staden package have options to search for bacterial promoters and translation starts (Shine-Dalgarno sequence). Currently, these options have not been implemented in Nip4. You could do this crudely with the string search by typing in a consensus sequence.


Take Home Lessons

We looked at several different genomic sequences with Artemis and Staden. These programs are likely to be useful to anyone who has to analyse large DNA sequences. The graphical displays allow you to get an overview of sequence features which would be difficult to get without using graphical displays such as these. The pairwise gene comparison in Sip4 would complement well the servers we investigated in the eukaryotic gene prediction course. The overall utility of the Staden package is dependent on the number of analytical tools that are included. Currently these are a bit limited but it may be worthwhile to look at future releases of the package as it is continually maintained and developed.


You can find this page at http://www.embl-heidelberg.de/~seqanal/courses/Jul01/GeneAnal.01.html