Predoc Practical Course on Sequence Analysis
In this practical we will use gcg programs namely motif, dotplot and growtree.
GCG is an integrated package of over 130 programs that allows you to manipulate and analyze nucleotide and protein sequences. If you can't do something on the web, you probably can with the GCG package, so it is a good idea to know how to access the package on UNIX. There are two interfaces available for GCG, a command line interface (horrible for inexperienced users) and an X windows interface. We will use the X interface to do some more analyses on EFTU sequences.
(Note that we are in transition at EMBL. GCG version 9 will soon be installed. The main new feature, SEQLAB, is both a powerful sequence alignment editor and a handy display tool that represents sequence features as symbols. In addition, the computer group will soon install Tau, a powerful DEC alpha computer that will provide central services to EMBL including GCG. The best way to access Tau from your lab, will probably be through a Macintosh X emulator (Xoftware or eXodus): several groups have this already and we are now investigating an EMBL site licence.
Exercise 1. Look for sequence motifs in the EFTU_ECOLI sequence
The Motifs program looks for protein sequence motifs by checking your protein sequence for every sequence pattern in the PROSITE Dictionary. Motifs can recognize the patterns with some of the symbols mismatched, but not with gaps. Although PROSITE is not an exhaustive collection of patterns, any new protein sequence should always be checked against it. You can learn more about PROSITE with SRS.
Do GCG Motifs:
Exercise 2. Do a dotplot to compare two sequences for sequence conservation
Dotplots are a useful visual way to see all of the structures in common between two sequences or to visualize repeated or inverted repeated structures in one sequence. Rather cumbersomely, GCG does dotplots in two stages. DotPlot is the second part of a pair of programs that generate dotplots of the points of similarity between two sequences.
Compare compares two protein or nucleic acid sequences and creates a file of the points of similarity between them for plotting with DotPlot. Compare finds the points using either a window/stringency or a word match criterion. The word comparison is 1,000 times faster than the window/stringency comparison, but somewhat less sensitive.
Do a dotplot in GCG:
Exercise 3: Construct a phylogenetic tree from a distance matrix
Currently, GCG is not the ideal package for drawing trees. But it can produce and display a Neighbour-Joining tree from an alignment. We can compare the results to the Clustal X, NJplot trees. Therefore for consistency we need to use the alignment made by Clustal X.
First run Distance to construct a distance matrix. Then use this matrix as input for GrowTree. The distances are expressed as substitutions per 100 bases or amino acids. For nucleic acid sequences, these methods are Kimura's two-parameter method, the Tajima-Nei method, ors the Jin-Nei gamma distance method; for protein sequences, the Kimura method; and for either type of sequence, the Jukes-Cantor method. It is also possible to obtain an uncorrected distance.
Load in the EFTU alignment:
Take Home Lessons
We undertook three different tasks with GCG, using different numbers of sequences. This introduction barely touches on what the package can do. But once you can do a few things in GCG, it is fairly straightfoward to find your way around and do many other tasks. GCG is not always the best place for doing the more specialised tasks. For example it is convenient to make a tree in GCG if you are already using the package. But trees are calculated and displayed rather better by many other specialist packages.
If you are not a regular computer user, you may not feel comfortable on UNIX and you will certainly need to know how to make directories, print, delete files etc. Teaching UNIX is the responsibility of the Computer Group: if you need a course on UNIX, go and see them!