GCG is an integrated package of over 130 programs that allows you to manipulate and analyze nucleotide and protein sequences. If you can't do something on the web, you probably can with the GCG package, so it is a good idea to know how to access the package on UNIX. There are two interfaces available for GCG, a command line interface (difficult for inexperienced users) and an X windows interface. We will use the X interface to do some analyses, mainly on EFTU sequences.
GCG version 9.1 is now installed on TAU, a powerful DEC alpha computer that provides central services to EMBL including GCG. The main new feature of GCG 9.1, SEQLAB, is both a versatile sequence alignment editor and a handy display tool that represents sequence features as symbols. The best way to access TAU from your lab will probably be through a Macintosh X Windows emulator (MacX v.2, Xoftware or eXodus): The computer group is arranging an institute licence for Mac X 2 and will place it on the Mac server. You can see us for a quick lesson, or help in installation. If you need help with UNIX, register for one of the weekly basic UNIX courses run by Bjoern Kindler.
In this practical we will use several gcg programs, namely motifs, seqlab, coilscan, compare/dotplot, gap, bestfit, pileup and distances/growtree. This will provide experience in (1) examining a single sequence, (2) looking
for sequence similarity between two sequences and (3) aligning a group of
related sequences. There is on-line help available for GCG.
Exercise 1a. Look for sequence motifs in the EFTU_ECOLI sequence
The Motifs program looks for protein sequence motifs by checking your protein sequence for every sequence pattern in the PROSITE Dictionary. Motifs can recognize the patterns with some of the symbols mismatched, but not with gaps. Although PROSITE is not an exhaustive collection of patterns, any new protein sequence should always be checked against it. You can learn more about PROSITE with SRS.
Do GCG Motifs:
Exercise 1b. Check for special regions in your sequence using SeqLab
GCG's SeqLab interface has two modes: the Main List mode and the sequence Editor mode. In the Editor mode you can edit your sequence (inserting, deleting, copying) as well as investigate it further in checking for features in the database entry like domains, helices, strands, turns, ligand binding sites.
Check for features in the sequence:
Exercise 1c. Check for coiled-coils in a protein sequence
CoilScan uses the method of Lupas et al. (see the ACKNOWLEDGMENT topic) to find coiled-coil segments in protein sequences by comparing each residue in the sequence to a weight matrix tabulated from known coiled-coil protein segments. A coiled-coil probability is calculated for each residue in the protein, and those segments whose probabilities meet or exceed a threshold probability you set are reported in the output. This prediction method works only for solvent exposed (amphipathic) coiled coils, particularly for parallel and antiparallel two-stranded coiled coils and for parallel three-stranded coiled coils.
Do GCG Coilscan:
Exercise 2a. Do a dotplot to compare two sequences for sequence conservation
Dotplots are a useful visual way to see all of the segments in common between two sequences or to visualize repeated or inverted repeated structures in one sequence. Rather cumbersomely, GCG does dotplots in two stages. DotPlot is the second part of a pair of programs that generate dotplots for the points of similarity between two sequences.
Compare compares two protein or nucleic acid sequences and creates a file of the points of similarity between them for plotting with DotPlot. Compare finds the points using either a window/stringency or a word match criterion. The word comparison is 1,000 times faster than the window/stringency comparison, but somewhat less sensitive.
Make a dotplot in GCG:
Exercise 2b. Use Bestfit to compare two sequences for partial sequence similarity
BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local alignment algorithm of Smith and Waterman. This is useful for example when comparing modular proteins that share some, but not all of the same domains.
Do a Bestfit in GCG:
Exercise 2c. Use Gap to compare two sequences for overall sequence similarity
Gap uses the global alignment algorithm of Needleman and Wunsch to find the best alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps. (It is inappropriate to use Gap for comparing multidomain proteins if they are only partially related.)
Do a Gap in GCG
Exercise 3a: Make a Multiple Alignment
So far we made only alignments with 2 sequences. Now we will use a program for making a multiple alignment that uses a progressive, clustering alignment method. Afterwards we will draw a phylogenetic tree based on this alignment.
Do a Pileup in GCG:
Exercise 3b: Construct a phylogenetic tree from a distance matrix
Currently, GCG is not the ideal package for drawing trees. But it can produce and display a Neighbour-Joining tree from an alignment.
First run Distance to construct a distance matrix. Then use this matrix as input for GrowTree. The distances are expressed as substitutions per 100 bases or amino acids. For nucleic acid sequences, these methods are Kimura's two-parameter method, the Tajima-Nei method, or the Jin-Nei gamma distance method; for protein sequences, the Kimura method; and for either type of sequence, the Jukes-Cantor method. It is also possible to obtain an uncorrected distance (i.e. neglecting the effect of multiple hidden substitutions at the same site).
Load in the EFTU/EF1A alignment:
You probably saw that you can also make trees with the PAUP package. This
is a widely used package with several tree making algorithms, parsimony
in particular. However, the Paup interface is under development and the
display tool in GCG is quite limited: you cannot reroot a tree or display
it as an unrooted radial tree (the correct way to show a tree when you have
no idea where the root should go). In the future GCG may also distribute
TreeTool, a nice display tool which we have running on Sun Unix at EMBL.
See the Sequence Analysis Service for available tree calculation and display software at EMBL.
Take Home Lessons
We undertook several different tasks with GCG, using different numbers of sequences. This introduction barely touches on what the package can do. But once you can do a few things in GCG, it is fairly straightfoward to find your way around and do many other tasks. GCG is not always the best place for doing the more specialised tasks. For example it is convenient to make a tree in GCG if you are already using the package. But trees are calculated and displayed rather better by many other specialist packages. The same is valid for Multiple Sequence Alignment and many other functions.