Biocomputing Unit
Sequence Analysis Service
Gibson Group

Using GCG Tools on TAU

Sequence Analysis Service, May/October/November '98

GCG is an integrated package of over 130 programs that allows you to manipulate and analyze nucleotide and protein sequences. If you can't do something on the web, you probably can with the GCG package, so it is a good idea to know how to access the package on UNIX. There are two interfaces available for GCG, a command line interface (difficult for inexperienced users) and an X windows interface. We will use the X interface to do some analyses, mainly on EFTU sequences.

GCG version 9.1 is now installed on TAU, a powerful DEC alpha computer that provides central services to EMBL including GCG. The main new feature of GCG 9.1, SEQLAB, is both a versatile sequence alignment editor and a handy display tool that represents sequence features as symbols. The best way to access TAU from your lab will probably be through a Macintosh X Windows emulator (MacX v.2, Xoftware or eXodus): The computer group is arranging an institute licence for Mac X 2 and will place it on the Mac server. You can see us for a quick lesson, or help in installation. If you need help with UNIX, register for one of the weekly basic UNIX courses run by Bjoern Kindler.

In this practical we will use several gcg programs, namely motifs, seqlab, coilscan, compare/dotplot, gap, bestfit, pileup and distances/growtree. This will provide experience in (1) examining a single sequence, (2) looking for sequence similarity between two sequences and (3) aligning a group of related sequences. There is on-line help available for GCG.

Getting started

Analysing a Protein Sequence

Exercise 1a. Look for sequence motifs in the EFTU_ECOLI sequence

The Motifs program looks for protein sequence motifs by checking your protein sequence for every sequence pattern in the PROSITE Dictionary. Motifs can recognize the patterns with some of the symbols mismatched, but not with gaps. Although PROSITE is not an exhaustive collection of patterns, any new protein sequence should always be checked against it. You can learn more about PROSITE with SRS.

Do GCG Motifs:


Exercise 1b. Check for special regions in your sequence using SeqLab

GCG's SeqLab interface has two modes: the Main List mode and the sequence Editor mode. In the Editor mode you can edit your sequence (inserting, deleting, copying) as well as investigate it further in checking for features in the database entry like domains, helices, strands, turns, ligand binding sites.

Check for features in the sequence:

Exercise 1c. Check for coiled-coils in a protein sequence

CoilScan uses the method of Lupas et al. (see the ACKNOWLEDGMENT topic) to find coiled-coil segments in protein sequences by comparing each residue in the sequence to a weight matrix tabulated from known coiled-coil protein segments. A coiled-coil probability is calculated for each residue in the protein, and those segments whose probabilities meet or exceed a threshold probability you set are reported in the output. This prediction method works only for solvent exposed (amphipathic) coiled coils, particularly for parallel and antiparallel two-stranded coiled coils and for parallel three-stranded coiled coils.

Do GCG Coilscan:


Comparing two sequences with each other

Exercise 2a. Do a dotplot to compare two sequences for sequence conservation

Dotplots are a useful visual way to see all of the segments in common between two sequences or to visualize repeated or inverted repeated structures in one sequence. Rather cumbersomely, GCG does dotplots in two stages. DotPlot is the second part of a pair of programs that generate dotplots for the points of similarity between two sequences.

Compare compares two protein or nucleic acid sequences and creates a file of the points of similarity between them for plotting with DotPlot. Compare finds the points using either a window/stringency or a word match criterion. The word comparison is 1,000 times faster than the window/stringency comparison, but somewhat less sensitive.

Make a dotplot in GCG:


Check parameterisation:


Exercise 2b. Use Bestfit to compare two sequences for partial sequence similarity

BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local alignment algorithm of Smith and Waterman. This is useful for example when comparing modular proteins that share some, but not all of the same domains.

Do a Bestfit in GCG:

Exercise 2c. Use Gap to compare two sequences for overall sequence similarity

Gap uses the global alignment algorithm of Needleman and Wunsch to find the best alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps. (It is inappropriate to use Gap for comparing multidomain proteins if they are only partially related.)

Do a Gap in GCG

Multiple Sequence Analysis

Exercise 3a: Make a Multiple Alignment

So far we made only alignments with 2 sequences. Now we will use a program for making a multiple alignment that uses a progressive, clustering alignment method. Afterwards we will draw a phylogenetic tree based on this alignment.

Do a Pileup in GCG:

Exercise 3b: Construct a phylogenetic tree from a distance matrix

Currently, GCG is not the ideal package for drawing trees. But it can produce and display a Neighbour-Joining tree from an alignment.

First run Distance to construct a distance matrix. Then use this matrix as input for GrowTree. The distances are expressed as substitutions per 100 bases or amino acids. For nucleic acid sequences, these methods are Kimura's two-parameter method, the Tajima-Nei method, or the Jin-Nei gamma distance method; for protein sequences, the Kimura method; and for either type of sequence, the Jukes-Cantor method. It is also possible to obtain an uncorrected distance (i.e. neglecting the effect of multiple hidden substitutions at the same site).

Load in the EFTU/EF1A alignment:


Take Home Lessons

We undertook several different tasks with GCG, using different numbers of sequences. This introduction barely touches on what the package can do. But once you can do a few things in GCG, it is fairly straightfoward to find your way around and do many other tasks. GCG is not always the best place for doing the more specialised tasks. For example it is convenient to make a tree in GCG if you are already using the package. But trees are calculated and displayed rather better by many other specialist packages. The same is valid for Multiple Sequence Alignment and many other functions.