Using GCG Tools on TAU
Sequence Analysis Service, May '98
GCG is an integrated package of over 130 programs that allows you to manipulate
and analyze nucleotide and protein sequences. If you can't do something
on the web, you probably can with the GCG package, so it is a good idea
to know how to access the package on UNIX. There are two interfaces available
for GCG, a command line interface (difficult for inexperienced users) and
an X windows interface. We will use the X interface to do some analyses,
mainly on EFTU sequences.
GCG version 9.1 is now installed on TAU, a powerful DEC alpha computer that provides central services to EMBL including
GCG. The main new feature of GCG 9.1, SEQLAB, is both a powerful sequence alignment editor and a handy display tool
that represents sequence features as symbols. The best way to access TAU from your lab will probably be through a Macintosh X Windows emulator (MacX v.2, Xoftware or eXodus): you need to buy the emulator but some groups have one already. You can
see us for a quick lesson.
In this practical we will use several gcg programs, namely motifs, seqlab, coilscan, compare/dotplot, gap, bestfit, pileup and distances/growtree. This will provide experience in (1) examining a single sequence, (2) looking
for sequence similarity between two sequences and (3) aligning a group of
related sequences.
Getting started
- Log on to TAU:
- if you are working on a Unix machine, type rlogin tau.
- if you are working on an XTerminal, choose Tau from the menu for logging
in.
- if you are working on a Mac, use the X Emulator software to open a window
on tau.
- On the unix command line type prepare gcg. GCG will be set up.
- If you have problems with the setup procedure above ask somebody experienced
in your lab or send us an email or call us (ext. 530).
- Type wpi (to use gcg's X interface). The SeqLab X-window will appear.
- Familiarise yourself with the layout and pulldown menus.
- Note the help menu in the top right corner. See how the help topics are nested.
- Select job manager and output manager. These are needed to see what is running and to view the output.
- In Options choose Preferences and Output and toggle on Automatically display new output. (saves a lot of boredom later..)
Analysing a Protein Sequence
Exercise 1a. Look for sequence motifs in the EFTU_ECOLI sequence
The Motifs program looks for protein sequence motifs by checking your protein sequence
for every sequence pattern in the PROSITE Dictionary. Motifs can recognize the patterns with some of the symbols mismatched, but not
with gaps. Although PROSITE is not an exhaustive collection of patterns, any new protein sequence should
always be checked against it. You can learn more about PROSITE with SRS.
Do GCG Motifs:
- In the file pull down menu, select Add sequences from > database.
- In the Database Browser window, enter swissprot:eftu_ecoli in Database Specification: and click on Add to Main Window button.
- From the main window, Functions menu, select Protein Analysis > Motifs.
- In the Motifs window, check that the sequence is selected (if needed, click on it in
the main window).
- Also check that Motifs will run as Background Job.
- Now click on the Run button.
- Check what is happening in the job manager window (You should see that your job is running) and in the output manager.
- When the job is finished (1-2 minutes), the result will pop up (if you have
set the Automatic Display).
Questions
- How many motifs were found?
- Would this be insightful, given a poorly understood protein?
Exercise 1b. Check for special regions in your sequence using SeqLab
GCG's SeqLab interface has two modes: the Main List mode and the sequence Editor mode. In the Editor mode you can edit your sequence (inserting, deleting, copying) as well
as investigate it further in checking for features in the database entry
like domains, helices, strands, turns, ligand binding sites.
Check for features in the sequence:
- Select the sequence in the Main List by clicking on it.
- Change the Mode to Editor.
- Change the Display to Features Coloring.
- The sequence will now be coloured according to its SWISS-PROT feature table
entries.
- Pull down the Windows menu and choose Features. The Sequence Features window will appear.
- Change the Show selection to All features in current sequence.
- Click on a feature in the Sequence Features Window. This region is highlighted in the sequence.
- Change the Display: Mode in the Main Window to Graphic Features and change the scale gradually (e.g. to 4:1) to get a better overview over
the features in the sequence.
- Note that double clicking on the sequence name will pop up a Sequence Information window and (if it is not already open) double clicking on a sequence feature
will pop up the Sequence Features window.
Exercise 1c. Check for coiled-coils in the protein sequence
CoilScan uses the method of Lupas et al. (see the ACKNOWLEDGMENT topic)
to find coiled-coil segments in protein sequences by comparing each residue
in the sequence to a weight matrix tabulated from known coiled-coil protein
segments. A coiled-coil probability is calculated for each residue in the
protein, and those segments whose probabilities meet or exceed a threshold
probability you set are reported in the output. This prediction method works
only for solvent exposed (amphipathic) coiled coils, particularly for parallel
and antiparallel two-stranded coiled coils and for parallel three-stranded
coiled coils.
Do GCG Coilscan:
- In the File pull down menu, select Add Sequences From and then Databases.
- In the database browser window, enter swissprot:kata_arath in Database Specification: and click on Add to Main Window button.
- From the main window, Functions menu, select Protein Analysis > Coilscan.
- In the Coilscan window, check that the sequence is selected.
- Also check that Coilscan will run as Background Job.
- Now click on the Run button.
- Check the output files.
- Run the same job with a different window size and compare the results.
Question
- Does the predicted coiled-coil range match the SWISS-PROT feature entry?
Comparing two sequences with each other
Exercise 2a. Do a dotplot to compare two sequences for sequence conservation
Dotplots are a useful visual way to see all of the structures in common
between two sequences or to visualize repeated or inverted repeated structures
in one sequence. Rather cumbersomely, GCG does dotplots in two stages. DotPlot is the second part of a pair of programs that generate dotplots of the
points of similarity between two sequences.
Compare compares two protein or nucleic acid sequences and creates a file of the
points of similarity between them for plotting with DotPlot. Compare finds the points using either a window/stringency or a word match
criterion. The word comparison is 1,000 times faster than the window/stringency
comparison, but somewhat less sensitive.
Do a dotplot in GCG:
- Using the Database browser window, give the database specification swissprot:ef1a_halha.
- Select both the eftu_ecoli and ef1a_halha sequences in the main window (click on them).
- From the functions menu, select Pairwise Comparison > Compare.
- In the Compare window, the DotPlot option should be preselected for you.
- The two sequences should also be shown as selected.
- Click the Run button. When the job completes you will see the dotplot.
Question
- Can you trace the matching regions of the two proteins?
Check parameterisation:
- In the Compare window, click on Options.
- Alter the stringency value e.g. to 16.0.
- Run Compare again and examine the plot.
Questions
- Can you trace the matching regions of the two proteins?
- Can you see any Indels (sites of insertion or deletion) in the plot?
- Do you think GCG has got the default values right?
- (You could find the ideal signal to noise settings by trial and error. )
Exercise 2b. Use Bestfit to compare two sequences for partial sequence similarity
BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to
maximize the number of matches using the local alignment algorithm of Smith and Waterman. This is useful for example when comparing
modular proteins that share some, but not all of the same domains.
Do a Bestfit in GCG:
- Select both sequences eftu_ecoli and ef1a_halha in the main window.
- From the functions menu, select Pairwise Comparison > Bestfit.
- In the Bestfit window, check that the sequences are selected.
- Click the Run button. When the job completes you will see the alignment.
Questions
- Are the N- and C-termini of the proteins aligned?
- Has Bestfit found all the matching blocks shown in the dotplot (reopen from the output
manager?
- Do another Bestfit, clicking on options then setting the gap extension penalty to 1.0
- What is the effect of the lower extension penalty?
- Do the ends of the alignment match better to the dotplot now?
- In general, do you think it is safe to trust default parameter settings?
Exercise 2c. Use Gap to compare two sequences for overall sequence similarity
Gap uses the global alignment algorithm of Needleman and Wunsch to find the best alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps.
(It is inappropriate to use Gap for comparing multidomain proteins if they
are only partially related.)
Do a Gap in GCG
- Select both sequences eftu_ecoli and ef1a_halha in the main window.
- From the functions menu, select Pairwise Comparison > Gap.
- In the Gap window, check that the sequences are selected.
- Click the Run button. When the job completes you will see the alignment.
- Note how the complete sequences are aligned in the Gap alignment, in contrast to Bestfit.
- Again, does the alignment look correct with the default parameters?
Multiple Sequence Analysis
Exercise 3a: Make a Multiple Alignment
So far we made only alignments with 2 sequences. Now we will use a program
for making a multiple alignment that uses a progressive, clustering alignment
method. Afterwards we will draw a phylogenetic tree based on this alignment.
Do a Pileup in GCG:
- Load the sequences that are going to be aligned in the main window. This
time we will load with the help of a gcg list file:
- Click on the middle mouse button to load the list file in a separate netscape window.
- Save this file in your home directory. Use the Netscape File/Save As... option.
- In the GCG Main Window pull down the File menu and choose Add Sequences From > Sequence Files.
- Select the list file and Add it to the main list.
- In the Functions menu, select Multiple Comparison > Pileup.
- In the Pileup window check that the list file is choosen and click the run button.
- When Pileup finishes, the alignment will be displayed.
- Now load the alignment (in .msf format) into the main list by using the
option in the output manager.
- Select the aligned sequences and change Mode to Editor.
- View the alignment and note how the colour-coding highlights conserved columns.
- Do you think the colouring could be improved?
- Try editing the alignment by adding and subtracting gaps in sequences. Click
on the buttons to find out what they do: you would need these for a real
editing job.
Exercise 3b: Construct a phylogenetic tree from a distance matrix
Currently, GCG is not the ideal package for drawing trees. But it can produce
and display a Neighbour-Joining tree from an alignment.
First run Distance to construct a distance matrix. Then use this matrix as input for GrowTree. The distances are expressed as substitutions per 100 bases or amino acids.
For nucleic acid sequences, these methods are Kimura's two-parameter method,
the Tajima-Nei method, or the Jin-Nei gamma distance method; for protein
sequences, the Kimura method; and for either type of sequence, the Jukes-Cantor
method. It is also possible to obtain an uncorrected distance (i.e. neglecting
the effect of multiple hidden substitutions at the same site).
Load in the EFTU/EF1A alignment:
- From the output manager select the .msf file, that was produced by pileup, and add it to the main list and select it. (You may have done this already).
- From the main window, functions menu, select Evolution and then Distances.
- In the Distances window, check that the sequences are selected as .msf{*} (the {*} means
take the whole set of sequences).
- Also check that GrowTree is toggled on.
- Click on Run. When the job is finished you will get the phylogenetic tree plot for the
sequences you selected.
Questions
- Is the tree plausibly rooted - Does the most divergent sequence make biological
sense?
- Does the program let you reroot the tree?
- Would you dare to publish a wrongly rooted tree?
- Compare this tree to the one you got from Pileup in the pileup.figure file.
- Why does yeast (and other eukaryotes too) have both an EF1A and an EFTU
entry?
Note
You probably saw that you can also make trees with the PAUP package. This
is a widely used package with several tree making algorithms, parsimony
in particular. However, the Paup interface is under development and the
display tool in GCG is quite limited: you cannot reroot a tree or display
it as an unrooted radial tree (the correct way to show a tree when you have
no idea where the root should go). In the future GCG may also distribute
TreeTool, a nice display tool which we have running on Sun Unix at EMBL.
See the Sequence Analysis Service for available tree calculation and display software at EMBL.
Take Home Lessons
We undertook several different tasks with GCG, using different numbers of
sequences. This introduction barely touches on what the package can do.
But once you can do a few things in GCG, it is fairly straightfoward to
find your way around and do many other tasks. GCG is not always the best
place for doing the more specialised tasks. For example it is convenient
to make a tree in GCG if you are already using the package. But trees are
calculated and displayed rather better by many other specialist packages.
The same is valid for Multiple Sequence Alignment and many other functions.