Autumn 01 Course
Making and Applying Multiple Sequence Alignments
by Toby Gibson, Chenna Ramu and Aidan Budd, November 26th-29th, 2001
In this practical we will use locally installed programs on our UNIX server for making multiple sequence alignments and trees and a WWW server for secondary strcuture prediction. All of this could perfectly well be done on a Mac (or PC) too.
Examining multiple sequence alignments is a vital activity in modern biology. Comparative sequence analysis is the continuation of the Darwinian/Linnaeal tradition of comparative biology (so fruitful in the history of biology) at the molecular level. The alignments also serve as input into other types of analyses and here we will perform two very different ones: deriving trees and 2D structure predictions.
We will make a multiple alignment of a protein sequence family and calculate a tree from the sequences. We will study the tree to see whether it fits with our current understanding of phylogeny. We will then use the same alignment to get a 2D prediction and compare it to the known structure for one of these proteins.
Part 1. Making multiple sequence alignments and calculating a tree.
We will use:
- Clustal X for calculating both the alignment and the tree.
- NJplot for displaying and manipulating the tree.
- Jpred for 2D prediction.
- The CATH structure classification database for comparing with an actual 2D structure.
Both these programs run on desktop Macs and PCs but today we will run them on TAU, a UNIX server where they are already installed.
On your LINUX PC:
- Open an Xterm (available from the lower menus of the desktop).
- Type xhost +
- This allows tau to open X-windows on the local machine.
- Log on to tau:
- if you are working on a Unix machine, type rlogin tau.
- Type cd /scrap/your_username to change to the scrap disk for the practical.
- On the unix command line type prepare clustalx. The Clustal X program will be set up.
- On the unix command line type prepare njplot. NJplot will be set up.
Step 1. Getting a set of EF-TU / EF-1A sequences
Elongation factors are found in all species so have often been used for phylogenetic investigations. EF-TU in eubacteria and EF-1A in eukaryotes are orthologous factors. There are >150 entries in SWISSPROT which would take too long to align today so we will use the SRS query manager to provide a representative selection.
- Open another Linux Xterm.
- Type cd /net/fileserver4/scrap/yourname/
- We want to work on the central disks and this command should mount /scrap.
- Now start Netscape from the command line.
- In a Netscape window, go to SRS and click Start a new SRS session.
- Check the box for the Swiss-Prot database and continue.
- Click on the Query Manager button:
- The query manager allows "advanced users" to do powerful query and list refinements.
- Using the mouse, select and cut the following query text:
EF11_HUMAN | EF12_HUMAN | EF12_XENLA | EF13_XENLA | EF1A_GIALA |
EF1A_PLAFK | EF1A_WHEAT | EF1A_YEAST | EF11_DROME | EF11_MOUSE |
EF11_SCHPO | EF11_XENLA | EF1A_ARATH | EF1A_CAEEL | EF1A_CHICK |
EF1A_DICDI | EF1A_ENTHI | EF1A_HALHA | EF1A_METJA | EF1A_PODAN |
EF1A_PYRWO | EF1A_SULAC | EF1A_THEAC |
EFTU_BACSU | EFTU_ECOLI | EFTU_HAEIN | EFTU_MYCPN | EFTU_THEAQ |
EFTU_AQUAE | EFTU_RICPR | EFTU_HUMAN | EFTU_BOVIN | EFTU_YEAST]
- Paste the text into the query box and click the expression button to execute the query.
- SRS should return a list of entries matching the query.
- Now click the save button.
- Set Use view to FastaSeqs and click the save button.
- Using the netscape menus, save the sequences to a text format file called eftu.fasta.
Step 2. Aligning the elongation factor sequences with Clustal X
Multiple Alignments have many uses. They are used for revealing important conserved residues, for making phylogenies, for secondary structure prediction etc.
- In the TAU terminal, type clustalx.
- clustalx will open a new window.
- Invoke File > Load Sequences and select eftu.fasta.
- The sequences will be displayed in the clustalx main window.
- Using the Alignment menu select Do complete alignment and click the align button.
- The alignment calculation will take a couple of minutes, then the alignment will be displayed.
- Examine the alignment and answer the questions below.
- Are there any completely conserved residues?
- What is the graph under the alignment showing?
- Is the order of the sequences changed after alignment?
- How might the sequences be regrouped?
- Some proteins have N-terminal extensions
- Are these bacterial or eukaryotic?
- What is the function of this extension?
- (Hint: You can check the entry in SRS)
- Are the residues coloured by
- Proximity in the alphabet?
- Physicochemical properties?
- Why are some residues not coloured?
- Are Gly and Pro always coloured?
- What is so special about these residues?
Step 3. Calculating a tree with Clustal X and displaying it with NJplot
Clustal X uses the neighbour-joining method to calculate trees. This is a distance method (based on distances between sequences) that gives reasonable results. NJ is not the best method (usually said to be the computationally intensive Maximum-Likelihood approach) but is fast and good for a quick examination of tree topology. In particular, NJ is less robust than ML to variation in mutation rates between the sequences. A common artefact of unequal rates is that fast evolving sequences (which have long branches) exhibit "long branch attraction" - moving toward each other and deeper into the tree than their true positions.
Calculate the tree:
- Using the tree menu, toggle on Correct for multiple substitutions.
- This approximates the actual number of differences (ie the number of mutations that occurred), not just the visible differences between 2 sequences.
- (You can see the distance matrix using Output format options: It would be instructive to compare the matrices with and without the correction, but we probably don't have time today.)
- Then choose Bootstrap N-J tree.
- Set number of bootstrap trials to 100 (to make it go quicker).
- Bootstrapping is a resampling method to look at the stability of branches when a sequence is slightly modified. The idea is that unstable branches should not be used for supporting a phylogeny.
- Note the name of the tree (by default eftu.phb) and click the OK button.
- The tree and bootstrapping calculations may take a minute or so.
Display the tree:
- On the command line in a TAU Xterminal, type njplot.
- The NJplot window will appear.
- Use File > Open to load in the tree file eftu.phb.
- NJ trees are unrooted and NJplot guesses how to root the tree.
- Is the tree plausibly displayed?
- If not, then use New Outgroup to display the tree so that the long internal branch roots the tree.
- This corresponds to the last common ancestor of archaea and eubacteria.
- Toggle on the display of Bootstrap values.
- Now examine the tree and answer the questions.
- (If you want to print the tree out, save plot as postscript and print using command lpr -Plw-v111 eftu.ps).
- Are eukaryotic sequences on the archaeal or eubacterial divide or both?
- Some trees support a clade of fungi and animals with plants outside
- Did toads split off before the plants, animals and fungi diverged?
- Does the annotaton in the Xenopus entries help to understand this?
- Which Xenopus entry has the longest branch?
- Might it's position be artefactual?
- Branch lengths provide an estimate of sequence divergence
- Which sequence is closest to the root?
- Is it closer in time to the last common ancestor?
- Is it closer in sequence to the last common ancestor?
- Is it a result of horizontal transfer of sequences from archaea?
- Are there any eukaryotic and archaeal sequences with notably short branches?
- Chemical reaction rates increase with temperature
- Therefore thermophilic prokaryotes like Aquifex and Pyrococcus will be subject to much higher mutation rates and evolve faster than mesophiles won't they?
- Mitochondria are derived from an ancestral alpha-purple bacterium and Rickettsia is one of their closest relatives
- Do the eukaryotic mitochondrial EF-TUs and Rickettsia share a unique branch?
- Are all the mitochondrial sequences on a unique branch?
- Which likely evolved faster, mitochondrial or bacterial sequences?
- Might it improve the topology to incorporate more sequences from mitochondria and their relatives?
- Is Giardia lambdia the earliest diverged eukaryote?
- Since it lacks a mitochondrion, is this good evidence for the emergence of amoeboid anaerobic eukaryotes with cytoskeleton, nucleus etc. that later engulfed a symbiotic purple bacterium to give rise to the aerobic, mitochondrial eukaryotes?
- Bootstraps report the stability of branches
- Which branch has the lowest bootstrap?
- Do the shortest branch lengths have the highest bootstrap values?
- Do the bootstraps help to warn us that the mitochondrial EF-TUs might be misplaced?
- Can they substitute for a proper rate test (i.e. whether the tree approximates to a constant molecular clock)?
Part 2. Obtaining a secondary structure prediction from a multiple alignment
2D prediction is much more accurate with multiple alignments than with single sequences. Most current methods use neural networks trained on a combination of multiple alignments and structures. The original multiple alignment prediction by neural network was the PhD predictProtein server developed by Burkhard Rost at EMBL (now maintained from Columbia NY). The success of PhD spawned an industry of related methods. We will use the JPred server at the EBI which has a simple and robust interface and makes predictions of comparable quality to PhD. 2D predictions have a number of uses. E.g. they have been used to suggest that two divergent sequence families are structurally related. They can help to define domains in multidomain proteins and distinguish between globular and non-globular segments.
Prepare the sequences:
- In Clustal X, cut and paste the sequences so that all EFTUs are above EF1As with EFTU_ECOLI as the top sequence.
- EFTU-ECOLI will be used as the reference sequence by Jpred.
- Invoke the option Save Sequences as.
- Click on the GCG/MSF button.
- MSF is the multiple sequence format of the GCG package:
- Many programs accept this format.
- Using the Save from residue boxes, just save the range of the alignment bounded by the EFTU_Ecoli termini.
- Give the filename eftu.msf and click the OK button.
- In Netscape, open a new window and load eftu.msf into it.
- We now have the sequences ready for input into Jpred.
Run a Jpred prediction:
- In a Netscape window go the Jpred server.
- Note the links to useful information: if you used Jpred for research purposes you would want to learn more.
- Click on Submit a sequence which takes you to the job set up page.
- If you submit a single sequence, Jpred will make an alignment for you.
- According to the authors, this may give the best results.
- However, our advice is never trust an automatic alignment process!
- If you do run Jpred on a single sequence, check the alignment quality as you would if you made it yourself.
- Cut and paste the MSF format alignment into the sequence data box.
- Toggle on Aligned sequences in MSF format and turn off the scan of the Brookhaven Database.
- [Note: there is a useful "advanced option"' to run PhD, Predator and other independent prediction methods. We could compare them all to see how well they work. However it would take longer and there probably is not time today. ]
- Click the run secondary structure predictions button to submit the job.
- It will take a few minutes to run and may be queued behind others.
- When Jpred finishes, it offers several viewing options. The Postscript output is nice and could form the base material for a figure. Today click on the HTML format - a new window will open with the results.
- Examine the results page: The alignment is returned and below it results of component prediction algorithms. (To learn more about these, you would consult the documentation.)
- The main lines to note are:
- Jpred - the consensus prediction
- "E" = extended (beta-strand); "H" = helix; "- "loop or coil.
- The Lupas and Coil lines are coiled-coil predictions
- Jnet Rel - The reliability of prediction (1 = lowest; 9 = highest reliability)
- What is the shortest predicted segment of secondary structure?
- Are there any helix/strand conflicts between the prediction algorithms?
- Are there more helices predicted in the C-terminus or N-terminus?
- What is the largest coiled-coil segment in EFTU?
- Where is the largest region of low reliability in the prediction.
- Is it exclusively loop/coil prediction?
Compare 2D prediction to the EFTU structure:
- In a new Netscape window, open the CATH DB link.
- Search General Text for EFTU.
- Click on the upper 1efc link - this takes you to the CATH 2D summary for the E.coli structure.
- How many domains are noted for EFTU?
- Is the C-terminal domain mainly helical?
- Does the prediction roughly agree?
- How many beta-hairpins are there?
- 2D structure predictions can be evaluated by per-residue accuracy and/or by approximately correct placement of 2D elements.
- Using the rule that the 2D predictions must show >50% overlap with the actual element:
- How many helices are: correctly predicted; not predicted; over predicted?
- How many strands are: correctly predicted; not predicted; over predicted?
- How many 2D elements are: correctly predicted; not predicted; over predicted?
- Do the failures have mainly high or low reliability assignments?
- Can you have confidence in the reliabilities?
Part 2 Take Home Lessons
It is said that biology cannot be understood without setting it in an evolutionary context. Comparative sequence analysis is a continuation of the Darwinian tradition. Phylogenetic trees are fascinating in themselves but, in conjunction with multiple sequence alignments, are also important tools for gaining insight into the function of sequence families. However, tree calculations are unreliable unless there are plenty of diagnostic mutations to correctly assign the branching order. Variation in rate of sequence evolution confounds the algorithms, and can give rise to highly misleading trees: as we saw here, parts of the tree were obviously wrong when we apply extrinsic knowledge. Various mechanisms can give rise to rate increases: obviously selection for a new function (which can also help fix neutral mutations by a piggy-back mechanism); conversely a loss of function mutation can release purifying selection, also increasing fixation of neutral mutations. Other factors such as effective population size are important too: the larger the population size, the lower the likelihood that a given polymorphism will become fixed. Why do thermophilic prokaryotes evolve very slowly even though the chemically induced mutation rate ought to be higher? Perhaps it is because they live in an environment that has existed for 4 billion years, in actual physical locations that change on a slow geological timescale so selection is primarily conservative (purifying), and the effective population size is very large? At any rate, this serves to remind us that when we look at sequence divergence we see the accepted mutation rate and this will depend on many factors.