Biocomputing Unit
Sequence Analysis Service

Autumn 01 Course

Making and Applying Multiple Sequence Alignments

by Toby Gibson, Chenna Ramu and Aidan Budd, November 26th-29th, 2001

In this practical we will use locally installed programs on our UNIX server for making multiple sequence alignments and trees and a WWW server for secondary strcuture prediction. All of this could perfectly well be done on a Mac (or PC) too.

Examining multiple sequence alignments is a vital activity in modern biology. Comparative sequence analysis is the continuation of the Darwinian/Linnaeal tradition of comparative biology (so fruitful in the history of biology) at the molecular level. The alignments also serve as input into other types of analyses and here we will perform two very different ones: deriving trees and 2D structure predictions.

We will make a multiple alignment of a protein sequence family and calculate a tree from the sequences. We will study the tree to see whether it fits with our current understanding of phylogeny. We will then use the same alignment to get a 2D prediction and compare it to the known structure for one of these proteins.

Part 1. Making multiple sequence alignments and calculating a tree.

We will use:

Getting Started

Both these programs run on desktop Macs and PCs but today we will run them on TAU, a UNIX server where they are already installed.

On your LINUX PC:

Step 1. Getting a set of EF-TU / EF-1A sequences

Elongation factors are found in all species so have often been used for phylogenetic investigations. EF-TU in eubacteria and EF-1A in eukaryotes are orthologous factors. There are >150 entries in SWISSPROT which would take too long to align today so we will use the SRS query manager to provide a representative selection.


Step 2. Aligning the elongation factor sequences with Clustal X

Multiple Alignments have many uses. They are used for revealing important conserved residues, for making phylogenies, for secondary structure prediction etc.


Step 3. Calculating a tree with Clustal X and displaying it with NJplot

Clustal X uses the neighbour-joining method to calculate trees. This is a distance method (based on distances between sequences) that gives reasonable results. NJ is not the best method (usually said to be the computationally intensive Maximum-Likelihood approach) but is fast and good for a quick examination of tree topology. In particular, NJ is less robust than ML to variation in mutation rates between the sequences. A common artefact of unequal rates is that fast evolving sequences (which have long branches) exhibit "long branch attraction" - moving toward each other and deeper into the tree than their true positions.

Calculate the tree:

Display the tree:


Part 2. Obtaining a secondary structure prediction from a multiple alignment

2D prediction is much more accurate with multiple alignments than with single sequences. Most current methods use neural networks trained on a combination of multiple alignments and structures. The original multiple alignment prediction by neural network was the PhD predictProtein server developed by Burkhard Rost at EMBL (now maintained from Columbia NY). The success of PhD spawned an industry of related methods. We will use the JPred server at the EBI which has a simple and robust interface and makes predictions of comparable quality to PhD. 2D predictions have a number of uses. E.g. they have been used to suggest that two divergent sequence families are structurally related. They can help to define domains in multidomain proteins and distinguish between globular and non-globular segments.

Prepare the sequences:

Run a Jpred prediction:


Compare 2D prediction to the EFTU structure:


Part 2 Take Home Lessons

It is said that biology cannot be understood without setting it in an evolutionary context. Comparative sequence analysis is a continuation of the Darwinian tradition. Phylogenetic trees are fascinating in themselves but, in conjunction with multiple sequence alignments, are also important tools for gaining insight into the function of sequence families. However, tree calculations are unreliable unless there are plenty of diagnostic mutations to correctly assign the branching order. Variation in rate of sequence evolution confounds the algorithms, and can give rise to highly misleading trees: as we saw here, parts of the tree were obviously wrong when we apply extrinsic knowledge. Various mechanisms can give rise to rate increases: obviously selection for a new function (which can also help fix neutral mutations by a piggy-back mechanism); conversely a loss of function mutation can release purifying selection, also increasing fixation of neutral mutations. Other factors such as effective population size are important too: the larger the population size, the lower the likelihood that a given polymorphism will become fixed. Why do thermophilic prokaryotes evolve very slowly even though the chemically induced mutation rate ought to be higher? Perhaps it is because they live in an environment that has existed for 4 billion years, in actual physical locations that change on a slow geological timescale so selection is primarily conservative (purifying), and the effective population size is very large? At any rate, this serves to remind us that when we look at sequence divergence we see the accepted mutation rate and this will depend on many factors.