Multiple Sequence Alignment (MSA) Exercises and Demonstrations

MPI for Developmental Biology, Tuebingen

Tuesday 10th February 2009

Aidan Budd

Link to the Presentation

Demonstrating Structural Equivalence

Comparing (and aligning) pairs of similar structures demonstrates what it means for a pair of residues to be "structrually equivalent".

At the same time, it demonstrates that there may be regions of two structures for which there is not any such equivalence.

This example is a pair of bacterial toxins 1ji6 and 1i5p

We have aligned the N-terminal regions of these structures using FATCAT Here are links to the full-length structures aligned using
We made the alignments with FATCAT and CE as they both provide a pairwise sequence alignment and an easy-to-view structural alignment - not because of the quality of the alignments they produce (although these seem to be often good)

Comparing Sequence and Structural Alignments


How good are different structural alignment tools compared with purely sequence-based alignments?

We'll make a comparison using the example of two serine protease sequences (shown below) - the two sequences are both taken from a "gold-standard" alignment reference dataset (from

By examining the UniProt record, I have highlighted in red the active-site residues for 5ptp.

Note that the sequence below describe the proteins found in the PDB (not the UniProt) records - in this case that means it is after cleavage of signal peptide and activation hexapeptide.

>5ptp Bovine Cationic Trypsin TRY1_BOVIN (beta-trypsin form following Lys-23 autocatalytic clevage)

>1ton Rat Tonin KLK2_RAT

(both taken from

Initial PDB files 5ptp 1ton (chain A in both cases)

We will align


Carry out a similar exercise yourself with the following amylase sequences.

The three active-site residues in both sequences as described by their UniProt records have been labeled in red in the FASTA format sequences provided below.

>1smd AMY1_HUMAN


Here are the PDB structure file 1smd 1bf2

As for the serine proteases, align the above (note that pre-calculated results are provided in case the servers are down - try and get these results without consulting these files, however, unless necessary)
When examining the structural alignments, it may help to label within pymol the active site residues for the different chains

MSA from A-Z

Decision-making is central to building a good/accurate/appropriate MSA. Here are some examples of the kind of issues that require these decisions:
Your reasons for building the alignment - i.e. the biological quesiton(s) you are aiming to address using it - play a very important role in helping you reach decisions in response to questions such as these. Thus, there is not just one way to build an alignment comparing mammalian COX genes - the alignment you build will be differednt, depending on what you want to use the alignment for.

Thus, going from the beginning to the end of the process of building an MSA only makes sense in the context of a particular biological questions.

The demonstration/exercise shown below aims to do two things
Given that the other focus of this course is phylogeny, we will take evolutionary questions as the basis for this demonstration and exercise.

We'll go through the analysis step by step "online" using the projector.

After each step, we'll give you a chance to carry out the same actions as I've carried out on your own machines.

However, if you haven't time to carry out these analyses yourself, we've tried to provide files at each stage that you can use to move on to the next step, even if you haven't been able to complete the previous one.

COX1/COX2 Phylogeny

We want to make inferences about the phylogeny of the prostaglandin G/H synthases (COX1/COX2) gene family in mammals. In particular, we would like to estimate when the duplication event occured that yielded these two members of the family in humans. Perhaps we are interested in this question due to reading this paper (here's a static version of the Entrez PubMed Abstract for the paper incase the NCBI pages are down) where we read about  there being two COX genes in vertebrates/mammals and reference to an "invertebrate precursor" - when did this duplication occur?

This is a typical evolutionary question - using what we see today - (e.g contemporary human COX1/COX2 protein sequences), we try to infer  when (and if possible by what mechanism) the differences we see today have occured.
Identifying Our Initial Sequence of Interest
To begin, we need to find the sequence of at least one of our proteins of interest

A good place to start is searching the UniProt/SwissProt database, which provides high-quality, partially manually-curated sequence and functional annotation data for proteins (as content has relatively high accuracy, and is relatively small, making it faster to search) Notice how we need to use our biological knowledge to aid our bioinformatic searching - generally, the more we know about the biology of our sequences, the better we can judge the results of such searches, and the more interesting questions we can pose using our bioinformatic tools

This task (identifying the sequence/record associated with a given gene/protein/biological entitiy) is one of the main applications of bioinformatic tools (the other is to predict function/structure using comparison to similar genes/proteins/entities with demonstrated function/structure)

This was a relatively easy example - however, if we wanted to find additional information about the cox1 sequence e.g. 3D structure, known protein-protein or protein-chemical interactions etc., we would have to carry out similar kinds of searches of different databases (or, if we're lucky, use cross-references from the UniProt record) All this can take quite a lot of time and effort.

A short-cut that works in many circumstances is to use the automatic entity-recognition and biological-database integration software REFLECT
This is what happens when we REFLECT our abstract of interest Anyway - here are the UniProt records of the two human sequences COX1 and COX2
Note that these are NOT related to the mitochondrial protein COX1_HUMAN - these mitchondrial proteins were one of the reasons why our initial search of the UniProt database hit more than one record.
Identifying an Initial Set of Proteins Related to Our Sequence of Interest
We now need to identify a set of proteins that have the same structure/are related to our sequence of interest

There are many web resources/databases that provide such sets of proteins, in many cases already built into an MSA

We've already mentioned that different biological questions may well be best addressed by different MSAs. Thus the alignments/sets of sequences we obtain from such resources are unlikely to be the best possible alignments to address our own specific biological question. However, they often provide a good starting place for such an analysis - by carrying out modifications such as those described below, they can be adjusted to suit our specific question
However, in some cases such an alignment is not available - in which case a typical starting point is a BLAST search.

To build the MSA as far as possible "from scratch", we'll begin by looking at how we'd use a BLAST search for this.

We begin by choosing
If this search gives us a good starting point for our analysis, then no need to change the way we did the search.

If we hit many sequences, then downloading them all and building an MSA will take longer than necessary.

To change the search so that we hit fewer sequences we could:
If we find very few/no sequences, then we may well not have enough sequences to build an MSA appropriate for addressing our question of interest.

To carry out a search to identify more sequences we could:
We examine the results of the BLAST search
We will have around 40 sequences if we collect both of these two groups of sequences - this is a reasonable number to manipulate, so we will take all of them.

We can easily get both GenPept and FASTA format records (and save them locally) for these sequences by re-formating the BLAST output to show only those with low-enough E-values
Building an Initial MSA
We build an initial MSA from these sequences using MUSCLE at the EBI
We examine the alignment in CLUSTALX
Collecting Additional Sequences of Interest
Perhaps the duplication occured within the mammals - but we'd be interested to supplement these sequences with some from other chordates, so we'll query a larger database, restricting our search to non-mammalian chordates.
Combing Sequence Sets in one Alignment
It's probably easiest just to bundle all the sequences we have collected so far into one file and to submit this to MUSCLE again
In this case, we did not spend any time/effort creating a high-quality initial alignment e.g. by correcting mis-alignments, excluding unneeded sequences etc.

Sometimes, however, we want to incorporate additional sequences into an alignment that we have previously edited and manipulated - in which case we will not want to simply combine all (old and new) sequences together and completely re-align them.

Instead, we want to align the new sequences to the initial alignment while preserving the relationships of the sequecnes in the initial alignment to each other.

We can do this using CLUSTALX (in "profile alignment" mode.)

We can also do this using t_coffee (but only from the command-line)
MUSCLE also allows you to align two alignments with each other (known as profile-profile alignment), although not a set of sequences individually to a pre-existing alignment - this functionality is also only available from the command line
Adjusting Sequence Identifiers
As indicated, to make decisions about which sequences to exclude, we need to always consider the biological questions we are trying to address with our alignment.

In this case, we are interested in the timing of the duplication event leading to the two mammalian COX genes - thus we need to change the sequence identifiers so that we can:
When choosing new identifiers for the sequences we need to take into account that:
Here is a copy of the MUSCLE alignment created above with the identifiers altered - although note that the policy described above has not been carefully followed closely.
Preparing a "Final" Alignment
Almost always you will want to exclude some of the sequences from such an alignment. This might be for several different reasons:
Here is a copy of the above MUSCLE alignment, with altered sequence identifiers, trimmed of many sequences due to reasons described above.
Estimating a Phylogeny from the Alignment
Before estimating the phylogeny, we need to take into account that phylogenetic applications of MSAs are known to be strongly influenced by the inclusion of columns in the analysis where residues in the column are not realated by simple substitution events.

Therefore, we use the GBLOCKS server to extract those regions of the alignment that are  considered most likely to contain only columns where all residues are realated via substitution events.

Here is the HTML page showing which regions GBLOCKS kept and discarded from the alignment
Here is the FASTA format GBLOCKed alignment

Finally, we use a very quick and innacurate method to estimate phylogeny from this alignment - neighbour-joining as implemented by CLUSTALX - click here for a tree estimated in this way

(We will look at a more accurate approach in tomorrow's session which focuses on phylogenies)

Below is an image of the tree estimated in this way

Interestingly, there seem to be two copies of the gene also in the urochodate Ciona intestalis and the cephalochordate Branchiostoma floridae - however, the duplications appear to have occured independently of each other, and of the duplication that occurred to yield the two vertebrate copies of the gene.

Mammalian CDKs 1/2/3

Now you have a chance to carry out an similar analysis, this time on your own.

This time, suppose you are interested in a similar question, only this time you want to investigate the timing of the duplications that yielded the three mammalian CDK genes CDK1, CDK2, and CDK3 (for example, see this abstract [Santamaria et al.2007] which mentions all three of the proteins).

  1. Identify the UniProt record(s) associated with at least one of these genes
  2. Use BLAST to identify an appropriate set of proteins to build an alignment from, and download these sequences to a local FASTA-format file
  3. Use MUSCLE to build a multiple sequence alignment of the sequences
  4. Edit the names of the sequences so that you can tell which organisms  are from (while keeping the sequence identifiers unique)
  5. Examine the alignment in CLUSTALX (or JalView)
  6. Remove poorly-aligned regions from the alignment using the GBLOCKS server
  7. Quickly estimate a phylogeny from the GBLOCKed alignment, and visualise it using Njplot

Based on this tree, when do you think the three lineages diverged from each other? (i.e. when do you think their ancestral genes underwent duplications?)

If you have time, repeat a similar analysis as described above, this time begining with the appropriate TreeFam family as the initial source of sequences
(If you can't find the appropriate TreeFam record, here's the link to it).

Conserved Hydrophobic Core

Just to remind you - here we can see clearly that the non-polar residues can be almost completely excluded from the core of a globular protein.

Visualising and Editing Alignments

The aim of this exercise is to give you some practise using alignment viewers/editors to (i) identify and (ii) correct mis-aligned regions.

Download the alignments in the left column of the table below, all of which contain regions that are misaligned (in comparison with a set of reference alignments from BaliBase).

The second column links to an image of the alignment, with a box surrounding a region that contains a misalignment.

Using JalView and CLUSTALX, try to identify and correct the misalignment.

To see whether you were sucessful, compare your edited alignments with the reference alignments (provided in the third column in FASTA format, in the fourth as HTML, which highlights the regions BaliBase is confident are correctly aligned). The fifth column links to an image highlighting more closely the region that is misaligned, along with an indication of how the region might be corrected. The final column links to the BaliBase page for the alignment.

Alignment with Misalignment
Misaligned Region Highlighted
Reference BaliBase Alignment (FASTA)
Reference BaliBase Alignment (HTML - Core Regions Highlighted)
Edit Highlighted
BaliBase Page
BB12003.ref.fasta BB12003.html BB12003.edit.jpg
BB50013.mis.fasta BB50013.mis.jpg BB50013.ref.fasta BB50013.html BB50013.edit.jpg BB50013-BaliBase
BB50002.mis.jpg BB50002.ref.fasta BB50002.html BB50002.edit.jpg BB50002-BaliBase

Comparing MSA Tools


As you'll do in the exercises below, here we'll look at what happens when we re-align a reference alignment from BaliBase (1aboA_ref1) using several different pieces of automatic alignment software.


You will be told to take one from the list of BaliBase2 list of reference alignments.
  1. 1idy_ref1
  2. 1ycc_ref4
  3. kinase1_ref4
  4. 1eft_ref5
  5. 1pfc_ref1
  6. 1wit_ref1
  7. 1csp_ref1
  8. 1ldg_ref1
  9. sh3_ref6
  10. 1mrj_ref1
  11. 1led_ref1
Download the RSF format of the alignments of interest.

Load alignment into CLUSTALX locally, remove all gaps from all sequences (Edit->Select all sequences, Edit->Remove all Gaps) and save in FASTA format.

Alignment->Output Format Options->Output Order change to "Input" (this will preserve the order of the sequences in any alignments done by CLUSTALX to be the same as the order in the input alignment - makes it easier to compare the results of a CLUSTALX alignment with thre reference alignment)

Re-align the sequences locally using CLUSTALX (Alignment->Do Complete Alignment) - REMEMBER TO CHOOSE A DIFFERENT NAME FOR THE ALIGNMENT OUTFILE!

Compare the automatic alignment created by CLUSTALX with the reference BaliBase alignment - it will help if you re-arrange the (vertical) order of the sequences in the CLUSTALX alignment.

How many columns are mis-aligned in the CLUSTALX alignment i.e. where residues are aligned in the same column in the CLUSTALX alignment which are not aligned in the same column in the BaliBase alignment?  NB - Only consider those columns in the BaliBase alignment which are underlined/in upper case - these are the ones which are assessed as being reliably aligned according to structural comparisons

Do the same for other alignment tools such as:

Finding Pre-Calculated Alignments


If you can find a pre-calculated alignment that already contains a set of sequences that is useful for addressing your problem, you may well be able to save lots of time. Using the databases/resources below, we'll look at how you can get access to alignments that you can download yourself and examine locally.
>SRC_HUMAN|P12931|Proto-oncogene tyrosine-protein kinase Src (EC ENSG00000197122


Using these (and perhaps other alignment resources you might find) find a pre-calculated MSA that is will be a good starting place for
building an alignment to investigate:

Back To Gibson Team Training Pages