Multiple Sequence Alignments


Exercises and Demonstations

Structural and Sequence Alignments


Serine protease sequences



(both taken from

Initial PDB files 5ptp 1ton
Aligned PDB files 5ptp 1ton
(It's chain A in both cases)


Consider you are working with the following amylase sequence


The three residues forming the active site of the enzyme are the D, E, and D labeled in red in the sequence.

You can find out more about the sequence from its SwissProt entry P04745.

Assuming further that you know the sequence below to also be that of an amylase


(Checking the SwissProt record for this sequence P10342 will show that, indeed, it is an amylase - however, we imagine for now that we do not know this)

To identify possible candidate residues in the 1bf2 sequence that may be active site residues, attempt to align the sequences using
Compare these results to a structural alignment using CE - to the right are the links to the PDB structure files 1smd 1bf2 each of which contains just a single chain.
You can look at an MSA including these two sequences in BaliBase where the sequences make up part of reference set 1pamA_ref1.

If you finish quickly, then browse the BaliBase alignments, and choose a further family of interest.

You can download the alignments in several different formats, then load them into CLUSTALX and save in FASTA format (after removing all the gaps first) to get the ungapped sequences of those represented in the family.

Search for the PDB structures the alignments are based on using the PDB website (4-character identifiers should be the same used in the PDB e.g. 1smd)

SwissProt/UniProt accession numbers are provided on the BaliBase pages.

You can now align pairs of sequences using BLAST2Sequences and/or Smith-Waterman/Needleman-Wunsch, and compare these to the structural alignments of the same pairs of proteins. Look to see if the trends in terms of placement of gaps, and ability to identify equivalent residues are similar as for the amylase sequences above.

Finding Pre-Calculated Alignments


If you can find a pre-calculated alignment that already contains a set of sequences that is useful for addressing your problem, you may well be able to save lots of time. Using the databases/resources below, we'll look at how you can get access to alignments that you can download yourself and examine locally.
>SRC_HUMAN|P12931|Proto-oncogene tyrosine-protein kinase Src (EC ENSG00000197122


Using these (and perhaps other alignment resources you might find) find a pre-calculated MSA that is will be a good starting place for
building an alignment to investigate:

Editing Alignments


BaliBase2 kinase2_ref5 realigned (with errors) by ClustalX.

We examine the alignment using Clustalx, and correct some of the mistakes using JalView


Try editing mistakes caused by CLUSTALX in this alignment taken from BaliBase of a set of hepatitis proteinases (here is the link to the reference alignment). Using JalView, attempt to edit the alignment to correct these mistakes. It will help to examine the alignment using CLUSTALX (With the Quality->Show Low-Scorring Segments option switched on to identify potentially mis-aligned regions)

Another such example is this BaliBase alignment of thermitases re-aligned by CLUSTALX

The following alignment of KH domains was created by CLUSTALX - it contains several mis-aligned regions.

Compare the results of your edits with this manually-curated alignment of the KH domains.

If you want more practice of this kind, then take the following manually-curated alignments, realign them using some automatic alignment tool, and again try to identify mis-aligned regions, and correct them using JalView. Compare your edits with the initial alignment.

Automatic MSA Tools


As you'll do in the exercises below, here we'll look at what happens when we re-align a reference alignment from BaliBase (1aboA_ref1) using a range of different pieces of automatic alignment software.


You will be told to take one from the list of BaliBase2 list of reference alignments.
  1. 1idy_ref1
  2. 1ycc_ref4
  3. kinase1_ref4
  4. 1eft_ref5
  5. 1pfc_ref1
  6. 1wit_ref1
  7. 1csp_ref1
  8. 1ldg_ref1
  9. sh3_ref6
  10. 1mrj_ref1
  11. 1led_ref1
  12. 5ptp_ref1
Download the RSF format of the alignments of interest.

Load alignment into CLUSTALX locally, remove all gaps from all sequences (Edit->Select all sequences, Edit->Remove all Gaps) and save in FASTA format.

Re-align the sequences locally using CLUSTALX (Alignment->Do Complete Alignment) - REMEMBER TO CHOOSE A DIFFERENT NAME FOR THE ALIGNMENT OUTFILE!

Compare the automatic alignment created by CLUSTALX with the reference BaliBase alignment - it will help if you re-arrange the (vertical) order of the sequences in the CLUSTALX alignment.

How many columns are mis-aligned in the CLUSTALX alignment i.e. where residues are aligned in the same column in the CLUSTALX alignment which are not aligned in the same column in the BaliBase alignment? (Only consider those columns in the BaliBase alignment which are underlined/in upper case - these are the ones which are assessed as being reliably aligned according to structural comparisons)

Do the same for other alignment tools such as:
Which of the different tools creates the best alignments when compared to the BaliBase alignment?

Is there any obvious feature of these alignments that this method of assessing alignment quality ignores?

If you have time, take some additional BaliBase alignments and repeat these analyses. (Note - you will make life easier for yourselves if you focus on short alignments with only a few sequences) - ideally taking them from different categories of alignments e.g. from different sets of BaliBase categories.

Is there a trend for certain tools to always be better/worse than others, or do the results vary a lot from alignment to alignment?

Are particular types of alignments easier/harder to align correctly? (e.g. are alignments where average identity is high more reliably aligned?)

Building an alignment from scratch


Imagine we have noticed that the protein P53_HUMAN, contains a match to a cyclin-binding site pattern, and we would like to see whether this site is (a) conserved in other P53 sequences and (b) more strongly conserved than adjacent sequences


Either use BLAST (or one of the databases of multiple sequence alignments) to address a question of your own using MSAs, or try out one of the following exercises:

Back to Gibson Team course pages at EMBL.