Multiple Sequence Alignments

Exercises and Demonstrations

Aidan Budd

3D Structure-Based Interpretation of Alignments

Teaching Objectives

After completing this section, you will hopefully appreciate the properties that we expect residues in the same column to share of a sequence alignment that is being interpreted in a "structural" context.


Comparing (and aligning) pairs of similar structures demonstrates what it means for a pair of residues to be "structurally equivalent" - the relationship that we want residues in the same column of an alignment to share if we are using the alignment in a "structural" context e.g. predicting secondary structure, or building a protein profile HMM to carry out a sensitive sequence similarity search.

At the same time, it demonstrates that there may be regions of two structures for which there is not any such equivalence.


This example is a pair of bacterial toxins 1ji6 and 1i5p with very similar structures

We have aligned the N-terminal regions of these structures using FATCAT It should be clear, when looking at the structural alignment, that the structures are very similar, with most residues sharing 1:1 structural equivalence with a residue in the other structure.

By contrast, we provide below an alignment of two very different structures. Indeed, considered in terms of the kinds of secondary structure elements they contain, they are completely different (one is mainly alpha, the other mainly beta).
Note that we are still able to align these two structures, despite our opinion that they are global extremely dissimilar. The same is true for multiple sequences alignments - most software will report an alignment, whether or not the global similarity most of the software assumes the sequences share is indeed present within the sequences.

Given the global dissimilarity of the alpha solenoid and beta barrel structures, you would almost certainly want to avoid using such an alignment to make any inferences about similarity of function/structure between residues aligned in the same column.

In general, this illustrates the fact that it is important to be very confident that the sequences you include in an MSA indeed share the relationship you are interested in - this will typically be "structural" and/or "evolutionary" equivalence. Note, however, that we are avoiding a discussion of how to judge whether or not two sequences do indeed have such a relationship with each other - this involves using tools such as BLAST, that we we are not focusing on in this set of exercises.

If you are interested, you might like to try comparing some other pairs of structures and examining their structural and sequence alignments as calculated by FATCAT and CE.

Instructions on how to calculate pairwise structural alignments using CE and FATCAT and display the results in PyMOL

Note that all the structures used in the exercise below contain only one polypeptide chain, chain A. You need to supply this information when submitting the structures to CE and FATCAT.
Note that we are suggesting that you use FATCAT and CE as they both provide a pairwise sequence alignment along with easy-to-view structural alignments - not because they are known to provide on average the best quality alignments (although these seem to be often good).

Non-Equivalence Between Structural-Based and "Evolutionary"-/"Homology"-Based Interpretation of Alignments

Teaching Objectives

After completing this section, you will hopefully appreciate that while we often assume that "evolutionarily equivalent"/"homologous" residues are also structurally equivalent, this is not necessarily the case. This should highlight the importance of being aware of which of these contexts you are assuming when working with any given sequence alignment


During the accompanying presentation, you will probably have carried out an exercise discussing an alternative way of interpreting sequence alignments.

Remember that the previous section describes interpretation of alignments in a 3D-structural context - residues in the same column of an alignment should ideally share structurally-equivalent locations within the 3D structures of the molecules they describe. In the alternative "evolutionary" context, residues in the same column of an alignment should be able to trace their evolutionary history back to the same residues in the most recent common ancestor of the two sequences - it is assumed that any differences observed in a column (e.g. a column in which two sequences have a different amino acid residue e.g. glycine in one sequence and alanine in another) are due only to point mutation/substitution events.

In many cases it is appropriate to assume that closely-related sequences have very similar structures - in the previous section we looked at alignments of 3D protein structures between proteins that were evolutionarily related.

However, structural equivalence does not necessarily imply evolutionary relatedness - nor does evolutionary relatedness necessarily imply structural equivalence, as we will see in the following demonstration.

Notes - Short Protein Linear Motifs

Short protein linear motifs, predominantly found in eukaryotic proteomes (hence sometimes called ELMs for Eukaryotic Linear Motifs - follow this link to a resource developed by our group for exploring the presence of ELMs in your sequence of interest) are typically between 3 - 9 amino acid residues in length, and have many important roles in a wide range of different cellular processes. They typically function by mediating interactions with specific globular domains e.g. protein kinase motifs interact with protein kinase domains, cyclin-binding boxes interact specifically with cyclins etc.

While there is typically one or more position in a motif that must be present for the motif to be functional, other positions can often be occupied by many different possible amino acid residues. For example, a motif that is able to bind the SH3 domain has the sequence pattern ...[PV]..P where "." indicates any one of the 20 possible amino acids, "[PV]" indicates either the amino acids proline or valine, and "P" indicates that the position must be occupied by a proline.

The relatively weak constraints placed on the sequence of such a motif make them relatively likely to appear de novo in a protein sequence, for example, as the result of point mutations - in stark contrast to globular domains. Thus, where the same linear motif is present in two unrelated proteins (i.e. we have no evidence for the relatedness of the proteins to each other, but both possess a short sequence mediating an interaction with the same kind of globular domain), we assume that the two motifs are not evolutionarily related to each other. Thus, it would not make sense to analyse an alignment of such motifs in an evolutionary context - despite the fact that the 3D structures of the motifs will usually be very similar.

Demonstration - Structural Equivalence Without "Evolutionary" Equivalence

Both the HIV1 nef protein (UniProt:NEF_HV1BR) and the human neutrophil cytosol factor 1 (UniProt:NCF1_HUMAN) contain SH3-domain binding motifs that match the pattern "...[PV]..P" (note that peptides matching the patterns "P..P.[KR]" and "[RKY]..P..P" are also known to bind to the SH3 domain). These SH3-binding peptides are highlighted in red in the sequences shown below.


>NCF1_HUMAN Neutrophil cytosol factor 1

The following PyMOL session files show the 3D protein structures of these SH3-motif:SH3-domain intereactions. SH3 domains are coloured in cyan, the SH3-binding motifs are coloured in pink/salmon, and are shown with their amino acid side chains drawn as sticks. The N-termini of the peptides are coloured in blue, the C-termini in red.

NCF1_HUMAN interacting with an SH3 domain also contained within the NCF1_HUMAN protein
NEF_HV1BR interacting with the SH3 domain in human FYN1 (UniProt:FYN1_HUMAN)
This PyMOL session file shows the two motifs aligned in 3D - there are clear 1:1 equivalencies between positions in the two structures, despite no evidence for the two sequences being evolutionarily related (i.e. that their sequences can be traced back to the same ancestral protein sequence via either point mutaitons or no mutations at all).

Demonstration - One Sequence, Two Different Structures - Evolutionary Equivalence without Stuctural Equivalence

Ideally, we would show here an example of naturally-occurring sequences that were clearly related, but whose structures had been determined and found to form globular domains with different folds - but, as yet, we have not found such an example.

However, we do have an example of a sequence - the human Lymphotactin protein (UniProt:XCL1_HUMAN,P47992) - which adopts two different folds depending on conditions. The two sequences are clearly related (they're identical!), but in different contexts, evolutionarily-related/equivalent/homologous residues clearly have different structrues.

This PyMOL session file, which includes data from PDB records 1j8i (local file, PDB, PDBsum) and 2jp1 (local file, PDB, PDBsum), shows the two different forms of the protein in one file.

In the two structures, some regions form equivalent secondary structure elements - however, in at least two cases, the same residues form rather different structures (two such examples are highlighted in cyan and yellow in the PyMOL session file - regions of the same colour are equivalent in the two different structures).

Introducing Pairwise/Multiple Sequence Alignments with JalView


To explore our own ideas about what constitutes a good alignment, here we will carry out 'manual' alignments of some pairs of sequences using JalView. Most of us, when presented with a set of sequences, will have some 'intiuitive'/subconcious feeling about what a good or a bad alignment of them would look like. Trying to build your own 'good' alignments manually should help you understand better what you think makes an alignment 'good' or 'bad'.

This exercise is also useful as it gives you an opportunity to gain familiarity with JalView, which is (in my opinion) a good choice of software for editing sequence alignments - certainly it's what I use.

For instructions on how to use JalView follow this link.

I'll demo for you now:
With JalView I will:


Try the same with the sets of sequences provided at the bottom of this exercise - don't try and do all possible pairwise alignments, just pick out a few from each file.

While you're carrying out this exercise, make notes/discuss with your neighbours on the following issues/questions:
The sequence sets linked to below are (1-3) three different, non-overlapping regions of the human, rat, and mouse collagen 18s and (4) full length
To check your answers, you could align the sequences automatically using the EBI MUSCLE webserver, and then try comparing these results with the alignments you have built manually.

"Manual" Pairwise Sequence Alignments using JalView - All Files Containing only 2 Sequences


We will demonstrate aligning these two tublin sequences to each other using JalView. The demonstration will involve
Note that choosing/building an alignment is "simply" about deciding where gaps belong and where they do not belong.

We will also automatically build pairwise alignments of the sequences using three different approaches:
and compare the results of these approaches to the alignment we bulit "manually".


Try building pairwise alignments yourself using JalView in a similar way with the pairs of sequences found in the following files.

Try also one (or more) of the automatic pairwise alignment tools listed above.

Compare your "manual" alignment with the automatic alignments and:
While carrying out the manual alignments, write down:
  1. Features that describe a relatively "good" alignment, thinking in terms of
  2. Instructions on how to change a "bad" alignment into a better one
  3. Characteristics of sequences that are more difficult/take more time to align than others
Below are the sequences - if possible, try all of them, as they have been chosen to illustrate a range of different issues/points, and it will hopefully be useful for you to have encountered all of these.

"Manual" Pairwise Sequence Alignments using JalView


To demonstrate some of the charactistics of sequences that can make them easier or more difficult to align, we will try aligning some sequences ourselves "manually" using the JalView sequence editor.

This exercise is also useful as it gives you an opportunity to gain familiarity with JalView, which is (in my opinion) a good choice of software for editing sequence alignments - certainly it's what I use.

For instructions on how to use JalView follow this link.

I'll demo for you now:
With JalView I will:
I'll also try and make an MSA of all the sequences - and compare that with that obtained using the EBI MUSCLE webserver


1. Protein Kinase domains
Load these protein kinase sequences into JalView

Use JalView to make pairwise alignments "manually" as for the above demonstration, and compare the results with those obtained using the smith-waterman, needleman-wunsch, and BLAST 2 Sequence algorithms

Also try to build an MSA using JalView from the same sequences, and compare the result to that obtained using the EBI MUSCLE webserver
2. Collagen 18s
The sequence sets linked to below are (1-3) three different, non-overlapping regions of the human, rat, and mouse collagen 18s and (4) full length
  1. collagen 18s region A
  2. collagen 18s region B
  3. collagen 18s region C
  4. Full-length collagen 18s
For each set of sequences, use JalView to build a multiple sequence alignment, and compare the results you obtain with those using the EBI MUSCLE webserver.

Are there any obvious features of some of the sequences that make them easier/more difficult to align?

Below are examples of manually-annotated alignments for the different datasets
  1. collagen 18s region A
  2. collagen 18s region B
  3. collagen 18s region C
  4. Full-length collagen 18s

"Manual" MSA using JalView


We'll look at making our own multiple sequences alignments "manually" using JalView rather than automatically aligning them with MUSCLE, MAFFT, PRANK etc.

We'll look at two different sets of sequences - one we find easier, the other more difficult, to align manually.

The reason why we find these difficult to align are the same as the reasons why automatic software finds some alignments easier, others more difficult:
Here are some links with information on using JalView



Making an MSA is Easy?!

The act of taking a set of sequences, inputting them to an automatic MSA tool, and obtaining a result is usually trivial.



To see how easy it is to do this, try the demonstration above, trying to carry it out as quickly as possible

Comparing Results from Different Automatic MSA Tools - Short Exercise

Using the collagen sequences from the previous exercise to compare the MSAs built automatically using the following list of different MSA software
You'll hopefully see that:

Comparing Sequence and Structural Alignments


How good are different structural alignment tools compared with purely sequence-based alignments?

We'll make a comparison using the example of two serine protease sequences (shown below) - the two sequences are both taken from a "gold-standard" alignment reference dataset (test set 5ptp_ref1 from BaliBase2)

By examining the UniProt record, I have highlighted in red the active-site residues for 5ptp.

Note that the sequence below describe the proteins found in the PDB (not the UniProt) records - in this case they differ in that the PDB sequence is the same as that in UniProt except for removal of the signal peptide and activation hexapeptide.

>5ptp Bovine Cationic Trypsin TRY1_BOVIN (beta-trypsin form following Lys-23 autocatalytic cleavage)

>1ton Rat Tonin KLK2_RAT

Initial PDB files 5ptp 1ton (chain A in both cases)

We will align


Carry out a similar exercise yourself with the following amylase sequences.

The three active-site residues in both sequences as described by their UniProt records have been labeled in red in the FASTA format sequences provided below.

>1smd AMY1_HUMAN


Here are the PDB structure file 1smd 1bf2

As for the serine proteases, align the above (note that pre-calculated results are provided in case the servers are down - try and get these results without consulting these files, however, unless necessary)
When examining the structural alignments, it may help to label within PyMOL the active site residues for the different chains

Compatibility of Multiple Pairwise Alignments


There are three different pairwise alignments that can be built between these three haemoglobin sequence.

Estimate these using one of these three tools:
Here are the pre-calculated results obtained using EMBOSS's needle program with default settings (which implements the Needleman-Wunsch algorithm):
Try and build an MSA of these sequences (in JalView) so that all the alignments between residues in different sequences described in the three pairwise alignments are present in this MSA.

This exercise should have illustrated that ometimes (often/almost-always if the sequences being compared are very different from each other!) the complete set of all pairwise alignments for a set of sequences will not all be 'compatible' with each other i.e. it is impossible to build a single MSA that contains all of the aligned-column relationships found in all pairwise alignments.

Finding Pre-Calculated Alignments


If you can find a pre-calculated alignment that already contains a set of sequences that is useful for addressing your problem, you may well be able to save lots of time. Using the databases/resources below, we'll look at how you can get access to alignments that you can download yourself and examine locally.

You could try this out using SRC_HUMAN

>SRC_HUMAN|P12931|Proto-oncogene tyrosine-protein kinase Src (EC ENSG00000197122

Alternatively, as some of the alignments will be rather large for this well-studied, well-conserved protein, one could try instead with human Sperm-associated antigen 7 (UniProt:SPAG7_HUMAN).

>sp|O75391|SPAG7_HUMAN Sperm-associated antigen 7 ENSG00000091640


Using these (and perhaps other alignment resources you might find) find a pre-calculated MSA that is will be a good starting place for building an alignment to investigate:

Building Automatic MSAs

Clearly, an important step in obtaining an MSA is the use of automatic MSA software to build an MSA from an inital set of sequences.

Although, as already pointed out, if possible, it is usually a good idea to begin instead with a pre-calculated alignment that already contains many (ideally all!) the sequences needed for your analysis. This can save you a lot of time - not just the time it takes (i) to calculate the alignment, but also the time taken to (ii) select an appropriate set of sequences for the analysis - although the time saved in this way clearly depends on how well-suited the pre-calculated alignment is for your analysis.

Obtaining an Initial Set of Sequences for Automatic Alignment via BLAST

A typical way to build an MSA "from scratch" (i.e. begining with a single sequence of interest, and from there collecting a set of intial sequences from which to build an automatic MSA) is to collect a set of sequences that are evolutionarily related to your initial sequence using BLAST.

Sometimes this is trivial - not always, however, as an initial search may identify:
Sequence features that can cause these kinds of problems include:
Here we will attempt to use BLAST to collect a set of sequences to examine rodent phylogeny.

We will use the mouse cytochrome b protein sequence as a query at the NCBI BLAST server.

As expected, given that this protein is a popular phylogenetic marker, BLAST using default settings do not give us what we need i.e. far too many sequences for us to get an overview of the result.

To find a more appropriate set of sequences, we will try varying several different parameters of the search:
Use BLAST to collect a set of protein sequences related to human p53 protein, with sequences taken from the zebrafish (Danio rerio), mouse (Mus musculus), chicken (Galllus gallus), and african clawed frog (Xenopus laevis). Try to obtain an as-complete-as-possible list of such sequecnes from each organism.

Comparing Automatic MSA Tools

We have included an exercise/demonstration/session on comparing  MSAs calculated by several different alignment tools for several reasons:

Building/Examining MSAs with All Sequences in the Same Order

It may seem trivial - however, anyone who has tried to compare alignments of several similar sequences where the sequences are NOT all in the same order will know how difficult this can make the comparison. Without the same sequence order in all alignments, it can be very difficult to identify regions where the alignments differ.
There are several different ways to obtain obtain alignments with all sequences in the same order. We will demonstrate these using this set of sequences related to human p53

We will align the sequences using four different automatic MSA software packages - in all cases by simply cutting and pasting the unaligned sequences into the webservers provided by the EBI:
These are the alignments obtained using default settings
Note also that CLUSTAL and T-COFFEE have shortened considerably the length of the sequence name/description information provided for each sequence - this could cause trouble if you were comparing alignments in a way which depended on the sequences in different alignments have the same name/description information

Below are the alignments prepared in both .aln and .fasta format, with the order of the sequences determined by running initially MUSCLE over the sequences, and then fixing the order of the sequences when aligning using MAFFT, T_COFFEE, and CLUSTAL
Note that by default, as provided above, CLUSTAL and T-COFFEE provide the alignment in .aln/CLUSTAL format, while MAFFT and MUSCLE have returned it in FASTA format.

We'll now look at three different ways  of changing the order of sequences in the alignments:
  1. Manually adjust sequence order using an MSA editor:
  2. Sort sequence order automatically using the MSA editor JalView
    1. By percentage of pairwise sequence identity (%ID)
    2. Alphabetically (by the first word in the sequence description lines)
  3. (a) Pre-calculate a quick alignment (using MUSCLE) to order the sequences by %ID (b) remove all gaps from this initial alignment (c) calculate MSAs from this ungapped alignment using the settings that deliver the resulting alignments with the sequences in the same order as provided in the input sequence file
To specify whether the order of sequences in the output alignment should be that (i) of the order of sequences in the input file or (ii) the order generated by the alignment software (typically or perhaps even always this clusters together sequences in the alignment that are more similar to each other), look around at the options provided by the submissions pages for the MSA software webservers - from which you should choose either (i) input or (ii) aligned
Align this set of GTP-binding proteins using several of the different automatic MSA tools listed above.

Sort the order of the sequences in the alignments using each of the different approaches described above.

Which of the ways of sorting the sequences makes it easiest to compare the results of the analyses from the different MSA packages? Why to you prefer that method compared to the others?

Comparing Automatic MSA Tool Results to a Reference Alignment

As you'll do in the exercises below, in this demonstration we'll look at what happens when we re-align a reference alignment from BaliBase (1aboA_ref1) using several different pieces of automatic alignment software.
You will be told to take one from the list of BaliBase2 list of reference alignments.
  1. 1idy_ref1
  2. 1ycc_ref4
  3. kinase1_ref4
  4. 1eft_ref5
  5. 1pfc_ref1
  6. 1wit_ref1
  7. 1csp_ref1
  8. 1ldg_ref1
  9. sh3_ref6
  10. 1mrj_ref1
  11. 1led_ref1
Download the RSF format of the alignments of interest.

Load alignment into CLUSTALX locally, remove all gaps from all sequences (Edit->Select all sequences, Edit->Remove all Gaps) and save in FASTA format.

Alignment->Output Format Options->Output Order change to "Input" (this will preserve the order of the sequences in any alignments done by CLUSTALX to be the same as the order in the input alignment - makes it easier to compare the results of a CLUSTALX alignment with the reference alignment)

Re-align the sequences locally using CLUSTALX (Alignment->Do Complete Alignment) - REMEMBER TO CHOOSE A DIFFERENT NAME FOR THE ALIGNMENT OUTFILE!

Compare the automatic alignment created by CLUSTALX with the reference BaliBase alignment - it will help if you re-arrange the (vertical) order of the sequences in the CLUSTALX alignment.

How many columns are mis-aligned in the CLUSTALX alignment i.e. where residues are aligned in the same column in the CLUSTALX alignment which are not aligned in the same column in the BaliBase alignment?  NB - Only consider those columns in the BaliBase alignment which are underlined/in upper case - these are the ones which are assessed as being reliably aligned according to structural comparisons

Do the same for other alignment tools such as:

MSA analyses from Start to Finish

Decision-making is central to building a good/accurate/appropriate MSA. Here are some examples of the kind of issues that require these decisions:
Your reasons for building the alignment - i.e. the biological question(s) you are aiming to address using it - play a very important role in helping you reach decisions in response to questions such as these. Thus, there is not just one way to build an alignment comparing mammalian COX genes - the alignment you build will be different, depending on what you want to use the alignment for.

Thus, going from the beginning to the end of the process of building an MSA only makes sense in the context of a particular biological questions.

The demonstration/exercise shown below aims to do two things
Given that the other focus of this course is phylogeny, we will take evolutionary questions as the basis for this demonstration and exercise.

We'll go through the analysis step by step "online" using the projector.

After each step, we'll give you a chance to carry out the same actions as I've carried out on your own machines.

However, if you haven't time to carry out these analyses yourself, we've tried to provide files at each stage that you can use to move on to the next step, even if you haven't been able to complete the previous one.

Demonstration - DPW Linear Motif Conservation in the Epsins - an MSA Analysis from Start to Finish

The epsins is a family of proteins involved in regulating receptor-mediated endocytosis, which are characterised by the the Epsin N-terminal homology domain PFAM:ENTH;PF01417, e.g. EPN1_HUMAN or EPN1_RAT.

Most epsins contain functional instances of DPW/AP2alpha linear motifs, which mediate interactions with the alpha-subunit of adaptor protein complex 2 (AP-2 - there are two genes coding for alpha subunits in humans, AP2A1 and AP2A2).

We will now investigate the conservation of these motifs within the epsins by building an MSA of the family.

1. BLAST search at the NCBI using default settings using EPN1_RAT (here as FASTA format file) as a query (which we use rather than the human sequence as the 3D structure of one of the DPW motifs in this sequence has been determined).

This gets quite a long list of sequences - to get a more manageable number (too many sequences is difficult to look at, and processing them [e.g. making an MSA] is slower with more sequences), we'll collect
2. Build an MSA of the swissprot sequences using the EBI MUSCLE webserver

3. Load the alignment into JalView and look for instances of the DPW motif (regular-expression pattern 'DP[WF]') using JalView's "Find" feature
4. To fill out the picture of conservation a bit better, include also the high-scorring non-mammalian sequences by either
Here is the MUSCLE alignment obtained from simply aligning all the combined sequences together at once

5. We'll examine the alignment again using JalView, and will also adjust the alignment manually to try and line up mis-aligned DPW motifs (if there are any...)

Exercise - PTS-1 Linear Motif Conservation in the Catalayses - an MSA Analysis from Start to Finish

Carry out a similar exercise yourself, this time looking at the conservation of PTS-1 peroxisiomal targeting linear motifs in catalases.

1. Search the NCBI nr database using BLAST with the human catalase (UniProt:P04040;CATA_HUMAN) as a query
Use databases of different sizes, and try filtering by E-value threshold and taxonomic group to obtain a set of sequences - at the end of this exercise, I've put together some sets of sequences you might have obtained in this way, but before using them, try adjusting the BLAST searches yourself to obtain a set of sequences which is
The "Taxonomy Report" from the BLAST output page might be useful when planning your search strategy

2. Automatically align the sequences using the EBI MUSCLE webserver

3. Load the sequences into JalView and search for matches to the PTS-1 motif
4. Identify some of the taxonomic groups whose catalase sequences do not contain this motif - can you propose some reasons why this might be the case?
BLAST searches
The "Taxonomy Report" from an initial BLAST search (with default settings, which collects only 100 alignemnts) shows hits with high scores only in the animals/metazoans.

To see if there are related sequences in non-animals, I repeated the search using an initial query that filtered out all metazoan sequences, and collected a larger number than the default 100 alignments

This shows the protein to be present in many eukaryotes and prokaryotes (unusual) - and a search restricted to the archaea shows significant hits also in that domain of life - thus we can look to survey the presence/absence of this motif from all three domains of life.

It help to label the sequences so that we can tell which taxonomic group they're from - so I retrieve several different sets of sequences into separate files (one taxonomic group per file), as this makes it easier to re-name them according taxonomy


Demonstration - COX1/COX2 Phylogeny - an MSA Analysis from Start to Finish

We want to make inferences about the phylogeny of the prostaglandin G/H synthases (COX1/COX2) gene family in mammals. In particular, we would like to estimate when the duplication event occurred that yielded these two members of the family in humans. Perhaps we are interested in this question due to reading this paper (here's a static version of the Entrez PubMed Abstract for the paper in case the NCBI pages are down) where we read about  there being two COX genes in vertebrates/mammals and reference to an "invertebrate precursor" - when did this duplication occur?

This is a typical evolutionary question - using what we see today - (e.g contemporary human COX1/COX2 protein sequences), we try to infer  when (and if possible by what mechanism) the differences we see today have occurred.
Identifying Our Initial Sequence of Interest
To begin, we need to find the sequence of at least one of our proteins of interest

A good place to start is searching the UniProt/SwissProt database, which provides high-quality, partially manually-curated sequence and functional annotation data for proteins (as content has relatively high accuracy, and is relatively small, making it faster to search) Notice how we need to use our biological knowledge to aid our bioinformatic searching - generally, the more we know about the biology of our sequences, the better we can judge the results of such searches, and the more interesting questions we can pose using our bioinformatic tools

This task (identifying the sequence/record associated with a given gene/protein/biological entity) is one of the main applications of bioinformatic tools (the other is to predict function/structure using comparison to similar genes/proteins/entities with demonstrated function/structure)

This was a relatively easy example - however, if we wanted to find additional information about the cox1 sequence e.g. 3D structure, known protein-protein or protein-chemical interactions etc., we would have to carry out similar kinds of searches of different databases (or, if we're lucky, use cross-references from the UniProt record) All this can take quite a lot of time and effort.

REFLECT Exercise
A short-cut that works in many circumstances is to use the automatic entity-recognition and biological-database integration software REFLECT
This is what happens when we REFLECT our abstract of interest
Anyway - here are the UniProt records of the two human sequences COX1 and COX2
Note that these are NOT related to the mitochondrial protein COX1_HUMAN - these mitochondrial proteins were one of the reasons why our initial search of the UniProt database hit more than one record.
Identifying an Initial Set of Proteins Related to Our Sequence of Interest
We now need to identify a set of proteins that have the same structure/are related to our sequence of interest

We usually do this by beginning either with
Note that we've already mentioned that different biological questions may well be best addressed by different MSAs. Thus the alignments/sets of sequences we obtain from such resources are unlikely to be the best possible alignments to address our own specific biological question. However, they often provide a good starting place for such an analysis - by carrying out modifications such as those described below, they can be adjusted to suit our specific question
Identifying pre-aligned sequences related to COX-1/PTGS1
For example, querying TreeFam with the UniProt accession number for COX-1/PTGS (P23219) takes us to the TreeFam record for this family (TF329675) - whose full alignment you can find here.
Identifying Related Sequences using BLAST
To build the MSA as far as possible "from scratch", we'll begin by looking at how we'd use a BLAST search for this.

We begin by choosing
If this search gives us a good starting point for our analysis, then no need to change the way we did the search.

If we hit many sequences, then downloading them all and building an MSA will take longer than necessary.

To change the search so that we hit fewer sequences we could:
If we find very few/no sequences, then we may well not have enough sequences to build an MSA appropriate for addressing our question of interest.

To carry out a search to identify more sequences we could:
We examine the results of the BLAST search
We will have around 40 sequences if we collect both of these two groups of sequences - this is a reasonable number to manipulate, so we will take all of them.

We can easily get both GenPept and FASTA format records (and save them locally) for these sequences by re-formating the BLAST output to show only those with low-enough E-values
Building an Initial MSA
We build an initial MSA from these sequences using MUSCLE at the EBI
We examine the alignment in CLUSTALX
Collecting Additional Sequences of Interest
Perhaps the duplication occurred within the mammals - but we'd be interested to supplement these sequences with some from other chordates, so we'll query a larger database, restricting our search to non-mammalian chordates.
Combing Sequence Sets in one Alignment
It's probably easiest just to bundle all the sequences we have collected so far into one file and to submit this to MUSCLE again
In this case, we did not spend any time/effort creating a high-quality initial alignment e.g. by correcting mis-alignments, excluding unneeded sequences etc.

Sometimes, however, we want to incorporate additional sequences into an alignment that we have previously edited and manipulated - in which case we will not want to simply combine all (old and new) sequences together and completely re-align them.

Instead, we want to align the new sequences to the initial alignment while preserving the relationships of the sequences in the initial alignment to each other.

We can do this using CLUSTALX (in "profile alignment" mode.)

We can also do this using t_coffee (but only from the command-line)
MUSCLE also allows you to align two alignments with each other (known as profile-profile alignment), although not a set of sequences individually to a pre-existing alignment - this functionality is also only available from the command line
Adjusting Sequence Identifiers
As indicated, to make decisions about which sequences to exclude, we need to always consider the biological questions we are trying to address with our alignment.

In this case, we are interested in the timing of the duplication event leading to the two mammalian COX genes - thus we need to change the sequence identifiers so that we can:
When choosing new identifiers for the sequences we need to take into account that:
Here is a copy of the MUSCLE alignment created above with the identifiers altered - although note that the policy described above has not been carefully followed closely.
Preparing a "Final" Alignment
Almost always you will want to exclude some of the sequences from such an alignment. This might be for several different reasons:
Here is a copy of the above MUSCLE alignment, with altered sequence identifiers, trimmed of many sequences due to reasons described above.
Estimating a Phylogeny from the Alignment
Before estimating the phylogeny, we need to take into account that phylogenetic applications of MSAs are known to be strongly influenced by the inclusion of columns in the analysis where residues in the column are not related by simple substitution events.

Therefore, we use the GBLOCKS server to extract those regions of the alignment that are  considered most likely to contain only columns where all residues are related via substitution events.

Here is the HTML page showing which regions GBLOCKS kept and discarded from the alignment
Here is the FASTA format GBLOCKed alignment

Finally, we use a very quick and innaccurate method to estimate phylogeny from this alignment - neighbour-joining as implemented by CLUSTALX - click here for a tree estimated in this way

(We will look at a more accurate approach in tomorrow's session which focuses on phylogenies)

Below is an image of the tree estimated in this way using NJplot

Interestingly, there seem to be two copies of the gene also in the urochodate Ciona intestalis and the cephalochordate Branchiostoma floridae - however, the duplications appear to have occurred independently of each other, and of the duplication that occurred to yield the two vertebrate copies of the gene.

Exercise - Mammalian CDKs 1/2/3 Phylogeny - an MSA Analysis from Start to Finish

Now you have a chance to carry out an similar analysis, this time on your own.

This time, suppose you are interested in a similar question, only this time you want to investigate the timing of the duplications that yielded the three mammalian CDK genes CDK1, CDK2, and CDK3 (for example, see this abstract [Santamaria et al.2007] which mentions all three of the proteins).

  1. Identify the UniProt record(s) associated with at least one of these genes
  2. Use BLAST to identify an appropriate set of proteins to build an alignment from, and download these sequences to a local FASTA-format file
  3. Use MUSCLE to build a multiple sequence alignment of the sequences
  4. Edit the names of the sequences so that you can tell which organisms each sequence comes from (while keeping the sequence identifiers unique)
  5. Examine the alignment in CLUSTALX (or Jalview)
  6. Remove poorly-aligned regions from the alignment using the GBLOCKS server
  7. Quickly estimate a phylogeny from the GBLOCKed alignment, and visualise it using NJplot

Based on this tree, when do you think the three lineages diverged from each other? (i.e. when do you think their ancestral genes underwent duplications?)

If you have time, repeat a similar analysis as described above, this time beginning with the appropriate TreeFam family as the initial source of sequences
(If you can't find the appropriate TreeFam record, here's the link to it).

Demonstration - Phylogeny From Start to Finish with

Another example of an application of MSAs in phylogeny estimation. With this demonstration, the focus is on:
The Question
We'll try and identify the lamprey sequence (lampreys in wikipedia, UCMP, Tree of Life web project) most closely related to human UniProt link (UniProt:O00622;CYR61_HUMAN, here in FASTA format).
Acquiring initial set of sequences
We will begin with either a set of pre-aligned sequences from a database or unaligned related sequences using BLAST
Pre-aligned sequences from a database
We'll follow the link from the UniProt record for human CYR61 to the HOGENOM database to get our hands on an initial alignment
Unaligned sequences using BLAST
We'll search with the FASTA sequence of CYR61 (here is a local version of the file in case there are problems with UniProt) at the NCBI BLAST webserver
This is a small enough set to begin building an MSA from.

Notice how knowledge of the biology of the system (in this case, the taxonomy/phylogeny of my organisms of interest) drives our  decision making during the analysis.
Building an automatic MSA
Using the MUSCLE webserver at the EBI
Refine the MSA
Working with the initial MSA, we usually want to remove some of the sequences (as their unnecessary or would even be problematic for the downstream analysis), and add sequences that are not currently in the alignment.

We might also want to correct possible errors in the alignment, and/or realign the sequences once some have been removed.
Remove unnecessary sequences
To generally reduce the redundancy of the dataset i.e. to avoid having too many similar sequences, we can use an automatic tool:
Even if we use such tools as a start to the analysis, we typically also remove additional sequences "by hand"
Add missing sequences
There are (well, there were on 5.5.2011) no lamprey sequences in our alignment at the moment!

Obviously we need some to address our biological question of interest.

This link describes how we could go about acquring some
- however, to save time, rather than describing how we get them, here is a direct link to an appropriate set of lamprey sequences

We can build an MSA that includes these additional sequences either by:
  1. Realigning all sequences (initial sequences plus additional sequences) from scratch
  2. Aligning the additional sequences to the initial alignment.
Reduce the MSA
When using an MSA to estimate phylogeny and/or analyse evolutionary substitution processes, we usually need to remove columns from some of the columns from the alignment before estimating these evolutoinary paramters from our data.
We can do this:
We have to rename the sequences so that they could be accepted by GBLOCKS to give this file here

Here is the result file from GBLOCKS
Estimating the phylogeny
We can do this using:
Examining the phylogeny
We can examine the resulting tree files using software such as


Carry out a similar analysis, based on the sequences given here but addressing the question which human sequences in the shark sequences AAL37727 most closely related to.

Patterns of Conservation in Protein Alignments

Examine the protein structures (which you might obtain from the PDBe or the PDB) using PyMOL

Identify the regions of protein sequence within the alignment that correspond to secondary structure elements (helices, strands, turns) within the protein structure.

The aim is to build/obtain an alignment which highlights patterns of conservation/variation in different kinds of structural element.

You might need to collect and incorporate additional sequences in your alignment and/or remove some sequences from the alignment to make these patterns as clear as possible.

If possible, change the colouring of the PyMOL structure, and save a session file and image of the structure that highlights these patterns.

Example starting points for your analysis:
While this application is perhaps of limited use in your own research, it should serve to highlight, again, the extent to which an awareness of the specific application for which you are building an MSA governs the decisions you make when preparing the alignment.

Conserved Hydrophobic Core

Just to remind you - here we can see clearly that the non-polar residues can be almost completely excluded from the core of a globular protein.

Visualising and Editing Alignments

The aim of this exercise is to give you some practice using alignment viewers/editors to (i) identify and (ii) correct mis-aligned regions.

Download the alignments in the left column of the table below, all of which contain regions that are misaligned (in comparison with a set of reference alignments from BaliBase).

The second column links to an image of the alignment, with a box surrounding a region that contains a misalignment.

Using Jalview and CLUSTALX, try to identify and correct the misalignment.

To see whether you were successful, compare your edited alignments with the reference alignments (provided in the third column in FASTA format, in the fourth as HTML, which highlights the regions BaliBase is confident are correctly aligned). The fifth column links to an image highlighting more closely the region that is misaligned, along with an indication of how the region might be corrected. The final column links to the BaliBase page for the alignment.

Alignment with Misalignment
Misaligned Region Highlighted
Reference BaliBase Alignment (FASTA)
Reference BaliBase Alignment (HTML - Core Regions Highlighted)
Edit Highlighted
BaliBase Page
BB12003.ref.fasta BB12003.html BB12003.edit.jpg
BB50013.mis.fasta BB50013.mis.jpg BB50013.ref.fasta BB50013.html BB50013.edit.jpg BB50013-BaliBase

Back To Common Course Content page
Copyright Aidan Budd, 2009 (c)