Multiple Sequence Alignment (MSA) - Applications
Introductory Notes
The main aim of this pracical is to demonstrate to you that -
firstly - MSAs can do useful things and - secondly - that by taking the
time to make your MSA as good as possible you can signficantly improve
the quality of your analysis.
In the presentations we looked
at two different applications of MSAs
- secondary structure prediction
- phylogenetic analysis
In this exercise we are going to explore the use of MSAs for these
different purposes, with an emphasis on demonstrating the influence of
MSA quality on the results of these analyses.
Secondary Structure Prediction
We will begin by submitting several different alignments for secondary
structure prediction via JPRED.
The alignments are all of the same set
of sequences and cover a PH
domain.
Note: the first sequence in your alignment is the one that JPRED uses
to project its secondary structure prediction against. Therefore, you
may find it easier to make sure that the same sequence is always at the
top (i.e. is the first sequence) of the alignments. I would also
suggest making this sequence H_Sec7, the human ortholog of the mouse
sequence whose structure has been solved.
For each of the three alignments provided in clustal (*.aln) format
below, load
the alignment into CLUSTALX, and save the alignment in MSF format (the
format read by JPRED).
Alignment 1
Hand-edited alignment of PH domains from a range of different organisms
Alignment 2
Same set of sequences above, but aligned automatically by clustal
Alignment 3
Same set of sequences, but aligned automatically by mafft
Submit each of the alignments to JPRED
and compare the results of the
alignments - to do this I suggest you follow the link to the simple
representation of the JPRED prediction, and cut and paste the sequence
plus prediction into a text editor e.g. xemacs, switching off the
line-wraping function.
You will also want to compare the results to that of the "true"
structure for a sequence in this alignment (as mentioned, this is of
the close relative of the H_Sec7 protein found in mouse) - to do this
download this PDB structure file - 1U27 to
examine the
protein in 3D using the PyMOL software.
Type "pymol" at the shell prompt and
load the PDB file via the "File" menu. Switch on the "Display->Show
Sequence" option to view your selection in the context of the primary
sequence. You may find it useful to issue the following commands in the
text-box at the top of the window: "hide everything" followed by "show
cartoons, chain A". On the right side of the window select the "c" box,
and colour the structure by secondary structure, to make the helices
and strands stand out better.
This exercise provides the opportunity to interpret the alignment
within a context closer to the function/structure/biology of the
proteins, and in my opinion it is good to spend some time browsing the
comparison of the structure and the alignment. However you may find it
easier to make the comparison of prediction to structure using the
simple one-dimensional projection of secondary structure onto the
primary sequence - for that view this file of the secondary structure for
1U27.
Q Which of the alignments provides a
secondary structure prediction that is closest to that observed in the
solved
structure?
To answer this question you will need to consider how you judge the
quality of the prediction - number of residues in alignment whose
prediction agrees with "true" secondary structure? Comparison of number
of secondary structure elements predicted (e.g. 6 sheets predicted - 6
observed)? Some other measure?
Q Does CLUSTAL or MAFFT give the
better alignment (ignoring the quality of the secondary structure
prediction)? Does this correlate with the respective quality of the
secondary structure prediction?
Phylogenetic Analysis
In this section, we will run simple phylogenetic analyses using a range
of different alignments and datasets.
Note that we will use, exclusively, trees calculated using the
Neighbor-Joining (NJ)algortihm, as implemented in the CLUSTALX package.
We are using this method because it is embedded in the CLUSTAL
software, making it very easy to obtain a phylogenetic estimate using
this method. This is sufficient for our investigation into the
influence of alignment of phylogeny.
HOWEVER - if you are interested in conducting a phylogenetic analysis
as part of your own research, you should not use this approach (i.e. NJ
as implemented by CLUSTAL) - there are more accurate approaches
available that would be recommended. Some of these are very slow (e.g.
maximum likelihood [ML] approaches - although recent ML implementations
have speeded things up considerably) in comparison to NJ, although
Bayesian analysis of phylogeny, e.g. as implemented by the MrBayes
software provides, in general, more accurate analysis than NJ,
considerably faster than ML (note that the problem with the NJ approach
implemented in CLUSTAL is not just that the algorithm used is NJ, but
also the way in which the distance matrix [that the NJ algorithm acts
on] is calculated).
We will begin by subjecting three different alignments of the same set
of sequences to phylogenetic analysis.
The sequences were all taken from the TreeFam
database, and correspond to TreeFam family TF105782.
These three alignments were generated automatically from the set of
sequences downloaded from TreeFam, using three different programmes
(CLUSTAL, MAFFT, and PROBCONS).
TF105782 - CLUSTAL alignment
TF105782 - PROBCONS alignment
TF105782 - MAFFT alignment
Load each of these alignments into CLUSTALX and switch on the two
options "Exclude Positions with Gaps" and "Correct for Multiple
Substitutions" under the Tree menu.
Also from the tree menu, choose "Bootstrap N-J Tree", which you
should run using the default parameters.
Open each of these three trees using NJPLOT and compare the trees.
- Root them each in a similar position to make comparison between
trees easier (choose "New outgroup" button, and then click on the "#"
sign where you would like the tree re-rooted. Then click again the
"Full tree" button).
- Also rotate the trees around internal branches by pressing the
"Swap nodes" button and then selecting a branch by clicking on a "#"
sign, until they look as similar as possible
Q Which of these trees do you feel is the best? Are there some
branches that are obviously misplaced?
To address this you need to:
- Have a rough idea of the commonly-accepted eukaryotic (and
particularly the vertebtrae) phylogeny. This gives you something to
compare the observed results against, to allow you to judge the
accuracy of the results - are the estimated phylogenies very similar to
the commonly-accepted phylogenies? If so, then the method is considered
to be accuracte.
- Know the scientific names of commonly-occuring (in terms of
representation within the sequence databases) eukaryotes. Many
phylogenetic software packages have constraints on the length of the
names/descriptions that can be given to the sequences they are
analysing. Therefore, usually some form of short-hand is used to
describe the sequences used in the analyses. The approach taken by
SwissProt is to indicate the organism name using the first three
letters of the genus name, followed by the first two of the species
name e.g. Homo sapiens becomes HOMSA.
- Decide how to judge how similar to trees are to each other (a
non-trivial issue - perhaps you could use "the number of interior
branches that have the same organisms on either side"? Or some other
measure?)
To help you with these issues, look at the following webpages, which
provide you with some notes on these issues
Q Looking at the alignments, are there any obvious adjustments that
might be made to them that could improve them?
For example, you might want to remove some of the sequences that
are
obviously mis-aligned or are fragments (i.e. are incomplete).
If you can spot such possible changes, try making these adjustments,
estimating trees again from the adjusted alignments, and looking to see
whether the estimated phylogenies have improved.
Whether or not you feel that there is an obvious "winner" amongst the
trees, the exercise should have demonstrated that alignment surely has
a significant influence on the estimated phylogenies.
To demonstrate that this is not just a one-off example, try out a
similar exercise on the sets of sequences in the list below. For each
set of sequences, align them using several different alignment software
packages, and estimate trees from the resulting alignments using
CLUSTALX.
Each of these sets of sequences is taken each from a different TreeFam
record.
Q Based on your analysis of these families, can you spot any
characteristics of alignments that make them yield good or bad
phylogenetic estimates?
Extra Exercises
Here are some additional exercises for you to try out if you get
through the ones above with time to spare (don't go through them in
order - just pick out those that sound more interesting to you).
1) Repeat the exercise evaluating the quality of secondary structure
prediction using one or more of the domains below (to obtain the
CLUSTAL and MAFFT alignments, simply load the hand-edited file into
CLUSTALX, select all sequences, remove all gaps, and then save the
sequences into a new file. Load this file into CLUSTAL and MAFFT, and
create new, automatic, alignments of the sequences:
2) Use alignments for the PH domain created by MUSCLE - is this better
than CLUSTAL or MAFFT? (We don't use PROBCONS as it would be too slow).
3) Intuitavely, one would assume that a certain amount of substitution
information is required to make a good secondary structure prediction -
explore this by repeating the secondary structure prediction using
alignments that contain fewer and fewer sequences. Is the number of
sequences required to make a reasonable prediction relatively high or
low in your opinion? Is the number of sequences required similar or
different for different domains?
4) Repeat the exercise of evaluating the quality of secondary structure
prediction of the PH domain, this time using the PredictProtein server - this
software tends to give very accurate prediction of secondary structure,
but can be somewhat challenging to operate. In particular you need to
be careful about the format of the alignment you submit (I leave it to
you to discover the issues involved - as you are probably aware,
discovering finicky sequence format constraints is part and parcel of
the job of a bioinformatics user, and this may be a good example of
that task). Make notes on how you got it working - perhaps we will have
time to present some of these at the end of the course.
5) If that exercise has whet your appetite (and, yes, it
is "whet" not "wet"... :-) ) for exploring secondary
structure prediction, try out some of the other servers
listed by ExPASy and repeat exercise (4) with these servers.
6) If you want to try using some different phylogenetic analyses, have
a go at submitting some of the alignments above to some different
phylogenetic methods - for example PHYML. Note that you will need
to process these alignments, at least to the extent that you can
prepare alignments with no gaps in (look into using GBLOCKS for an
automatic way of doing this, or use SEAVIEW to remove columns). Other
ideas for phylogenetic software you could try and use can be found on
Joe Felsenstein's
list of phylogeny and evolution-related software.
This
set of exercises should have:
- given you a very basic introduction into using MSAs to predict
secondary structure and to estimate phylogenies
- demonstrated that alignment quality can have a clear effect on
the quality of the analyses that operate on the MSAs
Note that throughout these
exercises the following formating is
used to
specify different types of text
Bold non-italic text like this gives
you instructions about tasks you should carry out e.g. "View the
following webpage"
Italic text specifies questions for
you to answer
Back to Course Front Page
Back
to Gibson Team course pages at EMBL.