Multiple Sequence Alignment (MSA) - Applications


Introductory Notes

The main aim of this pracical is to demonstrate to you that - firstly - MSAs can do useful things and - secondly - that by taking the time to make your MSA as good as possible you can signficantly improve the quality of your analysis.

In the presentations we looked at two different applications of MSAs
In this exercise we are going to explore the use of MSAs for these different purposes, with an emphasis on demonstrating the influence of MSA quality on the results of these analyses.

Secondary Structure Prediction

We will begin by submitting several different alignments for secondary structure prediction via JPRED. The alignments are all of the same set of sequences and cover a PH domain.

Note: the first sequence in your alignment is the one that JPRED uses to project its secondary structure prediction against. Therefore, you may find it easier to make sure that the same sequence is always at the top (i.e. is the first sequence) of the alignments. I would also suggest making this sequence H_Sec7, the human ortholog of the mouse sequence whose structure has been solved.

For each of the three alignments provided in clustal (*.aln) format below, load the alignment into CLUSTALX, and save the alignment in MSF format (the format read by JPRED).

Alignment 1

Hand-edited alignment of PH domains from a range of different organisms

Alignment 2
Same set of sequences above, but aligned automatically by clustal

Alignment 3

Same set of sequences, but aligned automatically by mafft

Submit each of the alignments to JPRED and compare the results of the alignments - to do this I suggest you follow the link to the simple representation of the JPRED prediction, and cut and paste the sequence plus prediction into a text editor e.g. xemacs, switching off the line-wraping function.

You will also want to compare the results to that of the "true" structure for a sequence in this alignment (as mentioned, this is of the close relative of the H_Sec7 protein found in mouse) - to do this download this PDB structure file - 1U27 to examine the protein in 3D using the PyMOL software.

Type "pymol" at the shell prompt and load the PDB file via the "File" menu. Switch on the "Display->Show Sequence" option to view your selection in the context of the primary sequence. You may find it useful to issue the following commands in the text-box at the top of the window: "hide everything" followed by "show cartoons, chain A". On the right side of the window select the "c" box, and colour the structure by secondary structure, to make the helices and strands stand out better.

This exercise provides the opportunity to interpret the alignment within a context closer to the function/structure/biology of the proteins, and in my opinion it is good to spend some time browsing the comparison of the structure and the alignment. However you may find it easier to make the comparison of prediction to structure using the simple one-dimensional projection of secondary structure onto the primary sequence - for that view this file of the secondary structure for 1U27.

Q Which of the alignments provides a secondary structure prediction that is closest to that observed in the solved structure?
To answer this question you will need to consider how you judge the quality of the prediction - number of residues in alignment whose prediction agrees with "true" secondary structure? Comparison of number of secondary structure elements predicted (e.g. 6 sheets predicted - 6 observed)? Some other measure?

Q Does CLUSTAL or MAFFT give the better alignment (ignoring the quality of the secondary structure prediction)? Does this correlate with the respective quality of the secondary structure prediction?

Phylogenetic Analysis

In this section, we will run simple phylogenetic analyses using a range of different alignments and datasets.

Note that we will use, exclusively, trees calculated using the Neighbor-Joining (NJ)algortihm, as implemented in the CLUSTALX package. We are using this method because it is embedded in the CLUSTAL software, making it very easy to obtain a phylogenetic estimate using this method. This is sufficient for our investigation into the influence of alignment of phylogeny.

HOWEVER - if you are interested in conducting a phylogenetic analysis as part of your own research, you should not use this approach (i.e. NJ as implemented by CLUSTAL) - there are more accurate approaches available that would be recommended. Some of these are very slow (e.g. maximum likelihood [ML] approaches - although recent ML implementations have speeded things up considerably) in comparison to NJ, although Bayesian analysis of phylogeny, e.g. as implemented by the MrBayes software provides, in general, more accurate analysis than NJ, considerably faster than ML (note that the problem with the NJ approach implemented in CLUSTAL is not just that the algorithm used is NJ, but also the way in which the distance matrix [that the NJ algorithm acts on] is calculated).

We will begin by subjecting three different alignments of the same set of sequences to phylogenetic analysis.

The sequences were all taken from the TreeFam database, and correspond to TreeFam family TF105782.

These three alignments were generated automatically from the set of sequences downloaded from TreeFam, using three different programmes (CLUSTAL, MAFFT, and PROBCONS).

TF105782 - CLUSTAL alignment
TF105782 - PROBCONS alignment
TF105782 - MAFFT alignment

Load each of these alignments into CLUSTALX and switch on the two options "Exclude Positions with Gaps" and "Correct for Multiple Substitutions" under the Tree menu.

Also from the tree menu, choose "Bootstrap N-J Tree", which you should run using the default parameters.

Open each of these three trees using NJPLOT and compare the trees.
Q Which of these trees do you feel is the best? Are there some branches that are obviously misplaced?
To address this you need to:
To help you with these issues, look at the following webpages, which provide you with some notes on these issues
Q Looking at the alignments, are there any obvious adjustments that might be made to them that could improve them?
For example, you might want to remove some of the sequences that are obviously mis-aligned or are fragments (i.e. are incomplete).

If you can spot such possible changes, try making these adjustments, estimating trees again from the adjusted alignments, and looking to see whether the estimated phylogenies have improved.


Whether or not you feel that there is an obvious "winner" amongst the trees, the exercise should have demonstrated that alignment surely has a significant influence on the estimated phylogenies.


To demonstrate that this is not just a one-off example, try out a similar exercise on the sets of sequences in the list below. For each set of sequences, align them using several different alignment software packages, and estimate trees from the resulting alignments using CLUSTALX.

Each of these sets of sequences is taken each from a different TreeFam record.
Q Based on your analysis of these families, can you spot any characteristics of alignments that make them yield  good or bad phylogenetic estimates?


Extra Exercises

Here are some additional exercises for you to try out if you get through the ones above with time to spare (don't go through them in order - just pick out those that sound more interesting to you).

1) Repeat the exercise evaluating the quality of secondary structure prediction using one or more of the domains below (to obtain the CLUSTAL and MAFFT alignments, simply load the hand-edited file into CLUSTALX, select all sequences, remove all gaps, and then save the sequences into a new file. Load this file into CLUSTAL and MAFFT, and create new, automatic, alignments of the sequences:

2) Use alignments for the PH domain created by MUSCLE - is this better than CLUSTAL or MAFFT? (We don't use PROBCONS as it would be too slow).

3) Intuitavely, one would assume that a certain amount of substitution information is required to make a good secondary structure prediction - explore this by repeating the secondary structure prediction using alignments that contain fewer and fewer sequences. Is the number of sequences required to make a reasonable prediction relatively high or low in your opinion? Is the number of sequences required similar or different for different domains?

4) Repeat the exercise of evaluating the quality of secondary structure prediction of the PH domain, this time using the PredictProtein server - this software tends to give very accurate prediction of secondary structure, but can be somewhat challenging to operate. In particular you need to be careful about the format of the alignment you submit (I leave it to you to discover the issues involved - as you are probably aware, discovering finicky sequence format constraints is part and parcel of the job of a bioinformatics user, and this may be a good example of that task). Make notes on how you got it working - perhaps we will have time to present some of these at the end of the course.

5) If that exercise has whet your appetite (and, yes, it is "whet" not "wet"... :-)   ) for exploring secondary structure prediction, try out some of the other servers listed by ExPASy and repeat exercise (4) with these servers.

6) If you want to try using some different phylogenetic analyses, have a go at submitting some of the alignments above to some different phylogenetic methods - for example PHYML. Note that you will need to process these alignments, at least to the extent that you can prepare alignments with no gaps in (look into using GBLOCKS for an automatic way of doing this, or use SEAVIEW to remove columns). Other ideas for phylogenetic software you could try and use can be found on Joe Felsenstein's list of phylogeny and evolution-related software.

This set of exercises should have:

Note that throughout these exercises the following formating is used to specify different types of text

Bold non-italic text like this gives you instructions about tasks you should carry out e.g. "View the following webpage"

Italic text specifies questions for you to answer

Back to Course Front Page
Back to Gibson Team course pages at EMBL.