MSAs and Phylogenetic Analysis

Introductory Notes

The majority of phylogenetic analyses carried out these days are based on MSAs of biological sequences. As we have discussed, and experienced, several times so far in this course, the quality of the alignment we use for an anlaysis can have a significant influence on the quality of the results of the analysis.

Substitution Rates in Alignments and Phylogenetic Analysis

It is important to note that how useful a particular protein MSA is for addressing a given phylogenetic issue depends on several characteristics of an alignment.

For example, as a general pointer, longer alignments would be prefered over shorter ones - longer alignments contain more columns, with the potential to contain columns that have captured shared substitution events (certainly this should be the case when working with two alignments with the same rate of substition occuring along the branches of the trees, but with one alignment longer than the other).

Another important aspect to consider is that diferent proteins accept substitutions at different rates - some proteins e.g. viral coat proteins, accept a very large number of substitions in their amino acid sequence, while others, such as actin and tubulin, accept very few such events. The relvance of this to phylogenetic analyses is that different proteins are appropriate for investigating different phylogenetic divergence events - for example, if you were attempting to resolve phylogenetic divergences that occured within the last 10 million years, but you are working with a protein that evolves so slowly that it experienced no substitions at all within the last 10 million years, then clearly the MSA of this protein would not be able to shed any light on such events. At the other end of the scale, if you are attempting to investigate events that occured 500 million years ago, but are using a protein that accepts many mutations in the course of just a few years, then the signal from any of the substitutions relevant to your question will certainly be shouted down by all the noise caused by subsequent mutations (alignment columns that have experienced large number of substitutions of this kind are described as "saturated").

Given these facts, one might expect that for a given phylogenetic question, there would exist a rate of evolution that a sequence would have experienced that would make it ideal for the analysis - indeed, this question has been addressed by Ziheng Yang's 1998 paper on this subject

Yang, Z. 1998. On the best evolutionary rate for phylogenetic analysis. Systematic Biology 47:125-133

To give you some direct experience of the effect of these kinds characteristics of alignments in phylogenetic analysis, here are some exercises that explore these issues.

Examine the set of sequence alignments provided below.
Based on the characteristics of these alignments, predict which you think would be good, and which bad, at resolving the phylogenetic relationships of the sequences they contain.

Carry out bootstrapped phylogenetic analyses on these alignments, and examine the resulting trees.

Q To what extent do the results you observe agree with your expectations?

For several of these alignments, investigate the influence of alignment length on estimated phylogeny by using CLUSTALX (and the "Range" options available when saving sequences) to chop them into alignments with a range of different lengths. When looking at these phylogenies, consider not just the topologies of the trees, but also the bootstrap values associated with the internal branches.

Q Do you see any obvious changes in the estimated phylogenies based on the alignments of different length?

Q If you do see such changes, does there seem to be a particular alignment length that could be considered a threshold below which the phylogenies become particularly poor? Or is there too much variation between the characteristics of these different alignments to be able to come up with such a rule?

Selecting Columns for Phylogenetic Analysis

As we have discussed several times in this course, analyses that are based on MSAs assume that amino acids in the same column of the alignment are related by substitution events. It follows that we may want to consider removing columns where we suspect this may not be the case from the alignment.

In the case that we are carrying out a phylogenetic analysis, the GBLOCKS software package can be used to retain only those columns in the alignment where all residues are related by substitution events.

To give you experience using this software, work with the alignment below

Save the alignment in "pir" format using CLUSTALX (the alignment needs to be saved in "pir" format for input into GBLOCKS), and submit the "pir" file to the GBLOCKS server using default settings.

Follow the link at the bottom of the page, and save the output ".pir" file locally (e.g. add a "GB" before the ".pir" of your submitted alignment, so that you can find the file again later) and use CLUSTALX to estimate an NJ tree from the alignment. Bookmark the HTML output page from GBLOCKS.

This process of removing unreliable columns from the alignment can also be carried out "manually" -

Open the alignment in CLUSTALX, and save a new copy of the alignment with a different name (e.g. add "SV" for SEAVIEW before the ".aln").

Open this new copy in SEAVIEW (you do not want to edit the original file in case you make a mistake).

Use the information about conservation and the better colouring scheme provided by CLUSTALX to decide which regions of the alignment to remove using SEAVIEW.

Begin by selecting all the sequences in SEAVIEW (you want to remove the residues from all sequences in each column you select for removeal).

To switch off the default safeguard against accidently deleting residues, clock the "Allow Seq. Edition" box in the "Props" menu i.e. Props->Allow Seq. Edition. To delete columns, use the Backspace key.

Save your cropped alignment at regular intervals in new files (otherwise, if you make a mistake and delete a column you did not want to, you will hav to begin over again)

Begin removing residues from the right (C-term) of the alignment - this ensures that the numbering of the alignment positions in the not-yet-processed alignment in SEAVIEW remains the same as that in the alignment being viewed in CLUSTALX.

(to see the advantage of begining from the right, try the process again, but beginning from the left!)

Reload the cropped alignment into CLUSTALX and estimate an NJ tree from it.

Compare the alignments and phylogenetic trees estimated using (i) the alignment obtained after processing with GBLOCKS (ii) the alignment you prepared by making your own decisions about which columns to retain and (iii) the initial alignment.

Q When making your own selection of columns, do you tend to be more or less conservative than GBLOCKS?

Q Are there any major differences in the phylogenies estimated using the different alignments? (consider both the topologies and the bootstrap support for the different branches).

When applied to shorter  alignments, GBLOCKS often has an unwanted effect on topology and bootstrap values of the estimated phylogenies - the exclusion of so many columns from the final analysis by the program simply removing too much information from the analysis. However, for longer alignments, it can be shown that it has a positive effect.

To appreciate the difference between the effect of GBLOCKS on alignments of different lengths, apply GBLOCKS to the following two alignments, which are of very different length, and compare the results with trees estimated without using GBLOCKS. The short alignment is of  a set of  PH domains. To create a long alignment, you should concatenate together the following set of sequences - each file contains the MSA of a different protein family. In each alignment, the same set of sequences are represented - therefore, to obtain a long alignment, one can simply join together the sequences from the same organism. Carry this manipulation out using a text editor (this can be tricky!)

(This approach - concatenating together the sequences from several different genes, is increasingly used as a way of obtaining more sensitive analyses of species phylogenies. As this approach is becoming more and more common, I have asked you to do the concatenating yourself, to give you a bit of experience of how one can go about creating such datasets. Note that one would typically carry out such a manipulation of sequences using a short script - those coders amongst you might like to try and put together such a script...?)

Note that GBLOCKS is not a perfect solution to the problem of preparing an alignment for phylogenetic analysis - not only is there a sometimes significant loss of information, if the alignment passed to GBLOCKS contains many fragments, or seqeuences that contain false seqeuence e.g. translated non-coding sequence, GBLOCKS is likely to maintain this "bad" seqeunce in the processed alignment. Therefore, one needs to be careful to submit to GBLOCKS only alignments that have been mostly purged of fragments (GBLOCKS will usually not retain any columns containing gaps) and that contain no "bad" sequence. This is because GBLOCKS scans the alignment to find regions where there are no gaps, and where there are several highly-conserved columns. Thus, if there is one 'bad' sequence present, as long as there are a reasonable number of sequences in the alignment, these columns will remain highly-conserved, and the "bad" sequence will be included.

For example, try running GBLOCKS on the two alignments below and estimating the phylogenies of the resulting alignments. In both alignments, one of the frog sequences has been deliberately altered to contain an insertion of "bad" sequence. contains a (deliberately-inserted) region of bad sequence.
Q Can you identify the "bad" sequence in the two alignments (prior to GBLOCKS processing)?

Q Look at the GBLOCKED alignments - do both of them contain "bad" sequence?

Q What effect does the "bad" sequence have on the phylogenetic estimation (both in terms of topoology and bootstrap values)? (note that the "bad" sequence is copied from another part of the same frog sequence, thus in the region of the alignment containing the "bad" sequence, the frog sequnence is equally distantly related to all the other orgnaisms in the alignment)

In cases where the "bad" sequence is not just 'junk', but is instead more similar to one/some of the other sequences in the alignment than others, the "bad" sequence can have an more drastic influence on the estimated phylogeny.

Q Attempt to create for yourself a sequence that you expect will cause more problems for the phylogenetic estimation - load this sequence into the alignment and estimate the phylogeny - a small prize (perhaps...) for the person who makes the sickest tree...!

If you have trouble doing this yourself, then try using this alignment, which contains a sequence that is a mixture of frog and fish.

Note that throughout these exercises the following formating is used to specify different types of text

Bold non-italic text like this gives you instructions about tasks you should carry out e.g. "View the following webpage"

Italic text specifies questions for you to answer

Back to Gibson Team course pages at EMBL.