MSAs and Phylogenetic Analysis
The majority of phylogenetic analyses carried out these days are based
on MSAs of biological sequences. As we have discussed, and experienced,
several times so far in this course, the quality of the alignment we
use for an anlaysis can have a significant influence on the quality of
the results of the analysis.
Substitution Rates in Alignments and Phylogenetic Analysis
It is important to note that how useful a particular protein MSA is for
addressing a given phylogenetic issue depends on several
characteristics of an alignment.
For example, as a general pointer, longer alignments would be prefered
over shorter ones - longer alignments contain more columns, with the
potential to contain columns that have captured shared substitution
events (certainly this should be the case when working with two
alignments with the same rate of substition occuring along the branches
of the trees, but with one alignment longer than the other).
Another important aspect to consider is that diferent proteins accept
substitutions at different rates - some proteins e.g. viral coat
proteins, accept a very large number of substitions in their amino acid
sequence, while others, such as actin and tubulin, accept very few such
events. The relvance of this to phylogenetic analyses is that different
proteins are appropriate for investigating different phylogenetic
divergence events - for example, if you were attempting to resolve
phylogenetic divergences that occured within the last 10 million years,
but you are working with a protein that evolves so slowly that it
experienced no substitions at all within the last 10 million years,
then clearly the MSA of this protein would not be able to shed any
light on such events. At the other end of the scale, if you are
attempting to investigate events that occured 500 million years ago,
but are using a protein that accepts many mutations in the course of
just a few years, then the signal from any of the substitutions
relevant to your question will certainly be shouted down by all the
noise caused by subsequent mutations (alignment columns that have
experienced large number of substitutions of this kind are described as
Given these facts, one might expect that for a given phylogenetic
question, there would exist a rate of evolution that a sequence would
have experienced that would make it ideal for the analysis - indeed,
this question has been addressed by Ziheng Yang's 1998 paper on this
Yang, Z. 1998. On the best evolutionary rate for phylogenetic analysis.
Systematic Biology 47:125-133
To give you some direct experience of the effect of these kinds
characteristics of alignments in phylogenetic analysis, here are some
exercises that explore these issues.
Examine the set of sequence alignments provided below.
Based on the characteristics of these alignments, predict which you
think would be good, and which bad, at resolving the phylogenetic
relationships of the sequences they contain.
Carry out bootstrapped phylogenetic analyses on these alignments, and
examine the resulting trees.
Q To what extent do the results you observe agree with your
For several of these alignments, investigate the influence of alignment
length on estimated phylogeny by using CLUSTALX (and the "Range"
options available when saving sequences) to chop them into alignments
with a range of different lengths. When looking at these phylogenies,
consider not just the topologies of the trees, but also the bootstrap
values associated with the internal branches.
Q Do you see any obvious changes in the estimated phylogenies based
on the alignments of different length?
Q If you do see such changes, does there seem to be a particular
alignment length that could be considered a threshold below which the
phylogenies become particularly poor? Or is there too much variation
between the characteristics of these different alignments to be able to
come up with such a rule?
Selecting Columns for Phylogenetic Analysis
As we have discussed several times in this course, analyses that are
based on MSAs assume that amino acids in the same column of the
alignment are related by
substitution events. It follows that we may want to consider removing
columns where we suspect this may not be the case from the alignment.
In the case that we are carrying out a phylogenetic analysis, the
GBLOCKS software package can be used to retain only those columns in
the alignment where all residues are related by substitution events.
To give you experience using this software, work with the alignment
Save the alignment in "pir" format using CLUSTALX (the alignment needs
to be saved in "pir" format for input into GBLOCKS), and submit the
"pir" file to the GBLOCKS
server using default settings.
Follow the link at the bottom of the page, and save the output ".pir"
file locally (e.g. add a "GB" before the ".pir" of your submitted
alignment, so that you can find the file again later) and use CLUSTALX
to estimate an NJ tree from the alignment. Bookmark the HTML output
page from GBLOCKS.
This process of removing unreliable columns from the alignment can also
be carried out "manually" -
Open the alignment in CLUSTALX, and save a new copy of the alignment
with a different name (e.g. add "SV" for SEAVIEW before the ".aln").
Open this new copy in SEAVIEW (you do not want to edit the original
file in case you make a mistake).
Use the information about conservation and the better colouring
scheme provided by CLUSTALX to decide which regions of the alignment to
remove using SEAVIEW.
Begin by selecting all the sequences in SEAVIEW (you want to remove the
residues from all sequences in each column you select for removeal).
To switch off the default safeguard against accidently deleting
residues, clock the "Allow Seq. Edition" box in the "Props" menu i.e.
Props->Allow Seq. Edition. To delete columns, use the Backspace key.
Save your cropped alignment at regular intervals in new files
(otherwise, if you make a mistake and delete a column you did not want
to, you will hav to begin over again)
Begin removing residues from the right (C-term) of the alignment - this
ensures that the numbering of the alignment positions in the
not-yet-processed alignment in SEAVIEW remains the same as that in the
alignment being viewed in CLUSTALX.
(to see the advantage of begining from the right, try the process
again, but beginning from the left!)
Reload the cropped alignment into CLUSTALX and estimate an NJ tree
Compare the alignments and phylogenetic trees estimated using (i) the
alignment obtained after processing with GBLOCKS (ii) the alignment you
prepared by making your own decisions about which columns to retain and
(iii) the initial alignment.
Q When making your own selection of columns, do you tend to be more
or less conservative than GBLOCKS?
Q Are there any major differences in the phylogenies estimated using
the different alignments? (consider both the topologies and the
bootstrap support for the different branches).
When applied to shorter alignments, GBLOCKS often has an
unwanted effect on topology and bootstrap values of the estimated
phylogenies - the exclusion of so many columns from the final analysis
by the program simply removing too much information from the analysis.
However, for longer alignments, it can be shown that it has a positive
To appreciate the difference between the effect of GBLOCKS on
alignments of different lengths, apply GBLOCKS to the following two
alignments, which are of very different length, and compare the results
with trees estimated without using GBLOCKS. The short alignment is
of a set of PH domains. To create a
long alignment, you should concatenate together the following set of
sequences - each file contains the MSA of a different protein family.
In each alignment, the same set of sequences are represented -
therefore, to obtain a long alignment, one can simply join together the
sequences from the same organism. Carry this manipulation out using a
text editor (this can be tricky!)
(This approach - concatenating together the sequences from several
different genes, is increasingly used as a way of obtaining more
sensitive analyses of species phylogenies. As this approach is becoming
more and more common, I have asked you to do the concatenating
yourself, to give you a bit of experience of how one can go about
creating such datasets. Note that one would typically carry out such a
manipulation of sequences using a short script - those coders amongst
you might like to try and put together such a script...?)
Note that GBLOCKS is not a perfect solution to the problem of preparing
an alignment for phylogenetic analysis - not only is there a sometimes
significant loss of information, if the alignment passed to GBLOCKS
contains many fragments, or seqeuences that contain false seqeuence
e.g. translated non-coding sequence, GBLOCKS is likely to maintain this
"bad" seqeunce in the processed alignment. Therefore, one needs to be
careful to submit to GBLOCKS only alignments that have been mostly
purged of fragments (GBLOCKS will usually not retain any columns
containing gaps) and that contain no "bad" sequence. This is because
GBLOCKS scans the alignment to find regions where there are no gaps,
and where there are several highly-conserved columns. Thus, if there is
one 'bad' sequence present, as long as there are a reasonable number of
sequences in the alignment, these columns will remain highly-conserved,
and the "bad" sequence will be included.
For example, try running GBLOCKS on the two alignments below and
estimating the phylogenies of the resulting alignments. In both
alignments, one of the frog sequences has been deliberately altered to
contain an insertion of "bad" sequence. contains a
(deliberately-inserted) region of bad sequence.
Q Can you identify the "bad" sequence in the two alignments (prior
to GBLOCKS processing)?
Q Look at the GBLOCKED alignments - do both of them contain "bad"
Q What effect does the "bad" sequence have on the phylogenetic
estimation (both in terms of topoology and bootstrap values)? (note
that the "bad" sequence is copied from another part of the same frog
sequence, thus in the region of the alignment containing the "bad"
sequence, the frog sequnence is equally distantly related to all the
other orgnaisms in the alignment)
In cases where the "bad" sequence is not just 'junk', but is instead
more similar to one/some of the other sequences in the alignment than
others, the "bad" sequence can have an more drastic influence on the
Q Attempt to create for yourself a sequence that you expect will cause
more problems for the phylogenetic estimation - load this sequence into
the alignment and estimate the phylogeny - a small prize (perhaps...)
for the person who makes the sickest tree...!
If you have trouble doing this yourself, then try using this alignment, which
contains a sequence that is a mixture of frog and fish.
Note that throughout these exercises the
following formating is
specify different types of text
Bold non-italic text like this gives
you instructions about tasks you should carry out e.g. "View the
Italic text specifies questions for
you to answer
to Gibson Team course pages at EMBL.