As discussed
yesterday, one of the aims of this course is to demonstrate the potential
influence that the choice of substitution model, phylogenetic algorithms and
sequence alignment can have on the phylogeny estimated from a phylogenetic
analysis. Yesterday we saw how
different phylogenetic algorithms can yield different phylogenetic estimates
when applied to the same alignment and substitution model. Today we are going
to look at the influence of the substitution model on the estimated phylogeny
by using the same algorithm and alignment to estimate phylogeny using several
different substitution models.
The first
exercise uses a dataset
constructed by Jose Castresana (previously here at
EMBL, now with his own group in Barcelona). This consists of the concatenated
amino acid sequences of six different polypeptides encoded by the mitochondrial
genome of four different amniotes (vertebrates whose eggs contain an amniotic
membrane) along with equivalent sequences from an amphibian non-amniote
outgroup (a frog). An outgroup is a sequence (or set of sequences) that is
known, prior to the analysis, to be more distantly related to all other
sequences in the alignment than any of them are to each other, and can be used
to root the tree. The proteins included in the alignment are subunits 1, 2 and
3 of cytochrome oxidase, subunits 1 and 2 of NADH dehydrogenase and cytochrome
b.
The correct
phylogenetic relationship of these sequences is known (see
species tree file).
Use PHYML to estimate the
phylogenetic tree from these sequences using a range
of different models. Can you identify the substitution model that allows you to
obtain an accurate/correct estimate of the topology i.e. one that agrees with
the species phylogeny? You might want to
try varying: Notice that more
complicated models take longer to run. Once you have
identified a model of evolution that correctly estimates the phylogeny of these
sequences, apply the same model of evolution to an
alignment that contains
additional amniote sequences.
Compare the phylogeny estimated from this
alignment with that of the known
species tree for these organisms. Is the estimated
phylogeny correct for this larger analysis?
Perhaps return to this exercise at the end if you have time - it aims
to reinforce the lessons you learnt in exercise 1, but is not vital
You will now
carry out a similar exercise using the two distance-based algorithms
Neighbor-Joining (NJ) as implemented by the NEIGHBOR programme in the PHYLIP
package, and Weighbor, as implemented by the programme WEIGHBOR. This time you
will be working with one of the datasets you looked at yesterday, the
LDH vertebrate clade
that contains only one human sequence.
Use PUZZLE to obtain
distance matrices from this alignment using several different substitution
models.
Experiment with the models in the same way as in exercise 1.
Do any of
the models yield exactly the
species tree?
If not, which of the estimated
topologies do you consider the best?
Why? Hopefully these
two exercises have demonstrated that: Models of
evolution are not just interesting/useful in the context of attempts to obtain
as-accurate-as-possible phylogenetic estimates - it is also possible to use
models to test hypotheses about the nature of the evolutionary process acting
on a set of sequences. Often, such tests
are implemented within a likelihood framework, although one can also use
Bayesian approaches to the problem. In principle, these tests usually involve
estimation of ML branch lengths for a tree with a particular pre-specified
topology under two different substitution models. One then compares the
likelihoods of the ML trees (in this case, the trees differ only in their
branch lengths, not in their topologies) and asks whether the difference in
likelihood between these two values is greater than expected by chance. Note that, in the
case that two models are equally good descriptions of the evolution of the
family, we expect that if we could compare the ML of the two models over the
correct tree topology for an infinitely-long alignment, we expect that the
difference between the likelihood of the two models would be zero. However,
empirical datasets will of course always involve finite-length alignments.
Thus, even when the two models are equally good descriptions of the
evolutionary substitution processes acting on the sequences, we expect sampling
error to yield some difference in ML between the two models. Thus, the first
step in testing whether or not two models provide equally good explanations of
a dataset is to identifying the frequency distribution of the differences in ML
between two models that are equally good descriptions of the data. This allows
one to compare the difference in ML between the two models and to this
distribution, thus yielding an estimate of how likely the observed ML
difference would be under the null hypothesis of no difference between the
models - if this probability is ‘low’ (e.g. < 0.05) then this
can be considered evidence supporting the hypothesis that one of the models
(the one with the greater ML) is a better description of the data than the
other. Note additionally, however, that these tests are carried out in the
context of a given tree-topology that is assumed to be correct - errors in this
topology will effect the results of the test - in the same sense, the models
being tested typically differ in only some of their components. If these other
components describe the evolution of the sequences very poorly, this will also
affect the meaningfulness of such tests. However, despite
these caveats, if is often possible to demonstrate that the ML under one model
is extremely likely (i.e. with a very high P-value or probability) to be better
than that under another other model. In this case, if the model with the ML
differs from the other in a way that may be interpreted in terms of
characteristics of the substitution process that generated the sequences, then
this result provides evidence that the true substitution process possessed
these characteristics. It is easiest to
test between pairs of hypotheses when (a) one of the hypotheses is a nested
hypothesis of the first and (b) the distribution of the ‘likelihood
statistic’ (twice the difference in the logarithms of the likelihoods
obtained under the two models) is expected to follow the chi-squared
distribution in the situation where there is no significant difference between
the ability of the two models to account for the data (the nested model will
always have a higher likelihood than that of the more general model, as the
nested model has more parameters that can be optimised to maximise the
likelihood.) Models are nested
when one of the models is the same as the other, except that it that includes
additional parameters. To discuss
examples of nested and un-nested hypotheses, here we will briefly remind you of
some of the nomenclature used to describe different amino acid substitution
models. JTT, WAG, VT (and
several other such acronyms) refer to the instantaneous exchangability matrix
used in the model (the letters in the acronym typically refer to the last name
initials of the authors who published the model - JTT’s last ‘T’
refers to Janet Thornton, head of the EBI). Models that
estimate the equilibrium frequency of each of the amino acids used in the model
from the dataset under analysis (i.e. not making use of the values used by the
authors of the respective exchangability matrix) is denoted “+ F” Models that
incorporate a gamma distribution to model between-site rate heterogeneity are
denoted by “+ gamma”. Models that
incorporate the assumption that a proportion of sites in the alignment are
unable to vary are denoted by “+ I” (“I” stands for “invariant
sites”) Thus, the JTT +F
+ gamma + I model is based on the JTT exchangability matrix, estimating
equilibrium amino acid frequencies from the analysed dataset and modelling
between-site rate heterogeneity using a gamma distribution and by assuming a
fraction of invariant sites. Returning to the
subject of what constitutes a nested hypothesis - The JTT + I model is a nested
hypothesis of the JTT model - the JTT + I + F model is a nested hypothesis of
the JTT + I and the JTT + F models - but the JTT + I is NOT a nested hypothesis
(or vice versa) of the JTT + F model. It is perhaps
intuitively obvious that the larger the difference between the number of
parameters between the two models being compared, the larger the expected
difference in the ML values by chance - this is taken into account in these analyses
by using a chi-squared distribution with a number of degrees of freedom equal
to the difference in the number of parameters between the two models. However, just
because two hypotheses are nested, it does not follow that the distribution of
the likelihood statistic will follow the chi-squared distribution. For example,
the difference between JTT and JTT + gamma has been shown NOT to follow this
distribution!
However, note that just because you can’t use a simple chi-squared
table to do a test, it does not mean that the hypotheses cannot be tested -
one just needs to implement a different test. One case in which
the chi-squared test has been shown to be valid is in the case of testing for
the hypothesis of whether or not a set of sequences have all evolved at the
same rate (i.e. whether the sequences obey a molecular clock). This process can
be described as ‘testing the molecular clock hypothesis’ - if the
test fails, indicates that different lineages in the tree have experienced
different selective constraints. We will now test
the molecular clock hypothesis using two different programs. The first of these is a more flexible
(and hence inevitably more difficult to use), programme called CODEML, part of
the PAML package. We will begin by using CODEML because its use to test the
molecular clock hypothesis requires a better understanding of the procedure
involved in the test, and hence has greater didactic power. Additionally, this
introduces you to a general procedure that can is used by CODEML (and other
software in the PAML package) to test many other evolutionary hypotheses in a
similar fashion. We will then use a simpler procedure to carry out the test (as
implemented in PUZZLE) and compare the results of the test using the two
procedures. There are two
groups of files for you to download. The first of these relates to a set of terminal deoxynucleotidyl transferases, while the second
contains a set of concatenated mitochondrial protein sequences. Within each
group of files there is: [the
two alignments are the same, but the programmes require that the data be
formatted differently]
Concatenated mitochondrial proteins:
mtREV substitution
matrix for CODEML
CODEML
global clock control file Terminal deoxynucleotidyl transferases:
JTT substitution matrix for CODEML
CODEML global clock
control file If time is
short, only carry out the analysis of the concatenated mitochondrial sequences
using CODEML. Download the
files above into two directories, one for each dataset. Within each of
these directories, create two other directories, one for the PUZZLE analysis,
the other for the CODEML analysis. Create copies
of the files so that their names are appropriate for the execution of the two
programmes. Thus, the two files downloaded for each PUZZLE analysis should be
renamed “infile”
and “intree”
respectively. The CODEML control file should be renamed to “codeml.ctl”, the data matrices to
“jones.dat” and “mtREV24.dat” as appropriate, the alignment
and tree files to “infile”
and “intree”
respectively. To carry out
the analysis using CODEML without using the assumption of the molecular clock,
you should edit the “codeml.ctl”
(see the section on “Changing the contents of files” in the
Linux usage notes linked to from the front course page) so that the clock
parameter is altered appropriately. Run the analyses
for the two datasets using both CODEML and PUZZLE. Note that the
number of free branch-length parameters in a global clock model using a
bifrucating tree is (n - 1), while that of a tree where all branches are
allowed to vary in rate is (2n - 3) [where n is the number of terminal branches
in the tree]. Using CODEML you will need to calculate for yourself the
difference in the likelihood statistic between the two trees (this is twice the
difference in log likelihoods as estimated under the two models). Your can
assess the probability that the observed difference in the likelihood statistic
would be obtained from a datasets that does obey the global clock using a
chi-squared distribution where the degrees of
freedom is the difference in the number of parameters between the two models. In the case of
the terminal deoxynucleotidyl transferases, the
region of the tree that describes the relationships between the mammalian
species is not known with certainty (the tree supplied is that estimated using
an ML approach). By altering the
topology of this tree file and repeating the
clock analysis, investigate whether the conclusion you reached about whether
this data supports the clock hypothesis is independent of the topology of the
tree in this region (try just a few of the more extreme alternatives to this
branching, do not bother to cover all possible topologies) The
position of the root of the tree used for the concatenated mitochondrial
proteins is also not certain - it may lie either on the branch used in the tree
file that you downloaded, on the internal branch linking Ursus and Canis, or
the internal branch linking Bos, Balaenopte and Equus.
Again, explore whether
the conclusion you reached about whether this data supports the clock
hypothesis is altered when you use one of these alternative roots. Then,
for one of the two datasets, determine whether the conclusions reached about
the clock hypothesis (use the tree topology initially downloaded) depends on
the substitution model used to estimate the branch-lengths. Try changing the
substitution matrix and the model of site-based rate heterogeneity. Finally (and only
if you’re feeling brave - especially if you want to try it with CODEML!)
attempt a similar analysis using one of the LDH datasets from yesterday’s
session.
You will likely find that the alignment and/or tree files provided
yesterday do not work (CERTAINLY in the case of CODEML). To try and convince
the programme to work, investigate the LDH tree and alignment files, and
compare them to the files that DID work for the exercise described today. Keep
changing characteristics of the LDH files until the programme runs using the
appropriate root. If you alter the tree-file manually, always check with njplot
to see whether you have produced a readable tree-file. Also, be aware that
programs can be sensitive to things such as the names used to describe the
sequences (better using short names with only alphanumeric characters), the
order in which sequences are arranged in the alignment and tree files. This process you
have just gone through if you attempted to convince the LDH files to be read
correctly by the programme is a typical task that one encounters almost every
time one interacts with phylogenetic software - obviously for this course we
have attempted to remove such problems. However, this exercise should have
given you a good taste of what it’s really like! Don’t worry, these
things do get easier with practise. Honest!
Exercise 2 (OPTIONAL)
Using models to test
evolutionary hypotheses
Exercise 3