Day 2: 19th April 2005

Substitution models

Influence of substitution model on phylogeny estimates

As discussed yesterday, one of the aims of this course is to demonstrate the potential influence that the choice of substitution model, phylogenetic algorithms and sequence alignment can have on the phylogeny estimated from a phylogenetic analysis.  Yesterday we saw how different phylogenetic algorithms can yield different phylogenetic estimates when applied to the same alignment and substitution model. Today we are going to look at the influence of the substitution model on the estimated phylogeny by using the same algorithm and alignment to estimate phylogeny using several different substitution models.


Exercise 1

The first exercise uses a dataset constructed by Jose Castresana (previously here at EMBL, now with his own group in Barcelona). This consists of the concatenated amino acid sequences of six different polypeptides encoded by the mitochondrial genome of four different amniotes (vertebrates whose eggs contain an amniotic membrane) along with equivalent sequences from an amphibian non-amniote outgroup (a frog). An outgroup is a sequence (or set of sequences) that is known, prior to the analysis, to be more distantly related to all other sequences in the alignment than any of them are to each other, and can be used to root the tree. The proteins included in the alignment are subunits 1, 2 and 3 of cytochrome oxidase, subunits 1 and 2 of NADH dehydrogenase and cytochrome b.


The correct phylogenetic relationship of these sequences is known (see species tree file).


Use PHYML to estimate the phylogenetic tree from these sequences using a range of different models.


Can you identify the substitution model that allows you to obtain an accurate/correct estimate of the topology i.e. one that agrees with the species phylogeny?


You might want to try varying:


Notice that more complicated models take longer to run.



Once you have identified a model of evolution that correctly estimates the phylogeny of these sequences, apply the same model of evolution to an alignment that contains additional amniote sequences.


Compare the phylogeny estimated from this alignment with that of the known species tree for these organisms.


Is the estimated phylogeny correct for this larger analysis?


Exercise 2 (OPTIONAL)

Perhaps return to this exercise at the end if you have time - it aims to reinforce the lessons you learnt in exercise 1, but is not vital

You will now carry out a similar exercise using the two distance-based algorithms Neighbor-Joining (NJ) as implemented by the NEIGHBOR programme in the PHYLIP package, and Weighbor, as implemented by the programme WEIGHBOR. This time you will be working with one of the datasets you looked at yesterday, the LDH vertebrate clade that contains only one human sequence.


Use PUZZLE to obtain distance matrices from this alignment using several different substitution models.


Experiment with the models in the same way as in exercise 1.


Do any of the models yield exactly the species tree?


If not, which of the estimated topologies do you consider the best?




Hopefully these two exercises have demonstrated that:


Using models to test evolutionary hypotheses

Models of evolution are not just interesting/useful in the context of attempts to obtain as-accurate-as-possible phylogenetic estimates - it is also possible to use models to test hypotheses about the nature of the evolutionary process acting on a set of sequences.


Often, such tests are implemented within a likelihood framework, although one can also use Bayesian approaches to the problem. In principle, these tests usually involve estimation of ML branch lengths for a tree with a particular pre-specified topology under two different substitution models. One then compares the likelihoods of the ML trees (in this case, the trees differ only in their branch lengths, not in their topologies) and asks whether the difference in likelihood between these two values is greater than expected by chance.


Note that, in the case that two models are equally good descriptions of the evolution of the family, we expect that if we could compare the ML of the two models over the correct tree topology for an infinitely-long alignment, we expect that the difference between the likelihood of the two models would be zero. However, empirical datasets will of course always involve finite-length alignments. Thus, even when the two models are equally good descriptions of the evolutionary substitution processes acting on the sequences, we expect sampling error to yield some difference in ML between the two models. Thus, the first step in testing whether or not two models provide equally good explanations of a dataset is to identifying the frequency distribution of the differences in ML between two models that are equally good descriptions of the data. This allows one to compare the difference in ML between the two models and to this distribution, thus yielding an estimate of how likely the observed ML difference would be under the null hypothesis of no difference between the models - if this probability is ‘low’ (e.g. < 0.05) then this can be considered evidence supporting the hypothesis that one of the models (the one with the greater ML) is a better description of the data than the other. Note additionally, however, that these tests are carried out in the context of a given tree-topology that is assumed to be correct - errors in this topology will effect the results of the test - in the same sense, the models being tested typically differ in only some of their components. If these other components describe the evolution of the sequences very poorly, this will also affect the meaningfulness of such tests.


However, despite these caveats, if is often possible to demonstrate that the ML under one model is extremely likely (i.e. with a very high P-value or probability) to be better than that under another other model. In this case, if the model with the ML differs from the other in a way that may be interpreted in terms of characteristics of the substitution process that generated the sequences, then this result provides evidence that the true substitution process possessed these characteristics.


It is easiest to test between pairs of hypotheses when (a) one of the hypotheses is a nested hypothesis of the first and (b) the distribution of the ‘likelihood statistic’ (twice the difference in the logarithms of the likelihoods obtained under the two models) is expected to follow the chi-squared distribution in the situation where there is no significant difference between the ability of the two models to account for the data (the nested model will always have a higher likelihood than that of the more general model, as the nested model has more parameters that can be optimised to maximise the likelihood.)


Models are nested when one of the models is the same as the other, except that it that includes additional parameters.


To discuss examples of nested and un-nested hypotheses, here we will briefly remind you of some of the nomenclature used to describe different amino acid substitution models.


JTT, WAG, VT (and several other such acronyms) refer to the instantaneous exchangability matrix used in the model (the letters in the acronym typically refer to the last name initials of the authors who published the model - JTT’s last ‘T’ refers to Janet Thornton, head of the EBI).


Models that estimate the equilibrium frequency of each of the amino acids used in the model from the dataset under analysis (i.e. not making use of the values used by the authors of the respective exchangability matrix) is denoted “+ F”


Models that incorporate a gamma distribution to model between-site rate heterogeneity are denoted by “+ gamma”.


Models that incorporate the assumption that a proportion of sites in the alignment are unable to vary are denoted by “+ I” (“I” stands for “invariant sites”)


Thus, the JTT +F + gamma + I model is based on the JTT exchangability matrix, estimating equilibrium amino acid frequencies from the analysed dataset and modelling between-site rate heterogeneity using a gamma distribution and by assuming a fraction of invariant sites.


Returning to the subject of what constitutes a nested hypothesis - The JTT + I model is a nested hypothesis of the JTT model - the JTT + I + F model is a nested hypothesis of the JTT + I and the JTT + F models - but the JTT + I is NOT a nested hypothesis (or vice versa) of the JTT + F model.


It is perhaps intuitively obvious that the larger the difference between the number of parameters between the two models being compared, the larger the expected difference in the ML values by chance - this is taken into account in these analyses by using a chi-squared distribution with a number of degrees of freedom equal to the difference in the number of parameters between the two models.


However, just because two hypotheses are nested, it does not follow that the distribution of the likelihood statistic will follow the chi-squared distribution. For example, the difference between JTT and JTT + gamma has been shown NOT to follow this distribution!


However, note that just because you can’t use a simple chi-squared table to do a test, it does not mean that the hypotheses cannot be tested - one just needs to implement a different test.


One case in which the chi-squared test has been shown to be valid is in the case of testing for the hypothesis of whether or not a set of sequences have all evolved at the same rate (i.e. whether the sequences obey a molecular clock). This process can be described as ‘testing the molecular clock hypothesis’ - if the test fails, indicates that different lineages in the tree have experienced different selective constraints.


Exercise 3

We will now test the molecular clock hypothesis using two different programs.  The first of these is a more flexible (and hence inevitably more difficult to use), programme called CODEML, part of the PAML package. We will begin by using CODEML because its use to test the molecular clock hypothesis requires a better understanding of the procedure involved in the test, and hence has greater didactic power. Additionally, this introduces you to a general procedure that can is used by CODEML (and other software in the PAML package) to test many other evolutionary hypotheses in a similar fashion. We will then use a simpler procedure to carry out the test (as implemented in PUZZLE) and compare the results of the test using the two procedures.


There are two groups of files for you to download. The first of these relates to a set of terminal deoxynucleotidyl transferases, while the second contains a set of concatenated mitochondrial protein sequences. Within each group of files there is:

[the two alignments are the same, but the programmes require that the data be formatted differently]


Concatenated mitochondrial proteins:

PUZZLE alignment


CODEML alignment


mtREV substitution matrix for CODEML

CODEML global clock control file



Terminal deoxynucleotidyl transferases:

PUZZLE alignment


CODEML alignment


JTT substitution matrix for CODEML

CODEML global clock control file


If time is short, only carry out the analysis of the concatenated mitochondrial sequences using CODEML.


Download the files above into two directories, one for each dataset.


Within each of these directories, create two other directories, one for the PUZZLE analysis, the other for the CODEML analysis.


Create copies of the files so that their names are appropriate for the execution of the two programmes. Thus, the two files downloaded for each PUZZLE analysis should be renamed “infile” and “intree” respectively. The CODEML control file should be renamed to “codeml.ctl”, the data matrices to “jones.dat” and “mtREV24.dat” as appropriate, the alignment and tree files to “infile” and “intree” respectively.


To carry out the analysis using CODEML without using the assumption of the molecular clock, you should edit the “codeml.ctl” (see the section on “Changing the contents of files” in the Linux usage notes linked to from the front course page) so that the clock parameter is altered appropriately.


Run the analyses for the two datasets using both CODEML and PUZZLE.


Note that the number of free branch-length parameters in a global clock model using a bifrucating tree is (n - 1), while that of a tree where all branches are allowed to vary in rate is (2n - 3) [where n is the number of terminal branches in the tree]. Using CODEML you will need to calculate for yourself the difference in the likelihood statistic between the two trees (this is twice the difference in log likelihoods as estimated under the two models). Your can assess the probability that the observed difference in the likelihood statistic would be obtained from a datasets that does obey the global clock using a chi-squared distribution where the degrees of freedom is the difference in the number of parameters between the two models.



You have now carried out the most important part of the exercise!



If you have time, then continue below, where you will further explore this dataset, examining the extent to which the conclusions you reached in the initial analysis that you just completed depend on specific aspects of the models that were used in this analysis. When working with your own datasets, and carrying out your own tests, this step of exploring the robustness of your results and conclusions to plausible variations in the models used is important to help you (and your potential referees) gain confidence in these results and conclusions.



In the case of the terminal deoxynucleotidyl transferases, the region of the tree that describes the relationships between the mammalian species is not known with certainty (the tree supplied is that estimated using an ML approach).


By altering the topology of this tree file and repeating the clock analysis, investigate whether the conclusion you reached about whether this data supports the clock hypothesis is independent of the topology of the tree in this region (try just a few of the more extreme alternatives to this branching, do not bother to cover all possible topologies)


The position of the root of the tree used for the concatenated mitochondrial proteins is also not certain - it may lie either on the branch used in the tree file that you downloaded, on the internal branch linking Ursus and Canis, or the internal branch linking Bos, Balaenopte and Equus.


Again, explore whether the conclusion you reached about whether this data supports the clock hypothesis is altered when you use one of these alternative roots.


Then, for one of the two datasets, determine whether the conclusions reached about the clock hypothesis (use the tree topology initially downloaded) depends on the substitution model used to estimate the branch-lengths. Try changing the substitution matrix and the model of site-based rate heterogeneity.


Finally (and only if you’re feeling brave - especially if you want to try it with CODEML!) attempt a similar analysis using one of the LDH datasets from yesterday’s session.


You will likely find that the alignment and/or tree files provided yesterday do not work (CERTAINLY in the case of CODEML). To try and convince the programme to work, investigate the LDH tree and alignment files, and compare them to the files that DID work for the exercise described today. Keep changing characteristics of the LDH files until the programme runs using the appropriate root. If you alter the tree-file manually, always check with njplot to see whether you have produced a readable tree-file. Also, be aware that programs can be sensitive to things such as the names used to describe the sequences (better using short names with only alphanumeric characters), the order in which sequences are arranged in the alignment and tree files.


This process you have just gone through if you attempted to convince the LDH files to be read correctly by the programme is a typical task that one encounters almost every time one interacts with phylogenetic software - obviously for this course we have attempted to remove such problems. However, this exercise should have given you a good taste of what it’s really like! Don’t worry, these things do get easier with practise. Honest!