As discussed yesterday, one of the aims of this course is to demonstrate the potential influence that the choice of (i) substitution model, (ii) phylogenetic algorithms and (iii) sequence alignment can have on the phylogeny estimated from a phylogenetic analysis. Yesterday we saw how different phylogenetic algorithms can yield different phylogenetic estimates when applied to the same alignment and substitution model. Today we are going to look at the influence of the substitution model on the estimated phylogeny by using the same algorithm and alignment to estimate phylogeny using several different substitution models.
There are many different aspects and components of the models of amino acid substitution used by phylogeny software. Typically, however, the substitution models used in different analyses only differ in a small number of these components. To help you understand the meaning of the components that do sometimes differ between analyses, we will briefly discuss some of these often-varied components in the following section.
The first aspect of these models that one might choose to alter between analyses is the exchangeability matrix. This matrix describes the relative tendency of each of the different possible amino acid exchanges to occur - i.e. the tendency of alanine to be substituted by leucine, isoleucine, tryptophan etc. and so on for all pairwise amino acid substitutions.
Note that an exchangeability matrix is distinct from an instantaneous rate matrix and also from a substitution matrix (also described as the substitution probability matrix or the transition probability matrix). The instantaneous rate matrix is the combination of the exchangability matrix and the information concerning the relative frequencies of the different amino acids - it describes the instantaneous rates for each of the different amino acid substitutions given the equilibrium amino acid frequencies. Note that this is not a PROBABILITY matrix - rather a rate matrix. The substitution matrix describes, for a given period of evolutionary time, the probability of occurence of each of the different amino acid substitutions.
A considerable number of different exchangeability matrices have been built. Generally the process of building such a matrix involves the collection of large numbers of protein sequence alignments that are then used to estimate the frequency of each of the different possible amino acid exchanges. Published matrices differ both in the alignment dataset used to estimate them and in the method used to estimate the exchangabilities from these alignments.
Different exchangeability matrices are typically referred to by the acronyms of the authors who published the matrices. For example, one commonly used matrix was estimated by Jones, Thorne and Thornton (this last name being that of Janet Thornton, head of the EBI) and is usually referred to as the JTT matrix. This matrix was calculated from a large set of globular protein alignments. Another matrix based on globular protein alignments, but which is often found to be better than the JTT matrix (i.e. to yield higher ML values) is the WAG matrix. Another matrix that you should be aware of is mtREV. This was calculated from only mitochondrial proteins, and would be the typical matrix of choice if working with mitochondrial-only datasets. When describing the substitution model used in an analysis, the exchangeability matrix used is indicated by providing the author-name-based acronym of the matrix.
All substitution models that you will use during this course make the assumption that, within a given protein family, each of the twenty different amino acids is present at a some given equilibrium frequency and that this frequency is the same throughout the evolution of the family. Assuming that the amino acid frequency of the proteins in the family does indeed behave in this way, then on counting the number of alanine, tryptophan, leucine etc. residues in each of sequences, on average one would expect the same number of each amino acid in each of the sequences - e.g. on average, each sequence would contain two tryptophans, ten alanines etc.
The equilibrium amino acid frequencies used in the model can be those estimated from the dataset used to build the exchangeability matrix used in the analysis. However, usually the values used are estimated directly from the dataset being analysed. In analyses where the frequencies are estimated from the dataset under analysis, this is indicated by adding an "F" after the name of the exchangeability matrix used - thus, an analysis that uses the JTT matrix and estimates amino acid equilibrium frequencies from the dataset under analysis is designated "JTT + F"
Looking at almost any protein multiple sequence alignment, it is clear that some positions in the alignment experience large numbers of substitutions, while others experience very few or no substitutions. The image below shows a region of an alignment of eukaryotic EF1 (elongation factor 1) proteins (after gapped and ambiguously aligned residues were removed). Clearly, different regions of this alignment accept amino acid substitutions at different rates (contrast the right side of the alignment to the centre).
This characteristic of protein sequence evolution can be incorporated into substitution models by assuming different positions in the alignment change at different rates.
One approach to modelling this between-site rate variation would be to estimate the rate at which substitutions occur for each site independently. However, a more usual approach is to model rate variation between sites using a gamma distribution.
In this approach, conceptually, the likelihood of each site is calculated for all possible rates. The total likelihood for the site is taken as the sum of the likelihood calculated for all these different possible rates - the contribution of the likelihood for each rate to the total likelihood being weighted by the frequency with which that rate is described as occurring by the gamma distribution. By varying the value of alpha (the "shape" parameter of the gamma distribution) the relative frequency with which different rates are described to occur by the gamma distribution can be altered.
However, rather than carrying out the calculation as described above for all possible rate values (as you might imagine this involves an integration operation that is computationally intensive), the gamma distribution is usually approximated using a discrete distribution. By using a larger number of discrete categories, the discrete approximation becomes closer to the true gamma distribution, but the calculation time required increases. Typically, one uses 4 or 8 different categories to approximate the gamma distribution.
A further way in which rate variation is often modelled is to assume that a proportion of the sites in the alignment are unable to change i.e. are "invariant". Note that when using such a model, the number of invariant sites estimated from the alignment will be FEWER than the number of sites that have the same amino acid for all sequences - some of these residues are likely to be capable of changing, but the substitution rate is either too low to be observed, or hidden substitutions have occurred.
A substitution model that incorporates the JTT exchangeability matrix, estimates amino acid frequencies from the alignment dataset under analysis, and uses the gamma distribution to model between-site rate variation using 4 discrete rate categories to approximate the gamma distribution would be described as "JTT + F + gamma (4 discrete categories to approximate gamma)". If the model additionally assumes the presence of invariant sites in the analysis, it would be described as "JTT + F + gamma (4 discrete categories to approximate gamma) + I". If only two different rates are assumed between sites, an invariant and one other site, the model would be described as "JTT + F + I".
However, note that while it is possible to implement "gamma + I" models, there are problems associated with their use. There are conflicting opinions concerning whether such modesl should be used or not, see this link to the EvolDir forum for a discussion on the subject. However, the arguments associated with this issue are somewhat involved - for historical reasons the exercises in this course use such models. If you become involved in research that depends on the characteristics of different substitution models, then you should consider carefully whether you should be using such models.
The aims of this exercise are:
· to provide experience using ML to estimate phylogenies using the PHYML software
· to demonstrate that choice of substitution model can affect phylogeny estimates
· to investigate the effect of including additional sequences in an analysis
In the exercise you will use the PHYML programme to estimate ML trees from a mitochondrial protein dataset using a range of different substitution models. You will be asked to identify a substitution model that yields the correct phylogeny estimate. Finally, you add some additional sequences to the analysis and determine whether this affects the estimated phylogeny.
The alignment used in this exercise was constructed by Jose Castresana (previously here at EMBL, now with his own group in Barcelona). It consists of the concatenated amino acid sequences of six different polypeptides encoded by the mitochondrial genome of four different amniotes (vertebrates whose eggs contain an amniotic membrane) along with equivalent sequences from an amphibian non-amniote outgroup (a frog). An outgroup is a sequence (or set of sequences) that is known, prior to the analysis, to be more distantly related to all other sequences in the alignment than any of them are to each other, and can be used to root the tree. The proteins included in the alignment are subunits 1, 2 and 3 of cytochrome oxidase, subunits 1 and 2 of NADH dehydrogenase and cytochrome b.
· Species tree for organisms providing genes present in the alignment: castresana5seq.tree
· Concatenated mitochondrial protein MSA in phy format: castresana5seq.phy
In this exercise we will estimate phylogenies using ML with the PHYML programme. This is a faster implementation of ML than that of PROML that you used yesterday, and it also has available a wider range of different substitution models.
Create a folder called /home/<your_username>/day2/exercise1
Copy the alignment file in phy format into this new directory.
Execute PHYML using:
PHYML will prompt you to provide a name of an alignment file.
Enter the name of your alignment file and then type "ENTER"
Use the interactive menus (operated in a similar way to PHYLIP programmes) to specify the parameters used in the calculation.
For each of these parameters, continue pressing the letter indicated, followed by "ENTER", until the desired parameter setting is indicated. Note that certain options only become available after other appropriate parameters have been selected.
· To indicate that the file contains AA (amino acid) data use “D”.
· To change the exchangeability matrix to be used, use "M".
· To select a model that assumes a fraction of invariant sites in the alignment, use “V”.
· After selecting a model using invariant sites, you need to tell the programme to estimate the proportion of invariable sites (i.e. to “Optimise p-invar”). The option to not estimate p-invar but to specify a value for this yourself is provided to save time - if you have run the analysis once, and have thus estimated a value of p-invar, you could enter this value again now, preventing the software from spending time re-estimating the value. However, at least for the first time that an analysis is run, this value needs to be estimated.
· To select a model that assumes between-site rate variation within the alignment, use “R”. After choosing to model between-site rate variation, you can use “C” to vary the number of discrete rate categories used to approximate the gamma distribution. You will also need to indicate that the alpha “shape” parameter for the gamma distribution should be estimated using “A”.
Carry out an initial analysis using PHYML by estimating the phylogeny using a JTT model. Type “Y” followed by “ENTER” to start the analysis after having specified this model.
PHYML writes several files to the directory containing the data file.
· The file whose suffix is “_phyml_lk.txt” provides the likelihood of the ML tree.
· The file whose suffix is “_phyml_stat.txt” describes the parameters used to run the analysis.
· The file whose suffix is “_phyml_tree.txt” contains the ML tree in NEWICK format.
Visualise the tree estimated using PHYML using NJPLOT, and compare the tree to the species tree.
Does the tree estimated by PHYML agree with the species tree?
If not, repeat the analysis with PHYML using different substitution models to see if you can identify a substitution model that does estimate the correct tree.
You might want to try varying:
· The exchangeability matrix used (note that mtREV is a matrix built from mitochondrial genes).
· The model of rate variation between individual residues in the alignment including:
o the number of categories used to represent the gamma distribution.
o whether or not invariant sites are assumed to exist in the alignment.
Notice that more complicated models take longer to run.
Describe in the shorthand scheme described in the introductory section of this webpage the model that yielded the correct topology.
Finally, we will investigate the effect of including additional sequences into the analysis. The following files are of an alignment of the same set of proteins as used in the earlier part of the exercise, but with the inclusion of additional amniote sequences.
· Species tree for organisms providing genes present in the alignment: castresana10seq.tree
· Concatenated mitochondrial protein MSA in phy format: castresana10seq.phy
Having identified a substitution model that estimates the correct phylogeny from a set of five organisms, one might hope that the use of the same model for an alignment containing additional species for the same set of aligned proteins would also estimate the correct phylogeny.
Estimate the phylogeny of the 10 sequences provided above using the same substitution model that yielded a correct estimate of phylogeny from an alignment of 5 sequences, using PHYML.
Is the estimated tree correct i.e. the same as the species tree?
If not, are the phylogenetic relationships between the set of 5 sequences present in the initial alignment correctly estimated?
Models of evolution are not just interesting/useful in the context of attempts to obtain as-accurate-as-possible phylogenetic estimates - it is also possible to use models to test hypotheses about the nature of the evolutionary process acting on a set of sequences.
Often, such tests are implemented within a likelihood framework, although one can also use Bayesian approaches to the problem. In principle, these tests usually involve estimation of ML branch lengths for a tree with a particular pre-specified topology under two different substitution models. One then compares the likelihoods of the ML trees (in this case, the trees differ only in their branch lengths, not in their topologies) and asks whether the difference in likelihood between these two values is greater than expected by chance.
Note that, in the case that two models are equally good descriptions of the evolution of the family, we expect that if we could compare the ML of the two models over the correct tree topology for an infinitely-long alignment, we expect that the difference between the likelihood of the two models would be zero. However, empirical datasets will of course always involve finite-length alignments. Thus, even when the two models are equally good descriptions of the evolutionary substitution processes acting on the sequences, we expect sampling error to yield some difference in ML between the two models. Thus, the first step in testing whether or not two models provide equally good explanations of a dataset is to identifying the frequency distribution of the differences in ML between two models that are equally good descriptions of the data. By comparing the difference in ML under the two models to this distribution, one can estimate the probability of obtaining such a difference in ML between two models that are equally good explanations of the data. If this probability is ‘low’ (e.g. < 0.05) then this can be considered evidence supporting the hypothesis that one of the models (the one with the greater ML) is a better description of the data than the other. Note additionally, however, that these tests are carried out in the context of a given tree-topology that is assumed to be correct - errors in this topology will affect the results of the test. Note also that the models being tested typically differ in only some of their components. If other components common to the two models describe the evolution of the sequences very poorly, this will also affect the meaningfulness of such tests.
However, despite these caveats, if is often possible to demonstrate that the ML under one model is extremely likely (i.e. with a very high P-value or probability) to be better than that under another model. In this case, if the model with the ML differs from the other in a way that may be interpreted in terms of characteristics of the substitution process that generated the sequences, then this result provides evidence that the true substitution process possessed these characteristics.
It is easiest to test between pairs of hypotheses when (a) one of the hypotheses is nested within the other and (b) the distribution of the ‘likelihood statistic’ (twice the difference in the logarithms of the likelihoods obtained under the two models) is expected to follow the chi-squared distribution in the situation where there is no significant difference between the ability of the two models to account for the data (the more general model will always have a higher likelihood than that of the nested model, as the more general model has more free parameters that can be optimised to maximise the likelihood.)
A models is nested in another model when one of the models (the more general model) is the same as the other (nested) model except that it includes additional free (i.e. allowed to vary) parameters. Or, looking at things differently, the hypotheses are nested if one of the models is a special case of the other. Thus, the JTT model is a nested hypothesis of the JTT + I model, as the JTT model is a special case of the JTT + I model (where the proportion of invariant sites is 0).
It is perhaps intuitively obvious that the larger the difference between the number of parameters between the two models being compared, the larger the expected difference in the ML values by chance - this is taken into account in these analyses by using a chi-squared distribution with a number of degrees of freedom equal to the difference in the number of parameters between the two models.
However, just because two hypotheses are nested, it does not follow that the distribution of the likelihood statistic will follow the chi-squared distribution. For example, the difference between JTT and JTT + gamma has been shown NOT to follow this distribution!
However, note that just because you can’t use a simple chi-squared table to do a test, it does not mean that the hypotheses cannot be tested - one just needs to implement a different test.
One case in which the chi-squared test has been shown to be valid is in the case of testing for the hypothesis of whether or not a set of sequences have all evolved at the same rate (i.e. whether the sequences obey a molecular clock). This process can be described as ‘testing the molecular clock hypothesis’ - if the test fails, this indicates that different lineages in the tree have experienced different selective constraints.
A more pragmatic reason for testing for rate variation within a tree is that phylogenies containing sequences with a large degree of rate variation are relatively difficult to estimate (due to a phenomenon often described as “long branch attraction” [LBA] where branches with high rates tend to cluster together within the tree). Thus, if you suspect that your phylogeny is suffering from LBA, a first step to investigating whether or not LBA might be occurring is to test for rate variation within your sequences - if there is no such rate variation, then LBA is unlikely to be affecting your estimates. Be aware that very few datasets appear to follow a global (i.e. over all sequences in an analysis) molecular clock.
To appreciate the meaning of the molecular clock hypothesis, consider the following rooted phylogeny of a set of genes. Note that the distance from the root of the tree to each terminal node is the same. Thus, the lineages leading from the root of the tree to each of the terminal nodes have experienced the same number of substitution events. Assuming that all the sequences being analysed were sampled at the same point in time, this indicates that the rate of accumulation of substitutions along all lineages has remained constant throughout evolution. Evolutionary processes that follow this pattern i.e. occur at the same rate along all lineages, are said to behave according to a “molecular clock”.
The image below shows a gene family phylogenetic tree that DOES NOT follow the molecular clock - note that the horizontal distance from the root of the tree to each terminal node is different for different terminal nodes. However, note also that this tree is exactly the same as that of the image above this, except that it is rooted in a different position.
As is clear from the two images above, the test of the molecular clock hypothesis is made in the context of working with a particular phylogeny. Thus, one can not simply consider an alignment on its own and ask the question “Have the sequences in this alignment behaved according to a molecular clock” without having either estimated the phylogeny of the sequences, or having a prior hypothesis of the relationship of the sequences. Note, further, that the hypothesis requires not only that it be tested in the context of a phylogenetic tree, but also that this needs to be a ROOTED phylogenetic tree - if we were to test the clock hypothesis on the first of the two trees above, we would find that the sequences support the clock hypothesis, but carry out the test on the second of the two phylogenies, and the test would show that the sequences DO NOT support the clock hypothesis. Thus: same unrooted topology, different root leads to different conclusions about the clock hypothesis. Thus, when using programmes to test the molecular clock hypothesis, one must provide them with both an alignment AND a phylogenetic tree against which to test the hypothesis.
There will often be some uncertainty about the position of the root of the gene tree - see yesterdays discussion and exercises on the subject of rooting. In the case where there are several different plausible roots for a gene tree, a test of the clock hypothesis should include tests using each of the different plausible root positions. If all plausible roots are consistent with the molecular clock, then the clock hypothesis is clearly supported. If some of these positions DO NOT support the molecular clock, even if others do, then clearly one would be much less confident about stating that the sequences have followed a molecular clock.
The aims of this exercise are:
· to demonstrate the use of the likelihood framework to test an evolutionary hypothesis
o specifically to test the molecular clock hypothesis
· to provide experience using the PAML software, a tool frequently used to test such hypotheses
In the exercise you will use the CODEML programme from the PAML package to determine which of two different sets of proteins is behaving according to the molecular clock.
This exercise makes use of two datasets. The first is a set of terminal deoxynucleotidyl transferases, while the second is a set of concatenated mitochondrial protein sequences.
CODEML operates differently from the other programmes you have experienced so far. Rather than specifying the options with which the programme should be executed through an interactive menu, these options are specified in a so-called “control file”.
This control file specifies, amongst other things, parameters describing:
· the name and location of the alignment and tree files to be used in the calculations;
· the name and location of the file containing the exchangeability matrix to be used in the calculation;
· whether a model of between-site rate heterogeneity should be used;
· if between-site rate heterogeneity is modelled, the details of this model;
To set a parameter within the control file, one must know the name of the parameter, and the type of value that the parameter can accept. To set a parameter in the control file one provides the name of the parameter, followed by an ‘equals’ sign, followed by the value that you want to give to the parameter. Any text coming after a * sign on a line is ignored. Thus, to set the name of the alignment file and tree file needed by the analysis you would include the following two lines in the control file:
seqfile = infile * name of sequence data file
treefile = ../intree * name of tree structure file
The name of the parameter that contains the name of the sequence data file is “seqfile”, and that containing the tree structure file is “treefile”. The file that contains the alignment data is specified as being “infile”, and is located in the same directory as the control file. The tree structure file is located in the parent directory of the directory that contains the control file, as indicated by the name of the file being preceded by “../”.
To test the clock hypothesis using CODEML, you will need to execute the programme twice. The first time, you will run the analysis using a model where all branches in the tree are allowed to evolve at different rates (the model that assumes no clock - the more general of the two models). You will then need to examine the output file (whose name is specified in the control file by the “outfile” parameter) to determine the log likelihood score of the tree under this model.
You will then run the analysis again, having altered the control file to use a substitution model that assumes that all branches have evolved at the same rate (the nested model). The output file from this analysis is then examined, and the log likelihood score of the tree under this different, more constrained model is determined.
Double the difference between the likelihoods estimated under the two models should follow the chi-squared distribution if the two models provide equally good descriptions of the data. Thus, double the difference that you observed between the likelihoods between the two analyses can be compared to the chi-squared distribution to determine the probability of obtaining the observed difference between the likelihoods if the models are equally good descriptions of the data.
NOTE the analysis DOES NOT estimate the phylogenetic tree for the dataset. Instead, the tree is assumed to be already known (and indeed the topology of the tree must be provided to the programme). Instead, CODEML determines maximum likelihood branch lengths for the supplied tree topology and substitution model, and reports the likelihood of this tree-topology/ML-branch-lengths/substitution-model combination.
Create a directory called /home/<your_username>/day2/exercise2/
Copy the concatenated mitochondrial protein (CMP) alignment file and tree files, the file containing the exchangeability matrix, and CODEML control file provided above into this new directory.
Edit the control file using emacs so that the parameters for alignment file (seqfile), tree structure file (treefile), matrix file (aaRatefile) and clock model (clock) are set to correspond to the names you have given these files, and in the case of the clock model, to “0” (which specifies that the analysis should be run assuming all branches can have different rates).
Make sure the control file is called “codeml.ctl”.
CODEML expects to find the parameter data in a file with this name.
Examine the output file (whose name is specified using the “outfile” parameter) to determine the log likelihood score for the ML tree. Make a note of this value.
To find the likelihood of the ML tree, identify the line in the output file, fairly near the end, where the input tree is given in NEWICK format, and is then followed by a line that begins “lnL” (i.e. the natural logarithm, ln, of the likelihood, L).
Rename the outfile to indicate that it was obtained from an analysis that assumed no global clock.
Edit the control file so that CODEML will be run using a model that assumes the sequences have evolved under a global clock.
Run the analysis again using the edited CODEML file.
Determine the lnL of the ML tree from the analysis assuming a global clock.
Use this chi-squared table to determine the probability that the observed difference in the lnL values was obtained by two models that are equally good explanations of the data
When carrying out this test, note that the number of free branch-length parameters in a global clock model using a bifurcating tree is (n - 1), while that of a tree where all branches are allowed to vary in rate is (2n - 3) [where n is the number of terminal branches in the tree]. The number of degrees of freedom to be used in the chi-squared test is the difference in the number of parameters between the two models.
Repeat the above procedure to test whether the terminal deoxynucleotidyl transferase dataset has evolved under a molecular clock
Which of the two datasets has evolved under a molecular clock?
If you have time, then continue below, where you will further explore the two datasets, examining the extent to which the conclusions you reached in the initial analysis depend on specific aspects of the models that were used in this analysis. When working with your own datasets, and carrying out your own tests, this step of exploring the robustness of your results and conclusions to plausible variations in the models used is important to help you (and your potential referees) gain confidence in these results and conclusions.
In the case of the terminal deoxynucleotidyl transferases, the region of the tree that describes the relationships between the mammalian species is not known with certainty (the tree supplied is that estimated using an ML approach).
Test whether the results of the analysis are affected using some alternative topologies of the input tree - edit the tree file using emacs - see the Linux usage notes for instructions on how to use this editor.
Try just a few of the more extreme alternatives to the branching in the original tree, do not bother to cover all possible topologies.
The topology of the tree estimated for the concatenated mitochondrial proteins is also not certain - in particular, the horse sequence might rather be more closely related to the bear + dog split, than to the cow + whale split.
Explore whether the conclusion you reached about whether this data supports the clock hypothesis is altered when you use this alternative topology.
For one of the two datasets, determine whether the conclusions reached about the clock hypothesis (use the tree topology initially downloaded) depends on the substitution model used to estimate the branch-lengths.
Try changing the exchangeability matrix (specified by the “aaRatefile” parameter) and the model of site-based rate heterogeneity (the “ncatG” parameter specifies the number of discrete rate categories used to estimate the gamma distribution).
Finally (and only if you’re feeling brave) attempt a similar analysis using the LDH datasets from yesterday’s session.
You will find that the alignment and/or tree files provided yesterday do not work. To try and convince CODEML to work, investigate the LDH tree and alignment files, and compare them to the files that DID work for the exercise described today. Keep changing characteristics of the LDH files until CODEML runs using the appropriate root. If you alter the tree-file manually, always check with njplot to see whether you have produced a readable tree-file. Also, be aware that programs can be sensitive to things such as the names used to describe the sequences (better using short names with only alphanumeric characters), the order in which sequences are arranged in the alignment and tree files.
This process you have just gone through if you attempted to convince the LDH files to be read correctly by the programme is a typical task that one encounters almost every time one interacts with phylogenetic software - obviously for this course we have attempted to remove such problems. However, this exercise should have given you a good taste of what it’s really like! Don’t worry, these things do get easier with practise. Honest!
The exercises above will hopefully have demonstrated:
1. that one can consider aspects of the substitution model used in a phylogenetic analysis separately from the algorithm used in the phylogenetic analysis. Thus, for example, one can use NJ or ML algorithms with the same substitution model, or NJ using two different substitution models.
2. that the choice of substitution model can affect the phylogeny estimated
3. that one can use the likelihood framework to test hypotheses concerning the evolution of a set of sequences
4. that the models of evolution used in molecular phylogenetic analyses involve modelling of the SUBSTITUTION/POINT-MUTATION processes of molecular evolution
If you do not feel convinced or confident about any of these points, please talk to one of the demonstrators, and they discuss these points with you - if you are confident about them, it will help you over the other days of the course.