Day 3: 20th April 2005

Precision of phylogenetic estimates

 

As already discussed, in the context of a phylogenetic analysis, the concepts of accuracy and precision of the result can be interpreted in terms of, respectively, the extent to which the estimated tree agrees with the true tree (accuracy) and the extent to which the best estimate of the phylogeny is preferred to alternative topologies (precision). Thus, an analysis that estimates exactly the true phylogeny is perfectly accurate. In contrast, an analysis that considers all but one of the possible trees as being very poorly supported by the data, while the single tree that is proposed as the phylogeny estimate is extremely well supported by the data, is an extremely precise. To take the example of a simple maximum likelihood analysis of the data, a very precise analysis will identify the true phylogeny, while a very precise analysis will indicate that one tree (the maximum likelihood tree) has a very much higher likelihood than any other possible topology.

 

Clearly, it is impossible to obtain an estimate of the accuracy of a phylogenetic analysis when one has no knowledge of the topology of the true tree - although the use of independent datasets that are expected to have the same evolutionary history can provide one method of estimating the accuracy of the result.

 

In contrast, a number of different methods are available that allow us to estimate the precision of an analysis.  Today we will look at several different ways in which this precision can be assessed.

Splits/partitions and consensus trees

Methods that attempt to estimate the precision of a phylogenetic estimate typically involve the estimation of multiple trees from the dataset. Precision is then assessed by the extent to which these individual estimates agree with each other. One problem with such an analysis is the representation of the set of trees obtained by the analysis. This set of trees is typically represented as a ‘consensus tree’.

 

To understand the construction of consensus trees (and hence what it is that they represent) you need to be comfortable with the concept of a ‘split’ or ‘partition’ of a phylogenetic tree. A split is a division of the set of terminal-nodes/sequences/taxonomic-units of a tree into two mutually exclusive sets. Thus, any branch of a phylogenetic tree describes a split, with the terminal branches that are connected to one end of the branch making up one of the divisions, and the branches connected to the other end of the branch making up the other division. The illustration below indicates an internal branch that can be described as the split [ABC|DE].

 

A description of all the splits specified in this way by a tree provides a complete description of the topology of the tree. For example, the tree in this image

 

 

specifies the following set of splits

[A|BCDE]

[AB|CDE]

[B|ACDE]

[ABC|DE]

[C|ABDE]

[D|ABCE]

[E|ABCD]

although, as you have probably noticed, only splits [AB|CDE] and [ABC|DE] are informative phylogenetically i.e. provide information about which taxonomic units cluster together in the unrooted tree.

 


 Exercise 1

This is a short exercise to help you become comfortable working with splits.

 

Firstly, for the following tree, describe the set of informative splits.

 

Secondly, for each of the two sets of splits given below, reconstruct the unrooted phylogenetic tree that they describe.

 

(a)

[AB|CDE]

[CD|ABE]

 

(b)

[AB|CDEFGH]

[ABC|DEFGH]

[ED|ABCFGH]

[ABCDE|FGH]

[GH|ABCDEF]

 

The tree that you should have reconstructed from exercise 1a is here

The tree that you should have reconstructed from exercise 1b is here


 

 

The consensus tree of a group of trees (all of which must describe relationships between the same set of taxonomic units) is constructed from the set of taxonomically informative splits present in the group of trees. A consensus tree is constructed from this set of splits such that those splits appearing most frequently within this set (i.e. that are present in many of the trees in the initial group) are present in the consensus tree topology. Typically, the internal branches of the consensus tree are labelled with the percentage ratio of the number of times the split that describes the branch was present in the dataset of pooled splits to the number of trees in the split. Thus, if all trees in the group contain a particular split, then the consensus tree will contain that split, and it will be labelled ‘100%’.

 

However, different rules can be used to build these consensus trees. To take two examples, one can choose to include in the consensus tree only those splits that are present in ≥ 50% of the group of trees (described as the “50% consensus tree). An alternative is to include, initially, the set of splits present in the same ≥ 50% of the group of trees, and to then add, one after another, other splits to the tree that do not conflict with those already present. Thus, if adding a split to the tree present in 40% of trees does not conflict with any split already in the tree, then it is added. If there is then a split present in 37% of the trees that does conflict with splits already present in the tree, then it is not included. Instead one would identify the split present in the next-highest percentage of trees that does not conflict with splits already present. This process is continued until a bifrucating tree is obtained (this tree is described sometimes as the “extended 50% consensus tree”).

 


Exercise 2

In this exercise, consider this set of ten trees, all of which describe relationships between the same set of taxonomic units, and build both (i) the 50% consensus tree and (ii) the extended 50% consensus tree of the group of trees.

 

As described above, to do this, you must

 

(a) decompose all 10 trees into their informative splits

 

(b) count how often each split occurs in this set

 

(c) build the consensus tree based on this information, labelling the branches according to the frequency with which the split appears in the 10 trees.

 

You may find it useful to consult this file, which contains the set of all possible splits for the set of taxonomic units present in the trees.

 

You will perhaps be pleased to learn that one does not typically construct consensus trees by hand like this. Instead, we give them to CONSENSE (a programme in the PHYLIP package).

 

Use CONSENSE to create both the 50% consensus tree (‘majority rule’) and the extended 50% consensus tree (‘majority rule extended’).

By comparing the results of these calculations obtained from CONSENSE, check that the trees you constructed yourself 'by hand' are correct

 

Note that CONSENSE indicates the number of trees in the infile that possess a particular split that is present in the consensus tree by the length of that internal branch in the consensus tree - thus an internal branch in the consensus tree that has length 12 was present in 12 of the trees presented to CONSENSE - this number might represent 100% of the trees in this initial set, if there were only 12 trees presented to the programme, or 12% if 100 trees were presented etc.


 

Returning to the topic of the precision of a phylogenetic analysis.

 

Clearly, if one obtains just a single estimate of phylogeny from an analysis, one cannot obtain an estimate of the precision of the estimate. To measure/estimate the precision, one needs to obtain multiple estimates of phylogeny from the dataset such that these trees are sampled with a frequency that reflects the support given to that topology (or the splits present in the topology) by the dataset.

 

Two frequently-used approaches that obtain such distributions of trees are  non-parametric bootstrapping and Bayesian analysis of phylogeny. Consensus trees are then typically used to summarise the sets of trees obtained from this analysis - splits that occur in many of the estimated trees are then present in these consensus trees, indicating precisely-estimated regions of the topology, while splits that occur relatively rarely in the set of trees are either not present at all in the consensus tree, or are present, but are labelled with low ‘support values’ (either posterior probabilities [from a Bayesian analysis] or bootstrap percentages [from a non-parametric bootstrap analysis]).

 

Hopefully this section will have:

 

Bootstrap values using NJ

 


Exercise 3

You are going to use bootstrapping to estimate the precision of the phylogenetic estimate for a set of elongation factors. This dataset includes the set of all elongation factor 1As known from several animals (Xenopus, Drosophila, chicken, human, mouse), along with sequences from several archaeal genomes and other organisms (C. elegans, budding and fission yeast). The sequences in the alignment are labelled as follows:

 

 

Download the following alignment file.

 

You will now use PHYML to carry out a ML analysis using JTT + I + gamma (4 discrete categories) on 5 non-parametrically bootstrapped datasets.

 

To do this, you should first obtain 5 bootstrapped alignment datasets - you can do this using SEQBOOT  (use the “R” option to specify that you want to create 5 rather than the default 100 datasets - the datasets will be written to the file “outfile”).

 

Now run PHYML, choosing the appropriate substitution model as described above. You should also change the “S” option to tell PHYML that the alignment file provided to it contains multiple datasets, you will also need to tell it how many datasets it contains (5, as we created 5 bootstrapped alignments using SEQBOOT).

 

This analysis will produce an outfile that contains a tree estimated for each of the datasets it was passed in the infile.

 

Note that you can apply this protocol for generating non-parametric bootstrapped tree-sets using ANY phylogenetic analysis method simply by creating an appropriate number of bootstrap datasets using SEQBOOT and then running your analysis of choice over each of these datasets - although this might be easier or harder than things are here.

 

Based on the previous day’s exercises, can you outline a protocol (i.e. list the programmes you would use in order, indicating which file produced by one programme is passed to which file in the next programme) to create 100 non-parametric bootstrapped trees using Neighbor-Joining with a JTT + I + gamma substitution model?

 

While the PHYML analysis is running, download the following alignment file (the same alignment as use for PHYML, but in a different format). Load the alignment into CLUSTALX and use CLUSTALX to estimate a Neighbor-Joining tree with 1000 bootstrapped datasets (correcting for multiple substitutions). Load the resulting tree file (with suffix “.phb”) into NJPLOT and examine the estimated phylogeny, along with the corresponding bootstrap percentages for each branch.

 

Do you consider that the tree estimated by CLUSTALX is likely to be correct/accurate? While considering this question, ask yourself what prior assumptions/hypotheses about the evolution of the sequences you are using to answer it - remember what you learnt during the exercises you did on the first day of the course!

 

Are there some branches in the tree that you consider very likely to have been misplaced?

 

What is it about these branches that makes you suspect that they have been misplaced?

 

Are the positions of these branches within the tree estimated precisely (i.e. with high bootstrap support)?

 

After the PHYML analysis is finished, download this tree file that contains 95 trees estimated by nonparametric bootstrapping using PHYML with the same model as for the 5 you have obtained by your own PHYML analysis.

 

Concatenate the file you just downloaded to that containing the five trees you calculated from five bootstrapped datasets using PHYML. See the Linux command help-sheet for hints on how to accomplish this. This should result in a file that contains 100 trees estimated from this dataset using nonparametric bootstrapping.

 

(This somewhat contrived route to obtaining a set of 100 bootstrapped trees is necessary to save time - it would have taken too long for you to have calculated all 100 trees yourself during the session today. Normally one would create them all using the same run of PHYML)

 

Run CONSENSE to estimate the 50% majority rule (extended) consensus tree from these 100 trees.

 

Use PUZZLE to estimate ML branch lengths for this tree topology using the same substitution model that you used in PHYML. 

 

To do this you should:

 

 

 

and run PUZZLE to reconstruct this user tree given this alignment

 

Print out the resulting tree and use the consensus tree to label the internal branches of this topology with the corresponding bootstrap percentages.

 

Does the result of this ML analysis contain similar or the same probably-misplaced branches as that of the NJ analysis?

 

REMEMBER TO THINK OF THE  BOOTSTRAP VALUES AS PERCENTAGES! Thus, from the NJ tree you did 1000 bootstraps, so a branch that has 1000 bootstrap support has a 100% bootstrap percentage, while one that has 100 bootstrap support has only a 10% bootstrap support. However, if you had run only 100 bootstrap analyses, a branch with 100 bootstrap support corresponds to a 100% bootstrap percentage.

 

If so, is the bootstrap support for branches that you considered misplaced on the NJ tree very different on this ML-bootstrapped tree?

 

Posterior probabilities using MrBayes

Exercise 4

For this analysis, we are going to work with the same elongation factor dataset as used for exercise 3.

 

Download the following alignment file.

 

Load this file into CLUSTALX and save the file in NEXUS format.

 

Use EMACS to edit this file to:

and save the file with a new name.

 

Execute MRBAYES using “mb”.

 

Load the alignment file into MRBAYES using :

“execute <name_of_edited_filename>”

 

Specify the substitution model to be used in the analysis by typing:

“lset aamodel=jones rates=invgamma basefreq=estimate”

 

Tell the mcmc to stop running when it gets to the end of its pre-specified number of generations

“set autoclose=yes”

 

Specify the way in which the mcmc should be run using:

“mcmcp autoclose=yes ngen=25000 printfreq=200 samplefreq=20 nchains=4 savebrlens=yes”

 

Run the mcmc using:

“mcmc”

 

NOW STOP - and listen to the introductory talk - hopefully what you hear will make it somewhat easier to understand what you just did and why!

 

We now need to check to see if the chains have reached stationarity.

 

After the analysis has run, you should find that a file has been created in the directory in which you have run MRBAYES that has the suffix “.out.p”.

 

Copy this file to one with the same name, except that you add “.csv” at the end of the filename.

 

Start up OpenOffice by clicking on the desktop icon labelled “Office”

 

Use the file menu to open the file you just created with ending “.csv”

 

You will be presented with a page that asks you to specify what is being used to separate the columns of data in this file. Choose “Tabs” and then open the file.

 

Select the first 2000 rows of the first two columns of the spreadsheet. Then go to the INSERT menu and choose “Chart”.

 

Continue through the dialogue boxes presented to you until you are asked what kind of chart (pie chart, bar chart etc.) you would like to draw. Choose “XY plot”, and then indicate that you want to draw the chart.

 

This should draw a chart that plots the variation in probability with generation number for your dataset.

 

Determine whether the probability has reached a stable horizontal line (i.e. stationarity). If so, note the generation number after which you are certain that stationarity has been reached.

 

Calculate the number of trees that should be discarded from the set of sampled trees because they were sampled prior to stationarity. To do this you need to divide the number of generations visited prior to stationarity by the sample frequency used. This specifies that number of trees to be discarded as “burnin”.

 

Return to the MRBAYES terminal window.

 

You now need to create a consensus tree from the set of trees sampled by MRBAYES after stationarity has been reached. To do this type on one line:

sumt filename=<filename_created_by_mrbayes_with_.out.t_suffix> contype=allcompat burnin=<num_of_trees_to_discard_as_burnin>

 

Exit MRBAYES with Control-C

 

Use EMACS to view the file that has “.con” suffix. The order of columns in the “translation table” at the top of the file should be re-ordered (initially, the order is “<number> <corresponding_sequence_name>” - this should be altered so that the order is reversed for all rows.)

 

Save this file with a new name, and view this new file using TreeView. Print the tree.

 

Examine the file created with suffix “.parts”. Within here is the information about the posterior probabilities associated with the different internal branches estimated by MRBAYES. Label the tree that you printed out with the posterior probability values that you find in this file.

 

Compare this tree to that estimated using both NJ and ML in exercise 3.

 

Is it more accurate than the results of the NJ and ML analyses?

 

Do posterior probabilities for branches that are common between NJ, ML and Bayesian trees tend to be higher or lower than the corresponding bootstrap percentages?

 

For branches that are different between NJ/ML and Bayesian analysis, does there tend to be high posterior probability or bootstrap percentage?