As already
discussed, in the context of a phylogenetic analysis, the concepts of accuracy
and precision of the result can be interpreted in terms of, respectively, the
extent to which the estimated tree agrees with the true tree (accuracy) and the
extent to which the best estimate of the phylogeny is preferred to alternative
topologies (precision). Thus, an analysis that estimates exactly the true
phylogeny is perfectly accurate. In contrast, an analysis that considers all
but one of the possible trees as being very poorly supported by the data, while
the single tree that is proposed as the phylogeny estimate is extremely well
supported by the data, is an extremely precise. To take the example of a simple
maximum likelihood analysis of the data, a very precise analysis will identify
the true phylogeny, while a very precise analysis will indicate that one tree
(the maximum likelihood tree) has a very much higher likelihood than any other
possible topology.
Clearly, it is
impossible to obtain an estimate of the accuracy of a phylogenetic analysis
when one has no knowledge of the topology of the true tree - although the use
of independent datasets that are expected to have the same evolutionary history
can provide one method of estimating the accuracy of the result.
In contrast, a
number of different methods are available that allow us to estimate the
precision of an analysis. Today we
will look at several different ways in which this precision can be assessed.
Methods that
attempt to estimate the precision of a phylogenetic estimate typically involve
the estimation of multiple trees from the dataset. Precision is then assessed
by the extent to which these individual estimates agree with each other. One
problem with such an analysis is the representation of the set of trees
obtained by the analysis. This set of trees is typically represented as a
‘consensus tree’.
To understand the
construction of consensus trees (and hence what it is that they represent) you
need to be comfortable with the concept of a ‘split’ or
‘partition’ of a phylogenetic tree. A split is a division of the
set of terminal-nodes/sequences/taxonomic-units of a tree into two mutually
exclusive sets. Thus, any branch of a phylogenetic tree describes a split, with
the terminal branches that are connected to one end of the branch making up one
of the divisions, and the branches connected to the other end of the branch
making up the other division. The illustration below indicates an internal
branch that can be described as the split [ABC|DE].
A description of
all the splits specified in this way by a tree provides a complete description
of the topology of the tree. For example, the tree in this image
specifies the
following set of splits
[A|BCDE]
[AB|CDE]
[B|ACDE]
[ABC|DE]
[C|ABDE]
[D|ABCE]
[E|ABCD]
although, as you
have probably noticed, only splits [AB|CDE] and [ABC|DE] are informative
phylogenetically i.e. provide information about which taxonomic units cluster
together in the unrooted tree.
This is a short
exercise to help you become comfortable working with splits.
Firstly, for
the following tree, describe the set of informative splits.
Secondly, for
each of the two sets of splits given below, reconstruct the unrooted
phylogenetic tree that they describe.
(a)
[AB|CDE]
[CD|ABE]
(b)
[AB|CDEFGH]
[ABC|DEFGH]
[ED|ABCFGH]
[ABCDE|FGH]
[GH|ABCDEF]
The tree that you should have reconstructed from exercise 1a is
here
The tree that you should have reconstructed from exercise 1b is
here
The consensus
tree of a group of trees (all of which must describe relationships between the
same set of taxonomic units) is constructed from the set of taxonomically
informative splits present in the group of trees. A consensus tree is
constructed from this set of splits such that those splits appearing most
frequently within this set (i.e. that are present in many of the trees in the
initial group) are present in the consensus tree topology. Typically, the
internal branches of the consensus tree are labelled with the percentage ratio
of the number of times the split that describes the branch was present in the
dataset of pooled splits to the number of trees in the split. Thus, if all
trees in the group contain a particular split, then the consensus tree will
contain that split, and it will be labelled ‘100%’.
However,
different rules can be used to build these consensus trees. To take two
examples, one can choose to include in the consensus tree only those splits
that are present in ≥ 50% of the group of trees (described as the
“50% consensus tree). An alternative is to include, initially, the set of
splits present in the same ≥ 50% of the group of trees, and to then add,
one after another, other splits to the tree that do not conflict with those
already present. Thus, if adding a split to the tree present in 40% of trees
does not conflict with any split already in the tree, then it is added. If
there is then a split present in 37% of the trees that does conflict with
splits already present in the tree, then it is not included. Instead one would
identify the split present in the next-highest percentage of trees that does
not conflict with splits already present. This process is continued until a
bifrucating tree is obtained (this tree is described sometimes as the
“extended 50% consensus tree”).
In this
exercise, consider this set of ten
trees, all of which describe relationships
between the same set of taxonomic units, and build both (i) the 50% consensus
tree and (ii) the extended 50% consensus tree of the group of trees.
As described
above, to do this, you must
(a) decompose
all 10 trees into their informative splits
(b) count how
often each split occurs in this set
(c) build the
consensus tree based on this information, labelling the branches according to
the frequency with which the split appears in the 10 trees.
You may find it
useful to consult
this file, which contains the set of all possible splits for
the set of taxonomic units present in the trees.
You will perhaps
be pleased to learn that one does not typically construct consensus trees by
hand like this. Instead, we give them to CONSENSE (a programme in the PHYLIP
package).
Use CONSENSE
to create both the 50% consensus tree (‘majority rule’) and the extended
50% consensus tree (‘majority rule extended’).
By comparing the results
of these calculations obtained from CONSENSE, check that the trees you constructed
yourself 'by hand' are correct
Note that
CONSENSE indicates the number of trees in the infile that possess a particular
split that is present in the consensus tree by the length of that internal
branch in the consensus tree - thus an internal branch in the consensus tree
that has length 12 was present in 12 of the trees presented to CONSENSE - this
number might represent 100% of the trees in this initial set, if there were
only 12 trees presented to the programme, or 12% if 100 trees were presented
etc.
Returning to the
topic of the precision of a phylogenetic analysis.
Clearly, if one
obtains just a single estimate of phylogeny from an analysis, one cannot obtain
an estimate of the precision of the estimate. To measure/estimate the
precision, one needs to obtain multiple estimates of phylogeny from the dataset
such that these trees are sampled with a frequency that reflects the support
given to that topology (or the splits present in the topology) by the dataset.
Two frequently-used
approaches that obtain such distributions of trees are non-parametric bootstrapping and
Bayesian analysis of phylogeny. Consensus trees are then typically used to
summarise the sets of trees obtained from this analysis - splits that occur in
many of the estimated trees are then present in these consensus trees,
indicating precisely-estimated regions of the topology, while splits that occur
relatively rarely in the set of trees are either not present at all in the
consensus tree, or are present, but are labelled with low ‘support
values’ (either posterior probabilities [from a Bayesian analysis] or
bootstrap percentages [from a non-parametric bootstrap analysis]).
Hopefully this
section will have:
You are going to use
bootstrapping to estimate the precision of the phylogenetic estimate for a set
of elongation factors. This dataset includes the set of all elongation factor
1As known from several animals (Xenopus, Drosophila, chicken, human, mouse), along with
sequences from several archaeal genomes and other organisms (C. elegans, budding and fission yeast). The
sequences in the alignment are labelled as follows:
Download the
following
alignment file.
You will now use PHYML to
carry out a ML analysis using JTT + I + gamma (4 discrete categories) on 5
non-parametrically bootstrapped datasets.
To do this, you should
first obtain 5 bootstrapped alignment datasets - you can do this using
SEQBOOT (use the “R”
option to specify that you want to create 5 rather than the default 100
datasets - the datasets will be written to the file “outfile”).
Now run PHYML,
choosing the appropriate substitution model as described above. You should also
change the “S” option to tell PHYML that the alignment file
provided to it contains multiple datasets, you will also need to tell it how
many datasets it contains (5, as we created 5 bootstrapped alignments using
SEQBOOT).
This analysis will
produce an outfile that contains a tree estimated for each of the datasets it
was passed in the infile.
Note that you can apply
this protocol for generating non-parametric bootstrapped tree-sets using ANY
phylogenetic analysis method simply by creating an appropriate number of
bootstrap datasets using SEQBOOT and then running your analysis of choice over
each of these datasets - although this might be easier or harder than things
are here.
Based on the previous
day’s exercises, can you outline a protocol (i.e. list the programmes you
would use in order, indicating which file produced by one programme is passed
to which file in the next programme) to create 100 non-parametric bootstrapped
trees using Neighbor-Joining with a JTT + I + gamma substitution model?
While the PHYML
analysis is running, download the
following alignment file (the same alignment
as use for PHYML, but in a different format). Load the alignment into CLUSTALX
and use CLUSTALX to estimate a Neighbor-Joining tree with 1000 bootstrapped
datasets (correcting for multiple substitutions). Load the resulting tree file
(with suffix “.phb”) into NJPLOT and examine the estimated
phylogeny, along with the corresponding bootstrap percentages for each branch.
Do you consider that
the tree estimated by CLUSTALX is likely to be correct/accurate? While considering this question, ask
yourself what prior assumptions/hypotheses about the evolution of the sequences
you are using to answer it - remember what you learnt during the exercises you
did on the first day of the course!
Are there some
branches in the tree that you consider very likely to have been misplaced?
What is it about these
branches that makes you suspect that they have been misplaced?
Are the positions of
these branches within the tree estimated precisely (i.e. with high bootstrap
support)?
After the PHYML
analysis is finished, download
this tree file
that contains 95 trees estimated
by nonparametric bootstrapping using PHYML with the same model as for the 5 you
have obtained by your own PHYML analysis.
Concatenate the file
you just downloaded to that containing the five trees you calculated from five
bootstrapped datasets using PHYML. See the Linux command help-sheet for hints
on how to accomplish this. This should result in a file that contains 100 trees
estimated from this dataset using nonparametric bootstrapping.
(This somewhat contrived
route to obtaining a set of 100 bootstrapped trees is necessary to save time -
it would have taken too long for you to have calculated all 100 trees yourself
during the session today. Normally one would create them all using the same run
of PHYML)
Run CONSENSE to
estimate the 50% majority rule (extended) consensus tree from these 100 trees.
Use PUZZLE to estimate ML
branch lengths for this tree topology using the same substitution model that
you used in PHYML.
To do this you should:
and run PUZZLE to
reconstruct this user tree given this alignment
Print out the
resulting tree and use the consensus tree to label the internal branches of
this topology with the corresponding bootstrap percentages.
Does the result of
this ML analysis contain similar or the same probably-misplaced branches as
that of the NJ analysis?
REMEMBER TO THINK OF
THE BOOTSTRAP VALUES AS
PERCENTAGES! Thus, from the NJ tree you did 1000 bootstraps, so a branch that
has 1000 bootstrap support has a 100% bootstrap percentage, while one that has
100 bootstrap support has only a 10% bootstrap support. However, if you had run
only 100 bootstrap analyses, a branch with 100 bootstrap support corresponds to
a 100% bootstrap percentage.
If so, is the
bootstrap support for branches that you considered misplaced on the NJ tree
very different on this ML-bootstrapped tree?
For this analysis, we are
going to work with the same elongation factor dataset as used for exercise 3.
Download the
following
alignment file.
Load this file into
CLUSTALX and save the file in NEXUS format.
Use EMACS to edit this
file to:
and save the file with
a new name.
Execute MRBAYES using
“mb”.
Load the alignment
file into MRBAYES using :
“execute
<name_of_edited_filename>”
Specify the
substitution model to be used in the analysis by typing:
“lset
aamodel=jones rates=invgamma basefreq=estimate”
Tell the mcmc to stop
running when it gets to the end of its pre-specified number of generations
“set
autoclose=yes”
Specify the way in
which the mcmc should be run using:
“mcmcp
autoclose=yes ngen=25000 printfreq=200 samplefreq=20 nchains=4 savebrlens=yes”
Run the mcmc using:
“mcmc”
NOW STOP - and listen
to the introductory talk - hopefully what you hear will make it somewhat easier
to understand what you just did and why!
We now need to check to
see if the chains have reached stationarity.
After the analysis has
run, you should find that a file has been created in the directory in which you
have run MRBAYES that has the suffix “.out.p”.
Copy this file to one
with the same name, except that you add “.csv” at the end of the
filename.
Start up OpenOffice by
clicking on the desktop icon labelled “Office”
Use the file menu to
open the file you just created with ending “.csv”
You will be presented
with a page that asks you to specify what is being used to separate the columns
of data in this file. Choose “Tabs” and then open the file.
Select the first 2000
rows of the first two columns of the spreadsheet. Then go to the INSERT menu
and choose “Chart”.
Continue through the
dialogue boxes presented to you until you are asked what kind of chart (pie
chart, bar chart etc.) you would like to draw. Choose “XY plot”,
and then indicate that you want to draw the chart.
This should draw a chart
that plots the variation in probability with generation number for your
dataset.
Determine whether the
probability has reached a stable horizontal line (i.e. stationarity). If so,
note the generation number after which you are certain that stationarity has
been reached.
Calculate the number
of trees that should be discarded from the set of sampled trees because they
were sampled prior to stationarity. To do this you need to divide the number of
generations visited prior to stationarity by the sample frequency used. This
specifies that number of trees to be discarded as “burnin”.
Return to the MRBAYES
terminal window.
You now need to create
a consensus tree from the set of trees sampled by MRBAYES after stationarity
has been reached. To do this type on one line:
sumt
filename=<filename_created_by_mrbayes_with_.out.t_suffix>
contype=allcompat burnin=<num_of_trees_to_discard_as_burnin>
Exit MRBAYES with
Control-C
Use EMACS to view the
file that has “.con” suffix. The order of columns in the
“translation table” at the top of the file should be re-ordered
(initially, the order is “<number> <corresponding_sequence_name>”
- this should be altered so that the order is reversed for all rows.)
Save this file with a
new name, and view this new file using TreeView. Print the tree.
Examine the file
created with suffix “.parts”. Within here is the information about
the posterior probabilities associated with the different internal branches
estimated by MRBAYES. Label the tree that you printed out with the posterior
probability values that you find in this file.
Compare this tree to
that estimated using both NJ and ML in exercise 3.
Is it more accurate
than the results of the NJ and ML analyses?
Do posterior
probabilities for branches that are common between NJ, ML and Bayesian trees
tend to be higher or lower than the corresponding bootstrap percentages?
For branches that are
different between NJ/ML and Bayesian analysis, does there tend to be high
posterior probability or bootstrap percentage?