Link to course homepage

Day 3: 2nd June 2005

 

One of the analyses that you will run today takes a considerable time to run. Therefore, you should set this analysis running before the presentation is given. To do this, go to the beginning of exercise 4 below, and follow the instructions. This is a Bayesian analysis using the MRBAYES software.

 

 

Precision and accuracy

Precision and accuracy are two different qualities that can be used to describe the estimate of a parameter of from a set of data (i.e. they are qualities that can be used to describe characteristics of a measurement).

 

 

To remove ambiguity from the discussion, here are the defenitions (taken from Dictionary.com) of the two words that I am using in this discussion:

 

Accuracy

The ability of a measurement to match the actual value of the quantity being measured.

 

Precision

The ability of a measurement to be consistently reproduced.

 

 

Precision of phylogenetic estimates

As already discussed, in the context of a phylogenetic analysis, the concepts of accuracy and precision of the result can be interpreted in terms of, respectively, the extent to which the estimated tree agrees with the true tree (accuracy) and the extent to which the best estimate of the phylogeny is preferred to alternative topologies (precision). Thus, an analysis that estimates a phylogeny from the dataset that is identical to the true phylogeny is perfectly accurate. In contrast, an analysis that considers all but one of the possible trees as being very poorly supported by the data, while the single tree that is proposed as the phylogeny estimate is extremely well supported by the data, is an extremely precise estimtae of the phylogeny. To take the example of a simple maximum likelihood analysis of the data, a very accurate analysis will identify the true phylogeny, while a very precise analysis will indicate that one tree (the maximum likelihood tree) has a very much higher likelihood than any other possible topology.

 

Clearly, it is impossible to obtain an estimate of the accuracy of a phylogenetic analysis when one has no knowledge of the topology of the true tree - although the use of independent datasets that are expected to have the same evolutionary history can provide one method of estimating the accuracy of the result.

 

In contrast, a number of different methods are available that allow us to estimate the precision of an analysis.  Today we will look at several different ways in which this precision can be assessed.

 

 

Splits/partitions and consensus trees

Methods that attempt to estimate the precision of a phylogenetic estimate typically involve the estimation of multiple trees from the dataset. Precision is then assessed by the extent to which these individual estimates agree with each other. One problem with such an analysis is the representation of the set of trees obtained by the analysis. This set of trees is typically represented as a ‘consensus tree’.

 

To understand the construction of consensus trees (and hence what it is that they represent) you need to be comfortable with the concept of a ‘split’ or ‘partition’ of a phylogenetic tree. A split is a division of the set of terminal-nodes/sequences/taxonomic-units of a tree into two mutually exclusive sets. Thus, any branch of a phylogenetic tree describes a split, with the terminal branches that are connected to one end of the branch making up one of the divisions, and the branches connected to the other end of the branch making up the other division. The illustration below indicates an internal branch that can be described as the split [ABC|DE].

 

A description of all the splits specified in this way by a tree provides a complete description of the topology of the tree. For example, the tree in this image

 

 

specifies the following set of splits

 

[A|BCDE]

[AB|CDE]

[B|ACDE]

[ABC|DE]

[C|ABDE]

[D|ABCE]

[E|ABCD]

 

although, as you have probably noticed, only splits [AB|CDE] and [ABC|DE] are phylogenetically -informative i.e. provide information about which taxonomic units cluster together in the unrooted tree.

 

 


Exercise 1

 

The aim of this exercise is:

·      to help you become acquainted with the concept of splits of phylogenetic trees

 

In this exercise you will be asked (a) to list the set of phylogenetically-informative splits present in a phylogenetic tree and (b) reconstruct a phylogenetic tree from a set of splits.

 

Download the following file: day3_ex1_tree.ph

 

Visualise the tree in this file using UNROOTED.

> unrooted day3_ex1_tree.ph

 

Describe the set of informative splits in this tree.

 

For each of the two sets of splits below, reconstruct the unrooted phylogenetic tree that they describe.

 

(a)

[AB|CDE]

[CD|ABE]

 

(b)

[AB|CDEFGH]

[ABC|DEFGH]

[ED|ABCFGH]

[ABCDE|FGH]

[GH|ABCDEF]

 

The tree that you should have reconstructed from exercise 1a: day3_ex1a_answer_tree.ph

 

The tree that you should have reconstructed from exercise 1b: day3_ex1b_answer_tree.ph


 

 

The consensus tree of a group of trees (all of which must describe relationships between the same set of taxonomic units) is constructed from the set of taxonomically informative splits present in the group of trees. A consensus tree is constructed from this set of splits such that those splits appearing most frequently within this set (i.e. that are present in many of the trees in the initial group) are present in the consensus tree topology. Typically, the internal branches of the consensus tree are labelled with the percentage ratio of the number of times the split that describes the branch was present in the dataset of pooled splits to the number of trees in the split. Thus, if all trees in the group contain a particular split, then the consensus tree will contain that split, and it will be labelled ‘100%’.

 

However, different rules can be used to build these consensus trees. To take two examples, one can choose to include in the consensus tree only those splits that are present in ≥ 50% of the group of trees (described as the “50% consensus tree). An alternative is to include, initially, the set of splits present in the same ≥ 50% of the group of trees, and to then add, one after another, other splits to the tree that do not conflict with those already present. Thus, if adding a split to the tree present in 40% of trees does not conflict with any split already in the tree, then it is added. If there is then a split present in 37% of the trees that does conflict with splits already present in the tree, then it is not included. Instead one would identify the split present in the next-highest percentage of trees that does not conflict with splits already present. This process is continued until a bifurcating tree is obtained (this tree is described sometimes as the “extended 50% consensus tree”).

 

 


Exercise 2

The aim of this exercise is:

 

In this exercise, you will be asked to build both the 50% consensus tree and the extended 50% consensus tree from a set of 10 trees. You will then use CONSENSE to carry out the same operations.

 

Files

Ten trees from which to calculate consensus trees: day3_ex2_10trees.ph

Set of all possible phylogenetically-informative splits for these 10 trees: all_phylo_inf_splits.txt

 

It might help you to print both of these files out onto paper.

 

Calculate consensus trees manually

The file day3_ex2_10trees.ph contains ten UNROOTED phylogenetic trees in NEWICK format. Each of the trees describes phylogenetic relationships between the same set of taxonomic units.

 

Draw the topology of each of these trees on paper.

 

Decompose all 10 trees into their informative splits.

 

Count how often each split occurs in this set of all informative splits from all 10 trees.

 

Build: (i) 50% consensus tree, and (ii) the extended 50% consensus tree, based on the list of splits from all trees, labelling each branch according to the frequency with which the split appears in the 10 trees.

 

You may find it useful to consult the file containing the list of all possible phylogenetically-informative splits while carrying out this exercise to help you identifying and keeping track of the frequency of occurrence of each split.

 

Calculate consensus trees using CONSENSE

You will perhaps be pleased to learn that one does not typically construct consensus trees by hand like this. Instead, we use CONSENSE from the PHYLIP package to carry out the calculations.

 

To use CONSENSE to calculate the consensus trees:

 

Create a directory /home/<your_username>/day3/exercise2

 

Save the above file containing the ten trees in this new directory

 

CONSENSE expects to operate on a file named "intree" that is located in the directory from which CONSENSE is executed, therefore:

Copy the file containing the ten trees into a file named "intree" in this new directory

 

To execute CONSENSE use:

> consense

 

Use the interactive text menus to build the 50% consensus tree (Majority Rule tree):

Type "C" followed by "ENTER" (an repeat these two actions) to choose different rules.

 

Once you have chosen the appropriate rule, type "Y" followed by "ENTER" to begin the calculation.

 

Visualise the trees estimated from sense using NJPLOT - the consensus tree is written to a file called "outtree".

Switch on the "Branch lengths" button within NJPLOT to display the lengths of the branches of the consensus tree. Note that the set of 6 splits containing just one terminal branch (i.e. the phylogenetically UN-informative splits) each occur in each of the trees. Thus, the length of all terminal branches should be the same as the total number of trees passed to CONSENSE. However, because of the way in which CONSENSE describes the number of trees containing each split, one of the terminal nodes is selected to have a length twice as long as all the others. Ignore this - simply treat it as having a length half that of what it shows.

 

Repeat the above procedure, this time calculating the extended 50% consensus tree (‘majority rule extended’).

 

Do the trees estimated using CONSENSE agree with the trees you estimated yourself?

 

Note that CONSENSE indicates the number of trees in the infile that possess a particular split that is present in the consensus tree by the length of that internal branch in the consensus tree - thus an internal branch in the consensus tree that has length 8 was present in 8 of the trees presented to CONSENSE - this number might represent 100% of the trees in this initial set, if there were only 8 trees presented to the programme, or 8% if 100 trees were presented etc.


 

 

Returning to the topic of the precision of a phylogenetic analysis.

 

Clearly, if one obtains just a single estimate of phylogeny from an analysis, one cannot obtain an estimate of the precision of the estimate. To measure/estimate the precision, one needs to obtain multiple estimates of phylogeny from the dataset such that these estimated phylogenies are sampled with a frequency that reflects the support given to that phylogeny (or the splits present in the phylogeny) by the dataset.

 

Two frequently-used approaches that obtain such distributions of trees are non-parametric bootstrapping and Bayesian analysis of phylogeny. Consensus trees are then typically used to summarise the sets of trees obtained from this analysis - splits that occur in many of the estimated trees are then present in these consensus trees, indicating precisely-estimated regions of the topology, while splits that occur relatively rarely in the set of trees are either not present at all in the consensus tree, or are present, but are labelled with low ‘support values’ (either posterior probabilities [from a Bayesian analysis] or bootstrap percentages [from a non-parametric bootstrap analysis]).

 

Hopefully this section will have:

·      Familiarised you with the concepts of both splits and consensus trees

·      Demonstrated that consensus trees provide a method to assess the extent of agreement between a set of trees

·      Introduced the idea of using a consensus tree to summarise estimates of the precision with which internal branches in a phylogeny estimate are supported by the dataset

 

 

Exercise 3: Bootstrap values using ML

 

Aims of the exercise:

 

During the exercise you will use ML to estimate the phylogeny of a group of elongation factor sequences. You will also use non-parametric bootstrapping to determine the bootstrap support percentages for the different splits in the estimated tree. As a bare minimum, one should use 100 bootstrap datasets for an analysis of this kind - and a ML tree must be estimated from each of these 100 datasets, which can take a long time. To speed up the exercise, you will only be calculating trees from 5 bootstrapped datasets - you will then add the trees estimated from these 5 datasets to 95 trees that I calculated earlier in the same way. You will then be asked to estimate the bootstrap support based on the resulting set of 100 trees.

 

 

 

Files

 

 

Dataset

The sequences used in this analysis are the set of all elongation factor 1As known from several animals (Xenopus, Drosophila, chicken, human, mouse), along with sequences from several archaeal genomes and other organisms (C. elegans, budding and fission yeast). Sequences in the alignment are labelled as follows:

 

fly: Drosophila melanogaster

human: Homo sapiens

mouse: Mus musculus

frog: Xenopus laevis

chicken: Gallus gallus

worm: Caenorhabditis elegans

pombe: Schizosaccharomyces pombe, fission yeast

cerevisa: Saccharomyces cerevisae, bakers yeast

 

along with sequences from three different archaea, labelled as: archaeonA, archaeonB and archaeonC

 

Calculating ML trees for 5 bootstrap datasets

You will now use PHYML to carry out a ML analysis using JTT + I + gamma (4 discrete categories) on 5 non-parametrically bootstrapped datasets.

 

Obtain 5 bootstrapped alignment datasets

Bootstrapping depends on the analysis of so-called 'pseudoreplicate' datasets. These datasets are based on an original 'observed' dataset (i.e. the protein multiple sequence alignment of your sequences of interest) in that they are collections of randomly- sampled columns from the original dataset. Columns are sampled from the original dataset until the pseudoreplicate dataset contains the same number of columns as the original dataset. Sampling is done WITH REPLACEMENT i.e. a column can be sampled multiple times. A consequence of sampling with replacement is that almost certainly some columns from the original dataset will be present multiple times in the pseudoreplicate dataset, while other columns from the original dataset will be absent from the pseudoreplicate.

 

The first stage in your analysis is to create five bootstrapped alignments. These are created using the SEQBOOT programme of the PHYLIP package.

 

Create a directory /home/<your_username>/day3/ex3/seqboot

 

Save the EF1A alignment dataset (phy format) into this directory

 

SEQBOOT, as other PHYLIP programmes, expects a file called 'infile' to exist in the directory from which it is executed, and that this file will contain the sequence alignment file that you want to have bootstrapped. Therefore:

Copy the EF1A alignment file to the name "infile"

 

Execute SEQBOOT:

> seqboot

 

Use the "R" option from the interactive text menu to indicate that you want to create 5 rather than the default 100 datasets.

 

Run the calculation using "Y" followed by "ENTER"

 

This creates a file called "outfile" that contains the five bootstrapped datasets.

 

Create a copy of "outfile" that has a more informative name using the unix "cp" command.

 

Calculate ML trees for each of the 5 pseudoreplicate alignments

If one were to calculate the value of some parameter from the set of multiple pseudo-replicate datasets, this variability will reflect that of estimates of the parameter if it were calculated from "true repeated" samples. Thus, the distribution of the estimates of this parameter calculated from multiple pseudoreplicate datasets provides an estimate of the "sampling" or "random" error of the measurement.

 

The parameter that we are interested in estimating from the datasets is the topology of the phylogenetic tree. Thus, we need to estimate a phylogenetic tree from each of the pseudoreplicate datasets. To do this we will use PHYML with the JTT + gamma (4 discrete categories) + I model.

 

Create a directory /home/<your_username>/day3/ex3/phyml

 

Copy the file containing 5 bootstrapped alignments (created using SEQBOOT) into this new directory

 

Execute PHYML, passing the programme the name of the bootstrapped dataset file when prompted for a filename.

 

Use the interactive text menu to choose the JTT + gamma (4 discrete categories) + I substitution model

Note that you should tell PHYML to run the analysis using an optimised value for the alpha shape parameter of the gamma distribution.

 

Use the "S" option in the text menu to tell PHYML that the alignment file contains multiple datasets.

 

Execute PHYML ("Y" followed by "ENTER")

 

This analysis will produce a file called "outfile" that contains a tree estimated for each of the datasets that were present in the input datafile.

 

Note that you can apply a similar protocol for generating non-parametric bootstrapped tree-sets using ANY phylogenetic analysis method simply by creating an appropriate number of bootstrap datasets using SEQBOOT and then running your analysis of choice to estimate a tree from each of these datasets - although this might be easier or harder than things are here.

 

Based on the previous day’s exercises, can you outline a protocol (i.e. list the programmes you would use in order, indicating which file produced by one programme is passed to which file in the next programme) to create 100 non-parametric bootstrapped trees using Neighbor-Joining with a JTT + I + gamma substitution model?

Hint: this requires the use of SEQBOOT, PUZZLE and NEIGHBOR

 

Interlude: NJ bootstrapping using CLUSTALX

PHYML will probably take some time to calculate trees from each of these datasets. To keep us busy, we will calculate bootstrap percentages from the same dataset using a different estimation algorithm and a different substitution model - those implemented in the CLUSTALX software. The algorithm used is exactly the NJ algorithm implemented in the PHYLIP programme NEIGHBOR. However, CLUSTALX does not calculate distances between sequences using ML - rather it uses a much faster (but less accurate) substitution model.

 

Create a directory /home/<your_username>/day3/ex3/nj

 

Save the EF1A alignment dataset (pir format) into this directory

 

Load the alignment into CLUSTALX

> clustalx <ef1a_alignment_filename.pir> &

 

From the "Trees" menu choose the "Correct for multiple substitutions" option

This chooses a better, although still very fast and still less accurate than ML, substitution model.

 

Choose the "Bootstrap N-J tree" option from the tree menu and use the default of creating 1000 bootstrap datasets.

 

Execute the calculation by clicking the "OK" button.

 

The estimated tree will be written to a file called <ef1a_alignment_filename>.phb

 

Examine the tree file using NJPLOT.

 

Load the alignment into CLUSTALX and use CLUSTALX to estimate a Neighbor-Joining tree with 1000 bootstrapped datasets (correcting for multiple substitutions). Load the resulting tree file (with suffix “.phb”) into NJPLOT and examine the estimated phylogeny, along with the corresponding bootstrap percentages for each branch.

 

Do you consider that the tree estimated by CLUSTALX is likely to be correct/accurate?

While considering this question, ask yourself what prior assumptions/hypotheses about the evolution of the sequences you are using to answer it - remember what you learnt during the exercises you did on the first day of the course!

 

Are there some branches in the tree that you consider very likely to have been misplaced?

 

What is it about these branches that makes you suspect that they have been misplaced?

 

Are the positions of these probably-misplaced branches within the tree estimated precisely (i.e. with high bootstrap support)?

 

Make a note of the answers to these questions, for comparison to the results of the ML analysis

 

ML analysis revisited

Returning to the ML analysis...

 

By estimating bootstrap percentages from a large number of datasets, you obtain a better estimate of the precision of your analysis. As already mentioned, one typically uses at least 100 datasets. To obtain a set of 100 trees estimated from bootstrapped datasets, we will add the five trees that you have estimated using PHYML to 95 trees that we estimated for you.

 

Use the unix "cat" command to create a file containing your 5 trees and the 95 previously-estimated trees:

> cat <5TreeFilename> <95TreeFilename>   >   <newFilename>

In the command above, "new_filename" is the name of a file containing 100 trees.

 

(This somewhat contrived route to obtaining a set of 100 bootstrapped trees is necessary to save time - it would have taken too long for you to have calculated all 100 trees yourself during the session today. Normally one would create them all using the same run of PHYML.)

 

You now need to create a consensus tree from the set of 100 trees, to summarise the information present in the set of trees estimated from the bootstrapped alignment datasets.

 

Use CONSENSE to estimate the 50% majority rule (extended) consensus tree from these 100 trees.

 

As you are aware, CONSENSE calculates the TOPOLOGY of the consensus tree, with branch lengths on the tree indicating the bootstrap support of the splits that they represent. Thus, the tree produced by CONSENSE does not contain information about estimated number of substitutions occurring along each of the lineages of the consensus topology. Thus, you should next use PUZZLE to estimate ML branch lengths for this tree topology using the same substitution model that you used in PHYML.

 

To do this, you need to present PUZZLE with two files - one containing a tree file describing the topology of the consensus tree, the other an alignment file. We require these two files as we are not using PUZZLE to search between different topologies to identify an optimal (ML) topology. Instead we want, FOR A GIVEN TOPOLOGY (that is provided in the tree file), to determine the ML set of branch-lengths.

 

The alignment file should be called "infile", the tree file should be called "intree".

 

Create a new directory to contain the results of the PUZZLE analysis

 

Copy the CONSENSUS output consensus tree into this new directory, and give it the name "intree"

 

Copy the initial alignment file (phy-format) into the new directory, and call it "infile"

 

Execute PUZZLE

> puzzle

 

From the interactive text menu, use "B" to set the type of analysis as "Tree reconstruction"

 

Use "M" to specify the exchangeability matrix to be used to the one used by PHYML (i.e. JTT)

 

Specify that you want to model between-site rate heterogeneity using "W" to choose the "Mixed" model, and "C" to specify the number of discrete rate categories used to approximate the gamma distribution to 4.

 

Finally type "Y" followed by "ENTER" to run the analysis.

PUZZLE writes the tree to the file "outtree". This tree has the same topology as that of the input tree (the consensus tree), with the set of branches that maximised the likelihood of the data given the topology and the substitution model.

 

Print out the tree obtained from PUZZLE, and use the consensus tree to label the internal branches of the PUZZLE tree topology with the corresponding bootstrap percentages.

 

When comparing the results of the two bootstrap analyses (ML and NJ), REMEMBER TO THINK OF THE  BOOTSTRAP VALUES AS PERCENTAGES! Thus, from the NJ tree you did 1000 bootstraps, so a branch that has 1000 bootstrap support has a 100% bootstrap percentage, while one that has 100 bootstrap support has only a 10% bootstrap support. However, if you had run only 100 bootstrap analyses, a branch with 100 bootstrap support corresponds to a 100% bootstrap percentage.

 

Does the result of this ML analysis contain similar or the same probably-misplaced branches as that of the NJ analysis?

 

If so, is the bootstrap support for branches that you considered misplaced on the NJ tree very different on this ML-bootstrapped tree?


 

 

Exercise 4: Posterior probabilities using MrBayes

Aims of the exercise:

 

In the exercise, you will be provided with an alignment file containing data that you will analyse using MRBAYES. Firstly, you need to convert the format of this file such that it can be read by MRBAYES. You then load data into MRBAYES, specify the parameters that describe the way in which MRBAYES should carry out the exercise, and then set the analysis running. You will then interpret the results of this Bayesian analysis and compare these results to those from the ML and NJ analyses carried out in exercise 3.

 

Files

EF1A dataset protein MSA (pir format): ef1a.pir

Datasets

For this analysis, we are going to work with the same elongation factor dataset as used for exercise 3.

 

Convert dataset into format readable by MRBAYES

A frequent step in almost all phylogenetic analyses is the conversion of files (whether they contain trees or multiple sequence alignments) from one format to another. This is inevitable as different programmes accept data in different formats.

 

MRBAYES accepts alignments in the NEXUS format. CLUSTALX converts files from pir to NEXUS format, although a couple of aspects of NEXUS format as written by CLUSTALX are not compatible with the NEXUS format accepted by MRBAYES. Therefore, the first step in the Bayesian analysis is to convert the pir format file into NEXUS format using CLUSTALX. The resulting NEXUS file must then be edited by hand to make it acceptable to MRBAYES

 

Create a directory /home/<your_username>/day3/ex4

 

Download EF1A dataset into this new directory.

 

Load EF1A dataset into CLUSTALX

> clustalx <ef1a_dataset_in_pir_format> &

 

Use the File->"Save sequence as" option to save this alignment in NEXUS format

 

Load the CLUSTALX-created NEXUS format file into the EMACS file editor

> emacs <ef1a_dataset_in_nexus_format> &

 

Use EMACS (see linux usage notes for help) to both:

·      remove the line that begins ‘symbols’ from the top of the file

·      immediately after the word ‘interleave’ near the top of the file, add ‘=yes’ to give “interleave=yes”

 

Save the edited file under a new name (e.g. ef1a_ed.nxs)

 

Begin running MRBAYES on dataset

MRBAYES uses a different form of interface than those of the programmes you have encountered so far - it is very similar to that of the PAUP* software. The interface provides a command-line-like environment in which you can carry out a range of different actions on your data.

 

After starting the programme, the first step is to load your alignment data into MRBAYES. After that, you will specify the substitution model to be used in the analysis, followed by parameters describing the way in which the MCMC will be run.

 

Execute MRBAYES using “mb”.

> mb

 

Load the alignment file into MRBAYES by typing the following at the prompt:

“execute <name_of_edited_filename_in_nexus_format>”

 

Specify the substitution model to be used in the analysis by typing at the prompt:

“lset aamodel=jones rates=invgamma basefreq=estimate”

 

Tell the mcmc to stop running when it gets to the end of its pre-specified number of generations

“set autoclose=yes”

 

Specify the way in which the mcmc should be run using:

“mcmcp autoclose=yes ngen=25000 printfreq=200 samplefreq=20 nchains=4 savebrlens=yes”

 

Run the mcmc using:

“mcmc”

 

NOW STOP - and listen to the introductory talk! - hopefully what you hear will make it somewhat easier to understand what you just did and why!

 

Examine output of MRBAYES

We now need to check to see if the chains have reached "stationarity". By this we mean that we want to determine whether the Markov chains being run by MRBAYES appear to be sampling from the posterior probability distribution of the dataset, or whether the chains are still in the "burnin" phase. We can do this by examining the way in which the likelihoods of the trees sampled by the Markov chains vary with generation time. During the burnin phase, likelihoods increase with generation time - once trees are being sampled from the posterior probability distribution, there is no overall trend for likelihoods to increase with generation time.

 

After the analysis has run, you should find that a file has been created in the directory in which you have run MRBAYES that has the suffix “.out.p”. This file contains a list of the likelihoods of the trees sampled by the Markov chains.

 

We want to look at the variation of likelihood with generation time using a spreadsheet. This will allow us to plot a chart to gain an overview of this variation. However, the spreadsheet we will use to examine this data will only work if the name of the file being analysed ends with ".csv". Therefore:

Copy the file containing the likelihood/generation number data to a file with the same name, except that you add “.csv” at the end of the filename.

> cp <filename_prefix>.out.p <filename_prefix>.out.p.csv

 

We will use the spreadsheet software in the OpenOffice suite:

Start up OpenOffice by clicking on the desktop icon labelled “Office”

 

Use the file menu to open the file you just created (whose name ends with “.csv”)

 

You will be presented with a page that asks you to specify the character used to separate columns of data in this file.

Choose “Tabs” and then open the file.

 

Select the first 2000 rows of the first two columns of the spreadsheet.

 

Choose "Chart" from the INSERT menu.

 

Continue through the dialogue boxes presented to you until you are asked what kind of chart (pie chart, bar chart etc.) you would like to draw. Choose “XY plot”, and then indicate that you want to draw the chart.

This should draw a chart that plots the variation in probability with generation number for your dataset.

 

Determine whether the probability has reached a stable horizontal line (i.e. stationarity). If so, note the generation number after which you are certain that stationarity has been reached.

 

As discussed in the presentation this morning, the procedure of determining a Bayesian estimate of phylogeny involves building a consensus tree from the set of trees sampled by the Markov chains after the chains have started to sample from the posterior probability distribution of the trees. Thus we want to discard the set of trees that were sampled before the chains began to sample from this distribution, and need to pass this information to MRBAYES. However, MRBAYES does not expect that the set of trees to be discarded will be specified using the range of generation numbers whose trees should be discarded. Instead we need to specify simply the NUMBER of trees to be discarded. To do this you need to divide the number of generations visited prior to stationarity by the frequency with which trees were sampled from the chain - the chain does not write every single tree that it visits to a file, only every Nth tree, where N is the sample frequency used (and specified by you prior to running the chains.)

 

Calculate the number of trees that should be discarded from the set of sampled trees because they were sampled prior to stationarity.

 

You now need to create a consensus tree from the set of trees sampled by MRBAYES after stationarity has been reached.

Returning to the MRBAYES terminal window, you need to type the following instruction on one line:

sumt filename=<filename_created_by_mrbayes_with_.out.t_suffix> contype=allcompat burnin=<num_of_trees_to_discard_as_burnin>

 

Exit MRBAYES with Control-C

 

We now want to examine the tree estimated by MRBAYES. The tree written by MRBAYES is in NEXUS format. Unfortunately, this cannot be read by NJPLOT, so we need to use a different tree visualisation tool called TREEVIEW. However, even the NEXUS format tree written by MRBAYES cannot be viewed immediately by TREEVIEW. Thus, we need to open the tree file written by MRBAYES in the EMACS editor, edit the contents of the file, and save it to a file with a different name.

 

Use EMACS to view the file that has been written by MRBAYES that has a “.con” suffix (this file contains the consensus tree).

 

The order of columns in the “translation table” at the top of the file should be re-ordered (initially, the order is “<corresponding_sequence_name> <number>” - this should be altered so that the order is reversed for all rows.)

Use EMACS to edit the file in this way

 

Save this file with a new name.

 

Start TREEVIEW from the command line:

> tv

 

Use TREEVIEW's menus to load the edited NEXUS format tree file into TREEVIEW.

 

We want to be able to view both (i) tree topology and branch lenghts of the Bayesian consensus tree and (ii) the posterior probabilities calculated by MRBAYES for each of the splits present in the consensus tree. The information about the proportion of trees in the dataset from which the consensus tree was built that contained each split is contained in the file written by MRBAYES with a ".parts" suffix. The easiest way to view this set of information is to print out the tree (with topology and branch lengths) from TREEVIEW, and then label the internal branches of this tree using the posterior probability values in the ".parts" file using a pencil/pen. Note that the values in the ".parts" file are percentages - however they should be interpreted as probabilities. A value of 100 associated with a split in ".parts" should be labelled on the tree as "1.0"

 

Print the tree displayed by TREEVIEW

 

Examine the file created with suffix “.parts” using EMACS or LESS. Label the tree that you printed from TREEVIEW with the posterior probability values that you find in the ".parts" file.

 

Compare the tree (along with the posterior probability values) to the trees estimated estimated in exercise 3.

 

Is the tree estimated by MRBAYES more accurate than the results of the NJ and ML analyses?

 

Do posterior probabilities for branches that are common/shared/present-in-all-of NJ, ML and Bayesian trees tend to be higher or lower than the corresponding bootstrap percentages?

 

For branches that are different between NJ, ML and Bayesian analysis, does there tend to be high posterior probability or bootstrap percentage?