Interpreting and Estimating Phylogenies

Exercises and Demonstrations

Monday 30th - Tuesday 31st March 2009

Wellcome Trust Advanced Course on Molecular Evolution

Adrian Friday and Aidan Budd

Viewing and Manipulating Unscaled Trees with NJplot

Teaching Objectives
After completing this section, you will hopefully be able to use NJplot to change the way a phylogenetic tree is dawn by changing the root and rotating subtrees around internal branches to reach a specified/desired representation of the tree.
Many applications of phylogenies involve using them to check whether a given set of taxa/organisms/OTUs are related to each other in a particular way i.e. whether the topology of the estimated phylogeny supports a particular set of relationships between OTUs. As the same topology can be drawn in different ways, it is useful to be able to look at and manipulate a topology to see whether it supports a given relationship.

NJplot does not have many features and options, however it carries out the simple tasks of re-rooting and rotating branches - very useful when attempting to determine whether a given set of relationships exists in a tree, or when comparing to tree topologies - very quickly, which is often all you need when taking a first look at a phylogeny.

This page describes how to carry out these kinds of manipulations using NJplot - we will also demo them for you.
Exercise 1
Load the following NEWICK/PHYLIP format file into NJplot and use the software to try and reproduce as closely as possible the following image:

If the previous exercise was easy, try the same thing with the following file and image:

Investigating Branch Lengths and Scaled Trees using MESQUITE

Teaching Objectives
After completing this section, you will hopefully have gained at least an intuitive understanding of the usual interpretation of branch lengths of a molecular phylogeny.
The trees used in the NJplot exercise above are "unscaled" i.e. the lengths of the branches on the tree are arbitrarily assigned to provide a convenient representation of the tree. For example, in the NJplot-screenshots above, branch lengths are chosen to align all OTU labels on the right side of the tree.

In many/most cases, however, you will be using and manipulating trees where branch length represents the amount of change along a lineage, typically measured as expected substitutions per alignment column.
This file uses MESQUITE to demonstrate the relationship between the amount of change associated with a branch (as represented by a DNA sequence alignment) and branch lengths.

Formating Phylogenetic Tree Figures with Dendroscope

Teaching Objectives
After completing this section, you will hopefully be able to manipulate the representation (root, branch rotation, formating) of a phylogenetic tree using Dendroscope to provide a starting point for preparing trees for use in figures
While NJplot is good for quickly examining a tree, it provides only fairly limited tools to manipulate the representation of a tree. To prepare representations of trees for use in figures, we therefore typically begin by visualising the tree in Dendroscope, using it to make many of the formating changes we need to emphasize appropriate features of the tree. Note, however, that in almost all cases we carry out further changes and decoration of the tree using image software such as the GIMP or Adobe Illustrator.

This page shows how to carry out some common formating tasks using Dendroscope - we will also demo them for you
Exercise 2
Load this tree file into Dendroscope and manipulate the tree representation until it resembles the image below.

If you are having problems obtaining a representation like this, you can load this Dendroscope format file into Dendroscope - this includes the above tree saved in the state used to create the above figure.

Data Formats

Teaching Objectives
After completing this section, you will hopefully be able to:
Much of the software used to estimate, manipulate, and visualize phylogenetic data is produced by relatively small teams of developers, primarily for use in their own research. As a result, they typically have only limited time, resources, and motivation available to design and prepare the interface of the software to be compatible with a wide range of different data formats.

Thus, a common situation when working with phylogenetic data is that the output obtained from one tool must be adjusted before it can be successfully used by another tool.

The task of determining the changes that need to be made can be somewhat confounded by the error messages reported by software due to incompatible input data - these may be either somewhat cryptic or absent, making it difficult to diagnose the reasons for the incompatibility of the data.

Things to look out for when formating data for use by phylogenetic software are to, if possible:

Tree (NEWICK/PHYLIP Format) Data

Most software that operates on phylogenetic trees uses some derivative of the NEWICK/PHYLIP format for input/output of trees, as described during the presentation

To give you some practice overcoming typical problems associated with inputing tree data into phylogenetic software, we have put together some exercises to help both familiarize you with this format.
We'll show you how to draw "by hand" the tree corresponding to a given NEWICK string using the following example:


Which should yield something like the image shown below:

We'll also demonstrate doing this the other way around i.e. we'll try and write the NEWICK format string corresponding to the tree shown above - and then we'll try and alter the string we come up with to correspond instead to the tree shown below.

Exercise 3
Draw on paper the phylogenetic tree corresponding to the following NEWICK/PHYLIP format trees - be careful to check whether the trees are specified as rooted or unrooted, and draw them accordingly. Check your results by comparing the trees you draw with the images provided below the two tree images.


Link to tree image


Link to tree image

Now try this from the other side - write and save in a text editor NEWICK/PHYLIP format representations of the trees shown below. To check whether the file you write does indeed represent the appropriate tree, try loading the file into Dendroscope.

This first tree is unscaled - so do not attempt to include information about branch lengths in your NEWICK-FORMAT tree. (Here is a file that contains a NEWICK/PHYLIP format tree that should yield the tree seen below)

For the next tree, try to include branch length information in your NEWICK/PHYLIP format tree (Here is a file that contains a NEWICK/PHYLIP format tree that should yield the tree seen below)

Sources of Pre-Calculated Trees

Sometimes, rather than estimating a phylogenetic tree yourself, it will be enough to simply examine a tree obtained elsewhere.

However, such trees are often formated in a way that is incompatible with tree visualization (or other phylogenetic) software. Therefore, obtaining and visualising such trees provides further practice at interpreting and manipulating NEWICK format trees.

This page describes how to obtain trees (and in some cases alignments) from several different websites - we will also demo them for you
Exercise 4
Below is a list of websites which are sources of pre-calculated phylogenies -  use each of these sites to obtain trees that include the human cyclin F sequence (UniProt Entry Name: CCNF_HUMAN, UniProt Primary Accession Number: P41002).

After downloading the trees, try to load them into NJplot and Dendroscope
Note that you may need to edit the format of the downloaded trees for them to be accepted/correctly loaded into the software.




There are quite a few other sites that provide trees that can be downloaded, for example:

Exploring Large Phylogenetic Trees

Teaching Objectives
After completing this section, you will hopefully be able to interpret and explore very large phylogenetic trees using Dendroscope
As computers continue to increase in speed, we are able to calculate ever larger phylogenetic trees. Thus, it is now not unusual to be faced with the problem of examining trees with 100s or even 1000s of taxa. The tree visualization tools we have used so far (or at least the way we have been using them) are not well designed for this task, as large trees are too dense to visualize easily without somehow being able to easily focus on the regions of particular interest, while ideally at the same time providing some kind of overview of where the region of focus lies within the overall tree.

However, combined with search and format options, the "Magnifier" tool of Dendroscope makes this kind of task much easier.

This page describes how to use Dendroscope in this way - we will also demo it for you
Exercise 5
  1. Load this tree file into Dendroscope
  2. Find all OTUs that are from humans - as these are all taken from ENSEMBL, the labels for these OTUs should all contain the substring "ENSP00"
  3. Use the Format box to colour all these OTU red
  4. Examine the tree to identify the human sequence that is most similar to the fly "CG7922" sequence - this is the same human sequence that forms the smallest clan that contains CG7922 and a human sequence (however, note that if we want to avoid making assumptions about the location of the root of the tree, we can't describe this as the human sequence that is most closely related to the fly CG7922 sequence)
If you're having trouble identifying the appropriate fly sequence, this Dendroscope file has all human taxa labeled in red, with the CG7922 sequence labeled in blue

Editing Trees Using MESQUITE

Teaching Objectives
After completing this section, you will hopefully be able to create NEWICK format files of a desired topology and set of branch lengths using MESQUITE, beginning either "from scratch", or by modifying an existing tree.
We usually work with phylogenies that have been directly estimated from a dataset - typically a protein or DNA multiple sequence alignment. However, in certain situations we do not want or need to estimate a phylogeny - instead we can either create it completely "from scratch", or simply modify an existing phylogeny. These modifications might involve changing branch lengths, topology, and/or the rooting of the tree.

Typical uses of such "edited" trees are preparation of figures for publications/presentations e.g. a "cartoon" figure showing a consensus view of the relationships for a set of organisms, or when carrying out tests that compare several different specified phylogenetic hypotheses e.g. when applying the approximately unbiased (AU) test to an alignment and a set of phylogenies.

As already mentioned, MESQUITE is a very flexible tool for the analysis of phylogenetic data - and one part of its functionality enables us to edit trees in this way.

This page describes how to use MESQUITE in this way - we will also demo it for you
Exercise 6
Load this tree file into MESQUITE and edit it to yield the topology and branch lengths shown below.

Export the file from MESQUITE and then use Dendroscope to produce an image from it similar to the one shown below.

Create from scratch a phylogeny using MESQUITE with the topology and branch lengths shown below.

Reconciling Species and Gene Trees Using Mesquite

Teaching Objectives
After completing this section, you will hopefully appreciate that any gene tree can be consistent with any species tree, given inference of appropriate gene duplications and losses.
We will show how to compare/reconcile a set of of rooted trees (all with the same unrooted topology) with a species tree using MESQUITE.

This NEXUS format file can be loaded into MESQUITE - it includes three gene trees (shown in the image below - all have the same unrooted topology and branch lengths) and a species tree - the same as used in the quiz from the presentation.

Using the MESQUITE's Analysis->Visual Tree Analysis->Contained Gene (or Other) Tree command, we can identify the minimum number of duplication and deletion events that must be inferred to reconcile the different gene trees with the species tree.

This page describes how to use MESQUITE in this way


Teaching Objectives
After completing this section, you will hopefully be able to:
When carrying out a phylogenetic analysis, we often need to summarises the similarities/differences between a set of phylogenies e.g. the set of phylogenies sampled after the burnin phase in a Bayesian analysis of phylogeny. Many of the ways in which sets of trees are summarized make use of the concept of "phylogenetic splits".

A phylogenetic split is the two sets of OTUs associated with the two ends of a branch on a phylogenetic tree. For example, in the tree below, the split associated with the red branch is


The union of the two sets that make up the split is the complete set of OTUs from the tree, and the two sets should be disjoint (i.e. not sharing any OTUs in common).

Splits may be described as "trivial" and "non-trivial" - A trivial split contains just a single OTU, while both sets in a non-trivial split contain more than one OTU

For example, there are 5 trivial splits for the tree shown below (one for each terminal branch)


And 2 non-trivial splits (one for each internal branch on the corresponding unrooted tree)


Note also that the two sets of a split are unordered - thus, "AB | CDE" describes the same split as "AB | ECD"

A further feature of (a set of) splits is compatibility

A pair of splits are incompatible if it is impossible to draw a tree that contains both of them - instead, to include both of them in a diagram, we would need to use a split network. Likewise, a set of splits is compatible if it is possible to include all of them in a tree. Clearly, the set of splits that are described by a given tree will always be compatible.

As an example, the following two splits are incompatible:


(if you want, try [and fail!] yourself to build a tree that contains both of these splits)

You can identify whether a pair of splits are compatible or not by considering the intersections of the split sub-sets. Where exactly one of the sub-set intersections is empty, then the splits are compatible - otherwise the sets are incompatible.

Taking the example above, the intersections are (using "n" to indicate intersection):

{A,B} n {A,C} => {A}
{A,B} n {B,D,E} => {B}
{C,D,E} n {A,C} => {C}
{C,D,E} n {B,D,E} => {D,E}

All four intersections are non-empty - the splits must be incompatible

In contrast, the following two splits are compatible:


{A,B} n {C,D} => {}
{A,B} n {A,B,E} => {A,B}
{C,D,E} n {C,D} => {C,D}
{C,D,E} n {A,B,E} => {E}

as one (and only one) of the intersections is the empty set.
Exercise 7
(i) Identify the list of all non-trivial splits for the following tree - check here for the answers

(ii) Try the same exercise with this larger tree - again, you can find the answers here

(iii) There is only one bifurcating tree that is consistent with the set of splits listed below. Draw this tree - check here for the answer.




(iv) Here are the splits for a larger tree, this time with 12 taxa - if you've time, try and repeat the above exercise with this set of splits, building the unique bifurcating tree that is consistent with this set of splits - check here for the answer.










Building Consensus Trees by Hand

Teaching Objectives
After completing this section, you will hopefully be able to construct a consensus tree "manually" from a set of trees (all of which describe relationships between the same set of OTUs).
Consensus trees summarize the set of splits described by multiple phylogenetic trees. For example, a consensus tree might include all splits present in 80% of the trees.
Exercise 8
From the set of six trees presented below, build both the unrooted (i) strict consensus tree and the (ii) 50% majority tree.

If you're having trouble building these trees, click on the links supplied to view the strict consensus and the majority tree (where branch lengths are labeled by the number of times a give split is observed amongst the total of 6 trees used to build the tree).

Using SplitsTree and CONSENSE to build Consensus Trees and Networks

Teaching Objectives
After completing this section, you will hopefully be able to process a file containing a set of trees, specified in NEWICK format, all describing relationships between the same set of OTUs and:
A range of different software is available to calculate consensus trees and networks.

We will begin by using SplitsTree - a JAVA-based tool with a graphical user interface which build either strict or majority consensus tree, and also split networks.

Other software, such as CLANN or CONSENSE (one of the programs in the PHYLIP package), provide more flexibility in the type of different consensus  trees they can build - we will follow the SplitsTree exercise by using CONSENSE to build some consensus trees. This also gives us an opportunity to become familiar with the PHYLIP package.
Exercise 9 - Using SplitsTree
This page describes how to use SplitsTree to build consensus trees and split networks - we will demo this for you

Load this set of 100 trees into SplitsTree and calculate
If you have trouble calculating these trees/networks, then follow the links below to download (i) NEXUS format files with the trees/networks pre-calculated [which can be loaded for viewing directly into SplitsTree] and (ii) images of the trees/networks.]
By examining the strict consensus tree, identify those splits found in all 100 of the trees - the set of these splits can be found here.

By examining the majority consensus tree, determine how many of the trees contain the clan: (EF1A1_HUM, EF11_MOUS, EF1A_CHIC)

Examining the consensus network, identify the most frequent split found in the trees that is incompatible with the split EF1A1_HUM, EF11_MOUS, EF1A_CHIC | others. Determine also how many trees have this incompatible split

Within the set of trees, the taxon xILC49472 is most often found in two relatively small (mutually incompatible) clans. Identify these clans, and determine how many trees they are each found in.

If you have trouble answering the last few questions, check the answers here.
Exercise 9B - Using CONSENSE
This page describes how to use CONSENSE to build consensus trees - we will demo this for you

Use CONSENSE to build the:
of the 100 trees used above for the SplitsTree exercise.

Check your results by comparing them to the pre-calculated ones provided below:
Some questions that are just designed to give you a focus for examining these trees:

Do any of the consensus trees have the same number of polytomies?

Do any of the consensus trees have identical topologies?

Check here for the answers to these questions.

Demonstrating Structural Equivalence and Alignments

Teaching Objectives
After completing this section, you will hopefully appreciate the properties that we expect residues in the same column to share of a sequence alignment that is being interpreted in a "structural" context.
Comparing (and aligning) pairs of similar structures demonstrates what it means for a pair of residues to be "structurally equivalent" - the relationship that we want residues in the same column of an alignment to share if we are using the alignment in a "structural" context e.g. predicting secondary structure, or building a protein profile HMM to carry out a sensitive sequence similarity search.

At the same time, it demonstrates that there may be regions of two structures for which there is not any such equivalence.
This example is a pair of bacterial toxins 1ji6 and 1i5p with very similar structures

We have aligned the N-terminal regions of these structures using FATCAT It should be clear, when looking at the structural alignment, that the structures are very similar, with most residues sharing 1:1 structural equivalence with a residue in the other structure.

By contrast, we provide below an alignment of two very different structures. Indeed, considered in terms of the kinds of secondary structure elements they contain, they are completely different (one is mainly alpha, the other mainly beta).
Note that we are still able to align these two structures, despite our opinion that they are global extremely dissimilar. The same is true for multiple sequences alignments - most software will report an alignment, whether or not the global similarity most of the software assumes the sequences share is indeed present within the sequences.

Given the global dissimilarity of the alpha solenoid and beta barrel structures, you would almost certainly want to avoid using such an alignment to make any inferences about similarity of function/structure between residues aligned in the same column.

In general, this illustrates the fact that it is important to be very confident that the sequences you include in an MSA indeed share the relationship you are interested in - this will typically be "structural" and/or "evolutionary" equivalence. Note, however, that we are avoiding a discussion of how to judge whether or not sequences do indeed have such a relationship with each other - this issue will be discussed in detail by Bill Pearson and Ewan Birney in the next session.

If you are interested, you might like to try comparing some other pairs of structures and examining their structural and sequence alignments as calculated by FATCAT and CE.

Instructions on how to calculate pairwise structural alignments using CE and FATCAT and display the results in PyMOL
Note that we are suggesting that you use FATCAT and CE as they both provide a pairwise sequence alignment along with easy-to-view structural alignments - not because they are known to provide on average the best quality alignments (although these seem to be often good).

Phylogenetic Analysis: From Start to Finish

Teaching Objectives
After completing this section, you will hopefully:
The process of going from the formulation of a biological question (that can be investigated using a phylogeny) to obtaining an estimate of a phylogeny is a multi-stage process. It requires the investigator to make many decisions - many of which depend strongly on the specific overall aim/purpose of the analysis i.e. the biological questions the analysis being used to investigate.

This strong dependence on the specific purpose of an analysis is part of the reason why it is difficult/impossible to provide a one-fits-all detailed protocol/recipe for phylogenetic analyses.

Thus, the demonstration below aims to highlight (i) typical steps/stages and (ii) examples of the kinds of decisions that need to be made when carrying out such an analysis - thus, it is not intended as a blueprint for carrying out the ideal phylogenetic analysis!

Note, also, that while we have described these examples as taking the analysis "from start to finish", in reality the process is much more involved than shown here. One could argue that the beginning of the process begins considerably earlier than shown here, with the decisions about the biological questions of interest - and that the process would need to run on much longer than shown here, for example investigating and testing for a range of different potential sources of systematic error in the analysis, or using these results to identify additional data to be included in the analysis e.g. highlighting a particular set of taxa that might be useful in better resolving particular regions of the phylogeny.
As discussed above, an exercise that involves going from question to phylogeny must be done in the context of a specific biological question.

For this demonstration, we assume that we are interested in understanding the history of the human Histone acetyltransferase GCN5 gene, and its paralog Histone acetyltransferase PCAF. In particular, we want to determine approximately when the duplication that yielded these two genes occurred.

1. Identify the TreeFam family corresponding to these two genes, and download a protein sequence alignment for the family
2. Remove some of the sequences from the alignment - in just about every analysis you do, you'll find you want to remove some of your initial set of sequences from your analysis. You might want to do this for sequences that you feel are:
  1. unnecessary for the analysis as they are identical/nearly identical to other sequences in the alignment
  2. incomplete/fragmented in a way that would exclude large regions of the alignment from later analysis
  3. poorly aligned/likely to contain sequence errors
3. Remove columns from the alignment where you are not confident that all residues in the column are "evolutionarily" equivalent i.e. related via single-residue substitutions.
4. Edit the taxa names in the FASTA format file so that they are all 10 characters long, and contain only capital letters and/or underscores. Ideally, you should be able to identify the organism the sequence comes from, and (if there is more than one sequence from the same organism, the name should also make it possible to distinguish between the two sequences)
5. Save the alignment in PHYLIP format using CLUSTALX

6. Identify the best substitution model (or at least, the best from a list of models you choose to examine) to use in your phylogenetic analysis, given your alignment
7. Use RAxML to estimate a set of non-parametric bootstrapped trees from this alignment - to keep the analysis as quick as possible, calculate only 10 bootstrapped trees
8. Obtain a single best estimate of the tree from the alignment using a string such as:
9. Combine the results of these two runs to determine the bootstrap support for the branches in the maximum likelihood tree using RAxML e.g.:
Exercise 10
Carry out a similar analysis to that described above.

This time the scenario is that you are interested in the evolution of the human Polyadenylate-binding protein 2 protein and its paralog Embryonic polyadenylate-binding protein 2. The duplication that yielded the two genes probably occurred after the divergence of the urochodate from the vertebrate lineage. It has been suggested that the embryonic copy of the gene has been evolving much faster than the other copy. Your aim is to to investigate the evolution of the vertebrate sequences of this family, looking to see whether (by simply inspecting the resulting phylogeny) there seems to be a difference in rate of evolution of the two paralogs. If so, use the tree to decide when you think it's most likely that this different in rate was established.

Begin by finding the TreeFam record that corresponds to this family.
Using CLUSTALX/Jalview (follow these links for instructions on using CLUSTALX and Jalview)
Using ProtTest (follow this link for instructions on using ProtTest)
Using RAxML (follow this link for instructions on using RAxML in this way)
Using NJplot or Dendroscope, examine these trees

In which lineage do you think the change in amino-acid substitution rate occurred within the family?

Can you think of some of the assumptions you are making when you draw this conclusion?

(You'll find possible answers to these questions by following this link.)

Back To Session Mainpage