Phylogeny Exercises and Demonstrations

Tuesday 10th February

MPI for Developmental Biology, Tuebingen, Germany

Aidan Budd

Presentation Link

Phylogenetic Analysis from A-Z

From the formulation of a biological question (that can be investigated using a phylogeny) to obtaining an estimate of a phylogeny is a multi-stage process. Along the way, there are several stages at which important descisions must be taken on how to proceed - decisions that are strongly influenced by the aim of the analysis.

Thus, an exercise that involves going from question to phylogeny must be done in the context of a specific biological question.


For this demonstration, we assume that we are interested in understanding the history of the human Histone acetyltransferase GCN5 gene, and its paralog Histone acetyltransferase PCAF. In particular, we want to determine approximately when the duplication that yielded these two genes occured.

  1. Identify the TreeFam family corresonding to these two genes, and download a protein sequence alignment for the family
  2. Load the sequences into CLUSTALX and remove sequences from the alignment that you feel are:
    1. unnecessary for the analysis as they are identical/nearly identical to other sequences in the alignment
    2. incomplete/fragmented in a way that would exclude large regions of the alignment from later analysis
    3. poorly aligned/likely to contain sequence errors
      1. Note that, in a "real" analysis, you might well want to attempt to resolve some of these issues i.e. correct alignment errors, check whether "unusual" sequence is likely to be due to errors or is, in fact, real. However, for the sake of speed, we'll for now just exclude these sequences from the analysis
      2. the alignment might look something like this
  3. With the CLUSTALX Quality->Show Low-Scoring Segments option switched on, load the alignment in JalView and (informed by the regions highlighted as low-scoring by CLUSTALX), remove those columns from the alignment where you are not confident that all residues in the column are "evolutionarily" equivalent i.e. related via single-residue substitutions.
  4. Edit the taxa names in the FASTA format file so that they are all 10 characters long, and contain only capital letters and/or underscores. Ideally, you should be able to identify the organism the sequence comes from, and (if there is more than one sequence from the same organism, the name should also make it possible to distinguish between the two sequences)
  5. Save the alignment in PHYLIP format using CLUSTALX
  6. Run ProtTest to identify the protein substitution model that best describes the sequences in your alignment
  7. Use RAxML to estimate a set of non-parameteric bootstraped trees from this alignment - to keep the analysis as quick as possible, calculate only 10 bootstrapped trees
  8. Obtain a single best estimate of the tree from the alignment using a string such as:
  9. Combine the results of these two runs to see the bootstrap support for the branches in the maximum likelihood tree using RAxML e.g.:


Carry out a simliar analysis to that described above.

This time the scenario is that you are interested in the evolution of the human Polyadenylate-binding protein 2 protein and its paralog Embryonic polyadenylate-binding protein 2. The duplication that yielded the two genes probably occured after the divergence of the urochodate from the vertebrate lineage. It has been suggested that the embryonic copy of the gene has been evolving much faster than the other copy. Your aim is to to investigate the evolution of the vertebrate sequences of this family, looking to see whether (by simply inspecting the resulting phylogeny) there seems to be a difference in rate of evolution of the two paralogs. If so, use the tree to decide when you think it's most likely that this different in rate was established.

Begin by finding the TreeFam record that corresponds to this family
Using CLUSTALX/JalView
Which lineage do you think the change in amino-acid substitution occured within the family?
Can you think of some of the assumptions you are making when you draw this conclusion?

(You'll find possible answers to these questions at the end of this page)

Viewing and Manipulating Unscaled Trees with NJplot

Load the following NEWICK/PHYLIP format file into NJplot and use the software to try and reproduce as closely as possible the following image:

If this last exercise was easy, try the same thing with the following file and image:

Demonstration - Branchlengths and Scaled Trees

The above trees are "Unscaled" - the lengths of the branches on the tree are arbitrarily assigned to provide a convenient representation of the tree e.g. as here with all the taxa labels alligned on the right side of the tree.

In many/most cases, however, you will be using and manipulating trees where branchlength represents the amount of change (typically measured as expected substitutions per residue) along a given lineage.

This file uses MESQUITE to demonstrate the relationship between the amount of change associated with a branch (as represented by a DNA sequence alignment) and branchlengths.

Formating Phylogenetic Trees

Dendroscope is able to apply a range of different formating options to display phylogentic trees in a way to highlight specific features of a tree - useful when preparing trees for use in presentations and figures.

Load this tree file into Dendroscope and manipulate the tree representation to make it resemble the image below.

If you are having problems obtaining a representation like this, you can load this Dendroscope format file into Dendroscope - this includes the above tree saved in the state used to create the above figure.

Data Formats

Much of the software used to estimate, manipulate, and visualise phylogenetic data is produced by relatively small teams of developers as an adjunct for their own research. This limits considerably the time and resources available to design interfaces to the software, and to make the software commpatible with different variants of input format.

Thus, a common situation when working with phylogenetic data is that the ouput obtained from one tool must be adjusted before it can be successfully accepted by another.

Typically, also, the error messages reported by software due to incompatible input data are either cryptic, unhelpful, or completely absent.

Common features to look out for when formating data for use by phylogenetic software are to, if possible:

Tree (NEWICK/PHYLIP Format) Data

Most software that operates on phylogentic trees uses some derivative of the NEWICK/PHYLIP format for input/output of trees.

To give you some practice overcoming typical problems associated with inputing tree data into phylogentic software, we have put together some exercises to help both familiarise you with this format.

Draw on paper the phylogenetic tree corresponding to the following NEWICK/PHYLIP format trees - be careful to check whether the trees are specified as rooted or unrooted, and draw them accordingly. Check whether you've been successful by comparing the trees you draw with the images provided below the two tree images.


Tree image


Tree image

Now try this from the other side - write and save in a text editor NEWICK/PHYLIP format representations of the trees shown below. To check whether the file you write does indeed represent the appropriate tree, try loading the file into Dendroscope.

This first tree is unscaled - so do not attempt to include information about branchlengths in your NEWICK-FORMAT tree. (Here is a file that contains a NEWICK/PHYLIP format tree that should yield the tree seen below)

For the next tree, try to include branchlength information in your NEWICK/PHYLIP format tree (Here is a file that contains a NEWICK/PHYLIP format tree that should yield the tree seen below)

Sources of Pre-Calculated Trees

Sometimes, rather than estimating for yourself a phylogenetic tree, it will be enough to simply examine a tree obtained elsewhere.

Use the following websites to identify trees that include the human cyclin F sequence (UniProt Entry Name: CCNF_HUMAN, UniProt Primary Accession Number: P41002).

Download these trees and try to load them into the different TreeViewers we have looked at
Note that you may need to edit the format of the downloaded trees for them to be accepted/correctly loaded into the software.




There are quite a few other sites that provide trees that can be downloaded, for example:

Exploring Large Phylogenetic Trees

On occasion you will need to browse through the results of a large phylogenetic analysis e.g. of 1000 taxa. It can be very difficult to navigate around such a tree to identify the features relevant to your questions of interest.

The "Magnifier" tool provided by Dendroscope can be very helpful in examining trees of this kind, particularly in combination with the software's text-search facility.
  1. Load this tree file into Dendroscope
  2. Find all taxa that are from humans - as these are all taken from ENSEMBL, these all contain the substring "ENSP00"
  3. Use the Format box to colour all these taxa red
  4. Examine the tree to identify the human sequence that is most similar to the fly "CG7922" sequence
If you're having trouble identifying the appropriate fly sequence, this Dendroscope file has all human taxa labeled in red, with the CG7922 sequence labeled in blue

Editing trees using MESQUITE

Much of the time, we work with phylogenies that have been directly estimated from a dataset - usually protein or DNA multiple sequence alignment. However, in certain situations we want to obtain a new phylogeny NOT by estimating it directly from data, but by either creating it completely "from scratch", or by simply modifying an exiting phylogeny - this might involve changing branchlengths, topology, or the rooting of the tree.

Typical uses of such "edited" trees are preparation of figures for publications/presentations, or tests that involve using comparing several different specified phylogenetic hypotheses.

MESQUITE is a very flexible tool for the analysis of phylogenetic data - part of its functionality enables us to edit trees in this way.

Load this tree file into MESQUITE and edit it to yield the topology and branch-lengths shown below.

Export the file from MESQUITE and then use Dendroscope to produce an image from it similar to the one shown below.

Create from scratch a phylogeny using MESQUITE with the topology and branchlengths shown below.


Identify the list of non-trivial splits for the following tree - check here for the answers

Try the same exercise with this larger tree - again, you can find the answers here

There is only one bifurcating tree that is consistent with the set of splits listed below. Draw this tree - check here for the answer.




Here are the splits for a larger tree, this time with 12 taxa - if you've time, try and repeat the above exercise with this set of splits, building the unique bifurcating tree that is consistent with this set of splits - check here for the answer.










Building Consensus Trees by Hand

From the set of six trees presented below, build both the unrooted (i) strict consensus tree and the (ii) 50% majority tree.

If you're having trouble building these trees, click on the links supplied to view the strict consensus and the majority tree.(where branchlengths are labeled and count the number of times a give split is observed amongst the total of 6 trees used to build the tree).

Using SplitsTree and CONSENSE to build Consensus Trees and Networks

You can use SplitsTree to build either a strict or majority consensus tree - although note that other software, for example CLANN or CONSENSE (CONSENSE is just one of the programs in the PHYLIP package), provide much more control and a much wider range of different consensus tree methods.

Using SplitsTree, load this set of 100 trees and calculate
If you have trouble calculating these trees/networks, then follow the links below to download (i) NEXUS format files with the trees/networks pre-calculated [which can be loaded for viewing directly into SplitsTree] (ii) images of the trees/networks.
By examining the strict consensus tree, identify those splits found in all 100 of the trees - the set of these splits can be found here.

By examining the majority consensus tree, determine how many of the trees contain the split: EF1A1_HUM, EF11_MOUS, EF1A_CHIC | others

Examine the consensus network -

Identify the most frequent split found in the trees that is incompatible with the split EF1A1_HUM, EF11_MOUS, EF1A_CHIC | others. Determine also how many trees have this incompatible split

Within the set of trees, the taxon xILC49472 is most often found in two relatively small (mutually incompatible) clans. Identify these clans, and determine how many trees they are each found in.

If you have trouble answering the last few questions, check out the answers at the end of this page.


If you included several invertebrate lineages in your tree, you will have noticed that the substitution rate of these sequences is generally relatively low - suggesting that the embryonic lineage experieneced an increase in the substitution rate (rather than the non-embryonic lineage experiencing a reduced rate).

To decide which lineage the amino-acid substitution rate changed in the Polyadenylate-binding proteins, you might assume that such large changes in substitution rate occur relatively rarely, so that you would prefer a scenario in which the smallest number of such rate changes occurred. In this case, you would infer that the change occured in the embryonic lineage before diversification of vertebrates into sarcopytrigians (includes humans, birds, amphibians, and a few "fish"), and actinopterygians (including most "fish" e.g. zebrafish, fugu), and after the duplication event.

Back To Gibson Team Training Pages