Interpreting and Estimating Phylogenies

Exercises and Demonstrations

Aidan Budd

Phylogeny Terminology and Concepts


Anyone who has worked with phylogenies before has had to work with the specific vocabulary and concepts of the field. When we first encounter these words and concepts, we have to actively make an effort to understand what is meant by them. But the more we work in the field, the more internalised and obvious becomes our understanding of them.

Not everyone, however, understands this vocabulary and concepts in the same way, something that can lead to misunderstanding and confusion when communicating about the topic.

The exercise below focuses on our basic understanding of  vocabulary and concepts associated with phylogenies and molecular evolution. By asking you to write down and discuss you understanding of several fundamental concepts and terms used in the field, the aim is to:

Indeed, within any group, different people have different questions/issues/topics/problems with understanding a given topic. A better awareness, for both you and the trainers, of your specific most important learning outcomes/needs of the course has been shown to enhance the ability of students to improve their understanding as a result of attending a course. Thus, in more general terms, the aim of this exercises is to highlight, both for students and trainers, specific, different prior understandings of key topics in the field.

Note, also, it has been shown that thinking about a topic in several different "contexts" (e.g. (i) silently on your own (ii) in direct discussion with others or (iii) in a larger group discussion) tends to reinforce our ability to benefit from a given learning experience. Hence the structure of the exercise provided below.


1. Write down definitions/descriptions of the terms given in the list below (A. - to C.).

Do this silently, on your own, and please actually write it down (on paper, in a document on your computer) - the act of commiting your ideas into specific words often does a good job of making it clear where you are more and where less certain about how you want to express something.

It might help to begin by looking at the wikipedia page for phylogenetic trees. If you do begin by consulting this page, it would be best if you spend a bit of time looking through it, but then close the page when you write your own definitions/explanations of these terms, so that you don't find yourself just repeating what's already in the wikipedia page.

Here are the terms:

  1. phylogenetic tree
  2. phylogenetic tree topology
  3. branch length

2. Once you've finished, discuss your definitions with those of your neighbours.

Make notes on how your definitions differ, and write together a definition that agrees with both your ideas of the meaning of these terms.

3. We'll show definitions of these terms that we've come up with to the class - then we'll have a class discussion about any ways in which these definitions conflict with the ones you have come up with

4. Repeat the above steps, this time selecting two or more terms from the list below.
In case you're interested, you could consult the definitions/explanations we've put together for these terms ourselves.

Common Misconceptions about Phylogenies and Evolution

We've discussed several different common misconceptions about evolution in general, and interpretation of phylogenies in particular.

You can read more about such common misconceptions in the literature, for example in this paper by T. Ryan Gregory:

Understanding Evolutionary Trees
Evolution: Education and Outreach 1:2;121-137 (2008) DOI: 10.1007/s12052-008-0035-x
T. Ryan Gregory

This exercise looks at an example of a (satirical) use of common ideas about evolution in a piece of popular culture i.e. the music video for the Fatboy Slim track "Right Here, Right Now".

We will use a viewing of the video as an exercise in identifying examples of some of the misconceptions we've been discussing during the course.

While watching the video, make notes on:

You might like to compare the Fatboy Slim video to this similar one (which seems to be from a BBC Series, perhaps "Walking with Beasts" - I couldn't find a clear attribution for the video anywhere) - not clear to me whether one of these videos inspired the other, or whether the concept was arrived at independently/via convergence...

Viewing and Manipulating Unscaled Trees with NJplot

Teaching Objectives
After completing this section, you will hopefully be able to use NJplot to change the way a phylogenetic tree is visualised/drawn by the software by changing the root and rotating subtrees around internal branches to reach a specified/desired representation of a tree.
Many applications of phylogenies involve using them to check whether a given set of taxa/organisms/OTUs are related to each other in a particular way i.e. whether the topology of the estimated phylogeny supports a particular set of relationships between OTUs. As a tree with a given topology can be drawn/represented/visualised in many different ways, it can sometimes be tricky to tell whether a given representation of a phylogenetic tree indeed contains the relationship of interest. Thus, working with and changing between different representations of a tree is a useful skill when interpreting the results of phylogenetic analyses.

NJplot does not have many features and options, however it carries out the simple tasks of re-rooting and rotating branches - very useful when attempting to determine whether a given set of relationships exists in a tree, or when comparing to tree topologies - very quickly, which is often all you need when taking a first look at a phylogeny.

This page describes how to carry out these kinds of manipulations using NJplot - we will also demo them for you.
Exercise 1
Load the following NEWICK/PHYLIP format file into NJplot and use the software to try and reproduce as closely as possible the following image:

If the previous exercise was easy, try the same thing with the following file and image:

Investigating Branch Lengths and Scaled Trees using MESQUITE

Teaching Objectives
After completing this section, you will hopefully have gained at least an intuitive understanding of the usual interpretation of branch lengths of a molecular phylogeny.
The trees used in the NJplot exercise above are "unscaled" i.e. the lengths of the branches on the tree are arbitrarily assigned to provide a convenient representation of the tree. For example, in the NJplot-screenshots above, branch lengths are chosen to align all OTU labels on the right side of the tree.

In many/most cases, however, you will be using and manipulating trees where branch length represents the amount of change along a lineage, typically measured as expected substitutions per alignment column.
Demonstration - Investigating Branch Lengths and Scaled Trees using MESQUITE
This file uses MESQUITE to demonstrate the relationship between the amount of change associated with a branch (as represented by a DNA sequence alignment) and branch lengths.

We'll see what happens to a simulated alignment evolved over trees where we only change the lengths of the branches, keeping the topology the same.

We use the following four-taxon trees:

A. All branches the same length
  1. all branches length 1.0
  2. all branches length 0.1

B. External branches three times the length of the internal branch
  3. external branches 0.1, internal branch 0.3
  4. external branches 0.2, internal branch 0.6
  5. external branches 0.8, internal branch 2.4

Comparing 1 and 2 - there is much more variability in alignments produced by tree 2
Comparing 2 and 3 - there are many more columns in tree 3 alignments where taxa taxonA and taxonB have the same base, taxonC and taxonD have the same base, and these two bases are different - making it easier to look at a tree 3 alignment and guess the correct tree topology

Comparing 3 to 4 and 5 - the "noise" caused by larger numbers of subsitutions increasingly obscures this phylogenetic signal

We see that re-rooting the tree doesn't change these trends
Exercise 2 - Investigating Branch Lengths and Scaled Trees using MESQUITE
1. Using the same NEXUS format MESQUITE file as above as a basis, and using the instructions below on how to use MESQUITE, construct a phylogenetic tree that yields an alignment similar to that one shown below i.e. with taxonD and taxonB much more similar to each other than either is to taxonA and taxon C


Here's a NEXUS/MESQUITE file with such a tree in it

2. If you're feeling ambitious, try and edit the tree in MESQUITE to add some extra taxa (taxonE and taxonF), so that when simulating DNA characters taxonE has a sequence very similar to taxonB and taxonD (as above), but with taxonF very similar to taxonA's sequence


Formatting Phylogenetic Tree Figures with Dendroscope

Teaching Objectives
After completing this section, you will hopefully be able to manipulate the representation (root, branch rotation, formatting) of a phylogenetic tree using Dendroscope to provide a starting point for preparing trees for use in figures
While NJplot is good for quickly examining a tree, it provides only fairly limited tools to manipulate the representation of a tree. To prepare representations of trees for use in figures, we therefore typically begin by visualising the tree in Dendroscope, using it to make many of the formating changes we need to emphasize appropriate features of the tree. Note, however, that in almost all cases we carry out further changes and decoration of the tree using image software such as the GIMP or Adobe Illustrator.
We will quickly demo some of the features of Dendroscope for you using this tree file
This page shows how to carry out these and some common tasks using Dendroscope.

NOTE that some versions of Dendroscope have bugs in them associated with how they show support values on branches of the tree after re-rooting!
Exercise 3
Load this tree file into Dendroscope and manipulate the tree representation until it resembles the image below.

If you are having problems obtaining a representation like this, you can load this Dendroscope format file into Dendroscope - this includes the above tree saved in the state used to create the above figure.

If you've time, try something similar, using the same tree file as input as used for the above exercise (also found under this link), but this time using an unrooted representation, as show below.

If you've (even more) time, try to recreate (both) the above images, as far as possible, using the following alternative tree viewers:

Data Formats

Teaching Objectives
After completing this section, you will hopefully be able to:
Much of the software used to estimate, manipulate, and visualize phylogenetic data is produced by relatively small teams of developers, primarily for use in their own research. As a result, they typically have only limited time, resources, and motivation available to design and prepare the interface of the software to be compatible with a wide range of different data formats.

Thus, a common situation when working with phylogenetic data is that the output obtained from one tool must be adjusted before it can be successfully used by another tool.

The task of determining the changes that need to be made can be somewhat confounded by the error messages reported by software due to incompatible input data - these may be either somewhat cryptic or absent, making it difficult to diagnose the reasons for the incompatibility of the data.

Things to look out for when formating data for use by phylogenetic software are to, if possible:
When confronted by a tree file that cannot be read by a given piece of software, then either:

Tree (NEWICK/PHYLIP Format) Data

Most software that operates on phylogenetic trees uses some derivative of the NEWICK/PHYLIP format for input/output of trees, as described during the presentation

To give you some practice overcoming typical problems associated with inputing tree data into phylogenetic software, we have put together some exercises to help both familiarize you with this format.

These two links both provide descriptions of the NEWICK/PHYLIP format
We'll show you how to draw "by hand" the tree corresponding to a given NEWICK string using the following example:


Which should yield something like the image shown below:

We'll demonstrate this process "the other way around" i.e. we'll try and write the NEWICK format string corresponding to tree A shown above - and then we'll try and alter that string to correspond instead to tree B shown below.

We can test whether we've correctly written out the corresonding NEWICK string by loading the string into Dendroscope and checking:

Tree A

Tree B

Exercise 4
Draw on paper the phylogenetic tree corresponding to the following NEWICK/PHYLIP format trees - be careful to check whether the trees are specified as rooted or unrooted, and draw accordingly. Check your results by comparing the trees you draw with the images provided below the two tree images.


Link to tree image


Link to tree image

Now try this 'from the other side' - write and save in a text editor NEWICK/PHYLIP format representations of the trees shown below. To check whether the file you write does indeed represent the appropriate tree, try loading the file into Dendroscope.

This first tree is unscaled - so do not attempt to include information about branch lengths in your NEWICK-FORMAT tree. (Here is a file that contains a NEWICK/PHYLIP format tree that should yield the tree seen below)

For the next tree, try to include branch length information in your NEWICK/PHYLIP format tree (Here is a file that contains a NEWICK/PHYLIP format tree that should yield the tree seen below)

Identify Errors in a NEWICK String

If you try and load this NEWICK format tree "((A,B),C),D);" into NJplot (indeed into most phylogenetic software packages) it will complain/throw an error.

One reason this can happen, i.e. that a software package can't read a NEWICK string (and indeed the reason in this case) is that there is not a complete set of matching parentheses - this becomes obvious when the number of open parentheses "(" [2] are compared to the number of closing parentheses ")" [3] - there is thus an unmatched ")" parenthesis.

By inserting an additional "(" into the string, it can now be read by NJplot  "(((A,B),C),D);"
Exercises 4a
The following NEWICK strings have similar problems i.e. should not load successfully into NJplot.

Modify the strings in a text editor so that they do load successfully into the software

1.   ((A,(E,D)),(C,(B,F));
2.   (A:1,D:6,((E:1.01,F:1.2):1,B:2):0.21,(C:4,G:2.2):2):1);"

The "answers" can be found here.

Sequence Data Formatting Problems

One might also have problems loading sequence/alignment data into phylogenetic analysis software.

For alignment datasets, similar rules of thumb concerning taxon names apply as for NEWICK/PHYLIP format tree files i.e. avoiding non-alphabetic characters etc.

Additional issues specific for reading alignment files into phylogenetic software are:
Demonstration and Exercise - Editing an alignment file until it is accepted as input to MrBayes
When confronted by an alignment file that is not accepted as in put by a software package, the same advice applies as for tree files:
In this demonstration we will work with two files:
We'll follow the following procedure to edit the "failing" file until MrBayes accepts it:
  1. In the UNIX terminal, move to the directory we want to do our analysis in
  2. Start MrBayes running (on my machine this is done by typing "mb" at the UNIX prompt)
  3. Execute the file within MrBayes (by typing "execute FILENAME" at the MrBayes prompt)
  4. Make changes to the input alignment in a text editor, to make the failing file more like the one that loads successfully, and repeat
As an exercise, try and get this FASTA format alignment (taken from TreeFam family TF105907, which includes human poly-A binding proteins) loaded into MrBayes - it will help to examine this NEXUS format file that is accepted as input by MrBayes.
Demonstration - Downloading an MSA, running Gblocks, saving in NEXUS format
Different software reads alignments in different formats - for example, MrBayes accepts alignment data in NEXUS rather than PHYLIP or FASTA format.

CLUSTALX can be used, in many cases, to convert between formats - very useful, for example, if you begin by working with an alignment in FASTA format but want to analyse the alignment using MrBayes.

Here we'll demonstrate using CLUSTALX to convert an alignment downloaded from TreeFam into a gapless/reduced alignment in NEXUS format, using Gblocks to obtain an automatic reduced alignment

Here is a link describing how to obtain alignments from TreeFam - and here's a link to the TreeFam home page

We'll try this with an alignment containing the human cyclin F sequence (UniProt Entry Name: CCNF_HUMAN, UniProt Primary Accession Number: P41002).

1. We begin by querying TreeFam (using the UniProt Primary Accession Number) to find the record containing this protein

In case we have trouble finding this:
2. We'll load the alignment into CLUSTALX and save in FASTA format for uploading to the Gblocks server
3. We'll upload the file to the Gblocks server, and download the resulting reduced alignment in FASTA format
4. We load the reduced FASTA file into CLUSTALX and save in NEXUS format
Note that, in this form, the file will not be accepted as input to MrBayes - it needs to be altered slightly before that is possible.
Exercise 5 - Converting to NEXUS format for execution by MrBayes
Try and get this FASTA format alignment (taken from TreeFam family TF105907, which includes human poly-A binding proteins) loaded into MrBayes - it will help to examine this NEXUS format file that is accepted as input by MrBayes

Try and get the seed alignment from TreeFam family that includes the human histone acetyltransferase KAT2A protein (UniProt :Q92830;KAT2A_HUMAN), and load it successfuly into MrBayes.

Sources of Pre-Calculated Trees

Sometimes, rather than estimating a phylogenetic tree yourself, it will be enough to simply examine a tree obtained elsewhere.

However, such trees are often formated in a way that is incompatible with tree visualization (or other phylogenetic) software. Therefore, obtaining and visualising such trees provides further practice at interpreting and manipulating NEWICK format trees.

This page describes how to obtain trees (and in some cases alignments) from several different websites - we will also demo them for you
Exercise 6
Below is a list of websites which are sources of pre-calculated phylogenies -  use each of these sites to obtain trees that include the human cyclin F sequence (UniProt Entry Name: CCNF_HUMAN, UniProt Primary Accession Number: P41002).

After downloading the trees, try to load them into NJplot and Dendroscope
Note that you may need to edit the format of the downloaded trees for them to be accepted/correctly loaded into the software.




There are quite a few other sites that provide trees that can be downloaded, for example:

Exploring Large Phylogenetic Trees

Teaching Objectives
After completing this section, you will hopefully be able to interpret and explore very large phylogenetic trees using Dendroscope
As computers continue to increase in speed, memory, etc., we are able to calculate ever larger phylogenetic trees. Thus, it is now not unusual to be faced with the problem of examining trees with 100s or even 1000s of taxa. The tree visualization tools we have used so far (or at least the way we have been using them) are not well designed for this task, as large trees are too dense to visualize easily i.e. it is hard/impossible to identify within them regions of particular interest while  at the same time providing some kind of overview of where the region of focus lies within the overall tree.

However, combined with search and format options, the "Magnifier" tool of Dendroscope makes this kind of task much easier.

Here's a quick demo using Dendroscope for this purpose, using this large tree:
This page describes how to use Dendroscope in this way - we will also demo it for you
Exercise 7
  1. Load this tree file into Dendroscope
  2. Find all OTUs that are from humans - as these are all taken from ENSEMBL, the labels for these OTUs should all contain the substring "ENSP00"
  3. Use the Format box to colour all these OTU red
  4. Examine the tree to identify the human sequence that is likely to be most closely related to the fly "CG7922" sequence i.e. the human sequence that forms the smallest clan that contains CG7922 and a human sequence (however, note that if we want to avoid making  assumptions about the location of the root of the tree, we can't be sure, if the tree is accurate, that this is indeed the human sequence most closely related to the fly CG7922 sequence, as under some rootings of the tree [all be it very unlikely ones, give our experience of patterns of gene-family evolution) they are not very closely related to each other.
If you're having trouble identifying the appropriate fly sequence, this Dendroscope file has all human taxa labeled in red, with the CG7922 sequence labeled in blue

Editing Trees Using MESQUITE

Teaching Objectives
After completing this section, you will hopefully be able to create NEWICK format files of a desired topology and set of branch lengths using MESQUITE, beginning either "from scratch", or by modifying an existing tree.
We usually work with phylogenies that have been directly estimated from a dataset - typically a protein or DNA multiple sequence alignment. However, in certain situations we do not want or need to estimate a phylogeny - instead we can either create it completely "from scratch", or simply modify an existing phylogeny. These modifications might involve changing branch lengths, topology, and/or the rooting of the tree.

Typical uses of such "edited" trees are preparation of figures for publications/presentations e.g. a "cartoon" figure showing a consensus view of the relationships for a set of organisms, or when carrying out tests that compare several different specified phylogenetic hypotheses e.g. when applying the approximately unbiased (AU) test to an alignment and a set of phylogenies.

As already mentioned, MESQUITE is a very flexible tool for the analysis of phylogenetic data - and one part of its functionality enables us to edit trees in this way.

This page describes how to use MESQUITE in this way - we will also demo it for you:
After you've had a go at the first part of the exercise I'll also demo:
Exercise 8
Load this tree file into MESQUITE and edit it to yield the topology and branch lengths shown below.

Export the file from MESQUITE and then use Dendroscope to produce an image from it similar to the one shown below.

Create from scratch a phylogeny using MESQUITE with the topology and branch lengths shown below.

Reconciling Species and Gene Trees Using Mesquite

Teaching Objectives
After completing this section, you will hopefully appreciate that any gene tree can be consistent with any species tree, given inference of appropriate gene duplications and losses.
We will show how to compare/reconcile a set of of rooted trees (all with the same unrooted topology) with a species tree using MESQUITE.

This NEXUS format file can be loaded into MESQUITE - it includes three gene trees (shown in the image below - all have the same unrooted topology and branch lengths) and a species tree - the same as used in the quiz from the presentation.

Using the MESQUITE's Analysis->Visual Tree Analysis->Contained Gene (or Other) Tree command, we can identify the minimum number of duplication and deletion events that must be inferred to reconcile the different gene trees with the species tree.

This page describes how to use MESQUITE in this way


Teaching Objectives
After completing this section, you will hopefully be able to:
When carrying out a phylogenetic analysis, we often need to summarises the similarities/differences between a set of phylogenies e.g. the set of phylogenies sampled after the burnin phase in a Bayesian analysis of phylogeny. Many of the ways in which sets of trees are summarized make use of the concept of "phylogenetic splits".

A phylogenetic split is the two sets of OTUs associated with the two ends of a branch on a phylogenetic tree. For example, in the tree below, the split associated with the red branch is


The union of the two sets that make up the split is the complete set of OTUs from the tree, and the two sets should be disjoint (i.e. not sharing any OTUs in common).

Splits may be described as "trivial" and "non-trivial" - A trivial split contains just a single OTU, while both sets in a non-trivial split contain more than one OTU

For example, there are 5 trivial splits for the tree shown below (one for each terminal branch)


And 2 non-trivial splits (one for each internal branch on the corresponding unrooted tree)


Note also that the two sets of a split are unordered - thus, "AB | CDE" describes the same split as "AB | ECD"

A further feature of (a set of) splits is compatibility

A pair of splits are incompatible if it is impossible to draw a tree that contains both of them - instead, to include both of them in a diagram, we would need to use a split network. Likewise, a set of splits is compatible if it is possible to include all of them in a tree. Clearly, the set of splits that are described by a given tree will always be compatible.

As an example, the following two splits are incompatible:


(if you want, try [and fail!] yourself to build a tree that contains both of these splits)

You can identify whether a pair of splits are compatible or not by considering the intersections of the split sub-sets. Where exactly one of the sub-set intersections is empty, then the splits are compatible - otherwise the sets are incompatible.

Taking the example above, the intersections are (using "n" to indicate intersection):

{A,B} n {A,C} => {A}
{A,B} n {B,D,E} => {B}
{C,D,E} n {A,C} => {C}
{C,D,E} n {B,D,E} => {D,E}

All four intersections are non-empty - the splits must be incompatible

In contrast, the following two splits are compatible:


{A,B} n {C,D} => {}
{A,B} n {A,B,E} => {A,B}
{C,D,E} n {C,D} => {C,D}
{C,D,E} n {A,B,E} => {E}

as one (and only one) of the intersections is the empty set.
Exercise 9
(i) Identify the list of all non-trivial splits for the following tree - check here for the answers

(ii) Try the same exercise with this larger tree - again, you can find the answers here

(iii) There is only one bifurcating tree that is consistent with the set of splits listed below. Draw this tree - check here for the answer.




(iv) Here are the splits for a larger tree, this time with 12 taxa - if you've time, try and repeat the above exercise with this set of splits, building the unique bifurcating tree that is consistent with this set of splits - check here for the answer.










Building Consensus Trees by Hand

Teaching Objectives
After completing this section, you will hopefully be able to construct a consensus tree "manually" from a set of trees (all of which describe relationships between the same set of OTUs).
Consensus trees summarize the set of splits described by multiple phylogenetic trees. For example, a consensus tree might include all splits present in 80% of the trees.
Exercise 10
From the set of six trees presented below, build both the unrooted (i) strict consensus tree and the (ii) 50% majority tree.

If you're having trouble building these trees, click on the links supplied to view the strict consensus and the majority tree (where branch lengths are labeled by the number of times a give split is observed amongst the total of 6 trees used to build the tree).

Using SplitsTree and CONSENSE to build Consensus Trees and Networks

Teaching Objectives
After completing this section, you will hopefully be able to process a file containing a set of trees, specified in NEWICK format, all describing relationships between the same set of OTUs and:
A range of different software is available to calculate consensus trees and networks.

We will begin by using SplitsTree - a JAVA-based tool with a graphical user interface which build either strict or majority consensus tree, and also split networks.

Other software, such as CLANN or CONSENSE (one of the programs in the PHYLIP package), provide more flexibility in the type of different consensus  trees they can build - we will follow the SplitsTree exercise by using CONSENSE to build some consensus trees. This also gives us an opportunity to become familiar with the PHYLIP package.
Exercise 11 - Using SplitsTree
This page describes how to use SplitsTree to build consensus trees and split networks - we will demo this for you

Load this set of 100 trees into SplitsTree and calculate
If you have trouble calculating these trees/networks, then follow the links below to download (i) NEXUS format files with the trees/networks pre-calculated [which can be loaded for viewing directly into SplitsTree] and (ii) images of the trees/networks.]
By examining the strict consensus tree, identify those splits found in all 100 of the trees - the set of these splits can be found here.

By examining the majority consensus tree, determine how many of the trees contain the clan: (EF1A1_HUM, EF11_MOUS, EF1A_CHIC)

Examining the consensus network, identify the most frequent split found in the trees that is incompatible with the split EF1A1_HUM, EF11_MOUS, EF1A_CHIC | others. Determine also how many trees have this incompatible split

Within the set of trees, the taxon xILC49472 is most often found in two relatively small (mutually incompatible) clans. Identify these clans, and determine how many trees they are each found in.

If you have trouble answering the last few questions, check the answers here.
Exercise 11B - Using CONSENSE
This page describes how to use CONSENSE to build consensus trees - we will demo this for you

Use CONSENSE to build the:
of the 100 trees used above for the SplitsTree exercise.

Check your results by comparing them to the pre-calculated ones provided below:
Some questions that are just designed to give you a focus for examining these trees:

Do any of the consensus trees have the same number of polytomies?

Do any of the consensus trees have identical topologies?

Check here for the answers to these questions.

Phylogenetic Analysis: From Start to Finish

Teaching Objectives
After completing this section, you will hopefully:
The process of going from the formulation of a biological question (that can be investigated using a phylogeny) to obtaining an estimate of a phylogeny is a multi-stage process. It requires the investigator to make many decisions - many of which depend strongly on the specific overall aim/purpose of the analysis i.e. the biological questions the analysis being used to investigate. In particular, we will focus in this session on those decisions associated with preparing an appropriate multiple sequence alignment (MSA) for analysis by phylogeny estimation tools, in particular on those concerning which regions of which sequences to include in the MSA.

This strong dependence of the "answers" to such decisions on the specific purpose of an analysis is part of the reason why it is difficult/impossible to provide a one-fits-all detailed protocol/recipe for phylogenetic analyses.

Thus, the demonstration below aims to highlight (i) typical steps/stages and (ii) examples of the kinds of decisions that need to be made when carrying out such an analysis - thus, it is not intended as a blueprint for carrying out the ideal phylogenetic analysis!

Note, also, that while we have described these examples as taking the analysis "from start to finish", in reality the process is much more involved than shown here. One could argue that the beginning of the process begins considerably earlier than shown here, with the decisions about the biological questions of interest - and that the process would need to run on much longer than shown here, for example investigating and testing for a range of different potential sources of systematic error in the analysis, or using these results to identify additional data to be included in the analysis e.g. highlighting a particular set of taxa that might be useful in better resolving particular regions of the phylogeny.
Demonstration - Components of Phylogeny Estimation Workflow
This set of demonstrations accompanies our discussion of a "generic" workflow for phylogeny estimation - the aim is simply to demonstrate how to do carry out some important data manipulations that are involved in many phylogenetic analyses.

We'll begin by using a subset of the full TreeFam family TF106503 alignment - which includes nucleoporin-like proteins such as human nucleoporin-like protein 2 (UniProt:O15504;NUPL2_HUMAN)

1. Obtaining an initial set of sequences
2. Aligning sequences using several different automatic MSA webservers
3. Refining the alignment
4. Building a reduced alignment
5. Getting a quick phylogeny estimate using CLUSTALX
6. Aligning "new" sequences to a pre-existing alignment using CLUSTALX
Here are example alignments
7. Change from FASTA to PHYLIP format with CLUSTALX

8. Choose a substitution model using ProtTest

8. Estimate phylogeny using maximum likelihood using either:
Exercise - Phylogenetic Workflow Using Unrelated Sequences
Try a similar analysis to that described above - however, this time, the initial set of sequences we're providing you with are NOT ALL RELATED TO EACH OTHER! (They are a mixture of p53, nupl2, and catalase sequences)

This means that you should certainly not attempt to estimate a phylogenetic tree from this complete set of sequences - however, the principle of GIGO (Garbage In, Garbage Out) applies also to phylogenetic analysis - even if your input is "garbage", i.e. a set of unrelated sequences, there's nothing stopping you from processing it under the assumption that they are related (despite the fact than any conclusions drawn from the resulting trees, that assumes they all sequences in the input are related to each other, are meaningless).

Thus, this exercise aims both to:
This is a link to the initial set of sequences, unaligned, in FASTA format

1. Aligning sequences using MUSCLE
2. Identifying and removing sequences with likely errors using JalView
3. Building a reduced alignment
4. Getting a quick phylogeny estimate using CLUSTALX
5. Aligning "new" sequences to a pre-existing (unreduced) alignment using CLUSTALX

To keep in the spirit of GIGO, let's add a set of three SRC-related proteins, which are unrelated to any of the sequences currently in the alignment
6. Trim the alignment again, using GBLOCKS or JalView
6. Change from FASTA to PHYLIP format with CLUSTALX
7. Estimate phylogeny using RAxML from the command line
You might like to try something similar, this time actually using a set of sequences that ARE related to each other?!

So, if you complete the above analysis much quicker than other participants, try doing a similar execise, this time using a set of proteins, all of which are related to the E. coli magnesium and cobalt efflux protein corC (UniProt:P0AE78;CORC_ECOLI)
Demonstration - Obtaining Pre-Calculated Phylogenies
The section above and this "how-to" link describe several different resources that provide pre-calculated trees.

This demonstration makes use of the such resources, focusing on using them to identify a tree that can be used to address a specific question, highlighting some common problems with this kind of analysis.

If your work is strongly dependent on obtaining an as-good-as-possible phylogeny to address a specific question, you will almost always want to estimate your own trees - this allows you to design the analysis optimally to address your own, specific question. However, often we want to look at a tree for a gene family just as "background" for other work we are doing - where a quick overview of the evolutionary history and diversity of a family is all we are looking for. In that case it is often possible to find a pre-calculated phylogeny that can give us that overview, saving considerable time and effort.

To carry out an exercise aimed at identifying an appropriate phylogeny to address a particular question, we obviously always need a particular question that we're trying to address!

In this case, we want to identify the Caenorhabditis elegans sequence most closely related to the human cyclin C (UniProt:CCNC_HUMAN;P24863)

1. Identify initial sequence of interest

The first step is to find the amino acid sequence of the human cyclin C gene - there are many many different ways of doing this, which can be the focus of a lecture (indeed an entire course) of its own, so we won't spend much time on this step. We'll try it two ways:
2. Identify pre-calculated trees for this sequence in one (or more) different data resources - to do this either:
3. Examine the pre-calculated phylogeny
Here are appropriate trees containing human and worm cyclin C sequences
As an exercise - try to identify the worm (C. elegans) sequence most closely related to the human ATP-dependent RNA helicase DDX58 gene (UniProt:O95786;DDX58_HUMAN), using pre-calculated phylogenetic trees.

Here are the "answers" to check your results against...

There are several reasons why might have problems with this approach to obtain an appropriate phylogeny (i.e. one that enables us to answer our specific question of interest):

A. The databases (of pre-calculated trees) have no tree which includes all the necessary/relevant/important sequences needed to resolve the question
B. Databases contain a tree with all relevant/important sequences but you are unable to find the appropriate records

A particular problem when using a database (e.g. OPTIC) which:
As an example, try querying OPTIC to identify trees/alignments associated with the fly dicer protein (UniProt:Q9VCU9;DCR1_DROME) using the following synonyms (here's the link to the record)
C. Once you've identified an appropriate record within a database, it can still be difficult to find the right buttons to click to get to the information you need

For example, finding the trees for the OPTIC record for fly dicer, or the alignments from the TreeFam record for the same family

D. Once you've obtained an appropriate tree, it can still be difficult to identify which sequences in the tree correspond to yoursequences of interest

For example, try to find the record in this eggNOG record which corresponds to the human cyclin C
Exercise 12 - Obtaining Pre-Calculated Phylogenies
Using a similar approach as described above, this exercise focuses on identifying the worm (C. elegans) sequence(s) most closely related to a set of sequences, using pre-calculated trees.

In some cases this is (relatively) trivial - in others either difficult or impossible.

Thus, if you have problems with a particular sequence (i.e. if you find it difficult to identify the most-closely related worm sequence), try and work out the cause of the problems, and to identify possible approaches to overcoming them.

You might also like to try this out with your own sequences of interest

1. Eukaryotic translation initiation factor 4E type 2
UniProt link - O60573 - IF4E2_HUMAN

2. Insulin-like growth factor-binding protein 10
UniProt link - O00622 - CYR61_HUMAN

3. Forkhead box protein B2
UniProt link - Q5VYV0 - FOXB2_HUMAN
Demonstration - Phylogeny From Start to Finish with
We'll try and identify the lamprey sequence most closely related to human UniProt link (UniProt:O00622;CYR61_HUMAN, here in FASTA format)

1. Get an appropriate initial set of pre-aligned sequences

We'll follow the link from the UniProt record to the HOGENOM database to get our hands on an initial alignment
2. Change sequence names to make the question easier to answer

To identify the query sequence in the alignment, we load it into JalView, search for a region of sequence that is unique to CYR61_HUMAN, and rename that sequence within the alignment/JalView
3. Add additional sequences to the alignment to help addressing the question

As we're interested in lamprey sequences - let's go and get some using BLAST

Query against the nr protein database doesn't match any sequences when restricted to "Petromyzontidae [ORGN]" (which is the NCBI taxon name for the lampreys

However there are some ESTs - here's a file with four of the matching ESTs in FASTA format

It makes sense to cluster the ESTs before working with them further - which we do using the CAP3 server
Then we use the EBI's genewise server with each of these sequences to get the protein sequence
We use CLUSTALX to align the lamprey sequences to the HOGENOM sequences
4. Modify alignment (remove sequences, reduce the alignment) in preparation for phylogenetic analysis

We have to remove some of the sequences to be able to estimate the phylogeny, as currently every column in the alignment contains a gap
We then either
5. Choose a model - we can do that using the ProtTest server  (or a local installation of the software)

For that we need to change the format of the alignment to PHYLIP using CLUSTALX

This suggested the JTT matrix + G as the model with best fit -  although by some of the sort criteria, alternatives scorred better - one could run the analysis using several models and check to see if these yield trees the provide the same answer to our initial question.

6. Estimate phylogeny

We'll use using a quick option (ProtDist/FastDist + Neighbor, with just 10 bootstraps)
Exercise 13 - Phylogeny From Start to Finish with
For this exercise, assume that you wan to understand the history of the human Histone acetyltransferase GCN5 gene, and its paralog Histone acetyltransferase PCAF. In particular, you want to determine approximately when the duplication that yielded these two genes occurred.

1. Identify the TreeFam family corresponding to these two genes, and download a protein sequence alignment for the family 2. Try to add some amphioxus sequences (Cephalochordata) to the alignment using blastp at the NCBI BLAST server using CLUSTALX to align these sequences to the TreeFam alignment
If you fancy a challenge, have a go at adding sequence from the annelid worm Capitella teleta from the JGI

3. Follow the same procedure as for the previous demonstration
So... when do you think the duplication event occured?
Demonstration - Phylogeny from Start to Finish with Command-Line RAxML and ProtTest
As discussed above, an exercise that involves going from question to phylogeny must be done in the context of a specific biological question.

For this demonstration, we assume that we are interested in understanding the history of the human Histone acetyltransferase GCN5 gene, and its paralog Histone acetyltransferase PCAF. In particular, we want to determine approximately when the duplication that yielded these two genes occurred.

1. Identify the TreeFam family corresponding to these two genes, and download a protein sequence alignment for the family
1a.  We could also try obtaining an initial set of unaligned sequences to begin the analysis, via BLAST at the NCBI
2. Use CLUSTALX and Jalview to identify sequences you want to remove from the alignment, and to remove them. You might want to do this for sequences that you feel are:
    1. unnecessary for the analysis as they are identical/nearly identical to other sequences in the alignment
    2. incomplete/fragmented in a way that would exclude large regions of the alignment from later analysis
    3. poorly aligned/likely to contain sequence errors
      1. Note that, in a "real" analysis, you might well want to attempt to resolve some of these issues i.e. correct alignment errors, check whether "unusual" sequence is likely to be due to errors or is, in fact, real. However, for the sake of speed, we'll for now just exclude these sequences from the analysis
      2. the alignment might look something like this
3. In CLUSTALX switch on the Quality->Show Low-Scoring Segments option
4. Load the alignment in Jalview and (informed by the regions highlighted as low-scoring by CLUSTALX), remove those columns from the alignment where you are not confident that all residues in the column are "evolutionarily" equivalent i.e. related via single-residue substitutions.
Alternatively, you could do this automatically using the GBLOCKS server

5. Edit the taxa names in the FASTA format file so that they are all 10 characters long, and contain only capital letters and/or underscores. Ideally, you should be able to identify the organism the sequence comes from, and (if there is more than one sequence from the same organism, the name should also make it possible to distinguish between the two sequences)
6. Save the alignment in PHYLIP format using CLUSTALX

7. Run ProtTest to identify the protein substitution model that best describes the sequences in your alignment
8. Use RAxML to estimate a set of non-parametric bootstrapped trees from this alignment - to keep the analysis as quick as possible, calculate only 10 bootstrapped trees
9. Obtain a single best estimate of the tree from the alignment using a string such as:
10. Combine the results of these two runs to see the bootstrap support for the branches in the maximum likelihood tree using RAxML e.g.:
Exercise 14 - Phylogeny from Start to Finish with Command-Line RAxML and ProtTest
Carry out a similar analysis to that described above.

This time the scenario is that you are interested in the evolution of the human Polyadenylate-binding protein 2 protein and its paralog Embryonic polyadenylate-binding protein 2. The duplication that yielded the two genes probably occurred after the divergence of the urochodate from the vertebrate lineage. It has been suggested that the embryonic copy of the gene has been evolving much faster than the other copy. Your aim is to to investigate the evolution of the vertebrate sequences of this family, looking to see whether (by simply inspecting the resulting phylogeny) there seems to be a difference in rate of evolution of the two paralogs. If so, use the tree to decide when you think it's most likely that this different in rate was established.

Begin by finding the TreeFam record that corresponds to this family
Using CLUSTALX/Jalview (follow these links for instructions on using CLUSTALX and Jalview)

Refine the alignment
Reduce the alignment
You could also try doing this automatically using the GBLOCKS server

Change the sequence names
Using ProtTest (follow this link for instructions on using ProtTest)
Using RAxML (follow this link for instructions on using RAxML in this way)
In which lineage do you think the change in amino-acid substitution rate occurred within the family?

Can you think of some of the assumptions you are making when you draw this conclusion?

(You'll find possible answers to these questions by following this link.)
Exercise 15 - Phylogeny from initial sequences to answering a question, the "Louisiana gastroenterologist" case
Metzker et al. PNAS 2002 describe a phylogenetic analysis of HIV env/gp120 sequences that was used as evidence in the trial of a Louisiana gastroenterologist accused of deliberately infecting a someone (the "victim") with HIV-infected blood from one of the gastroenterologist's patients.

In this exercise, you are asked to try this analysis (or at least a similar one) to that carried out by Metzker et al., with the aim of deciding how well the data supports the hypothesis that the victim (whose sequence identifiers all begin with a "V") was directly infected by blood taken from the patient (whose sequence identifiers all begin with a "P").

Carry out the analysis beginning with this file of 30 unaligned env/gp120 sequences. This contains:
  1. selected sequences (so that the analysis doesn't take so long to run) taken from the complete set analysed by Metzker
  2. two additional, more divergent "reference" sequences
To place the analysis in context, consider that the aim of the prosecution questioning of the expert witnesses in this case was to establish whether the virus samples in the victim and the patient were "closely related". This article, from the Croatian Medical Journal, (Budowle and Harmon, 2005; 46:514-21) gives some background on the legal context of this questioning. The jury was (at least as far as I understand it) charged with addressing the question whether it was beyond reasonable doubt that the victim was infected as a result of the actions of the gastroenterologist; the results of the phylogenetic analysis are only a small part of the evidence they considered when trying to answer this question.

You might also be interested to read the following documents, which are the decisions of the first and second appeals to the guilty verdict; they contain some comments/quotes from some of the expert phylogenetic witnesses:
Ideally, just use your previous experience to decide how to do this - however, if you need help/suggestions, you could look at the "Phylogeny analysis workflow" section of this presentation for ideas, and/or look at this link which makes some more explicit suggestions on what you could try in this particular case

While doing the analysis, you may find it useful to consider the following issues/questions:
For the sake of completeness, here are files containing
  1. the full set of env/gp120 nucleotide sequences used in the study
  2. the full set of RT-pol sequences described in the study
  3. the trimmed-down set of these sequences described above (this is a link to exactly the same file as that from within the text describing this exercise)
Demonstration 16 - Phylogeny from initial sequences to answering a question using North African rabies virus sequences

Sequences are taken from this paper (PubMed ID: 21060816), which analyses the possible influence of human activity on spread of rabies virus amongst endemic dog populations in North Africa.

Obtaining sequences

We might be given them by a collaborator

Or we might get them ourselves from a sequencing machine.

Or we might download a set of sequences that are first reported in a particular publication via PubMed.

Aligning sequences

There are many different tools available, and different tools are better suited for different "kinds" of sequence sets (length, whether DNA, RNA, or protein, number of sequences, how similar the sequences are to each other).

One option (fairly good for fairly divergent protein sequences) would be the MUSCLE webserver at the EMBL-EBI.

Here they are as aligned by MUSCLE

Here is an example of the same set of sequences aligned together with some more divergent ones

Model Testing

For this we use jModelTest (for nucleotide sequences) or ProtTest (for protein sequences) as appropriate.

Estimating a phylogeny

Tools typically used to estimate phylogenies for publication are mostly command-line based, and we don't have time to teach how to use these today. Instead we'll try using some webservers that have wrapped some of these programs up in a relatively easy-to-use interface that offer some, but not all, of the functionality of the tools.

Obtaining rabies sequences - exploring some of the many different possible sources of sequences

We'll take phosphoprotein complete CDS nucleic acid sequences examples

Looking for sequences first described in a particular publication?

Query PubMed with the PubMed ID, and link out to Nucleotide and/or Protein sequences, and export in an appropriate format (e.g. FASTA)

Using sequence similarity searching to identify similar sequences

Pre-calculated alignments

For rabies phosphoprotein CDSs, there's no resource I can find that provides pre-aligned sets of sequences. Thus, we'll try instead looking for alignments of vertebrate COX2 protein sequences.

We can try this by searching the Internet directly with a generic search engine - trying this in December 2013 I hit these pages:

I can try linking out from the UniProt record for the protein, in the "phylogenomic databases" and "Family and domain databases" sections.

Or I can go to sites I know have such alignments available such as:
Note that there are two, completely different and almost certainly unrelated proteins called COX2 in humans (PGH2_HUMAN and COX2_HUMAN)!

Phosphoprotein complete CDS nucleic acid sequences

Unaligned FASTA sequences
Aligned FASTA sequences - google "MUSCLE EBI" or "WEBPRANK" to get links to MUSCLE and PRANK webservers:
Modeltest results from PRANK alignment - use JModelTest on local machines to run modeltest: result (HKY + Gamma) - use a la carte mode to run analysis, and view NEWICK tree using a tree viewer locally on your machines

Glycoprotein partial CDS and intergenic spacer

Unaligned FASTA sequences

Try a similar analysis as used for the Phosphoprotein complete CDS nucleic acid sequences shown above

Phosphoprotein amino acid sequences

Begin with this small subset of the rabies phosphoprotein sequences analysed in the paper:

Back To Common Course Content page