Exercises and Demonstations
Displaying phylogenetic trees
Here we will look at several different software tools that can be used
to display/draw phylogenetic trees. This should help you start
exploring both pre-calculated trees, and ones you have estimated
yourself, along with helping you become more familiar with looking at
and thinking about trees.
NJPLOT is a relatively simple program initially developed to show the
results of bootstrapped Neighbour-Joining phylogenies. It is often
useful as a relatively quick and easy way of viewing the results of an
analysis. While it can accept as NEWICK-format un-rooted trees as
input, it is only able to display such trees with a root. It won't
accept trees with any polytomies apart from a single trisomy (used to
specify the position of the root).
DENDROSCOPE is a more comprehensive program that is able to produce
trees of nearly publication quality relatively quickly (although in
most cases the resulting images require additional editing in other
software such as Adobe Illustrator). It can display both rooted and
unrooted trees, with scaled and unscaled formats.
MESQUITE (which we will use later in a later exercise/practical) is an
extremely versitile and flexible software package for carryout out a
wide range of different kinds of evolutionary analyses. This
flexibility provides a relatively steep learning-curve. We will use it
to visualise AND EDIT evolutionary trees; both DENDROSCOPE and NJPLOT
are only able to view the data provided in a NEWICK format tree file -
while MESQUITE provides the ability to manually alter the topology and
branch-lengths of a tree (which can be useful for preparing
illustrative figures, or for preparing trees for input to other
software (or for other analyses within MESQUITE).
We will begin by downloading a pre-caluclated alignment, using the
alignment to obtain a rough estimate of the phylogeny of the sequences,
and will then visualise the resulting tree using NJPLOT, DENDROSCOPE,
Suppose we are interested in investigating the phylogeny of the
drosophilid Oskar genes. To begin our analysis we go to the set of
pre-calculated alignments of sets of "orthologous groups" of genes from
Chris Ponting's group's OPTIC
resource (these are available for other groups e.g. mammals, yeasts,
not just drosophilids).
To identify the relevant group of genes, we need to find the relevant
identifier for the Oskar genes - we do this by searching flybase.
We query OPTIC database with CG10901 (the relevant ID - although one
actually needs to query with "CG10901-RA")
We click on the link that leads us to the alignments (the rather
cryptic "group 3297")
Choose Alignments->Tanscripts-AminoAcids, and then export in Fasta
We then use ClustalX to calculate a quick bootstrapped tree from this
We look at the resulting tree in NJPLOT
- Trees->Exclude positions with gaps
- Trees->Correct for multiple subsitutions
- Trees->Bootstrap NJ Tree
Dendroscope can also examine this tree with no trouble
- re-rooting tree
- examining subtrees
- rotating branches
- accepts long names and hyphens in names
To work on the tree in Mesquite, we need to edit the file
- different tree representations
- rotate branches (after selecting a NODE)
- hide Edge Labels to lose BS values
- change formating of branches/labels
- zoom in on the tree
- search for taxa names
Mesquite, as mentioned, allows you to edit the tree
- put all text on one line
- remove all the hyphens
- "interchance branches" tool
- adjust branch-lengths
As you've seen, often tree-viewing software has problems displaying the
trees obtained from some sources - we'll look at trees from the TreeFam
database, which some of the software has problems with.
To get the Oskar sequences from TreeFam,
we query either with the External Accesion number from UniProt "P25158"
or with "CG10901" under "gene name" (we could find out the relevant
accesion number for our gene by BLASTing with our sequence of interest
against UniProt - this should identify an ID that will be within
TreeFam) - here's the alignment.
Loads OK into NJPLOT (it includes the node labels)
To load it into Dendroscope, we need to delete all of the comments (i.e
everything within "[....]") - here
it is without the comments.
Again, to get it into Mesquite we need to put everything onto one line
- here it is.
Download this tree file
(TF101051, cdc6 relatives from TreeFam with several of the branches
Load the tree into NJPLOT. Ignoring formating, rearange the branches of
the tree in NJPLOT to make them the same as in the image below.
Load the tree into DENDROSCOPE. Carry out the same exercise, but this
time use the formating options to make the tree as similar to the one
in the image below.
Load the tree into MESQUITE, edit the branches of the tree so that the
following (wrong) relationships are indicated:
- flies more closely related to yeast than to humans
Additionally, reduce the size of the two terminal "YEAST" branches by
- human more closely related to frog (XENTR) than to mice
Delete the two ray-finned fish sequences (GASAC and ORYLA)
Save the new tree, and load this into DENDROSCOPE.
Visit this link to
the gnathostome page on the Tree of Life.
Using a text editor, create a file to represent the tree at the top of
the page that can be successfully displayed in NJPLOT. If that's too
easy, try doing the same thing with their aminotes page.
Try the same exercise using the UCMP's "History
of Life" resouce page for the angiosperms/Athophyta.
This time, however, use MESQUITE to construct the tree file.
If you have extra time, look through either "Tree of Life" or "History
of Life" to find a group of organisms that are interesting for you (I
find it easiest to just start from the root nodes and work up), and try
the same exercise.
Relationship between tree and alignment probabilities
The aim of this demo is to help provide a more intuitive grasp of the
link between branch lenghts, substitution models, and how these
parameters can effect the probability of observing a given alignment -
to swap the logic around here, this hopefully also helps with our
understanding of how different alignments make different trees/models
more likely than others.
We load the following NEXUS
format file into MESQUITE - this describes two trees with the same
topology but with different branch-lengths
In MESQUITE we simulate DNA alignments from these trees using the
following Menu choices
Note that the frequency with which different "patterns" of alignment
columns are found in these alignments differs e.g. columns where all
residues are the same, columns where A and B have the same pattern etc.
- Characters -> Make New Matrix From -> Simulated Matrices on
Current Tree -> Evolve DNA Characters -> Jukes-Cantor
We can also add additional taxa to the trees, and run further
Based on where the additional branch(es) are placed, and their branch
lengths, what pattern to you expect to see most often in the alignment
compared to the expected-to-be-most-similar OTUs?
- SHOW Taxa Block (from "Project" Window)
- List -> Add Taxa
- Tree->Alter/Transform Tree->Add Taxa (Change position with
Move Branch tool , add a branch
length using the Adjust Branch Length tool )
Using the Adjust Branch Length tool, create trees (and simulate
datasets from them) where
- almost all alignment positions have the same residue for all
- three of the sequences almost always have each residue in a
column the same, a fourth sequence has often (more than half of the
columns) of the time) has a different residue from the others
- the sequences have diverged so much that there does not seem to
be any phylogenetic correlation between the sequences
Estimating Phylogenies with RAxML
- Obtain the seed alignment for TreeFam
- Load the alignment into ClustalX and examine it - we remove some
of the sequences to make the ML analysis quicker
- Save in FASTA format
- Rename sequences to make them short and alpha-numeic only
- Run FASTA format file through GBLOCKS to
remove columns with gaps, and which are likely to be misaligned, saving
the resulting outfile locally (could do this editing "manually" using
SEAVIEW or JALVIEW)
- Load GBLOCKed fasta file into CLUSTALX and save in PHYLIP format
- Load PHYLIP format file into RAxML webserver
provided at CIPRES
- No bootstrapping
- Choose Analysis: WAG + Gamma + F
- Choose Tool: RAxML
- Give email address
- Run Analysis
- View resulting tree in NJPLOT/Dendroscope
Carry out a similar analysis using either your own alignment of
interest, or using TF105084
Demonstration - CLUSTALX
Load the seed alignment for TF105048
into CLUSTALX - run bootstrap analysis with 100 replicates, correcting
for multiple substitutions and examine the result with NJPLOT
- Load this FASTA-format ef1a
dataset (the sequences have already been trimmed to remove
likely-misaligned colums) into CLUSTALX
- Estimate phylogenies for 1000 different bootstrapped datasets
- Examine resulting phylogeny in NJPLOT
Load the same alignment into JalView, SEAVIEW, or just edit the file
manually to introduce a misalignment of one of the sequences such that
most of that sequence is not properly aligned against the others.
Calculate Bootstrapped trees as above using ClustalX.
- Are there some branches in the tree that you consider very
likely to have been misplaced?
- What is it about these branches that makes you suspect that
they have been misplaced?
- Are the positions of these probably-misplaced branches within
the tree estimated precisely (i.e. with high bootstrap support)?
- What bootstrap supports are associated with branches "close"
to the misaligned sequence branch?
- Do the bootstrap values of most of the branches in the tree
stay the same or are they rather different compared to the initial tree?
Demonstration - RAxML
Using the same file you prepared for analysis by RAxML above, run the
raxmlHPC -s infile.phy -n infile.phy.raxml -c 4 -f d -m
PROTGAMMAJTT -b 2345211436277 -N 100
(Try using MrBayes or ProtTest to
choose a model if you're working with your own sequences. If you want
to build a consensus tree of the resulting trees, you could use
CONSENSE from the Phylip package, or SplitsTrees )
Methods that attempt to estimate the precision of a phylogenetic
estimate often involve the estimation of multiple trees from the
dataset (as above for the non-parametric bootstrap analysis). One
problem with such an analysis is how to represent the set of trees
obtained by the analysis in an informative way. Typically this is done
using a consensus tree, or perhaps using a phylogenetic network.
To understand these two different forms of representation you need to
understand the concept of a ‘split’ or ‘partition’ of a phylogenetic
tree. A split is a division of the set of
terminal-nodes/sequences/taxonomic-units of a tree into two mutually
exclusive sets. Thus, any branch of a phylogenetic tree describes a
split, with the terminal branches that are connected to one end of the
branch making up one of the divisions, and the branches connected to
the other end of the branch making up the other division. The
illustration below indicates an internal branch that can be described
as the split [ABC|DE].
A complete description/list of all the splits of this kind for a tree
provides a complete description of the topology of the tree. For
example, the tree the image above specifies the following set of splits
although, as you have probably noticed, only splits [AB|CDE] and
[ABC|DE] are phylogenetically-informative i.e. provide information
about which taxonomic units cluster together in the unrooted tree.
In this exercise you will be asked (a) to list the set of
phylogenetically-informative splits present in a phylogenetic tree and
(b) reconstruct a phylogenetic tree from a set of splits.
Download the following tree
Visualise the tree in this file using DENDROSCOPE - probably easiest to
work with it using an unrooted representation
Describe the complete list of all informative splits in this tree.
For each of the two sets of splits below, reconstruct the unrooted
phylogenetic tree that they describe.
The tree that you should have
reconstructed from question (a)
The tree that you should have
reconstructed from question (b)
Demonstration - SplitsTrees
We begin by loading the following set of 100 trees into
We first calculate a consensus network from the trees
To caluclate the consensus tree
- We choose to take the default names given to the trees ("Apply")
- Networks->Consensus Network
- Control-Click a branch to get the label menu up
- Edit->Select Edges
- Show Weight
SplitsTrees doesn't give you much control over the kind of consensus
tree drawn - CLANN or (from the PHYLIP package) CONSENSE provide much
more control, allowing one to use a much wider range of different
consensus tree methods.
- Trees->Consensus Tree
- Edge Weights->Count
Get a set of trees, see whether the low-supported branches on the
consensus tree are due to a few different splits or just lots of
jumping around within the tree
Removing potentially misaligned regions
Molecular phylogenetic analyses assume that amino acids in the same
column of the alignment are related by substitution events. It follows
that we may want to consider removing columns where we suspect this may
not be the case from the alignment - indeed, you have already done this
using GBLOCKS in a previous exercise.
However, sometimes we would prefer to do this "manually" i.e. by
selecting ourselves those columns to be excluded from the analysis.
To compare automatic with manual "cropping" or "trimming" of your
alignment, download this alignment locally to your machine TF105805.
Begin by processing the alignment using GBLOCKS
(remember, you'll need to convert it to either PIR or FASTA format) -
keep a link to the results page and resulting alignment.
Next, load the alignment into JalView and remove columns from the
alignment that you believe may include mis-aligned residues (Do this by
selecting columns along the top of the alignment, and hitting the
Compare the alignments and phylogenetic trees estimated using (i) the
alignment obtained after processing with GBLOCKS (ii) the alignment you
prepared by making your own decisions about which columns to retain and
(iii) the initial alignment.
Finally, estimate phylogenies (try using both CLUSTALX and RAxML)
from the manually and the GBLOCK-processed alignments
Q When making your own selection of columns, do you tend to be more or
less conservative than GBLOCKS?
Q Are there any major differences in the phylogenies estimated using
the different alignments? (consider both the topologies and the
bootstrap support for the different branches).
When applied to shorter alignments, GBLOCKS often has an unwanted
effect on topology and bootstrap values of the estimated phylogenies -
the exclusion of so many columns from the final analysis by the program
simply removing too much information from the analysis. However, for
longer alignments, it can be shown that it has a positive effect.
Note that GBLOCKS is not a perfect solution to the problem of preparing
an alignment for phylogenetic analysis - not only is there a sometimes
significant loss of information, if the alignment passed to GBLOCKS
contains many fragments, or seqeuences that contain false seqeuence
e.g. translated non-coding sequence, GBLOCKS is likely to maintain this
"bad" seqeunce in the processed alignment. Therefore, one needs to be
careful to submit to GBLOCKS only alignments that have been mostly
purged of fragments (GBLOCKS will usually not retain any columns
containing gaps) and that contain no "bad" sequence. This is because
GBLOCKS scans the alignment to find regions where there are no gaps,
and where there are several highly-conserved columns. Thus, if there is
one 'bad' sequence present, as long as there are a reasonable number of
sequences in the alignment, these columns will remain highly-conserved,
and the "bad" sequence will be included.
For example, try running GBLOCKS on the two alignments below and
estimating the phylogenies of the resulting alignments. In both
alignments, one of the frog sequences has been deliberately altered to
contain an insertion of "bad" sequence. contains a
(deliberately-inserted) region of bad sequence.
Q Can you identify the "bad" sequence in the two alignments (prior
to GBLOCKS processing)?
Q Look at the GBLOCKED alignments - do both of them contain "bad"
Q What effect does the "bad" sequence have on the phylogenetic
estimation (both in terms of topoology and bootstrap values)? (note
that the "bad" sequence is copied from another part of the same frog
sequence, thus in the region of the alignment containing the "bad"
sequence, the frog sequnence is equally distantly related to all the
other orgnaisms in the alignment)
Phylogenetic Estimation - from start to finish
In these exercises you are asked to go from having a question that can
be addressed using phylogenetic approaches, through the process of
obtaining an appopriate set of sequences for the analysis, aligning
them, estimating a phylogeny from them, and examining the final result
in the light of your initial question
Imagine you are working on the FGF10 protein in humans.
You have plaaned some experiments to be carried out in chickens to test
some of your hypotheses, therefore you want to determine:
To do this, go through the complete analysis including
- whether there is an ortholog for this gene in chickens
- if so, how many orthologs
In particular, think carefully about
- sequence similarity search
- initial alignment of sequences
- examination of alignment, discarding of some sequences, perhaps
collection of additional sequences and inclusion in the alignment
- removal of potentially mis-aligned regions
- estimation of phylogeny
- rooting of phylogeny
- what would be an appropriate set of sequences to collect
- once you have an initial set of sequences, which ones to keep,
which ones to discard
- once you have a tree, where to root it
to Gibson Team course pages at EMBL.