Interpreting and Estimating Phylogenies
Exercises and Demonstrations
Monday 30th - Tuesday 31st March 2009
Wellcome Trust Advanced Course on Molecular Evolution
Viewing and Manipulating Unscaled Trees with NJplot
Teaching Objectives
After completing this section, you will hopefully be able to use NJplot
to change the way a phylogenetic tree is dawn by changing the root and
rotating subtrees around internal branches to reach a specified/desired
representation of the tree.
Notes
Many applications of phylogenies involve using them to check whether a
given set of
taxa/organisms/OTUs are related to each other in a particular way i.e.
whether the topology of the estimated phylogeny supports a particular
set of relationships
between OTUs. As the same topology can be drawn in different ways, it
is useful to be able to look at and manipulate a topology to see
whether it supports a given relationship.
NJplot does not have many features and options, however it carries out
the simple tasks of re-rooting and rotating branches - very useful when
attempting to determine whether a given set of relationships exists in
a tree, or when comparing to tree topologies - very quickly, which is
often all you need when taking a first look at a phylogeny.
This
page describes how to carry out these kinds of manipulations using
NJplot - we will also demo them for you.
Exercise 1
Load the following
NEWICK/PHYLIP format file into NJplot and use the software to try
and reproduce as closely as possible the following image:

If the previous exercise was easy, try the same thing with the following file and
image:

Investigating Branch Lengths and Scaled Trees using MESQUITE
Teaching Objectives
After completing this section, you will hopefully have gained at least
an intuitive understanding of the usual interpretation of branch
lengths
of a molecular phylogeny.
Notes
The trees used in the NJplot exercise above are "unscaled" i.e. the
lengths of the branches on the
tree are arbitrarily assigned to provide a convenient representation of
the tree. For example, in the NJplot-screenshots above, branch lengths
are chosen to align all OTU labels on the right
side of the tree.
In many/most cases, however, you will be using and manipulating trees
where branch length represents the amount of change along a lineage,
typically measured
as expected substitutions per alignment column.
Demonstration
This file uses
MESQUITE to demonstrate the relationship between the amount of change
associated with a branch (as represented by a DNA sequence alignment)
and branch lengths.
Formating Phylogenetic Tree Figures with Dendroscope
Teaching Objectives
After completing this section, you will hopefully be able to manipulate
the representation (root, branch rotation, formating) of a phylogenetic
tree using Dendroscope to provide a starting point for preparing trees
for use in figures
Notes
While NJplot is good for quickly examining a tree, it provides
only fairly limited tools to manipulate the representation of a tree.
To prepare representations of trees for use in figures, we therefore
typically begin by visualising
the
tree in Dendroscope, using it to make many of the formating changes we
need to emphasize appropriate features of the tree. Note, however, that
in almost all cases we carry out further changes and decoration of the
tree using image software such as the GIMP or Adobe Illustrator.
This
page shows how to carry out some common formating tasks using
Dendroscope - we will also demo them for you
Exercise 2
Load this tree file
into Dendroscope and manipulate the tree representation until it
resembles the image below.

If you are having problems obtaining a representation like this, you
can load this
Dendroscope format file into Dendroscope - this includes the above
tree saved in the state used to create the above figure.
Data Formats
Teaching Objectives
After completing this section, you will hopefully be able to:
- draw by hand a phylogenetic tree as described in NEWICK format
- write the NEWICK format string that describes a given
representation of a tree
- download pre-calculated phylogenetic trees from several different
websites and display them in NJplot and Dendroscope
Notes
Much of the software used to estimate, manipulate, and visualize
phylogenetic data is produced by relatively small teams of developers,
primarily
for use in their own research. As a result, they typically have only
limited time, resources, and motivation available to design and prepare
the interface of the software to be compatible with a wide range of
different data formats.
Thus, a common situation when working with phylogenetic data is that
the output obtained from one tool must be adjusted before it can be
successfully used by another tool.
The task of determining the changes that need to be made can be
somewhat confounded by the error messages reported by software due to
incompatible input data - these may be either somewhat cryptic or
absent, making it difficult to diagnose the reasons for the
incompatibility of the data.
Things to look out for when formating data for use by
phylogenetic software are to, if possible:
- use 10-character taxon names (certain commonly-used software [for
example PHYLIP and PAML] is designed/easiest to use with exactly
10-character length taxon names)
- use only the upper-case letters "A-Z". If this is too
restrictive, then try also using lower-case letters "a-z", underscore
"_" and perhaps the numerals "0-9" (which shouldn't, but still might,
cause problems)
- completely avoid using any other characters e.g. #*!-%^()[];,
"white space" etc. as they may have special formating purposes in the
input expected by different software packages, causing input errors
- checking whether the input tree has appropriate characteristics
e.g. whether the tree described:
- is simply invalid (e.g. missing/mis-placed "(" or ")" signs)
- contains comments/labels on nodes/branches that may be
incompatible with other software
- is non-bifurcating, rooted etc. (which might be a problem for
software expecting bifurcating, unrooted trees, for example)
Tree (NEWICK/PHYLIP Format) Data
Notes
Most software that operates on phylogenetic trees uses some derivative
of the NEWICK/PHYLIP format for input/output of trees, as described
during the presentation
To give you some practice overcoming typical problems associated with
inputing tree data into phylogenetic software, we have put together
some
exercises to help both familiarize you with this format.
Demonstration
We'll show you how to draw "by hand" the tree corresponding to a given
NEWICK string using the following example:
((G,E),((C,((A,K,B),F)),((D,H),M)));
Which should yield something like the image shown below:


We'll also demonstrate doing this the other way around i.e. we'll try
and write the NEWICK format string corresponding to the tree shown
above - and then we'll try and alter the string we come up with to
correspond instead to the tree shown below.

Exercise 3
Draw on paper the phylogenetic tree corresponding to the following
NEWICK/PHYLIP format trees - be careful to check whether the trees are
specified as rooted or unrooted, and draw them accordingly. Check your
results by comparing the trees you draw with the
images provided below the two tree images.
((A,(B,(F,C))),(D,E));
Link
to tree image
(A:1,D:6,(((E:1,F:1):1,B:2):1,(C:4,G:2):2):1);
Link
to tree
image
Now try this from the other side - write and save in a text editor
NEWICK/PHYLIP format representations of the trees shown below. To check
whether the file you write does indeed represent the appropriate tree,
try loading the file into Dendroscope.
This first tree is unscaled - so do not attempt to include information
about branch lengths in your NEWICK-FORMAT tree. (Here
is a file that contains a NEWICK/PHYLIP format tree that should
yield the tree seen below)
For the next tree, try to include branch length information in your
NEWICK/PHYLIP format tree (Here
is a file that contains a NEWICK/PHYLIP format tree that should
yield the tree seen below)

Sources of Pre-Calculated Trees
Notes
Sometimes, rather than estimating a phylogenetic tree yourself, it
will be enough to simply examine a tree obtained elsewhere.
However, such trees are often formated in a way that is incompatible
with tree visualization (or other phylogenetic) software. Therefore,
obtaining and visualising such trees provides further practice at
interpreting and manipulating NEWICK format trees.
This
page describes how to obtain trees (and in some cases alignments)
from several different websites - we will also demo them for you
Exercise 4
Below is a list of websites which are sources of pre-calculated
phylogenies - use each of these sites to obtain trees that
include the human
cyclin F sequence (UniProt Entry Name: CCNF_HUMAN, UniProt Primary
Accession
Number: P41002).
After downloading the trees, try to load them into NJplot and
Dendroscope
Note that you may need to edit the format of the downloaded trees for
them to be accepted/correctly loaded into the software.
- if you have trouble finding the appropriate TreeFam page
(TF101006),
then
follow this
link straight to the page
- if you have trouble downloading the seed tree from this page, here it is already downloaded
- if you have trouble finding the appropriate Ensembl page, then follow
this link (note that this may change following a new release of
Ensembl - when this link was posted Ensembl was on release 53)
- if you have trouble extracting the tree from this page (using cut
and paste into a text editor), here it is already
downloaded
- if you have problems finding the appropriate HOVERGEN page, follow
this link
- if you have problems extracting the tree from this page, follow this link to the
file
There are quite a few other sites that provide trees that can be
downloaded, for example:
Exploring Large Phylogenetic Trees
Teaching Objectives
After completing this section, you will hopefully be able to interpret
and explore very large phylogenetic trees using Dendroscope
Notes
As computers continue to increase in speed, we are able to calculate
ever larger phylogenetic trees. Thus, it is now not unusual to be faced
with the problem of examining trees with 100s or even 1000s
of taxa. The tree visualization tools we have used so far (or at least
the way we have been
using them) are not well designed for this task, as large trees are too
dense to visualize easily without somehow being able to easily focus on
the regions of particular interest, while ideally at the same time
providing some kind of overview of where the region of focus lies
within the overall tree.
However, combined with search and format options, the "Magnifier" tool
of Dendroscope makes this kind of task much easier.
This
page describes how to use Dendroscope in this way - we will also
demo it for you
Exercise 5
- Load this
tree file into Dendroscope
- Find all OTUs that are from humans - as these are all taken from
ENSEMBL, the labels for these OTUs should all contain the substring
"ENSP00"
- Use the Format box to colour all these OTU red
- Examine the tree to identify the human sequence that is most
similar to the fly "CG7922" sequence - this is the same human sequence
that forms the smallest clan that contains CG7922 and a human sequence
(however, note that if we want to avoid making assumptions about the
location of
the root of the tree, we can't describe this as the human sequence that
is most closely related to the fly CG7922 sequence)
If you're having trouble identifying the appropriate fly sequence, this
Dendroscope file has all human taxa labeled in red, with the CG7922
sequence labeled in blue
Editing Trees Using MESQUITE
Teaching Objectives
After completing this section, you will hopefully be able to create
NEWICK format files of a desired topology and set of branch lengths
using MESQUITE, beginning either "from scratch", or by modifying an
existing tree.
Notes
We usually work with phylogenies that have been directly
estimated from a dataset - typically a protein or DNA multiple sequence
alignment. However, in certain situations we do not want or need to
estimate a phylogeny - instead we can either
create it completely "from scratch", or simply modify an
existing phylogeny. These modifications might involve changing
branch lengths,
topology, and/or the rooting of the tree.
Typical uses of such "edited" trees are preparation of figures for
publications/presentations e.g. a "cartoon" figure showing a consensus
view of the relationships for a set of organisms, or when carrying out
tests that compare
several different specified phylogenetic hypotheses e.g. when applying
the
approximately unbiased (AU) test to an alignment and a set of
phylogenies.
As already mentioned, MESQUITE is a very flexible tool for the analysis
of phylogenetic data
- and one part of its functionality enables us to edit trees in this
way.
This
page describes how to use MESQUITE in this way - we will also demo
it for you
Exercise 6
Load this
tree file
into MESQUITE and edit it to yield the topology and branch lengths
shown below.

Export the file from MESQUITE and then use Dendroscope to produce an
image from it similar to the one shown below.

Create from scratch a phylogeny using MESQUITE with the topology and
branch lengths shown below.

Reconciling Species and Gene Trees Using Mesquite
Teaching Objectives
After completing this section, you will hopefully appreciate that any
gene tree can be consistent with any species tree, given inference of
appropriate gene duplications and losses.
Demonstration
We will show how to compare/reconcile a set of of rooted trees (all
with the same unrooted topology) with a species tree using MESQUITE.
This NEXUS
format file can be loaded into MESQUITE - it includes three gene
trees (shown in the image below - all have the same unrooted topology
and branch lengths) and a species tree - the same as used in the quiz
from the presentation.

Using the MESQUITE's Analysis->Visual Tree
Analysis->Contained Gene (or Other) Tree command, we can
identify the minimum number of duplication and deletion events that
must be inferred to reconcile the different gene trees with the species
tree.
This
page describes how to use MESQUITE in this way
Splits
Teaching Objectives
After completing this section, you will hopefully be able to:
- list the set of splits specified by a given (rooted or unrooted)
phylogenetic tree
- reconstruct the unique tree topology specified by a set of
compatible splits
Notes
When carrying out a phylogenetic analysis, we often need to summarises
the similarities/differences between a set of phylogenies e.g. the set
of phylogenies sampled after the burnin phase in a Bayesian analysis
of phylogeny. Many of the ways in which sets of trees are summarized
make use of the concept of "phylogenetic splits".
A phylogenetic split is the two sets of OTUs associated with the two
ends of a branch on a phylogenetic tree. For example, in the tree
below, the split associated with the red branch is
CD | ABE
The union of the two sets that make up the split is the complete set of
OTUs from the tree, and the two sets should be disjoint (i.e. not
sharing
any OTUs in common).
Splits may be described as "trivial" and "non-trivial" - A trivial
split contains just a single OTU, while both sets in a non-trivial
split contain more than one OTU
For example, there are 5 trivial splits for the tree shown below (one
for each terminal branch)
A | BCDE
B | ACDE
C | ABDE
D | ABCE
E | ABCD
And 2 non-trivial splits (one for each internal branch on the
corresponding unrooted tree)
AB | CDE
CD | ABE
Note also that the two sets of a split are unordered - thus, "AB | CDE"
describes the same split as "AB | ECD"

A further feature of (a set of) splits is compatibility
A pair of splits are incompatible if it is impossible to draw a
tree that contains both of them - instead, to include both of them in a
diagram, we would need to use a split network. Likewise, a set of
splits is compatible if it is possible to include all of them
in a tree. Clearly, the set of splits that are described by a given
tree will always be compatible.
As an example, the following two splits are incompatible:
AB | CDE
AC | BDE
(if you want, try [and fail!] yourself to build a tree that contains
both of these splits)
You can identify whether a pair of splits are compatible or not by
considering the intersections of the split sub-sets. Where exactly one
of the sub-set intersections is empty, then the splits are compatible -
otherwise the sets are incompatible.
Taking the example above, the intersections are (using "n" to indicate
intersection):
{A,B} n {A,C} => {A}
{A,B} n {B,D,E} => {B}
{C,D,E} n {A,C} => {C}
{C,D,E} n {B,D,E} => {D,E}
All four intersections are non-empty - the splits must be incompatible
In contrast, the following two splits are compatible:
AB | CDE
CD | ABE
{A,B} n {C,D} => {}
{A,B} n {A,B,E} => {A,B}
{C,D,E} n {C,D} => {C,D}
{C,D,E} n {A,B,E} => {E}
as one (and only one) of the intersections is the empty set.
Exercise 7
(i) Identify the list of all non-trivial splits for the following tree
- check
here for the answers

(ii) Try the same exercise with this larger tree - again, you can find
the answers
here

(iii) There is only one bifurcating tree that is consistent with the
set of
splits listed below. Draw this tree - check
here
for the answer.
EC | HNGA
ECH | GAN
GA | NHEC
(iv) Here are the splits for a larger tree, this time with 12 taxa - if
you've time, try and repeat the above exercise with this set of splits,
building the unique bifurcating tree that is consistent with this set
of splits - check
here for the answer.
CB | ADEFGHKMNP
EH | ABCDGHKMNP
BCEH | ADFGKMNP
FN | ABCDEGHKMP
FNM | ABCDEGHKP
FNMK | ABCDEGHP
FNMKA | BCDEGHP
GP | ABCDEFHKMN
GPD | ABCEFHKMN
Building Consensus Trees by Hand
Teaching Objectives
After completing this section, you will hopefully be able to construct
a consensus tree "manually" from a set of trees (all of which describe
relationships between the same set of OTUs).
Notes
Consensus trees summarize the set of splits described by multiple
phylogenetic trees. For example, a consensus tree might include all
splits present in 80% of the trees.
Exercise 8
From the set of six trees presented below, build both the unrooted (i)
strict consensus tree and the (ii) 50% majority tree.

If you're having trouble building these trees, click on the links
supplied to view the strict
consensus and the majority
tree (where branch lengths are labeled by the number of times
a give split is observed amongst the total of 6 trees used to build the
tree).
Using SplitsTree and CONSENSE to build Consensus Trees and Networks
Teaching Objectives
After completing this section, you will hopefully be able to process a
file containing a set of trees, specified in NEWICK format, all
describing relationships between the same set of OTUs and:
- construct a range of different kinds of consensus trees using
SplitsTree and Consense from such a file
- construct a consensus/split network using SplitsTree from such a
file
- interpret a consensus/split network to identify the incompatible
splits within a set of such trees, and the frequency with which these
inconsistent splits occur within these trees
Notes
A range of different software is available to calculate consensus trees
and networks.
We will begin by using SplitsTree - a JAVA-based tool with a graphical
user interface which build either strict or majority consensus
tree, and also split networks.
Other software, such as CLANN
or CONSENSE
(one of the programs in the PHYLIP
package), provide more flexibility in the type of different
consensus trees they can build - we will follow the SplitsTree
exercise by using CONSENSE to build some consensus trees. This also
gives us an opportunity to become familiar with the PHYLIP package.
Exercise 9 - Using SplitsTree
This
page describes how to use SplitsTree to build consensus trees and
split networks - we will demo this for you
Load this set of
100 trees into SplitsTree and calculate
- the majority consensus tree
- the strict consensus tree
- the 0.1 fraction threshold consensus network
If you have trouble calculating these trees/networks, then follow the
links below to download (i) NEXUS format files with the trees/networks
pre-calculated [which can be loaded for viewing directly into
SplitsTree] and (ii) images of
the trees/networks.]
By examining the strict consensus tree, identify those splits found in
all 100 of the trees - the set of these splits can be found
here.
By examining the majority consensus tree, determine how many of the
trees contain the clan: (EF1A1_HUM, EF11_MOUS, EF1A_CHIC)
Examining the consensus network, identify the most frequent split found
in the trees that is
incompatible with the split EF1A1_HUM, EF11_MOUS, EF1A_CHIC | others.
Determine also how many trees have this incompatible split
Within the set of trees, the taxon xILC49472 is most often found in two
relatively small (mutually incompatible) clans. Identify these clans,
and determine how many trees they are each found in.
If you have trouble answering the last few questions, check the
answers here.
Exercise 9B - Using CONSENSE
This
page describes how to use CONSENSE to build consensus trees - we
will demo this for you
Use CONSENSE to build the:
- strict consensus tree
- majority rule consensus tree
- M65 consensus tree
- majority rule (extended) consensus tree
of the 100 trees used above for the SplitsTree exercise.
Check your results by comparing them to the pre-calculated ones
provided below:
Some questions that are just designed to give you a focus for examining
these trees:
Do any of the consensus trees have the same number of polytomies?
Do any of the consensus trees have identical topologies?
Check
here for the answers to these questions.
Demonstrating Structural Equivalence and Alignments
Teaching Objectives
After completing this section, you will hopefully appreciate the
properties that we expect residues in the same column to share of a
sequence alignment that is being interpreted in a "structural" context.
Notes
Comparing (and aligning) pairs of similar structures demonstrates what
it means for a pair of residues to be "structurally equivalent" - the
relationship that we want residues in the same column of an alignment
to share if we are using the alignment in a "structural" context e.g.
predicting secondary structure, or building a protein profile HMM to
carry
out a sensitive sequence similarity search.
At the same time, it demonstrates that there may be regions of two
structures for which there is not any such equivalence.
Demonstration
This example is a pair of bacterial toxins 1ji6
and 1i5p with very
similar structures
We have aligned the N-terminal regions of these structures using FATCAT
- this
link is to the aligned structures in a PyMOL session file
- two examples of structurally equivalent residues are coloured
in dark blue
- side-chains of adjacent residues that do not share 1:1
structural equivalence with these residues are shown in the default
colours of their protein chains
- regions of the chains which do not have any 1:1 structural
equivalence in the other structure are shown in lighter colours
- notice that these are all in surface loops
- this
link is the
sequence alignment implied by this structural alignment, with residues
coloured as in the above PyMOL session file
- this
link provides the same alignment, but in Jalview format (allows us
to easily change the format of regions of the alignment, and to examine
the effect of introducing gaps in different positions in the alignment)
It should be clear, when looking at the structural alignment, that the
structures are very similar, with most residues sharing 1:1 structural
equivalence with a residue in the other structure.
By contrast, we provide below an alignment of two very different
structures. Indeed, considered in terms of the kinds of secondary
structure elements they contain, they are completely different (one is mainly alpha, the other mainly beta).
Note that we are still able to align these two structures, despite our
opinion that they are global extremely dissimilar. The same is true for
multiple sequences alignments - most software will report an alignment,
whether or not the global similarity most of the software assumes the
sequences share is indeed present within the sequences.
Given the global dissimilarity of the alpha solenoid and beta
barrel structures, you would almost certainly want to avoid using such
an alignment to make any inferences about similarity of
function/structure between residues aligned in the same column.
In general, this illustrates the fact that it is important to be very
confident that the sequences you include in an MSA indeed share the
relationship you are interested in - this will typically be
"structural" and/or "evolutionary" equivalence. Note, however, that we
are avoiding a discussion of how to judge whether or not sequences do
indeed
have such a relationship with each other - this issue will be discussed
in detail by Bill Pearson and Ewan Birney in the next session.
If you are interested, you might like to try comparing some other pairs
of structures and examining their structural and sequence alignments as
calculated by FATCAT
and CE.
Instructions
on how to calculate pairwise structural alignments using CE
and FATCAT
and display the results in PyMOL
- Serine Proteases
- TIM Barrel hydrolases
Note that we are suggesting that you use FATCAT and CE as they both
provide a
pairwise sequence alignment along with easy-to-view structural
alignments -
not because they are known to provide on average the best quality
alignments (although
these seem to be often good).
Phylogenetic Analysis: From Start to Finish
Teaching Objectives
After completing this section, you will hopefully:
- understand how (and why) it is important to know, reasonably
specifically, the purpose of a phylogenetic analysis before carrying
one out
- be able to use an awareness of the aim of the analysis to
prepare an MSA that is appropriate for this purpose
- be able to implement all the steps needed to carry out one
possible
phylogenetic analysis
Notes
The process of going from the formulation of a biological question
(that can be investigated
using a phylogeny) to obtaining an estimate of a phylogeny is a
multi-stage process. It requires the investigator to make many
decisions - many of which depend strongly on the specific overall
aim/purpose of the analysis i.e. the biological questions the analysis
being used to investigate.
This strong dependence on the specific purpose of an analysis is part
of the reason why it is difficult/impossible to provide a one-fits-all
detailed protocol/recipe for phylogenetic analyses.
Thus, the demonstration below aims to highlight (i) typical
steps/stages and (ii) examples of the kinds of decisions that need to
be made when carrying out such an analysis - thus, it is not
intended as a blueprint for carrying out the ideal phylogenetic
analysis!
Note, also, that while we have described these examples as taking the
analysis "from start to finish", in reality the process is much more
involved than shown here. One could argue that the beginning of the
process begins considerably earlier than shown here, with the decisions
about the biological questions of interest - and that the process would
need to run on much longer than shown here, for example investigating
and testing for a range of different potential sources of systematic
error in the analysis, or using these results to identify additional
data to be included in the analysis e.g. highlighting a particular set
of taxa that might be useful in better resolving particular regions of
the phylogeny.
Demonstration
As discussed above, an exercise that involves going from question to
phylogeny must
be done in the context of a specific biological question.
For this demonstration, we assume that we are interested in
understanding the history of the human Histone acetyltransferase
GCN5 gene, and its paralog Histone acetyltransferase
PCAF. In particular, we want to determine approximately when the
duplication that yielded these two genes occurred.
1. Identify the TreeFam family corresponding to these two genes, and
download a protein sequence alignment for the family
2. Remove some of the sequences from the alignment - in just about
every analysis you do, you'll find you want to remove some of your
initial set of sequences from your analysis. You might want to do this
for
sequences that you feel are:
- unnecessary for the analysis as they are identical/nearly
identical
to other sequences in the alignment
- incomplete/fragmented in a way that would exclude large regions
of
the alignment from later analysis
- poorly aligned/likely to contain sequence errors
- Note that, in a "real" analysis, you may want to
attempt to
resolve some of these issues e.g. correct alignment errors, check
whether "unusual" sequence is likely to be due to errors or is, in
fact, real. However, for the sake of speed, we'll for now just exclude
these sequences from the analysis
- After removing sequences in this way, your alignment might look something like
this
- Use CLUSTALX (having switched on the Quality->Show
Low-Scoring
Segments
option) and Jalview to identify sequences you want to remove from the
alignment
- Remove the sequences from the Jalview alignment, storing the
removed sequences in a different Jalview window
- Follow
this link for instructions on how to use CLUSTALX in this way
- Follow
this link for instructions on how to use Jalview in this way
3. Remove columns from the alignment where you are not confident
that all residues in the
column are "evolutionarily" equivalent i.e. related via single-residue
substitutions.
- Use Jalview to remove these columns, informed by the
regions highlighted as low-scoring by CLUSTALX
- Follow
this link for instructions on how to do this with Jalview
- Save the file in FASTA format - it might
look like this
4. Edit the taxa names in the FASTA format file so that they are all
10
characters long, and contain only capital letters and/or underscores.
Ideally, you should be able to identify the organism the sequence comes
from, and (if there is more than one sequence from the same organism,
the name should also make it possible to distinguish between the two
sequences)
5. Save the alignment in PHYLIP format using CLUSTALX
6. Identify the best substitution model (or at least, the best from a
list of models you choose to examine) to use in your phylogenetic
analysis, given your alignment
- follow
this link for instructions on using ProtTest
- to speed up this analysis, only examine the matrices JTT and
WAG,
with
Add-ons +G and +F, and use the Fast Optimistation strategy
- If you don't have time to run this analysis, the log file of
such an
analysis looks like
this - in this one, the JTT + Gamma model is found to be the best
using both the AIC and BIC tests
7. Use RAxML to estimate a set of non-parametric bootstrapped trees
from
this alignment - to keep the analysis as quick as possible, calculate
only 10 bootstrapped trees
- Follow
this link for instructions on using RAxML
- Do this using a command similar to:
- raxmlHPC -s TF105399_seed_trimmed_HandGblocked_NameEd.phy -n
TF105399_seed_trimmed_HandGblocked_NameEd.phy.raxml -c 4 -f d -m
PROTGAMMAJTT -b 234534251 -N 10
- Here
is an example of the result file that might come out of such an
analysis
- This file can be viewed in SplitsTree to get an overview of
the
frequency of the different splits observed in the bootstrap trees
8. Obtain a single best estimate of the tree from the alignment
using a
string such as:
- raxmlHPC -s TF105399_seed_trimmed_HandGblocked_NameEd.phy -n
TF105399_seed_trimmed_HandGblocked_NameEd.phy.raxml_NoBootstrap -c 4 -f
d -m PROTGAMMAJTT
- here's
an example of such a result
9. Combine the results of these two runs to determine the bootstrap
support for
the branches in the maximum likelihood tree using RAxML e.g.:
- Try this using a command line string similar to
- raxmlHPC -f b -m PROTGAMMAJTT -c 4 -s
TF105399_seed_trimmed_HandGblocked_NameEd.phy -z
RAxML_bootstrap.TF105399_seed_trimmed_HandGblocked_NameEd.phy.raxml -t
RAxML_result.TF105399_seed_trimmed_HandGblocked_NameEd.phy.raxml_NoBootstrap
-n BS_TREE
- The resulting file might look like this
- This can be viewed using NJplot
or Dendroscope,
with the
branches of the
ML tree labeled with their bootstrap confidence values
Exercise 10
Carry out a similar analysis to that described above.
This time the scenario is that you are interested in the evolution of
the human Polyadenylate-binding
protein 2 protein and its paralog Embryonic
polyadenylate-binding protein 2. The duplication that yielded the
two genes probably occurred after the divergence of the urochodate from
the vertebrate lineage. It has been suggested that the embryonic copy
of the gene has been evolving much faster than the other copy. Your aim
is to to investigate the evolution of the vertebrate sequences of this
family, looking to see whether (by simply inspecting the resulting
phylogeny) there seems to be a difference in rate of evolution of the
two paralogs. If so, use the tree to decide when you think it's most
likely that this different in rate was established.
Begin by finding the TreeFam record that corresponds to this family.
Using CLUSTALX/Jalview (follow these links for instructions on using CLUSTALX
and Jalview)
- remove sequences from the alignment that are either unnecessary
or seem to contain errors
- remove columns from the alignment that contain gaps or where
you
are not confident that all residues in the column are related by
substitutions
- save as FASTA format and change the names to be 10 or fewer
alpha-numeric (or
underscore) characters
- save as PHYLIP format
Using ProtTest (follow
this link for instructions on using ProtTest)
- determine an appropriate substitution model
Using RAxML (follow
this link for instructions on using RAxML in this way)
- estimate a set of 10 maximum-likelihood
non-parametric bootstrapped tree topologies for the alignment
- estimate the single best ML tree from the alignment
- example of an execution string
- raxmlHPC -s TF105907_full_trimmed_handGblocked_NameEd.phy
-n
TF105907_full_trimmed_handGblocked_NameEd_SingleTree_RAXML -c 4 -f d -m
PROTGAMMAJTT
- example
of a result file
- integrate the bootstrapped phylogenies with the ML
phylogeny to obtain bootstrap support values for the ML tree
- raxmlHPC -s TF105907_full_trimmed_handGblocked_NameEd.phy
-n
TF105907_full_trimmed_handGblocked_NameEd_MlPlusBootstraps_RAXML -c 4
-f b -m PROTGAMMAJTT -z
RAxML_bootstrap.TF105907_full_trimmed_handGblocked_NameEd_BOOTSTRAPS_RAXML
-t
RAxML_result.TF105907_full_trimmed_handGblocked_NameEd_SingleTree_RAXML
Using NJplot or Dendroscope, examine these trees
In which lineage do you think the change in amino-acid substitution
rate occurred within the family?
Can you think of some of the assumptions you are making when you
draw
this conclusion?
(You'll find possible answers to these questions by
following this link.)
Back
To Session Mainpage