Phylogenies

Presentation

Exercises and Demonstations

Displaying phylogenetic trees

Demonstation

Here we will look at several different software tools that can be used to display/draw phylogenetic trees. This should help you start exploring both pre-calculated trees, and ones you have estimated yourself, along with helping you become more familiar with looking at and thinking about trees.

NJPLOT is a relatively simple program initially developed to show the results of bootstrapped Neighbour-Joining phylogenies. It is often useful as a relatively quick and easy way of viewing the results of an analysis. While it can accept as NEWICK-format un-rooted trees as input, it is only able to display such trees with a root. It won't accept trees with any polytomies apart from a single trisomy (used to specify the position of the root).

DENDROSCOPE is a more comprehensive program that is able to produce trees of nearly publication quality relatively quickly (although in most cases the resulting images require additional editing in other software such as Adobe Illustrator). It can display both rooted and unrooted trees, with scaled and unscaled formats.

MESQUITE (which we will use later in a later exercise/practical) is an extremely versitile and flexible software package for carryout out a wide range of different kinds of evolutionary analyses. This flexibility provides a relatively steep learning-curve. We will use it to visualise AND EDIT evolutionary trees; both DENDROSCOPE and NJPLOT are only able to view the data provided in a NEWICK format tree file - while MESQUITE provides the ability to manually alter the topology and branch-lengths of a tree (which can be useful for preparing illustrative figures, or for preparing trees for input to other software (or for other analyses within MESQUITE).

We will begin by downloading a pre-caluclated alignment, using the alignment to obtain a rough estimate of the phylogeny of the sequences, and will then visualise the resulting tree using NJPLOT, DENDROSCOPE, and MESQUITE.

Suppose we are interested in investigating the phylogeny of the drosophilid Oskar genes. To begin our analysis we go to the set of pre-calculated alignments of sets of "orthologous groups" of genes from Chris Ponting's group's OPTIC resource (these are available for other groups e.g. mammals, yeasts, not just drosophilids).

To identify the relevant group of genes, we need to find the relevant identifier for the Oskar genes - we do this by searching flybase.

We query OPTIC database with CG10901 (the relevant ID - although one actually needs to query with "CG10901-RA")

We click on the link that leads us to the alignments (the rather cryptic "group 3297")

Choose Alignments->Tanscripts-AminoAcids, and then export in Fasta format

We then use ClustalX to calculate a quick bootstrapped tree from this alignment
We look at the resulting tree in NJPLOT
Dendroscope can also examine this tree with no trouble
To work on the tree in Mesquite, we need to edit the file
Mesquite, as mentioned, allows you to edit the tree

As you've seen, often tree-viewing software has problems displaying the trees obtained from some sources - we'll look at trees from the TreeFam database, which some of the software has problems with.

To get the Oskar sequences from TreeFam, we query either with the External Accesion number from UniProt "P25158" or with "CG10901" under "gene name" (we could find out the relevant accesion number for our gene by BLASTing with our sequence of interest against UniProt - this should identify an ID that will be within TreeFam) - here's the alignment.

Loads OK into NJPLOT (it includes the node labels)

To load it into Dendroscope, we need to delete all of the comments (i.e everything within "[....]") - here it is without the comments.

Again, to get it into Mesquite we need to put everything onto one line - here it is.

Exercises

Download this tree file (TF101051, cdc6 relatives from TreeFam with several of the branches removed)

Load the tree into NJPLOT. Ignoring formating, rearange the branches of the tree in NJPLOT to make them the same as in the image below.

Load the tree into DENDROSCOPE. Carry out the same exercise, but this time use the formating options to make the tree as similar to the one in the image below.

Load the tree into MESQUITE, edit the branches of the tree so that the following (wrong) relationships are indicated:
Additionally, reduce the size of the two terminal "YEAST" branches by approximately half.

Delete the two ray-finned fish sequences (GASAC and ORYLA)

Save the new tree, and load this into DENDROSCOPE.

Dendroscope Treee

Visit this link to the gnathostome page on the Tree of Life.

Using a text editor, create a file to represent the tree at the top of the page that can be successfully displayed in NJPLOT. If that's too easy, try doing the same thing with their aminotes page.

Try the same exercise using the UCMP's  "History of Life" resouce page for the angiosperms/Athophyta. This time, however, use MESQUITE to construct the tree file.

If you have extra time, look through either "Tree of Life" or "History of Life" to find a group of organisms that are interesting for you (I find it easiest to just start from the root nodes and work up), and try the same exercise.

Relationship between tree and alignment probabilities

Demonstration

The aim of this demo is to help provide a more intuitive grasp of the link between branch lenghts, substitution models, and how these parameters can effect the probability of observing a given alignment - to swap the logic around here, this hopefully also helps with our understanding of how different alignments make different trees/models more likely than others.

We load the following NEXUS format file into MESQUITE - this describes two trees with the same topology but with different branch-lengths
Mequite Tree1Mesquite Tree2
In MESQUITE we simulate DNA alignments from these trees using the following Menu choices
Note that the frequency with which different "patterns" of alignment columns are found in these alignments differs e.g. columns where all residues are the same, columns where A and B have the same pattern etc.

We can also add additional taxa to the trees, and run further simulations:
  1. SHOW Taxa Block (from "Project" Window)
  2. List -> Add Taxa
  3. Tree->Alter/Transform Tree->Add Taxa (Change position with Move Branch tool MESQUITE Move Branch Icon, add a branch length using the Adjust Branch Length tool MESQUITE Adjust Branch Length Icon)
Based on where the additional branch(es) are placed, and their branch lengths, what pattern to you expect to see most often in the alignment compared to the expected-to-be-most-similar OTUs?

Exercises

Using the Adjust Branch Length tool, create trees (and simulate datasets from them) where

Estimating Phylogenies with RAxML

Demonstration

  1. Obtain the seed alignment for TreeFam familly TF105048.
  2. Load the alignment into ClustalX and examine it - we remove some of the sequences to make the ML analysis quicker
  3. Save in FASTA format
  4. Rename sequences to make them short and alpha-numeic only
  5. Run FASTA format file through GBLOCKS to remove columns with gaps, and which are likely to be misaligned, saving the resulting outfile locally (could do this editing "manually" using SEAVIEW or JALVIEW)
  6. Load GBLOCKed fasta file into CLUSTALX and save in PHYLIP format
  7. Load PHYLIP format file into RAxML webserver provided at CIPRES
    1. No bootstrapping
    2. Choose Analysis: WAG + Gamma + F
    3. Choose Tool: RAxML
    4. Give email address
    5. Run Analysis
  8. View resulting tree in NJPLOT/Dendroscope

Exercise

Carry out a similar analysis using either your own alignment of interest, or using TF105084

Non-Parameteric Bootstrapping

Demonstration - CLUSTALX

Load the seed alignment for TF105048 into CLUSTALX - run bootstrap analysis with 100 replicates, correcting for multiple substitutions and examine the result with NJPLOT

Exercises

  1. Load this FASTA-format ef1a dataset (the sequences have already been trimmed to remove likely-misaligned colums) into CLUSTALX
  2. Estimate phylogenies for 1000 different bootstrapped datasets
  3. Examine resulting phylogeny in NJPLOT
Load the same alignment into JalView, SEAVIEW, or just edit the file manually to introduce a misalignment of one of the sequences such that most of that sequence is not properly aligned against the others. Calculate Bootstrapped trees as above using ClustalX.

Demonstration - RAxML

Using the same file you prepared for analysis by RAxML above, run the following command:

raxmlHPC -s infile.phy  -n infile.phy.raxml -c 4 -f d -m PROTGAMMAJTT -b 2345211436277 -N 100

(Try using MrBayes or ProtTest to choose a model if you're working with your own sequences. If you want to build a consensus tree of the resulting trees, you could use CONSENSE from the Phylip package, or SplitsTrees )

Phylogenetic Splits/Partitions

Methods that attempt to estimate the precision of a phylogenetic estimate often involve the estimation of multiple trees from the dataset (as above for the non-parametric bootstrap analysis). One problem with such an analysis is how to represent the set of trees obtained by the analysis in an informative way. Typically this is done using a consensus tree, or perhaps using a phylogenetic network.
 
To understand these two different forms of representation you need to understand the concept of a ‘split’ or ‘partition’ of a phylogenetic tree. A split is a division of the set of terminal-nodes/sequences/taxonomic-units of a tree into two mutually exclusive sets. Thus, any branch of a phylogenetic tree describes a split, with the terminal branches that are connected to one end of the branch making up one of the divisions, and the branches connected to the other end of the branch making up the other division. The illustration below indicates an internal branch that can be described as the split [ABC|DE].

Tree illustrating a split
 
A complete description/list of all the splits of this kind for a tree provides a complete description of the topology of the tree. For example, the tree the image above specifies the following set of splits
 
[A|BCDE]
[AB|CDE]
[B|ACDE]
[ABC|DE]
[C|ABDE]
[D|ABCE]
[E|ABCD]
 
although, as you have probably noticed, only splits [AB|CDE] and [ABC|DE] are phylogenetically-informative i.e. provide information about which taxonomic units cluster together in the unrooted tree.

Exercise

In this exercise you will be asked (a) to list the set of phylogenetically-informative splits present in a phylogenetic tree and (b) reconstruct a phylogenetic tree from a set of splits.
 
Download the following tree file
 
Visualise the tree in this file using DENDROSCOPE - probably easiest to work with it using an unrooted representation
 
Describe the complete list of all informative splits in this tree.
 

For each of the two sets of splits below, reconstruct the unrooted phylogenetic tree that they describe.
 
(a)
[AB|CDE]
[CD|ABE]
 
(b)
[AB|CDEFGH]
[ABC|DEFGH]
[ED|ABCFGH]
[ABCDE|FGH]
[GH|ABCDEF]
 
The tree that you should have reconstructed from question (a)
 
The tree that you should have reconstructed from question (b)

Demonstration - SplitsTrees

We begin by loading the following set of 100 trees into SplitsTrees

We first calculate a consensus network from the trees
To caluclate the consensus tree
SplitsTrees doesn't give you much control over the kind of consensus tree drawn - CLANN or (from the PHYLIP package) CONSENSE provide much more control, allowing one to use a much wider range of different consensus tree methods.

Exercise
Get a set of trees, see whether the low-supported branches on the consensus tree are due to a few different splits or just lots of jumping around within the tree


Removing potentially misaligned regions

Molecular phylogenetic analyses assume that amino acids in the same column of the alignment are related by substitution events. It follows that we may want to consider removing columns where we suspect this may not be the case from the alignment - indeed, you have already done this using GBLOCKS in a previous exercise.

However, sometimes we would prefer to do this "manually" i.e. by selecting ourselves those columns to be excluded from the analysis.

To compare automatic with manual "cropping" or "trimming" of your alignment, download this alignment locally to your machine TF105805.

Begin by processing the alignment using GBLOCKS (remember, you'll need to convert it to either PIR or FASTA format) - keep a link to the results page and resulting alignment.

Next, load the alignment into JalView and remove columns from the alignment that you believe may include mis-aligned residues (Do this by selecting columns along the top of the alignment, and hitting the backspace/delete key)

Compare the alignments and phylogenetic trees estimated using (i) the alignment obtained after processing with GBLOCKS (ii) the alignment you prepared by making your own decisions about which columns to retain and (iii) the initial alignment.

Finally, estimate phylogenies (try using both CLUSTALX and RAxML) from the manually and the GBLOCK-processed alignments

Q When making your own selection of columns, do you tend to be more or less conservative than GBLOCKS?


Q Are there any major differences in the phylogenies estimated using the different alignments? (consider both the topologies and the bootstrap support for the different branches).

When applied to shorter  alignments, GBLOCKS often has an unwanted effect on topology and bootstrap values of the estimated phylogenies - the exclusion of so many columns from the final analysis by the program simply removing too much information from the analysis. However, for longer alignments, it can be shown that it has a positive effect.

Note that GBLOCKS is not a perfect solution to the problem of preparing an alignment for phylogenetic analysis - not only is there a sometimes significant loss of information, if the alignment passed to GBLOCKS contains many fragments, or seqeuences that contain false seqeuence e.g. translated non-coding sequence, GBLOCKS is likely to maintain this "bad" seqeunce in the processed alignment. Therefore, one needs to be careful to submit to GBLOCKS only alignments that have been mostly purged of fragments (GBLOCKS will usually not retain any columns containing gaps) and that contain no "bad" sequence. This is because GBLOCKS scans the alignment to find regions where there are no gaps, and where there are several highly-conserved columns. Thus, if there is one 'bad' sequence present, as long as there are a reasonable number of sequences in the alignment, these columns will remain highly-conserved, and the "bad" sequence will be included.

For example, try running GBLOCKS on the two alignments below and estimating the phylogenies of the resulting alignments. In both alignments, one of the frog sequences has been deliberately altered to contain an insertion of "bad" sequence. contains a (deliberately-inserted) region of bad sequence.
Q Can you identify the "bad" sequence in the two alignments (prior to GBLOCKS processing)?

Q Look at the GBLOCKED alignments - do both of them contain "bad" sequence?

Q What effect does the "bad" sequence have on the phylogenetic estimation (both in terms of topoology and bootstrap values)? (note that the "bad" sequence is copied from another part of the same frog sequence, thus in the region of the alignment containing the "bad" sequence, the frog sequnence is equally distantly related to all the other orgnaisms in the alignment)

Phylogenetic Estimation - from start to finish

In these exercises you are asked to go from having a question that can be addressed using phylogenetic approaches, through the process of obtaining an appopriate set of sequences for the analysis, aligning them, estimating a phylogeny from them, and examining the final result in the light of your initial question

Imagine you are working on the FGF10 protein in humans. You have plaaned some experiments to be carried out in chickens to test some of your hypotheses, therefore you want to determine:
To do this, go through the complete analysis including
In particular, think carefully about

Back to Gibson Team course pages at EMBL.