Day 1: 18th April 2005

Introduction to phylogenetic trees and phylogenetic estimation algorithms


Visualising and conceptualising phylogenetic trees


One of the initial problems people encounter when working with phylogenies is that they find it hard to think about information in the form of phylogenetic trees. To help you become more comfortable with some of the characteristics of phylogenetic trees, we will begin with a number of exercises aimed at helping you to become accustomed to looking at and thinking about them.


Rooted and unrooted trees


As mentioned in the introductory talk, the phylogenetic methods that you will be using in this course all estimate UNROOTED phylogenetic trees. The image below shows an unrooted tree (on the left) and a rooted tree based on the same unrooted tree (on the right)



Often, we are interested in using phylogenetic trees to make statements about the degree of relatedness between different sequences e.g. “mouse and human LDH1a genes are more closely related to each other than either are to the Xenopus LDH1a gene”. However, one cannot make such statements based on an unrooted phylogenetic tree! Rather, one can only make statements about the genetic distance between two sequences. Thus, before considering the relatedness of sequences, we need to specify the root of the estimated tree. This usually requires the use of additional assumptions/hypotheses about (a) the evolution of the organisms providing the sequences and (b) the evolution of the gene family. In exercise 1 you will be asked to consider the kinds of assumptions and hypotheses you might make to infer the position of a root.


Exercise 1

Download this tree file. It describes the phylogenetic relationship between genes sampled from five different animals.


One very important aspect of phylogenetic studies in general is that they are most useful and most interesting when they are carried out in an attempt to assess the support of the dataset for a particular hypothesis. Without a hypothesis of interest, one simply obtains a pretty picture and little else as a result of the analysis. Thus, during this course we will attempt to present exercises in the context of a hypothesis that we are interested in exploring with the tree. In this case, let us imagine that you have functional data about the chicken sequence and are interested in whether it is likely to be worth studying the gene in mouse under the assumption that it has a similar function in this organism. Given that orthologous sequences are more likely to share function than paralogous sequences, you are attempting to use the tree to determine whether the mouse and chicken genes are orthologous (in which case you will assume that the function is very similar) or paralogous (in which case you will be much more cautious about working with this assumption).


Use the NJPLOT software to visualise the tree. NJPLOT always displays a rooted tree, and it is possible to change the position of the root used by the programme.


Note that by rotating the orientation of two branches around the interior branch that links them together does not alter the topology of the tree. This is apparent when you consider that it is possible to make exactly the same set of statements about relatedness of sequences in a rooted tree when any of the branches are rotated in this way (see the image below).


For both these trees we can say

·      D and E are more closely related to each other than either is to any other sequences.

·      The D and E grouping is more closely related to C than to either A or B.

·      The CDE grouping is more closely related to B than to A.

Groups of terminal branches that lie on one side of an internal branch are referred to as ‘splits’ or ‘partitions’ - within the context of a rooted tree, the set of such sequences directed towards the terminal branches of the tree (but not in the other direction) can also be referred to as a ‘clade’ - thus, on this rooted tree the DE grouping is a clade but the ABC grouping is not a clade)

As already mentioned, a rooted tree topology is simply a representation of a set of statements concerning the relatedness of a set of taxonomic units. Thus, because the two trees describe the same set of relatedness statements, they do not represent different tree topologies.


How many different possible rooted trees can be drawn from this unrooted tree? Use NJPLOT to place the root in all of these different possible positions. For which of these rooted trees can one make the statement that the mouse and chicken genes are orthologous?


To have answered the previous question, you needed to make an assumption about the evolution of the organisms providing the genes. What was this assumption?


Describe, in terms of numbers and positions of gene duplications and losses, two different scenarios for the evolution of this gene family based on a rooted tree, (a) one where it is possible to state that chicken and mouse sequences are orthologous (b) another where one cannot make this statement.


If we want to choose one scenario for the evolution of the family over the others, then we need a basis on which to discriminate between competing hypotheses/scenarios. One possibility could be to choose the scenario that requires the minimum number of gene duplications and gene deletion events. Assuming that equal weight is given to duplication events and deletion events, identify the rooted tree associated with this ‘most parsimonious’ scenario.


NOTE! There is no particular justification for this method of discriminating between hypotheses as opposed to another e.g. where one counts 2 for each duplication and 1 for each gene loss. The procedure described here is not necessarily the most appropriate approach to take in every analysis, it is rather provided as an example to illustrate that the interpretation of the results of a phylogenetic analysis require some kind of comparison between alternative hypotheses for the evolution of the gene family, and thus implicitly or explicitly some criterion/criteria for assessing the competing hypotheses.

Have you realised that it is possible to devise more than one scenario of duplication/gene-loss for a given rooted tree? For one of the rooted trees described above, devise a second such scenario. How would you choose between available scenarios?


Finally, do you conclude that the mouse and chicken sequences are more likely to be orthologous or paralogous sequences?


Hopefully, this exercise will have:


Identifying identical trees


As we have seen above, the results of a phylogenetic estimation should be considered in comparison to the phylogenies expected under different evolutionary scenarios. Through these comparisons one can address the question “Which of the hypotheses for the evolution of these sequences is best supported by my data?”. However, to be able to compare your estimated tree to that expected under different hypotheses, you need to be able to compare two trees (hypothesis and result tree) and decide whether they are the same or not (typically one is comparing only the topologies of the trees in this context - i.e. one considers that the trees are the same if the topologies are the same, ignoring differences in branch-lengths between the trees). This is not necessarily as easy as it sounds when comparing trees with different branch lengths, or when one compares unrooted trees represented in rooted form.

Thus, the next exercise is to take a set of trees and to identify those that have the same topology.

Exercise 2

In the first part of this exercise you will become acquainted with the NEWICK format for describing trees. This is the form in which trees are typically represented in data files. In NEWICK format, taxonomic labels that cluster together in the tree are grouped in parentheses.


NEWICK format can be used to represent either rooted or unrooted trees, the difference between the two forms of representation depends on the number of elements included in the outer-most set of parentheses. If there are three elements in this parentheses-pair, the tree is unrooted, if two elements, it is rooted. Thus (A,B,C) represents the unrooted tree for A B and C, while ((A,B),C) represents one of the three rooted trees for these sequences.


To check whether you can interpret trees in this format, draw on paper each of the following three trees and determine whether or not the  trees are rooted or unrooted.




(i)   (A,B,(C,D));

(ii)  ((A,B),(C,(D,E)));

(iii) ((A,B),((C,(D,E)),F),(G,H));



Now, in the second part of this exercise, you are presented with three sets of trees. Within each set of trees, all trees have the same terminal node labels. Within each set of three trees, two have the same unrooted topology. Identify this pair of identical trees in each set.


For the first two sets of threes, (a) and (b), working with one set of trees at a time, download the three tree files in the set and display all three at the same time using NJPLOT. Adjust the position of the root, and the rotation of the branches of the trees within NJPLOT to help you identify the two trees with identical unrooted topologies.


For (c), the three trees are provided in NEWICK format. Draw the three trees on paper, and then compare them to identify the two identical unrooted topologies.


(a) Firstly, download these three trees (tree1 tree2 tree3) with a relatively small number of terminal branches.


(b) Secondly, download these three trees ( tree1 tree2 tree3) with more terminal branches.


(c) Finally, here are three trees in NEWICK format.

(i) (((A,B),C),D,((E,F),G));

(ii) (G,(E,F),(D,(C,(B,A))));

(iii) (F,E,(G,(C,(D,(A,B)))));


Check with one of the demonstrators to see if you have correctly identified the pairs of identical trees.

Hopefully this exercise will have:




However, the typical scenario that one is faced with after a phylogenetic analysis is that the estimated tree is not exactly the same as that expected under any simple hypotheses that are considered likely. It seems likely that you have encountered similar problems in the interpretation of your own data?! Thus, one must consider reasons that could account for this lack of agreement between hypothesised and estimated trees. (In general, possible interpretations of such a result can be grouped into (1) those where the results contain errors (2) the evolution of the genes is more complicated than explained by a simple model or (3) a mixture of these two.)


In the case where the topology of the tree is not the same as that predicted by any simple hypotheses of the evolution of the family, one finds oneself attempting to judge which of the simple hypotheses the estimated tree best resembles (this of course assumes that we are expecting more complex hypotheses to be very unlikely or indeed do not even consider them as possible explanations.) In this case, one needs to compare the estimated tree to the trees expected under different hypotheses and decide which of the hypothetical trees the estimated tree is most similar to.


Unfortunately, there is no single correct way in which one can compare a set of trees to identify those which are most similar i.e. there are numerous different possible ‘tree distance’ measures, none of which can be considered the best. Additionally, this approach to answering the questions “which of the possible hypotheses does my data best support” or “can I reject any of the possible hypotheses as not well supported by the data” does not take into consideration how well the trees under the different hypotheses are supported by the alignment data - which would be desirable (indeed, on Wednesday we will look at a possible approach to asking the question incorporating this information.) However, the fact still remains that there are often only what we consider as ‘small’ differences between an estimated tree and that of the tree expected under a very simple hypothesis of evolution for the gene family. The next exercise aims to help you consider the assumptions that you are making when carrying out this dataset-independent comparison of tree-topologies.


Exercise 3

In this exercise, you are presented with two sets of three topologies, all of which are different from one another. Looking at each set of trees, identify the pair of trees that you consider more similar to one another.


Set A: tree1 tree2 tree3

Set B: tree1 tree2 tree3


On what basis did you make this judgement? Did you consider only the topology of the different trees? Did you also consider the branch lengths of the trees? Devise your own explicit rule for judging the distance between trees. If you are working with a partner, come up with these rules independently of one another and then compare your rules within the context of the two sets of topologies provided in this exercise, discuss whether either of these two sets of rules is better than the other.

Hopefully, this exercise should have:



In exercise 1 you interpreted the estimated phylogenies for a gene family tree assuming that the estimated tree was correct. However, as we hope to demonstrate during this course, there are many ways in which a phylogenetic analysis may yield the wrong topology. We will begin by considering the topologies estimated from a given dataset using the same alignment, same substitution model, but different estimation algorithms.


Estimating the phylogeny of vertebrate lactate dehydrogenase genes

As discussed earlier, molecular phylogenetic analysis can be considered as the application of a substitution model and a phylogenetic analysis algorithm to an aligned dataset to obtain an estimate of the phylogeny. To try and examine the influence that these three components of the analysis (model, algorithm and dataset) have on the estimated phylogeny, we will investigate what happens to the phylogeny estimates when one changes only one of these aspects of an analysis while keeping the other aspects the same. Today we will look at what can happen when you change the algorithm while keeping both the model and the datasets the same.

Exercise 4

To do this we will look at a set of vertebrate lactate dehydrogenase (LDH) genes. During the vertebrate lineage this family has experienced several gene-duplication events. The datasets provided here do not include all known vertebrate LDH sequences, however those sequences excluded from the datasets were clearly lineage-specific duplications in organisms that are included in the analysis. As we have the complete finished genome sequence only for the human genome, the absence of a gene from the dataset does not indicate that it is absent from that organism’s genome, unless it is a human gene. I have also excluded genes from several different organisms from the dataset to speed up the analyses.


You will be working with a set of vertebrate LDH genes that duplicated in the early vertebrate lineage. This group of sequences contains three human genes.


Firstly, download the alignment dataset for this group of LDH genes.


In case you are interested in examining the sequence alignment using CLUSTALX, here is the alignment in pir format, which can be read by CLUSTALX


You should estimate the phylogeny of this alignment using:

  1. PROML (maximum likelihood, “ML”).
  2. NEIGHBOR (Neighbor-Joining, “NJ”), also using PUZZLE to estimate a distance matrix.

Read the notes on Programme Usage (and see below) for help setting up and running these analyses. In all cases, choose a model of evolution that uses the JTT matrix, assuming equal rates for all sites in the alignment.


I suggest that you create two directories, one for carrying out the ML analysis, the other for the NJ analysis.


For the ML analysis, copy the alignment datafile into the appropriate directory, and rename it ‘infile’. Then execute PROML, choose appropriate options for the analysis, and run the analysis. The estimated tree will be written into the file ‘outtree’ and can be examined using NJPLOT


For the NJ analysis, again copy the alignment file into the appropriate directory, giving the file the name ‘infile’. Then run PUZZLE to obtain a distance matrix based on the JTT substitution model (modified using the interactive menu), this matrix will be written into the file “outdist”. Create a copy of the file “outdist” with name “infile” (replacing the previous infile), and then run NEIGHBOR. This will create a file “outtree” that will contain the estimated NJ tree which can be examined using NJPLOT.


Compare the trees estimated from these different datasets using these two different methods to the species tree for the organisms included in the analysis (there are additional organisms included in this tree that are not present in your dataset - if you are interested and have the time, there are additional sets of LDH genes with which you could choose to repeat the above estimation procedure - just ask one of the demonstrators to access these files.)




One can obtain a minimal estimate for the number of duplications that must have occurred in the evolution of a lineage by considering the number of genes present in each of the organisms being studied. It is clear that if an organism contains three different copies of a gene, then there must have been at least three duplication events in the tree being studied (assuming no lateral transfer events). Do any of the phylogenetic algorithms estimate a phylogeny that, if correct, would require the assumption of only this minimal number of duplication events? Which of the methods, given the species tree, do you consider to have obtained the best phylogeny estimate?  Why?



This exercise should have demonstrated: