Link to course homepage

Day 1: 31st May 2005

Introduction to phylogenetic trees and phylogenetic estimation algorithms

Key to webpage styles

Within these webpages, different styles are used to distinguish text that contain different kinds of information: providing background information; describing exercises; giving instructions within exercises; asking questions; and Unix commands that should be typed.


Background information








Unix commands


Unix commands will always be written preceded by a ">" sign. This is to indicate the prompt provided by the terminal AND SHOULD NOT BE TYPED. Thus, when you read


> mkdir happy_go_lucky


you should type the words


mkdir happy_go_lucky


at the prompt provided to you by the terminal.


Additionally, if  one of the words that should be typed at the terminal should be specified by you, the user, then the word that should be typed will be indicated between <> signs. Both these signs, and the word between them, should be substituted by you when typing the command. Thus, when you read


> less <name_of_tree_file>


then you should type the words


less outtree


assuming that the name of your file that contains a phylogenetic tree is called "outtree".


When a Unix command or feature is first introduced, its implementation will be carefully specified. Details of the use of the Unix commands and features can also be found on this page.



Interpretation of phylogenetic trees

In this section, we will explore and discuss the meaning and interpretation of phylogenetic trees estimated from molecular sequence data.


All methods for estimating phylogenetic trees that we will cover during this course estimate UNROOTED phylogenetic trees. Unrooted trees (unlike rooted trees) do not allow one to determine the order in which divergences between lineages have occurred. However, for the purposes of most analyses, we want to infer exactly this i.e. the order in which these events have occurred. Thus, typically, one is required to use some criterion to place a root on the inferred unrooted tree before one can consider whether the results of the analysis support a particular hypothesis of interest. Therefore, given that most analyses require consideration of rooted phylogenetic trees, we will begin by considering the meaning and interpretation of such trees.


As was discussed in the introductory presentation, the phylogenetic trees that you will be estimating from molecular data can be described in terms of a combination of (i) a tree topology and (ii) a set of lengths of the branches for each of the branches present in that topology. The purpose of a phylogenetic analysis is usually to determine the order in which lineages diverged from one another. This order of divergence is independent of the lengths of the branches concerned. Therefore, the following exercises concentrate on interpreting the TOPOLOGIES of rooted phylogenetic trees, rather the lengths of the branches.



Rooted tree topologies

Rooted tree topologies describe a set of relationships concerning the relatedness of the sequences being analysed to each other. These relationships can be specified by statements of the form


"Sequence A and sequence B are more closely related to one another (i.e. share a more recent common ancestor/lineage) that either is to sequence C".


As an example, consider the following tree of relationships between selected vertebrate organisms.



The set of relatedness relationships that are described by this tree can be specified by the following three statements (note that the minimum number of such statements required to fully describe the tree is the same as the number of internal branches on the tree)


1.    Human and mouse sequences are more closely related to each other than either is to any of the other sequences.

2.    The group of human and mouse sequences are more closely related to the frog sequences than they are to either the salmon or the zebrafish sequences.

3.    The salmon and zebrafish sequences are more closely related to each other than either are to any of the other sequences.


A third way of representing this set of relatedness relationships is to use sets of nested parentheses. In this method of representation, sequences (or groups of sequences) that are "sister groups" i.e. that are directly linked to the same internal branch, are included in the same set of parentheses. Thus, the tree above can be described as shown below. We have colour-coded the parentheses to indicate which opening and closing parentheses are associated with each another.




This form of representation is that typically used by computers, and is often referred to as the NEWICK format. Note that all sets of parentheses INCLUDING THE OUTERMOST RED PAIR contain two elements (either a terminal node name, or a further set of parentheses) separated by a comma. Thus, above, the red parentheses enclose the set of blue and green parentheses separated by a comma, while the blue parentheses separate the terminal node 'frog' and the black parentheses by a comma. This characteristic of all parentheses containing two and only two elements is shared by all ROOTED BIFURCATING trees represented in NEWICK format.


A common reason for the failure of a phylogenetics programme is with the format of the phylogenetic tree passed to the programme. Thus, it is important that one be familiar with the NEWICK format to allow one to explore possible reasons for the failure of phylogenetic software.


Note that, in all of the methods of representing rooted phylogenetic trees described above, it is possible to change the order in which sequences that are being referred to as more closely related to each other than others, without altering the meaning of the statement. Thus, the following two statements refer to the same relatedness relationship


1.    Human and mouse sequences are more closely related to each other than either is to any of the other sequences.

2.    Mouse and human sequences are more closely related to each other than either is to any of the other sequences.


As do the following two tree images



As do the following three NEWICK format trees







Exercise 1

The aims of this exercise are to:

·      reinforce the concepts described above concerning the interpretation of rooted phylogenetic tree topologies;

·      make you more familiar with different ways in which one can present the information represented in an phylogenetic tree.


In this exercise you will be presented with sets of relatedness relationships between groups of sequences in three different forms: (i) sets of statements; (ii) phylogenetic trees; and (iii) NEWICK format. From each set of relationships that you are presented with, you will then be asked to represent them in a different form.


(a) Provide, for the tree shown below, the set of statements that describe the relatedness relationships between the sequences represented in the tree.



(b) Based on the three statements given below, provide the corresponding NEWICK format representation of these relationships.


1.     Sequences A and B are more closely related to each other than either is to any other sequence in the tree.

2.     Sequence C is more closely related to sequences AB than it is to sequence D.

3.     Sequence D is equally distantly related to sequences A, B and C.



(c) Based on the information provided below in NEWICK format, draw the corresponding phylogenetic tree.




Rooted and unrooted trees 

As mentioned in the introductory talk, the phylogenetic methods that you will be using in this course all estimate UNROOTED phylogenetic trees. Unrooted trees can be considered as simply representing a set of rooted trees. The process of rooting an unrooted tree is that of placing an "ancestral' node onto the unrooted tree. This node is that of the earliest divergence present in the tree. It provides a temporal order to the divergences represented in the tree - all other divergences in the tree occurred after that of the ancestral node.


Each of the rooted trees included in the set represented by a given unrooted tree can be obtained by placing the ancestral node on each of the branches in the unrooted tree (including both internal and terminal branches). Thus, the unrooted tree shown below describes the following set of rooted trees.



Clearly, different rooted trees imply different sets of relatedness relationships. For example, rooted tree 1 is consistent with the statement


"Sequence B is more closely related to sequences C and D than it is to sequence A"


while rooted tree 2 IS NOT consistent with this statement.


Thus, before considering the relatedness of sequences, we need to specify the root of the estimated tree. This usually requires the use of additional assumptions/hypotheses about (a) the evolution of the organisms providing the sequences and (b) the evolution of the gene family.


Before looking at ways in which one can attempt to specify the root of a tree, however, we will briefly look at the way in which the NEWICK format represents unrooted (as opposed to rooted) trees. Remember that the NEWICK format represents the set of relatedness relationships using sets of nested parentheses.


The difference between the representations of rooted and unrooted bifurcating trees in NEWICK format lies in the contents of the outermost pair of parentheses. As already mentioned earlier, rooted bifurcating trees contain two elements within all their sets of parentheses - including the outermost pair. However, unrooted bifurcating trees contain THREE elements within the outermost pair of parentheses, but TWO within all other sets of parentheses.


Thus, trees 1, 2 and 3 of the NEWICK format trees given below are unrooted trees, while 4, 5 and 6 are rooted.


1. (A,B,C)

2. (A,(B,C),((D,E),F))

3. ((A,B),(C,D),(E,F))

4. (D,E)

5. ((A,(B,C)),D)

6. ((A,B),(C,D))



Exercise 2

The aims of this exercise are:

·      to further develop your ability to interpret trees represented in the NEWICK format

·      to develop your ability to distinguish between rooted and unrooted phylogenetic trees


In the exercise you will be presented with a number of different trees in NEWICK format and be asked to identify those which are rooted and those which are unrooted. You will then be asked to consider the basis for the way in which NEWICK format represents these two different types of trees.


Which of the following trees is rooted and which unrooted?


(i)   ((((A,B),C),D),E,(F,G))

(ii)  (((A,B),(C,D)),(E,F))

(iii) (A,(((B,D),C),(F,(E,G))))

(iv)  ((A,(B,C)),(D,(E,F)),(G,H))


Consider the various images of both rooted and unrooted trees above in the webpage.

The difference between the NEWICK format representations of these two different types of trees is based on characteristics of the sets of internal nodes associated with these different types of trees.


By considering the number of branches attached to each node, and by comparing these numbers between ancestral and non-ancestral nodes, can you identify characteristics of these two different types of trees that are consistent with their representations in NEWICK format?



Identifying identical trees

The result of a phylogenetic analysis is a phylogenetic tree. However, for the estimated tree to be of any interest, it must be compared to the trees expected under different hypotheses of the evolution of the gene family. For example, one may compare the estimated tree with that expected assuming no gene duplications in the gene family or assuming a large number of gene duplications and losses within the family. Where the estimated tree is the same as that expected under one of these hypotheses, then the data (i.e. the protein multiple sequence alignment) is considered to support that hypothesis for the evolution of the gene family better than any of the other hypotheses. Thus, to be able to interpret the results of a phylogenetic analysis, one needs to be able to compare pairs of trees and determine whether or not the two trees are identical. This is not necessarily as easy as it sounds when comparing trees with different branch lengths, or when comparing unrooted trees represented in rooted form. Thus, the next exercise involves examining sets of trees and identifying those trees within the sets that have the same topology.



Exercise 3

The aims of this exercise are to:

·      introduce the NJPLOT programme as a means of examining phylogenetic trees;

·      provide practise in the identification of identical pairs of tree topologies.


In the exercise you will be presented with several different sets of trees. From each set of trees you should identify the pair that have identical topologies.


Each of the three different sets of trees below (a, b and c) contains three unrooted trees. Trees within the same set describe relationships between the same set of terminal branches. Two of the trees in each set represent exactly the same tree topology, while the third tree has a different topology from that of the other two. You are asked to identify the pair of identical tree topologies from each of the three sets of trees. All trees are provided in NEWICK format. Trees in sets (a) and (b) are provided in files that you should download and visualise using NJPLOT, while the trees in set (c) are provided in text within this document only. To explore the trees in set (c), draw each of them on paper and then compare them to each other.



Creating a file structure in which to examine the files

It makes sense to organise the files that you download into a set of directories such that files belonging together can be easily identified. We provide here a suggested structure for organising your files along with explicit instructions on how to create it. We suggest that, for all other exercises that involve working with downloaded files, you create a similar structure. The instructions given below are somewhat more elaborate than one would typically use to illustrate certain features of using unix systems.


Create a directory in your home directory called "day1" using the mkdir command:


> mkdir day1


Move into the newly created directory using:


> cd day1


Create another new directory in the directory "day1" called "exercise3":


> mkdir exercise3


Using "cd" and "mkdir" create a further directory in "exercise3" called "tree_set_a"


Move into the directory "tree_set_a" and download into it the files provided below that contain the three trees to be considered in this set.


Move into the directory below (i.e. the parent directory of) "tree_set_a" - i.e. into the directory "exercise3" using:


> cd ..


Check that you have moved into the correct directory using the "pwd" command:


> pwd


Note that you could also have moved into this directory by specifying the full path of the directory into which you wanted to move using:


> cd /home/<your_user_name>/day1/exercise3


Note further that typing the above long "cd" command would have been quicker if you tried pressing the "tab" button after each letter - try this out to understand how this feature of the unix shell words.


Create a directory called "tree_set_b", move into it, and download the tree files provided below in set b.



Visualise tree files using NJPLOT

Move into the directory "tree_set_a" using "cd".


Visualise the contents of the first tree file using NJPLOT by typing:


> njplot <first_tree_filename> &


This should open a graphical window containing your file represented as a tree. To appreciate the purpose of the "&" added after the command typed above, first demonstrate to yourself that you are able to type further commands in your terminal window e.g.


> ls


Then close the graphical window displaying the tree in NJPLOT and type the NJPLOT command again without adding the "&"


> njplot <first_tree_filename>


Now attempt to type a command to the terminal window.


To be able to begin issuing commands again in the terminal window, type the "control" key and the "z" key simultaneously.


Now, however, the NJPLOT window no longer responds. To re-enable the NJPLOT window type


> bg


Whenever you are using unix to start windows outside the terminal that you want to interact with, you should "put the process into the background" in either of these two ways - either by following the command with a "&" or by pressing "control-z" followed by typing "> bg".



Changing file names

You may want to alter the names of the files you have downloaded. To do this use either the "cp" or the "mv" commands. The "cp" command creates a second copy of the file with a new name, with a different name, but retaining the initial copy of the file. The "mv" command creates a copy of the file with a new name and DELETES the initial copy of the file.


For example, the first tree to be downloaded is called "". To creates a second file with the same content but with a simpler name, while retaining the initial file type:


> cp <new_filename>


or, if you want to also delete the initial copy of the file


> mv <new_filename>



Comparing the trees

Use NJPLOT (as described above) to visualise the trees that are provided below in sets a and b.


Investigate only one set of trees at a time.


Once trees have been loaded into NJPLOT, click on the "New outgroup" button in the NJPLOT window, and experiment using different outgroups by clicking on different "#" signs located around the tree. To alter the orientation of two branches around an internal branch, click the "Swap nodes" button and again try clicking on the "#" signs to explore the behaviour of the programme.


Identify for each set of trees the pair of identical tree topologies.

Tree set a

tree1 tree2 tree3

Tree set b

tree1 tree2 tree3

Tree set c

(i) (((A,B),C),D,((E,F),G));

(ii) (G,(E,F),(D,(C,(B,A))));

(iii) (F,E,(G,(C,(D,(A,B)))));



Similarity between phylogenetic trees

In the previous exercise, you were asked to identify pairs of identical trees. As discussed, it is important to be able to identify identical pairs of trees to enable the comparison of estimated trees with those expected under different hypotheses of evolution.


However, the typical scenario that one is faced with after a phylogenetic analysis is that the estimated tree IS NOT identical to a tree expected under any simple hypothesis of evolution of the gene family. There are essentially three different possible explanations for results of this kind: (1) the results contain errors i.e. the wrong tree was estimated; (2) the evolution of the genes is more complicated than explained by a simple model; or (3) a mixture of (1) and (2).


Estimated phylogenies usually contain errors - as you will discover during this course. Often the estimated phylogeny is similar, but not identical, to that expected under some simple hypothesis of the evolution of a gene family. In this case, it is reasonable to assume that the data best supports the hypothesis whose expected tree is most similar to that of the estimated tree. (This of course assumes that we are expecting more complex hypotheses to be very unlikely or indeed do not even consider them as possible explanations of the data.) To be able to do this, one needs to be able to compare pairs of trees and identify those pairs that are most similar to one another.


The concept of "tree distances" provides a way of addressing questions of this kind. Identical trees are considered to be separated by a distance of 0. Non-identical trees are considered as separated from each other within "tree space" by some value that reflects the extent of the difference between the structures of the trees that are being compared. There are many different aspects of tree structure that could be used to define such distances. For example, one might count the minimum number of branches that would have to be broken and reattached to convert one tree topology into the other, and use this number as the distance between the trees. Alternatively, one might count the number of internal branches that 'differ' between two trees - in this context, one could consider that both trees shared an internal branch when they both contained an internal branch that divided the set of sequences in the tree into the same two groups of sequences. The number of branches not present between the trees could then be used as the distance between the trees. There are many other ways in which tree distances can be defined. Within the context of tree-distances, the most similar pair of trees within a set are those whose pairwise tree distance is minimised.


Given that there are many different ways in which one can define distances between trees, it follows that there is no single correct way in which one can compare a set of trees to identify those pairs which are most similar i.e. there are numerous different possible ‘tree distance’ measures, none of which can be considered the best. Additionally, using tree distances to approach questions such as “which of the possible hypotheses does my data best support” or “can I reject any of the possible hypotheses as not well supported by the data” does not take into consideration how well the trees under the different hypotheses are supported by the alignment data - which would be desirable. Methods exist that allow one to compare hypotheses in this way (i.e. in the context of the data), but we will not be covering them in this course, as they are rather complex and difficult to use appropriately.


The next exercise aims to give you experience identifying most-similar pairs of trees WITHOUT taking into account the data used to generate the trees, while also exploring some of the issues and assumptions associated with making such comparisons.



Exercise 4

The aims of this exercise are:

·      to give you experience with the concept of similarity between trees

·      to demonstrate that different ways of assessing the similarity between trees can yield different conclusions about which trees are most similar


In the exercise, you will be presented with several different sets of trees. Within each set of trees, you will be asked to identify the pair of trees that you consider most similar to each other.


Provided below are two sets of three tree topologies. Within each set of topologies, all trees describe relationships between the same set of sequences. Further, within each set, all trees have different topologies from each other.


For each of the two sets of trees below, A and B, identify the pair of trees that you consider more similar.


Set A: tree1 tree2 tree3

Set B: tree1 tree2 tree3


How did you decide which pairs of trees were most similar?


Can you state explicitly the rule that you used to determine which pairs of trees are most similar?


Devise an alternative rule for identifying pairs of most-similar trees.


Does this alternative rule identify different pairs of most-similar trees for tree sets A and B above?


Phylogenetic trees of gene families

Often, the aim of a phylogenetic analysis is to determine whether genes present in different organisms are related by speciation events (i.e. are "orthologous") or by gene duplication events (i.e. are "paralogous".) It is important to be able to distinguish between paralogous and orthologous genes because a general rule of gene family evolution appears to be that orthologous genes are more likely to have a similar function than are paralogous genes. Thus, if you can determine that a gene, A, that you have studied in one organism is orthologous to a gene, B, in another organism, you can conclude that gene B is likely to have the same (or at least a very similar) function to gene A, and then plan your experiments accordingly.


To be able to determine whether two genes are orthologous or paralogous, one needs to be able to identify those nodes/divergences in the phylogenetic tree of the gene family that represent speciation events and which represent gene duplication events. Note that to discriminate between these types of nodes (speciation and duplication) requires that one make assumptions about the SPECIES TREE for the organisms providing the genes. Additionally, note that one needs to compare the species tree to a ROOTED gene tree - thus one is required to infer a root on the gene tree.


In cases where no genes have been deleted, it is trivial to identify gene duplication nodes on the gene tree. Any node that links a pair of lineages leading to genes sampled from the same set of organisms is a duplication node (see below. Note that the root node of the tree (indicated by a red square) links, on one side, to genes from four different organisms (mouseA, humanA, koalaA, lizardA) and genes sampled from the same set of organisms on the other side (mouseB, humanB, koalaB, lizardB).



In the case where gene deletion events occur, gene duplication nodes are no longer "symmetrical" in this way. However, the duplication nodes can be identified as before by drawing in the complete deleted lineage and then identifying duplication nodes as above (see below). First the tree estimated from the gene sequences is displayed, followed by the reconciled gene and species trees. As before, the red box indicates a node associated with a duplication event. Greyed-out lineages and names are sequences that do not exist today i.e. that have been deleted (this assumes that our failure to identify a sequence indicates that it is not present in the genome of that organism).



Note that ANY gene tree can be reconciled with ANY species tree in this way - although in some cases (as seen below) one may need to infer many deletion and duplication events to carry out the reconciliation. The reconciled tree below was created using the GeneTree software written by Rod Page in Glasgow - this software will calculate the reconciled tree for a given species and gene tree pair. It creates images such as the one below here, that provide a very useful representation of the events required to reconcile the trees.



Note also that the number of duplications and deletions described above are the MINIMUM required to explain the observed gene and species trees. One can always imagine more intricate scenarios that would also reconcile the gene and species trees.


As already mentioned, one needs to be able to identify duplication and speciation nodes on gene trees in this way to be able to infer whether genes from different organisms are paralogs or orthologs. Thus, the following exercise is designed to give you practise in identifying whether nodes in a gene tree represent speciation or duplication events.



Exercise 5

The aim of this exercise is:


In this exercise you will be asked to identify duplication and speciation nodes on gene family phylogenetic trees. To do this you will be provided with three different rooted gene trees and a rooted species tree.


Compare this species tree file to the gene tree files and using NJPLOT to visualise the trees.


What is the minimum number of duplication and deletion events required to reconcile each of these gene trees with the species tree?


For each of the two gene trees, would you describe the relationships between geneA and geneB as paralogous, orthologous or in some other way?



Rooting phylogenetic trees

As already discussed, a typical phylogenetic analysis involves estimation of an unrooted phylogenetic tree that is then interpreted as a rooted tree. For example, note that both the species and gene trees used in the previous exercise 5 were rooted. Thus, we need to understand the different ways in which we can choose to place the root on an unrooted phylogenetic tree.


Often, one infers the position of the root by the use of an "outgroup" i.e. a sequence that one is certain is equally distantly related to all other sequences in the analysis - for example see below. This approach requires that one has prior knowledge about the phylogenetic relationship of the outgroup sequence to the other sequences, but does not assume any prior knowledge of the phylogenetic relationships of the other sequences to one another.


A disadvantage of this approach is that it introduces an additional long branch into the tree. Long branches are typically difficult to place accurately in a phylogenetic tree, and their presence might also affect the accuracy with which other branches are placed within the tree.


An alternative is to consider the number of duplications and deletions required to explain the observed gene tree, and choose the position of the root that minimises this number - see below for an example of this approach.


This approach has several disadvantages - firstly one is required to make assumptions about the phylogenetic relationships of the species providing the genes whose phylogeny is estimated. Secondly, it assumes that the most likely scenario for the evolution of the gene family is the one in which the smallest number of duplication/loss events have occurred - an assumption that is not necessarily true. Thirdly, one is required to make arbitrary decisions about the relative cost of duplication and deletion events.


Note that there are alternative approaches that can be taken to inferring the root of a phylogenetic tree.


Given that you will usually need to infer the root of a tree when you carry out a phylogenetic analysis, we have designed the exercise below to give you practise rooting trees.



Exercise 6

The aim of this exercise is to:

·      give you practise specifying the root of a gene tree by identifying the root that minimises the complexity of the evolutionary scenario required to explain the gene tree


In the exercise you are asked to use a species tree and a gene tree to estimate the root of the gene tree.


For the following gene-tree/species-tree pair containing sequences from five different vertebrates, someone who you trust assures you that the root of the gene tree is located on one of the internal branches of the tree, when the root is assumed to be that which yields the evolutionary scenario that includes the smallest number of gene duplication events.


Species tree:

Gene tree:


Place the root of the tree at each of the internal branches in turn, and identify that internal branch that, when used as the root, minimises the number of gene duplication events.






Estimating the phylogeny of vertebrate lactate dehydrogenase genes

As discussed during the presentation, molecular phylogenetic analysis can be considered as the application of a substitution model and a phylogenetic analysis algorithm to an aligned dataset to obtain an estimate of the phylogeny. To examine the influence that these three components of the analysis (model, algorithm and dataset) have on the estimated phylogeny, we will investigate what happens to the phylogeny estimates when one changes only one of these aspects of an analysis while keeping the other aspects the same. In the following exercise we will look at what can happen when you change the algorithm while keeping both the model and the dataset the same.



Exercise 7

The aim of this exercise is to:

·      demonstrate that using different algorithms but the same substitution model and dataset, you may estimate different phylogenetic trees;

·      to provide experience using several different pieces of phylogenetic software.


In this exercise you are asked to estimate the phylogeny of a set of vertebrate lactate dehydrogenase (LDH) genes using both the Neighbor-Joining (NJ) algorithm and maximum likelihood (ML). To estimate the NJ tree, you will use PUZZLE to estimate a distance matrix and then apply NEIGHBOR to construct an NJ tree from this distance matrix. To estimate an ML tree, you will use PROML. 


By comparing the trees estimated from these two algorithms with the species tree for these organisms, you are then asked to determine which of the two methods estimated the more correct phylogeny.



Files for use in the exercise

·      Protein multiple sequence alignment (MSA) file for use by PUZZLE and PROML (phy format interleaved): day1_ex7_infile.phy

·      Protein MSA file for examination using CLUSTALX (pir format): day1_ex7_infile.pir

·      Species tree file:


Description of the data set provided

During the vertebrate lineage, the LDH gene family has experienced several gene-duplication events. The subset of vertebrate LDH genes included in this analysis duplicated from other LDH genes in the early vertebrate lineage, and includes three different human genes.


The datasets provided here do not include all known vertebrate LDH sequences. However, those sequences excluded from the datasets were clearly lineage-specific duplications in organisms that are included in the analysis. As we have the complete finished genome sequence only for the human genome, the absence of a gene from the dataset does not indicate that it is absent from that organism’s genome, unless it is a human gene. I have also excluded genes from several different organisms for which relevant LDH genes have been sequenced. These sequences were excluded to speed up the analyses.


Neighbor-Joining analysis

Organise files

Create a directory /home/<your_username>/day1/exercise7/nj


Copy the alignment file in phy format into this directory



Create data matrix

The NJ algorithm operates on a distance matrix. In the context of the amino acid sequence data used in this analysis, this matrix consists of the set of the estimation of the number of amino acid substitutions between all pairs of sequences present in the alignment. These estimated number of amino acid substitutions (the 'distances') are calculated using maximum likelihood using the PUZZLE programme. To calculate ML distances, you need to specify a substitution model - this provides the programme with the information about how likely to consider each of the different substitutions that can occur between different amino acids.


Thus, to calculate a distance matrix using PUZZLE, you need to pass the protein MSA file into the programme. To do this, you need to execute the PUZZLE programme in a directory that contains a file called "infile" that contains the alignment data.


Copy the alignment file in phy format within the "nj" directory using cp so that there is a second copy of the file within that directory that is called "infile"


Execute PUZZLE from within the "nj" directory using the following unix command


> puzzle


Use the interactive menu provided by PUZZLE to specify how the programme should be executed.


To specify the instantaneous rate matrix for use in the calculations:

Type "M" followed by "ENTER". Repeat until "JTT" is shown as “Model of substitution”.


To create a distance matrix, but not to calculate a ML tree:

Type "B" followed by "ENTER". Repeat until “Type of analysis” is “Likelihood mapping”

All you need to know about this type of analysis is that it creates a distance matrix file.


Once all options have been set as required:

Type "Y" followed by "ENTER" to begin the calculations.


The distance matrix is written into a file called "outdist".


Use the unix "cp" command to copy the contents of "outdist" to a file with a more meaningful name such as "puzzle_dist_matrix".



Apply NJ algorithm to distance matrix

Now that you have created a distance matrix, you need to apply the NJ algorithm to the matrix to calculate a phylogenetic tree.  You will use the NEIGHBOR programme (part of the PHYLIP package) to carry out this calculation.


As for other PHYLIP programmes (and indeed as for PUZZLE), NEIGHBOR expects to find the information that it will use to calculate the tree in a file called "infile" in the directory from which NEIGHBOR is executed. As we want to pass NEIGHBOR our distance matrix file calculated using PUZZLE as input:

Copy the file containing the pairwise ML distances into a file called "infile".


Execute the NEIGHBOUR programme from within the "nj" directory by using the unix command:

> neighbor


We do not need to change any of the options that can be used to alter the behaviour of NEIGHBOR, so, as for all PHYLIP programmes (and PUZZLE), start the calculation:

Type "Y" followed by "ENTER" to execute the calculation


The tree estimated by NEIGHBOR is written to a file called "outtree".

Copy the "outtree" file to a file with a more useful name e.g. "neighbor_tree".


Compare the tree estimated by NEIGHBOR to the rooted species tree using NJPLOT.


Based on this comparison of gene and species trees, what is the minimum number of gene duplication events required to account for the tree topology estimated by NEIGHBOR?


We will next estimate the phylogeny of these sequences using ML. We will then, as for the NJ analysis above, compare the tree estimated using ML to the species tree and determine the minimum number of duplication events required to account for the observed gene tree. By comparing the results of this analysis to that carried out for the NJ tree, we will attempt to decide which of the two methods has estimated the "better" tree.



Maximum likelihood analysis

Create a directory called /home/<your_username>/day1/exercise7/ml


Copy the phy format alignment for the LDH sequences into this new "ml" directory.


We will use the PROML programme from within the PHYLIP package to estimate the phylogeny of these sequences using ML. As already mentioned, data is passed into PHYLIP programmes by executing the programmes from within directories that contain files that have names that are specified by PHYLIP and contain the appropriate data. In this case, PROML requires a phy-format protein MSA file with the name "infile" to be present in the directory from which it is executed. Therefore:


Copy the phy format MSA file in the "ml" directory to a file named "infile".


Execute PROML by typing the unix command:

> proml


We now want to run the ML analysis using the same substitution model as for the calculation of ML distances for the NJ analysis. To do this:

Type "P" followed by “ENTER”. Repeat until the “JTT or PAM amino acid change model” is set to “Jones-Taylor-Thornton model”


Begin calculations:

Type "Y" followed by "ENTER"


NOTE!!! The options chosen for this analysis are designed to execute quickly. To do this analysis on your own data, you should run the analysis with different options that yield a more accurate analysis


PROML writes the estimated tree into a file called "outtree".


Copy the file "outtree" into a file with a more meaningful name.


Compare the tree estimated by PROML to the rooted species tree using NJPLOT.


Based on this comparison of gene and species trees, what is the minimum number of gene duplication events required to account for the tree topology estimated by PROML?



Comparing trees estimated using NJ and ML

Which of the two analyses (NJ or ML) do you consider was the more accurate?




Note that there is no rule about which methods identify the better trees. Different methods perform differently on different data sets - sometimes NJ can estimate the correct tree where ML estimates the wrong tree, and vice versa.


We will wait until you have learned more about phylogenetic methods before offering general advice about how to conduct your own analyses, and to choose which methods of analysis to implement. However, in general, it would be fair to say that it is a good idea to carry out several different methods of analysis and to compare the results. Obviously, if all methods tend to estimate the same tree, you have more confidence in that tree than if you had estimated the tree using only one method.



The exercises above will hopefully have demonstrated that:

1.    A phylogenetic tree estimated from gene sequences cannot be interpreted in isolation. Interpretation requires the use of prior hypotheses concerning the phylogeny of the organisms providing the sequences being analysed.

2.    When rooting a gene tree without the use of an outgroup, one makes assumptions about the how likely one considers gene duplication and deletion events to be in the evolution of that gene family.

3.    How to carry out phylogenetic analyses with sequence alignment and the substitution model constant while varying the algorithm.

4.    Different phylogenetic estimation algorithms can estimate different phylogenies from the same substitution model and sequence alignment