Practical
Courses on Sequence Analysis
Molecular Phylogeny with Maximum Likelihood
by Jose
Castresana July 5th, 2001
Introduction
In this practical we will learn how to generate trees by maximum
likelihood from a set of aligned sequences, and we will also use
one of the most powerful features of this approach: the possibility
to make a statistical evaluation of a tree and to compare alternative
hypothesis about the branching order in the tree and about the
process of evolution that generated the sequences.
We will use aligned DNA and amino acid sequences of several
species in order to study the phylogenetic relationships among
these species. The same principles apply to alignments of protein
families that include paralogues where the aim is to study the
evolution of a protein family.
There are several good programs to do maximum-likelihood analysis
from a set of aligned sequences. We will only use here MOLPHY
and PUZZLE. Both can do the analysis of DNA and amino acid sequences
(the latter is not yet possible in current versions of other very
good packages like PAUP* or PHYLIP) and both are free. They have
complementary features and therefore it is good to get used to
both. If you are mainly doing analysis of DNA sequences you should
also have a look at PHYLIP and PAUP*, which also include other
methods of tree reconstruction such as distance methods and parsimony
(in these cases for both DNA and amino acids).
To better follow the practical it is an advantage to have some
knowledge on basic aspects of phylogenetic analysis, including
notions on sequence alignments and the hidden multiple substitutions,
tree topology, the monophyly and paraphyly concepts, and the meaning
of the molecular clock. You can take a quick look at some pages
explaining very briefly some of these concepts:
Exercise 1: Phylogenetic trees of amniotes
with mitochondrial protein sequences: construction of maximum-likelihood
trees and evaluation of alternative phylogenies
Amniotes include those vertebrates that possess many adaptations
for the terrestrial life: reptiles, birds and mammals.
Although it is clear that amniotes form a monophyletic group,
the phylogenetic relationships among them have been widely debated
from both morphological and molecular points of view. There are
three possibilities for the relationship among reptiles, birds
and mammals:
In fact, there may still be other alternatives if reptiles
do not form a monophyletic group, as explained below.
We will use a given alignment of mitochondrial proteins to
construct maximum-likelihood trees of representatives of the main
amniote groups. This alignment was generated after concatenating
amino acid alignments of the subunits 1, 2 and 3 of cytochrome
oxidase, the subunits 1 and 2 of NADH dehydrogenase
and the cytochrome b. From these alignments, all
positions containing a gap were eliminated since it is not easy
to study insertion/deletion events.
- Open a terminal window and connect to tau. As always, do:
- xhost +
- telnet tau
- setenv DISPLAY address_of_your_computer:0
- Copy to your system the files that you will need for this exercise:
- mkdir amniotes (to make a new directory)
- cd amniotes (to go down to the new directory)
- cp /home/seqanal/public_html/courses/ml_course/data/amniotes/* . (to copy the necessary files to your directory. Do not forget the last dot.)
- You have two alignments in three different formats (every program we will use, ClustalX, Puzzle and Molphy, need a different format!):
- amn1.pir: alignment of 9 amniotes and an outgroup in PIR/NBRF format
- amn1.phy: id. in phylip interleaved format
- amn1.phys: id. in phylip sequential format
- amn2.pir: alignment of 5 amniotes and an outgroup in PIR/NBRF format
- amn2.phy: id. in phylip interleaved format
- amn2.phys: id. in phylip sequential format
The sequences in amn1, that we will use first, are:
The amphibian used as outgroup. The outgroup is necessary to define later the deepest split in the tree; it must a species that we know it is not inter-related with the ones we are interested in:
The reptiles:
- alligator (Alligator mississippiensis)
- snake (Dinodon semicarinatus)
- turtle (Pelomedusa subrufa)
The birds:
- chicken (Gallus gallus)
- raven (Corvus frugilegus)
- ostrich (Struthio camelus)
The mammals:
- cat (Felis catus)
- mouse (Mus musculus)
- opossum (Didelphis virginiana)
- Let's first view the alignment and make a simple neighbor-joining
tree with ClustalX:
- prep clustalx (to prepare the necessary environment
of the program)
- type clustalx & (the symbol & allows you to re-use the terminal window for subsequent commands) and open the alignment amn1.pir. You can see that it is
long (2018 positions) and quite well conserved. Both features
are important to ensure that there is enough phylogenetic information
to test different hypotheses.
- Check Correct for multiple substitutions (this is
always necessary except when sequences are extremely well conserved)
and Draw N-J Tree.
- View the tree with njplot. First type prep njplot
- and njplot amn1.ph &
- Choose frog as a New outgroup (by clicking
on the corresponding # symbol; this is the step where knowing the outgroup is very important). You can see that the neighbor-joining tree is consistent with the hypothesis B and the monophyly of reptiles.
- Let's now make our first maximum-likelihood tree. Searches of maximum-likelihood trees can be exhaustive or heuristic. In exhaustive searches, the maximum-likelihood program evaluates all possible trees and gives you the one with the best score. Our alignment contains 10 species and the total number of possible trees for this number of species is 2027025, meaning that an exhaustive search would take several days. We will start with a very powerful heuristic (or approximate) search used by MOLPHY called local rearrangement.
It needs a starting tree (normally a neighbor-joining tree) and
then it will rearrange the branches of this tree trying to find
a better one, i.e. with a better likelihood.
- prep molphy
- The program that performs maximum likelihood of amino acid
sequences in the MOLPHY package is called protml. For
the first calculation that we will make, protml needs
the starting tree in a simple but very special format. In order
to simplify the format conversion we will produce another neighbor-joining
tree with the MOLPHY package and then we will calculate the maximum-likelihood
tree. All programs in MOLPHY run in the command line. For viewing
the list of options you can invoke the program protml
without any option.
Calculating the neighbor-joining tree with MOLPHY
- protml -mfD amn1.phys > amn1.dis This
will produce a file of pairwise distances called amn1.dis
from the alignment amn1.phys, that now has to be in phylip
sequential format. The option mf means that we are using
a model of mitochondrial proteins called mtREV24 for calculating
the pairwise distances. The option D means that we only
want distances and not further calculations.
- njdist -tamn1 amn1.dis This will
produce the file amn1.tpl -the tpl extension is automatically
added- which is a neighbor-joining tree from the amn1.dis distance
file. It has an identical topology to the one you did with ClustalX,
but in the right format for the next steps (this format was specified
with the option t).
Calculating a maximum-likelihood tree with MOLPHY (heuristic
search)
- Now, to make a maximum-likelihood tree of amn1.phys
starting from the topology amn1.tpl, run the following
command:
- protml -mfR amn1.phys amn1.tpl > amn1.protml_out
(it may take a few minutes). The option R means that now
we want to search by local rearrangement for a maximum-likelihood
tree starting from the provided topology. As before, the option
mf means that we follow a model of mitochondrial proteins.
The output of the program with information on the search procedure
is sent to the file amn1.protml_out. The tree is
always saved in a file called protml.tre. Rename it to
something else:
- mv protml.tre amn1.protmlRtree
- Look at it with njplot. Select the right outgroup and compare it side by side with the neighbor-joining tree. You see that the new tree also supports hypothesis B but now the position of the alligator has changed and is now the earliest branching species within reptiles. This means that in this case the local rearrangement search
found a new tree, with a higher likelihood than the neighbor-joining
tree, and therefore better according to the mitochondrial model
of protein evolution that we are using.
Calculating a maximum-likelihood tree with PUZZLE (heuristic
search; no rate differences)
- Puzzle can also calculate a maximum-likelihood tree,
but the kind of search done by this program is very different
from the one made by protml and much faster. Puzzle
will calculate a number of maximum-likelihood trees (1000 by
default) from quartets of species sampled from the original alignment.
Then, all these trees are joined into a single tree and branch
lengths are then calculated following the specified model of
evolution. Puzzle is menu driven and the different options
are well explained in the online manual:
- prep puzzle
- type puzzle and enter the filename of the same alignment
used above but now in the phylip interleaved format: amn1.phy
- You see a menu with lots of options. An important one is
the model of substitution and we will use the default one, mtREV24,
which is the same we used in protml with option mf. So
just say yes [y]. Note that the calculation is much faster
than with protml.
- The output with information about the calculations are in
outfile and the final tree made by Puzzle is saved as
outtree. Rename this one with the command:
- mv outtree amn1.puzzletree_nr
- If you open this tree with njplot you will see that the topology of this tree is identical to the one made by both neighbor-joining programs we used before but different from the protml
tree. Therefore we see that different heuristic search algorithms
can produced different trees, even if we use the same model in
the calculation of the maximum likelihood function.
Calculating a maximum-likelihood tree with PUZZLE (heuristic
search; rate differences)
- Puzzle can improve the model of evolution used in the calculation by allowing an additionally number of parameters that, basically, admit that different positions may evolve at different rates. For using this option, open puzzle again with the amn1.phy alignment and change option w. The program is now set to consider 8 different classes of rates. You can change this to 4 rates, which is normally enough to improve the model a lot. Proceed (the calculation will take now a bit longer due to the complexity of the model) and rename the new tree:
- mv outtree amn1.puzzletree_rates
- You see now that the tree is the same as the one produced
with protml, with alligator as an early branch. You should
learn from these results that different models of evolution may
also produce different trees, and that better models of evolution
are very important to make better phylogenies.
Comparing alternative phylogenies with Puzzle (Kishino-Hasegawa
test)
- We already produced two different phylogenies (both in fact supporting the tree B for the relationships among birds, mammals and reptiles). We may now test if the differences among the trees A, B and C that we defined a priori are statistically significant. We may also consider for the test the topology obtained by neighbor-joining and the first puzzle search. And, finally, we may consider another tree which is very popular among many zoologists and that considers that crocodiles are closer to birds than to other reptiles (thus making reptiles paraphyletic). In summary, we will test the following 5 trees:
1: Topoolgy A
2: Topoolgy B
3: Topoolgy C
4: Topoolgy B (with turtle as the earliest branching reptile)
5: Topoolgy B (with crocodiles closer to birds)
These are only 5 trees out of the 2027025 different topologies that are possible for the 10 species in this alignment, but these are the hypotheses in which we may be particularly interested. You can find a file with the five mentioned topologies in amniote_hypoth. This file has the format necessary for the tests in both MOLPHY and Puzzle. It is possible to construct such simple topology files with the program TreeView that runs on Macintosh and Windows computers.
- Open puzzle and enter the amn1.phy alignment
- Select the option k. This will tell the program not
to search for any topology. Instead it is only going to score
the topologies we give to the program. And select y to
continue. You will be asked for the topology file: enter amniote_hypoth.
- Now the important file is outfile. Rename it to:
- mv outfile amn1.hypoth_puzzle
- View the file for example with:
- more amn1.hypoth_puzzle
- This file is full of useful information concerning the input sequences. For what it is now our interest go towards the end, to the section called COMPARISON OF USER TREES (NO CLOCK).
- You see in three consecutive columns the log-likelihood of
every topology (log L), the difference of every log-likelihood
with respect to the best value (difference) and the standard error of that difference (S.E.). When the difference is greater than 1.96 times the S.E. then this difference is considered significant at a 5% level of confidence. You see that the best tree is the second one (topology B above) and that the two trees where we only changed the position of the alligator (trees 4 and 5) are not significantly different. Therefore, as a conclusion, these alignments support the topology B and statistically reject the topologies A and C, but they do not have enough information to know the phylogenetic position of crocodiles within the amniotes tree. There may be many other hypotheses to test, but these are the ones we were interested in here. It is up to the researcher to test the most interesting ones, but it is necessary to take into account that the hypotheses have to be defined a priori (comparing only the best trees after a search is not statistically valid).
Exhaustive searches of maximum-likelihood trees with MOLPHY
(This exercise is for the experts)
- When the number of sequences in an alignment is small it is possible to test all topologies so that we can be completely sure that we got the maximum-likelihood tree. We will use a subset of species from the alignment we used before, including one representative of every major amniote group we are considering: frog, alligator, snake, turtle, chicken and mouse. It is normally not very good idea to remove species since we loose information about the evolution of our sequences, but we do it here for time reasons. The new alignment is amn2.phys.
The number of all possible topologies for six species is 105.
The file listing these topologies is topo105. It is possible to construct these complete topology files with programs like PAUP. The use of given topologies is done with protml in the following way:
- protml -mfub amn2.phys topo105 > amn2.protml_105topo
The option u tells the program to use the given topology
file and the option b is used to avoid making a long calculation
that we do not need now.
- The results are now in amn2.protml_105topo. Go to the end of this file (it is very long!) to see the maximum-likelihood values, the difference with respect to the best tree and the S.E. of the difference. The best topology is number 5, which is like the tree B above but puts crocodiles with birds. However, there are 13 topologies that cannot be statistically rejected (in this case the calculation of difference - 1,96*S.E. has to be done by oneself with a spreadsheet program like Excel).
- There is another possibility with protml to do an
exhautive search in which an approximate likelihood is calculated
for all possible topologies:
- protml -mfeb amn2.phys > amn2.protml_105topo_aprox
- With this new option, e, it is not necessary to give
the topology file (the program will calculate it in this case).
You can also see that the calculation is much faster invoking
the approximate likelihood, so this procedure can be applied
to long alignments. However, examine the output file amn2.protml_105topo_aprox, where you can see that the tree with the best approximate likelihood does not coincide with the one calculated via exact likelihood (in fact, the second one in the new file is the best tree). What it is normally done now is to get the best trees calculated with the approximate likelihood, make a topology file and calculate the exact likelihood as above. This is another alternative to the local rearrangement or the puzzle searches that we have done above. As you see, there are many ways to design and perform these maximum-likelihood calculations, and it is probably best to do them in several ways to get confidence in the results.
Exercise 2: Phylogenetic trees of primates
with cytochrome b sequences: testing the molecular clock
(likelihood ratio test)
- Hypotheses testing in maximum likelihood can be about different topologies as we have done before, but it can also be about different models of evolution. For example, you can test whether a tree where different rates of evolution are allowed in different lineages and a tree that follows a perfect molecular clock are statistically the same or not. If both trees are the same then you can use the clock-like tree to infer divergence times. Normally, only sequences that have not diverged very much have some chances of following a good molecular clock. Therefore we provide a DNA alignment of cytochrome b sequences from several old-world monkeys.
Copy this set by doing:
- cd .. (to go up one directory)
- mkdir old_world
- cd old_world
- cp /home/seqanal/public_html/courses/ml_course/data/old_world/*
.
- The only alignment you need now is old_world.phy for
its use with Puzzle
- So open it in puzzle and change the option z to compute branch lengths forcing the molecular clock. Since we included a new-world monkey (squirrel monkey) we know that this is the outgroup of the tree. Select option l and select
species 1 for the location of the root. We can use the
DNA model set by default (HKY) and let the program estimate the
parameters that this model needs: transition/transversion ratio
and nucleotide frequencies. Go ahead with the search and rename
the outfile to something meaningful.
- Go to the end of the file, to the section MOLECULAR CLOCK LIKELIHOOD RATIO TEST. You can see that it is statistically confirmed the presence of a molecular clock in this particular alignment. Both trees (without clock and with clock) are in the file outtree. Split the file into two files -one per tree- by cutting them from the terminal and pasting them into a text editor like nedit, and visualize both trees with njplot. You will see that there are no sequences that are specially accelerated or slowed down in the no-clock tree, which explains that the difference with the clock-like tree was not significant. Therefore in the tree with clock it is possible to infer all the branching times if we know through the fossil record at least the time of one split event.
References
- Graur, G. and Li, W. H. (1999). Fundamentals of Molecular
Evolution, Second Edition. (Sinauer Associates, Sunderland, MA).
Book contents.
- Page, R.D.M and Holmes, E.C. (1998). Molecular evolution:
a phylogenetic approach (Blackwell Science). Book
contents.
- Li, W. H. (1997). Molecular evolution (Sinauer Associates,
Sunderland, MA). Book
contents.
- Nei, M. (1996). Phylogenetic analysis in molecular evolutionary
genetics. Annu. Rev. Genet. 30, 371-403. Abstract.
- Swofford, D. L., Olsen, G. J., Waddell, P. J., and Hillis,
D. M. (1996). Phylogenetic inference. In Molecular systematics,
D. M. Hillis, C. Moritz and B. K. Mable, eds. (Sinauer Associates,
Sunderland, MA), pp. 407-514. Book
contents.
Programs and internet resources
Here you can find a very thorough list containing most of the
available programs for phylogenetic reconstruction including the
ones mentioned in this course:
And here you can find phylogenetic information about many organisms: