Practical Course on Sequence Analysis
Generating trees from aligned sequences
by Jose Castresana and Toby Gibson, November 15th, 2000
In this practical we will mainly use the multiple alignment and tree reconstruction program Clustal X. We will run it on UNIX, but it is available for Macintosh and PCs too. Clustal X can make a multiple alignment from a set of unaligned sequences and colors the resulting alignment according to conserved features in a column, which is useful in highlighting structural or functionally important positions of the alignment. This program can also make a phylogenetic tree from the multiple alignment by the neighbor-joining method, which is very fast. We will use the following programs to make a tree:
- SRS5 (through the WWW interface) to extract from databases a group of sequences to align.
- clustalx to make the alignments and phylogenetic trees.
- njplot to display the trees.
You can also read during the practical about the following topics in the provided supplementary information:
Exercise 1: Phylogenetic tree of Eucarya, Bacteria and Archaea with EF-Tu sequences
The elongation factor (called EF-Tu in Bacteria, and EF-1alpha in eukaryotes and archaebacteria) is a very well-conserved protein that has been used to study deep phylogenetic relationships. We will make alignments and phylogenetic trees from several well-known species (and widely used in research) belonging to very divergent groups, and we will examine some phylogenetic concepts as they become necessary to understand the successive steps of the practical. We will mainly use here the term EF-Tu to refer to this protein family.
Extracting EF-Tu sequences
- Go to the SRS5 query page with Netscape.
- Select the SwissProt database.
- Search for:
- eftu | ef1a | ef11 in the ID (Identification) field (the symbol | stands for OR). These are the names that these sequences receive in SwissProt. In different groups of organisms they receive different names but in principle all they refer to the same protein.
- 350: in the SeqLength field. We want sequences longer than 350 amino acids. (Don't forget the colon, that is used to indicate the range, in this case up to infinity, which is omitted.)
- You should find 150 entries (or more if there are recent additions). Be sure that all are displayed in one page. Select the following 25 sequences, from species belonging to the three domains of life (Archaea, Bacteria and Eucarya):
- EF11_DROME Drosophila melanogaster (Eucarya, arthropod)
- EF11_HUMAN Homo sapiens (Eucarya, vertebrate)
- EF11_MOUSE Mus musculus (Eucarya, vertebrate)
- EF11_SCHPO Schizosaccharomyces pombe (Eucarya, fungi)
- EF11_XENLA Xenopus laevis (Eucarya, vertebrate)
- EF1A_ARATH Arabidopsis thaliana (Eucarya, plant)
- EF1A_CAEEL Caenorhabditis elegans (Eucarya, nematode)
- EF1A_CHICK Gallus gallus (Eucarya, vertebrate)
- EF1A_DICDI Dictyostelium discoideum (Eucarya, protist)
- EF1A_ENTHI Entamoeba histolytica (Eucarya, protist)
- EF1A_GIALA Giardia lamblia (Eucarya, protist)
- EF1A_HALHA Halobacterium halobium (Archaea, euryarchaeota)
- EF1A_MAIZE Zea mays (Eucarya, plant)
- EF1A_METJA Methanococcus jannaschii (Archaea, euryarchaeota)
- EF1A_PODAN Podospora anserina (Eucarya, fungi)
- EF1A_PYRWO Pyrococcus woesei (Archaea, euryarchaeota)
- EF1A_SULAC Sulfolobus acidocaldarius (Archaea, crenarchaeota)
- EF1A_THEAC Thermoplasma acidophilum (Archaea, euryarchaeota)
- EF1A_YEAST Saccharomyces cerevisiae (Eucarya, fungi)
- EFTU_BACSU Bacillus subtilis (Bacteria, low G+C Gram-positive bacteria)
- EFTU_ECOLI Escherichia coli (Bacteria, purple bacteria)
- EFTU_HAEIN Haemophilus influenzae (Bacteria, purple bacteria)
- EFTU_MYCPN Mycoplasma pneumoniae (Bacteria, low G+C Gram-positive bacteria)
- EFTU_SYNP7 Synechococcus sp. (Bacteria, cyanobacteria)
- EFTU_THEAQ Thermus aquaticus (Bacteria)
- Click the save button, and don't forget to perform operation on selected before saving.
- Now in Use view select *Complete Entries* and click the SAVE button again.
- You have now got the set of unaligned sequences in EMBL format. Click Save As (from the Netscape menu), select Text in the Format for Saved Document, and give a name to the file, for example eftu1.embl, that will be saved in your directory.
Aligning the sequences with Clustal X
- Open an X-window terminal and type prepare clustalx.
- Then type clustalx & to open the program. (Use always the & symbol after a command that opens a program window so that you don't block the X-window terminal for another command.)
- With Load Sequences read in the EF-Tu file (eftu1.embl).
- Now Do Complete Alignment, which could take a few minutes.
Evaluating the alignment
- Use the slider to view the entire alignment and check whether some positions are not properly aligned. To help you view the poorly aligned segments, invoke Low-Scoring Segment Parameters from the Quality menu.
- Set to the stringent Gonnet PAM 120 matrix and the Minimum Length of Segments to 6. Now Calculate Low-Scoring Segments.
- Check where these divergent or missaligned segments are. If one sequence contains many of those segments, its position in the phylogenetic tree will not be reliable.
Making and displaying trees
- Methods to make phylogenetic trees are classified into two main categories: DISTANCE METHODS and DISCRETE METHODS. Clustal X implementes the neighbor-joining method -which is a distance method- and it is based on a matrix of pairwise distances derived from the alignment. Let's first construct this matrix for our alignment so that we can see which format it has and what these numbers mean. Select Output Format Options from the Trees menu and select -only- Phylip distance matrix (Phylip is a special format).
- Select now Correct for multiple substitutions under the Trees menu. This option is very important to calculate distances and trees, and should now become marked in the menu. You can read in the supplementary information why the correction for MULTIPLE SUBSTITUTIONS is important.
- Select Draw N-J Tree. (In fact, only the distance matrix will be made now.)
- If you did not change the default name you will get a file called eftu1.dst in your directory.
- Open eftu1.dst with a text editor (for example nedit). You will see that this is a 25x25 symmetric matrix where every row is split into several rows to fit the screen. Values are number of amino acid substitutions per site for every pair of sequences. You can now read for example the distance between human and Drosophila. If we assume an approximate split time of 600 million years between vertebrates and arthropods, you can calculate the number of amino acid substitutions per site per year or evolutionary rate by a simple division. You can compare the evolutionary rate of EF-Tu with that of other proteins in the following table, that covers a good range of evolutionary rates:
(Table taken from Kimura, 1983)
- Is EF-Tu a slow or a fast-evolving protein?
- Let's now make a tree. Positions containing any gap are difficult to deal with, so select Exclude positions with gaps. (Now both Correct for multiple substitutions and Exclude positions with gaps should be marked under the Trees menu.)
- Select again Output Format Options and choose Phylip format tree (unselect now the distance matrix output).
- Draw N-J Tree. If you don't change the name, you will obtain a file called eftu1.ph.
- Open this file with a text editor to see the format of the file. Parentheses group sequence names according to the phylogeny estimated by the neighbor-joining algorithm. Numbers represent again number of amino acid substitutions per site, but this time between nodes of the tree, and they represent the branch lengths.
- Let's now see the tree with the njplot program. First type prepare njplot and then njplot. Open the eftu1.ph file.
- We can now learn some concepts related to the TREE TOPOLOGY. Neighbor-joining trees are always unrooted. The display program Njplot will choose an arbitrary outgroup that may or may not be right but, in all trees, we need to choose a meaningful outgroup according to our knowledge of the sequences and species in the tree. You can read in the references given below about a number of evidences suggesting that eukaryotes and archaebacteria are more closely related to each other, so that one can put the root between Bacteria and a Eucarya/Archaea group. Thus, if njplot didn't choose it correctly, we must select Bacteria as the outgroup. For this purpose, click on New outgroup. You now see marked with a # symbol all nodes, internal and external, of the tree. Click on the node representing the most recent common ancestor of Bacteria or, alternatively, on the most recent common ancestor of the Eucarya/Archaea group, that will produce the same result.
- Does the tree follow a good MOLECULAR CLOCK after having chosen this outgroup?
- There is also a number of important phylogenetic questions that can be addressed with this tree, such as for example the MONOPHYLY of the three main domains of life or the relationships among eukaryotes:
- Do all eukaryotes form a monophyletic group?
- Do all Bacteria form a monophyletic group?
- Can you say the same about archaeal species?
- What is the relationship among the three main eukaryotic groups (plants, fungi and animals)?
- Do you find any sequence in a very strange position according to your knowledge of zoology? (We will come back to this question later.)
Evaluating the tree
- There are several ways to calculate the reliability of a group within a tree. One of the most popular methods is the bootstrap. Go back to Clustal X with the EF-Tu alignment. With both Correct for multiple substitutions and Exclude positions with gaps checked, select Bootstrap N-J Tree.
- Give 100 (this is enough number of trials) to the Number of bootstrap trials.
- Open the new tree file eftu1.phb with njplot. Select the right outgroup with New outgroup if necessary, go back to the Full tree option and check Bootstrap values. You will see now which percentage of bootstrap replications support every group in the tree. When we make a phylogenetic tree from, say, 100 bootstrap replications, we do the following. The original alignment is modified slightly, by removing a few positions and substituting them for repeated ones, and a phylogenetic tree is done. This is repeated 100 times, and each time the modification is different (depending on a random number), so we end up with 100 "slightly different" trees. Now it is possible to make a consensus tree for these 100 trees and to calculate how many times every grouping in the consensus appeared in the 100 replications. These are the bootstrap values for every grouping.
- Now you can answer to phylogenetic questions in a more quantitative way. For example:
- What is the support for the grouping of animals and fungi?
Detecting gene duplications in a tree
- Phylogenetic trees are also important to understand the evolution of genes. We can study, for example, if GENE DUPLICATIONS have happened during the evolution of a gene. To detect them, we need to know well the phylogenetic relationships of the species considered. Xenopus laevis is an amphibian, and therefore related to other vertebrates. Its position within our tree does not correspond with this expectation and this tells us that we may have included a paralogous sequence of Xenopus laevis in this tree.
- If you want to determine which is the true orthologue of this family of proteins for Xenopus laevis, make a new search in SRS5 with elongation & factor & alpha in the Description field, and Xenopus in the Organism field. Select all new sequences except EF11_XENLA, that we already have, and save the file with the name xenopus.embl.
- Now make a file containing the previous data set and the xenopus data set, with the unix command cat. Type cat eftu1.embl xenopus.embl > eftu2.embl. The new file eftu2.embl now contains both data sets.
- Make a new alignment with the eftu2.embl set of sequences in Clustal X, make a neighbor-joining tree, and display it with njplot.
- Which sequence of Xenopus is now in the expected phylogenetic position and is therefore the probable orthologue of the rest of vertebrate sequences?
Exercise 2: Phylogenetic tree of mitochondrial D-loop sequences from human populations
Phylogenetic trees can be used to study some aspects of human evolution. We need for this purpose genes that are highly variable and thus show many differences among individuals. The D loop (or control region) of the mitochondrial DNA is one of the most variable pieces of our genome and therefore it has been used in many studies of human evolution. One of the most popular aspects of these works is the african ORIGIN OF MODERN HUMANS or Out-of-Africa hypothesis, that postulates that a founder group emigrated from Africa around 100,000 years ago and colonized the rest of the world. This hypothesis predicts an early split of the nonafrican sequences within the african sequences in a phylogenetic tree of a sample of human populations, and a higher sequence diversity in african populations compared to the nonafrican populations. Let's try to estimate one of these trees.
Extracting D-loop sequences
- Make with SRS5, in the EMBL database, a search for Vigilant & Wilson in the Authors field and 600:800 in the SeqLength field. This will search for sequences published by these two authors (among others) in a specific study on human populations. We want sequences between 600 and 800 bp, that contain a concatenation of two pieces of the human D-loop sequence, the so-called hypervariable regions I and II. Select Description in the Include fields in output option so that you can see the geographical origin of the sequences in the entries found. You should find 114 entries:
|HSMTDL001: Western Pygmy
HSMTDL002: Western Pygmy
HSMTDL004: Eastern Pygmie
HSMTDL005: Eastern Pygmie
HSMTDL006: Eastern Pygmie
HSMTDL030: Eastern Pygmie
HSMTDL031: Eastern Pygmie
HSMTDL032: Eastern Pygmie
HSMTDL037: Western Pygmy
HSMTDL038: Western Pygmy
HSMTDL039: Western Pygmy
HSMTDL040: Western Pygmy
HSMTDL041: Western Pygmy
HSMTDL042: Western Pygmy
HSMTDL043: Western Pygmy
HSMTDL044: Western Pygmy
HSMTDL045: Western Pygmy
HSMTDL046: Western Pygmy
HSMTDL047: Western Pygmy
|HSMTDL048: Western Pygmy
HSMTDL050: Papua N.Guinean
HSMTDL059: African American
HSMTDL063: African American
HSMTDL066: Eastern Pygmie
HSMTDL067: Eastern Pygmie
HSMTDL068: Eastern Pygmie
HSMTDL069: Eastern Pygmie
HSMTDL070: Eastern Pygmie
HSMTDL071: Eastern Pygmie
HSMTDL072: Eastern Pygmie
HSMTDL073: Eastern Pygmie
HSMTDL079: Papua N.Guinean
HSMTDL080: Papua N.Guinean
HSMTDL081: Papua N.Guinean
HSMTDL082: Papua N.Guinean
HSMTDL097: Papua N.Guinean
HSMTDL108: Papua N.Guinean
HSMTDL109: Papua N.Guinean
HSMTDL110: Papua N.Guinean
HSMTDL125: Papua N.Guinean
HSMTDL129: Papua N.Guinean
HSMTDL130: Papua N.Guinean
HSMTDL131: Papua N.Guinean
HSMTDL132: Papua N.Guinean
HSMTDL133: Papua N.Guinean
HSMTDL134: Papua N.Guinean
HSMTDL135: Papua N.Guinean
- Select 15 sequences of african origin and 15 sequences of nonafrican origin (!Kung, Herero, Naron, Hadza, Yorubans and Pygmies are all africans). Try to get a good representation of all populations. Save the file with the selected sequences in EMBL format.
- We need as always an outgroup to root the tree. We will use the chimpanzee D-loop sequence PTMITCT for this. Search for PTMITCT in the ID field with SRS5. Save this sequence in EMBL format and concatenate the file with the human sequences file using the cat UNIX command.
Making the alignment and phylogenetic tree
- Make the alignment with Clustal X. You will see clearly the two regions of the D loop that were sequenced in this human population study by comparison with the complete chimpanzee D-loop sequence. Check Correct for multiple substitutions and Exclude positions with gaps, and make the neighbor-joining tree. Examine the geographical origin of the sequences in your tree (for a quick examination you can see that, with a few exceptions, numbers of african sequences are from 001 to 073 and the rest are nonafrican sequences).
- Can you see in your tree the early split of nonafrican sequences within african sequences and the higher sequence diversity in african populations?
References on phylogenetic tree reconstruction
- Kimura, M. (1983). The neutral theory of molecular evolution (Cambridge University Press, Cambridge)
- Nei, M. (1996). Phylogenetic analysis in molecular evolutionary genetics. Annu. Rev. Genet. 30, 371-403. Abstract.
- Swofford, D. L., Olsen, G. J., Waddell, P. J., and Hillis, D. M. (1996). Phylogenetic inference. In Molecular Systematics, Second Edition, D. M. Hillis, C. Moritz and B. K. Mable, eds. (Sinauer Associates, Sunderland, MA), pp. 407-514. Book contents.
- Li, W. H. (1997). Molecular Evolution (Sinauer Associates, Sunderland, MA). Book contents.
- Page, R.D.M and Holmes, E.C. (1998). Molecular evolution: a phylogenetic approach (Blackwell Science). Book contents.
- Graur, G. and Li, W. H. (1999). Fundamentals of Molecular Evolution, Second Edition. (Sinauer Associates, Sunderland, MA). Book contents.
References for the EF-Tu exercise
These are some references where EF-Tu sequences have been used to study several phylogenetic issues:
- Iwabe, N., Kuma, K., Hasegawa, M., Osawa, S., and Miyata, T. (1989). Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes. Proc. Natl. Acad. Sci. USA 86, 9355-9359. Abstract.
- Rivera, M. C., and Lake, J. A. (1992). Evidence that eukaryotes and eocyte prokaryotes are immediate relatives. Science 257, 74-76. Abstract.
- Hashimoto, T., Nakamura, Y., Nakamura, F., Shirakura, T., Adachi, J., Goto, N., Okamoto, K., and Hasegawa, M. (1994). Protein phylogeny gives a robust estimation for early divergences of eukaryotes: phylogenetic place of a mitochondria-lacking protozoan, Giardia lamblia. Mol. Biol. Evol. 11, 65-71. Abstract.
- Baldauf, S. L., Palmer, J. D., and Doolittle, W. F. (1996). The root of the universal tree and the origin of eukaryotes based on elongation factor phylogeny. Proc. Natl. Acad. Sci. USA 93, 7749-7754. Abstract.
And these are some additional references dealing with the root of the universal tree:
- Forterre, P., Benachenhou-Lahfa, N., Confalonieri, F., Duguet, M., Elie, C., and Labedan, B. (1993). The nature of the last universal ancestor and the root of the tree of life, still open questions. BioSystems 28, 15-32. Abstract.
- Gupta, R. S., and Golding, G. B. (1996). The origin of the eukaryotic cell. Trends Biochem. Sci. 21, 166-171. Abstract.
- Brown, J. R., and Doolittle, W. F. (1997). Archaea and the prokaryote-to-eukaryote transition. Microbiol. Mol. Biol. Rev. 61, 456-502. Abstract.
- Forterre, P., and Philippe, H. (1999). Where is the root of the universal tree of life?, Bioessays 21, 871-879. Abstract.
References on human genetic evolution
- Vigilant, L., Stoneking, M., Harpending, H., Hawkes, K., and Wilson, A. C. (1991). African populations and the evolution of human mitochondrial DNA. Science 253, 1503-7. Abstract.
- Cavalli-Sforza, L.L, Menozzi, P. and Piazza, A (1994). The history and geography of human genes (Princeton University press, Princeton). Book contents.
- Templeton, A. R. (1997). Out of Africa? What do genes tell us? Curr. Opin. Genet. Dev. 7, 841-847. Abstract.
- Krings, M., Stone, A., Schmitz, R. W., Krainitzki, H., Stoneking, M., and Pääbo, S. (1997). Neandertal DNA sequences and the origin of modern humans. Cell 90, 19-30. Abstract.
- Foley, R. (1998). The context of human genetic evolution. Genome Res. 8, 339-347. Abstract.
- Gagneux, P., Wills, C., Gerloff, U., Tautz, D., Morin, P. A., Boesch, C., Fruth, B., Hohmann, G., Ryder, O. A., and Woodruff, D. S. (1999). Mitochondrial sequences show diverse evolutionary histories of African hominoids. Proc. Natl. Acad. Sci. USA 96, 5077-5082. Abstract.
Programs and internet resources
These are two good, multi-platform programs that can be used to construct phylogenetic trees by different methods:
Here you can find a very thorough list containing most of the available programs for phylogenetic reconstruction:
And here you can find phylogenetic information about many organisms:
Address of this document: http://www.embl-heidelberg.de/~seqanal/courses/trees_practical/trees_practical.html
Last updated: 16/10/2000