Link to course homepage

Programme usage notes

 

In general, I advise that when you start to work on a new exercise, you create a new directory in your home directory in which to process, store and create the required files.

 

Visualising tree files

 

NJPLOT

Visualising rooted phylogenetic trees

Execution: > njplot tree_file_name &

 

Provided with a file that contains a tree in NEWICK format, it will display a rooted representation of the tree file. The tree in the file can be either rooted or unrooted - if unrooted, then NJPLOT will guess the position of the root, and display the tree accordingly.  Buttons at the top of the screen can be used to alter the representation of the tree, including choosing a new root. It is not possible to change the tree in a way that alters the unrooted topology of the tree provided to the programme. Saving the tree using the “File” menu will save the tree in NEWICK format to a new file - the tree in the new file will be rooted in the same position as displayed by the programme.

 

TREEVIEW

Visualising rooted and unrooted phylogenetic trees

Execution: > tv &

Accepts trees in both NEWICK and NEXUS format.

 

Phylogeny estimation programmes and packages

 

PHYLIP (the package as a whole)

Package of programmes that can be used to carry out a wide range of different analyses that can be used to investigate phylogenies

Programmes in the PHYLIP package expect that they will be executed in a directory that contains a file called ‘infile’. This file should contain the alignment information stored in the appropriate format. In those cases where the program (e.g. consense) requires an input file that contains trees (or just one tree) then this file is expected to be called ‘intree’.

 

The results of running a PHYLIP programme will be written to files in the same directory from which the programme was executed. If a tree (or trees) are part of the result of this analysis, they will be written to a file called ‘outtree’. All of the programmes will write a file called ‘outfile’ that will contain various pieces of information about the analysis.

 

Normally, after running a PHYLIP programme, it makes sense to give the output file(s) new names. This (i) prevents the contents of the files being overwritten if another PHYLIP programme is run in that directory (ii) provides them with more meaningful names. For example, after running the programme ‘proml’, you could rename the resulting ‘outfile’ and ‘outtree’ files to ‘proml_outfile’ and ‘proml_outtree’ respectively.

 

PROML

Protein sequence maximum likelihood tree estimation

Execution: > proml

 

During this course, we will only ever be using PROML to estimate the maximum likelihood tree. Thus, PROML should be executed from a directory containing an alignment in a file named ‘infile’.

 

Use ‘R’ to switch between different models of between-site rate heterogeneity

Use ‘S’ to run a more accurate analysis

Use ‘G’ to use the global rearrangement procedure while searching for the ML tree (this is a more rigorous search than that implemented without global rearrangements)

 

During the course, we will run PROML **without** switching on the slower, more accurate analysis, or the global rearrangements, to save time. However, if using PROML yourself, you should run it with these options switched on, assuming that this does not make the analysis too slow.

 

Use ‘Y’ to run the analysis

 

If your model assumes heterogeneity of rates between sites, on pressing ‘Y’ you will be asked to provide the coefficient of variation of substitution rate among sites. This should be pre-calculated by you by running PUZZLE with the same model with which you are running PROML, in the ‘outfile’ produced by PUZZLE, there is an estimate of the shape parameter of the gamma distribution used to model the between-site rate variation. As instructed by PROML, you provide PROML with the inverse of the square-root of this value for alpha estimated by PUZZLE.

 

Next you are asked for the number of categories you want to use to model the rate variation. Typically one uses 5 here (one for the invariant sites, if one is using that model, the remaining four for the discrete approximation to the gamma distribution - to do the same analysis without invariant sites, enter ‘4’ at this prompt)

 

Finally, you should enter the fraction of invariant sites, which again should be estimated using PUZZLE.

 

CONSENSE

Builds consensus trees from sets of trees all of which describe relationships between the same set of taxnomic units/sequences

Execution: > consense

 

Should be executed from a directory with a file containing multiple trees all of which have the same set of taxonomic units in a file called ‘intree’.

 

Use “C” to switch between different rules for building the consensus tree.

Type “Y” to run the analysis

 

NEIGHBOR

Applies the Neighbor-Joining algorithm to a distance matrix to estimate a phylogeny

Execution: > neighbor

 

Should be executed from a directory with a file containing a distance matrix such as that written by PUZZLE, this file should be called ‘infile’.

 

Type “Y” to run the analysis.

 

SEQBOOT

Constructs nonparametric bootstrapped alignment datasets from an input alignment dataset

Execution: > seqboot

 

Should be executed from a directory with a file containing an alignment with the name “infile”. Use “R” to change the number of bootstrapped datasets to produce, and run the analysis with “Y”

 

WEIGHBOR

Applies the Weighbor algorithm to a distance matrix to estimate a phylogeny

Execution: > weighbor -L <x-b 14 -i infile -o outfile

replace <x> by the alignment length minus the estimated number of invariant residues

 

The command above will produce a file called “outfile” containing trees estimated from each of the datasets present in the file “infile”.

 

The input file should contain one or more distance matrices, such as those created by PUZZLE.

 

The number placed in after the “-L” should be obtained by running PUZZLE on the initial alignment to obtain the estimated proportion of invariant residues, and then using this number to calculate the number of variable residues estimated to be present in the original alignment.

 

Thus, if PUZZLE estimates that the fraction of invariant sites is 0.41, and the initial alignment is 250 residues long, the number that goes after “-L” is

 

250 - (0.41 * 250) [or, of course, 250 * (1 - 0.41)].

 

PUZZLE

·      Creates distance matrices - can implement many different substitution models

·      Conducting likelihood-ratio tests of the molecular clock

·      Implements several different likelihood-based tests of topologies including the SH test

Execution: > puzzle

 

PUZZLE is a relatively versatile tool, and we will use it in several different ways during the course.

 

PUZZLE is, in usage, similar to the PHYLIP programmes. Thus, it should be executed in a directory containing (as appropriate) files called ‘infile’ and ‘intree’, and writes several files to the same directory called ‘outfile’, ‘outtree’ and ‘outdist’, although not all of these will be produced depending on the analysis carried out.

 

Using PUZZLE to produce distance matrices

This analysis requires an alignment in a file called ‘infile’ but does not require a treefile called ‘intree’.

 

Use option “b” to switch the type of analysis to ‘Likelihood mapping’ rather than ‘Tree reconstruction’  [We are only interested in obtaining the distance matrix from puzzle in this analysis, both of these analysis types will cause puzzle to write a distance matrix file, but “Likelihood mapping” will be quicker to run].

 

Use option ‘m’ to chose a substitution matrix (in this course we mostly use the JTT model).

 

Use ‘w’ to choose between different models of rate heterogeneity. In this course we will only use the “Uniform rate”, “Gamma distributed rates” and “Mixed” models. To change the number of discrete rate categories used to model the gamma distribution use “c” and then enter the appropriate number of categories.

 

Type “y” to run the analysis.

 

The distance matrix file is called ‘outdist’. To estimate a NJ tree from this matrix, copy this file to the name ‘infile’ and then execute the ‘neighbor’ programme from the PHYLIP package.

 

Using PUZZLE to test the molecular clock hypothesis

This analysis requires a sequence alignment in a file called ‘infile’. It also requires a tree in NEWICK format in a file called ‘intree’.

 

Usually this analysis requires running puzzle twice. The first run is used to determine the labels placed by puzzle on the branches of the input tree. This allows one, on the second run, to force puzzle to use the correct branch as the root of the tree (the clock analysis requires a rooted tree). Thus, run the analysis once, using a simple (and hence quick) substitution model. Then repeat the analysis using the substitution model of interest.

 

Use ‘b’ to set the type of analysis to “Tree reconstruction”

and “k” to specify that the tree-search procedure should  simply be to read the user-defined trees.

 

Use ‘z’ to specify that clocklike branch-lengths should be calculated.

 

Finally, specify an appropriate substitution model as for the calculation of the distance matrix, and then type “y” to run the analysis.

 

After the first run, examine the file ‘outfile’ to find the section labelled

“MAXIMUM LIKELIHOOD BRANCH LENGTHS OF USER DEFINED TREE # 1(WITH CLOCK)”. Here PUZZLE states which branch has been used as the root. Examine the tree below this statement, identify the number assigned by puzzle to the branch that should be used as the root, and then repeat the analysis described above, only this time use the “l” to enter this number as the branch that should be used as the root.

 

Using PUZZLE to apply the SH test to a set of topologies

This mode of usage requires an alignment “infile” and an “intree” file that contains several different tree topologies. “b” is used to run “Tree reconstruction analysis”, “k” is used to set the search procedure to “Evaluate user defined trees”, the appropriate substitution model is chosen as previously, and the analysis is run using “Y”.

 

The SH test considers the difference in likelihoods between a set of tree topologies. The likelihood of each of the set of topologies is calculated under a specified substitution model for a given sequence alignment. The difference in likelihood between each topology and that of the topology present in the set that has the maximum likelihood is calculated. The test assigns to each topology in the set the probability that the difference in likelihood between that tree and the maximum likelihood tree is only due to sampling error. Thus, a p-value from the SH test for a topology of less than 0.05 would be rejected at the 5% level as an equally good explanation of the data than the maximum likelihood tree in the set.

 

To obtain the results of the SH test, examine “outfile” under the section “COMPARISON OF USER TREES (NO CLOCK)”. Here each of the trees in “intree” has its log likelihood, the difference in log likelihood between this tree and the tree in the set with the maximum likelihood, and then the p-values for several different tests - for the SH test look in the column “p-SH”. The p-value indicates the probability that the log likelihood of that tree is significantly lower than that of the maximum likelihood tree.

 

PHYML

Maximum likelihood estimation of phylogeny (using either DNA or protein sequences)

Execution: > phyml

 

The first question requested by the programme is the name of the file containing a sequence alignment that should be analysed. You should enter the ‘absolute path’ to the sequence file i.e. with reference to the root directory. Thus, if you are in your home directory, and want to use phyml to analyse a file called “aln.phy” you should enter “/home/your_user_name/aln.phy” at this prompt.

 

Use “D” to indicate that the file contains AA (amino acid) sequences.

 

Choose a substitution matrix using “M”.

 

To use a substitution model that assumes some fraction of invariant sites, use “V”. As you have no prior information about the proportion of invariant sites, you should respond affirmatively with a “Y” to the question “Optimise p-invar?”, which will cause phyml to estimate and optimise the proportion of invariant sites.

 

To use a model that uses a discrete gamma distribution to model between-site heterogeneity, select “R”. Alter the number of discrete categories used to approximate the gamma distribution using “C”. Tell phyml to optimise and estimate the shape parameter of the gamma distribution using “A” follwed by “Y” to the question “Optimise alpha ?”.

 

Run the analysis using “Y”

 

The result files are written to the same directory as that containing the input file.

 

It is possible to run a parametric bootstrap analysis using “B” followed by the number of bootstrapped datasets you want to use when asked “Number of replicates” (typically we enter “100”).

 

MRBAYES

Execution: > mb

 

CODEML (one programme from the PAML package)

Estimation of the maximum likelihood set of branch lengths for a tree with given topology. A very large range of different evolutionary models can be used in these calculations

Execution: > codeml

CODEML requires that the directory in which it is executed contains a file called “codeml.ctl”. This file specifies the path/name of the files that should be processed by CODEML.

 

In codeml.ctl, the file that contains the alignment data is specified using:

 

seqfile = name_of_aln_file

 

 

the file containing the tree using:

 

treefile = name_of_tree_file

 

 

the file containing an amino acid substitution matrix using:

 

aaRatefile = name_of_matrix_file

 

 

the file that will contain the results of the calculations using:

 

outfile = name_of_outfile

 

to use a global clock in the likelihood calculation use:

 

clock = 1

 

 

to run the analysis without a clock use:

 

clock = 0

 

 

When using CODEML to test the clock hypothesis, run codeml once with “clock = 1” and again with “clock = 0”. Examine the outfiles to obtain the likelihood of the trees calculated under these two models. You should also check that the tree used by CODEML has its root in the correct position (both these pieces of information come almost at the end of the outfile). To test the significance of the difference in likelihood between the two models, use the chi-squared test where the number of degrees of freedom the difference between the number of free branch-length parameters under the two models. This is (2n - 3) in the model without the clock, and (n-1) for the model with the global clock [where n is the number of terminal branches in the tree].

 

 

Processing, editing and creating multiple sequence alignments

GBLOCKS

Execution: > Gblocks align_filename

SEAVIEW

Execution: > seaview align_filename &

 

CLUSTALX

Execution: > clustalx align_filename &