In general, I
advise that when you start to work on a new exercise, you create a new
directory in your home directory in which to process, store and create the
required files.
Execution:
> njplot tree_file_name &
Provided with a
file that contains a tree in NEWICK format, it will display a rooted
representation of the tree file. The tree in the file can be either rooted or
unrooted - if unrooted, then NJPLOT will guess the position of the root, and
display the tree accordingly.
Buttons at the top of the screen can be used to alter the representation
of the tree, including choosing a new root. It is not possible to change the
tree in a way that alters the unrooted topology of the tree provided to the
programme. Saving the tree using the “File” menu will save the tree
in NEWICK format to a new file - the tree in the new file will be rooted in the
same position as displayed by the programme.
Execution:
> tv &
Accepts trees
in both NEWICK and NEXUS format.
Programmes in
the PHYLIP package expect that they will be executed in a directory that
contains a file called ‘infile’. This file should contain the alignment
information stored in the appropriate format. In those cases where the program
(e.g. consense) requires an input file that contains trees (or just one tree)
then this file is expected to be called ‘intree’.
The results of
running a PHYLIP programme will be written to files in the same directory from
which the programme was executed. If a tree (or trees) are part of the result
of this analysis, they will be written to a file called ‘outtree’. All of the programmes will
write a file called ‘outfile’ that will contain various pieces of
information about the analysis.
Normally, after
running a PHYLIP programme, it makes sense to give the output file(s) new
names. This (i) prevents the contents of the files being overwritten if another
PHYLIP programme is run in that directory (ii) provides them with more
meaningful names. For example, after running the programme ‘proml’,
you could rename the resulting ‘outfile’ and ‘outtree’ files to ‘proml_outfile’ and ‘proml_outtree’ respectively.
Execution:
> proml
During this
course, we will only ever be using PROML to estimate the maximum likelihood
tree. Thus, PROML should be executed from a directory containing an alignment
in a file named ‘infile’.
Use
‘R’ to switch between different models of between-site rate
heterogeneity
Use
‘S’ to run a more accurate analysis
Use
‘G’ to use the global rearrangement procedure while searching for
the ML tree (this is a more rigorous search than that implemented without
global rearrangements)
During the
course, we will run PROML **without** switching on the slower, more accurate
analysis, or the global rearrangements, to save time. However, if using PROML
yourself, you should run it with these options switched on, assuming that this
does not make the analysis too slow.
Use
‘Y’ to run the analysis
If your model
assumes heterogeneity of rates between sites, on pressing ‘Y’ you
will be asked to provide the coefficient of variation of substitution rate
among sites. This should be pre-calculated by you by running PUZZLE with the
same model with which you are running PROML, in the ‘outfile’ produced by PUZZLE, there is an
estimate of the shape parameter of the gamma distribution used to model the
between-site rate variation. As instructed by PROML, you provide PROML with the
inverse of the square-root of this value for alpha estimated by PUZZLE.
Next you are
asked for the number of categories you want to use to model the rate variation.
Typically one uses 5 here (one for the invariant sites, if one is using that
model, the remaining four for the discrete approximation to the gamma
distribution - to do the same analysis without invariant sites, enter
‘4’ at this prompt)
Finally, you
should enter the fraction of invariant sites, which again should be estimated
using PUZZLE.
Execution:
> consense
Should be
executed from a directory with a file containing multiple trees all of which
have the same set of taxonomic units in a file called ‘intree’.
Use
“C” to switch between different rules for building the consensus
tree.
Type
“Y” to run the analysis
Execution:
> neighbor
Should be
executed from a directory with a file containing a distance matrix such as that
written by PUZZLE, this file should be called ‘infile’.
Type
“Y” to run the analysis.
Execution:
> seqboot
Should be
executed from a directory with a file containing an alignment with the name
“infile”.
Use “R” to change the number of bootstrapped datasets to produce,
and run the analysis with “Y”
Execution:
> weighbor -L <x-b 14 -i infile -o outfile
replace
<x> by the alignment length minus the estimated number of invariant
residues
The command
above will produce a file called “outfile” containing trees estimated from each of the
datasets present in the file “infile”.
The input file
should contain one or more distance matrices, such as those created by PUZZLE.
The number
placed in after the “-L” should be obtained by running PUZZLE on
the initial alignment to obtain the estimated proportion of invariant residues,
and then using this number to calculate the number of variable residues
estimated to be present in the original alignment.
Thus, if PUZZLE
estimates that the fraction of invariant sites is 0.41, and the initial
alignment is 250 residues long, the number that goes after “-L” is
250 - (0.41 *
250) [or, of course, 250 * (1 - 0.41)].
Execution:
> puzzle
PUZZLE is a
relatively versatile tool, and we will use it in several different ways during
the course.
PUZZLE is, in
usage, similar to the PHYLIP programmes. Thus, it should be executed in a
directory containing (as appropriate) files called ‘infile’ and ‘intree’, and writes several files to the
same directory called ‘outfile’, ‘outtree’ and ‘outdist’, although not all of these will
be produced depending on the analysis carried out.
This analysis
requires an alignment in a file called ‘infile’ but does not require a treefile
called ‘intree’.
Use option
“b” to switch the type of analysis to ‘Likelihood
mapping’ rather than ‘Tree reconstruction’ [We are only interested in obtaining
the distance matrix from puzzle in this analysis, both of these analysis types
will cause puzzle to write a distance matrix file, but “Likelihood
mapping” will be quicker to run].
Use option
‘m’ to chose a substitution matrix (in this course we mostly use
the JTT model).
Use
‘w’ to choose between different models of rate heterogeneity. In
this course we will only use the “Uniform rate”, “Gamma
distributed rates” and “Mixed” models. To change the number
of discrete rate categories used to model the gamma distribution use
“c” and then enter the appropriate number of categories.
Type
“y” to run the analysis.
The distance
matrix file is called ‘outdist’. To estimate a NJ tree from this matrix,
copy this file to the name ‘infile’ and then execute the ‘neighbor’
programme from the PHYLIP package.
This analysis
requires a sequence alignment in a file called ‘infile’. It also requires a tree in
NEWICK format in a file called ‘intree’.
Usually this
analysis requires running puzzle twice. The first run is used to determine the
labels placed by puzzle on the branches of the input tree. This allows one, on
the second run, to force puzzle to use the correct branch as the root of the
tree (the clock analysis requires a rooted tree). Thus, run the analysis once,
using a simple (and hence quick) substitution model. Then repeat the analysis
using the substitution model of interest.
Use
‘b’ to set the type of analysis to “Tree
reconstruction”
and
“k” to specify that the tree-search procedure should simply be to read the user-defined
trees.
Use
‘z’ to specify that clocklike branch-lengths should be calculated.
Finally,
specify an appropriate substitution model as for the calculation of the
distance matrix, and then type “y” to run the analysis.
After the first
run, examine the file ‘outfile’ to find the section labelled
“MAXIMUM
LIKELIHOOD BRANCH LENGTHS OF USER DEFINED TREE # 1(WITH CLOCK)”. Here
PUZZLE states which branch has been used as the root. Examine the tree below
this statement, identify the number assigned by puzzle to the branch that
should be used as the root, and then repeat the analysis described above, only
this time use the “l” to enter this number as the branch that
should be used as the root.
This mode of
usage requires an alignment “infile” and an “intree” file that contains several
different tree topologies. “b” is used to run “Tree
reconstruction analysis”, “k” is used to set the search
procedure to “Evaluate user defined trees”, the appropriate substitution
model is chosen as previously, and the analysis is run using “Y”.
The SH test
considers the difference in likelihoods between a set of tree topologies. The
likelihood of each of the set of topologies is calculated under a specified
substitution model for a given sequence alignment. The difference in likelihood
between each topology and that of the topology present in the set that has the
maximum likelihood is calculated. The test assigns to each topology in the set
the probability that the difference in likelihood between that tree and the maximum
likelihood tree is only due to sampling error. Thus, a p-value from the SH test
for a topology of less than 0.05 would be rejected at the 5% level as an
equally good explanation of the data than the maximum likelihood tree in the
set.
To obtain the results
of the SH test, examine “outfile” under the section “COMPARISON OF USER
TREES (NO CLOCK)”. Here each of the trees in “intree” has its log likelihood, the
difference in log likelihood between this tree and the tree in the set with the
maximum likelihood, and then the p-values for several different tests - for the
SH test look in the column “p-SH”. The p-value indicates the
probability that the log likelihood of that tree is significantly lower than
that of the maximum likelihood tree.
Execution:
> phyml
The first
question requested by the programme is the name of the file containing a
sequence alignment that should be analysed. You should enter the
‘absolute path’ to the sequence file i.e. with reference to the
root directory. Thus, if you are in your home directory, and want to use phyml
to analyse a file called “aln.phy” you should enter
“/home/your_user_name/aln.phy” at this prompt.
Use
“D” to indicate that the file contains AA (amino acid) sequences.
Choose a
substitution matrix using “M”.
To use a
substitution model that assumes some fraction of invariant sites, use
“V”. As you have no prior information about the proportion of
invariant sites, you should respond affirmatively with a “Y” to the
question “Optimise p-invar?”, which will cause phyml to estimate
and optimise the proportion of invariant sites.
To use a model
that uses a discrete gamma distribution to model between-site heterogeneity,
select “R”. Alter the number of discrete categories used to
approximate the gamma distribution using “C”. Tell phyml to
optimise and estimate the shape parameter of the gamma distribution using
“A” follwed by “Y” to the question “Optimise
alpha ?”.
Run the
analysis using “Y”
The result
files are written to the same directory as that containing the input file.
It is possible
to run a parametric bootstrap analysis using “B” followed by the
number of bootstrapped datasets you want to use when asked “Number of
replicates” (typically we enter “100”).
Execution:
> mb
Execution:
> codeml
CODEML requires
that the directory in which it is executed contains a file called “codeml.ctl”. This file specifies the
path/name of the files that should be processed by CODEML.
In codeml.ctl,
the file that contains the alignment data is specified using:
seqfile =
name_of_aln_file
the file
containing the tree using:
treefile =
name_of_tree_file
the file
containing an amino acid substitution matrix using:
aaRatefile =
name_of_matrix_file
the file that will
contain the results of the calculations using:
outfile =
name_of_outfile
to use a global
clock in the likelihood calculation use:
clock = 1
to run the
analysis without a clock use:
clock = 0
When using
CODEML to test the clock hypothesis, run codeml once with “clock =
1” and again with “clock = 0”. Examine the outfiles to obtain
the likelihood of the trees calculated under these two models. You should also
check that the tree used by CODEML has its root in the correct position (both
these pieces of information come almost at the end of the outfile). To test the
significance of the difference in likelihood between the two models, use the
chi-squared test where the number of degrees of freedom the difference between
the number of free branch-length parameters under the two models. This is (2n - 3) in the model without the clock, and (n-1) for the
model with the global clock [where n is the number of terminal branches in the
tree].
Execution:
> Gblocks align_filename
Execution:
> seaview align_filename &
Execution:
> clustalx align_filename &