Exercises and Demonstations
Relationship between tree and alignment probabilities
The aim of this demo is to help provide a more intuitive grasp of the
link between branch lenghts, substitution models, and how these
parameters can effect the probability of observing a given alignment -
to swap the logic around here, this hopefully also helps with our
understanding of how different alignments make different trees/models
more likely than others.
We load the following NEXUS
format file into MESQUITE - this describes two trees with the same
topology but with different branch-lengths
In MESQUITE we simulate DNA alignments from these trees using the
following Menu choices
Note that the frequency with which different "patterns" of alignment
columns are found in these alignments differs e.g. columns where all
residues are the same, columns where A and B have the same pattern etc.
- Characters -> Make New Matrix From -> Simulated Matrices on
Current Tree -> Evolve DNA Characters -> Jukes-Cantor
We can also add additional taxa to the trees, and run further
Based on where the additional branch(es) are placed, and their branch
lengths, what pattern to you expect to see most often in the alignment
compared to the expected-to-be-most-similar OTUs?
- SHOW Taxa Block (from "Project" Window)
- List -> Add Taxa
- Tree->Alter/Transform Tree->Add Taxa (Change position with
Move Branch tool , add a branch
length using the Adjust Branch Length tool )
Using the Adjust Branch Length tool, create trees (and simulate
datasets from them) where
- almost all alignment positions have the same residue for all
- three of the sequences almost always have each residue in a
column the same, a fourth sequence has often (more than half of the
columns) of the time) has a different residue from the others
- there is no longer
Estimating Phylogenies with RAxML
- Obtain the seed alignment for TreeFam
- Load the alignment into ClustalX and examine it - we remove some
of the sequences to make the ML analysis quicker
- Save in FASTA format
- Rename sequences to make them short and alpha-numeic only
- Run FASTA format file through GBLOCKS to
remove columns with gaps, and which are likely to be misaligned, saving
the resulting outfile locally (could do this editing "manually" using
SEAVIEW or JALVIEW)
- Load GBLOCKed fasta file into CLUSTALX and save in PHYLIP format
- Load PHYLIP format file into RAxML webserver
provided at CIPRES
- No bootstrapping
- Choose Analysis: WAG + Gamma + F
- Choose Tool: RAxML
- Give email address
- Run Analysis
- View resulting tree in NJPLOT/Dendroscope
Carry out a similar analysis using either your own alignment of
interest, or using TF105084
Load the seed alignment for TF105048
into CLUSTALX - run bootstrap analysis with 100 replicates, correcting
for multiple substitutions and examine the result with NJPLOT
- Load this FASTA-format ef1a
dataset (the sequences have already been trimmed to remove
likely-misaligned colums) into CLUSTALX
- Estimate phylogenies for 1000 different bootstrapped datasets
- Examine resulting phylogeny in NJPLOT
- Are there some branches in the tree that you consider very
likely to have been misplaced?
- What is it about these branches that makes you suspect that
they have been misplaced?
- Are the positions of these probably-misplaced branches within
the tree estimated precisely (i.e. with high bootstrap support)?
Removing potentially misaligned regions
Molecular phylogenetic analyses assume that amino acids in the same
column of the alignment are related by substitution events. It follows
that we may want to consider removing columns where we suspect this may
not be the case from the alignment - indeed, you have already done this
using GBLOCKS in a previous exercise.
However, sometimes we would prefer to do this "manually" i.e. by
selecting ourselves those columns to be excluded from the analysis.
To compare automatic with manual "cropping" or "trimming" of your
alignment, download this alignment locally to your machine TF105805.
Begin by processing the alignment using GBLOCKS
(remember, you'll need to convert it to either PIR or FASTA format) -
keep a link to the results page and resulting alignment.
Next, load the alignment into JalView and remove columns from the
alignment that you believe may include mis-aligned residues (Do this by
selecting columns along the top of the alignment, and hitting the
Compare the alignments and phylogenetic trees estimated using (i) the
alignment obtained after processing with GBLOCKS (ii) the alignment you
prepared by making your own decisions about which columns to retain and
(iii) the initial alignment.
Finally, estimate phylogenies (try using both CLUSTALX and RAxML)
from the manually and the GBLOCK-processed alignments
Q When making your own selection of columns, do you tend to be more or
less conservative than GBLOCKS?
Q Are there any major differences in the phylogenies estimated using
the different alignments? (consider both the topologies and the
bootstrap support for the different branches).
When applied to shorter alignments, GBLOCKS often has an unwanted
effect on topology and bootstrap values of the estimated phylogenies -
the exclusion of so many columns from the final analysis by the program
simply removing too much information from the analysis. However, for
longer alignments, it can be shown that it has a positive effect.
Note that GBLOCKS is not a perfect solution to the problem of preparing
an alignment for phylogenetic analysis - not only is there a sometimes
significant loss of information, if the alignment passed to GBLOCKS
contains many fragments, or seqeuences that contain false seqeuence
e.g. translated non-coding sequence, GBLOCKS is likely to maintain this
"bad" seqeunce in the processed alignment. Therefore, one needs to be
careful to submit to GBLOCKS only alignments that have been mostly
purged of fragments (GBLOCKS will usually not retain any columns
containing gaps) and that contain no "bad" sequence. This is because
GBLOCKS scans the alignment to find regions where there are no gaps,
and where there are several highly-conserved columns. Thus, if there is
one 'bad' sequence present, as long as there are a reasonable number of
sequences in the alignment, these columns will remain highly-conserved,
and the "bad" sequence will be included.
For example, try running GBLOCKS on the two alignments below and
estimating the phylogenies of the resulting alignments. In both
alignments, one of the frog sequences has been deliberately altered to
contain an insertion of "bad" sequence. contains a
(deliberately-inserted) region of bad sequence.
Q Can you identify the "bad" sequence in the two alignments (prior
to GBLOCKS processing)?
Q Look at the GBLOCKED alignments - do both of them contain "bad"
Q What effect does the "bad" sequence have on the phylogenetic
estimation (both in terms of topoology and bootstrap values)? (note
that the "bad" sequence is copied from another part of the same frog
sequence, thus in the region of the alignment containing the "bad"
sequence, the frog sequnence is equally distantly related to all the
other orgnaisms in the alignment)
Phylogenetic Estimation - from start to finish
In these exercises you are asked to go from having a question that can
be addressed using phylogenetic approaches, through the process of
obtaining an appopriate set of sequences for the analysis, aligning
them, estimating a phylogeny from them, and examining the final result
in the light of your initial question
Imagine you are working on the FGF10 protein in humans.
You have plaaned some experiments to be carried out in chickens to test
some of your hypotheses, therefore you want to determine:
To do this, go through the complete analysis including
- whether there is an ortholog for this gene in chickens
- if so, how many orthologs
In particular, think carefully about
- sequence similarity search
- initial alignment of sequences
- examination of alignment, discarding of some sequences, perhaps
collection of additional sequences and inclusion in the alignment
- removal of potentially mis-aligned regions
- estimation of phylogeny
- rooting of phylogeny
- what would be an appropriate set of sequences to collect
- once you have an initial set of sequences, which ones to keep,
which ones to discard
- once you have a tree, where to root it
to Gibson Team course pages at EMBL.