Exercises and Demonstations

Relationship between tree and alignment probabilities


The aim of this demo is to help provide a more intuitive grasp of the link between branch lenghts, substitution models, and how these parameters can effect the probability of observing a given alignment - to swap the logic around here, this hopefully also helps with our understanding of how different alignments make different trees/models more likely than others.

We load the following NEXUS format file into MESQUITE - this describes two trees with the same topology but with different branch-lengths
Mequite Tree1Mesquite Tree2
In MESQUITE we simulate DNA alignments from these trees using the following Menu choices
Note that the frequency with which different "patterns" of alignment columns are found in these alignments differs e.g. columns where all residues are the same, columns where A and B have the same pattern etc.

We can also add additional taxa to the trees, and run further simulations:
  1. SHOW Taxa Block (from "Project" Window)
  2. List -> Add Taxa
  3. Tree->Alter/Transform Tree->Add Taxa (Change position with Move Branch tool MESQUITE Move Branch Icon, add a branch length using the Adjust Branch Length tool MESQUITE Adjust Branch Length Icon)
Based on where the additional branch(es) are placed, and their branch lengths, what pattern to you expect to see most often in the alignment compared to the expected-to-be-most-similar OTUs?


Using the Adjust Branch Length tool, create trees (and simulate datasets from them) where

Estimating Phylogenies with RAxML


  1. Obtain the seed alignment for TreeFam familly TF105048.
  2. Load the alignment into ClustalX and examine it - we remove some of the sequences to make the ML analysis quicker
  3. Save in FASTA format
  4. Rename sequences to make them short and alpha-numeic only
  5. Run FASTA format file through GBLOCKS to remove columns with gaps, and which are likely to be misaligned, saving the resulting outfile locally (could do this editing "manually" using SEAVIEW or JALVIEW)
  6. Load GBLOCKed fasta file into CLUSTALX and save in PHYLIP format
  7. Load PHYLIP format file into RAxML webserver provided at CIPRES
    1. No bootstrapping
    2. Choose Analysis: WAG + Gamma + F
    3. Choose Tool: RAxML
    4. Give email address
    5. Run Analysis
  8. View resulting tree in NJPLOT/Dendroscope


Carry out a similar analysis using either your own alignment of interest, or using TF105084

Phylogenetic Bootstrapping


Load the seed alignment for TF105048 into CLUSTALX - run bootstrap analysis with 100 replicates, correcting for multiple substitutions and examine the result with NJPLOT


  1. Load this FASTA-format ef1a dataset (the sequences have already been trimmed to remove likely-misaligned colums) into CLUSTALX
  2. Estimate phylogenies for 1000 different bootstrapped datasets
  3. Examine resulting phylogeny in NJPLOT

Removing potentially misaligned regions

Molecular phylogenetic analyses assume that amino acids in the same column of the alignment are related by substitution events. It follows that we may want to consider removing columns where we suspect this may not be the case from the alignment - indeed, you have already done this using GBLOCKS in a previous exercise.

However, sometimes we would prefer to do this "manually" i.e. by selecting ourselves those columns to be excluded from the analysis.

To compare automatic with manual "cropping" or "trimming" of your alignment, download this alignment locally to your machine TF105805.

Begin by processing the alignment using GBLOCKS (remember, you'll need to convert it to either PIR or FASTA format) - keep a link to the results page and resulting alignment.

Next, load the alignment into JalView and remove columns from the alignment that you believe may include mis-aligned residues (Do this by selecting columns along the top of the alignment, and hitting the backspace/delete key)

Compare the alignments and phylogenetic trees estimated using (i) the alignment obtained after processing with GBLOCKS (ii) the alignment you prepared by making your own decisions about which columns to retain and (iii) the initial alignment.

Finally, estimate phylogenies (try using both CLUSTALX and RAxML) from the manually and the GBLOCK-processed alignments

Q When making your own selection of columns, do you tend to be more or less conservative than GBLOCKS?

Q Are there any major differences in the phylogenies estimated using the different alignments? (consider both the topologies and the bootstrap support for the different branches).

When applied to shorter  alignments, GBLOCKS often has an unwanted effect on topology and bootstrap values of the estimated phylogenies - the exclusion of so many columns from the final analysis by the program simply removing too much information from the analysis. However, for longer alignments, it can be shown that it has a positive effect.

Note that GBLOCKS is not a perfect solution to the problem of preparing an alignment for phylogenetic analysis - not only is there a sometimes significant loss of information, if the alignment passed to GBLOCKS contains many fragments, or seqeuences that contain false seqeuence e.g. translated non-coding sequence, GBLOCKS is likely to maintain this "bad" seqeunce in the processed alignment. Therefore, one needs to be careful to submit to GBLOCKS only alignments that have been mostly purged of fragments (GBLOCKS will usually not retain any columns containing gaps) and that contain no "bad" sequence. This is because GBLOCKS scans the alignment to find regions where there are no gaps, and where there are several highly-conserved columns. Thus, if there is one 'bad' sequence present, as long as there are a reasonable number of sequences in the alignment, these columns will remain highly-conserved, and the "bad" sequence will be included.

For example, try running GBLOCKS on the two alignments below and estimating the phylogenies of the resulting alignments. In both alignments, one of the frog sequences has been deliberately altered to contain an insertion of "bad" sequence. contains a (deliberately-inserted) region of bad sequence.
Q Can you identify the "bad" sequence in the two alignments (prior to GBLOCKS processing)?

Q Look at the GBLOCKED alignments - do both of them contain "bad" sequence?

Q What effect does the "bad" sequence have on the phylogenetic estimation (both in terms of topoology and bootstrap values)? (note that the "bad" sequence is copied from another part of the same frog sequence, thus in the region of the alignment containing the "bad" sequence, the frog sequnence is equally distantly related to all the other orgnaisms in the alignment)

Phylogenetic Estimation - from start to finish

In these exercises you are asked to go from having a question that can be addressed using phylogenetic approaches, through the process of obtaining an appopriate set of sequences for the analysis, aligning them, estimating a phylogeny from them, and examining the final result in the light of your initial question

Imagine you are working on the FGF10 protein in humans. You have plaaned some experiments to be carried out in chickens to test some of your hypotheses, therefore you want to determine:
To do this, go through the complete analysis including
In particular, think carefully about

Back to Gibson Team course pages at EMBL.