Link to course homepage

Day 4 - 3rd June 2005

Preparing alignments for phylogenetic analysis

The style of today's exercise is rather different from that of the first three days of the course. Rather than discussing in detail aspects of the theory behind phylogenetic analyses, we provide an example of a 'typical' phylogenetic exercise: beginning with a question of interest; involving collection of the sequences for examination; alignment of these sequences; and finally estimation of the phylogeny of these sequences.

 

Following this exercise, the demonstrators will be available to help you begin an analysis of your own set of sequences of interest.

 

 

Exercise 1

The aims of this exercise are:

 

Question of interest

We are studying newt homeobox proteins. We want to see whether it is possible to identify sequences orthologous to this sequence in other vertebrates (e.g. Human, Xenopus).

 

 

Collect initial set of appropriate sequences

We will retrieve sequences related to our sequence of interest using BLAST at the NCBI.

 

>BOX5_NOTVI

APHGACQTSGTLRSMSGSMAESLLGSDHSKAAFLEFGTGTHSPQGHYPLHSFHPPTEGPY

GGSGYGGRTLGYPYSPHGHPQHHASPYLPYHQGQHGGSLGHGGSRLDEDTELEKNTVIEN

GEIRINGKGKKIRKPRTIYSSVQLQALNQRFQQTQYLALPERAELAAHLGLTQTQVKIWF

QNKRSKYKKIMKQGSSIQEGEHLHSSASMSPCSPNIPPHWDSPMGTKGGPIGHGSYINNY

GPWYQPHHQDSMPRPQMM

 

Submit the above newt DLL (Distal-less homeodomainprotein) sequence as a query selecting the NCBI BLAST of SWISSPROT database.

 

Is an Amphibian sequence the closest relative of your query sequence?

 

Use the BLAST server to restrict the BLAST output to contain only sequences from "vertebrata" and to contain a maximum of 50 sequences.

 

Select and retrieve sequences in FASTA format, using the output controls. This takes a few steps in NCBI Entrez.

[NCBI names are rather long. You could edit the unneeded text at this stage.]

 

 

Multiple sequence analysis of initial sequence set

Load the sequences into CLUSTALX and do an automatic alignment using default settings.

 

 

Examining alignment

Examine the alignment.

 

1. Is the sequence uniformly conserved i.e. do different columns in the alignment appear to accept substitutions at different rates to others?

 

2. Given the uniformity (or not) of the conservation, would you expect that your phylogenetic analysis would be improved by incorporation of a model of between-site rate heterogeneity (i.e. gamma and invariant sites)?

 

3. Are any regions of the alignment likely to be so divergent that one would expect that they do not contain any useful phylogenetic information?

 

4. Are there any sequences in the alignment that you expect are fragments?

 

Refine alignment

Delete all the fragments.

 

Remove columns for which all sequences contain gaps, and realign the sequences.

 

Check for any sequences that appear to be highly divergent or rather unusual within the conserved homeobox region.

 

If there are any such sequences, delete them, remove gaps, realign.

 

When the alignment seems to be ready, save the alignment in PIR format.

 

GBLOCKS processing

The GBLOCKS software automatically identifies regions of your alignment that are likely to contain columns where all amino acids are likely to be related to one another by substitution processes.

 

Load the GBLOCKS server into your web browser

 

Load the PIR format alignment into the webpage

 

Toggle on smaller blocks and less strict flanking positions.

Why are we doing this?

 

Get blocks and save the resulting alignment as e.g. DLL_gb.pir

 

Make the tree

Load the Gblocked alignment into ClustalX.

 

Does this alignment have many or few informative positions?

 

Is it suitable for detailed or superficial phylogeny estimation?

 

Toggle on "correct for multiple substitutions"

 

Estimate tree

 

Display tree in NJPLOT

 

Tree examination:

Is there a Xenopus orthologue of our sequence (Box5_notvi)?

 

Is there another DLL from this newt?

 

Are there other amphibian DLLs?

 

Are DLX3_Notvi and DLL3_Xenla orthologues?

 

Is there something odd about Xenopus DLL numbering?

 

What?

 

Is this common in sequence databases?