The style of today's exercise is rather different from that of the first three days of the course. Rather than discussing in detail aspects of the theory behind phylogenetic analyses, we provide an example of a 'typical' phylogenetic exercise: beginning with a question of interest; involving collection of the sequences for examination; alignment of these sequences; and finally estimation of the phylogeny of these sequences.
Following this exercise, the demonstrators will be available to help you begin an analysis of your own set of sequences of interest.
The aims of this exercise are:
We will retrieve sequences related to our sequence of interest using BLAST at the NCBI.
Submit the above newt DLL (Distal-less homeodomainprotein) sequence as a query selecting the NCBI BLAST of SWISSPROT database.
Is an Amphibian sequence the closest relative of your query sequence?
Use the BLAST server to restrict the BLAST output to contain only sequences from "vertebrata" and to contain a maximum of 50 sequences.
Select and retrieve sequences in FASTA format, using the output controls. This takes a few steps in NCBI Entrez.
[NCBI names are rather long. You could edit the unneeded text at this stage.]
Load the sequences into CLUSTALX and do an automatic alignment using default settings.
Examine the alignment.
1. Is the sequence uniformly conserved i.e. do different columns in the alignment appear to accept substitutions at different rates to others?
2. Given the uniformity (or not) of the conservation, would you expect that your phylogenetic analysis would be improved by incorporation of a model of between-site rate heterogeneity (i.e. gamma and invariant sites)?
3. Are any regions of the alignment likely to be so divergent that one would expect that they do not contain any useful phylogenetic information?
4. Are there any sequences in the alignment that you expect are fragments?
Delete all the fragments.
Remove columns for which all sequences contain gaps, and realign the sequences.
Check for any sequences that appear to be highly divergent or rather unusual within the conserved homeobox region.
If there are any such sequences, delete them, remove gaps, realign.
When the alignment seems to be ready, save the alignment in PIR format.
The GBLOCKS software automatically identifies regions of your alignment that are likely to contain columns where all amino acids are likely to be related to one another by substitution processes.
Load the GBLOCKS server into your web browser
Load the PIR format alignment into the webpage
Toggle on smaller blocks and less strict flanking positions.
Why are we doing this?
Get blocks and save the resulting alignment as e.g. DLL_gb.pir
Load the Gblocked alignment into ClustalX.
Does this alignment have many or few informative positions?
Is it suitable for detailed or superficial phylogeny estimation?
Toggle on "correct for multiple substitutions"
Display tree in NJPLOT
Is there a Xenopus orthologue of our sequence (Box5_notvi)?
Is there another DLL from this newt?
Are there other amphibian DLLs?
Are DLX3_Notvi and DLL3_Xenla orthologues?
Is there something odd about Xenopus DLL numbering?
Is this common in sequence databases?