The style of today's exercise is rather different from that of the first three days of the course. Rather than discussing in detail aspects of the theory behind phylogenetic analyses, we provide an example of a 'typical' phylogenetic exercise: beginning with a question of interest; involving collection of the sequences for examination; alignment of these sequences; and finally estimation of the phylogeny of these sequences.
Following this exercise, the demonstrators will be available to help you begin an analysis of your own set of sequences of interest.
The aims of this exercise are:
We are
studying newt homeobox proteins. We
want to see whether it is possible to identify sequences orthologous to this
sequence in other vertebrates
(e.g. Human, Xenopus).
We will
retrieve sequences related to our sequence of interest using BLAST at the NCBI.
>BOX5_NOTVI
APHGACQTSGTLRSMSGSMAESLLGSDHSKAAFLEFGTGTHSPQGHYPLHSFHPPTEGPY
GGSGYGGRTLGYPYSPHGHPQHHASPYLPYHQGQHGGSLGHGGSRLDEDTELEKNTVIEN
GEIRINGKGKKIRKPRTIYSSVQLQALNQRFQQTQYLALPERAELAAHLGLTQTQVKIWF
QNKRSKYKKIMKQGSSIQEGEHLHSSASMSPCSPNIPPHWDSPMGTKGGPIGHGSYINNY
GPWYQPHHQDSMPRPQMM
Submit the above
newt DLL (Distal-less homeodomainprotein) sequence as a query selecting the NCBI BLAST of SWISSPROT database.
Is an Amphibian sequence
the closest relative of your query sequence?
Use the BLAST
server to restrict the BLAST output to contain only sequences from "vertebrata"
and to contain a maximum of 50 sequences.
Select and retrieve
sequences in FASTA format, using the output controls. This takes a few steps in
NCBI Entrez.
[NCBI names
are rather long. You could edit the unneeded text at this stage.]
Load the sequences
into CLUSTALX and do an automatic alignment using default settings.
Examine the
alignment.
1. Is the sequence
uniformly conserved i.e. do different columns in the alignment appear to accept
substitutions at different rates to others?
2. Given the
uniformity (or not) of the conservation, would you expect that your
phylogenetic analysis would be improved by incorporation of a model of
between-site rate heterogeneity (i.e. gamma and invariant sites)?
3. Are any regions of
the alignment likely to be so divergent that one would expect that they do not
contain any useful phylogenetic information?
4. Are there any
sequences in the alignment that you expect are fragments?
Delete all the
fragments.
Remove columns for
which all sequences contain gaps, and realign the sequences.
Check for any sequences
that appear to be highly divergent or rather unusual within the conserved
homeobox region.
If there are any
such sequences, delete them, remove gaps, realign.
When the alignment
seems to be ready, save the alignment in PIR format.
The GBLOCKS software automatically identifies regions of your alignment that are likely to contain columns where all amino acids are likely to be related to one another by substitution processes.
Load the GBLOCKS server
into your web browser
Load the PIR format
alignment into the webpage
Toggle on smaller
blocks and less strict flanking positions.
Why are we doing this?
Get blocks and save
the resulting alignment as e.g. DLL_gb.pir
Load the Gblocked
alignment into ClustalX.
Does this alignment
have many or few informative positions?
Is it suitable for
detailed or superficial phylogeny estimation?
Toggle on "correct
for multiple substitutions"
Estimate tree
Display tree in
NJPLOT
Is there a Xenopus orthologue of our sequence (Box5_notvi)?
Is there another DLL
from this newt?
Are there other
amphibian DLLs?
Are DLX3_Notvi and
DLL3_Xenla orthologues?
Is there something odd
about Xenopus DLL numbering?
What?
Is this common in
sequence databases?