Evolution and Protein Modular Architecture
EMBL PreDocs 2007
Tuesday October 30th - Wednesday October 31st 2007
Introduction
Our aim is to demonstrate to you some important software tools used to
investigate protein modular architecture. By this we mean the
investigation of (potentially multi-domain) protein sequences to
identify regions (modules) of the proteins that may be associated with
particular functions.
Predicting functions of a protein is the focus of many (perhaps most)
bioinformatic tools for analysing protein sequences. The typical use
case for these tools (at least if the user is a "typical" wet-lab
scientist) is that the scientist has identified a protein relevant to
his area of study, and about which not so much is known. To understand
the function of this protein in more detail, they will need to do
wet-lab experiments. Bioinformatic prediction of the function of the
protein provides hypotheses of the function of the protein that can
then be tested in these experiments. For example, the protein may be
predicted to contain a kinase domain, in which case an obvious
experiment would be to confirm that the protein does indeed have such
activity using biochemical experiment.
Different modules (and different regions of the same module) are under
different evolutionary constraints. Thus, different protein modules
show different patterns of variation over evolutionary time.
In the previous session with Rolf Apweiler, you will have been shown
how to identify proteins from other organisms (orthologs) and
from the same organism (paralogs) that are closely related to your
protein of interest. Additionally, you will have seen how to create a
multiples sequence alignment (MSA) of related protein sequences.
Looking at protein sequences in a MSA provides valuable insights into
the evolution of the sequences - put simply, regions of the proteins
that are under stronger constraints are less variable between sequences
compared to regions under weaker constraints.
In these exercises you will look at several proteins you have already
encountered during the course: pax6, pax2, lbx1 and pitx1. The
sequences you will begin with are all from humans. You will predict
the modular architecture of these proteins, and then examine MSAs of
these proteins to investigate the evolutionary patterns of these
different modules.
Open this link in a new browser window to
access the sequences we will be looking at in these exercises.
Predicting non-globular/globular regions of protein sequences
As Toby discussed in his presentation, the first step in analysing the
modular architecture of a protein is to predict which regions are
globular and non-globular. Globular regions have very different
structures - and hence functions - from non-globular regions. Thus, if
we are interested in predicting function, it is important that we
distinguish between these two types of sequence.
To predict globularity/non-globularity (order/disorder) we recommend
that you use several different predictors of globularity, and compare
the results of the different analyses.
In some cases, the predictions from the different methods all agree
about the boundaries of ordered/disordered regions in your proteins -
in which case it is easy to decide where you expect the boundaries are.
In other cases, the different methods disagree. In which case - either
spend some time learning about the differences between the different
methods, to try and account for the observed differences, or go to a
friendly bioinformatican for help (we're very happy to help with
problems of this kind - come to see us anytime during your time here at
EMBL).
Run the following order/disorder globular/non-globular prediction
servers using default options unless otherwise indicated. Begin by
using the human pax2 sequence, then if you have time run also the other
sequences provided. Followed by the link to the webservers, we have
also given you the output to expect from the servers (in case the
servers are slow/not working while we carry out the exercise)
Follow this link to get the query sequences,
which you should cut-and-paste into the following servers
- RONN
- DisPro
- IUPred -
note that under
option
"Output type" you should choose "generate plot" rather than the default
"raw data only"
- GlobPlot
For each of the sequences, compare the results of the different
predictors. Identify those regions that are predicted by most/all
servers to be either globular or non-globular, and other regions that
are very differently predicted by the different servers. In the next
exercise you will hopefully be able to check the accuracy of some of
your conclusions.
Certain protein modules are predicted with high confidence
Certain kinds of protein modules can be predicted with a very high
degree of confidence e.g. many protein globular domains e.g. protein
kinase domains. Use the following webservers that predict such
high-confidence modules to see whether any such modules are present in
your proteins of interest. If such domains are identified in your
sequences, check the annotation of these domains to see whether they
are globular or non-globular in structure, and compare these findings
with the results of the globularity prediction you carried out in the
previous exercise. As we are very confident that a module predicted
using these methods is correct, where we find such modules we can use
them to assess how good the different globularity prediction methods
are.
Follow this link to get the query sequences,
which you should cut-and-paste into the following servers.
The different prediction servers do not always identify the same
conserved elements in all the different proteins. This is true for at
least one of the proteins used in this exercise. Which ones?
This demonstrates that if you are interested in identifying conserved
modules in your protein of interest you should use the same protein as
a query in several different servers. Often all the servers give the
same/similar results - but as you see here, not always.
The results of these servers give you differing degrees of information
about the conserved modules. Examples of this include information about
the position and phase of splice sites, the set of organisms in which
the module is found, surveys of all the proteins in which the module is
predicted for particular organisms etc.
Use the results of these servers to answer the following questions:
- Is there any evidence for splice junctions to be associated with
module boundaries? (hint - look in particular at pax6)
- How many proteins are predicted to include each of these
different modules in humans?
- Can you think of any ways in which these estimates might be
wrong?
- Are three-dimensional structures known for any of these modules?
Can you find images of these structures?
- Can you find a summary of known functions for the modules?
Predicting protein linear motifs
Other kinds of protein modules can only be predicted with much lower
confidence - for example, most protein linear motifs. The ELM server
maintained by our group provides a resource to help investigating
possible ELMs that may be present in your sequences.
Use the ELM server to identify
possible ELMs present in your sequences.
For each sequences specify its subcellular localisation and taxon. In
particular, focus on the following motifs
Use the results of the server, and the links above, to address the
following questions:
- What biological activity is associated with these different
motifs?
- Which of your proteins contain these motifs?
- How other ELMs does the server suggest might be present in
your sequences?
- Why are the motifs colour-coded?
- What is indicated by motifs of the following colours: blue,
grey, blue/red?
- Does the server identify any of the motifs as having been already
demonstrated to be active via experiments?
- Based on the ELMs they contain, would you predict any of these
transcription factors to be bound to genes being repressed (hint -
check whether groucho/TLE is known to be a repressor or an activator of
gene expression)
You will have noticed that the server suggests that large number of
different ELMs might be present in your sequences. Most of these are
false positives i.e. are not functional instances of ELMs. ELM provides
several different ways to try and decide which predicted ELMs are more
likely to be true positives (and hence the best candidates for testing
using experimental methods). For example, certain ELMs are restricted
in their taxonomic distribution, and most are relevant only in certain
subcellular contexts (hence you being asked to specify taxon and
subcellular localisation when running the server).
Most ELMs are found outside globular regions. Thus, predicted ELMs
occurring in non-globular regions are better candidates to be true
positives than those in globular regions.
Amino-acid residues in non-globular regions that are more strongly
conserved than the surrounding non-globular sequence are also good
candidates to be ELMs - obviously, this is particularly true if many of
the sequences in this region contain the ELM.
Using this information, use the techniques introduced by Rolf in the
previous session to:
- Collect sequences related to your sequence of interest using BLAST
at the EBI searching against either the "UniProt Knowledgebase" or
"UniProtKB/Swiss-Prot" databases, collecting 250 sequences and
alignments. Use the "Download -> Fasta" option to get the fasta
format for the sequenecs you are interested in aligning.
- Align these sequences using either the EBI MUSCLE server or the EBI MAFFT server (if you have
time, try both of the multiple sequences alignment programs and see
which one gives the better alignment - you might want to discuss how
one can judge alignment quality with us during the class) cut and
paste the sequences obtained from the BLAST search into the sequence
window of the MUSCLE server. Follow the link to the fasta-format output
alignment file, and save this file locally to your computer.
- Download the resulting alignment and view it in clustalX2.
Using the resulting alignments (or, if you don't have time, use the
links below to download pre-prepared alignments for viewing in
clustalx), investigate the pattern of conservation of the different
ELMs predicted to be in non-globular regions of these sequences.
Which of the ELMs do you think are most likely to be functional (and
thus which are the best candidates for testing experimentally?)
If you are having trouble seeing the wood for the trees - i.e. if you
find there are too many predicted ELMs for you to know what to focus on
- try looking at the LIG_EH1 groucho-binding motif, the cyclin and cdk
binding motifs LIG_CYCLIN_1 and MOD_CDK, and the sumoylation site
MOD_SUMO.
Alignments
Patterns of evolution in protein modules
Finally, we want you to look at the patterns of evolution of the
different modules in these protein sequences.
Use the alignments from the previous section, along with predictions
you made using the order/disorder predictors, and the highly-conserved
module prediction using PFAM/SMART/CDD.
For three different types of protein sequence (globular domains, linear
motifs, disordered/unfolded/non-globular regions) compare at the
conservation of the amino acid sequences within these alignments.
Consider both whether amino acids within the regions experience many
substitution events (ie whether the amino acid residues in the same
column of the alignment tend to be the same/similar or very different
in different sequences) and whether or not the regions tolerate
frequent large or small insertion/deletion events i.e. whether lots of
sequences in the region are represented only by gap residues.
Which of these three different types of sequence is the most conserved?
Can you find cases where you suspect that complete modules are
lost/gained in the course of evolution?
Interpreting the patterns of presence and absence of modules within the
sequences in terms of loss and gain of modules requires at least an
implicit (and, better, an explicit) hypothesis concerning the
phylogenetic relationships of the sequences being studied. To do this
using an explicit hypothesis, you need to estimate a phylogeny for the
sequences. Therefore, if you have time, use clustalx to estimate a
neighbour-joining tree of the different families. To do this either use
the local version of clustalx on your machine, or use the EBI ClustalW
server. Take the modules you suspect of having been lost/gained
during the evolution of the sequences, and map these features onto the
resulting phylogenies (visualise the trees using the Bork groups iTOL [interactive Tree of Life]
server.)
The output of ClustalW probably won't work - so you should use this file as input for iTOL.
PAX alignment
Does this allow you to identify branches on the tree that you believe
are associated with gain/loss events of particular modules?
Think about which kind of assumptions you are making as you infer the
presence of such gain/loss events - for example, concerning the
frequency of certain kinds of evolutionary event.
Pre-calculated Result Pages From Web-Servers
The links below are to pages showing the results of running RONN,
DisProt, IUPred, GlobPlot, PFAM, SMART, CDD, ELM using default values
for the human pax2, pax6, lbx1, and pitx1 sequences.
Some Protein Order/Disorder Predictors
Remember - usually safest to input just "raw" protein sequence - i.e.
"GHYYITS..."
Rather than inputing as FASTA format i.e. with description line
">P1248923 protein description
GHYYITS..."
Run the servers using default options unless otherwise indicated
- RONN
- DisProt
- IUPred - note that under
option
"Output type" you should choose "generate plot" rather than the default
"raw data only"
- GlobPlot
Some Protein Family/Structural Domain Predictors
Linear Motif Investigation
Multiple Sequence Alignment Servers
Phylogenetic Software
- iTOL (interactive
Tree of Life)
- ClustalW
(for drawing NJ trees)
Find Sequences
- ExPASY
- ENTREZ
- GOOGLE - you'd be surprised
how often it's easiest just to google to find the information e.g.
"swissprot src human" as search terms...
Back
to Gibson Team course pages at EMBL.