Investigating Eukaryotic Linear Motifs (ELMs) using MSAs
Introductory Notes
As we disucssed during the presentation, ELMs are important for the
function of many
different proteins but, unlike the larger globular domains, are
relatively difficult to identify at the sequence level given their
short length (typically 3 or 4 amino acids).
To make the task of identifying potentially functional ELMs easier, the
ELM server depends on filtering out as many ELMs as possible on the
basis of our
knowledge of their context-dependence e.g. some ELMs only function in
the nucleus - thus if you are interested in ELMs in an extra-cellular
protein, you can immediately discount those that are only functional in
the nucleus. Another example of context dependence is that some ELMs
are only functional in certain organisms. Therefore, when investigating
ELMs, it is important to exploit the knowledge you already possess
about your protein of interest.
As an aside, note that some proteins may operate in several different
contexts. One example of this is that some proteins function in more
than one different sub-cellular localistaion e.g. shuttling between the
cytoplasm and the nucleus. Other proteins need to function in several
different organismal contexts e.g. the extracellular proteins of the
malaria parasite that function in both the mosquito and their mammalian
hosts. In the case of multiple sub-cellular localisations you can
investigate several at the same time using a single query - however in
the case of multiple organismal contexts, you would need to submit your
query multiple times.
The reason why MSAs can help us investigating ELMs is that (good)
alignments provide a way of assessing conservation and variation of
amino acids at different sites during evolution. Sites that are
important in the function of the protein evolve more slowly than other
sites in a protein, thus such sites can be visible as columns of
conserved residues in an alignment. Thus, the same approach as used
here to look for short functional (and hence conserved) sites in
proteins, can be applied to any attempt to identify functional regions
of proteins using an alignment e.g. highly-conserved active-site
residues.
Note, however, that just because a motif is not conserved between
sequences, it does not necessarily follow that the motif is not
functional - it could have appeared in the sequence after the
divergence of the sequence that contains it from the other sequences in
the alignment.
In this exercise we will make use of the ELM
server along
with MSA to explore the presence of ELMs within several sequences, and
see how the use of MSAs is a useful tool in the our attempts to
identify likely functional ELMs.
Begin with this sequence - the first
step is for you to determine the organismal and sub-cellular context in
which the protein functions.
To do this, locate the UNIPROT entry for the protein by BLASTiing the
query sequence at the EBI's BLAST
web-server.
Examine the database record whose sequence is the top hit from this
search (which should be essentially identical to your query) to
determine which organism and cellular localisation the protein has.
Proceed to the ELM server - note
that you can query the database using
either the SWISSPROT name, ID, or with the sequence itself - it should
be faster to use the name/ID, however note that sometimes SWISSPROT
changes the name of a sequence, or at other times the server behind
which the name lookup is based is not available, so it is important to
be able to query also with the sequence itself.
Add the information of your query sequence's organism, and sub-cellular
localisation, and submit your query to the ELM database.
The most important information you get from your query is the summary
of features reported by ELM. This shows you the hits to the regular
expressions for the ELMs in the database that meet the organism and
subcellular-localisation criteria (try repeating the query with
different subcellular-locatlisation constraints to see the effect this
has on the results.)
As ELMs tend not to be found in globular domains, those regular
expressions that hit in domains are indicated as "Smart filtered" and
are considered less likely to be true than those outside these domains.
Sometimes, one of the database annotators has examined exactly the same
sequence as you are querying with while annotating an ELM - in that
case, these "known" or "annotated" ELMs are reported to you in purple.
In the opinion of the annotator, there is good, direct, experimental
evidence for this ELM being functional. However, how (apart from doing
experiments) can we investigate whetehr the other ELMs are 'true" - or
at least narrow down the set of ELMs we might want to attempt to verify
experimentally?
The answer is that, as you might have guessed, we can use MSAs.
ELMs typically reside within rapidly-evolving sequence that is
intrinsically disordered i.e. does not, on its own, fold into stable
secondary structure. Such sequence experiences a relatively high
substitution rate - however, as ELMs are important for the function of
the protein, they tend to remain much more resistent to mutations than
the rest of the sequence. Therefore, a good multiple sequence alignment
between a set of sequences that all contain an ELM in the same region
of their sequence should highlight this sequence conservation, and
strengthen our belief that this sequence may be a functional ELM.
At a relatively simple level, ELM already does this - if you query with
a sequenece related to one that contains an annotated ELM, and your
sequence also contains a hit to the regular expression of the ELM, then
ELM will report this to you as an "ELM predicted by homology" - to see
this in action, query ELM using the following sequences - they have the
same subcellular localisation as the initial sequence
However, in the absence of curated instances in closely-related
proteins, you have to create a MSA to investigate the conservation of
the motifs (although an automated approach to carrying out such an
analysis is currently being worked on in our lab).
With this in mind, collect a set of sequences related to the initial
one above, and create a multiple sequenece alignment from them,
Examine the resulting alignments:
- locate those regions of the initial query sequence that are
identified by ELM as potential motifs (i.e. those motifs that are not
filtered out by the ELM server)
- check whether other sequences in the alignment share these motifs
- if the alignment is good, the corresponding regions of these
sequences should be aligned with the equivalent regions in the other
sequences. However, it is typically difficult to align the regions of
sequences that contain ELMs, so you may well need to manually adjust
the alignment before deciding whether or not the ELM motifs are present
in the related sequences.
Use the CLUSTALX Edit->Search for String option to help find the
instances
of the different motifs in the query sequence (note that this will not
work if the alignment contains gaps in the middle of the sequenes...)
Q Is the annotated ELM successfully aligned?
Q If so, is it present in
all the sequences?
Q If it is missing from some of the sequences,
speculate on the evolution of the ELM within this group of seqeunces
(i.e. where it is present and absent, when it might have been lost or
gained, possible functional reasons for its loss/gain)
Q Consider the same set of questions for the other ELMs which are
suggested as possibly present in the sequence
Note that, as far as I am aware, the different MSA creation software is
trained on alingments of globular domains, rather than on ELMs - this
is part of the reason why MSA of such regions often fail to correctly
align these motifs. With this in mind, create an alignment of these
sequences again, this time using MAFFT and MUSCLE.
Q Do these programmes do a better or a worse job than CLUSTALX at
correctly aligning the motifs?
Extra Exercises
Here are some additional exercises for you to try out if you get
through the ones above with time to spare (don't go through them in
order - just pick out those that sound more interesting to you).
1) Here are some additional sequences for you to examine for ELMs using
MSAs (in the same way as in the main exercise above)
P53_HUMAN
In particular look into the motifs LIG_CYCLIN_1, LIG_MDM2 and
MOD_PIKK_1
EPN1_HUMAN
In particular look at the LIG_AP2alpha_2 motifs,
LIG_Clathr_ClatBox_1, LIG_EH and LIG_PIP2_ENTH_1 motifs - note the
conservation of motifs in the same cellular process i.e. endocytosis
2) An additional set of exercises that focus on the ELM server can be
found under this
link - if you want to explore the resource in more detail, try
carrying out this set of exercises.
3) ELM is just one of many resources available to aid in the
identification of potentially functional ELMs - although it might well
be the most sophisticated resource of this kind that covers a wide
range of different functional site types - ranging from phosphorylation
sites to nuclear export signals. Most of these other sites focus on
identifying/predicting specific types of ELMs e.g. one is dedicated to
identifying XXXX
To explore some of these other resources, investigate the following
sequences:
Some servers to try:
Note that throughout these exercises the
following formating is
used to
specify different types of text
Bold non-italic text like this gives
you instructions about tasks you should carry out e.g. "View the
following webpage"
Italic text specifies questions for
you to answer
Back
to Gibson Team course pages at EMBL.