Investigating Eukaryotic Linear Motifs (ELMs) using MSAs

Introductory Notes

As we disucssed during the presentation, ELMs are important for the function of many different proteins but, unlike the larger globular domains, are relatively difficult to identify at the sequence level given their short length (typically 3 or 4 amino acids).

To make the task of identifying potentially functional ELMs easier, the ELM server depends on filtering out as many ELMs as possible on the basis of our knowledge of their context-dependence e.g. some ELMs only function in the nucleus - thus if you are interested in ELMs in an extra-cellular protein, you can immediately discount those that are only functional in the nucleus. Another example of context dependence is that some ELMs are only functional in certain organisms. Therefore, when investigating ELMs, it is important to exploit the knowledge you already possess about your protein of interest.

As an aside, note that some proteins may operate in several different contexts. One example of this is that some proteins function in more than one different sub-cellular localistaion e.g. shuttling between the cytoplasm and the nucleus. Other proteins need to function in several different organismal contexts e.g. the extracellular proteins of the malaria parasite that function in both the mosquito and their mammalian hosts. In the case of multiple sub-cellular localisations you can investigate several at the same time using a single query - however in the case of multiple organismal contexts, you would need to submit your query multiple times.

The reason why MSAs can help us investigating ELMs is that (good) alignments provide a way of assessing conservation and variation of amino acids at different sites during evolution. Sites that are important in the function of the protein evolve more slowly than other sites in a protein, thus such sites can be visible as columns of conserved residues in an alignment. Thus, the same approach as used here to look for short functional (and hence conserved) sites in proteins, can be applied to any attempt to identify functional regions of proteins using an alignment e.g. highly-conserved active-site residues.

Note, however, that just because a motif is not conserved between sequences, it does not necessarily follow that the motif is not functional - it could have appeared in the sequence after the divergence of the sequence that contains it from the other sequences in the alignment.

In this exercise we will make use of the ELM server along with MSA to explore the presence of ELMs within several sequences, and see how the use of MSAs is a useful tool in the our attempts to identify likely functional ELMs.

Begin with this sequence - the first step is for you to determine the organismal and sub-cellular context in which the protein functions.

To do this, locate the UNIPROT entry for the protein by BLASTiing the query sequence at the EBI's BLAST web-server.

Examine the database record whose sequence is the top hit from this search (which should be essentially identical to your query) to determine which organism and cellular localisation the protein has.

Proceed to the ELM server - note that you can query the database using either the SWISSPROT name, ID, or with the sequence itself - it should be faster to use the name/ID, however note that sometimes SWISSPROT changes the name of a sequence, or at other times the server behind which the name lookup is based is not available, so it is important to be able to query also with the sequence itself.

Add the information of your query sequence's organism, and sub-cellular localisation, and submit your query to the ELM database.

The most important information you get from your query is the summary of features reported by ELM. This shows you the hits to the regular expressions for the ELMs in the database that meet the organism and subcellular-localisation criteria (try repeating the query with different subcellular-locatlisation constraints to see the effect this has on the results.)

As ELMs tend not to be found in globular domains, those regular expressions that hit in domains are indicated as "Smart filtered" and are considered less likely to be true than those outside these domains.

Sometimes, one of the database annotators has examined exactly the same sequence as you are querying with while annotating an ELM - in that case, these "known" or "annotated" ELMs are reported to you in purple. In the opinion of the annotator, there is good, direct, experimental evidence for this ELM being functional. However, how (apart from doing experiments) can we investigate whetehr the other ELMs are 'true" - or at least narrow down the set of ELMs we might want to attempt to verify experimentally?

The answer is that, as you might have guessed, we can use MSAs.

ELMs typically reside within rapidly-evolving sequence that is intrinsically disordered i.e. does not, on its own, fold into stable secondary structure. Such sequence experiences a relatively high substitution rate - however, as ELMs are important for the function of the protein, they tend to remain much more resistent to mutations than the rest of the sequence. Therefore, a good multiple sequence alignment between a set of sequences that all contain an ELM in the same region of their sequence should highlight this sequence conservation, and strengthen our belief that this sequence may be a functional ELM.

At a relatively simple level, ELM already does this - if you query with a sequenece related to one that contains an annotated ELM, and your sequence also contains a hit to the regular expression of the ELM, then ELM will report this to you as an "ELM predicted by homology" - to see this in action, query ELM using the following sequences - they have the same subcellular localisation as the initial sequence
However, in the absence of curated instances in closely-related proteins, you have to create a MSA to investigate the conservation of the motifs (although an automated approach to carrying out such an analysis is currently being worked on in our lab).

With this in mind, collect a set of sequences related to the initial one above, and create a multiple sequenece alignment from them,

Examine the resulting alignments:
Use the CLUSTALX Edit->Search for String option to help find the instances of the different motifs in the query sequence (note that this will not work if the alignment contains gaps in the middle of the sequenes...)

Q Is the annotated ELM successfully aligned?

Q If so, is it present in all the sequences?

Q If it is missing from some of the sequences, speculate on the evolution of the ELM within this group of seqeunces (i.e. where it is present and absent, when it might have been lost or gained, possible functional reasons for its loss/gain)

Q Consider the same set of questions for the other ELMs which are suggested as possibly present in the sequence

Note that, as far as I am aware, the different MSA creation software is trained on alingments of globular domains, rather than on ELMs - this is part of the reason why MSA of such regions often fail to correctly align these motifs. With this in mind, create an alignment of these sequences again, this time using MAFFT and MUSCLE.

Q Do these programmes do a better or a worse job than CLUSTALX at correctly aligning the motifs?

Extra Exercises

Here are some additional exercises for you to try out if you get through the ones above with time to spare (don't go through them in order - just pick out those that sound more interesting to you).

1) Here are some additional sequences for you to examine for ELMs using MSAs (in the same way as in the main exercise above)

In particular look into the motifs LIG_CYCLIN_1, LIG_MDM2 and MOD_PIKK_1

In particular look at the LIG_AP2alpha_2 motifs, LIG_Clathr_ClatBox_1, LIG_EH and LIG_PIP2_ENTH_1 motifs - note the conservation of motifs in the same cellular process i.e. endocytosis

2) An additional set of exercises that focus on the ELM server can be found under this link - if you want to explore the resource in more detail, try carrying out this set of exercises.

3) ELM is just one of many resources available to aid in the identification of potentially functional ELMs - although it might well be the most sophisticated resource of this kind that covers a wide range of different functional site types - ranging from phosphorylation sites to nuclear export signals. Most of these other sites focus on identifying/predicting specific types of ELMs e.g. one is dedicated to identifying XXXX

To explore some of these other resources, investigate the following sequences:
Some servers to try:

Note that throughout these exercises the following formating is used to specify different types of text

Bold non-italic text like this gives you instructions about tasks you should carry out e.g. "View the following webpage"

Italic text specifies questions for you to answer

Back to Gibson Team course pages at EMBL.