Evolution and Protein Modular Architecture

EMBL PreDocs 2007

Tuesday October 30th - Wednesday October 31st 2007

Aidan Budd and Toby Gibson


Introduction

Our aim is to demonstrate to you some important software tools used to investigate protein modular architecture. By this we mean the investigation of (potentially multi-domain) protein sequences to identify regions (modules) of the proteins that may be associated with particular functions. 

Predicting functions of a protein is the focus of many (perhaps most) bioinformatic tools for analysing protein sequences. The typical use case for these tools (at least if the user is a "typical" wet-lab scientist) is that the scientist has identified a protein relevant to his area of study, and about which not so much is known. To understand the function of this protein in more detail, they will need to do wet-lab experiments. Bioinformatic prediction of the function of the protein provides hypotheses of the function of the protein that can then be tested in these experiments. For example, the protein may be predicted to contain a kinase domain, in which case an obvious experiment would be to confirm that the protein does indeed have such activity using biochemical experiment.

Different modules (and different regions of the same module) are under different evolutionary constraints. Thus, different protein modules show different patterns of variation over evolutionary time.

In the previous session with Rolf Apweiler, you will have been shown how to  identify proteins from other organisms (orthologs) and from the same organism (paralogs) that are closely related to your protein of interest. Additionally, you will have seen how to create a multiples sequence alignment (MSA) of related protein sequences. Looking at protein sequences in a MSA provides valuable insights into the evolution of the sequences - put simply, regions of the proteins that are under stronger constraints are less variable between sequences compared to regions under weaker constraints.

In these exercises you will look at several proteins you have already encountered during the course: pax6, pax2, lbx1 and pitx1. The sequences you will begin with are all from humans. You will predict the modular architecture of these proteins, and then examine MSAs of these proteins to investigate the evolutionary patterns of these different modules.

Open this link in a new browser window to access the sequences we will be looking at in these exercises.

Predicting non-globular/globular regions of protein sequences

As Toby discussed in his presentation, the first step in analysing the modular architecture of a protein is to predict which regions are globular and non-globular. Globular regions have very different structures - and hence functions - from non-globular regions. Thus, if we are interested in predicting function, it is important that we distinguish between these two types of sequence.

To predict globularity/non-globularity (order/disorder) we recommend that you use several different predictors of globularity, and compare the results of the different analyses.

In some cases, the predictions from the different methods all agree about the boundaries of ordered/disordered regions in your proteins - in which case it is easy to decide where you expect the boundaries are. In other cases, the different methods disagree. In which case - either spend some time learning about the differences between the different methods, to try and account for the observed differences, or go to a friendly bioinformatican for help (we're very happy to help with problems of this kind - come to see us anytime during your time here at EMBL).

Run the following order/disorder globular/non-globular prediction servers using default options unless otherwise indicated. Begin by using the human pax2 sequence, then if you have time run also the other sequences provided. Followed by the link to the webservers, we have also given you the output to expect from the servers (in case the servers are slow/not working while we carry out the exercise)

Follow this link to get the query sequences, which you should cut-and-paste into the following servers
For each of the sequences, compare the results of the different predictors. Identify those regions that are predicted by most/all servers to be either globular or non-globular, and other regions that are very differently predicted by the different servers. In the next exercise you will hopefully be able to check the accuracy of some of your conclusions.

Certain protein modules are predicted with high confidence

Certain kinds of protein modules can be predicted with a very high degree of confidence e.g. many protein globular domains e.g. protein kinase domains. Use the following webservers that predict such high-confidence modules to see whether any such modules are present in your proteins of interest. If such domains are identified in your sequences, check the annotation of these domains to see whether they are globular or non-globular in structure, and compare these findings with the results of the globularity prediction you carried out in the previous exercise. As we are very confident that a module predicted using these methods is correct, where we find such modules we can use them to assess how good the different globularity prediction methods are.

Follow this link to get the query sequences, which you should cut-and-paste into the following servers.

The different prediction servers do not always identify the same conserved elements in all the different proteins. This is true for at least one of the proteins used in this exercise. Which ones?

This demonstrates that if you are interested in identifying conserved modules in your protein of interest you should use the same protein as a query in several different servers. Often all the servers give the same/similar results - but as you see here, not always.

The results of these servers give you differing degrees of information about the conserved modules. Examples of this include information about the position and phase of splice sites, the set of organisms in which the module is found, surveys of all the proteins in which the module is predicted for particular organisms etc.

Use the results of these servers to answer the following questions:

Predicting protein linear motifs

Other kinds of protein modules can only be predicted with much lower confidence - for example, most protein linear motifs. The ELM server maintained by our group provides a resource to help investigating possible ELMs that may be present in your sequences.

Use the ELM server to identify possible ELMs present in your sequences. For each sequences specify its subcellular localisation and taxon. In particular, focus on the following motifs
Use the results of the server, and the links above, to address the following questions: You will have noticed that the server suggests that large number of different ELMs might be present in your sequences. Most of these are false positives i.e. are not functional instances of ELMs. ELM provides several different ways to try and decide which predicted ELMs are more likely to be true positives (and hence the best candidates for testing using experimental methods). For example, certain ELMs are restricted in their taxonomic distribution, and most are relevant only in certain subcellular contexts (hence you being asked to specify taxon and subcellular localisation when running the server).

Most ELMs are found outside globular regions. Thus, predicted ELMs occurring in non-globular regions are better candidates to be true positives than those in globular regions.

Amino-acid residues in non-globular regions that are more strongly conserved than the surrounding non-globular sequence are also good candidates to be ELMs - obviously, this is particularly true if many of the sequences in this region contain the ELM.

Using this information, use the techniques introduced by Rolf in the previous session to:
  1. Collect sequences related to your sequence of interest using BLAST at the EBI searching against either the "UniProt Knowledgebase" or "UniProtKB/Swiss-Prot" databases, collecting 250 sequences and alignments. Use the "Download -> Fasta" option to get the fasta format for the sequenecs you are interested in aligning.
  2. Align these sequences using either the EBI MUSCLE server or the EBI MAFFT server (if you have time, try both of the multiple sequences alignment programs and see which one gives the better alignment - you might want to discuss how one can judge alignment quality with us during the class) cut and paste the sequences obtained from the BLAST search into the sequence window of the MUSCLE server. Follow the link to the fasta-format output alignment file, and save this file locally to your computer.
  3. Download the resulting alignment and view it in clustalX2.

Using the resulting alignments (or, if you don't have time, use the links below to download pre-prepared alignments for viewing in clustalx), investigate the pattern of conservation of the different ELMs predicted to be in non-globular regions of these sequences.

Which of the ELMs do you think are most likely to be functional (and thus which are the best candidates for testing experimentally?)

If you are having trouble seeing the wood for the trees - i.e. if you find there are too many predicted ELMs for you to know what to focus on - try looking at the LIG_EH1 groucho-binding motif, the cyclin and cdk binding motifs LIG_CYCLIN_1 and MOD_CDK, and the sumoylation site MOD_SUMO.

Alignments


Patterns of evolution in protein modules

Finally, we want you to look at the patterns of evolution of the different modules in these protein sequences.

Use the alignments from the previous section, along with predictions you made using the order/disorder predictors, and the highly-conserved module prediction using PFAM/SMART/CDD.

For three different types of protein sequence (globular domains, linear motifs, disordered/unfolded/non-globular regions) compare at the conservation of the amino acid sequences within these alignments. Consider both whether amino acids within the regions experience many substitution events (ie whether the amino acid residues in the same column of the alignment tend to be the same/similar or very different in different sequences) and whether or not the regions tolerate frequent large or small insertion/deletion events i.e. whether lots of sequences in the region are represented only by gap residues.

Which of these three different types of sequence is the most conserved?

Can you find cases where you suspect that complete modules are lost/gained in the course of evolution?

Interpreting the patterns of presence and absence of modules within the sequences in terms of loss and gain of modules requires at least an implicit (and, better, an explicit) hypothesis concerning the phylogenetic relationships of the sequences being studied. To do this using an explicit hypothesis, you need to estimate a phylogeny for the sequences. Therefore, if you have time, use clustalx to estimate a neighbour-joining tree of the different families. To do this either use the local version of clustalx on your machine, or use the EBI ClustalW server. Take the modules you suspect of having been lost/gained during the evolution of the sequences, and map these features onto the resulting phylogenies (visualise the trees using the Bork groups iTOL [interactive Tree of Life] server.)
The output of ClustalW probably won't work - so you should use this file as input for iTOL. PAX alignment
Does this allow you to identify branches on the tree that you believe are associated with gain/loss events of particular modules?

Think about which kind of assumptions you are making as you infer the presence of such gain/loss events - for example, concerning the frequency of certain kinds of evolutionary event.

Pre-calculated Result Pages From Web-Servers

The links below are to pages showing the results of running RONN, DisProt, IUPred, GlobPlot, PFAM, SMART, CDD, ELM using default values for the human pax2, pax6, lbx1, and pitx1 sequences.




Some Protein Order/Disorder Predictors

Remember - usually safest to input just "raw" protein sequence - i.e.

"GHYYITS..." 

Rather than inputing as FASTA format i.e. with description line

">P1248923 protein description
GHYYITS..." 


Run the servers using default options unless otherwise indicated

Some Protein Family/Structural Domain Predictors


Linear Motif Investigation


Multiple Sequence Alignment Servers


Phylogenetic Software


Find Sequences


Back to Gibson Team course pages at EMBL.