Protein Architecture, Disorder, and Motifs Tutorial

Aidan Budd and Toby Gibson

Tuesday 28th November 2006

Proteins can have dozens of domains and/or short peptide functional sites (so called linear motifs) where only the local peptide sequence is relevant to function, e.g. phosphorylation sites. Although the peptide embodies all the information for function, linear motifs often may regulate activity of other parts of the protein. In such cases protein function cannot be well understood without an overview of the modular architecture. We'll look at some servers that can help to characterise protein architecture.

We will use:

Exercise 1: Comparing a sequence to a database of protein domains

You have already looked at "simple" sequence similarity searching in a previous exercise. However, it is possible to carry out considerably more sensitive sequence similarity searches by using "profiles" - searches that involve comparing information derived from multiple sequence alignments, rather than just single sequences with each other. Good, large (i.e. involving many sequences) alignments contain useful information concerning the different evolutionary dynamics of individual residues in the alignment, rather than assuming that all residues evolve in the same way (as is the case with pairwise sequence similarity searching). This makes them better at identifying distantly related sequences.

There are several very useful databases of modules that are found in multidomain proteins, including PFAM at the Sanger Centre, PROSITE at ISREC and SMART at EMBL. All three resources exploit comparison of query sequences against information based on multiple sequence alignments - this makes them more sensitive than simple BLAST pairwise searches. We will first check for protein domains in the Src oncoprotein using the SMART server.

Open SMART in a new browser window

You may need to choose a SMART mode - if so select NORMAL.

Type src_human into SMART's Sequence ID box. Alternatively, this method of submission may be broken, in which case follow this link to SRS and use the sequence for SRC_HUMAN at the bottom of this page.

Click on the Sequence SMART Button.

The search should take about a minute unless the server is busy.

When you get the results, note the domain "bubble" diagram and the table of matching domains lower in the page. Search through the links provided by SMART to address the following questions:

Q1a: Based on your experience of pairwise sequence similarity searching using BLAST would you say the E-value scores are good i.e. do you expect that the domains predicted by SMART to be present in your protein really are there?

Q1b: Is the domain common i.e. in your opinion, are there copies of the domain found in many different proteins?

Q1c: Does SMART provide you with links to literature describing these domains?

Q1d: Has the 3D structure been solved Are for any of these domains?

Q1e: Do any disease mutations occur in these domains?

Q1f: What is the longest span of amino acids in which SMART does not identify any domains?

Q1e: Do you think this protein has especially many or few domains?

(You could repeat the SMART query with FBN1_HUMAN, the Marfan Syndrome protein, to see an example of a protein with a large number of domains. Also, by choosing to "Display all proteins with similar domain organisation" on the results page, you can get an idea of the domain composition of a range of different proteins)

Exercise 2: Exploring Protein Order/Disorder

Today we will look at three different web servers (GlobPlot, IUPRED, DisProt) that predict the presence of nonglobular regions of protiens. It can be be important to discern nonglobular regions of proteins: They often have short functional sites e.g. histone tails (interesting) and they can interfere with protein crystallisation (bad).

GlobPlot

GlobPlot uses "coil preferences" for the amino acids to distinguish nonglobular and globular regions of multidomain proteins. It uses a simple graphical approach based on summing the parameters so that the slope of the graph indicates the nature of the sequence. A rising positive slope has a nonglobular preference while a negative slope indicates globular preference. Unlike sliding window algorithms, this approach is good for finding segments of any length in a sequence.

Open GlobPlot in a new window.

We will run GlobPlot in default mode today. Note that there are parameters to affect the smoothness of the curve, switch off SMART etc. To understand better how GlobPlot works, you could clikc on the "Propensities" link to see the Russell/Linding amino acid "coil" propensity values that GlobPlot uses to draw a simple graph by summing these values along the sequence

Type SRC_HUMAN in the swissprot ID box and click GlobPlot NOW. (NB this method of submitting a sequence may not work, in which case cut and paste the sequence from SRS SRC_HUMAN)

Globplot can take a minute to run as it uses the SMART server too.

Examine the output. Note the slope of the graph compared to the reported globular domains and see what the coloured blocks denote.

Q2a: Is the slope mainly positive or negative?

Q2b: Are those regions of the slope that are negative associated with any other features represented in the output? Why would one expect this association?

Q2c: Are there any peaks/troughs where the slope inverts?

Q2d: What is the longest "putative unstructured segment" listed by Globplot?

Q2e: How well do SMART/Pfam and GlobPlot agree on where there is globular structure?

There are lots of proteins that give interesting and informative GlobPlots:
If there is time you could try some of our favourites such as:

DisProt

We will now use a different web server, DisProt, to predict nonglobular regions in the same protein, human SRC. DisProt uses a different approach to GlobPlot to predicting disordered regions, using a trained neural network.

Obtain the sequence for human SRC SRC_HUMAN.

Open the DisProt query page.


Copy and paste the human SRC sequence into the query box, and run the analysis using the default settings.

IUPRED

Here is IUPRED, yet another server that predicts disordered regions in proteins.

Copy and paste the SRC sequence into the IUPRED server. Specify that a plot should be made based on the results, and that a window size of 1000 should be used.


Q3: Do the regions of the sequence predicted by DisProt and IUPRED to be globular/non-globular agree with those identified by SMART and GlobPlot?

Repeat the analysis using GlobPlot, SMART, IUPRED and DisProt  using IRS1_HUMAN instead of SRC.

Q4: Do you see any tendency for any of these servers to tend to under or over predict the presence of disorder/order? e.g. Does GlobPlot tend to strongly predict disorder in places where the other servers are less definate about their assignment of the sequence to either Globular/Non-Globular categories?

Exercise 3: Searching for short functional sites with the ELM server

Src is an example of a protein that has many small functional sites for modification and/or interaction with ligands. We term these "linear motifs" because they do not require 3D structure for function, needing only to be sufficiently accessible. Motif functions include ligand recognition, amino acid modification, signalling, cell compartment targeting, cleavage and so forth. There are probably less categories of motif than globular domains but there are probably more instances in a eukaryotic proteome. As part of a consortium, we have begun to collect these motifs and develop a new database, ELM. Currently we have more than 100 patterns entered in the database. The ELM server's web interface allows sequences to be compared to the stored patterns. Motif prediction presents difficulties as matches are not statistically significant, so the user needs to think logically about which motifs/domains are incompatible with each other. Part of the ELM project involves developing filters to reduce the number of false positive matches.

Looking for conserved motifs in the human and Xenopus src protein N-termini:

Open the ELM server query page in a new window.

Type src_human into the SWISS-PROT ID box or submit the sequence directly

Specify species as Homo sapiens and cell compartment as cytosol.
(note that src is actually directed to the inner plasma membrane surface by myristoylation.)

Click on the Submit Button. The results will appear in a new window.

Immediately start a new search with src1_xenla and set species to Xenopus laevis.
The searches should take about 1-2 minutes unless the ELM and SMART servers are busy.

Look over the outputs. Explore the graphics by mouseover and click on some of the links. Then answer the questions below.

Q5a: Note that some of the motifs have been "greyed out"in the output - can you see, comparing these motifs with the output of other aspects in the display, which other features these greyed-out motifs corelate?

Q5b: Why do you think these motifs have been greyed out?

Q5cAre there any annotated motifs i.e. motifs that have been identified by the curators of the database as being definately true?

Q5d: For the src CDK site, follow links to find whether CDK2 or CDK5 modifies this site?

Q5e: Does the ELM instance mapper use PHI-BLAST or PSI-BLAST?

Q5f: Find the set of reported motifs that obey the following criteria:
(1) They are in the non-globular N-termini (approx. residues 1-80).
(2) They are found at the same place in human and frog (indicating that there is functional conservation).


NB An easy, and frequently used, way of checking for conservation in position of a motif between related sequences is to obtain a BLAST pairwise alignment between the two sequences, and to examine the region around the motifs from this alignment. You could do this using the BLAST 2 sequences server at the NCBI or using a Smith/Waterman alignment (a more accurate pairwise alignment algorithm than BLAST) at the Pasteur Instiutte in Paris

Q5g: Are there conserved N-myristoylation sites?

Q5h: Are there any conserved phosphorylation sites?

Q5i: Are there any conserved cyclin binding sites?

Q5j: Is a cyclin binding site meaningful on its own? Or is there another motif that must be found in the same protein as this motif for it to be functional?

Q5k: Are there more cyclin-binding than CDK phosphorylation sites?

Q5l: Is src likely to be phosphorylated at specific points in the cell cycle?

There are 8 src-like kinases in the human proteome with partially redundant function. Presumably they will be substrates of CDK too?

Q5m: Do new ELM queries with the src-like kinases yes_human and yes_xenla to check whether it too is a CDK substrate.

Exercise 4: Exploring the architecture of the protein Epsin1

Epsin is a protein involved in clathrin-mediated endocytosis. It binds to the membrane, inducing curvature and is regulated by many adaptor protein interactions. Endocytosis is a highly dynamic process involves many different proteins that come together in transient complexes. The whole system takes extensive advantage of short linear motifs. Let's check some out!

Submit SMART, GlobPlot and ELM queries with entry EPN1_RAT (rat Epsin1) (links for these servers and SRS can be found elsewhere in this tutorial)

In ELM, remember to set cell compartment to cytosol.

Q6a: How many globular domains are reported in the SMART output?

Q6b: Looking at the arrangement of globular domains as predicted by SMART, is there any indication that exon duplication has occurred in the evolution of this gene? It also helps to check further by clicking on the Display all proteins with similar domain composition. Then on display domain architecture of ALL. This allows you to survey the Epsins in many different organisms at the same time.

Q6c: Do SMART and GlobPlot agree about where the sequence is globular?

Q6d: How big is the largest segment of disordered sequence indicated by GlobPlot?


Many short functional motifs are mapped in Epsins already, notably:

Q6e: Which of these motifs are not found by ELM?
Note: the ELM DB does not have anything like full coverage yet so there may be no entry yet for many motifs.

Q6f: Are any of these motifs found by SMART instead?

Q6g: Which of these motifs has been "rescued" from the ENTH domain? (While it is true that motifs tend not to occur in globular domains, they do sometimes occur in loops between secondary structure elements. When a ELM annotator identifies a case of such a domain in the literature, they can annotate this particular instance of the motif such that, even though it is present in a domain, and therefore  would normally be filtered out, it will still be presented to the user - this is a "rescued" motif). Click on its link to get an idea why it should be rescued.

Q6h: Which of the NPF and DPW instances are not yet annotated in ELM?

Q6i: Are there any other endocytosis targeting ELMs picked up in Epsin?

Q6j: Is/are the matches plausible?

Q6k: Are there any other motifs that one might consider following up experimentally? (to decide this, you would look at alignments between closely-related epsins to check for motifs present in the same place in the different sequences which are not present in globular domains)


If you have spare time, you can also have a look at the protein frq FRQ_NEUCR, a circadian clock protein in fungi, BFA1_YEAST, involved in a spindle assembly checkpoint in yeast.

Exercise 5: Looking for linear motif candidates in interaction datasets with STRING and DILIMOT

On a related topic, it is also possible to use other sources of information to predict the presence of motifs - in this case we will use the DiliMot server, which predicts the presence of motifs based on analysis of a set of physically-interacting proteins.

However, the DILIMOT server is slow and there are too many of us: We will set up the exercise in the evening and bookmark the results page and analyse the results in the morning.

There is a well known linear motif in proteins which bind to the PCNA clamp protein of the DNA polymerase complex. Our aim is to look to see whether we can retrieve the motif from interaction datasets?

(The ELM entry is lig_PCNA)

Go to the STRING webserver

Search PROTEIN Interactors with PCNA_HUMAN as query

This search usually works well with yeast 2-hybrid data. You can either follow the guidelines here or try to vary the parameters to see if you get better or worse results....

Official Guidelines: Select Experiments only and ask for 50 interactors. (Selecting Literature only might be another sensible option).

Once you have the revised dataset, bring up the Summary Network.

Collect the sequences in Fasta format using the save button.


Go to the DILIMOT webserver

Cut and paste the PCNA interaction dataset in FASTA from STRING into the window.

Set species to human.

Click Find Motifs to initiate the motif search.

Now bookmark the results link so we can return in the morning to evaluate the search.


Q7a: Is there a motif something like Q..L..FF? If so how many proteins have the motif? Click on the link.

Q7b: Look at the protein diagrams - Is the location of the motif always the same?

Q7c: DILIMOT uses conservation to upweight the significance of the motif matches. Is the motif present in all the aligned homologues or only in some of them?

Q7d: Are the hydrophobic positions in the motif always conserved? If not, what substitutions are allowed?

Q7e:Is the Gln always conserved?


DILIMOT runs at the edge of the signal to noise for finding short motifs. Currently the protocol only allows a single amino acid at each position. Since most linear motifs do not specify exact matches at every position, the linear motif may be present in other proteins in the dataset but not found in the query. Therefore DILIMOT results provide an entry point to explore a candidate motif. Multiple alignments of the proteins can help to derive a better consensus pattern.

Take home lessons

We need to know all the functional domains and motifs in a protein to truly understand how it functions: in our examples, the short peptide motifs easily outnumber the globular domains. Bioinformatics resources can help us find many of these components but are by no means comprehensive. Known domains can be assigned with good statistical confidence. In the case of the ELM short functional sites, there is no statistical support for candidates. ELM results should be filtered - partly by ELM itself but also by the user. Checking for conservation in closely related proteins is a good test whether ELM matches should be followed up.