Protein Architecture, Disorder, and Motifs Tutorial
Aidan Budd and Toby Gibson
Tuesday 28th November 2006
Proteins can have dozens of domains and/or short peptide functional
sites (so called linear motifs) where only the local peptide sequence
is relevant to function, e.g. phosphorylation sites. Although the
peptide embodies all the information for function, linear motifs often
may regulate activity of other parts of the protein. In such cases
protein function cannot be well understood without an overview of the
modular architecture. We'll look at some servers that can help to
characterise protein architecture.
We will use:
- SMART for
revealing known domains
- GlobPlot, IUPRED, and DisProt for
protein order and disorder
- ELM for exploring potential
short peptide functional sites
Exercise 1: Comparing a sequence to a database of protein domains
You have already looked at "simple" sequence similarity searching in a
previous exercise. However, it is possible to carry out considerably
more sensitive sequence similarity searches by using "profiles" -
searches that involve comparing information derived from multiple
sequence alignments, rather than just single sequences with each other.
Good, large (i.e. involving many sequences) alignments contain useful
information concerning the different evolutionary dynamics of
individual residues in the alignment, rather than assuming that all
residues evolve in the same way (as is the case with pairwise sequence
similarity searching). This makes them better at identifying distantly
There are several very useful databases of modules that are found in
multidomain proteins, including PFAM at the Sanger
Centre, PROSITE at
ISREC and SMART at
EMBL. All three resources exploit comparison of query sequences against
information based on multiple sequence alignments - this makes them
more sensitive than simple BLAST pairwise searches. We will first check
domains in the Src oncoprotein using the SMART server.
Open SMART in a new browser window
You may need to choose a SMART mode - if so select NORMAL.
Type src_human into SMART's Sequence ID box. Alternatively, this
method of submission may be broken, in which case follow this link to
SRS and use the sequence for SRC_HUMAN
at the bottom of this page.
Click on the Sequence SMART Button.
The search should take about a minute unless the server is busy.
When you get the results, note the domain "bubble" diagram and the
table of matching domains lower in the page. Search through the links
provided by SMART to address the following questions:
Q1a: Based on your experience of pairwise sequence similarity
searching using BLAST would you say the E-value scores are
good i.e. do you expect that the domains predicted by SMART to be
present in your protein really are there?
Q1b: Is the domain common i.e. in your opinion, are there copies of the
domain found in many different proteins?
Q1c: Does SMART provide you with links to literature describing these
Q1d: Has the 3D structure been solved Are for any of these domains?
Q1e: Do any disease mutations occur in these domains?
Q1f: What is the longest span of amino acids in which SMART does not
identify any domains?
Q1e: Do you think this protein has especially many or few domains?
(You could repeat the SMART query with FBN1_HUMAN,
the Marfan Syndrome
protein, to see an example of a protein with a large number of domains.
Also, by choosing to "Display all proteins with similar domain
organisation" on the results page, you can get an idea of the domain
composition of a range of different proteins)
Exercise 2: Exploring Protein Order/Disorder
Today we will look at three different web servers (GlobPlot, IUPRED, DisProt)
that predict the presence of nonglobular regions of protiens. It can be
be important to discern nonglobular regions of proteins: They often
have short functional sites e.g. histone tails (interesting) and they
can interfere with protein crystallisation (bad).
GlobPlot uses "coil preferences" for the amino acids to distinguish
nonglobular and globular regions of multidomain proteins. It uses a
simple graphical approach based on summing the parameters so that the
slope of the graph indicates the nature of the sequence. A rising
positive slope has a nonglobular preference while a negative slope
indicates globular preference. Unlike sliding window algorithms, this
approach is good for finding segments of any length in a sequence.
Open GlobPlot in a new window.
We will run GlobPlot in default mode today. Note that there are
parameters to affect the smoothness of the curve,
switch off SMART etc. To understand better how GlobPlot works, you
could clikc on the "Propensities" link to see the Russell/Linding
amino acid "coil" propensity values that GlobPlot uses to draw a simple
graph by summing these values along the
Type SRC_HUMAN in the swissprot ID box and click GlobPlot NOW. (NB
this method of submitting a sequence may not work, in which case cut
and paste the sequence from SRS SRC_HUMAN)
Globplot can take a minute to run as it uses the SMART server too.
Examine the output. Note the slope of the graph compared to the
reported globular domains and see what the coloured blocks denote.
Q2a: Is the slope mainly positive or negative?
Q2b: Are those regions of the slope that are negative associated with
any other features represented in the output? Why would one expect this
Q2c: Are there any peaks/troughs where the slope inverts?
Q2d: What is the longest "putative unstructured segment" listed by
Q2e: How well do SMART/Pfam and GlobPlot agree on where there is
There are lots of proteins that give interesting and informative
If there is time you could try some of our favourites such as:
We will now use a different web server, DisProt, to predict nonglobular
regions in the same protein, human SRC. DisProt uses a different
approach to GlobPlot to predicting disordered regions, using a trained
Obtain the sequence for human SRC SRC_HUMAN.
Open the DisProt
Copy and paste the
human SRC sequence into the query box, and run the analysis using the
Here is IUPRED, yet another
server that predicts disordered regions in proteins.
Copy and paste the SRC sequence into the IUPRED server. Specify that a
plot should be made based on the results, and that a window size of
1000 should be used.
Q3: Do the regions of the sequence predicted by DisProt and IUPRED to
be globular/non-globular agree with those identified by SMART and
Repeat the analysis using GlobPlot, SMART, IUPRED and DisProt
instead of SRC.
Q4: Do you see any tendency for any of these servers to tend to
under or over predict the presence of disorder/order? e.g. Does
GlobPlot tend to strongly predict disorder in places where the other
servers are less definate about their assignment of the sequence to
either Globular/Non-Globular categories?
Exercise 3: Searching for short functional sites with the ELM server
Src is an example of a protein that has many small functional sites for
modification and/or interaction with ligands. We term these "linear
motifs" because they do not require 3D structure for function, needing
only to be sufficiently accessible. Motif functions include ligand
recognition, amino acid modification, signalling, cell compartment
targeting, cleavage and so forth. There are probably less categories of
motif than globular domains but there are probably more instances in a
eukaryotic proteome. As part of a consortium, we have begun to collect
these motifs and develop a new database, ELM. Currently we have more
than 100 patterns entered in the database. The ELM server's web
interface allows sequences to be compared to the stored patterns. Motif
prediction presents difficulties as matches are not statistically
significant, so the user needs to think logically about which
motifs/domains are incompatible with each other. Part of the ELM
project involves developing filters to reduce the number of false
Looking for conserved motifs in the human and Xenopus src protein
Open the ELM server query page
in a new window.
Type src_human into the SWISS-PROT ID box or submit the sequence
Specify species as Homo sapiens and cell compartment as cytosol.
(note that src is actually directed to the inner plasma membrane
Click on the Submit Button. The results will appear in a new window.
Immediately start a new search with src1_xenla and set species to
The searches should take about 1-2 minutes unless the ELM and SMART
servers are busy.
Look over the outputs. Explore the graphics by mouseover and click on
some of the links. Then answer the questions below.
Q5a: Note that some of the motifs have been "greyed out"in the
output - can you see, comparing these motifs with the output of other
aspects in the display, which other features these greyed-out motifs
Q5b: Why do you think these motifs have been greyed out?
Q5cAre there any annotated motifs i.e. motifs that have been identified
by the curators of the database as being definately true?
Q5d: For the src CDK site, follow links to find whether CDK2 or CDK5
modifies this site?
Q5e: Does the ELM instance mapper use PHI-BLAST or PSI-BLAST?
Q5f: Find the set of reported motifs that obey the following criteria:
(1) They are in the non-globular N-termini (approx. residues 1-80).
(2) They are found at the same place in human and frog (indicating that
there is functional conservation).
NB An easy, and frequently used, way of checking for conservation in
position of a motif between related sequences is to obtain a BLAST
pairwise alignment between the two sequences, and to examine the region
around the motifs from this alignment. You could do this using the BLAST 2
sequences server at the NCBI or using a Smith/Waterman
alignment (a more accurate pairwise alignment algorithm than BLAST)
at the Pasteur Instiutte in Paris
Q5g: Are there conserved N-myristoylation sites?
Q5h: Are there any conserved phosphorylation sites?
Q5i: Are there any conserved cyclin binding sites?
Q5j: Is a cyclin binding site meaningful on its own? Or is there
another motif that must be found in the same protein as this motif for
it to be functional?
Q5k: Are there more cyclin-binding than CDK phosphorylation sites?
Q5l: Is src likely to be phosphorylated at specific points in the cell
There are 8 src-like kinases in the human proteome with partially
redundant function. Presumably they will be substrates of CDK too?
Q5m: Do new ELM queries with the src-like kinases yes_human and
yes_xenla to check whether it too is a CDK substrate.
Exercise 4: Exploring the architecture of the protein Epsin1
Epsin is a protein involved in clathrin-mediated endocytosis. It binds
to the membrane, inducing curvature and is regulated by many adaptor
protein interactions. Endocytosis is a highly dynamic process involves
many different proteins that come together in transient complexes. The
whole system takes extensive advantage of short linear motifs. Let's
check some out!
Submit SMART, GlobPlot and ELM queries with entry EPN1_RAT (rat
Epsin1) (links for these servers and SRS can be found elsewhere in this
In ELM, remember to set cell compartment to cytosol.
Q6a: How many globular domains are reported in the SMART output?
Q6b: Looking at the arrangement of globular domains as predicted by
SMART, is there any indication that exon duplication has occurred in
the evolution of this gene? It also helps to check further by clicking
on the Display all proteins with similar domain
composition. Then on display domain architecture of ALL. This allows
you to survey the Epsins in many different organisms at the same time.
Q6c: Do SMART and GlobPlot agree about where the sequence is globular?
Q6d: How big is the largest segment of disordered sequence indicated by
Many short functional motifs are mapped in Epsins already, notably:
- 2 clathrin boxes
- 3 EH domain binding motifs (NPF motifs)
- 8 AP2-binding motifs (DP[WF] motifs)
- 1 Pip2 binding motif
- 3 Ubiquitin interacting (UIM) motifs
Q6e: Which of these motifs are not found by ELM?
Note: the ELM DB does not have anything like full coverage yet so there
may be no entry yet for many motifs.
Q6f: Are any of these motifs found by SMART instead?
Q6g: Which of these motifs has been "rescued" from the ENTH domain?
(While it is true that motifs tend not to occur in globular domains,
they do sometimes occur in loops between secondary structure elements.
When a ELM annotator identifies a case of such a domain in the
literature, they can annotate this particular instance of the motif
such that, even though it is present in a domain, and therefore
would normally be filtered out, it will still be presented to the user
- this is a "rescued" motif). Click on its link to get an idea why it
should be rescued.
Q6h: Which of the NPF and DPW instances are not yet annotated in ELM?
Q6i: Are there any other endocytosis targeting ELMs picked up in Epsin?
Q6j: Is/are the matches plausible?
Q6k: Are there any other motifs that one might consider following up
experimentally? (to decide this, you would look at alignments between
closely-related epsins to check for motifs present in the same place in
the different sequences which are not present in globular domains)
If you have spare time, you can also have a look at the protein frq FRQ_NEUCR,
a circadian clock protein in fungi, BFA1_YEAST,
involved in a spindle assembly checkpoint in yeast.
Exercise 5: Looking for linear motif candidates in interaction
datasets with STRING and DILIMOT
On a related topic, it is also possible to use other sources of
information to predict the presence of motifs - in this case we will
use the DiliMot server, which predicts the presence of motifs based on
analysis of a set of physically-interacting proteins.
However, the DILIMOT server is slow and there are too many of us: We
will set up the exercise in the evening and bookmark the results page
and analyse the results in the morning.
There is a well known linear motif in proteins which bind to the PCNA
clamp protein of the DNA polymerase complex. Our aim is to look to see
whether we can retrieve the motif from interaction datasets?
(The ELM entry is lig_PCNA)
Go to the STRING webserver
Search PROTEIN Interactors with PCNA_HUMAN as query
This search usually works well with yeast 2-hybrid data. You can either
follow the guidelines here or try to vary the parameters to see if you
get better or worse results....
Official Guidelines: Select Experiments only and ask for 50
interactors. (Selecting Literature only might be another sensible
Once you have the revised dataset, bring up the Summary Network.
Collect the sequences in Fasta format using the save button.
Go to the DILIMOT webserver
Cut and paste the PCNA interaction dataset in FASTA from STRING into
Set species to human.
Click Find Motifs to initiate the motif search.
Now bookmark the results link so we can return in the morning to
evaluate the search.
Q7a: Is there a motif something like Q..L..FF? If so how many proteins
have the motif? Click on the link.
Q7b: Look at the protein diagrams - Is the location of the motif always
Q7c: DILIMOT uses conservation to upweight the significance of the
motif matches. Is the motif present in all the aligned homologues or
only in some of them?
Q7d: Are the hydrophobic positions in the motif always conserved? If
not, what substitutions are allowed?
Q7e:Is the Gln always conserved?
DILIMOT runs at the edge of the signal to noise for finding short
motifs. Currently the protocol only allows a single amino acid at each
position. Since most linear motifs do not specify exact matches at
every position, the linear motif may be present in other proteins in
the dataset but not found in the query. Therefore DILIMOT results
provide an entry point to explore a candidate motif. Multiple
alignments of the proteins can help to derive a better consensus
Take home lessons
We need to know all the functional domains and motifs in a protein to
truly understand how it functions: in our examples, the short peptide
motifs easily outnumber the globular domains. Bioinformatics resources
can help us find many of these components but are by no means
comprehensive. Known domains can be assigned with good statistical
confidence. In the case of the ELM short functional sites, there is no
statistical support for candidates. ELM results should be filtered -
partly by ELM itself but also by the user. Checking for conservation in
closely related proteins is a good test whether ELM matches should be