Function Prediction using STRING Tutorial
Aidan Budd and Lars Juhl Jensen
Wednesday 29th November 2006
In the previous day's tutorials, we looked at a number of different
ways of investigating the function of a protein sequence. In this
practical, we will be looking at STRING,
a webserver/database that integrates a wide range of different methods
for infering function together in one place - this can provide clues to
the function of a protein that might be completely missed by methods we
have already discussed.
Query STRING using YFHM_ECOLI
- either enter this as an identifier, or if this causes problems,
submit the sequence directly. Make sure you carry out the query in
There are a large number of groups of proteins predicted to interact
with proteins related to this protein. Note that the list of
interactions is ordered by "Score", a measure of how confident STRING
is that a group of proteins interacts with relatives of the query
We are of course interested in finding out what kind of proteins have
been predicted to be interacting with YFHM_ECOLI (or relatives of this
protein). Often the COGs predicted as interactors are annotated on the
results page with comments concerning their possible function. However,
in this case this does not give us any useful information.
To get more information, open in a new tab or window the link
associated with several of the predicted interacting COGs.
give you a list of the proteins involved in each COG - by following the
links in this page you can find out the kind of proteins involved in
Q1a: Do the functions of proteins within the same COG tend to be
Q1b: Are there any proteins you are already know something about in any
these COGs, or indeed in the COG containing YFHM_ECOLI? If so, are
these proteins eukaryotic or prokaryotic in origin?
Another way of finding out more about the predicted interactors is to
examine the specific evidence used to make the prediction of the
interaction, by clicking on the icons below the result table. Begin by
examining the evidence from the textmining analysis. This provides you
with a list of text documents (mostly PubMed abstracts) that mention
members of two different COGs - this provides evidence that the two
proteins interact. It therefore deals with single genes that are
members of the COGs, and is a good way of getting a summary of which
members of the COGs have had the most research carried out on them.
Q1c: There are several members of the YFHM_ECOLI COG that dominate
proteins involved in this list - which are they?
Another source of data included in STRING comes from databases
that contain manually-curated information about groups or pathways of
proteins that are known to interact with each other e.g. KEGG, BIOCARTA
Browse the set of proteins identified as belonging to
the same pathways as the YFHM_ECOLI.
Q1d: Is there some kind of trend in the biological function of
these pathways? e.g. signalling, immune response, protein synthesis,
You may have noticed that the information provided from these sets of
evidence was all associated with eukaryotic members of the
YFHM_ECOLI COG - and as we began with a bacterial protein, we may well
be more interested in identifying proteins in bacteria that may
interact with YFHM_ECOLI or related proteins.
Three additional sets of data used to identify potential interactions,
but which apply predominantly to prokaryoes, are the "Neighbourhood",
"Fusion" and "Occurence" sources of evidence. As the only
experimeintal, database and text-mining evidence for interactions with
the YFHM_ECOLI COG appear to apply to eukaryotic genes, we will turn
off these data types, and search only using these three evidence types,
by altering the "Active Prediction Methods" at the bottom of the page,
and then "Updating Parameters".
Q1e: By browsing the evidence from these three data types, can you
a set of proteins you think likely to interact with YFHM_ECOLI or
bacterial relatives of this protein?
Q1f: Can you suggest a function for any
of these likely interactors?
To contrast the information that can be learned from STRING about a
protein compared with simple flat file database searching and sequence
similarity searching that we covered in a previous exercise, try using
both SRS to investigate the function of YFHM_ECOLI (i.e. view
the swissprot entry for the protein)
Q1g: Do you find our more or less about the protein using
STRING or swissprot?
STRING has two different modes of operation, "Protein" and "COG" -
clicking on the "interactors wanted:" link on the front STRING page
describes the different between them. The two modes offer a choice
between greater selectivity (in the Protein mode) and greater
sensitivity (in the COG mode) - the COG mode has the problem that in
some cases the groups are very inclusive e.g. most serine/threonine
kinases are in the same group, and it is then impossible to tease apart
interactions specific to a subset of the members of this group. In
contrast, the Protein mode predicts interaction partners for one
protein in a specific species, helping overcome problems of this kind.
As we will now look at an example focusing on eukaryotic signalling
pathways, which involve many different kinases, the mode of choice, for
the reasons outlined above, is the Protein mode. If you are interested,
you could try working through this exercise again, this time using the
COG mode - this should help highlight some of the differences between
We begin by using STRING to gain a quick overview of some of the
interactions associated with the intensively-studied proto-oncogene
RAF, in humans.
Enter the swissprot identifier for the
in Protein mode into STRING, or submit the protein sequence.
The specificity of the Protein mode can be demonstrated by repeating
this search, using the a RAF1 sequence from a different animal - RAF1_XENLA.
Submit this sequence to
As this protein is not in the set of precomputed proteins used
by STRING, the information is mapped onto the most similar sequences
that STRING can find, which happen to be in chickens
Q2a: Can you browse through the results of this search to find out
where the species of the sequences is described?
also that within the vertebrates, most of the evidence used by STRING
(experiments, pathway databases) is associated with mammals (and
of course human in particular), and there is only a minimal amount of
information obtainable from STRING about the interaction network
associated with RAF1 in the chicken.
Repeat the search using the mouse
This should give you an amount of information intermediate to that
between the chicken
and human networks.
When working with a protein as intesively studied as RAF, the
information provided by STRING can be somewhat overwhelming, making it
difficult "to see the wood for the trees". One way of limiting the
information presented to a more manageble amount is to restrict the
evidence used to construct the interaction networks to those more
reliable forms of evidence (probably the most appropriate is the
manually curated information from pathway databases such as KEGG.) This
allows one to use STRING to ask the question "which pathways is my
protein of interest already well known to be involved in".
Based on the results obtained using human RAF, restrict the evidence
that coming only from databases.
Q2b: Make a list of five different pathways the
protein is involved in.
Now switch on the experiments and textmining evidence.
Q2c: Is the coverage
of these two evidence types greater than that of the database evidence?
Why do you think this is the case?
Rather than just using STRING to browse and explore the interactions
associated with a particular protein, it can also be used to ask the
question "How are two or more genes related to each other in the
protein intreaction network" e.g. are they directly interacting with
each other? If not, are they still close to one another in the network?
To ask such a question, use the "List" option on the STRING from page.
We will use this to look at the relationship between Interleukin-4
(IL4_HUMAN) and Raf (RAF1_HUMAN).
Enter the swissprot ID's of these two proteins in
the list, and examine the network view.
Are they direct interactors?
Click "+" under the network to expand the network around the two
proteins - can you identify any proteins that interact with both of
these two proteins? What kind of interactions are these?
Back to introductory