Function Prediction using STRING Tutorial

Aidan Budd and Lars Juhl Jensen

Wednesday 29th November 2006

In the previous day's tutorials, we looked at a number of different ways of investigating the function of a protein sequence. In this practical, we will be looking at STRING, a webserver/database that integrates a wide range of different methods for infering function together in one place - this can provide clues to the function of a protein that might be completely missed by methods we have already discussed.

Exercise 1

Query STRING using YFHM_ECOLI - either enter this as an identifier, or if this causes problems, submit the sequence directly. Make sure you carry out the query in "COG" mode.

There are a large number of groups of proteins predicted to interact with proteins related to this protein. Note that the list of interactions is ordered by "Score", a measure of how confident STRING is that a group of proteins interacts with relatives of the query sequence.

We are of course interested in finding out what kind of proteins have been predicted to be interacting with YFHM_ECOLI (or relatives of this protein). Often the COGs predicted as interactors are annotated on the results page with comments concerning their possible function. However, in this case this does not give us any useful information.

To get more information, open in a new tab or window the link associated with several of the predicted interacting COGs.

This will give you a list of the proteins involved in each COG - by following the links in this page you can find out the kind of proteins involved in the COG.

Q1a: Do the functions of proteins within the same COG tend to be similar?

Q1b: Are there any proteins you are already know something about in any of these COGs, or indeed in the COG containing YFHM_ECOLI? If so, are these proteins eukaryotic or prokaryotic in origin?

Another way of finding out more about the predicted interactors is to examine the specific evidence used to make the prediction of the interaction, by clicking on the icons below the result table. Begin by examining the evidence from the textmining analysis. This provides you with a list of text documents (mostly PubMed abstracts) that mention members of two different COGs - this provides evidence that the two proteins interact. It therefore deals with single genes that are members of the COGs, and is a good way of getting a summary of which members of the COGs have had the most research carried out on them.

Q1c: There are several members of the YFHM_ECOLI COG that dominate the proteins involved in this list - which are they?

Another source of data included in STRING comes from databases that contain manually-curated information about groups or pathways of proteins that are known to interact with each other e.g. KEGG, BIOCARTA etc.

Browse the set of proteins identified as belonging to the same pathways as the YFHM_ECOLI.

Q1d: Is there some kind of trend in the biological function of these pathways? e.g. signalling, immune response, protein synthesis, cell adhesion...

You may have noticed that the information provided from these sets of evidence was all associated with eukaryotic members of the YFHM_ECOLI COG - and as we began with a bacterial protein, we may well be more interested in identifying proteins in bacteria that may interact with YFHM_ECOLI or related proteins.

Three additional sets of data used to identify potential interactions, but which apply predominantly to prokaryoes, are the "Neighbourhood", "Fusion" and "Occurence" sources of evidence. As the only experimeintal, database and text-mining evidence for interactions with the YFHM_ECOLI COG appear to apply to eukaryotic genes, we will turn off these data types, and search only using these three evidence types, by altering the "Active Prediction Methods" at the bottom of the page, and then "Updating Parameters".

Q1e: By browsing the evidence from these three data types, can you describe a set of proteins you think likely to interact with YFHM_ECOLI or bacterial relatives of this protein?

Q1f: Can you suggest a function for any of these likely interactors?

To contrast the information that can be learned from STRING about a protein compared with simple flat file database searching and sequence similarity searching that we covered in a previous exercise, try using both SRS to investigate the function of YFHM_ECOLI (i.e. view the swissprot entry for the protein)

Q1g: Do you find our more or less about the protein using STRING or swissprot?

Exercise 2

STRING has two different modes of operation, "Protein" and "COG" - clicking on the "interactors wanted:" link on the front STRING page describes the different between them. The two modes offer a choice between greater selectivity (in the Protein mode) and greater sensitivity (in the COG mode) - the COG mode has the problem that in some cases the groups are very inclusive e.g. most serine/threonine kinases are in the same group, and it is then impossible to tease apart interactions specific to a subset of the members of this group. In contrast, the Protein mode predicts interaction partners for one protein in a specific species, helping overcome problems of this kind.

As we will now look at an example focusing on eukaryotic signalling pathways, which involve many different kinases, the mode of choice, for the reasons outlined above, is the Protein mode. If you are interested, you could try working through this exercise again, this time using the COG mode - this should help highlight some of the differences between the modes.

We begin by using STRING to gain a quick overview of some of the interactions associated with the intensively-studied proto-oncogene RAF, in humans.

Enter the swissprot identifier for the gene, RAF1_HUMAN, in Protein mode into STRING, or submit the protein sequence.

The specificity of the Protein mode can be demonstrated by repeating this search, using the a RAF1 sequence from a different animal - RAF1_XENLA.

Submit this sequence to STRING.

As this protein is not in the set of precomputed proteins used by STRING, the information is mapped onto the most similar sequences that STRING can find, which happen to be in chickens

Q2a: Can you browse through the results of this search to find out where the species of the sequences is described?

Note also that within the vertebrates, most of the evidence used by STRING (experiments, pathway databases) is associated with mammals (and of course human in particular), and there is only a minimal amount of information obtainable from STRING about the interaction network associated with RAF1 in the chicken.

Repeat the search using the mouse gene RAF1_MOUSE

This should give you an amount of information intermediate to that between the chicken and human networks.

When working with a protein as intesively studied as RAF, the information provided by STRING can be somewhat overwhelming, making it difficult "to see the wood for the trees". One way of limiting the information presented to a more manageble amount is to restrict the evidence used to construct the interaction networks to those more reliable forms of evidence (probably the most appropriate is the manually curated information from pathway databases such as KEGG.) This allows one to use STRING to ask the question "which pathways is my protein of interest already well known to be involved in".

Based on the results obtained using human RAF, restrict the evidence to that coming only from databases.

Q2b: Make a list of five different pathways the protein is involved in.

Now switch on the experiments and textmining evidence.

Q2c: Is the coverage of these two evidence types greater than that of the database evidence? Why do you think this is the case?

Rather than just using STRING to browse and explore the interactions associated with a particular protein, it can also be used to ask the question "How are two or more genes related to each other in the protein intreaction network" e.g. are they directly interacting with each other? If not, are they still close to one another in the network?

To ask such a question, use the "List" option on the STRING from page. We will use this to look at the relationship between Interleukin-4 (IL4_HUMAN) and Raf (RAF1_HUMAN).

Enter the swissprot ID's of these two proteins in the list, and examine the network view.

Are they direct interactors?
Click "+" under the network to expand the network around the two proteins - can you identify any proteins that interact with both of these two proteins? What kind of interactions are these?

Back to introductory page.