Text Database Searching with SRS Tutorial

Aidan Budd and Venkata P. Satagopam

Tuesday 28th November 2006

Biological data is stored in many different databases throughout the world, e.g. see the list on the ExPASy website or the Nucleic Acids Research site. Obviously, for these databases to be useful, one needs to be able to query them efficiently. The purpose of this tutorial is to help you improve your ability to issue effective queries against such databases using methods other than direct comaprison of amino acid or nucleotide sequences. We will focus on the swissprot database and SRS (a powerful system designed to query biological databases).

Exercise 1

However, before focusing on querying such databases, we will begin with giving you practice of another important task - that of finding the websites of the databases you are interested in, obviously without knowning where these are, you would not be able to query them!

As already discussed, there are databases available containing an amazing range of different kinds of data.

Search the internet (you could try ExPASy, NAR, Google, PubMed) for freely-accesible online databases online that focus on collecting the following types of data:
phosphorylation sites
disordered regions of proteins
microarray experiment data

For each of the databases, look to see if there are records for any of the proteins you personally are interested in - if you have not got any such proteins available, then try SRC_HUMAN  and PRIO_CHICK.

Q1: Can you find more than one database for any of these types of data?

Q2: Do you encounter any problems while searching for entries related to your proteins of interest within these databases? Why do you think these problems exist?

This exercise aims to demonstrate:

Exercise 2

When studying a particular protein, it obviously makes sense to begin by finding out what is already known about them. One obvious way of doing this is to carry out literature searches using PubMed or the ISI web of science, to identify relevant articles. However, your protein of interest may well have been the focus of intensive study, with many hundreds or even thousands of articles published about it. e.g. SRC (more than 13000 articles) or p53 (more than 40000 articles).

In such cases, a good first step towards getting an overview of the structure and function of the protein is to consult the record for the protein in one of the manually-annotated sequence databases e.g. swissprot. In this exercise we will look at the swissprot entries for a couple of different proteins, to get an idea of the kind of information that can be found there. We will look at the sequences using SRS, already introduced to you by Venkata's presentation, a system for examining and linking between records from different databases.

Consider the two sequences MLL4_HUMAN and SRC_HUMAN. For each of these sequences answer the following questions:

Q3: What are the different names that the proteins are known by?

When searching through other databases to find records associated with your protein of interest, this will sometimes involve knowing the name of the the protein. However, many proteins have several names, often due to their being studied by different people at different times. When working with a particular protein it is often useful to make a list of the different names it is known by, this can help, for example, when searching literature databases for publications related to these proteins. Note that, rather than searching using a "name" of a sequence, it helps to reduce ambiguity (if you have that option) by searching databases using either a direct sequence comparison (e.g. using a sequence similarity search) or using an accession number from one of the primary sequence databases.

Q4a: Is there any evidence that either of these proteins is phosphorylated?

Q4b: If so, has the responsible kinase been discovered?

Q4c: Which papers/evidence have been used to identify these phosphorylation sites?

Using swissprot to check this kind of information can be useful if you have identified a phosphorylation site experimentally on your protein, and you want to check to whether it has already been identified by someone else, to see whether your discovery is novel or not.

Q5: Has the 3D structure of any of these proteins been solved?

Interpreting the result of mutational analysis is best done in the context of 3D structural information about the protein being mutated - just one of many reasons why it can be useful to learn whether the 3D structure of a protein has already been solved.

Q6: Find out which biological processes the proteins are involved in by following the links to GO

A useful summary of the function of a protein can be obtained from examining its description using the Gene Ontology (GO). Finding other proteins described using the same GO terms is one way of obtaining a set of functionally-related proteins.

Q7: By following the links from the GO pages, find the names of at least two other proteins that share the same biological process as these.

Note that for many of the above questions there exist specific databases that could be consulted to obtain (probably more comprehensive) answers to these questions e.g. Phospho.ELM for the identification of known phosphorylation sites, or MSD for information on 3D structures associated with a sequence. However, the information provided by swissprot provides a useful first point of call to check for lots of different kinds of information all in the same place.

This exercise aims to demonstrate:

Exercise 3

It has happened to all of us - we sit in a talk, hear that our protein of interest is involved in a story we didn't know about e.g. it interacts with a protein we have never heard of. So then, once we get back to the lab, we want to find out more about this protein. However, we only know the name of the gene, not yet its sequence, which we will surely need to find out more about it. To do this, a typical first step is to use SRS to query a protein sequence database to identify the sequence.

To demonstrate that this situation is not always as easy as one might hope, in this exercise we ask you to use SRS to find the swissprot record for the well-known oncogene k-ras, involved in the ras-raf-mapk signalling pathway - specifically, you should look for the sequence of the human k-ras.

Q8: Going to the SRS start page, does the "quick text search" bring you quickly to the appropriate record?

(In some cases such a search can lead you directly to your sequence of interest.)

Q9: Can you help reduce the number of sequences hit by specifying the name of the organism you are searching for the sequence in as well as the gene name? (Perhaps you already tried this) Hint: try combing the words with different Boolean operators e.g. &. Does this make a difference?

An intensively-studied protein of this kind is almost certainly in the swissprot database, rather than the much larger uniprot. Therefore it might help to limit your search to swissprot (a subset of uniprot).

Q10: Does limiting your search to swissprot reduce the number of sequences hit to a manageable number?

If there are still too many sequences hit to easily identify the record you need, then try using the "Query Builder" instead of the "quick text search". The first step is then to choose the database to be searched (in this case swissprot). Experiment with the page until you find out how to specify this database.

The next step is to choose the Standard Query Form. This allows you to limit the fields that each term should be searched in - you may need to search for "ras" in either the "gene name" field or in the "synonym" field - hopefully this will allow you to finally pin down the sequence you are looking for!

Q11: Why are so many records that are NOT the sequence you are looking for picked up in these searches? Look in some of these records to see where the word "ras" is found to check this.

This exercise aims to:

Exercise 4

As is clear, one can use SRS to identify a group of sequences fulfiling a complicated set of criteria. For example, one might be interested in looking for characteristics in the sequence of a particular group of proteins, perhaps they tend to include large numbers of potential phosphorylation sites, for example.

For the purpose of this exercise we will look to obtain the set of sequences of those human proteins located in the nucleus that contain tyrosine kinase domains.

To do this, we go to the "Query Builder" as before, and construct a query that will only select those swissprot records with nuclear localisation (this is done using GO - the GO number for nuclear proteins is "GO:0005634", which can be searched for by choosing the "Links: DBxref" search field - this stands for "external database reference", which is where swissprot stores the accession numbers used to link through to other databases).

In the same way, we select those proteins predicted to contain a PFAM tyrosine kinase domain, which have the accession number "PF07714".

Finally, we restrict the search to those whose organism is "human"

(Note that this exercise demonstrates how useful the cross-linking of swissprot to many different databases can be.)

Q11: Do you think that this query has identified the complete set of human nuclear protein kinases? If not, then why not?

To collect the amino acid sequences of this disappointingly small set of proteins, move to the "Tools" bar on the left side of the screen, choose to save the selected entries in fasta format. Note that by going to the "Export" page, one can also dump the results of the search into an excel-format file, which can also be very useful.

Exercise 5 (only if you have time)

Now a somewhat more complicated exercise, to try out several different aspects of what you have hopefully learnt in this tutorial

Collect the fasta sequences of the Schizosaccharomyces pombe and Saccharomyces cerevisiae proteins that have transmembrane domains, and are involved in mitosis.

Back to introductory page.