Text Database Searching with SRS Tutorial
Aidan Budd and Venkata P. Satagopam
Tuesday 28th November 2006
Biological data is stored in many different databases throughout the
world, e.g. see the list on the ExPASy website or
the Nucleic Acids
Research site. Obviously, for these databases to be useful, one
needs to
be able to query them efficiently. The purpose of this tutorial is to
help you improve your ability to issue effective queries against such
databases using methods other than direct comaprison of amino acid or
nucleotide sequences. We will focus on the swissprot
database and SRS (a
powerful system designed to query biological databases).
Exercise 1
However, before focusing on querying such databases, we will begin with
giving you practice of another important task - that of finding the
websites of the databases you are interested in, obviously without
knowning where these are, you would not be able to query them!
As already discussed, there are databases available containing an
amazing range of different kinds of data.
Search the internet (you could try ExPASy, NAR, Google, PubMed) for
freely-accesible online databases online that focus on collecting the
following types of data:
phosphorylation sites
disordered regions of proteins
microarray experiment data
For each of the databases, look to see if there are records for
any
of the proteins you personally are interested in - if you have not got
any such proteins available, then try SRC_HUMAN
and PRIO_CHICK.
Q1: Can you find more than one database for any of these types
of data?
Q2: Do you encounter any problems while searching for entries
related to your proteins of interest within these databases? Why do you
think these problems exist?
This exercise aims to demonstrate:
- the large number of databases available
- that there are often several different databases available for a
given kind of data
- that it is often not easy to find the records you are interested
in within a database
Exercise 2
When studying a particular protein, it obviously makes sense to begin
by finding out what is already known about them. One obvious way of
doing this is to carry out literature searches using PubMed or the ISI
web of science, to identify relevant articles. However, your protein of
interest may well have been the focus of intensive study, with many
hundreds or even thousands of articles published about it. e.g. SRC
(more than 13000 articles) or p53 (more than 40000 articles).
In such cases, a good first step towards getting an overview of the
structure and function of the protein is to consult the record for the
protein in one of the manually-annotated sequence databases e.g.
swissprot. In this exercise we will look at the swissprot entries for a
couple of different proteins, to get an idea of the kind of information
that can be found there. We will look at the sequences using SRS,
already introduced to you by Venkata's presentation, a system for
examining and linking between records from different databases.
Consider the two sequences MLL4_HUMAN
and SRC_HUMAN.
For each of these sequences answer the following questions:
Q3: What are the different names that the proteins are known by?
When searching through other databases to find records associated with
your protein of interest, this will sometimes involve knowing the name
of the the protein. However, many proteins have several names, often
due to their being studied by different people at different times. When
working with a particular protein it is often useful to make a list of
the different names it is known by, this can help, for example, when
searching literature databases for publications related to these
proteins. Note that, rather than searching using a "name" of a
sequence, it helps to reduce ambiguity (if you have that option) by
searching databases using either a direct sequence comparison (e.g.
using a sequence similarity search) or using an accession number from
one of the primary sequence databases.
Q4a: Is there any evidence that either of these proteins is
phosphorylated?
Q4b: If so, has the responsible kinase been discovered?
Q4c: Which papers/evidence have been used to identify these
phosphorylation sites?
Using swissprot to check this kind of information can be useful if you
have identified a phosphorylation site experimentally on your
protein, and you want to check to whether it has already been
identified by someone else, to see whether your discovery is novel or
not.
Q5: Has the 3D structure of any of these proteins been solved?
Interpreting the result of mutational analysis is best done in the
context of 3D structural information about the protein being mutated -
just one of many reasons why it can be useful to learn whether the 3D
structure of a protein has already been solved.
Q6: Find out which biological processes the proteins are involved in
by following the links to GO
A useful summary of the function of a protein can be obtained from
examining its description using the Gene Ontology (GO). Finding other
proteins described using the same GO terms is one way of obtaining a
set of functionally-related proteins.
Q7: By following the links from the GO pages, find the names of at
least two other proteins that share the same biological process as
these.
Note that for many of the above questions there exist specific
databases that could be consulted to obtain (probably more
comprehensive) answers to these questions e.g. Phospho.ELM for the
identification of known phosphorylation sites, or MSD for information on 3D
structures associated with a sequence. However, the information
provided by swissprot provides a useful first point of call to check
for lots of different kinds of information all in the same place.
This exercise aims to demonstrate:
- that swissprot is a very useful resource in the way it can
provide an easily-overviewed summary of a huge amount of data collected
about a particular protein
- examples of some of the different kinds of information contained
in a swissprot entry
Exercise 3
It has happened to all of us - we sit in a talk, hear that our protein
of interest is involved in a story we didn't know about e.g. it
interacts with a protein we have never heard of. So then, once we get
back to the lab, we want to find out more about this protein. However,
we only know the name of the gene, not yet its sequence, which we will
surely need to find out more about it. To do this, a typical first step
is to use SRS to query a protein sequence database to identify the
sequence.
To demonstrate that this situation is not always as easy as one might
hope, in this exercise we ask you to use SRS to find the swissprot
record for the well-known oncogene k-ras, involved in the ras-raf-mapk
signalling
pathway - specifically, you should look for the sequence of the human
k-ras.
Q8: Going to the SRS start
page, does the "quick text search" bring you quickly to the
appropriate record?
(In some cases such a search can lead you directly to your sequence of
interest.)
Q9: Can you help reduce the number of sequences hit by specifying the
name of the organism you are searching for the sequence in as well as
the gene name? (Perhaps you already tried this) Hint: try combing the
words with different Boolean operators e.g. &. Does this make a
difference?
An intensively-studied protein of this kind is almost certainly in the
swissprot database, rather than the much larger uniprot. Therefore it
might help to limit your search to swissprot (a subset of uniprot).
Q10: Does limiting your search to swissprot reduce the number of
sequences hit to a manageable number?
If there are still too many sequences hit to easily identify the record
you need, then try using the "Query Builder" instead of the "quick text
search". The first step is then to choose the database to be searched
(in this case swissprot). Experiment with the page until you find out
how to specify this database.
The next step is to choose the
Standard Query Form. This allows you to limit the fields that each term
should be searched in - you may need to search for "ras" in either the
"gene name" field or in the "synonym" field - hopefully this will allow
you to finally pin down the sequence you are looking for!
Q11: Why are so many records that are NOT the sequence you are
looking for
picked up in these searches? Look in some of these records to see where
the word "ras" is found to check this.
This exercise aims to:
- introduce you to searching in SRS
- demonstrate that one often has to alter the parameters of
such a search to be able to find the sequence(s) of interest
Exercise 4
As is clear, one can use SRS to identify a group of sequences fulfiling
a complicated set of criteria. For example, one might be interested in
looking for characteristics in the sequence of a particular group of
proteins, perhaps they tend to include large numbers of potential
phosphorylation sites, for example.
For the purpose of this exercise we will look to obtain the set of
sequences of those human proteins located in the nucleus that contain
tyrosine kinase domains.
To do this, we go to the "Query Builder" as before, and construct a
query that will only select those swissprot records with nuclear
localisation (this is done using GO - the GO number for nuclear
proteins is "GO:0005634", which can be searched for by choosing the
"Links: DBxref" search field - this stands for "external database
reference", which is where swissprot stores the accession numbers used
to link through to other databases).
In the same way, we select those proteins predicted to contain a PFAM
tyrosine kinase domain, which have the accession number "PF07714".
Finally, we restrict the search to those whose organism is "human"
(Note that this exercise demonstrates how useful the cross-linking of
swissprot to many different databases can be.)
Q11: Do you think that this query has identified the complete set of
human
nuclear protein kinases? If not, then why not?
To collect the amino acid sequences of this disappointingly small set
of
proteins, move to the "Tools" bar on the left side of the screen,
choose to save the selected entries in fasta format. Note that by going
to the "Export" page, one can also dump the results of the search into
an excel-format file, which can also be very useful.
Exercise 5 (only if you have time)
Now a somewhat more complicated exercise, to try out several different
aspects of what you have hopefully learnt in this tutorial
Collect the fasta sequences of the Schizosaccharomyces pombe
and Saccharomyces cerevisiae proteins that have transmembrane
domains, and are involved in mitosis.
Back to introductory
page.