Working with SRS
Practical Courses on Basic Sequence Analysis
Introduction to Sequence Retrieval System
Sequence Analysis Service, November 26th-29th 2001
Information about molecular biology is available in a growing number of
biological databases. These different databases specialize in
particular ways. For example all known enzymatic activities are contained
in the ENZYME database, tertiary structures of known proteins are stored in the
PDB databank, EMBL contains nucleotide sequences, SWISS-PROT
contains protein sequences (mainly translated from the EMBL nucleotide database),
the PROSITE database contains a dictionary of conserved sites and patterns in proteins,
OMIM contains information on mutations involved in human genetic disease, etc.
Many of these databases have direct cross-references to other databases.
For example EMBL entries are cross-linked to SWISS-PROT, PDB etc.
SWISS-PROT has the most cross-links of all.
cross-references help to track down related information stored in another
database. Of course, cross-indexing is only useful if you have a way to
use the link - and this is where SRS comes in.
SRS makes a network of all the databases that have been made available inside
it. It has a powerful indexing system to build indices, and a parser that
can be adapted to parse all the very different formats used by the databases.
(Unfortunately database developers have not bothered to use common formats).
Once the index is built, the database is ready for probing with SRS using
SRS's query language.
The query language supports the three boolean operators
(BUTNOT !, AND &, OR |)
and two link operators left link <, right link > .
These operators allow complex queries and queries to be combined.
The general syntax of the query language is
[database-field:keyword & keyword] > database
When you want to search with part of the keyword use the wild card '*' , example
You can use SRS query language under the Unix command line interface (command getz ) or through a WWW interface (command wgetz ). Nearly everything in SRS can be done in either mode, but the ease and
efficiency may vary.
We will concentrate on the WWW interface for most of our searches. However,
with the command line interface ( getz ) we will see how to generate a small database where we want to get hold
of a group of subsequences having some property in common.
The WWW interface has many different pages like
| Select Library || to select the databanks to search with
| Query Form || to fill out keywords and to select the fields to be displayed etc.,
| Query Result || to examine the result
| Query Manager || to manipulate the queries
| View Manager || to manipulate the way of displaying the result
| LINK || to link to other databases
| Download Query || to save the results to your disk
| Databank || to list all available databanks under SRS
and even a few more pages which will be explained during the practicals.
A typical database explained
SRS uses the flat file (text file) format for databases.
A database contains one or more Entries. An Entry contains several different
fields. For example a SWISS-PROT entry looks like the following
(Some fields are omitted in this example).
Fig.1 Sample SWISS-PROT entry
Click on the highlighted id name to see the full entry
_______=> Accession number
| ____=> name of the field (in this case identification)
| | _____=> id name
| | |
| | |
| ID ACH1_CAEEL STANDARD; PRT; 498 AA.
DT 01-FEB-1996 (REL. 33, CREATED)
--- DE ACETYLCHOLINE RECEPTOR LIKE PROTEIN, ALPHA-TYPE PRECURSOR.
| OS CAENORHABDITIS ELEGANS.
| - FT TRANSMEM 231 252 POTENTIAL.
| | SQ SEQUENCE 498 AA; 57169 MW; 570ECAC1 CRC32;
| | | MSVCTLLISC AILAAPTLGS LQERRLYEDL MRNYNNLERP VANHSEPVTV
| | | DVDEKNQVVY VNAWLDYTWN DYNLVWDKAE YGNITDVRFP AGKIWKPDVL
| | |__=> Sequence field
| |__________________________=> Feature Table field
|____=> Description field
- Login to any of your favorite machines.
- Type netscape4. This brings up the netscape browser.
- Open the following URL: You get this documentation online!
- Click here (Click with the middle mouse button!)
to start SRS with another window
- You get the SRS home page. Now click on start button.
You are in the Top Page where you can select databases
of interest and continue.
Finding more about databases
Some people are scared when they first see this page! How about you!
This is the main database selection page. Throughout the SRS session you
will see the database names
are 'hyper-lighted'. Clicking on them takes you
to another page where you will see more information
about the database. Also it shows you the different fields that this particular
database has and the indexing date of the database etc., If you are lucky
you will also get a WWW link to the 'home' site of the database.
Browsing the indices
- Select SWISSPROT by clicking on the check box. Press the
- Now you see the Query Form page. There are two types of query forms
available. Click on the 'alternate queryform' button to see the
other one. Choose the one that has all the fields
'hypertext link'. All the available data fields are
highlighted and listed with the input field to be filled.
Click on ID data field. You get the Data-field information page.
It gives you information about the indexing date, status,
number of id's etc., You can browse through this.
- Type some id names (if you know any) or just click on List Values
button. Check what you get.
- Similarly try out other fields.
- Try to find out the other way of browsing indices by clicking on the database name.
- How many different types of molecules are present in the EMBL database ?
- How many molecule types are unannotated ?
- What are the different divisions in EMBL database ?
- Find shortest author names in SWISSPROT database
(tip: the regexp character '?' matches any
single character of any value)
- Is there a EMBL entry with a single base ?
- Try similar search with SWISSPROT !
- How many transmembrane regions exists in SWISSPROT with amino acid range between say 20 and 40 ?
Querying the database
- Go to Top Page. (Click on the TOP PAGE button)
- Select SWISSPROT database by clicking the check box. Then press Continue button.
- Type kinase in the All Text field and press Do Query button.
- Check how many entries you get.
- Question: Is this going to be useful? If so when ?
- Click on an entry name and check whether you get the full entry! Browse through this
entry and see the cross-links to other databanks.
- Go back to Query Form Page using Netscape's Back button. Type kinase in the Description field. Check the result. You can narrow down your search by querying the
appropriate field. With the All Text field you get a lot of entries and
with the description field you get a lot less entries.
- Go back to Query Form Page. Click the Description field check box beneath Include in List. Press Do Query button.
- You see that the Description field now appears on the query result! You can check the other fields like
this. Try Keywords field.
Using the boolean operators ( BUTNOT, AND, OR ):
There are rather a lot of kinases with very diverse functions
but you can narrow down your query using
- Go back to Query Form page. Type tyrosine & kinase in the description field. Now you should get the total number of entries.
- Go back to Query Form page. Type tyrosine & kinase ! receptor . And check the result!
- Question: did you manage to exclude all transmembrane receptor kinases?
Using the view
- Go to Query Manager Page. You see all your queries are registered here. You can reinspect the
- Select the last query. Select the Sequence Simple view under the using the view option. Click on the VIEW button. Now you get the result in a table. You can add and remove the fields
you want through the View Manager page.
Saving the query result
We will see how to save the query results to your disk.
It is important to know how to do this. eg. if you want to
make a multiple alignment of the sequences.
Linking the queries with other databases
SRS has two link operators. Namely '<' (left link) '>' (right link).
The query expression
[swissprot-id:acha_human] > prosite
means link the swissprot acha_human entry to prosite! You get the
[swissprot-id:acha_human] < prosite
means the same above, but this time you get the swissprot entry. This makes
sure that there is a cross-reference available for prosite.
Note: The PROSITE database entries each cover
a whole protein family and all matching SWISS-PROT entries are cross-referenced.
Library Network to see how different libraries are linked directly
or indirectly! The numbers shown there are the number of steps needed to link them.
You can click on the numbers to get a clear picture about how the
intermediate libraries are used for indirect links between any two databases.
Another simple way to link queries: In the Query Manager page,
type q1 > embl and press Expression button. You can
type any valid SRS query syntax here!
- Go to the Query Manager page. Select the last query.
- Click on the Link button.
- Now you are in the Database Link Page . Select EMBL database and
press the Continue button.
- You get all the EMBL entries which are cross-linked from SWISSPROT.
[swissprot-id:acha_human] > prosite > swissprot
This retrieves all the SWISS-PROT entries for the protein family to which acha_human belongs.!
enzyme < pdb
This very simple query gives all the enzyme database entries for which the 3D
structure is known!
The concept of Sub-Entries and their usage
We now know how a SWISS-PROT entry looks like.
Sub-entries are made from annotated subsequences!
For example one of the Feature Table fields of SWISS-PROT entry
FT ZN_FING 236 260 C4-TYPE.
ZN_FING means Zinc finger. The corresponding subsequence (from 236 to 260) is
C R G S R N CP I D Q H H R NQ C Q Y C R L K K C
SRS treats this as a sub-entry. This is useful since you can
collect and use them like any other database using other sequence
analysis packages. (We will make a database of
Sub-entries with the command line interface getz).
- Exercise: Go to Query form page. get all the sub entries for
zinc finger. Remember the FtKey field. Then click on one of
the sub entries to see how it looks. Can you get the original (parent) entry?
- Exercise: Retrieve all the intron subsequences from
the EMBL database. How many entries are there ?
Searching other databases
We will try the OMIM database.
OMIM was originally a book (Mendelian Inheritance in Man) of short
review articles on genes and genetically based traits of medical
importance. Now it is on-line. Most of its information is text, so that relevant articles can
be retrieved searching through keywords.
Searching OMIM can be fun!
- Go to Top Page
- Select the OMIM databank and press Continue.
- To learn more about DMD (Duchenne Muscular Dystrophy) disease.
Type dmd in the Keyword field.
- Have you ever wondered why do you sneeze ? Well, this time
search with sneeze inside OMIM and find out about that.
(You can even type 'ACHOO' )
Getting information from secondary sources
SRS can provide links to external web sites, such as Medline, or
for example if a database is not supported at EMBL. Thus Swiss-Prot
entries for yeast proteins have links to SGD at Stanford. Sometimes
you can learn a lot by using these external links. We'll follow up
links about the human inherited disease Phenylketonuria (PKA).
- Go to Top Page
- Select the Swiss-Prot database and press continue
Type in phenylketonuria.
Which of the listed entries has classical phenylketonuria ?
Use the OMIM and PAHdb links to find out about mutations causing PKA
- Are PKA mutations dominant or recessive ?
- How many different classes of mutations (splice,
missence, etc) can cause PKA ?
- Is it true that missense mutations are the rarest class ?
- Are the mutations evenly distributed through the sequence ?
- Do SWISSPROT Features list all classes of mutation ?
- For an even more striking example of human genetic disease,
go to the Marfan Syndrome entry and repeat the exercise
Findout the swissprot entries that do not have omim cross reference to
the previous search phenylketonuria
getz: Command line interface to SRS
- Login to any Unix machine, e.g. tau
- type prepare srs
- if prepare does not work, type
source /app/prepare/prepare first and then type prepare srs
- type ' getz -help'. If this does not work then you have
- Try out the following commands!
Remember you have to protect the query syntax with inverted commas ''.
Lists all the databanks available under SRS.
getz -info swissprot
Prints more information about swissprot library indexing data,
data fields etc.,
Prints only the identification name.
getz '[swissprot-id:acha_human]' -e
Prints the entire entry. The default format is swissprot.
getz '[swissprot-id:acha_human]' -e -sf fasta
Prints the entire entry with fasta format.
getz '[swissprot-id:acha_human]' -e -sf gcg
Prints the entry in GCG format.
getz '[swissprot-id:acha_human]' -f 'id seq' -sf fasta
Prints only the id line and sequence with fasta format.
getz '[swissprot-id:acha_human]' -f 'id des seq'
Prints the Identification name
getz '[swissprot-id:acha_human] > prosite' -e
Prints the entire entry of prosite ( NEUROTR_ION_CHANNEL ).
getz '[swissprot-id:acha_human] > prosite > swissprot'