Working with SRS

Biocomputing Unit
Biocomputing
Sequence Analysis Service
Gibson Group
EMBL
EMBL

Practical Courses on Basic Sequence Analysis

Introduction to Sequence Retrieval System

Sequence Analysis Service, November 26th-29th 2001

Introduction:

Information about molecular biology is available in a growing number of different biological databases. These different databases specialize in particular ways. For example all known enzymatic activities are contained in the ENZYME database, tertiary structures of known proteins are stored in the PDB databank, EMBL contains nucleotide sequences, SWISS-PROT contains protein sequences (mainly translated from the EMBL nucleotide database), the PROSITE database contains a dictionary of conserved sites and patterns in proteins, OMIM contains information on mutations involved in human genetic disease, etc.

Many of these databases have direct cross-references to other databases. For example EMBL entries are cross-linked to SWISS-PROT, PDB etc. SWISS-PROT has the most cross-links of all. These cross-references help to track down related information stored in another database. Of course, cross-indexing is only useful if you have a way to use the link - and this is where SRS comes in.

SRS makes a network of all the databases that have been made available inside it. It has a powerful indexing system to build indices, and a parser that can be adapted to parse all the very different formats used by the databases. (Unfortunately database developers have not bothered to use common formats). Once the index is built, the database is ready for probing with SRS using SRS's query language.

The query language supports the three boolean operators (BUTNOT !, AND &, OR |) and two link operators left link <, right link > . These operators allow complex queries and queries to be combined.

The general syntax of the query language is

[database-field:keyword]
or

[database-field:keyword & keyword] > database

When you want to search with part of the keyword use the wild card '*' , example [swissprot-des:transmem*]

You can use SRS query language under the Unix command line interface (command getz ) or through a WWW interface (command wgetz ). Nearly everything in SRS can be done in either mode, but the ease and efficiency may vary.

We will concentrate on the WWW interface for most of our searches. However, with the command line interface ( getz ) we will see how to generate a small database where we want to get hold of a group of subsequences having some property in common.

The WWW interface has many different pages like

Select Library to select the databanks to search with
Query Form to fill out keywords and to select the fields to be displayed etc.,
Query Result to examine the result
Query Manager to manipulate the queries
View Manager to manipulate the way of displaying the result
LINK to link to other databases
Download Query to save the results to your disk
Databank to list all available databanks under SRS

and even a few more pages which will be explained during the practicals.

A typical database explained

SRS uses the flat file (text file) format for databases. A database contains one or more Entries. An Entry contains several different fields. For example a SWISS-PROT entry looks like the following (Some fields are omitted in this example).

Fig.1 Sample SWISS-PROT entry
Click on the highlighted id name to see the full entry

_______=> Accession number | | ____=> name of the field (in this case identification) | | | | _____=> id name | | | | | | | ID ACH1_CAEEL STANDARD; PRT; 498 AA. --AC P48180; DT 01-FEB-1996 (REL. 33, CREATED) --- DE ACETYLCHOLINE RECEPTOR LIKE PROTEIN, ALPHA-TYPE PRECURSOR. | OS CAENORHABDITIS ELEGANS. | - FT TRANSMEM 231 252 POTENTIAL. | | SQ SEQUENCE 498 AA; 57169 MW; 570ECAC1 CRC32;

| |   |	  MSVCTLLISC AILAAPTLGS LQERRLYEDL MRNYNNLERP VANHSEPVTV
| |   |   DVDEKNQVVY VNAWLDYTWN DYNLVWDKAE YGNITDVRFP AGKIWKPDVL 
| |   |__=> Sequence field
| |__________________________=> Feature Table field
|
|____=> Description field



Getting Started

Finding more about databases

Some people are scared when they first see this page! How about you! This is the main database selection page. Throughout the SRS session you will see the database names are 'hyper-lighted'. Clicking on them takes you to another page where you will see more information about the database. Also it shows you the different fields that this particular database has and the indexing date of the database etc., If you are lucky you will also get a WWW link to the 'home' site of the database.

Browsing the indices


Querying the database

Using the boolean operators ( BUTNOT, AND, OR ):

There are rather a lot of kinases with very diverse functions but you can narrow down your query using these operators.

Using the view

Saving the query result

We will see how to save the query results to your disk. It is important to know how to do this. eg. if you want to make a multiple alignment of the sequences.

Linking the queries with other databases

SRS has two link operators. Namely '<' (left link) '>' (right link). The query expression

[swissprot-id:acha_human] > prosite
means link the swissprot acha_human entry to prosite! You get the prosite entry.

[swissprot-id:acha_human] < prosite
means the same above, but this time you get the swissprot entry. This makes sure that there is a cross-reference available for prosite.
Note: The PROSITE database entries each cover a whole protein family and all matching SWISS-PROT entries are cross-referenced.

Check the Library Network to see how different libraries are linked directly or indirectly! The numbers shown there are the number of steps needed to link them. You can click on the numbers to get a clear picture about how the intermediate libraries are used for indirect links between any two databases.

Another simple way to link queries: In the Query Manager page, type q1 > embl and press Expression button. You can type any valid SRS query syntax here!

More Examples:

[swissprot-id:acha_human] > prosite > swissprot

This retrieves all the SWISS-PROT entries for the protein family to which acha_human belongs.!

enzyme < pdb

This very simple query gives all the enzyme database entries for which the 3D structure is known!

The concept of Sub-Entries and their usage

We now know how a SWISS-PROT entry looks like. Sub-entries are made from annotated subsequences! For example one of the Feature Table fields of SWISS-PROT entry 7UP1_DROME contains
FT   ZN_FING     236    260       C4-TYPE.
ZN_FING means Zinc finger. The corresponding subsequence (from 236 to 260) is

C R G S R N CP I D Q H H R NQ C Q Y C R L K K C

SRS treats this as a sub-entry. This is useful since you can collect and use them like any other database using other sequence analysis packages. (We will make a database of Sub-entries with the command line interface getz).

Searching other databases

We will try the OMIM database.
OMIM was originally a book (Mendelian Inheritance in Man) of short review articles on genes and genetically based traits of medical importance. Now it is on-line. Most of its information is text, so that relevant articles can be retrieved searching through keywords.

Searching OMIM can be fun!


Getting information from secondary sources

SRS can provide links to external web sites, such as Medline, or for example if a database is not supported at EMBL. Thus Swiss-Prot entries for yeast proteins have links to SGD at Stanford. Sometimes you can learn a lot by using these external links. We'll follow up links about the human inherited disease Phenylketonuria (PKA).


getz: Command line interface to SRS