Recent Changes - Search:

Documentation

HintsAndTips

UsefulInfo

ElmResource

CollabWork

PmWiki

edit SideBar

KeywordSearchExercises

Keyword/Text-based Searching of Bioinformatic Databases

As we have seen, biological data is stored in many different databases throughout the world. Obviously, for these databases to be useful, one needs to be able to query them efficiently. The purpose of this tutorial is to help you improve your ability to issue effective queries against such databases using methods other than direct comparison of amino acid or nucleotide sequences. We will focus on searches carried out using SRS (Sequence Retrival Service). Many different sites in the world offer the use of SRS to query against a range of different databases - one of the very useful features of the system is its ability to interface with many different types of data. This link describes a list of such servers available around the world. However, we will be working with only one of these servers, that provided by Reinhard Schneider's group at the EMBL in Heidelberg, maintained and developed by Venkata Satagopam. Follow this link to access this server.

Simple Queries Using SRS

Tasks - Simple SRS query

Using the basic/simple search tool provided on the first page of the EMBL Heidelberg SRS server, try a range of different queries to try and find one that will find only the UniProt record for the human Aurora Kinase B protein (e.g. you could start with using "aurora kinase human".)

Note that these searches will likely hit two kinds of proteins - both those that are evolutionarily related and evolutionarily un-related to the Aurora Kinase B protein in humans

Questions

  • How can you distinguish between proteins that are related/non-related to your query protein?
  • Look at the database entries for some of the proteins unrelated to your query protein, and try and work out why they have been included in the set of database hits? In which fields are matches being made to query words?

If you have time, try a similar exercise for the proteins:

  • human NR1H2
  • Saccromyces cerevisiae p27kip1
  • human Interleukin-5 receptor alpha chain (IL5R)

Note that if you have the UniProt accession number for the sequences of interest it becomes trivial to make a query that finds only the sequences you are interested in.

Note also that that essentially what you are doing here is querying the databases using text/keywords, to try and find the records that are related to your protein of interest. Put differently, we are attempting to find those records in the databases that are similar/the-same as our protein of interest. There are many different ways to consider similarity of proteins e.g. one might be interested in proteins with similar lengths, similar net charge, similar sub-cellular localisation - one can (and indeed people do) use similarities of this kind to address biological questions. One could argue that the basis of almost all bioinformatic approaches involves identifiying similarity between biological entities of interest. Thus, in this example, we are searching for proteins considered as SIMILAR to our protein of interest in that they are associated with similar/the-same keywords/text in a database.

Note that searching biological databases in this way depends on the sequences having previously been annotated - either a manual/human annotator or a computer has analysed the record and has decide to associate a particular text with it. Thus, such methods cannot be used with un-annotated sequences. In most of the rest of this course we will be concentrating on how to carry out your own analyses of such sequences - amongst other things this gives us the ability to work with as-yet-unannotated sequences, along with allowing us to critically assess the annotations of others.

Searching SRS Using Boolean Logic

One way to construct more powerful/discriminating/useful queries, is to make use of the logic operators in your searching

  • & (AND) - only selects records if the terms surrounding the operator are in the search field e.g. "nuclear & cytoplasmic" would return records where the search fields contain both these terms
  • | (OR) - selects records if either of terms surrounding the operator is in the search field e.g. "nuclear | cytoplasmic" would return records where the search fields contain either or both of the terms
  • ! (NOT) - selects records that do NOT contain the symbol that comes after the operator e.g. "! human" would return only records that did not contain the word human

These are known as Boolean operators as they come from Boolean logic.

Task - Use Boolean operators in SRS searches

Try creating queries using Boolean operators for the same problems as for the first exercise. Do they allow you to come up with simpler/more-effective queries?

Note that operators of this kind can often be used in other search systems. For example, they can be used within Entrez, and so can be used (for example) to create more sophisticated PubMed queries.

SRS Query Builder With More Sophisticated Searches

By going via the SRS "Query Builder" we have the opportunity to construct more sophisticated queries that should more efficient at identifying the particular records we are interested in within a database.

The first step is to choose an appropriate database against which to make the searches.

Then the query builder pages can be easily used to construct searches which look for words only in particular database fields. For example, you could look for records that only contain the words "Nuclear Receptor" in the gene description fields. This would help remove hits to proteins that intereact with such receptors but are not such receptors.

Task - Using SRS "Query Builder"

Try creating queries using the Query Builder to restrict searches to particular database fields, for the same problems as in the first exercise. Does this allow you to come up with simpler/more-effective queries?

Again, functionality of this kind is also incorporated into other search engines. For example, the "[ti]" in the following Entrez PubMed query "aurora [ti]" restricts the search of PubMed records to those that contain the word "aurora" in their title, but not in their abstract or in some other part of the PubMed record. These techniques can also be used to query other databases available via NCBI's Entrez - for example, searching for protein sequences in the same way you have already tried using SRS (toggle the box next to "Search" on the left side to "Protein".)

Try carrying out similar searches using Entrez - again, with the goal of specifying a query that will retrieve only the record for the human aurora kinase. You may find it useful to examine this list of queryable fields and this description of Boolean search operators

Making Searches More General Using Wildcards

Another feature of SRS that can be used to create more powerful queries is the use of wildcards.

  • "*" which matches 0 or more of any kind of character e.g. the query "Ubiquit*" would match words within the search path such as "Ubiquitination", "Ubiquitinated", "Ubiquitous" etc.
  • "?" which matches only one such character e.g. the query "Tob?" would match "Toby" but not "Tobias"

For example, if we wanted to collect proteins that are similar to human Aurora Kinase B (which has the record name in !UniProt of AURKB_HUMAN) we could try the following query "AURKB_*"

Note that such a search may identify records other than those evolutionarily related to AURKB_HUMAN - can you think why this might happen?

Again, note that wildcard functionality is often incorporated into other search engines - for example, as for Boolean logical operators, these are also available in Entrez PubMed.

Using SRS Inter-Database Links

A special and very powerful property of SRS is its ability to link between different databases that are stored within the system. These allow you to collect the set of records from a database that have links to the records you have identified using your search.

SRS has two link operators:

  • '<' (left link) - the result of q1 < q2 will list all entries in q2 which have cross references to entries in q1
  • '>' (right link) - the result of q1 > q2 will list all entries in q1 which have cross references to entries in q2

To use these link operators effectively, visit the "history" page. Here you will find a list of all the different searches you have carried out on SRS so far.

Tasks - using SRS linking capabilities

As an example, take the results from one of your searches to identify the human Aurora Kinase B record where the search hit many different records.

You can get the list of all different entries in the PDB database of structures by selecting this query and then linking to pdb

"query > pdb"

To find the set of records in "query" where there is a pdb record that links to them, carry out the reverse query

"query < pdb"

Carry out a query to identify the set of PFAM domains found in human proteins

  • Restrict the search to just "Organism Name"
  • Use "right link"ing

To turn this around, try using "left link" to find out how many UniProt human proteins have not been annotated as containing a PFAM domain.

Now, we will use OMIM (a disease database) to find all genes believed to have a genetic influence on cancer. Using links to the LENZYME database of enzymes, and then to PDB, we can collect the set of enzymes believed to be involved in cancer for which there is a PDB entry.

It should also be possible to restrict the search to only those enzymes that are annotated as being hydrolases. If you have time, try and carry out this query also.

Try and think of some datasets that could be interesting for you to work with that could be generated using links - try and carry out these searches with the help of the instructors.

Downloading Sequences Locally Using SRS

Please don't try and download too many sequences! It can cause problems and make things very slow. Stick to at most 100 sequences or so

We have looked so far at several different ways of querying databases using SRS. However, if you had to visit the page of each record you hit, cuting and pasting the information you need to another file, it could take a lot of time for you to get out all the information you need. However, thankfully it is possible to simply download to your local computer the results of your SRS query.

Tasks - downloading sequences from SRS

Using some of the features described above, use SRS to find a set of records that are likely to be related to the human Aurora Kinase B.

Download the fasta sequences for these records to a local file on your desktop.

Hint: First select some sequences. Then click on the "Tools" icon on the left side of the screen...

From here you can use these sequences in a range of different bioinformatic tasks.

Learning Objectives

The aim of this session is:
  • To illustrate features of SRS - a powerful tool for text-based searches of biological sequence databases
    • Simple text searches
    • More sophisticated searches
      • Combining keywords using Boolean operators for more sophisticated searching
      • Using wildcards to expand list of records found
      • Restricting searches to different database fields
      • Using links between databases
    • Downloading records based on the results of searches

-- Main.AidanBudd - 25 May 2007

Edit - History - Print - Recent Changes - Search
Page last modified on January 29, 2008, at 11:04 AM CET