Biocomputing Unit
Sequence Analysis Service
Gibson Group

Autumn 01 Course

WWW Sequence Database Searching Practical

by Toby Gibson, Chenna Ramu and Aidan Budd, November 26th-29th 2001

In this practical we will run some database search tools provided by EMBL and available through the web. Examination of the outputs may reveal some differences between the results, depending on the type of algorithm used in the sequence comparison. We will also modify the query and search set ups to illustrate the importance of a little thought in advance of (or better late than never...) database searching. Rule No. 1 is "Know your sequence!"

WWW DB search Tools

We will use:

Getting started

The teaching machines are INTEL PCs running the LINUX OS. It will take a few moments to get set up.

Step 1 Choosing an snRNP SM protein as query

SM proteins are found in snRNP complexes. There are quite a number in Swiss-Prot and they are fairly divergent, so it is difficult (or impossible) to detect them all in a search with a single query sequence. All SM proteins share a small globular domain, but many have a C-terminal non-globular domain too. This will be used to illustrate the problems of searching with multi-domain proteins.

You now have the sequence of human SM-B protein available in a form that can be cut and pasted into the DB query forms.

Step 2 BLAST2 searching with human SM-B protein

BLAST2 is an upgraded version of BLAST, one of the most widely used database search packages. The BLAST programs find the best matching ungapped sections in a sequence comparison. The most important modification for the user to note in BLAST2 is that neighbouring ungapped segments can now be concatenated by allowing gaps between them. This improves both sensitivity and interpretation of the results.


Step 2B BLAST 2 search with SM-B and a filter

Now repeat the search but filter out segments of "reduced sequence complexity".


Step 3 Bic_SW search with human SM-B protein

The Bioccelerator is fast dedicated hardware exclusively designed to speed up dynamic programming (i.e. slow but sensitive) sequence comparison. It is built by the Israeli company Compugen. It can perform a number of search permutations including basic Smith-Waterman, profile searches and Protein v. DNA frame-shifting comparisons. The Smith-Waterman search finds the best matching segments between any two sequences, allowing for gaps to be inserted at any position.

The search will take a couple of minutes (unless the Bic is busy). When it is finished you can look at the high-score list and alignments in the output and compare the results with BLAST2.


Step3B Bic-SW search with the SM Domain only

Now repeat the search but use the globular N-terminal domain only.


Step 4. Bic_profilesearch based on an alignment of SM proteins

Profile searches are one of the most sensitive search tools currently available. The raw materials for profile searching are a multiple sequence alignment in conjunction with a residue exchange matrix (e.g. the Gonnet Pam250 matrix). A profile scores the amino acids at each position in the alignment: conserved positions score more strongly than unconserved ones (whereas in a single sequence, they are all equally significant). We can compare the sensitivity to the searches with a single sequence as query.

Step 4A. Preparing a profile from an SM alignment

(We have prepared an alignment for you as there is not enough time to do a multiple alignment today).

Step 4B. BIC_Profilesearch with an SM domain profile prepared with the Gonnet Pam250 matrix

The search will take a couple of minutes (unless the Bic is busy). When it is finished you can look at the high-score list and alignments in the output. Use the SRS links to learn about the top hits.


Step 5. Comparing a sequence to a database of protein domains

Since profile searches are so sensitive, it would make sense to query an unknown sequence against a set of profiles for known protein families. There are several very useful databases of modules that are found in multidomain proteins, including PFAM at the Sanger Centre, PROSITE at ISREC and SMART at EMBL. They use a form of profile technically described as a "hidden Markov model", but the end result is very much like the profiles we just ran. We will search for protein domains in an "unknown" protein using the SMART server.


Take Home Lessons

Hopefully the exercise of varying the query type has illustrated that the way a search is set up is very important. The queries here illustrate the effect of different sequence types. There are other parameters that often influence the search sensitivity. For example when a globular domain is longer, the Gonnet Pam250 matrix would be expected to outperform the default Blosum62 in the detection of divergent homologues, because it is less stringent and so gives longer optimally matching segments. (Over short matches it is noisier and could perform worse). In fact we used the Gonnet matrix to make the profiles: because of the extra information in the alignment, profiles usually perform better with Pam250 than Blosum62. Also, gap penalties are critical parameters in dynamic programming and should always be tested by trial and error. In other words, it pays to try several variations in the searches, not just accept the results of the first search.