Predocs99 Course
WWW Database Searching Practical
by Toby Gibson, Chenna Ramu and Christine Gemünd, October 12th-15th 1999
In this practical we will run some database search tools provided by EMBL
and available through the web. Examination of the outputs may reveal some
differences between the results, depending on the type of algorithm used
in the sequence comparison. We will also modify the query and search set
ups to illustrate the importance of a little thought in advance of (or during...)
database searching. Rule No. 1 is "Know your sequence!"
WWW DB search Tools
We will use:
- SRS5 to extract query sequences.
- The Bork group's BLAST2 server for BLAST searches (fast ungapped comparison).
- Our group's Bioccelerators for Smith-Waterman and profile searches (exhaustive gapped comparison).
- WWWProfileWeight to make the profiles from a sequence alignment.
- The Bork group's SMART server to compare a sequence against a protein domain database.
Step 1 Choosing an snRNP SM protein as query
SM proteins are found in snRNP complexes. There are quite a number in Swiss-Prot
and they are fairly divergent, so it is difficult (or impossible) to detect
them all in a search with a single query sequence. All SM proteins share
a small globular domain, but many have a C-terminal non-globular domain
too. This will be used to illustrate the problems of searching with multi-domain
proteins.
- Open a new navigator window and load this page into it.
- Load SRS5 and Start the session.
- Select the Swiss-Prot database and continue.
- Set the field selection (by default All Text) to Description.
- Type snrnp & sm & b in the Description box, then Do Query.
- Click on the entry RSMB_Human.
- Swiss-prot entries often have useful hypertext links: What does the PFAM link do?
You now have the sequence of human SM-B protein available in a form that
can be cut and pasted into the DB query forms.
Step 2 BLAST2 searching with human SM-B protein
BLAST2 is an upgraded version of BLAST, one of the most widely used database
search packages. The BLAST programs find the best matching ungapped sections
in a sequence comparison. The most important modification for the user to
note in BLAST2 is that neighbouring ungapped segments can now be concatenated
by allowing gaps between them. This improves both sensitivity and interpretation
of the results.
- Open a new navigator window and load this page into it.
- Load a BLAST2 query submission page.
- It is worth familiarising yourself with the layout and consulting the helps.
- Paste in the RSMB_Human sequence from the SRS browser (consult the help for format).
- Select the Swiss-Prot database.
- Set the filter option to none.
- Check the number of top hit descriptions and alignments to be shown: set
to 100 or so.
- Start the BLAST search: this should take a few minutes at most.
- Save output to a new file, so that you do not lose it.
- Examine output, and investigate the detected entries by using the SRS links.
Questions
- 1. How many SM proteins are detected above the first false positive?
- 2. Is there another class of protein that is strongly detected?
- 3. If so, is this biologically meaningful?
- 4. Are the P-values a reliable guide to homology?
Step 2B BLAST 2 search with SM-B and a filter
Now repeat the search but filter out segments of "reduced sequence complexity".
- Reload a BLAST2 query submission page.
- Paste in the RSMB_Human sequence.
- Select the Swiss-Prot database.
- Set the filter option to SEG+XNU.
- Check the number of top hit descriptions and alignments to be shown: set
to 100 or so.
- Start the BLAST search.
- Save output to a new file, so that you do not lose it.
- Examine output, and investigate the detected entries by using the SRS links.
Questions
- 1. How many SM proteins are detected above the first false positive?
- 2. Is there another class of protein that is strongly detected?
- 3. Why are rather few sequences listed?
- 4. How does this setup compare in sensitivity to the unfiltered search?
- 5. Are the P-values are reliable guide to homology?
Step 3 Bic_SW search with human SM-B protein
The Bioccelerator is fast dedicated hardware exclusively designed to speed
up dynamic programming (i.e. slow but sensitive) sequence comparison. It is built by the Israeli
company Compugen. It can perform a number of search permutations including
basic Smith-Waterman, profile searches and Protein v. DNA frame-shifting
comparisons. The Smith-Waterman search finds the best matching segments
between any two sequences, allowing for gaps to be inserted at any position.
- Open a new navigator window and load this page into it.
- Load the Bioccelerator home page.
- Go to the Bioccelerator Searches page.
- Select application sw_p.
- It is worth familiarising yourself with the layout and consulting the help links.
- Paste the RSMB_human sequence from the SRS browser into the Query Sequence box.
- Select the Swiss-Prot database. (It may already be the default selection in gcg format).
- Now Do Search to start the Bioccelerator run.
- When you get the output, save to a new file, so that you do not lose it.
The search will take a couple of minutes (unless the Bic is busy). When
it is finished you can look at the high-score list and alignments in the
output and compare the results with BLAST2.
Questions
- 1. How are SM proteins distributed in the output?
- 2. What position is the highest false positive?
- 3. Is another class of proteins strongly detected?
- 4. Are the E-values a reliable guide to the SM protein detections?
- 5. Compared to BLAST:
- (a) Which, if any, is more sensitive?
- (b) Which output is easier to understand?
Step3B Bic-SW search with the SM Domain only
Now repeat the search but use the globular N-terminal domain only.
- Reload the Bioccelerator home page.
- Go to the Bioccelerator Searches page.
- Select application sw_p.
- Paste the range 1-82 of RSMB_human into the Bic query form.
- Select the Swiss-Prot database. (It may already be the default selection).
- Now Do Search to start the Bioccelerator run.
- When you get the output, save to a new file, so that you do not lose it.
Questions
- 1. Are more or less SM proteins detected?
- 2. Is another class of proteins strongly detected?
- 3. Are the E-values a reliable guide to the SM protein detections?
- 4. Compared to the BLAST filtered search which, if any, is more sensitive?
- 5. Collect a multiple alignment using the buttons in the header:
- Is this useful to judge the detections?
- Which entries have incomplete sequence fragments?
Step 4. Bic_profilesearch based on an alignment of SM proteins
Profile searches are one of the most sensitive search tools currently available.
The raw materials for profile searching are a multiple sequence alignment
in conjunction with a residue exchange matrix (e.g. the Gonnet Pam250 matrix). A profile scores the amino acids at each position in the alignment:
conserved positions score more strongly than unconserved ones (whereas in
a single sequence, they are all equally significant). We can compare the
sensitivity to the searches with a single sequence as query.
Step 4A. Preparing a profile from an SM alignment
(We have prepared an alignment for you as there is not enough time to do
a multiple alignment today).
- Open a new navigator window and load this page into it.
- Load WWWProfileWeight.
- Load the alignment SM_domain.aln in a new netscape window.
- Cut and Paste the alignment into the Paste box.
- Run ProfileWeight to make the profile.
- Look at the resulting profile:
- (a) See how scores for amino acids vary for each position in the alignment.
- (b) See how the position-specific gap penalties are lowered at existing
gaps.
- (c) Note the suggested gap penalties in the header: these are only a rough
guide.
- Save Profile to save the profile to a file (e.g. as Sm.prf) for use in the profile search.
Step 4B. BIC_Profilesearch with an SM domain profile prepared with the Gonnet Pam250
matrix
- Open a new navigator window and load this page into it.
- Load the Bioccelerator home page.
- Go to the Bioccelerator Searches Page.
- Select Profilesearch in the Application box.
- In the Upload a file box, give the full directory name of the Sm protein profile.
- (Alternatively you could cut and paste into the Query Sequence box.)
- Give Gap opening penalty 1.0 and extension penalty 0.1.
- Select the Swiss-Prot database.
- Do Search.
- Save the output to a new file, so that you do not lose it.
The search will take a couple of minutes (unless the Bic is busy). When
it is finished you can look at the high-score list and alignments in the
output. Use the SRS links to learn about the top hits.
Questions
- 1. How are SM entries distributed in the output?
- 2. Are the E-values a reliable guide to SM protein detections?
- 3. Is the profile search more or less sensitive than the single sequence
queries?
- 4. Collect a multiple alignment using the buttons in the header:
- Is this useful to judge the detections?
- Can you see any conserved positions in the alignment?
Step 5. Comparing a sequence to a database of protein domains
Since profile searches are so sensitive, it would make sense to query an
unknown sequence against a set of profiles for known protein families. There are several very useful databases of modules that are found in multidomain
proteins, including PFAM at the Sanger Centre, PROSITE at ISREC and SMART at EMBL. They use a form of profile technically described as a "hidden
Markov model", but the end result is very much like the profiles we just
ran. We will search for protein domains in an "unknown" protein using the
SMART server.
- Open a new navigator window and load this page in it.
- Load the SMART query page.
- Toggle on HMM searching against PFAM database (includes more domains).
- Get the unknown sequence and cut and paste it into SMART's Sequence box.
- Click on the GO! SMART Button.
- The search should take about a minute unless the server is busy.
- When you get the results, note the domain "bubble" diagram and the table
of matching domains.
Questions
- Based on your recent experiences would you say the E-value scores are good?
- What happens if you click on a domain bubble?
- Is the domain common?
- Is there any literature on the domain?
- Are there structures for any of these domains?
- Is this protein likely to be in the nucleus, cytoplasm or extracellular
compartments?
- Can you say what kind of protein it is?
- Do you think this protein has especially many or few domains?
- Try repeating the SMART search with FBN1_HUMAN, the Marfan Syndrome protein.
Take Home Lessons
Hopefully the exercise of varying the query type has illustrated that the
way a search is set up is very important. The queries here illustrate the
effect of different sequence types. There are other parameters that often
influence the search sensitivity. For example when a globular domain is
longer, the Gonnet Pam250 matrix would be expected to outperform the default Blosum62 in the detection of divergent homologues, because it is less stringent
and so gives longer optimally matching segments. (Over short matches it
is noisier and could perform worse). In fact we used the Gonnet matrix to
make the profiles: because of the extra information in the alignment, profiles
usually perform better with Pam250 than Blosum62. Also, gap penalties are
critical parameters in dynamic programming and should always be tested by
trial and error. In other words, it pays to try several variations in the
searches, not just accept the results of the first search.