EMBO YPI Tutorial
WWW Sequence Database Searching Practical
by Toby Gibson, Chenna Ramu and Aidan Budd, August
30th 2002
In this practical we will run some database search tools provided by EMBL
and available through the web. Examination of the outputs may reveal some
differences between the results, depending on the type of algorithm used
in the sequence comparison. We will also modify the query and search set
ups to illustrate the importance of a little thought in advance of (or
better late than never...) database searching. Rule No. 1 is "Know your
sequence!"
WWW DB search Tools
We will use:
-
SRS5 to extract
query sequences.
-
The Bork group's BLAST2
server for BLAST searches (fast ungapped comparison).
-
Our group's Bioccelerators
for Smith-Waterman and profile searches (exhaustive gapped comparison).
-
WWWProfileWeight
to make the profiles from a sequence alignment.
-
The Bork group's SMART server
to compare a sequence against a protein domain database.
Getting started
The teaching machines are INTEL PCs running the LINUX OS. It will take
a few moments to get set up.
-
Login with your EMBL name and password (may
already have been done for you)
-
If assorted brightly-coloured non-netscape windows open up... please kill them.
-
Start web browser by clicking on "WWW" icon at top left of screen.
-
Configure browser to make the font the right size and to use javascript and java properly.
-
Go to the top bar of the browser, and choose
-
-> settings
-
-> configure
-
-> browser
-
-> Appearance
-
-> Fixed font
-
-> Fixed
-
The font is now set correctly... to set java and javascript...
-
Next to "Appearance" choose "Java/JavaScript"
-
Under Global Setting click boxes for...
-
Enable Java globally
-
Enable JavaScript globally
-
Click Apply and then OK.
-
Load this page into the browser.
-
Go to the EMBL homepage first (type "www.embl-heidelberg.de" into the address
box)
-
-> Computational (3rd item on right of page)
-
-> Biocomputing Unit (on left side of page)
-
-> Gibson Group
-
-> Courses (near top, on the left... a good idea to bookmark this page)
-
to bookmark... Click "Bookmark" on bar at top of browser window, then choose
"Add Bookmark"
-
to return to this page at any time, just go back to "Bookmark" and choose
the "EMBL sequence analysis" mark
-
-> EMBO YPI course... and there you are!
Step 1 Choosing an snRNP SM protein as query
SM proteins are found in snRNP complexes. There are quite a number
in Swiss-Prot and they are fairly divergent, so it is difficult (or impossible)
to detect them all in a search with a single query sequence. All SM proteins
share a small globular domain, but many have a C-terminal non-globular
domain too. This will be used to illustrate the problems of searching with
multi-domain proteins.
-
Load an SRS5 page by clicking on this link with
the middle mouse button (which should open the link into a new window) and Start a new SRS session
by clicking on the Start button
-
(SRS - Sequence Retrieval System - is a powerful and widely used tool
to retrieve information from sequence databases.)
-
Tick the box for the Swiss-Prot database and click continue.
-
Set the field selection (by default Main Text) to Description.
-
Type snrnp & sm & b in the Description box, then
Do
Query.
-
Click on the entry RSMB_Human.
-
Swiss-prot entries often have useful hypertext links: What does the PFAM
link do?
You now have the sequence of human SM-B protein available in a form that
can be cut and pasted into the DB query forms, which we will do in Step 2.
Step 2 BLAST2 searching with human SM-B protein
BLAST2 is an upgraded version of BLAST, one of the most widely used
database search packages. The BLAST programs find the best matching ungapped
sections in a sequence comparison. The most important modification for
the user to note in BLAST2 is that neighbouring ungapped segments can now
be concatenated by allowing gaps between them. This improves both sensitivity
and interpretation of the results.
-
Load a BLAST2 query
submission page into a new window (again, use middle mouse button on BLAST2
link)
-
It is worth familiarising yourself with the layout and consulting the helps.
-
Paste in the RSMB_Human sequence from the SRS browser (consult the
help for format).
-
cut and paste can be done easily by highlighting the text you wish to copy
using the mouse
-
click with the left mouse button where you want to begin the copied region
-
still holding the left button, drag the mouse until the highlighted text
covers all you want to copy
-
without clicking the left mouse button again in the window with the highlighted
text, move to the widow to be pasted into, left-button clicking in the
new window to select the window.
-
position the mouse where you want to paste into, and click the middle mouse
button. Wow! it pasted! (hopefully) aren't computers amazing?
-
Select the Swiss-Prot database.
-
Set the filter option to none.
-
Set the number of top hit descriptions and alignments to be shown to 100.
-
Start the BLAST search: this should take a few minutes at most.
-
Examine output, and investigate the detected entries by using the SRS links.
Questions
-
1. How many SM proteins are detected above the first false positive?
-
2. Is there another class of protein that is strongly detected?
-
3. If so, is this biologically meaningful?
-
4. Are the P-values a reliable guide to homology?
Step 2B BLAST 2 search with SM-B and a filter
Now repeat the search but filter out segments of "reduced sequence complexity".
-
Reload a BLAST2 query
submission page.
-
Paste in the RSMB_Human sequence.
-
Select the Swiss-Prot database.
-
Set the filter option to SEG+XNU.
-
Check the number of top hit descriptions and alignments to be shown: set
to 100 or so.
-
Start the BLAST search.
-
Examine output, and investigate the detected entries by using the SRS links.
Questions
-
1. How many SM proteins are detected above the first false positive?
-
2. Is there another class of protein that is strongly detected?
-
3. Why are rather few sequences listed?
-
4. How does this setup compare in sensitivity to the unfiltered search?
-
5. Are the P-values are reliable guide to homology?
Step 3 Bic_SW search with human SM-B protein
The Bioccelerator is fast dedicated hardware exclusively
designed to speed up dynamic
programming (i.e. slow but sensitive) sequence comparison. It is built
by the Israeli company Compugen. It can perform a number of search permutations
including basic Smith-Waterman, profile searches and Protein v. DNA frame-shifting
comparisons. The Smith-Waterman search finds the best matching segments
between any two sequences, allowing for gaps to be inserted at any position.
-
Load the Bioccelerator
home page.
-
Go to the Bioccelerator Searches page.
-
Select application sw_p.
-
It is worth familiarising yourself with the layout and consulting the help
links.
-
Paste the RSMB_human sequence from the SRS browser into the Query
Sequence box.
-
Select the Swiss-Prot database. (It may already be the default selection
in gcg format).
-
Now Do Search to start the Bioccelerator run.
The search will take a couple of minutes (unless the Bic is busy). When
it is finished you can look at the high-score list and alignments in the
output and compare the results with BLAST2.
Questions
-
1. How are SM proteins distributed in the output?
-
2. What position is the highest false positive?
-
3. Is another class of proteins strongly detected?
-
4. Are the E-values a reliable guide to the SM protein detections?
-
5. Compared to BLAST:
-
(a) Which, if any, is more sensitive?
-
(b) Which output is easier to understand?
Result: Click here
Step3B Bic-SW search with the SM Domain only
Now repeat the search but use the globular N-terminal domain only.
-
Reload the Bioccelerator
home page.
-
Go to the Bioccelerator Searches page.
-
Select application sw_p.
-
Paste the range 1-82 of RSMB_human into the Bic query form.
-
Select the Swiss-Prot database. (It may already be the default selection).
-
Now Do Search to start the Bioccelerator run.
Questions
-
1. Are more or less SM proteins detected?
-
2. Is another class of proteins strongly detected?
-
3. Are the E-values a reliable guide to the SM protein detections?
-
4. Compared to the BLAST filtered search which, if any, is more sensitive?
-
5. Collect a multiple alignment using the buttons in the header:
-
Is this useful to judge the detections?
-
Which entries have incomplete sequence fragments?
Step 4. Bic_profilesearch based on an alignment of SM proteins
Profile searches are one of the most sensitive search tools currently
available. The raw materials for profile searching are a multiple sequence
alignment in conjunction with a residue exchange matrix (e.g. the Gonnet
Pam250 matrix). A profile scores the amino acids at each position in
the alignment: conserved positions score more strongly than unconserved
ones (whereas in a single sequence, they are all equally significant).
We can compare the sensitivity to the searches with a single sequence as
query.
Step 4A. Preparing a profile from an SM alignment
(We have prepared an alignment for you as there is not enough time
to do a multiple alignment today).
-
Load WWWProfileWeight.
-
Load the alignment
SM_domain.aln in a new netscape window.
-
Cut and Paste the alignment into the Paste box.
-
Run ProfileWeight to make the profile.
-
Look at the resulting profile:
-
(a) See how scores for amino acids vary for each position in the alignment.
-
(b) See how the position-specific gap penalties are lowered at existing
gaps.
-
(c) Note the suggested gap penalties in the header: these are only a rough
guide.
Step 4B. BIC_Profilesearch with an SM domain profile prepared
with the Gonnet Pam250 matrix
-
Load the Bioccelerator
home page.
-
Go to the Bioccelerator Searches Page.
-
Select Profilesearch in the Application box.
-
Cut and paste the profile into the Query Sequence box.
-
Give Gap opening penalty 1.0 and extension penalty 0.2.
-
Select the Swiss-Prot database.
-
Do Search.
The search will take a couple of minutes (unless the Bic is busy). When
it is finished you can look at the high-score list and alignments in the
output. Use the SRS links to learn about the top hits.
Questions
-
1. How are SM entries distributed in the output?
-
2. Are the E-values a reliable guide to SM protein detections?
-
3. Is the profile search more or less sensitive than the single sequence
queries?
-
4. Collect a multiple alignment using the buttons in the header:
-
Is this useful to judge the detections?
-
Can you see any conserved positions in the alignment?
Step 5. Comparing a sequence to a database of protein
domains
Since profile searches are so sensitive, it would make sense to query
an unknown sequence against a set of profiles for known protein families.
There are several very useful databases of modules that are found in multidomain
proteins, including PFAM
at the Sanger Centre, PROSITE
at ISREC and SMART at EMBL.
They use a form of profile technically described as a "hidden Markov model",
but the end result is very much like the profiles we just ran. We will
search for protein domains in an "unknown" protein using the SMART server.
-
Load the SMART query page.
-
Toggle on PFAM domains (includes more domains).
-
Get the "unknown
sequence" and cut and paste it into SMART's Sequence box.
-
Click on the Sequence SMART Button.
-
The search should take about a minute unless the server is busy.
-
When you get the results, note the domain "bubble" diagram and the table
of matching domains.
Questions
-
Based on your recent experiences would you say the E-value scores are
good?
-
What happens if you click on a domain bubble?
-
Is the domain common?
-
Is there any literature on the domain?
-
Are there structures for any of these domains?
-
Is this protein likely to be in the nucleus, cytoplasm or extracellular
compartments?
-
Can you say what kind of protein it is?
-
Do you think this protein has especially many or few domains?
-
Try repeating the SMART search with FBN1_HUMAN,
the Marfan Syndrome protein.
Take Home Lessons
Hopefully the exercise of varying the query type has illustrated
that the way a search is set up is very important. The queries here illustrate
the effect of different sequence types. There are other parameters that
often influence the search sensitivity. For example when a globular domain
is longer, the Gonnet
Pam250 matrix would be expected to outperform the default Blosum62
in the detection of divergent homologues, because it is less stringent
and so gives longer optimally matching segments. (Over short matches it
is noisier and could perform worse). In fact we used the Gonnet matrix
to make the profiles: because of the extra information in the alignment,
profiles usually perform better with Pam250 than Blosum62. Also, gap penalties
are critical parameters in dynamic programming and should always be tested
by trial and error. In other words, it pays to try several variations in
the searches, not just accept the results of the first search.