Sequence Similarity Searching Tutorial
Aidan Budd and Toby Gibson
Tuesday 28th November 2006
As has already been discussed, an important aspect of bioinformatics,
particularly for its pragmatic use by wet-lab scientists, is its use in
the identification of a set of evolutionarily related sequences.
(Note that rather than continually specifying
that we are refering to evolutionary relatedness, from here on
we will simply use the word
"related"). The most commonly used tool for such work is BLAST - as can
be seen from the citations (as of December 2006) of the original BLAST
papers there are a huge number of users of this software:
Altschul SF et al.
Basic Local Alignment Search Tool
Journal of Molecular Biology 215 (3): 403-410, 1990
20978 citations
Altschul SF et al.
Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs
Nucleaic Acids Research 25 (17): 3389-3402, 1997
17812 citations
Both papers are the most highly cited in all of science for their year
of publication!
Much of molecular and cellular biology involves the quest to discover
functions associated with biological sequences. "Sequence
Similarity
Searches" of the kind caried out by BLAST (and other software) provide
a quick way of
obtaining a wealth of possible ideas about the function of a
sequence, certainly much quicker than most wet-lab techniques. This is
why such sequence similarity methods are so heavily used.
How can these methods be such powerful tools? This is due to a simple
truth about biological sequences - two
related
sequences are more
likely to have similar function than two unrelated sequences.
To look at this another way, consider the situation where one is
working on a newly identified sequence, let's call it sequence A. You
discover another sequence, e.g. sequence B, that is shown by BLAST
or some other similar tool to be related to sequence A. If someone
else has already discovered something about the function of sequence B,
then an obvious and hopefully testable hypothesis you can use in your
experimental work is that
sequence A also shares this function - thus greatly narrowing down the
set of experiments you might choose to conduct on sequence A.
As already mentioned, the most frequently-used tool to try and identify
sequences related to a query sequence is the program known as BLAST. The
most common way of using BLAST is to input a "query sequence" (e.g. the
sequence of a protein you are interested in), and to have BLAST compare
this sequence to a set of other sequences (the sequence database).
BLAST then returns to the user those sequences in the database that it
calculates are
most likely to be related to your query sequence. For each of the
comparisons between the query sequence and a database sequence reported
to the user, BLAST calculates a score that reflects BLAST's estimate of
how likely it is that the two sequences are related - the higher the
score, the more likely BLAST has calculated that the database sequence
is
related to the query sequence. However, rather than relying on these
scores, most users consult instead the "E-value" for each comparison,
also calculated by BLAST. This E-value is a number ranging between 0
and the total number of sequences in the database searched against - it
is known as the "Expectation"-value, and indicates the number of
sequences that would be expected to have that score (or more) if the
query sequence were compared against a database containing no sequences
related to the query sequence. Thus, a lower E-value indicates that
the sequences are more likely to be related than if the comparison had
a higher E-value. An
E-value of 0.00001 or less (also sometimes written as 1e-5, which is
shorthand for 1.0 * 10-5) is often used as good initial
evidence that a query and database sequence are related, although
further investigation should always be carried out to obtain additional
support for such a hypothesis.
The aim of these exercises is to demonstrate that the way in which you
run BLAST can affect
how well the program is able to identify sequences within a database
that are evolutionarily
related to your query sequence - to consider how well BLAST has carried
out this task, we need to
consider both the accuracy (also known as the selectivity) and sensitivity of the search. Accuracy refers to
the ability of the search to assign high scores only
to sequences in the database that are evolutionarily related to the
query sequence. Sensitivity refers to the ability of the search to
identify as many of
the sequences related to the query sequence as possible by reporting
them with high scores
To illustrate the difference between accuracy and sensitivity, consider
a BLAST against a database that contains many sequences related to
the query sequence. If the search identifies only one of the database
sequences as related to the query sequence (i.e. only one of the
database sequences is reported with high score/low e-value) then this
is an insensitive search, as many of the truly related sequences were
not identified as such i.e. there are many false negatives. However, if
the same search also does not assign
a high score/low e-value to any of the sequences in the database that
are unrelated to the query sequence, then the search was also very
accurate, as it did not report any false positives.
Note that to assess the accuracy and sensitivity of a search, we have
to already know, before doing the search, which sequences in the
database are related to our query and which aren’t – this is obviously
not normally the case when one is using BLAST in one's own research,
where the aim is usually to try and work out which sequences in the
database
are related to the query sequence. However, by understanding better
how accuracy and sensitivity can be influenced in BLAST, you should be
able to conduct more effective searches using your own sequences in the
future, hence our use of the somewhat artificial situation in these
exercises.
Exercise 1
The following sequence is an SM protein, involved in RNA splicing. You
can obtain the sequence from this link RSMB_HUMAN.
We
will use it to search the swissprot database, where reading the
sequence annotation helps us know which sequences are related
and which unrelated to our query sequence.
Note the score that BLAST assigns to each alignment obtained between
the
query sequence and a database sequence. As we mentioned, BLAST is able
to calculate, using information such as the
length of the query sequence and the size of the database searched
against, the significance of the scores
obtained. As already mentioned, this significance is reported as an
E-value, a measure of the
number of alignments one would expect with this score if the database
contained only sequences non-related to the query sequence.
Q1a: Based on this definition of an
E-value, in a query against a database containing no sequences that are
related to your query, how many sequences would you expect to obtain a
score which has an E-value of <= 1?
Q1b: With E-value <= 10?
Q1c: What is
the largest E-value you could find for a database containing 1000
sequences?
Use the SM protein sequence to query
the swissprot database at the EBI using its NCBI-BLAST2 web server. Run
the search using the default settings apart from altering the database
to being swissprot rather than uniprot. Save as a bookmark a link to
the result page, to allow you to consult the results again later - give
the bookmark a useful name so that you can retrieve it again later
easily e.g. rsmbInitialSearch
Q2: Was this an accurate search i.e. can you find some (many?)
sequences with low E-values that are not SM proteins (based on the
short annotations given to the sequences by swissprot)?
Q3: Was this a sensitive search i.e. were there many sequences that are
related
to the query sequence having an E-value indicating relatedness to the
query sequence?
Examine the alignments between the
unrelated sequences that score
highly (click on the "Blast Result" button).
Q4a: Do these alignments tend to include the same/a similar part of
the query sequence? Or, looked at the other way, is there a region of
the query sequence that tends not to be included in these alignments?
Q4b: Are
there any obvious characteristics of the amino acids of the query
sequence that are involved in these alignments?
Q4c: Can you suggest a way in which one might
be able to increase the accuracy of these BLAST searches by
editing/excluding regions of a query sequence from comparisons made
against database sequences?
Q4d: What do you think the
consequences of this alteration would be on the accuracy and
sensitivity of the search?
Repeat the BLAST search against
SWISSPROT, this time using as a query
one of the sequences that, based on the annotation, you suspect not to
be truly related to the query sequence, but which has an E-value of
less than 1e-3.
Save the results of your search as a
bookmark, and repeat it using a sequence that,
based on the swissprot annotation, you expect to be truly related to
your query sequence, also with an E-value less than 1e-3 (again saving
the result page as a bookmark)
There should be an obvious difference between the sets of the
sequences retrieved from searches made using sequences you expected
were related to the query sequence, and those that were not.
Q5a: How would
you describe this difference in terms of their function as indicated by
the swissprot annotation of these sequences?
Q5b: Based on this observation,
can you suggest a way in which one could attempt to confirm whether or
not a database sequence with a low E-value is truly related to the
query sequence.
Exercise 2
As you may have realized, there is a tendency for regions of proteins
that contain many of the same amino acids (so-called "low complexity
regions") to
be responsible for causing false positives i.e. database sequences
being reported with low E-values that are not actually related to the
query sequence. Therefore, in almost all cases one
chooses to run such searches ommiting these regions of the sequence.
Thankfully, there is an easy way of implementing such a strategy, as
filters have been developed that process the query sequence prior to
submission to BLAST, to remove exactly these regions.
Repeat the search using the EBI
NCBI-BLAST2 server and the same query sequence as before, RSMB_HUMAN.
This time switch the filter option to "true", and also specify
the number of alignments to be displayed to 250 rather than the default
(50).
Again, bookmark the result page.
Q6a: Comment on the accuracy and
sensitivity of this search in comparison to the first search you
carried out using RSMB_HUMAN against SWISSPROT.
Q6b: Are there sequences in the
database whose annotation indicates that they are related to RSMB_HUMAN
but which have relatively high E-values?
Q6c: Given your answer to the
previous question, if a database sequence has a high E-value e.g. 4.0,
does this indicate that the sequence is unrelated to the query protein?
Other factors can also influence the accuracy and sensitivity of the
search
Repeat the searches several times,
varrying some of the parameters including: gapalign, opengap,
extendgap, matrix. Click on the links above these options to learn the
meaning of these different parameters, and bookmark the result page
each time.
Q7a: Which set of parameters give the
best result, in your opinion?
Q7b: Why do you find this result the
best (think in terms of accuracy and sensitivity of the searches)?
Exercise 3
(only if you have enough time)
We will now look at a different protein family, the annexins. To learn
more about the family, you can have a look at its entry in the SMART
database of protein domains.
Using the query sequence of an annexin
from the single-cellular eukaryotic parasite Giardia lambila, ANXE1_GIALA, carry out a number of different BLAST
searches to identify the set of paramters that you find give the best
results.
Q8b: Is the set of optimal parameters
for these searches different than those for the RSMB_HUMAN protein? If
so, why do you think this may be the case?
Hopefully this
exercise will have illustrated that:
- Sequence
similiarity searches can be a very useful tool for obtaining ideas
about the function of a sequence
- It is vital to
examine the results of such a search critically, as they often return
both false positives and negatives
- A good way of
critically examining the results of such a search is to carry out
further searches using sequences identified as potentially related to
your query from your initial search
- The parameters with
which a sequence similarity search is carried out can have a
significant effect on the ability of the search to identify proteins
related to the query sequence.
Back to introductory page.