Sequence Similarity Searching Tutorial

Aidan Budd and Toby Gibson

Tuesday 28th November 2006

As has already been discussed, an important aspect of bioinformatics, particularly for its pragmatic use by wet-lab scientists, is its use in the identification of a set of  evolutionarily related sequences. (Note that rather than continually specifying that we are refering to evolutionary relatedness, from here on we will simply use the word "related"). The most commonly used tool for such work is BLAST - as can be seen from the citations (as of December 2006) of the original BLAST papers there are a huge number of users of this software:

Altschul SF et al.
Basic Local Alignment Search Tool
Journal of Molecular Biology 215 (3): 403-410, 1990
20978 citations

Altschul SF et al.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleaic Acids Research 25 (17): 3389-3402, 1997
17812 citations

Both papers are the most highly cited in all of science for their year of publication!

Much of molecular and cellular biology involves the quest to discover functions associated with biological sequences. "Sequence Similarity Searches" of the kind caried out by BLAST (and other software) provide a quick way of obtaining a wealth of possible ideas about the function of a sequence, certainly much quicker than most wet-lab techniques. This is why such sequence similarity methods are so heavily used.

How can these methods be such powerful tools? This is due to a simple truth about biological sequences -  two related sequences are more likely to have similar function than two unrelated sequences. To look at this another way, consider the situation where one is working on a newly identified sequence, let's call it sequence A. You discover another sequence, e.g. sequence B, that is shown by BLAST or some other similar tool to be related to sequence A. If someone else has already discovered something about the function of sequence B, then an obvious and hopefully testable hypothesis you can use in your experimental work is that sequence A also shares this function - thus greatly narrowing down the set of experiments you might choose to conduct on sequence A.

As already mentioned, the most frequently-used tool to try and identify sequences related to a query sequence is the program known as BLAST. The most common way of using BLAST is to input a "query sequence" (e.g. the sequence of a protein you are interested in), and to have BLAST compare this sequence to a set of other sequences (the sequence database). BLAST then returns to the user those sequences in the database that it calculates are most likely to be related to your query sequence. For each of the comparisons between the query sequence and a database sequence reported to the user, BLAST calculates a score that reflects BLAST's estimate of how likely it is that the two sequences are related - the higher the score, the more likely BLAST has calculated that the database sequence is related to the query sequence. However, rather than relying on these scores, most users consult instead the "E-value" for each comparison, also calculated by BLAST. This E-value is a number ranging between 0 and the total number of sequences in the database searched against - it is known as the "Expectation"-value, and indicates the number of sequences that would be expected to have that score (or more) if the query sequence were compared against a database containing no sequences related to the query sequence. Thus, a lower E-value indicates that the sequences are more likely to be related than if the comparison had a higher E-value. An E-value of 0.00001 or less (also sometimes written as 1e-5, which is shorthand for 1.0 * 10-5) is often used as good initial evidence that a query and database sequence are related, although further investigation should always be carried out to obtain additional support for such a hypothesis.

The aim of these exercises is to demonstrate that the way in which you run BLAST can affect how well the program is able to identify sequences within a database that are evolutionarily related to your query sequence - to consider how well BLAST has carried out this task, we need to consider both the accuracy (also known as the selectivity) and sensitivity of the search. Accuracy refers to the ability of the search to assign high scores only to sequences in the database that are evolutionarily related to the query sequence. Sensitivity refers to the ability of the search to identify as many of the sequences related to the query sequence as possible by reporting them with high scores
 
To illustrate the difference between accuracy and sensitivity, consider a BLAST against a database that contains many sequences related to the query sequence. If the search identifies only one of the database sequences as related to the query sequence (i.e. only one of the database sequences is reported with high score/low e-value) then this is an insensitive search, as many of the truly related sequences were not identified as such i.e. there are many false negatives. However, if the same search also does not assign a high score/low e-value to any of the sequences in the database that are unrelated to the query sequence, then the search was also very accurate, as it did not report any false positives.

Note that to assess the accuracy and sensitivity of a search, we have to already know, before doing the search, which sequences in the database are related to our query and which aren’t – this is obviously not normally the case when one is using BLAST in one's own research, where the aim is usually to try and work out which sequences in the database are related to the query sequence. However, by understanding better how accuracy and sensitivity can be influenced in BLAST, you should be able to conduct more effective searches using your own sequences in the future, hence our use of the somewhat artificial situation in these exercises.

Exercise 1

The following sequence is an SM protein, involved in RNA splicing. You can obtain the sequence from this link RSMB_HUMAN.

We will use it to search the swissprot database, where reading the sequence annotation helps us know which sequences are related and which unrelated to our query sequence.

Note the score that BLAST assigns to each alignment obtained between the query sequence and a database sequence. As we mentioned, BLAST is able to calculate, using information such as the length of the query sequence and the size of the database searched against, the significance of the scores obtained. As already mentioned, this significance is reported as an E-value, a measure of the number of alignments one would expect with this score if the database contained only sequences non-related to the query sequence.

Q1a: Based on this definition of an E-value, in a query against a database containing no sequences that are related to your query, how many sequences would you expect to obtain a score which has an E-value of <= 1?

Q1b: With E-value <= 10?

Q1c: What is the largest E-value you could find for a database containing 1000 sequences?

Use the SM protein sequence to query the swissprot database at the EBI using its NCBI-BLAST2 web server. Run the search using the default settings apart from altering the database to being swissprot rather than uniprot. Save as a bookmark a link to the result page, to allow you to consult the results again later - give the bookmark a useful name so that you can retrieve it again later easily e.g. rsmbInitialSearch

Q2: Was this an accurate search i.e. can you find some (many?) sequences with low E-values that are not SM proteins (based on the short annotations given to the sequences by swissprot)?

Q3: Was this a sensitive search i.e. were there many sequences that are related to the query sequence having an E-value indicating relatedness to the query sequence?

Examine the alignments between the unrelated sequences that score highly (click on the "Blast Result" button).

Q4a: Do these alignments tend to include the same/a similar part of the query sequence? Or, looked at the other way, is there a region of the query sequence that tends not to be included in these alignments?

Q4b: Are there any obvious characteristics of the amino acids of the query sequence that are involved in these alignments?

Q4c: Can you suggest a way in which one might be able to increase the accuracy of these BLAST searches by editing/excluding regions of a query sequence from comparisons made against database sequences?

Q4d: What do you think the consequences of this alteration would be on the accuracy and sensitivity of the search?

Repeat the BLAST search against SWISSPROT, this time using as a query one of the sequences that, based on the annotation, you suspect not to be truly related to the query sequence, but which has an E-value of less than 1e-3.

Save the results of your search as a bookmark, and repeat it using a sequence that, based on the swissprot annotation, you expect to be truly related to your query sequence, also with an E-value less than 1e-3 (again saving the result page as a bookmark)

There should be an obvious difference between the sets of the sequences retrieved from searches made using sequences you expected were related to the query sequence, and those that were not.

Q5a: How would you describe this difference in terms of their function as indicated by the swissprot annotation of these sequences?

Q5b: Based on this observation, can you suggest a way in which one could attempt to confirm whether or not a database sequence with a low E-value is truly related to the query sequence.

Exercise 2

As you may have realized, there is a tendency for regions of proteins that contain many of the same amino acids (so-called "low complexity regions") to be responsible for causing false positives i.e. database sequences being reported with low E-values that are not actually related to the query sequence. Therefore, in almost all cases one chooses to run such searches ommiting these regions of the sequence. Thankfully, there is an easy way of implementing such a strategy, as filters have been developed that process the query sequence prior to submission to BLAST, to remove exactly these regions.

Repeat the search using the EBI NCBI-BLAST2 server and the same query sequence as before, RSMB_HUMAN. This time switch the filter option to "true",  and also specify the number of alignments to be displayed to 250 rather than the default (50).

Again, bookmark the result page.


Q6a: Comment on the accuracy and sensitivity of this search in comparison to the first search you carried out using RSMB_HUMAN against SWISSPROT.

Q6b: Are there sequences in the database whose annotation indicates that they are related to RSMB_HUMAN but which have relatively high E-values?

Q6c: Given your answer to the previous question, if a database sequence has a high E-value e.g. 4.0, does this indicate that the sequence is unrelated to the query protein?

Other factors can also influence the accuracy and sensitivity of the search

Repeat the searches several times, varrying some of the parameters including: gapalign, opengap, extendgap, matrix. Click on the links above these options to learn the meaning of these different parameters, and bookmark the result page each time.

Q7a: Which set of parameters give the best result, in your opinion?

Q7b: Why do you find this result the best (think in terms of accuracy and sensitivity of the searches)?

Exercise 3

(only if you have enough time)

We will now look at a different protein family, the annexins. To learn more about the family, you can have a look at its entry in the SMART database of protein domains.

Using the query sequence of an annexin from the single-cellular eukaryotic parasite Giardia lambila, ANXE1_GIALA, carry out a number of different BLAST searches to identify the set of paramters that you find give the best results.

Q8b: Is the set of optimal parameters for these searches different than those for the RSMB_HUMAN protein? If so, why do you think this may be the case?


Hopefully this exercise will have illustrated that:
  1. Sequence similiarity searches can be a very useful tool for obtaining ideas about the function of a sequence
  2. It is vital to examine the results of such a search critically, as they often return both false positives and negatives
  3. A good way of critically examining the results of such a search is to carry out further searches using sequences identified as potentially related to your query from your initial search
  4. The parameters with which a sequence similarity search is carried out can have a significant effect on the ability of the search to identify proteins related to the query sequence.
Back to introductory page.