Recent Changes - Search:







edit SideBar



Test sequence

Sequence Similarity Searches

In the previous session, one of the tasks was to identify proteins within a database that were related to Aurora Kinase B. We saw that many of these text searches identified hit proteins that we identified as not evolutionarily related to the kinase.

In some cases, this is exactly what we want to do with our text searches. However, in a situation where we want to analyse the sequence of the kinase and its relatives, we want to avoid hitting sequences that were not actually related to the kinases. Note that if you had used searches that focused on PFAM family IDs, it would be possible to use a text search to easily identify a set of related sequences - sequences with the same PFAM ID should all be related to each other. This, of course, depends on the sequences having been annotated as including domains as specified using PFAM families (and that the accuracy of the search depends on these annotations having been correctly assigned).

Sequence similarity searching provides an annotation-independent way of identifying such sequences. These methods, as applied to protein sequences (they are also applied to nucleotide sequences, but we will not be looking at such applications in this course), use models of evolution (almost always of globular protein regions) in terms of the frequency of substitution of different amino acids, and of the insertion/deletion of regions within these domains. We previously briefly discussed the importance of similarity measures in bioinformatic analyses - in this case, sequences are considered more similar to each other if they share (within the same positions of an alignment) many amino acids that are known to frequently substitute for each other in globular domains. This measure in many cases finds that evolutionarily-related sequences are more similar to each other than are non-evolutionarily-related sequences.

Identifying records in database that corresponds to your sequence/entity of interest

As an example - if you were working on Aurora Kinase B in a human system, and have only been focusing on the biochemistry of the protein, you might want to find out its chromosomal location and genomic neighbourhood. One way to do this would be to search with the Aurora Kinase B sequence against the human genome sequence using BLAST - the most commonly used sequence similarity search tool.

Note that one could also do this using keyword searching - however, this has two potential disadvantages

  • There are often several different names (synonyms) for a gene. Thus, you may search with a name that is not annotated for that particular gene and thus not be able to identify it in the database you are interested in.
  • There are sometimes several genes with similar/the-same names. Thus, you may use a text query to identify what you think is your gene of interest, but you may find instead a completely different, unrelated gene. If you do not realise this error, this could be a very costly mistake, with you drawing completely wrong conclusions about your gene. A typical situation of this kind might occur if your protein is known as p53 - perhaps it is THE p53 - but perhaps it is another protein with the same molecular weight...

How does one identify the sequence in the database you have searched against that is equivalent to your query protein? Ideally the database sequence and query sequence should be exactly the same length and have identical amino acid sequences when lined up against each other - such sequences would be described as "identical".

Note however that for several different reasons you may not find such an identical sequence in the database, but one that is almost but not quite identical. This could be because you are:

  • searching with a splice form that is not in the database, although other splice forms are present in the database
  • there may be sequencing errors in either your query sequence or the database sequence
  • the start or end of either your query or the database sequence could have been mis-predicted


Try this yourself, using the Aurora Kinase B sequence and BLAST on the Ensembl website.

  • Can you find a sequence in the predicted peptide database that is identical to this sequence that is taken from UniProt?
  • Which chromosome is the human Aurora Kinase B located on?
  • Which genes are next to it in the genome?

Note that even if you find an identical match to your sequence in a database, you may still be misled by the result of the search. For example, if you expect that the sequence you have hit must come from the same organism as your query sequence (because the sequence will surely have acquired some different amino acids in other organisms), you would often be right (depending on how distantly related the two different organisms are). However, in some cases, protein sequences stay exactly (or nearly exactly) the same between very distantly related organisms.

To illustrate this, carry out a search of the UniProt database at the EBI using their NCBI-BLAST server, using the human alpha cardiac muscle actin sequence.


  • Is the sequence identical in organisms other than human? If so, which organisms is it identical in?

Identifying related sequences

As will be discussed tomorrow, a first step in analysing a protein sequence for functional insights is to examine it for the presence of modules that can be predicted with high confidence e.g. globular domains (and some non-globular functional regions) that are found in databases such as SMART, PFAM or CDD. This would then be followed by prediction of non-globular regions of protein. This would then be followed by a search for functional regions witihin these predicted non-globular regions e.g. using ELM. If there are regions predicted to be globular that are missing from these databases, one should then search for related proteins using sequence similarity searches.

There are two important uses of sequence similarity searches in this workflow. The first of these is explicitly refered to above i.e. the search for proteins related to your sequence of interest, at the end of the analysis - the aim being to discover as many proteins as possible that you are confident are related to your query sequence. The more such proteins you find, the more likely that one of them has been studied in such a way that allows you to make functional predictions for your own sequence - as always, working on the assumption that sequences that are related to each other are more likely to share similar functions than randomly-selected sequences that are not related to each other. The second is to aid in the investigation of potentially functional regions of non-globular regions - looking at alignments of these regions can reveal portions of the sequence that are more highly conserved than others and which are therefore more likely to be involved in important functions.

One would carry out sequence similarity searches for these two different tasks in different ways. Thus we will look at them separately.

Identifying distant relatives

As mentioned, the aim of such searches is to identify as many different sequences related to your query sequence as possible, to increase the chance that you identify a related sequence for which there already exists functional information. At the same time, one wants to be very sure that one does not mis-identify an un-related sequence as being related to your query sequence - this could lead to you drawing completely wrong conclusions about your query sequence which could have very serious consequences for your work. Thus, the set of exercises below are intended to show you how the parameters you use to run your BLAST searches influence both the specitivity (ability to discriminate related from un-related sequences) and sensitivity (ability to identify as many true-positives i.e. sequences that are truly related to your query sequence) of your searches. One is, of course, faced with a need to find a trade off between these two qualities when making ones own searches - in practice what one does is to repeat such a search using a range of different parameters.

For these exercises, we will be carrying out BLAST searches against a database that has been specially designed to illustrate these points, kindly supported by Yan P. Yuan from Peer Bork's group at the EMBL Heidelberg. Note that if you are conducting your own searches, you should instead use the BLAST servers at the EBI, NCBI, or some other such site where regular updates are made to the databases, there are many associated tools that help you analyse your results further etc. Some examples of these servers are listed here.

Filtering out low complexity regions

Bork Group BLAST server

As you are hopefully now aware, one of the important components of the algorithm and model that BLAST and other such similarity search programes use to identify related sequences is a model of the frequency of substitutions between different amino acids that are observed in globluar protein sequences. Thus, "good" alignments between related globular proteins can be scorred highly. However, it is very important to note that these patterns are different for non-globular sequences. As a result of this, alignments between unrelated, non-globular regions may also score highly, despite being unrelated - and because such alignments score highly, one may be mistaken into believeing that the sequences are actually related when they are not. As we have discussed, we are very keen on avoiding that this happens!

Thus, for searches of this kind, we typically choose to remove those regions of the sequences that we predict (on the basis of amino acid compositions and repetitions) to be non-globular, using a filter.


  • What effect do you think running a search with and without the filter would be, in terms of selectivity and sensitivity?

Task - investigate effect of low-complexity filter on BLAST search results

Query the database "btbd" twice with the Drosophila melanogaster protein scribbler (also known as Breakless). The first query should be done WITH the low-complexity filter switched on, the second query WITHOUT the filter switched on. This protein is predominantly low-complexity, but contains a Zinc Finger domain somewhere near the middle of the sequence.

The "btbd" database contains a set of sequences specifically chosen to illustrate features of sequence similarity searching. It contains several sets of proteins that are listed below - unless otherwise indicated, proteins that belong to one set are not related to proteins from another set. This should allow you to easily check the output of the BLAST searches for true and false positives and negatives.

  • A20 zinc-finger domain containing proteins
  • Angimotins (contain colied-coil regions)
  • Collagens
  • Bacterial Ice-Nucleation proteins
  • Prions
  • rasGEF-domain containing proteins
  • Insect Scribbler/Breakless-like proteins (contain a C2H2-like Zinc finger domain)
  • Securins
  • Yeast-Eco1p-like proteins (contain an acetyltransferase domain - although this is currently missed in PFAM)
  • C2H2 Ziinc-finger domain containing proteins

(NOTE all zinc finger domains are presumed to be related to each other)


  • Is there a difference in the sensitivity and selectivity of these two searches?
  • Which of the searches was more useful in the search for distantly-related protein sequences?
Changing matrices and gap penalties

Hopefully the previous exercise has demonstrated that it is best to carry out searches of this kind with the low-complexity filter switched on.

Other parameters of BLAST searches can also be worth adjusting in the search for distantly-related sequences. To explore the effect these parameters can have on your ability to identify distantly-related sequences, query with this bacterial protein, that contains a TAZ domain. Use a range of different substitution matrices and gap-penalties, and perhaps try changing other parameters too. In each case, check the number of alignments that are found with an E-value of less than 0.0001 / 1E-4 (if the sequences have been evolving "normally", such a score is strongly indicative of the aligned sequences being related). Note also the changes in the scores and E-values for sequences in the different searches


  • Which combinations of parameters give the "best" results (i.e. the highest number of sequences that you know are related to the query with E-values below the arbitrary threshold of less than 0.0001 / 1E-4)?
  • Is there a trend in the scores and E-values of alignments obtained from "better" searches compared to "worse" searches?
  • Did switching on the low-complexity have an effect on these searches? If not, why not?

You should also have noticed that, while relatively few sequences are found to be significantly (with E-value of less than 0.0001 / 1E-4) similar to the query sequences, there were many sequences known to be related to your query sequence with higher E-values. This illustrates an important point - just because sequences are not "significantly" similar, it does not mean that they are not related to the query protein.

There is a way to check whether a sequence with an insignificant score/E-value is related to the query sequence - if the query and the hit protein are related, then we expect that BLAST searches using the hit sequence would tend to identify a similar set of sequences to those obtained from the initial search.

Task - investigate the behaviour of "reciprocal" BLAST searches

Check to see whether this is indeed the case, use the accession numbers found in the blast output of the hit sequences to obtain the corresponding sequences via SRS, and repeat the search using the hit protein. Repeat this for several other such hits - do you indeed see that these sequences tend to hit similar sets of sequences in these "reciprocal" searches?

To see how sequences that have significant E-values but are not relatives to the query search behave in this context, repeat the search with the Drosophila protein scribbler without using the filter. Carry out similar reciprocal searches using sequences that you know are unrelated to the query sequence but which have significant E-values - is the pattern of sequences hit from this set of reciprocal searches different than for the reciprocal searches with the "true" positives?

Hopefully, this set of exercises will have demonstrated to you how to use BLAST searches to identify distant relatives of your sequence of interest, and the influence of changing search parameters on your attempt to carry out this task.

Note that if these searches fail to identify related proteins whose sequence is helpful in predicting the function of your protein, there are more sensitive methods available e.g. PSI-BLAST, or creating your own protein profile HMMs for searches using HMMER. However we won't look at those here - ask if you're interested in learning more about them.

When working with globular regions of sequences, we usually switch the low-complexity filters on - typically providing us with a much more specific search than if we didn't use the filters.

However, if one is specifically interested in regions that have low sequence-complexity, this approach will cause problems - particularly in cases where the entire protein is non-globular. In this case, the best approach is to take only hits with very low E-values, where the reciprocal BLAST searches also hit the query protein with low E-values.

We often use BLAST As already mentioned, the aim of such searches is to create alignments of non-globular/poorly-conserved regions, to identify local regions of higher conservation which are potential candidates for important candidate sites in your protein of interest.

The main difference with searches of this type (compared to those carried out to determine distant relatedness between proteins) is that you are attempting to identify closely-related sequences (because more distantly-related sequence are likely to have diverged so much that alignment is not longer feasible in these non-globular regions of interest) in non-globular sequence regions. Thus, it makes no sense to carry out such searches while using the low-complexity filters, as this would exclude exactly those sequences/regions-of-sequences you are interested in from your analysis.

Learning Objectives

The aim of this session is:

  • Providing an overview of the algorithmic/theoretical basis of sequence similarity searching
  • Providing hands-on experience of sequence simliarity searching, and giving an overview of
    • Different options available and the influence these options can have on the results of the search
    • Guidelines on carrying out your own searches
    • Differences in the way that disordered/low-complexity sequence behaves in such searches compared to globular sequence
  • Raising awareness of situations in which it makes sense to apply sequence similarity searches (i.e. typical use cases)
    • Working with a sequence that is almost completely uncharacterised, using the search to come up with initial predictions for the function of the sequence
    • Identifying relatives of a protein of interest in a particular organism/group of organisms

After being taught this section we want that you:

  • Gain at least a basic understanding of the algorithms/theory used in sequence similarity searches
  • Know which servers and tools are available to carry out such searches
  • Know of situations in which it could be useful for you to apply these techniques

Links to more information on sequence similarity searching

-- Main.AidanBudd - 25 May 2007

Edit - History - Print - Recent Changes - Search
Page last modified on January 29, 2008, at 11:51 AM CET