Recent Changes - Search:

Documentation

HintsAndTips

UsefulInfo

ElmResource

CollabWork

GroupMeetings

PmWiki

edit SideBar

SequenceDatabaseSimilaritySearching

Sequence Database Similarity Searching

Nigel Brown and Aidan Budd


We will explore the functionality of the BLAST2 program in relation to sequence alignment concepts.

The easiest way to run blast is to use a web-server. Two general purpose public BLAST web-servers are at NCBI and EBI, whose links are given below. We will only use the NCBI blast server in these exercises.

In real life you should consider using both tools as they offer different search databases and slightly different functionality. In fact, EBI also offers WU-BLAST2 as well as NCBI-BLAST2 for the same kinds of searches and you should try WU-BLAST2, as it may give different results.

Unfortunately, there is redundancy in the sequence databases: the same sequences appear in the US and European databases under different accession numbers and identifiers in different databases, and with different layouts of the associated information, which can be very confusing.

The key databases do cross-reference each other so it is possible to navigate between them. From the US side (NCBI), the Entrez system provides their unifying interface. From the European side (EMBL/EBI) it is SRS that provides the glue.

Nevertheless, your research problem may be driven by information in a published paper or make use of commercial materials that reference accessions in the US or the European databases. Obviously, the corresponding server would be more appropriate.


USA: National Center for Biotechnology and Information (NCBI)

Connect to NCBI Blast server

This offers various types of blast service:

  • General searches of different types of database (protein, nucleotide using BLAST2, PSI-BLAST).
  • Special searches of whole genomes.
  • Other specialised searches.

Select

  protein blast   Search protein database using a protein query
                  Algorithms: blastp, psi-blast, phi-blast

and bookmark it.


Europe: EMBL European Bioinformatics Institute (EBI)

Connect to EBI Blast server

This offers a choice of:

  • General BLAST programs, including:
    • NCBI-BLAST2 protein and nucleotide search (same as at NCBI).
    • WU-BLAST2 protein and nucleotide search (alternative program).
  • Specialised BLAST services:
    • various specialised databases.
    • advanced blast tools (PSI-BLAST, ...).

Exercise 1

The effect of changing search database on scores and E-values.

Use this query (an uncharacterised protein from ''Methanocaldococcus jannaschii'', an Archaean):

>gi|2501594|sp|Q57997|Y577_METJA PROTEIN MJ0577
MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVAGLNKSVEEFE
NELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGVDIIIMGSHGKTNLKEILLG
SVTENVIKKSNKPVLVVKRKNS

Q1 Look at this sequence. Roughly how long is it?

1. Access the the NCBI BLAST2 server and paste in this sequence.

  • Select database "nr".
  • Accept all defaults, but make a note of the matrix (BLOSUM62).
  • Search.
  • Find the number of sequences and letters in the search database (try "Search summary").
  • Find the hit "sp|Q57951|Y531_METJA" (search in browser for "Y531" until you find the alignment).
  • Write down the score, bit score, E-value.

2. Repeat using database "swissprot"

Q2 For the hit Y531_METJA, what happens to the score? the bit score? the E-value?

Q3 Whatever you decided about the effect on scores, can you explain this?

Q4 Why does E-value change?

Answers


Exercise 2

The effect of changing scoring scheme (matrix) on scores and E-values.

No need to run these tests, just look at the table below.

This was made using same search as in Exercise 1, by searching with query Y577_METJA and examining hit Y531_METJA.

nr; Number of letters 2,558,340,887; Number of sequences 7,415,798

Score = 81.2 bits (184),  Expect = 1e-13, PAM30    (least remote)
Score = 79.9 bits (182),  Expect = 2e-13, PAM70
Score = 87.9 bits (195),  Expect = 7e-16, BLOSUM80
Score = 85.1 bits (209),  Expect = 2e-15, BLOSUM62 (default)
Score = 87.8 bits (284),  Expect = 3e-16, BLOSUM45 (most remote)

swissprot; Number of letters 132,353,686; Number of sequences 352,264

Score = 81.2 bits (184),  Expect = 6e-15, PAM30    (least remote)
Score = 79.9 bits (182),  Expect = 1e-14, PAM70
Score = 87.9 bits (195),  Expect = 4e-17, BLOSUM80
Score = 85.1 bits (209),  Expect = 1e-16, BLOSUM62 (default)
Score = 87.8 bits (284),  Expect = 1e-17, BLOSUM45 (most remote)

Q1 Are scores still constant between databases (as per Exercise 1), all else being equal?

Q2 What happens to the raw score as the matrix changes? By how much does it vary, in which direction?

Q3 And for bit score?

Q4 E-value?

Q5 What do you conclude about the stability of scores and E-values if you wanted to compare searches from (a) different databases, and (b) under different scoring schemes?

Answers

Note: same kind of observations hold if you change other parameters of the alignment scoring scheme, for example gap opening or extension penalties.


Exercise 3

Effect of changing scoring scheme on E-value cutoff.

Use this query (an acyltransferase from Neorickettsia sennetsu, a Eubacterium):

>UNIPROT:Q2GE02 1-acyl-sn-glycerol-3-phosphate acyltransferase family protei...
MNKSMRNNGFFAAAAYVTMNVLFNVSVVAWTTLYGLVVLPALFLPPEDVIEVRRVWIRGV
FVLLRFFFNVEFEIRGSENIHLHKQFIVASKHHSPLDVILLADLFQKPAFVLKRSLIFIP
IFGLYMIATKMITVSYSGKSNGLDVLRKMVRQAKVLSKNRTIILYPEGGRTKVGEEVGYK
RGILALYKHLNLPVLPIALNVGQIWPVGYFSNKRSGKAVVQILPPIMPGLKDDEFLQVLR
DAIETNTKHLLSSENKHLLTLQKN 

Q1 Look at this sequence. Roughly how long is it?

1. Go to the NCBI BLAST2 server page again.

  • Select database "swissprot"
  • Accept all defaults, but make a note of the matrix (BLOSUM62).
  • Search.
  • Write down how many hits with E-value < 0.0001 (1e-04).
  • Which hit has E-value < 0.0001? Call this the threshold hit.
  • Write down the score, bit score, E-value for the {first, threshold, last} hits.

The query is a 1-acyl-sn-glycerol-3-phosphate acyltransferase.

Q2 Roughly how many hits are there? (We took the default limit of 100).

Q3 Looking at the descriptions, do the hits support its being an acyltransferase?

Q4 Is our cutoff E < 0.0001 reasonable, ie., are there more proteins with the same apparent function but higher E-values? If so, how high?

Answers

2. Repeat using a new substitution matrx: PAM30

Q5 How many hits overall?

Q6 Are there more acyltransferases with E above 0.0001? What is the highest such E-value?

Q7 What is the PAM30 matrix? (from the slide presentation)

Q8 How does use of a more conservative matrix affect matches?

Answers

3. Repeat with BLOSUM45 (less conservative) and answer the questions as usual.

Answers

Note: did you notice that as you selected a new substitution matrix, the gap penalties changed as well?


Overall conclusions from Exercise 1, 2 and 3

Changing the database can drastically change E-values.

Changing the scoring scheme drastically changes the E-value and the scores.

Bit scores are the least variable quantity and therefore the most comparable across different searches.

More conservative matrices (PAM30, BLOSUM80) find fewer hits and are less likely to produce false positives. Less conservative matrices (PAM250, BLOSUM45) find more hits at the potential expense of more false positives, although the latter point was not demonstrated here.

Different matrices should be used with different gap penalties, but these are usually set up for you by the system.


Exercise 4

The importance of low complexity filtering.

Use this query (Drosophila scribbler, a putative transcription factor):

>gi|24654891|ref|NP_524678.2| scribbler CG5580-PA, isoform A [Drosophila melanogaster]
MEKKLAKVALNSSNVATESNIKNNKLLANLSAAGAATATTATTGTATTTTATTANQALNFNNKTKATANA
TAASANNRHNNNNNSSAIKKHTNTKQLAGKSPACNSSSSSLSSSSSSNSSESKDTNFEYEDEWNIGGIPE
LLDDLDADIEKSAHSSGGGNQATALNAKQANSSSTSSSSSSKGGASSSSSSAAATGSSSSHKSHKTTLHS
NLSATSPTTIKFTRQPVAIGGANSSSSSSAAAAPSGGSANVVAKGSSSSSSSTSSSSSGKHHHHHHHHSN
SSSSGSSSKGYKSALVAQLNSPSPLNSNSKSLSGSGSGSGNTNGAAGAGAGSTLSSSTFAGFSKGGSLVS
SSAGAAAALAAGSGQQGSKFSAGGMSSQTGSGSGGNNTSNSNNSSSGSGGSGSGSSTGNTASGSGNNNST
SAGGPPSSQGGNNGNGSGSSSSSSSGKSSAKMSIDHQATLDKGLKMKIKRTKPGTKSSEAKHEIVKATDQ
QQNGALGAGSNNSANEDGSSGSSSTNASSLGSTNSSSSASSGSSSSSGSSSSSSKKHLNNASSGSGSSSS
GGGSQNNASGHASGGGSSGSSQSTPQGTKRGSSGHRREKTKDKNAHSNRMSVDKSAAAASAAGEKDTPEK
CSGTGAGGSPCSCNGDVGAPCSHHACIRRAAHMSNSAGNANSSAGTGQSGGSSSMSAVPPGVFTPSAGSP
STGSPSTVVPAAASLLAATGAASSSASQMASSSAGGVGGSGGGANAPGPPGKESAGSIKISSHIAAQLAA
AAASNSYSGSGANTNQGQNSNAGGNGGSESKAAAAAQAKLMAPGMISATMHHTISVPAGTGTGDDDTKSP
PAKRVKHEAGASGAGGGKEMVDICIGTSVGTITEPDCLGPCEPGTSVTLEGIVWHETEGGVLVVNVTWRG
KTYVGTLLDCTRHDWAPPRFCDSPTEELDSRTPKGRGKRGRSAGLTPDLSNFTETRSSIYFSHAQVHSKL
RNGATKGRGGATPSTSPTAFLPPRPEKRKSKDEAPSPLNGDASDGASVGGIGGAGGVNMVNASGIPISAS
GGGLATQPQSLLNPVTGLNVQISTKKCKTASPCAISPVLLECPEQDCSKKYKHANGLRYHQSHAHGAGGG
ASSMDEDSMQAPEDPATPPSPGVASGTGSGASVASSAVPATAPSAGQGTVAVSPNTPLANSSNPVTNGNV
APSAPATGSVTIAAPNTTPSVVETQAPLTGPPPVTPPAPTPICAVATPGAEQSVSSVLPLGNLPLTAGPN
SATQQQQPPTQQQQPQLLVPGGSAASLQQQQQQQQPVAGGSITAGISGQALSQHQQQLMGGLPAMLSDQQ
QQALLQQGALKAGVLRFGPPDGNPLQQQPGQASVNPQTQQSPPRPPSHVQDQQTPSAYAQQAGLKTSPGF
GSVGVGAASSKQKKNRKSPGPSDFEGRVSREDVQSPAYSDISDDSTPVAEQEMLDKSVGQAVTAKHIELM
GKKPTEVGVGVPPPPAPNMYVPGMYQFYPAQQQSAPPPQQQQQQPQYMVQTEPGKPPGLPPALTQAQQQQ
QLQPGAPPPTSQPPSHLLGPPGQQSVAAHLADYSGKNKDPPLDLMTKPQPQPGQPPSQQQQSGQLSGQEN
NGKDVGPPTSQPGSQPPPVNLSAVAGPPPGSLPPGLGGLSALGAAGLGGPGPGKGMPHFYPFNFIPPAYP
YNVDPNFGSVSIVASEEAAKLSGHPGLPPSSQAQQLSGISIKEERLKESPSPHDQPKHMPSQQQMIASKL
IKQEPMTKQEIKQEPNSNPGQQHPPPQQQPAPQPQQQQPPPPQPQQPHALHPKDLQALGAYPAIYQRHSI
NLAVQQAREEELRRYYMFTGRQNSAAAAAAAAAQNAASGGLPPHPGMMHKDEPGMGSAAQQQQQQQQQQM
QIAQQQQQAIQQHHQHLQQQHQAQQQQQQQQHQQQQQQQQQQQQQQQQQQQQQKLKQSQAASAGANNKAT
NLTKDSPKQKGGDDDQPLKVKQEGQKPTMETQGPPPPPTSQYFLHPSYISPTPFGFDPNHPMYRNVLMSA
AGPYNTAPYHLPIPRPYHAPEDLSRNTGTKALDALHHAASQYYTTHKIHELSERALKSPTSGSGPVKVSV
SSPSIGPPQPGGPTSSGPGSGPVSGVLGPGSGSNQQPGSAPGSAGGVPLNLQPPPGGMGPSPGSKPDLSG
PKGHGGVTPGSSLDGHKQSMPGGPPPNGPSGNGAVGGVGGAAGNGAAGGGGAGAADSRSPPPQRHVHTHH
HTHVGLGYPMYPAPYGAAVLASQQAAAVAVINPFPPGPSK

Q1 Look at the scribbler protein sequence. How does it differ from the other queries we have used?

  • Access the BLAST2 data entry form.
  • Select database "nr".
  • Expand "Algorithm parameters", set "Max target sequences" = 10000 and "Expect threshold" = 100. Make sure you have BLOSUM62 set.
  • Search.

Q2 What function do many of the hits suggest?

Q3 If you look at the alignments, do you see anything odd?

  • In the "Algorithm parameters" section, tick the "Low complexity regions" box. Click on the '?' icon to the right of the option to get help on what this means.
  • Rerun the search.

Q4 Was it faster?

Q5 Are there more or fewer results?

Q6 Looking closely at the alignments, do you see anything odd?

Q7 What happens to the bit scores and E-values when filtering is enabled?

Q8 For our earlier searches in Exercises 1 and 2, what would be the effect of enabling complexity filtering on the scores and E-values.

Answers

Keep your last search results (nr with low complexity) for the next Exercise.

Conclusions

Many sequences contain regions of low complexity that may randomly match other such regions in the the sequence database.

Generally we don't want these and the search will produce quicker, cleaner results if low complexity filtering is enabled.

Low complexity filtering has no effect on searches when the query contans no low complexity regions.


Exercise 5

Reciprocal searches.

Scribbler is a fly zinc finger protein, probably therefore a transcription factor. Its zinc finger domain lies centrally.

  • Look at the nr with low complexity filtering result of Exercise 4.
  • Look at the graphical picture of the alignments against the query: the majority of more remote hits lie against a narrow central portion containing the zinc finger region.
  • Near the bottom of the "Desciptions" section you will see an unnamed mouse gene "dbj|BAC36234.1" with a high (insignificant?) E-value around 2.5.
  • Jump down to the alignment for this protein (click on its score).

Q1 What is the sequence range along the query? Does this correspond to the known zinc finger region?

We will try to see if this is also a zinc finger protein.

  • Open a new window ready for a blast search and paste "dbj|BAC36234.1" into the query sequence box. The NCBI server will recognise the accession number and fetch the sequence automatically.
  • Select the search database as nr, BLOSUM62, "Max target sequences" = 10000 and "Expect threshold" = 100, all as before.
  • Search.

Q2 What does the graphical alignment look like now?

Q3 What are the many of the best hits?

Q4 Was the original scribbler protein (accession: NP_524678.2) found in this reciprocal search? If so, what was the E-value?

  • Select the search database as nr, BLOSUM62, "Max target sequences" = 10000 and "Expect threshold" = 100, all as before, then run the search.

Q5 What does the graphical alignment look like now?

Q6 What are the many of the best hits?

Q7 Was the original scribbler protein (accession: NP_524678.2) found in this reciprocal search? If so, what was the E-value?

Q8 Jump to the corresponding alignment. What region of the scribbler gene was matched? And what region of the mouse gene?

Q9 With reference to what you know about the location of the scribbler zinc finger and to the initial scribbler search, have we matched the same C-terminal region of scribbler?

Q10 What about the mouse gene; what region was matched?

Answers

Tentative conclusion

There seems to be evidence that this unassigned mouse gene is related to scribbler by similarity with the C-terminal region, and that it is also a zinc finger protein, by similarity with other mammalian proteins that contain zinc fingers. However, blast is unable to demonstrate any putative homology between the fly and mouse zinc finger regions directly. Of course this is just the start of the story.


What might we do next?

There are other mouse genes reported in the scribbler search - perhaps this unassigned mouse gene is more like one of them: we might be able to bridge between proteins.

We could try a more sensitive search using a) a more divergent substitution matrix, such as BLOSUM45.

We could use a more sensitive search tool altogether: PSI-BLAST is available from the same website and uses the same search databases.


Learning objectives

  • Understand the concept of local alignment as it applies to BLAST.
  • Be aware of the different BLAST services.
  • Understand where BLAST alignment scores and E-values come from.
  • Know how (and why!) to change BLAST parameters.
  • Understand how to analyse BLAST output in the biological context especially with regard to low complexity filtering and reciprocal searches.
  • Further refine your BLAST search output and know how to export data.

END (phew...!)

Edit - History - Print - Recent Changes - Search
Page last modified on December 03, 2008, at 02:35 AM CET