Sequence Database Similarity Searching
Nigel Brown and Aidan Budd
We will explore the functionality of the BLAST2 program in relation to sequence alignment concepts.
The easiest way to run blast is to use a web-server. Two general purpose public BLAST web-servers are at NCBI and EBI, whose links are given below. We will only use the NCBI blast server in these exercises.
In real life you should consider using both tools as they offer different search databases and slightly different functionality. In fact, EBI also offers WU-BLAST2 as well as NCBI-BLAST2 for the same kinds of searches and you should try WU-BLAST2, as it may give different results.
Unfortunately, there is redundancy in the sequence databases: the same sequences appear in the US and European databases under different accession numbers and identifiers in different databases, and with different layouts of the associated information, which can be very confusing.
The key databases do cross-reference each other so it is possible to navigate between them. From the US side (NCBI), the Entrez system provides their unifying interface. From the European side (EMBL/EBI) it is SRS that provides the glue.
Nevertheless, your research problem may be driven by information in a published paper or make use of commercial materials that reference accessions in the US or the European databases. Obviously, the corresponding server would be more appropriate.
USA: National Center for Biotechnology and Information (NCBI)
Connect to NCBI Blast server
This offers various types of blast service:
protein blast Search protein database using a protein query Algorithms: blastp, psi-blast, phi-blast
and bookmark it.
Europe: EMBL European Bioinformatics Institute (EBI)
Connect to EBI Blast server
This offers a choice of:
The effect of changing search database on scores and E-values.
Use this query (an uncharacterised protein from ''Methanocaldococcus jannaschii'', an Archaean):
>gi|2501594|sp|Q57997|Y577_METJA PROTEIN MJ0577 MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVAGLNKSVEEFE NELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGVDIIIMGSHGKTNLKEILLG SVTENVIKKSNKPVLVVKRKNS
Q1 Look at this sequence. Roughly how long is it?
1. Access the the NCBI BLAST2 server and paste in this sequence.
2. Repeat using database "swissprot"
Q2 For the hit Y531_METJA, what happens to the score? the bit score? the E-value?
Q3 Whatever you decided about the effect on scores, can you explain this?
Q4 Why does E-value change?
The effect of changing scoring scheme (matrix) on scores and E-values.
No need to run these tests, just look at the table below.
This was made using same search as in Exercise 1, by searching with query Y577_METJA and examining hit Y531_METJA.
nr; Number of letters 2,558,340,887; Number of sequences 7,415,798 Score = 81.2 bits (184), Expect = 1e-13, PAM30 (least remote) Score = 79.9 bits (182), Expect = 2e-13, PAM70 Score = 87.9 bits (195), Expect = 7e-16, BLOSUM80 Score = 85.1 bits (209), Expect = 2e-15, BLOSUM62 (default) Score = 87.8 bits (284), Expect = 3e-16, BLOSUM45 (most remote) swissprot; Number of letters 132,353,686; Number of sequences 352,264 Score = 81.2 bits (184), Expect = 6e-15, PAM30 (least remote) Score = 79.9 bits (182), Expect = 1e-14, PAM70 Score = 87.9 bits (195), Expect = 4e-17, BLOSUM80 Score = 85.1 bits (209), Expect = 1e-16, BLOSUM62 (default) Score = 87.8 bits (284), Expect = 1e-17, BLOSUM45 (most remote)
Q1 Are scores still constant between databases (as per Exercise 1), all else being equal?
Q2 What happens to the raw score as the matrix changes? By how much does it vary, in which direction?
Q3 And for bit score?
Q5 What do you conclude about the stability of scores and E-values if you wanted to compare searches from (a) different databases, and (b) under different scoring schemes?
Note: same kind of observations hold if you change other parameters of the alignment scoring scheme, for example gap opening or extension penalties.
Effect of changing scoring scheme on E-value cutoff.
Use this query (an acyltransferase from Neorickettsia sennetsu, a Eubacterium):
>UNIPROT:Q2GE02 1-acyl-sn-glycerol-3-phosphate acyltransferase family protei... MNKSMRNNGFFAAAAYVTMNVLFNVSVVAWTTLYGLVVLPALFLPPEDVIEVRRVWIRGV FVLLRFFFNVEFEIRGSENIHLHKQFIVASKHHSPLDVILLADLFQKPAFVLKRSLIFIP IFGLYMIATKMITVSYSGKSNGLDVLRKMVRQAKVLSKNRTIILYPEGGRTKVGEEVGYK RGILALYKHLNLPVLPIALNVGQIWPVGYFSNKRSGKAVVQILPPIMPGLKDDEFLQVLR DAIETNTKHLLSSENKHLLTLQKN
Q1 Look at this sequence. Roughly how long is it?
1. Go to the NCBI BLAST2 server page again.
The query is a 1-acyl-sn-glycerol-3-phosphate acyltransferase.
Q2 Roughly how many hits are there? (We took the default limit of 100).
Q3 Looking at the descriptions, do the hits support its being an acyltransferase?
Q4 Is our cutoff E < 0.0001 reasonable, ie., are there more proteins with the same apparent function but higher E-values? If so, how high?
2. Repeat using a new substitution matrx: PAM30
Q5 How many hits overall?
Q6 Are there more acyltransferases with E above 0.0001? What is the highest such E-value?
Q7 What is the PAM30 matrix? (from the slide presentation)
Q8 How does use of a more conservative matrix affect matches?
3. Repeat with BLOSUM45 (less conservative) and answer the questions as usual.
Note: did you notice that as you selected a new substitution matrix, the gap penalties changed as well?
Overall conclusions from Exercise 1, 2 and 3
Changing the database can drastically change E-values.
Changing the scoring scheme drastically changes the E-value and the scores.
Bit scores are the least variable quantity and therefore the most comparable across different searches.
More conservative matrices (PAM30, BLOSUM80) find fewer hits and are less likely to produce false positives. Less conservative matrices (PAM250, BLOSUM45) find more hits at the potential expense of more false positives, although the latter point was not demonstrated here.
Different matrices should be used with different gap penalties, but these are usually set up for you by the system.
The importance of low complexity filtering.
Use this query (Drosophila scribbler, a putative transcription factor):
>gi|24654891|ref|NP_524678.2| scribbler CG5580-PA, isoform A [Drosophila melanogaster] MEKKLAKVALNSSNVATESNIKNNKLLANLSAAGAATATTATTGTATTTTATTANQALNFNNKTKATANA TAASANNRHNNNNNSSAIKKHTNTKQLAGKSPACNSSSSSLSSSSSSNSSESKDTNFEYEDEWNIGGIPE LLDDLDADIEKSAHSSGGGNQATALNAKQANSSSTSSSSSSKGGASSSSSSAAATGSSSSHKSHKTTLHS NLSATSPTTIKFTRQPVAIGGANSSSSSSAAAAPSGGSANVVAKGSSSSSSSTSSSSSGKHHHHHHHHSN SSSSGSSSKGYKSALVAQLNSPSPLNSNSKSLSGSGSGSGNTNGAAGAGAGSTLSSSTFAGFSKGGSLVS SSAGAAAALAAGSGQQGSKFSAGGMSSQTGSGSGGNNTSNSNNSSSGSGGSGSGSSTGNTASGSGNNNST SAGGPPSSQGGNNGNGSGSSSSSSSGKSSAKMSIDHQATLDKGLKMKIKRTKPGTKSSEAKHEIVKATDQ QQNGALGAGSNNSANEDGSSGSSSTNASSLGSTNSSSSASSGSSSSSGSSSSSSKKHLNNASSGSGSSSS GGGSQNNASGHASGGGSSGSSQSTPQGTKRGSSGHRREKTKDKNAHSNRMSVDKSAAAASAAGEKDTPEK CSGTGAGGSPCSCNGDVGAPCSHHACIRRAAHMSNSAGNANSSAGTGQSGGSSSMSAVPPGVFTPSAGSP STGSPSTVVPAAASLLAATGAASSSASQMASSSAGGVGGSGGGANAPGPPGKESAGSIKISSHIAAQLAA AAASNSYSGSGANTNQGQNSNAGGNGGSESKAAAAAQAKLMAPGMISATMHHTISVPAGTGTGDDDTKSP PAKRVKHEAGASGAGGGKEMVDICIGTSVGTITEPDCLGPCEPGTSVTLEGIVWHETEGGVLVVNVTWRG KTYVGTLLDCTRHDWAPPRFCDSPTEELDSRTPKGRGKRGRSAGLTPDLSNFTETRSSIYFSHAQVHSKL RNGATKGRGGATPSTSPTAFLPPRPEKRKSKDEAPSPLNGDASDGASVGGIGGAGGVNMVNASGIPISAS GGGLATQPQSLLNPVTGLNVQISTKKCKTASPCAISPVLLECPEQDCSKKYKHANGLRYHQSHAHGAGGG ASSMDEDSMQAPEDPATPPSPGVASGTGSGASVASSAVPATAPSAGQGTVAVSPNTPLANSSNPVTNGNV APSAPATGSVTIAAPNTTPSVVETQAPLTGPPPVTPPAPTPICAVATPGAEQSVSSVLPLGNLPLTAGPN SATQQQQPPTQQQQPQLLVPGGSAASLQQQQQQQQPVAGGSITAGISGQALSQHQQQLMGGLPAMLSDQQ QQALLQQGALKAGVLRFGPPDGNPLQQQPGQASVNPQTQQSPPRPPSHVQDQQTPSAYAQQAGLKTSPGF GSVGVGAASSKQKKNRKSPGPSDFEGRVSREDVQSPAYSDISDDSTPVAEQEMLDKSVGQAVTAKHIELM GKKPTEVGVGVPPPPAPNMYVPGMYQFYPAQQQSAPPPQQQQQQPQYMVQTEPGKPPGLPPALTQAQQQQ QLQPGAPPPTSQPPSHLLGPPGQQSVAAHLADYSGKNKDPPLDLMTKPQPQPGQPPSQQQQSGQLSGQEN NGKDVGPPTSQPGSQPPPVNLSAVAGPPPGSLPPGLGGLSALGAAGLGGPGPGKGMPHFYPFNFIPPAYP YNVDPNFGSVSIVASEEAAKLSGHPGLPPSSQAQQLSGISIKEERLKESPSPHDQPKHMPSQQQMIASKL IKQEPMTKQEIKQEPNSNPGQQHPPPQQQPAPQPQQQQPPPPQPQQPHALHPKDLQALGAYPAIYQRHSI NLAVQQAREEELRRYYMFTGRQNSAAAAAAAAAQNAASGGLPPHPGMMHKDEPGMGSAAQQQQQQQQQQM QIAQQQQQAIQQHHQHLQQQHQAQQQQQQQQHQQQQQQQQQQQQQQQQQQQQQKLKQSQAASAGANNKAT NLTKDSPKQKGGDDDQPLKVKQEGQKPTMETQGPPPPPTSQYFLHPSYISPTPFGFDPNHPMYRNVLMSA AGPYNTAPYHLPIPRPYHAPEDLSRNTGTKALDALHHAASQYYTTHKIHELSERALKSPTSGSGPVKVSV SSPSIGPPQPGGPTSSGPGSGPVSGVLGPGSGSNQQPGSAPGSAGGVPLNLQPPPGGMGPSPGSKPDLSG PKGHGGVTPGSSLDGHKQSMPGGPPPNGPSGNGAVGGVGGAAGNGAAGGGGAGAADSRSPPPQRHVHTHH HTHVGLGYPMYPAPYGAAVLASQQAAAVAVINPFPPGPSK
Q1 Look at the scribbler protein sequence. How does it differ from the other queries we have used?
Q2 What function do many of the hits suggest?
Q3 If you look at the alignments, do you see anything odd?
Q4 Was it faster?
Q5 Are there more or fewer results?
Q6 Looking closely at the alignments, do you see anything odd?
Q7 What happens to the bit scores and E-values when filtering is enabled?
Q8 For our earlier searches in Exercises 1 and 2, what would be the effect of enabling complexity filtering on the scores and E-values.
Keep your last search results (nr with low complexity) for the next Exercise.
Many sequences contain regions of low complexity that may randomly match other such regions in the the sequence database.
Generally we don't want these and the search will produce quicker, cleaner results if low complexity filtering is enabled.
Low complexity filtering has no effect on searches when the query contans no low complexity regions.
Scribbler is a fly zinc finger protein, probably therefore a transcription factor. Its zinc finger domain lies centrally.
Q1 What is the sequence range along the query? Does this correspond to the known zinc finger region?
We will try to see if this is also a zinc finger protein.
Q2 What does the graphical alignment look like now?
Q3 What are the many of the best hits?
Q4 Was the original scribbler protein (accession: NP_524678.2) found in this reciprocal search? If so, what was the E-value?
Q5 What does the graphical alignment look like now?
Q6 What are the many of the best hits?
Q7 Was the original scribbler protein (accession: NP_524678.2) found in this reciprocal search? If so, what was the E-value?
Q8 Jump to the corresponding alignment. What region of the scribbler gene was matched? And what region of the mouse gene?
Q9 With reference to what you know about the location of the scribbler zinc finger and to the initial scribbler search, have we matched the same C-terminal region of scribbler?
Q10 What about the mouse gene; what region was matched?
There seems to be evidence that this unassigned mouse gene is related to scribbler by similarity with the C-terminal region, and that it is also a zinc finger protein, by similarity with other mammalian proteins that contain zinc fingers. However, blast is unable to demonstrate any putative homology between the fly and mouse zinc finger regions directly. Of course this is just the start of the story.
What might we do next?
There are other mouse genes reported in the scribbler search - perhaps this unassigned mouse gene is more like one of them: we might be able to bridge between proteins.
We could try a more sensitive search using a) a more divergent substitution matrix, such as BLOSUM45.
We could use a more sensitive search tool altogether: PSI-BLAST is available from the same website and uses the same search databases.