Winter 03 Course
New Web Tools available from the Gibson Group
Toby Gibson and Group, Jan 26th-29th 2003
The aim of this practical is to get hands-on experience with five web servers that can help to investigate biological sequences. The ELM server is a new tool for investigating functional motifs in proteins. It currently has about 70 motif entries and will soon have more. GlobPlot is a server for exploring protein disorder and globularity - relevant to the ELM server since many functiional motifs are in disordered parts of proteins. BLAST2SRS is a server for retrieving homologous proteins that is useful when you want some, but not all, and you are not sure in advance which... Switching to genes, the Gen2EST server uses ESTs to get an overview of gene structure. The Gene2EST server will allow you to submit gene-size queries (if need be >100,000 bp) to search the EST databases and provides both graphical and alignment output: in favourable cases - i.e. plentiful ESTs - Gene2EST can rapidly give a good description of a gene structure, including alternative splicing. Finally, we introduce a server that is not from our group! The Web of Science server from ISI allows you to peruse the Science Citation Index. As well as counting citations, this server is uniquely useful for searching "forwards in time"through publication abstracts.
Note of Caution: Most of these servers are still under development - they will change, but hopefully for the better!
Getting started
The teaching machines are INTEL PCs running the LINUX OS. It will take a few moments to get set up.
- Login with your EMBL name and password.
- Start netscape from the pullup icon at second lower left of the screen.
- Load this page into it. (It is a few clicks from EMBL's home page).
- Check that java, javascript and style sheets are enabled in the netscape preferences in the advanced options.
- Also in preferences, set the fonts to 14 or 18 point.
Exercise 1. Exploring multidomain proteins with Globplot
It can be be important to discern nonglobular regions of proteins: They can have have short functional sites e.g. histone tails (interesting) and they can interfere with protein crystallisation (bad). Globplot uses "coil preferences" for the amino acids to distinguish nonglobular and globular regions of multidomain proteins. It uses a simple graphical approach based on summing the parameters so that the slope of the graph indicates the nature of the sequence. A rising positive slope has a nonglobular preference and a negative slope globular. Unlike sliding window algorithms, this approach is good for finding segments of any length in a sequence.
Investigate several different classes of protein sequence with Globplot:
- Click using the middle mouse button to open GlobPlot in a new window.
- We will run it in default mode today:
- Note that there are parameters to affect the smoothness of the curve, switch off SMART etc.
- Click on Propensities to see the Russell/Linding amino acid "coil" propensity values
- GlobPlot draws a simple graph by summing these values along the sequence -
- Do you think it will work well?
- Type TPIS_HUMAN in the swissprot ID box and click GlobPlot NOW.
- Globplot can take a minute or two to run as it uses the SMART server too.
- Open another Globplot window and repeat with PRP1_HUMAN
- Now we will compare the plots.
- Questions:
- Are the slopes the same sign for both proteins?
- Which protein is nonglobular (positive slope)?
- Which protein is globular (negative slope)?
- Are there any peaks/troughs where the slope inverts?
- Could we use this to collect segments with a given conformational preference?
- What is the longest "putative unstructured segment" listed by Globplot?
- Open another Globplot window and repeat with CBP_HUMAN, a large chromatin protein
- Questions:
- The other proteins had a consistent slope to their graphs, either positive or negative:
- How many globular domains did SMART find?
- Do the globular domains span regions of positive, negative or both slopes?
- Is Globplot good at finding the known globular domains?
- Are there other parts of CBP that are not yet annotated as globular?
- How many strongly nonglobular segments are present in CBP?
- Are they likely to have any function beyond acting as spacers between globular domains?
- Now repeat with the well known P53 oncogene, P53_HUMAN
- P53 has two folded domains:
- Core/DNA-binding domain ~95-292
- Tetramerization domain ~325-356
- Many short functional motifs are mapped in P53, here are some:
- Phosphorylation sites: S15, T18, S20, S315
- Acetylation sites: K320, K373, K382
- Sumoylation site: K386
- NLS: K304-K320
- PIN1 pSerPro isomerisation sites: SP33, SP46, TP81, SP315
- MDM2/TFII31 FXXPhiPhi interaction site: 19-FSDLW-23
- Questions:
- Does Globplot show the two folded domains (negative slope)?
- Do the regions outside the folded domains have nonglobular, globular or intermediate propensities?
- Are the functional motifs in nonglobular segments (positive slope?)
Notes:
Globplot is quite good at distinguishing globular and nonglobular segments of poplypeptide sequence. It should be useful for investigating multidomain proteins. Globplot is best used in conjunction with other tools e.g. SMART and PFAM domain databases, the ELM motif database, multiple sequence alignments and 2D prediction servers like PHD or JPRED.
Exercise 2. Searching for short functional sites with the ELM server
P53 is an example of a protein that has many small functional sites for modification and/or interaction with ligands. We term these "linear motifs" because they do not require 3D structure for function, needing only to be sufficiently accessible. Motif functions include ligand recognition, amino acid modification, signalling, cell compartment targeting, cleavage and so forth. There are probably less categories of motif than globular domains but there are probably more instances in a eukaryotic proteome. As part of a consortium, we have begun to collect these motifs and develop a new database, ELM. Currently we have about 70 patterns entered in the database. We are developing a web interface to allow sequences to be compared to the patterns. Motif prediction presents difficulties as matches are not statistically significant, so the user needs to think logically about which motifs/domains are incompatible with each other. Part of the ELM project involves developing filters to reduce the number of false positive matches.
Note: The ELM Server is still under development!
Looking for targeting signals in proteins by comparing sequences to a set of protein motifs:
- Open a new navigator window and load this page in it.
- Click with the middle button to open the test ELMserver query page in a new window.
- Get the "unknown sequences page" and cut and paste "sequence1" into the Sequence box.
- Celular location is "not specified" and species is Homo sapiens.
- Click on the Submit Button.
- The search should take about 1-2 minutes unless the ELM and SMART servers are busy.
- Look for matching motifs that might target a protein to a cell compartment or membrane.
- Answer the questions below.
- Do the same for the other protein sequences in new windows too.
Questions
- Has any motif been found that could result in the protein being targeted within the cell?
- To where is the protein being targeted?
- Is the targeting motif N-, C-terminal or somewhere in the middle?
- Would this location eliminate any of the other ELM matches?
- (To the extent that you are aware of the biology!)
- Why are some motifs being filtered out?
- Do you think this is reliable?
- Might there be better ways of doing domain filtering?
- Are compartment targeting signals always N- or C-terminal?
Looking for motifs in the nuclear protein P53:
- Open the test ELMserver query page.
- Click with middle button on P53 and cut and paste the sequence into the Sequence box.
- Specify cellular location as "nucleus" and species as Homo sapiens.
- Click on the Submit Button.
- The search should take about 1-2 minutes unless the ELM and SMART servers are busy.
- Answer the questions below.
Questions
- How many motifs are found after globular domain filtering?
- Is this filter working well?
- Find reported motifs that are in the nongloblar segments - using the P53 globular domain ranges given above?
- Do any of these motifs correspond to the list given above?
- Are there any other plausible motifs?
Notes:
These exercises show the utility - but also the limitations - of using regular expression patterns to search for small motifs. The filters can help a bit, yet run the risk of excluding some genuine matches, as in the case of P53 where the PFAM entry violates the globular domain boundaries. ELM results should not be used uncritically but be put into context. For example, if a protein is known to be phosphorylated, then the phosphorylation motifs, which mostly overpredicted badly, become pointers to experimentally determine which sites are modified. Multiple sequence alignments are useful - indeed essential - in conjunction with ELM too, since conserved functional sites will show up as conserved sequence blocks in the alignment.
Exercise 3. Retrieving subsets of homologous protein sequences using BLAST2SRS
Lots of BLAST servers allow the user to retrieve sequences using links from the output. However, this can be frustraing because often only some sequences are of interest. For example you may only be interested in eukaryotic sequences or only in a branch of a large multigene family. It can be frustrating to collect only the sequences you want and not waste time looking at entries you are not interested in. BLAST2SRS is a server for efficiently collecting protein sequences from SWISSPROT and SPTREMBL databases. It achieves this by interfacing BLAST with SRS the sequence retrieval system.
Using BLAST2SRS to collect subsets of TFIIB-family sequences
- Click with the middle button to load the BLAST2SRS page.
- Click with middle button to get the TF2B_human sequence and cut and paste into the query box.
- Select the SWISSALL database (SWISSPROT and SPTREMBL).
- Click the Run Blast button. The server may take a couple of minutes to provide output.
- Examine the BLAST2SRS output. Note:
- A list of species found.
- Buttons to modify and reset selections
- Keyword box and links to SRS
- Useful information in the table of hits (species gene name) that other BLAST servers ignore.
- Questions:
- Why are fragments flagged?
- Is the E-value cut-off appropriate for these sequences?
- Collect the FASTA sequences:
- What is the name composed of? Is this useful for SPTREMBL entries?
- Use SRS to collect sequence subsets with this list of keywords:
- archaea
- bacteria
- eukaryota
- archaea ! pyrococcus
- metazoa
- eukaryota & BRF1
- Questions
- Are there archaeal and bacterial TFIIB homologues?
- Are there always the same number of TFIIB proteins per species?
- What is the function of BRF1?
- What effect do ! and & have?
Notes:
Probably you have found yourself doing a BLAST search many times with the same query? Sometimes you want to collect a few new or different sequences to add to your collection. Or perhaps you are only interested in getting some chordate sequences with an invertebrate outgroup for phylogenetic analysis. This is the sort of job that BLAST2SRS is designed for. Note that the NCBI BLAST site has begun to offer similar custom retrieval tools: although in general it is much more sophisticated in many ways than BLAST2SRS it is still not quite as powerful for this particular job.
Exercise 4. Using ESTs to reveal gene structure
with Gene2EST
The BLAST2 program is well suited to EST detection. Because it tolerates
small gaps in alignments, it deals well with sequencing errors in the ESTs.
However it is painfully time-consuming to work through the output, especially
if there are many 100s of matches. Furthermore, most BLAST2 servers will
not allow the user to submit large query sequences. We have tried to address
these deficiencies with Gene2EST. The Gene2EST
server presents BLAST2-derived output in a way that allows the user to
rapidly evaluate the results. A graphical display (viewed with Artemis)
provides an overview of the results, while an alignment of the ESTs on
the query sequence lets the user follow up the exons in detail.
Getting the Query:
-
In a Netscape window, go to SRS
and click Start.
-
Click the EMBL Box and then click Continue.
- Type HSAK1 in an ID box and click Do Query.
-
Click on the EMBL:HSAK1 entry. This entry contains a human adenylate kinase gene AK1.
-
Click on the Save button.
-
Set Use view to FastaSeqs then click SAVE.
- The sequence is now in a suitable format. (If you want, you can save it in a file as AK1.SEQ).
Using Gene2EST:
-
Open a new navigator window and load this page into it.
-
Load the Gene2EST
query submission page.
-
Cut and paste the query sequence into the sequence box and submit the job.
- The search will take a few minutes. Bookmark the page to make sure you can collect the output.
-
Note that Repeatmasker runs first to mask out dispersed repeats
like Alus.
-
It will report these while BLAST is running: Did it find any?
-
Examine the BLAST output.
-
How many matching ESTs are there?
-
Ought we to allow more top hits if we ran the search again?
-
How does % identity vary in the hits?
-
Are any of the matches spread over multiple HSPs?
Examining the alignment output:
-
Click on the Alignment button.
- After a minute or two, the alignment of high identity matches will be provided in a new window.
- to make things quicker, note that the gene is in the usual 5' - 3' orientation.
-
Has the 5' exon been found? Is the upstream TATA box TATAAAA or TATAAAT?
- Are they bounded by GT and AG splice consensi?
- Has the 3' exon been found? Is the poly-A signal AATAAA or ACTAAA? Is just one poly-A site used?
-
Why are there more 3' ESTs than 5' and internal ESTs?
-
Are internal exons bordered by plausible splice sites with GT and AG consensi?
- Have many exons been found? six, seven or eight?
-
Are there any "funny" matches that don't make good sense?
-
Is it possible that these are not true "Expressed Sequence Tags"?
-
Could there be some ESTs primed from genomic DNA at runs of As?
-
Are there any Retroelements like MIR, L2 or ALUs masked out in the introns?
Examining the Graphical output with Artemis:
-
Click on the EMBL_Entry button. The sequence with ESTs annotated
as an EMBL format feature table will be provided in a new window.
-
Note how the ESTs are annotated as mRNA features:
-
why are there join statements in some ESTs?
- Save to file as e.g. ak1.embl.
-
Type art ak1.embl & to start Artemis with the AK1 sequence loaded.
-
In a few seconds, the Artemis graphical display should appear.
-
Examine the Artemis Display:
-
Can you see the exon-intron structure of the gene?
-
Click on an exon - How many neighbouring exons are bolded?
-
Is there any evidence for alternative splicing in this gene?
-
Does the promoter region have a high GC content, typical of many human
promoters? (Hint: Use the Display menu).
-
Artemis is a display-editor of EMBL features:
-
you could edit the file to clean up the results by deleting the implausible
matches.
-
you could use it to assemble features for the entire mRNA, promoter/polyA signals and the inferred coding content.
Notes:
Our chosen example works quite well: ESTs alone can quickly reveal the entire gene structure. We need the program RepeatMasker to filter out repetitive elements and it is on by default: genomic repeat sequences are abundantly represented in ESTs, especially from 3' exons. There also seem to be many ESTs primed from intronic poly-A runs in the genomic DNA sequence. If a gene is not very highly expressed, we could struggle to find the true ESTs in the flood of noisy matches. It is an instructive exercise to see the effects of using Repeatmasker filtered and unfiltered on the sequence from EMBL:AF106656 in Gene2EST. After that exercise, one is left to wonder what fraction of the ~12 million public ESTs belong in the "really useful" category....
Exercise 5 Exploring citations with the Science Citation Index
In this practical we will investigate the ISI Web of Science server which provides access to the Science Citation Index. The workhorse of literature retrieval is the freely accessible PubMed service run at Entrez/NCBI. PubMed has many useful properties including cross-links from databases like EMBL and SWISS-PROT that will never be practical for a commercial service like ISI. The advantages of ISI are: a larger set of scientific journals; citation counts; finding more recent articles from a key citation. These features are useful for text mining and complementary to the free PubMed service.
Counting Citations:
You are setting up a new biotech company and want some academic directors on the board. You recently heard Prof. I. W. Mattaj talk at a conference about "nuclear baskets - woven or wickerwork?", a trendy, happening area of cell science. Would he be worth approaching as a director? Has his work been influential or is this just a flash in the pan? You decide to check the citation index...
- Load the ISI Portal page into a navigator window and click through to the Web of Science.
- Note the options, then click on the Full Search button.
- The Full Search date and database set up page loads.
- What database(s) can you search?
- Which is the earliest year you can search?
- Tick the Science citation index box, then click the General Search button.
- Type mattaj i* in the AUTHOR box and click SEARCH. (We'll initially assume there is only one I. Mattaj.)
- Examine the search results page.
- Note the buttons at the top of the page - we will use these later.
- How many articles are listed on the page?
- Is that all the articles found?
- If not, how many articles has the researcher published: < 10, < 50, 100 - 200?
- When was the first article published?
- (Hint - you will have to click to the relevant page.)
- Has the researcher been active enough to warrant our interest?
- Now we want to check if the researcher's more recent work is actively cited.
- Click on the DATE & DB LIMITS button to return to the setup page.
- Click on the From button, select the years 1996 - 2003 and click GENERAL SEARCH.
- Note that your keyword query setup has been remembered!
- You must ALWAYS use the buttons at the top to return to the setup pages.
- Click on Search.
- Choose to Sort results by: Times Cited.
- By clicking on the links find out:
- If the results are correctly sorted by number of citations.
- Does the most cited paper have <10; 10 - 100; 100 - 500; 500 - 1000 citations?
- Is this paper a review or an experimental article?
- Is it being cited much this year?
- How many reviews are in the top 10?
- Are there any papers in this time period with no citations?
- So called "high impact" papers are collected by ISI for the two preceding years:
- Use the buttons to refine the query for the years 2001-2002. Sort results by citation.
- Top journals have an "impact factor" roughly in the range 20-30.
- By this yardstick are any of the recent papers "high impact"?
- You are now in a position to decide whether you want to recruit our subject to the board!
- [Er... citation counts say nothing about personality and appropriateness. If that is our only criterion, we could get a shock when our new colleague turns out to have appalling man-management skills while frittering away venture capital due to his expensive habits, insisting on flying first class, dining in expensive restaurants and staying in top hotels! (Though not in this example of course!)]
Citation counts - and especially short term impact factors - should not be used uncritically. There are many misleading factors and biases such as how many scientists work in a particular field and how often they publish. Impact factors are calculated annually so the month of publication can have a big effect on a given paper. Furthermore, impact factors lead to a focus on short term research goals that might be damaging. As an example, "high impact" Cell and Nature have shorter citation half-lives than NAR - so the most cited NAR papers tend to remain valuable to researchers for longer than those in the trendier journals. The true worth of a paper (whatever that is) may not be apparent for many years. An example would be the apoptosis field which suddenly exploded into action based upon pioneering work over many years by rather few researchers. It would be a shame to so construct the scientific enterprise that there was no room for pioneers any more.
Exercise 6 Collecting articles that cite a key paper
Retrieving articles by citation of a key paper is useful any time that one cannot be sure whether keyable information will be present in the abstracts (when you are interested in a method for example) or when one cannot be sure which keywords are appropriate. The example we will use is the well known "Nuclear Localisation Signal" (NLS) which refers to a sumbstantial body of literature.
The problem:
- Retrieving articles that discuss nuclear localisation signals is clearly going to be awkward:
- Will it be referred to as a signal or a sequence, or even as a motif or pattern?
- Is it localisation or localization? - unfortunately it is both.
- Is NLS sometimes used in abstracts without the full definition?
- What if it also stands for non-linear Schrodinger equation?!
- Will the NLS be mentioned at all in the abstract?
Collecting NLS articles:
- Using your newly gained experience, set up queries of science citation index for all years
- Using these TOPIC keyword combinations:
- Count the number of articles retrieved for:
- nuclear localisation signal
- nuclear localization signal
- nuclear locali* signal* or nuclear locali* sequence* or NLS not schrodinger not nonlinear
- Note the use of logical operators to better specify the query.
- A list of operators and their usage is outlined in the on-line help.
- You might be able to think of even more complex variations but we have probably retrieved most of what we can get by NLS-based keyword search.
Collecting articles via primary NLS work:
- The classical bipartite NLS was clearly described by Dingwall and Laskey in a 1991 TIBS review.
- (Note that the literature goes back several years before that.)
- To get this article, search all years and enter dingwall and laskey into AUTHORS .
- Sort results by Times Cited
- Is the article their joint best-cited paper?
- What variant of nuclear localisation signal is in the abstract?
- Or do they use another definition?
- Ought we to enlarge our keyword query to include or nuclear target* signal* or nuclear target* sequence*?
- Are there more articles citing this paper than we retrieved with the NLS keyword search?
- Is there are a 2nd highly cited nuclear targeting paper from this group?
- Do you think the sets of citing papers will strongly overlap?
- Examine the abstracts of the 10 most recent papers citing Dingwall and Laskey:
- How many of these papers could also be retrieved by the complex NLS keyword search?
Comparing annual citations of a paper:
- One worry with using the 1991 Dingwall and Laskey paper to retrieve current articles is that it is now 10 years old and its citation rate may be in decline. We can find out if this is a serious limitation.
- Click to Date & DB Limits and set the year to 1995, then click to CITED REF SEARCH.
- Type Dingwall C in AUTHORS and 1991 in CITED YEAR then click LOOKUP.
- This retrieves a list of papers published in 1991 but cited in 1995.
- How many TIBS articles did Dingwall apparently publish in 1991?
- Is he really that prolific?
- Tick the box for the proper citation and click SEARCH.
- How many articles cited the 1991 paper in 1995?
- Return to Date & DB Limits and set the Year to 2001.
- Repeat the process to find out how many articles cite the 1991 paper in 2001.
- How many articles cited the 1991 paper in 2001?
- Is it more or less than in 1995?
- Is it a big difference?
- One could repeat the process for every year since 1991 to get the citation curve for this paper.
As well as pandering to one's vanity by enabling citations to be counted, the WOS server allows one to retrieve articles that cite a key reference. Sometimes this can be a very useful way of searching the literature. For example, if you are interested in applying a method that you are not familiar with, you may want to look in the literature at how and when that method has been applied - and whether it has been modified and improved by anyone. Keyword searches are not much use for doing this because abstracts usually do not detail which methods have been applied. So, finding articles which cite the method is going to be very helpful. By contrast, if you wanted to catch up on recent work with the giant muscle protein, titin, then you just need to do the keyword search since the word titin is extremely likely to occur in the abstracts. So, whether or not collecting articles by citation is useful depends on the kind of literature retrieval you need to do.
But let's end with a warning: don't get too obsessed by citation counts...
The servers we showed today can be useful for various aspects of sequence analysis. Our group servers provide tools that are not really covered by other web sites. However, we are only filling in some perceived gaps in what is available. There are many powerful tools available at sites like the EBI, EXPASY and NCBI. Check them out!
You can find this page at http://www.embl-heidelberg.de/~seqanal/courses/jan03/GibsonTools.html