Biocomputing Unit
Biocomputing
Sequence Analysis Service
Gibson Group
EMBL
EMBL

Winter 03 Course

New Web Tools available from the Gibson Group

Toby Gibson and Group, Jan 26th-29th 2003

The aim of this practical is to get hands-on experience with five web servers that can help to investigate biological sequences. The ELM server is a new tool for investigating functional motifs in proteins. It currently has about 70 motif entries and will soon have more. GlobPlot is a server for exploring protein disorder and globularity - relevant to the ELM server since many functiional motifs are in disordered parts of proteins. BLAST2SRS is a server for retrieving homologous proteins that is useful when you want some, but not all, and you are not sure in advance which... Switching to genes, the Gen2EST server uses ESTs to get an overview of gene structure. The Gene2EST server will allow you to submit gene-size queries (if need be >100,000 bp) to search the EST databases and provides both graphical and alignment output: in favourable cases - i.e. plentiful ESTs - Gene2EST can rapidly give a good description of a gene structure, including alternative splicing. Finally, we introduce a server that is not from our group! The Web of Science server from ISI allows you to peruse the Science Citation Index. As well as counting citations, this server is uniquely useful for searching "forwards in time"through publication abstracts.

Note of Caution: Most of these servers are still under development - they will change, but hopefully for the better!


Getting started

The teaching machines are INTEL PCs running the LINUX OS. It will take a few moments to get set up.


Exercise 1. Exploring multidomain proteins with Globplot

It can be be important to discern nonglobular regions of proteins: They can have have short functional sites e.g. histone tails (interesting) and they can interfere with protein crystallisation (bad). Globplot uses "coil preferences" for the amino acids to distinguish nonglobular and globular regions of multidomain proteins. It uses a simple graphical approach based on summing the parameters so that the slope of the graph indicates the nature of the sequence. A rising positive slope has a nonglobular preference and a negative slope globular. Unlike sliding window algorithms, this approach is good for finding segments of any length in a sequence.

Investigate several different classes of protein sequence with Globplot:

Notes:

Globplot is quite good at distinguishing globular and nonglobular segments of poplypeptide sequence. It should be useful for investigating multidomain proteins. Globplot is best used in conjunction with other tools e.g. SMART and PFAM domain databases, the ELM motif database, multiple sequence alignments and 2D prediction servers like PHD or JPRED.

Exercise 2. Searching for short functional sites with the ELM server

P53 is an example of a protein that has many small functional sites for modification and/or interaction with ligands. We term these "linear motifs" because they do not require 3D structure for function, needing only to be sufficiently accessible. Motif functions include ligand recognition, amino acid modification, signalling, cell compartment targeting, cleavage and so forth. There are probably less categories of motif than globular domains but there are probably more instances in a eukaryotic proteome. As part of a consortium, we have begun to collect these motifs and develop a new database, ELM. Currently we have about 70 patterns entered in the database. We are developing a web interface to allow sequences to be compared to the patterns. Motif prediction presents difficulties as matches are not statistically significant, so the user needs to think logically about which motifs/domains are incompatible with each other. Part of the ELM project involves developing filters to reduce the number of false positive matches.

Note: The ELM Server is still under development!

Looking for targeting signals in proteins by comparing sequences to a set of protein motifs:

Questions

Looking for motifs in the nuclear protein P53:

Questions

Notes:

These exercises show the utility - but also the limitations - of using regular expression patterns to search for small motifs. The filters can help a bit, yet run the risk of excluding some genuine matches, as in the case of P53 where the PFAM entry violates the globular domain boundaries. ELM results should not be used uncritically but be put into context. For example, if a protein is known to be phosphorylated, then the phosphorylation motifs, which mostly overpredicted badly, become pointers to experimentally determine which sites are modified. Multiple sequence alignments are useful - indeed essential - in conjunction with ELM too, since conserved functional sites will show up as conserved sequence blocks in the alignment.


Exercise 3. Retrieving subsets of homologous protein sequences using BLAST2SRS

Lots of BLAST servers allow the user to retrieve sequences using links from the output. However, this can be frustraing because often only some sequences are of interest. For example you may only be interested in eukaryotic sequences or only in a branch of a large multigene family. It can be frustrating to collect only the sequences you want and not waste time looking at entries you are not interested in. BLAST2SRS is a server for efficiently collecting protein sequences from SWISSPROT and SPTREMBL databases. It achieves this by interfacing BLAST with SRS the sequence retrieval system.

Using BLAST2SRS to collect subsets of TFIIB-family sequences

Notes:

Probably you have found yourself doing a BLAST search many times with the same query? Sometimes you want to collect a few new or different sequences to add to your collection. Or perhaps you are only interested in getting some chordate sequences with an invertebrate outgroup for phylogenetic analysis. This is the sort of job that BLAST2SRS is designed for. Note that the NCBI BLAST site has begun to offer similar custom retrieval tools: although in general it is much more sophisticated in many ways than BLAST2SRS it is still not quite as powerful for this particular job.


Exercise 4. Using ESTs to reveal gene structure with Gene2EST

The BLAST2 program is well suited to EST detection. Because it tolerates small gaps in alignments, it deals well with sequencing errors in the ESTs. However it is painfully time-consuming to work through the output, especially if there are many 100s of matches. Furthermore, most BLAST2 servers will not allow the user to submit large query sequences. We have tried to address these deficiencies with Gene2EST. The Gene2EST server presents BLAST2-derived output in a way that allows the user to rapidly evaluate the results. A graphical display (viewed with Artemis) provides an overview of the results, while an alignment of the ESTs on the query sequence lets the user follow up the exons in detail.

Getting the Query:

Using Gene2EST: Examining the alignment output: Examining the Graphical output with Artemis:

Notes:

Our chosen example works quite well: ESTs alone can quickly reveal the entire gene structure. We need the program RepeatMasker to filter out repetitive elements and it is on by default: genomic repeat sequences are abundantly represented in ESTs, especially from 3' exons. There also seem to be many ESTs primed from intronic poly-A runs in the genomic DNA sequence. If a gene is not very highly expressed, we could struggle to find the true ESTs in the flood of noisy matches. It is an instructive exercise to see the effects of using Repeatmasker filtered and unfiltered on the sequence from EMBL:AF106656 in Gene2EST. After that exercise, one is left to wonder what fraction of the ~12 million public ESTs belong in the "really useful" category....

Exercise 5 Exploring citations with the Science Citation Index

In this practical we will investigate the ISI Web of Science server which provides access to the Science Citation Index. The workhorse of literature retrieval is the freely accessible PubMed service run at Entrez/NCBI. PubMed has many useful properties including cross-links from databases like EMBL and SWISS-PROT that will never be practical for a commercial service like ISI. The advantages of ISI are: a larger set of scientific journals; citation counts; finding more recent articles from a key citation. These features are useful for text mining and complementary to the free PubMed service.

Counting Citations:

You are setting up a new biotech company and want some academic directors on the board. You recently heard Prof. I. W. Mattaj talk at a conference about "nuclear baskets - woven or wickerwork?", a trendy, happening area of cell science. Would he be worth approaching as a director? Has his work been influential or is this just a flash in the pan? You decide to check the citation index...

Citation counts - and especially short term impact factors - should not be used uncritically. There are many misleading factors and biases such as how many scientists work in a particular field and how often they publish. Impact factors are calculated annually so the month of publication can have a big effect on a given paper. Furthermore, impact factors lead to a focus on short term research goals that might be damaging. As an example, "high impact" Cell and Nature have shorter citation half-lives than NAR - so the most cited NAR papers tend to remain valuable to researchers for longer than those in the trendier journals. The true worth of a paper (whatever that is) may not be apparent for many years. An example would be the apoptosis field which suddenly exploded into action based upon pioneering work over many years by rather few researchers. It would be a shame to so construct the scientific enterprise that there was no room for pioneers any more.


Exercise 6 Collecting articles that cite a key paper

Retrieving articles by citation of a key paper is useful any time that one cannot be sure whether keyable information will be present in the abstracts (when you are interested in a method for example) or when one cannot be sure which keywords are appropriate. The example we will use is the well known "Nuclear Localisation Signal" (NLS) which refers to a sumbstantial body of literature.

The problem:

Collecting NLS articles:

Collecting articles via primary NLS work:

Comparing annual citations of a paper:

As well as pandering to one's vanity by enabling citations to be counted, the WOS server allows one to retrieve articles that cite a key reference. Sometimes this can be a very useful way of searching the literature. For example, if you are interested in applying a method that you are not familiar with, you may want to look in the literature at how and when that method has been applied - and whether it has been modified and improved by anyone. Keyword searches are not much use for doing this because abstracts usually do not detail which methods have been applied. So, finding articles which cite the method is going to be very helpful. By contrast, if you wanted to catch up on recent work with the giant muscle protein, titin, then you just need to do the keyword search since the word titin is extremely likely to occur in the abstracts. So, whether or not collecting articles by citation is useful depends on the kind of literature retrieval you need to do.

But let's end with a warning: don't get too obsessed by citation counts...


The servers we showed today can be useful for various aspects of sequence analysis. Our group servers provide tools that are not really covered by other web sites. However, we are only filling in some perceived gaps in what is available. There are many powerful tools available at sites like the EBI, EXPASY and NCBI. Check them out!



You can find this page at http://www.embl-heidelberg.de/~seqanal/courses/jan03/GibsonTools.html