Biocomputing Unit
Biocomputing
Sequence Analysis Service
Gibson Group
EMBL
EMBL

Predocs04 Course

Basic Tools in Sequence Analysis Practical

by Toby Gibson and the Sequence Analysis Team, October 6th-7th, 2004

In this practical we will use both WWW-based tools and locally installed programs on our LINUX PCs which cover the core sequence analysis activities: querying databases and making multiple sequence alignments.

In the first part, we will run some database similarity search tools provided at EMBL and available through the web. These can be accessed from any computer and are simple to use. Web servers are often the nicest way to do sequence analysis. But you should be aware that they can be unreliable, need constant care from their providers and are not suited to every task. Sometimes you have to run programs on local machines too. Database search tools are well-suited to web servers and we will try out the Blast and Bioccelerator servers at EMBL. Examination of the outputs may reveal some differences between the results, depending on the type of algorithm used in the sequence comparison. We will also modify the query and search set ups to illustrate the importance of a little thought in advance of (or better late than never...) database searching. Rule No. 1 is "Know your sequence"!

In the second part we will investigate protein architecture and function. Many interesting proteins are multidomain and some have natively disordered regions too. Complex protein architecture is indicative of a complex function and complex regulation of function.

In the third part, we will make a multiple alignment of a protein sequence family and calculate a tree from the sequences. We will study the tree to see whether it fits with our current understanding of phylogeny.


Part 1. WWW sequence similarity search Tools

We will use:


Getting started

The teaching machines are INTEL PCs running the LINUX OS. It will take a few moments to get set up.


Step 1 Choosing an snRNP SM protein as query

SM proteins are found in snRNP complexes. There are quite a number in Swiss-Prot and they are fairly divergent, so it is difficult (or impossible) to detect them all in a search with a single query sequence. All SM proteins share a small globular domain, but many have a C-terminal non-globular domain too. This will be used to illustrate the problems of searching with multi-domain proteins.

You now have the sequence of human SM-B protein available in a form that can be cut and pasted into the DB query forms.


Step 2 BLAST2 searching with human SM-B protein

BLAST2 is an upgraded version of BLAST, one of the most widely used database search packages. The BLAST programs find the best matching ungapped sections in a sequence comparison. The most important modification for the user to note in BLAST2 is that neighbouring ungapped segments are concatenated by allowing gaps between them. This improves both sensitivity and interpretation of the results.

Questions

Step 2B BLAST 2 search with SM-B and a filter

Now repeat the search but filter out segments of "reduced sequence complexity".

Questions


Step 3 Bic_SW search with human SM-B protein

The Bioccelerator is fast dedicated hardware exclusively designed to speed up dynamic programming (i.e. slow but sensitive) sequence comparison. It was built by the Israeli company Compugen. It can perform a number of search permutations including basic Smith-Waterman, profile searches and Protein v. DNA frame-shifting comparisons. The Smith-Waterman search finds the best matching segments between any two sequences, allowing for gaps to be inserted at any position.

The search will take a couple of minutes (unless the Bic is busy). When it is finished you can look at the high-score list and alignments in the output and compare the results with BLAST2.

Questions

Step3B Bic-SW search with the SM Domain only

Now repeat the search but use the globular N-terminal domain only.

Questions


Part 1 Take Home Lessons

Hopefully the exercise of varying the query type has illustrated that the way a search is set up is very important. The queries here illustrate the effect of different sequence types. There are other parameters that often influence the search sensitivity. For example when a globular domain is longer, the Gonnet Pam250 matrix would be expected to outperform the default Blosum62 in the detection of divergent homologues, because it is less stringent and so gives longer optimally matching segments. (Over short matches it is noisier and could perform worse). There isn't time today or we would have done profile searches using alignments as input. For this we would use the Gonnet matrix to make the profiles: because of the extra information in the alignment, profiles usually perform better with Pam250 than Blosum62. Also, gap penalties are critical parameters in dynamic programming and should always be tested by trial and error. In other words, it pays to try several variations in the searches, not just accept the results of the first search.


Part 2. Exploring protein architecture and function with SMART, GlobPlot and ELM

Proteins can have dozens of domains and/or short peptide functional sites where only the local peptide sequence is relevant to function, e.g. phosphorylation sites. Although the peptide embodies all the information for function, these motifs often may regulate activity of other parts of the protein. We'll look at some servers that can help to characterise protein architecture.


We will use:


Step 1. Comparing a sequence to a database of protein domains

The most sensitive sequence searches use "profiles" - queries based on multiple alignments. In fact we can query an unknown sequence against a set of profiles for known protein families. There are several very useful databases of modules that are found in multidomain proteins, including PFAM at the Sanger Centre, PROSITE at ISREC and SMART at EMBL. They use a form of profile technically described as a "hidden Markov model". We will first check for protein domains in the Src oncoprotein using the SMART server.

Questions


Step 2. Exploring order/disorder with GlobPlot

It can be be important to discern nonglobular regions of proteins: They often have short functional sites e.g. histone tails (interesting) and they can interfere with protein crystallisation (bad). GlobPlot uses "coil preferences" for the amino acids to distinguish nonglobular and globular regions of multidomain proteins. It uses a simple graphical approach based on summing the parameters so that the slope of the graph indicates the nature of the sequence. A rising positive slope has a nonglobular preference while a negative slope indicates globular preference. Unlike sliding window algorithms, this approach is good for finding segments of any length in a sequence.

Questions:


Step 3. Searching for short functional sites with the ELM server

Src is an example of a protein that has many small functional sites for modification and/or interaction with ligands. We term these "linear motifs" because they do not require 3D structure for function, needing only to be sufficiently accessible. Motif functions include ligand recognition, amino acid modification, signalling, cell compartment targeting, cleavage and so forth. There are probably less categories of motif than globular domains but there are probably more instances in a eukaryotic proteome. As part of a consortium, we have begun to collect these motifs and develop a new database, ELM. Currently we have more than 100 patterns entered in the database. We are developing a web interface to allow sequences to be compared to the patterns. Motif prediction presents difficulties as matches are not statistically significant, so the user needs to think logically about which motifs/domains are incompatible with each other. Part of the ELM project involves developing filters to reduce the number of false positive matches.

Looking for conserved motifs in the human and Xenopus src protein N-termini:

Questions

There are 8 src-like kinases in the human proteome with partially redundant function. Presumably they will be substrates of CDK too?


Step 4. Exploring the architecture of the protein Epsin1

Epsin is a protein involved in clathrin-mediated endocytosis. It binds to the membrane, inducing curvature and is regulated by many adaptor protein interactions. Endocytosis is a highly dynamic process involves many different proteins that come together in transient complexes. The whole system takes extensive advantage of short linear motifs. Let's check some out!

Questions

Many short functional motifs are mapped in Epsins already, notably:


Part 2. Take home lessons

We need to know all the functional domains and motifs in a protein to truly understand how it functions: in our examples, the short peptide motifs easily outnumber the globular domains. Bioinformatics resources can help us find many of these components but are by no means comprehensive. Known domains can be assigned with good statistical confidence. In the case of the ELM short functional sites, there is no statistical support for candidates. ELM results should be filtered - partly by ELM itself but also by the user. Checking for conservation in closely related proteins is a good test whether ELM matches should be followed up.


Part 3. Making multiple sequence alignments and calculating a tree

We will use:


Getting Started

Both these programs run on desktop Macs and PCs but today we will run them on the LINUX stations where they are already installed.

On your LINUX PC:


Step 1. Getting a set of EF-TU / EF-1A sequences

Elongation factors are found in all species so have often been used for phylogenetic investigations. EF-TU in eubacteria and EF-1A in eukaryotes are orthologous factors. There are >150 entries in SWISSPROT which would take too long to align today so we will use the SRS query manager to provide a representative selection.

[swissprot-MainText:
EF11_HUMAN | EF12_HUMAN | EF12_XENLA | EF13_XENLA | EF1A_GIALA |
EF1A_PLAFK | EF1A_WHEAT | EF1A_YEAST | EF11_DROME | EF11_MOUSE |
EF11_SCHPO | EF11_XENLA | EF1A_ARATH | EF1A_CAEEL | EF1A_CHICK |
EF1A_DICDI | EF1A_ENTHI | EF1A_HALHA | EF1A_METJA | EF1A_PODAN |
EF1A_PYRWO | EF1A_SULAC | EF1A_THEAC |
EFTU_BACSU | EFTU_ECOLI | EFTU_HAEIN | EFTU_MYCPN | EFTU_THEAQ |
EFTU_AQUAE | EFTU_RICPR | EFTU_HUMAN | EFTU_BOVIN | EFTU_YEAST]


Step 2. Aligning the elongation factor sequences with Clustal X

Multiple Alignments have many uses. They are used for revealing important conserved residues, for making phylogenies, for secondary structure prediction etc.

Questions


Step 3. Calculating a tree with Clustal X and displaying it with NJplot

Clustal X uses the neighbour-joining method to calculate trees. This is a distance method (based on distances between sequences) that gives reasonable results. NJ is not the best method (usually said to be the computationally intensive Maximum-Likelihood approach) but is fast and good for a quick examination of tree topology. In particular, NJ is less robust than ML to variation in mutation rates between the sequences. A common artefact of unequal rates is that fast evolving sequences (which have long branches) exhibit "long branch attraction" - moving toward each other and deeper into the tree than their true positions.

Calculate the tree:

Display the tree:

Questions


Answers to all the questions are provided on another page. Click here.


Part 3 Take Home Lessons

It is said that biology cannot be understood without setting it in an evolutionary context. Comparative sequence analysis is a continuation of the Darwinian tradition. Phylogenetic trees are fascinating in themselves but, in conjunction with multiple sequence alignments, are also important tools for gaining insight into the function of sequence families. However, tree calculations are unreliable unless there are plenty of diagnostic mutations to correctly assign the branching order. Variation in rate of sequence evolution confounds the algorithms, and can give rise to highly misleading trees: as we saw here, parts of the tree were obviously wrong when we apply extrinsic knowledge. Various mechanisms can give rise to rate increases: obviously selection for a new function (which can also help fix neutral mutations by a piggy-back mechanism); conversely a loss of function mutation can release purifying selection, also increasing fixation of neutral mutations. Other factors such as effective population size are important too: the larger the population size, the lower the likelihood that a given polymorphism will become fixed. Why do thermophilic prokaryotes evolve very slowly even though the chemically induced mutation rate ought to be higher? Perhaps it is because they live in an environment that has existed for 4 billion years, in actual physical locations that change on a slow geological timescale so selection is primarily conservative (purifying), and the effective population size is very large? At any rate, this serves to remind us that when we look at sequence divergence we see the accepted mutation rate and this will depend on many factors.