![]() Biocomputing |
![]() Gibson Group |
![]() EMBL |
Predocs06 Course
Basic Tools in Sequence Analysis Practical
by Toby Gibson and the Sequence Analysis Team, October 19th-20th, 2006
In this practical we will use both WWW-based tools and locally installed programs on our LINUX PCs which cover the core sequence analysis activities: querying databases of various types and making multiple sequence alignments.
In the first and second parts, we will run some web tools provided at EMBL and available through the web. These can be accessed from any computer and are simple to use. Web servers are often the nicest way to do sequence analysis. But you should be aware that they can be unreliable, need constant care from their providers and are not suited to every task. Sometimes you have to run programs on local machines too. Database search tools are well-suited to web servers and in the first part, we will try out the (now rather old fashioned and quaint) BLAST server at EMBL. Examination of the outputs may reveal some differences between the results, depending on the parameter set up used in the sequence comparison. This serves to illustrate the importance of a little thought in advance of, or when repeating (better late than never...), database searches. Rule No. 1 is "Know your sequence"!
In the second part we will investigate protein architecture and function with more EMBL web-based tools. Many interesting proteins are multidomain and often have natively disordered regions too. Complex protein architecture is indicative of a complex function and complex regulation of function. Not only are the proteins themselves complex but they assemble into large, flexible, dynamical protein complexes using cooperative interactions involving short peptide motifs.
In the third part, we will make a multiple alignment of a protein sequence family and calculate a tree from the sequences. We will study the tree to see whether it fits with our current understanding of phylogeny.
Part 1. WWW sequence similarity searching
We will use:
Getting started
The teaching machines are INTEL PCs running the LINUX OS. It will take a few moments to get set up.
Step 1 Choosing an snRNP SM protein as query
SM proteins are found in snRNP complexes. There are quite a number in Swiss-Prot and they are fairly divergent, so it is difficult (or impossible) to detect them all in a search with a single query sequence. All SM proteins share a small globular domain, but many have a C-terminal unstructured domain too. This will be used to illustrate the problems of searching with multi-domain proteins.
You now have the sequence of human SM-B protein available in a form that can be cut and pasted into the DB query forms.
Step 2 BLAST2 searching with human SM-B protein
BLAST2 is an upgraded version of BLAST, one of the most widely used database search packages. The BLAST programs find the best matching ungapped sections in a sequence comparison. The most important modification for the user to note in BLAST2 is that neighbouring ungapped segments are concatenated by allowing gaps between them. This improves both sensitivity and interpretation of the results.
Questions
Step 3 BLAST 2 search with SM-B and a filter
Now repeat the search but filter out segments of "reduced sequence complexity".
Questions
Part 1 Take Home Lessons
Hopefully the exercise of varying the parameter for reduced sequence complexity filtering has illustrated that the way a search is set up is very important. The queries here illustrate the effect of different sequence types. There are other parameters that often influence the search sensitivity. For example when a globular domain is longer, the Gonnet Pam250 matrix would be expected to outperform the default Blosum62 in the detection of divergent homologues, because it is less stringent and so gives longer optimally matching segments. (Over short matches it is noisier and could perform worse). There isn't time today or we would have done profile searches using alignments as input. For this we would use the Gonnet matrix to make the profiles: because of the extra information in the alignment, profiles usually perform better with Pam250 than Blosum62. At high sequence divergence, BLAST may also be outperformed by "dynamic programming" algorithms: slow but sensitive sequence alignment. Gap penalties are critical parameters in dynamic programming and in BLAST and should always be tested by trial and error. In other words, it pays to try several variations in the searches, not just accept the results of the first search.
Part 2. Exploring protein architecture and function with SMART, GlobPlot and ELM
Proteins can have dozens of domains and/or short peptide functional sites (so called linear motifs) where only the local peptide sequence is relevant to function, e.g. phosphorylation sites. Although the peptide embodies all the information for function, linear motifs often may regulate activity of other parts of the protein. In such cases protein function cannot be well understood without an overview of the modular architecture We'll look at some servers that can help to characterise protein architecture.
We will use:
Step 1. Comparing a sequence to a database of protein domains
The most sensitive sequence searches use "profiles" - queries based on multiple alignments. In fact we can query an unknown sequence against a set of profiles for known protein families. There are several very useful databases of modules that are found in multidomain proteins, including PFAM at the Sanger Centre, PROSITE at ISREC and SMART at EMBL. They use a form of profile technically described as a "hidden Markov model". We will first check for protein domains in the Src oncoprotein using the SMART server.
Questions
Step 2. Exploring Protein Order/Disorder
Today we will look at two different web servers (GlobPlot and DisProt) that predict the presence of nonglobular regions of protiens. It can be be important to discern nonglobular regions of proteins: They often have short functional sites e.g. histone tails (interesting) and they can interfere with protein crystallisation (bad).
GlobPlot
GlobPlot uses "coil preferences" for the amino acids to distinguish nonglobular and globular regions of multidomain proteins. It uses a simple graphical approach based on summing the parameters so that the slope of the graph indicates the nature of the sequence. A rising positive slope has a nonglobular preference while a negative slope indicates globular preference. Unlike sliding window algorithms, this approach is good for finding segments of any length in a sequence.
Questions:
DisProt
We will now use a different web server, DisProt, to predict nonglobular regions in the same protein, human SRC. DisProt uses a different approach to GlobPlot to predicting disordered regions, using a trained neural network.
Question
Step 3. Searching for short functional sites with the ELM server
Src is an example of a protein that has many small functional sites for modification and/or interaction with ligands. We term these "linear motifs" because they do not require 3D structure for function, needing only to be sufficiently accessible. Motif functions include ligand recognition, amino acid modification, signalling, cell compartment targeting, cleavage and so forth. There are probably less categories of motif than globular domains but there are probably more instances in a eukaryotic proteome. As part of a consortium, we have begun to collect these motifs and develop a new database, ELM. Currently we have more than 100 patterns entered in the database. The ELM server's web interface allows sequences to be compared to the stored patterns. Motif prediction presents difficulties as matches are not statistically significant, so the user needs to think logically about which motifs/domains are incompatible with each other. Part of the ELM project involves developing filters to reduce the number of false positive matches.
Looking for conserved motifs in the human and Xenopus src protein N-termini:
Questions
There are 8 src-like kinases in the human proteome with partially redundant function. Presumably they will be substrates of CDK too?
Step 4. Exploring the architecture of the protein Epsin1
Epsin is a protein involved in clathrin-mediated endocytosis. It binds to the membrane, inducing curvature and is regulated by many adaptor protein interactions. Endocytosis is a highly dynamic process involves many different proteins that come together in transient complexes. The whole system takes extensive advantage of short linear motifs. Let's check some out!
Questions
Many short functional motifs are mapped in Epsins already, notably:
Part 2. Take home lessons
We need to know all the functional domains and motifs in a protein to truly understand how it functions: in our examples, the short peptide motifs easily outnumber the globular domains. Bioinformatics resources can help us find many of these components but are by no means comprehensive. Known domains can be assigned with good statistical confidence. In the case of the ELM short functional sites, there is no statistical support for candidates. ELM results should be filtered - partly by ELM itself but also by the user. Checking for conservation in closely related proteins is a good test whether ELM matches should be followed up.
Part 3. Making multiple sequence alignments and calculating a tree
As night follows day, once we start examining a protein, we quickly find we need to align it to its homologues. As we are good evolutionary biologists we need to do "Comparative Analysis", just as Darwin did, except we use the protein sequences and not the morphology of bird's beaks. Conservation of sequence information is dictated by structural and functional requirements and we can often make extremely useful inferences. Of course we soon need to know how the proteins are related, often we have to deconvolute orthology v. paralogy complicated by gene loss... So then we need to make trees to try and work out how the different genes are related to each other.
We will use:
Getting Started
Both these programs run on desktop Macs and PCs but today we will run them on the LINUX stations where they are already installed.
On your LINUX PC:
Step 1. Getting a set of EF-TU / EF-1A sequences
Translation Elongation factors are found in all species so have often been used for phylogenetic investigations. EF-TU in eubacteria and EF-1A in eukaryotes are orthologous factors. There are >150 entries in SWISSPROT which would take too long to align today so we will use the SRS query manager to provide a representative selection.
[swissprot-MainText:
EF11_HUMAN | EF12_HUMAN | EF12_XENLA | EF13_XENLA | EF1A_GIALA |
EF1A_PLAFK | EF1A_WHEAT | EF1A_YEAST | EF11_DROME | EF11_MOUSE |
EF11_SCHPO | EF11_XENLA | EF1A_ARATH | EF1A_CAEEL | EF1A_CHICK |
EF1A_DICDI | EF1A_ENTHI | EF1A_HALSA | EF1A_METJA | EF1A_PODAN |
EF1A_PYRWO | EF1A_SULAC | EF1A_THEAC |
EFTU_BACSU | EFTU_ECOLI | EFTU_HAEIN | EFTU_MYCPN | EFTU_THEAQ |
EFTU_AQUAE | EFTU_RICPR | EFTU_HUMAN | EFTU_BOVIN | EFTU_YEAST]
Step 2. Aligning the elongation factor sequences with Clustal X
Multiple Alignments have many uses. They are used for revealing important conserved residues, for making phylogenies, designing PCR primers, for secondary structure prediction etc.
Questions
Step 3. Calculating a tree with Clustal X and displaying it with NJplot
Clustal X uses the neighbour-joining method to calculate trees. This is a distance method (based on distances between sequences) that gives reasonable results. NJ is not the best method (usually said to be the computationally intensive Maximum-Likelihood approach) but is fast and good for a quick examination of tree topology. In particular, NJ is less robust than ML to variation in mutation rates between the sequences. A common artefact of unequal rates is that fast evolving sequences (which have long branches) exhibit "long branch attraction" - moving toward each other and deeper into the tree than their true positions.
Calculate the tree:
Display the tree:
Questions
Answers to all the questions are provided on another page. Click here.
Part 3 Take Home Lessons
It is said that biology cannot be understood without setting it in an evolutionary context. Comparative sequence analysis is a continuation of the Darwinian tradition. Phylogenetic trees are fascinating in themselves but, in conjunction with multiple sequence alignments, are also important tools for gaining insight into the function of sequence families. However, tree calculations are unreliable unless there are plenty of diagnostic mutations to correctly assign the branching order. Variation in rate of sequence evolution confounds the algorithms, and can give rise to highly misleading trees: as we saw here, parts of the tree were obviously wrong when we apply extrinsic knowledge. Various mechanisms can give rise to rate increases: obviously selection for a new function (which can also help fix neutral mutations by a piggy-back mechanism); conversely a loss of function mutation can release purifying selection, also increasing fixation of neutral mutations. Other factors such as effective population size are important too: the larger the population size, the lower the likelihood that a given polymorphism will become fixed. Why do thermophilic prokaryotes evolve very slowly even though the chemically induced mutation rate ought to be higher? Perhaps it is because they live in an environment that has existed for 4 billion years, in actual physical locations that change on a slow geological timescale so selection is primarily conservative (purifying), and the effective population size is very large? At any rate, this serves to remind us that when we look at sequence divergence we see the accepted mutation rate and this will depend on many factors.