Predocs03 Course
Basic Tools in Sequence Analysis Practical
by Toby Gibson and the Sequence Analysis Team, October 7th-8th, 2003
In this practical we will use both WWW-based tools and locally installed programs on our LINUX PCs which cover the core sequence analysis activities: querying databases and making multiple sequence alignments.
In the first part, we will run some database similarity search tools provided at EMBL and available through the web. These can be accessed from any computer and are simple to use. Web servers are often the nicest way to do sequence analysis. But you should be aware that they can be unreliable, need constant care from their providers and are not suited to every task. Sometimes you have to run programs on local machines too. Database search tools are well-suited to web servers and we will try out the Blast and Bioccelerator servers at EMBL. Examination of the outputs may reveal some differences between the results, depending on the type of algorithm used in the sequence comparison. We will also modify the query and search set ups to illustrate the importance of a little thought in advance of (or better late than never...) database searching. Rule No. 1 is "Know your sequence!
In the second part we will investigate protein architecture and function. Many interesting proteins are multidomain and some have natively disordered regions too. Complex protein architecture is indicative of a complex function and complex regulation of function.
In the third part, we will make a multiple alignment of a protein sequence family and calculate a tree from the sequences. We will study the tree to see whether it fits with our current understanding of phylogeny.
Part 1. WWW sequence similarity search Tools
We will use:
- SRS5 to extract query sequences.
- The Bork group's BLAST2 server for BLAST searches (fast initial ungapped comparison).
- Our group's Bioccelerators for Smith-Waterman searches (exhaustive gapped comparison).
Getting started
The teaching machines are INTEL PCs running the LINUX OS. It will take a few moments to get set up.
- Login with your EMBL name and password.
- Start netscape from the second pullup starting icon at lower left of the screen.
- Load this page into it. (It is a few clicks from EMBL's home page).
- Check that javascript and style sheets are enabled in the netscape edit > preferences > advanced menu.
- The netscape fonts are too small for the screen resolution:
- Using preferences set fonts to 14 or 18 point.
Step 1 Choosing an snRNP SM protein as query
SM proteins are found in snRNP complexes. There are quite a number in Swiss-Prot and they are fairly divergent, so it is difficult (or impossible) to detect them all in a search with a single query sequence. All SM proteins share a small globular domain, but many have a C-terminal non-globular domain too. This will be used to illustrate the problems of searching with multi-domain proteins.
- Open a new navigator window and load this page into it.
- Click using the middle mouse button to open the SRS5 top page in a new window.
- (SRS (Sequence Retrieval System) is a powerful and widely used tool to retrieve information from sequence databases.)
- Click Start the session.
- Tick the box for the Swiss-Prot database and click continue.
- Set the field selection (by default All Text) to Description.
- Type snrnp & sm & b in the Description box, then Do Query.
- Click on the entry RSMB_Human.
- Swiss-prot entries often have useful hypertext links: What does the PFAM link do?
You now have the sequence of human SM-B protein available in a form that can be cut and pasted into the DB query forms.
Step 2 BLAST2 searching with human SM-B protein
BLAST2 is an upgraded version of BLAST, one of the most widely used database search packages. The BLAST programs find the best matching ungapped sections in a sequence comparison. The most important modification for the user to note in BLAST2 is that neighbouring ungapped segments can now be concatenated by allowing gaps between them. This improves both sensitivity and interpretation of the results.
-
- Click using the middle mouse button to open the BLAST2 query submission page.
- It is worth familiarising yourself with the layout and consulting the helps.
- Paste in the RSMB_Human sequence from the SRS browser (consult the help for format).
- Select the Swiss-Prot database.
- Set the filter option to none.
- Check the number of top hit descriptions and alignments to be shown: set to 100 or so.
- Start the BLAST search: this should take a few minutes at most.
- Save output to a new file, so that you do not lose it.
- Examine output, and investigate the detected entries by using the SRS links.
Questions
- 1. How many SM proteins are detected above the first false positive?
- 2. Is there another class of protein that is strongly detected?
- 3. If so, is this biologically meaningful?
- 4. Are the P-values a reliable guide to homology?
Step 2B BLAST 2 search with SM-B and a filter
Now repeat the search but filter out segments of "reduced sequence complexity".
- Reload a BLAST2 query submission page by using the middle mouse button
- Paste in the RSMB_Human sequence.
- Select the Swiss-Prot database.
- Set the filter option to SEG+XNU.
- Check the number of top hit descriptions and alignments to be shown: set to 100 or so.
- Start the BLAST search.
- Save output to a new file, so that you do not lose it.
- Examine output, and investigate the detected entries by using the SRS links.
Questions
- 1. How many SM proteins are detected above the first false positive?
- 2. Is there another class of protein that is strongly detected?
- 3. Why are rather few sequences listed?
- 4. How does this setup compare in sensitivity to the unfiltered search?
- 5. Are the P-values are reliable guide to homology?
Step 3 Bic_SW search with human SM-B protein
The Bioccelerator is fast dedicated hardware exclusively designed to speed up dynamic programming (i.e. slow but sensitive) sequence comparison. It is built by the Israeli company Compugen. It can perform a number of search permutations including basic Smith-Waterman, profile searches and Protein v. DNA frame-shifting comparisons. The Smith-Waterman search finds the best matching segments between any two sequences, allowing for gaps to be inserted at any position.
- Click using the middle mouse button to open the Bioccelerator home page in a new window.
- Go to the Bioccelerator Searches page.
- Select application sw_p.
- It is worth familiarising yourself with the layout and consulting the help links.
- Paste the RSMB_human sequence from the SRS browser into the Query Sequence box.
- Select the Swiss-Prot database. (It may already be the default selection in gcg format).
- Now Do Search to start the Bioccelerator run.
- When you get the output, save to a new file, so that you do not lose it.
The search will take a couple of minutes (unless the Bic is busy). When it is finished you can look at the high-score list and alignments in the output and compare the results with BLAST2.
Questions
- 1. How are SM proteins distributed in the output?
- 2. What position is the highest false positive?
- 3. Is another class of proteins strongly detected?
- 4. Are the E-values a reliable guide to the SM protein detections?
- 5. Compared to BLAST:
- (a) Which, if any, is more sensitive?
- (b) Which output is easier to understand?
Step3B Bic-SW search with the SM Domain only
Now repeat the search but use the globular N-terminal domain only.
- Reload the Bioccelerator home page with the middle button.
- Go to the Bioccelerator Searches page.
- Select application sw_p.
- Paste the range 1-82 of RSMB_human into the Bic query form.
- Select the Swiss-Prot database. (It may already be the default selection).
- Now Do Search to start the Bioccelerator run.
- When you get the output, save to a new file, so that you do not lose it.
Questions
- 1. Are more or less SM proteins detected?
- 2. Is another class of proteins strongly detected?
- 3. Are the E-values a reliable guide to the SM protein detections?
- 4. Compared to the BLAST filtered search which, if any, is more sensitive?
- 5. Collect a multiple alignment using the buttons in the header:
- Is this useful to judge the detections?
- Which entries have incomplete sequence fragments?
Hopefully the exercise of varying the query type has illustrated that the way a search is set up is very important. The queries here illustrate the effect of different sequence types. There are other parameters that often influence the search sensitivity. For example when a globular domain is longer, the Gonnet Pam250 matrix would be expected to outperform the default Blosum62 in the detection of divergent homologues, because it is less stringent and so gives longer optimally matching segments. (Over short matches it is noisier and could perform worse). There isn't time today or we would have done profile searches using alignments as input. For this we would use the Gonnet matrix to make the profiles: because of the extra information in the alignment, profiles usually perform better with Pam250 than Blosum62. Also, gap penalties are critical parameters in dynamic programming and should always be tested by trial and error. In other words, it pays to try several variations in the searches, not just accept the results of the first search.
Part 2. Exploring protein architecture and function with SMART, GlobPlot and ELM
Proteins can have dozens of domains and/or short peptide functional sites where only the local peptide sequence is relevant to function, e.g. phosphorylation sites. Although the peptide embodies all the information for function, these motifs often may regulate activity of other parts of the protein. We'll look at some servers that can help to characterise protein architecture.
We will use:
- SMART for revealing known domains
- GlobPlot for exploring protein order and disorder
- ELM for exploring potential short peptide functional sites
Step 1. Comparing a sequence to a database of protein domains
The most sensitive sequence searches use "profiles" - queries based on multiple alignments. In fact we can query an unknown sequence against a set of profiles for known protein families. There are several very useful databases of modules that are found in multidomain proteins, including PFAM at the Sanger Centre, PROSITE at ISREC and SMART at EMBL. They use a form of profile technically described as a "hidden Markov model". We will first check for protein domains in the Src oncoprotein using the SMART server.
- Click using the middle mouse button to open the SMART query page in a new window.
- Toggle on HMM searching against PFAM domains (includes more domains).
- Type src_human into SMART's Sequence ID box.
- Click on the Sequence SMART Button.
- The search should take about a minute unless the server is busy.
- When you get the results, note the domain "bubble" diagram and the table of matching domains.
Questions
- Based on your recent experiences would you say the E-value scores are good?
- What happens if you click on a domain bubble?
- Is the domain common?
- Is there any literature on the domain?
- Are there structures for any of these domains?
- What is the longest region that has no known domain?
- Do you think this protein has especially many or few domains?
- (You could repeat the SMART query with FBN1_HUMAN, the Marfan Syndrome protein.)
Step 2. Exploring order/disorder with GlobPlot
It can be be important to discern nonglobular regions of proteins: They often have have short functional sites e.g. histone tails (interesting) and they can interfere with protein crystallisation (bad). GlobPlot uses "coil preferences" for the amino acids to distinguish nonglobular and globular regions of multidomain proteins. It uses a simple graphical approach based on summing the parameters so that the slope of the graph indicates the nature of the sequence. A rising positive slope has a nonglobular preference while a negative slope indicates globular preference. Unlike sliding window algorithms, this approach is good for finding segments of any length in a sequence.
- Click using the middle mouse button to open GlobPlot in a new window.
- We will run it in default mode today:
- Note that there are parameters to affect the smoothness of the curve, switch off SMART etc. Click on Propensities to see the Russell/Linding amino acid "coil" propensity values
- GlobPlot draws a simple graph by summing these values along the sequence -
- Do you think it will work well?
- Type SRC_HUMAN in the swissprot ID box and click GlobPlot NOW.
- Globplot can take a minute to run as it uses the SMART server too.
- Examine the output, including the sequences with non-globular assignments below the graph.
Questions:
- Is the slope mainly positive or negative?
- Are there any peaks/troughs where the slope inverts?
- Could we use this to collect segments with a given conformational preference?
- What is the longest "putative unstructured segment" listed by Globplot?
- Do SMART and GlobPlot agree on where there is globular structure?
- There are lots of proteins that give interesting and informative GlobPlots:
- If there was time you could try CBP_HUMAN, P53_HUMAN, PRP1_HUMAN.
Step 3. Searching for short functional sites with the ELM server
Src is an example of a protein that has many small functional sites for modification and/or interaction with ligands. We term these "linear motifs" because they do not require 3D structure for function, needing only to be sufficiently accessible. Motif functions include ligand recognition, amino acid modification, signalling, cell compartment targeting, cleavage and so forth. There are probably less categories of motif than globular domains but there are probably more instances in a eukaryotic proteome. As part of a consortium, we have begun to collect these motifs and develop a new database, ELM. Currently we have about 100 patterns entered in the database. We are developing a web interface to allow sequences to be compared to the patterns. Motif prediction presents difficulties as matches are not statistically significant, so the user needs to think logically about which motifs/domains are incompatible with each other. Part of the ELM project involves developing filters to reduce the number of false positive matches.
Looking for conserved motifs in the human and Xenopus src protein N-termini:
- Click using the middle mouse button to open the ELM server query page in a new window.
- Type src_human into the SWISS-PROT ID box.
- Specify species as Homo sapiens and cellular location as cytoplasm.
- src is actually directed to the inner plasma membrane surface by myristoylation.
- Click on the Submit Button. The results will appear in a new window.
- Immediately start a new search with src1_xenla and set species to Xenopus laevis.
- The searches should take about 1-2 minutes unless the ELM and SMART servers are busy.
- Look over the outputs and answer the questions below.
Questions
- Why have results been sorted by globular domain filtering?
- Is this filter working well?
- (Note: there are usually phosphorylation sites inside tyrosine kinase domains.)
- Find the set of reported motifs that obey the following criteria:
- (1) They are in the non-globular N-termini (approx. residues 1-80).
- (2) They are found at the same place in human and frog (indicating that there is functional conservation).
- Are there conserved N-myristoylation sites?
- Are there any conserved phosphorylation sites?
- Are there any conserved cyclin binding sites?
- Is a cyclin binding site meaningful on its own?
- Are there more cyclin-binding than CDK phosphorylation sites?
- Is src likely to be phosphorylated at specific points in the cell cycle?
There are 8 src-like kinases in the human proteome with partially redundant function. Presumably they will be substrates of CDK too?
- Do new ELM queries with the src-like kinases yes_human and yes_xenla to confirm that they are also CDK substrates.
Step 4. Exploring the architecture of the protein Epsin1
Epsin is a protein involved in clathrin-mediated endocytosis. It binds to the membrane, inducing curvature and is regulated by many adaptor protein interactions. Endocytosis is a highly dynamic process involves many different proteins that come together in transient complexes. The whole system takes extensive advantage of short linear motifs. Let's check some out!
- Submit SMART, GlobPlot and ELM queries with entry O88339 (rat Epsin1).
- In ELM, remember to set cell compartment to cytoplasm.
Questions
- How many globular domains are reported in the SMART output?
- Is there any indication that exon duplication has occurred?
- To check further: Click on the Display all proteins with similar domain composition.
- Do SMART and GlobPlot agree about where the sequence is globular?
- How big is the largest segment of disordered sequence indicated by GlobPlot?
Many short functional motifs are mapped in Epsins already, notably:
- 2 clathrin boxes
- 3 EH domain binding motifs (NPF motifs)
- 8 AP2-binding motifs (DP[WF] motifs)
- 1 Pip2 binding motif
- 3 Ubiquitin interacting (UIM) motifs
- Which of these motifs are not found by ELM?
- Note: the ELM DB does not have anything like full coverage yet so there may be no entry yet.
- Are any found by SMART instead?
- Which of these motifs has been "rescued" from the ENTH domain?
- Click on its link to get an idea why it should be rescued.
- Are there any other endocytosis ELMs picked up inside the ENTH domain?
- Are there any other endocytosis motifs picked up in the Epsin tail that are not in the above list?
- Would it be worth doing an experimental assessment?
- Are there any other motifs that one might consider following up experimentally?
- Would you make a multiple alignment first to check for motif conservation?
Part 2. Take home lessons
We need to know all the functional domains and motifs in a protein to truly understand how it functions: in our examples, the short peptide motifs easily outnumber the globular domains. Bioinformatics resources can help us find many of these components but are by no means comprehensive. Known domains can be assigned with good statistical confidence. In the case of the ELM short functional sites, there is no statistical support for candidates. ELM results should be filtered - partly by ELM itself but also by the user. Checking for conservation in closely related proteins is a good test whether ELM matches should be followed up.
Part 3. Making multiple sequence alignments and calculating a tree
We will use:
- Clustal X for calculating both the alignment and the tree.
- NJplot for displaying and manipulating the tree.
Getting Started
Both these programs run on desktop Macs and PCs but today we will run them on the LINUX stations where they are already installed.
On your LINUX PC:
- Open an Xterm (available from the lower menus of the desktop).
- On the unix command line type clustalx &. The Clustal X program should be set up and appear in a new window.
Step 1. Getting a set of EF-TU / EF-1A sequences
Elongation factors are found in all species so have often been used for phylogenetic investigations. EF-TU in eubacteria and EF-1A in eukaryotes are orthologous factors. There are >150 entries in SWISSPROT which would take too long to align today so we will use the SRS query manager to provide a representative selection.
- Open another Linux Xterm.
- Now start Netscape from the command line.
- In a Netscape window, go to SRS and click Start a new SRS session.
- Check the box for the Swiss-Prot database and continue.
- Click on the Query Manager button:
- The query manager allows "advanced users" to do powerful query and list refinements.
- Using the mouse, select and cut the following query text:
[swissprot-MainText:
EF11_HUMAN | EF12_HUMAN | EF12_XENLA | EF13_XENLA | EF1A_GIALA |
EF1A_PLAFK | EF1A_WHEAT | EF1A_YEAST | EF11_DROME | EF11_MOUSE |
EF11_SCHPO | EF11_XENLA | EF1A_ARATH | EF1A_CAEEL | EF1A_CHICK |
EF1A_DICDI | EF1A_ENTHI | EF1A_HALHA | EF1A_METJA | EF1A_PODAN |
EF1A_PYRWO | EF1A_SULAC | EF1A_THEAC |
EFTU_BACSU | EFTU_ECOLI | EFTU_HAEIN | EFTU_MYCPN | EFTU_THEAQ |
EFTU_AQUAE | EFTU_RICPR | EFTU_HUMAN | EFTU_BOVIN | EFTU_YEAST]
- Paste the text into the query box and click the expression button to execute the query.
- SRS should return a list of entries matching the query.
- Now click the save button.
- Set Use view to FastaSeqs and click the save button.
- Using the netscape menus, save the sequences to a text format file called eftu.fasta.
Step 2. Aligning the elongation factor sequences with Clustal X
Multiple Alignments have many uses. They are used for revealing important conserved residues, for making phylogenies, for secondary structure prediction etc.
- Using the Clustal X menus:
- Invoke File > Load Sequences and select eftu.fasta.
- The sequences will be displayed in the clustalx main window.
- Using the Alignment menu select Do complete alignment and click the align button.
- The alignment calculation will take a couple of minutes, then the alignment will be displayed.
- Examine the alignment and answer the questions below.
Questions
- Are there any completely conserved residues?
- What is the graph under the alignment showing?
- Is the order of the sequences changed after alignment?
- How might the sequences be regrouped?
- Some proteins have N-terminal extensions
- Are these bacterial or eukaryotic?
- What is the function of this extension?
- (Hint: You can check the entry in SRS)
- Are the residues coloured by
- Proximity in the alphabet?
- Physicochemical properties?
- Why are some residues not coloured?
- Are Gly and Pro always coloured?
- What is so special about these residues?
Step 3. Calculating a tree with Clustal X and displaying it with NJplot
Clustal X uses the neighbour-joining method to calculate trees. This is a distance method (based on distances between sequences) that gives reasonable results. NJ is not the best method (usually said to be the computationally intensive Maximum-Likelihood approach) but is fast and good for a quick examination of tree topology. In particular, NJ is less robust than ML to variation in mutation rates between the sequences. A common artefact of unequal rates is that fast evolving sequences (which have long branches) exhibit "long branch attraction" - moving toward each other and deeper into the tree than their true positions.
Calculate the tree:
- Using the tree menu, toggle on Correct for multiple substitutions.
- This approximates the actual number of differences (ie the number of mutations that occurred), not just the visible differences between 2 sequences.
- Then choose Bootstrap N-J tree.
- Set number of bootstrap trials to 100 (to make it go quicker).
- Bootstrapping is a resampling method to look at the stability of branches when a sequence is slightly modified. The idea is that unstable branches should not be used for supporting a phylogeny.
- Note the name of the tree (by default eftu.phb) and click the OK button.
- The tree and bootstrapping calculations may take a minute or so.
Display the tree:
- On the command line in a LINUX Xterminal, type njplot.
- The NJplot window will appear.
- Use File > Open to load in the tree file eftu.phb.
- NJ trees are unrooted and NJplot guesses how to root the tree.
- Is the tree plausibly displayed?
- If not, then use New Outgroup to display the tree so that the longest internal branch roots the tree.
- This corresponds to the last common ancestor of archaea and eubacteria.
- Toggle on the display of Bootstrap values.
- Now examine the tree and answer the questions.
- (If you want to print the tree out, save plot as postscript and print using command lpr -Plw-v111 eftu.ps).
Questions
- Are eukaryotic sequences on the archaeal or eubacterial divide or both?
- Some trees support a clade of fungi and animals with plants outside
- Did toads split off before the plants, animals and fungi diverged?
- Does the annotaton in the Xenopus entries help to understand this?
- Which Xenopus entry has the longest branch?
- Might it's position be artefactual?
- Branch lengths provide an estimate of sequence divergence
- Which sequence is closest to the root?
- Is it closer in time to the last common ancestor?
- Is it closer in sequence to the last common ancestor?
- Is it a result of horizontal transfer of sequences from archaea?
- Are there any eukaryotic and archaeal sequences with notably short branches?
- Chemical reaction rates increase with temperature
- Therefore thermophilic prokaryotes like Aquifex and Pyrococcus will be subject to much higher mutation rates and evolve faster than mesophiles won't they?
- Mitochondria are derived from an ancestral alpha-purple bacterium and Rickettsia is one of their closest relatives
- Do the eukaryotic mitochondrial EF-TUs and Rickettsia share a unique branch?
- Are all the mitochondrial sequences on a unique branch?
- Which likely evolved faster, mitochondrial or bacterial sequences?
- Might it improve the topology to incorporate more sequences from mitochondria and their relatives?
- Is Giardia lambdia the earliest diverged eukaryote?
- Since it lacks a mitochondrion, is this good evidence for the emergence of amoeboid anaerobic eukaryotes with cytoskeleton, nucleus etc. that later engulfed a symbiotic purple bacterium to give rise to the aerobic, mitochondrial eukaryotes?
- Bootstraps report the stability of branches
- Which branch has the lowest bootstrap?
- Do the shortest branch lengths have the highest bootstrap values?
- Do the bootstraps help to warn us that the mitochondrial EF-TUs might be misplaced?
- Can they substitute for a proper rate test (i.e. whether the tree approximates to a constant molecular clock)?
Answers to all the questions are provided on another page. Click here.
Part 3 Take Home Lessons
It is said that biology cannot be understood without setting it in an evolutionary context. Comparative sequence analysis is a continuation of the Darwinian tradition. Phylogenetic trees are fascinating in themselves but, in conjunction with multiple sequence alignments, are also important tools for gaining insight into the function of sequence families. However, tree calculations are unreliable unless there are plenty of diagnostic mutations to correctly assign the branching order. Variation in rate of sequence evolution confounds the algorithms, and can give rise to highly misleading trees: as we saw here, parts of the tree were obviously wrong when we apply extrinsic knowledge. Various mechanisms can give rise to rate increases: obviously selection for a new function (which can also help fix neutral mutations by a piggy-back mechanism); conversely a loss of function mutation can release purifying selection, also increasing fixation of neutral mutations. Other factors such as effective population size are important too: the larger the population size, the lower the likelihood that a given polymorphism will become fixed. Why do thermophilic prokaryotes evolve very slowly even though the chemically induced mutation rate ought to be higher? Perhaps it is because they live in an environment that has existed for 4 billion years, in actual physical locations that change on a slow geological timescale so selection is primarily conservative (purifying), and the effective population size is very large? At any rate, this serves to remind us that when we look at sequence divergence we see the accepted mutation rate and this will depend on many factors.