Predocs01 Course
Basic Tools in Sequence Analysis Practical
by Toby Gibson,Chenna Ramu and Aidan Budd, October 8-9th, 2001
In this practical we will use both WWW-based tools and locally installed programs on our UNIX server which cover the core sequence analysis activities: querying databases and making multiple sequence alignments.
In the first part, we will run some database search tools provided by EMBL and available through the web. These can be accessed from any computer and are simple to use. Web servers are often the nicest way to do sequence analysis. But you should be aware that they can be unreliable, need constant care from their providers and are not suited to every task. Sometimes you have to run programs on local machines too. Database search tools are well-suited to web servers and we will try out the Blast and Bioccelerator servers at EMBL. Examination of the outputs may reveal some differences between the results, depending on the type of algorithm used in the sequence comparison. We will also modify the query and search set ups to illustrate the importance of a little thought in advance of (or better late than never...) database searching. Rule No. 1 is "Know your sequence!"
In the second part we will make a multiple alignment of a protein sequence family and calculate a tree from the sequences. We will study the tree to see whether it fits with our current understanding of phylogeny.
Part 1. WWW DB search Tools
We will use:
- SRS5 to extract query sequences.
- The Bork group's BLAST2 server for BLAST searches (fast ungapped comparison).
- Our group's Bioccelerators for Smith-Waterman and profile searches (exhaustive gapped comparison).
- WWWProfileWeight to make the profiles from a sequence alignment.
- The Bork group's SMART server to compare a sequence against a protein domain database.
Getting started
The teaching machines are INTEL PCs running the LINUX OS. It will take a few moments to get set up.
- Login with your EMBL name and password.
- Start the KDE Desktop by typing exec startx.
- Start netscape from the pullup starting icon at the lower left of the screen.
- Load this page into it. (It is a few clicks from EMBL's home page).
- Check that javascript and style sheets are enabled in the netscape preferences in the advanced options.
Step 1 Choosing an snRNP SM protein as query
SM proteins are found in snRNP complexes. There are quite a number in Swiss-Prot and they are fairly divergent, so it is difficult (or impossible) to detect them all in a search with a single query sequence. All SM proteins share a small globular domain, but many have a C-terminal non-globular domain too. This will be used to illustrate the problems of searching with multi-domain proteins.
- Open a new navigator window and load this page into it.
- Load SRS5 and Start the session.
- (SRS (Sequence Retrieval System) is a powerful and widely used tool to retrieve information from sequence databases.)
- Tick the box for the Swiss-Prot database and click continue.
- Set the field selection (by default All Text) to Description.
- Type snrnp & sm & b in the Description box, then Do Query.
- Click on the entry RSMB_Human.
- Swiss-prot entries often have useful hypertext links: What does the PFAM link do?
You now have the sequence of human SM-B protein available in a form that can be cut and pasted into the DB query forms.
Step 2 BLAST2 searching with human SM-B protein
BLAST2 is an upgraded version of BLAST, one of the most widely used database search packages. The BLAST programs find the best matching ungapped sections in a sequence comparison. The most important modification for the user to note in BLAST2 is that neighbouring ungapped segments can now be concatenated by allowing gaps between them. This improves both sensitivity and interpretation of the results.
- Open a new navigator window and load this page into it.
- Load a BLAST2 query submission page.
- It is worth familiarising yourself with the layout and consulting the helps.
- Paste in the RSMB_Human sequence from the SRS browser (consult the help for format).
- Select the Swiss-Prot database.
- Set the filter option to none.
- Check the number of top hit descriptions and alignments to be shown: set to 100 or so.
- Start the BLAST search: this should take a few minutes at most.
- Save output to a new file, so that you do not lose it.
- Examine output, and investigate the detected entries by using the SRS links.
Questions
- 1. How many SM proteins are detected above the first false positive?
- 2. Is there another class of protein that is strongly detected?
- 3. If so, is this biologically meaningful?
- 4. Are the P-values a reliable guide to homology?
Step 2B BLAST 2 search with SM-B and a filter
Now repeat the search but filter out segments of "reduced sequence complexity".
- Reload a BLAST2 query submission page.
- Paste in the RSMB_Human sequence.
- Select the Swiss-Prot database.
- Set the filter option to SEG+XNU.
- Check the number of top hit descriptions and alignments to be shown: set to 100 or so.
- Start the BLAST search.
- Save output to a new file, so that you do not lose it.
- Examine output, and investigate the detected entries by using the SRS links.
Questions
- 1. How many SM proteins are detected above the first false positive?
- 2. Is there another class of protein that is strongly detected?
- 3. Why are rather few sequences listed?
- 4. How does this setup compare in sensitivity to the unfiltered search?
- 5. Are the P-values are reliable guide to homology?
Step 3 Bic_SW search with human SM-B protein
The Bioccelerator is fast dedicated hardware exclusively designed to speed up dynamic programming (i.e. slow but sensitive) sequence comparison. It is built by the Israeli company Compugen. It can perform a number of search permutations including basic Smith-Waterman, profile searches and Protein v. DNA frame-shifting comparisons. The Smith-Waterman search finds the best matching segments between any two sequences, allowing for gaps to be inserted at any position.
- Open a new navigator window and load this page into it.
- Load the Bioccelerator home page.
- Go to the Bioccelerator Searches page.
- Select application sw_p.
- It is worth familiarising yourself with the layout and consulting the help links.
- Paste the RSMB_human sequence from the SRS browser into the Query Sequence box.
- Select the Swiss-Prot database. (It may already be the default selection in gcg format).
- Now Do Search to start the Bioccelerator run.
- When you get the output, save to a new file, so that you do not lose it.
The search will take a couple of minutes (unless the Bic is busy). When it is finished you can look at the high-score list and alignments in the output and compare the results with BLAST2.
Questions
- 1. How are SM proteins distributed in the output?
- 2. What position is the highest false positive?
- 3. Is another class of proteins strongly detected?
- 4. Are the E-values a reliable guide to the SM protein detections?
- 5. Compared to BLAST:
- (a) Which, if any, is more sensitive?
- (b) Which output is easier to understand?
Step3B Bic-SW search with the SM Domain only
Now repeat the search but use the globular N-terminal domain only.
- Reload the Bioccelerator home page.
- Go to the Bioccelerator Searches page.
- Select application sw_p.
- Paste the range 1-82 of RSMB_human into the Bic query form.
- Select the Swiss-Prot database. (It may already be the default selection).
- Now Do Search to start the Bioccelerator run.
- When you get the output, save to a new file, so that you do not lose it.
Questions
- 1. Are more or less SM proteins detected?
- 2. Is another class of proteins strongly detected?
- 3. Are the E-values a reliable guide to the SM protein detections?
- 4. Compared to the BLAST filtered search which, if any, is more sensitive?
- 5. Collect a multiple alignment using the buttons in the header:
- Is this useful to judge the detections?
- Which entries have incomplete sequence fragments?
Step 4. Bic_profilesearch based on an alignment of SM proteins
Profile searches are one of the most sensitive search tools currently available. The raw materials for profile searching are a multiple sequence alignment in conjunction with a residue exchange matrix (e.g. the Gonnet Pam250 matrix). A profile scores the amino acids at each position in the alignment: conserved positions score more strongly than unconserved ones (whereas in a single sequence, they are all equally significant). We can compare the sensitivity to the searches with a single sequence as query.
Step 4A. Preparing a profile from an SM alignment
(We have prepared an alignment for you as there is not enough time to do a multiple alignment today).
- Open a new navigator window and load this page into it.
- Load WWWProfileWeight.
- Load the alignment SM_domain.aln in a new netscape window.
- Cut and Paste the alignment into the Paste box.
- Run ProfileWeight to make the profile.
- Look at the resulting profile:
- (a) See how scores for amino acids vary for each position in the alignment.
- (b) See how the position-specific gap penalties are lowered at existing gaps.
- (c) Note the suggested gap penalties in the header: these are only a rough guide.
- Save Profile to save the profile to a file (e.g. as Sm.prf) for use in the profile search.
Step 4B. BIC_Profilesearch with an SM domain profile prepared with the Gonnet Pam250 matrix
- Open a new navigator window and load this page into it.
- Load the Bioccelerator home page.
- Go to the Bioccelerator Searches Page.
- Select Profilesearch in the Application box.
- In the Upload a file box, give the full directory name of the Sm protein profile.
- (Alternatively you could cut and paste into the Query Sequence box.)
- Give Gap opening penalty 1.0 and extension penalty 0.2.
- Select the Swiss-Prot database.
- Do Search.
- Save the output to a new file, so that you do not lose it.
The search will take a couple of minutes (unless the Bic is busy). When it is finished you can look at the high-score list and alignments in the output. Use the SRS links to learn about the top hits.
Questions
- 1. How are SM entries distributed in the output?
- 2. Are the E-values a reliable guide to SM protein detections?
- 3. Is the profile search more or less sensitive than the single sequence queries?
- 4. Collect a multiple alignment using the buttons in the header:
- Is this useful to judge the detections?
- Can you see any conserved positions in the alignment?
Step 5. Comparing a sequence to a database of protein domains
Since profile searches are so sensitive, it would make sense to query an unknown sequence against a set of profiles for known protein families. There are several very useful databases of modules that are found in multidomain proteins, including PFAM at the Sanger Centre, PROSITE at ISREC and SMART at EMBL. They use a form of profile technically described as a "hidden Markov model", but the end result is very much like the profiles we just ran. We will search for protein domains in an "unknown" protein using the SMART server.
- Open a new navigator window and load this page in it.
- Load the SMART query page.
- Toggle on HMM searching against PFAM domains (includes more domains).
- Get the "unknown sequence" and cut and paste it into SMART's Sequence box.
- Click on the Sequence SMART Button.
- The search should take about a minute unless the server is busy.
- When you get the results, note the domain "bubble" diagram and the table of matching domains.
Questions
- Based on your recent experiences would you say the E-value scores are good?
- What happens if you click on a domain bubble?
- Is the domain common?
- Is there any literature on the domain?
- Are there structures for any of these domains?
- Is this protein likely to be in the nucleus, cytoplasm or extracellular compartments?
- Can you say what kind of protein it is?
- Do you think this protein has especially many or few domains?
- Try repeating the SMART search with FBN1_HUMAN, the Marfan Syndrome protein.
Part 1 Take Home Lessons
Hopefully the exercise of varying the query type has illustrated that the way a search is set up is very important. The queries here illustrate the effect of different sequence types. There are other parameters that often influence the search sensitivity. For example when a globular domain is longer, the Gonnet Pam250 matrix would be expected to outperform the default Blosum62 in the detection of divergent homologues, because it is less stringent and so gives longer optimally matching segments. (Over short matches it is noisier and could perform worse). In fact we used the Gonnet matrix to make the profiles: because of the extra information in the alignment, profiles usually perform better with Pam250 than Blosum62. Also, gap penalties are critical parameters in dynamic programming and should always be tested by trial and error. In other words, it pays to try several variations in the searches, not just accept the results of the first search.
Part 2. Making multiple sequence alignments and calculating a tree.
We will use:
- Clustal X for calculating both the alignment and the tree.
- NJplot for displaying and manipulating the tree.
Getting Started
Both these programs run on desktop Macs and PCs but today we will run them on TAU, a UNIX server where they are already installed.
On your LINUX PC:
- Open an Xterm (available from the lower menus of the desktop).
- Type xhost +
- This allows tau to open X-windows on the local machine.
- Log on to tau:
- if you are working on a Unix machine, type rlogin tau.
- Type cd /scrap/your_username to change to the scrap disk for the practical.
- On the unix command line type prepare clustalx. The Clustal X program will be set up.
- On the unix command line type prepare njplot. NJplot will be set up.
Step 1. Getting a set of EF-TU / EF-1A sequences
Elongation factors are found in all species so have often been used for phylogenetic investigations. EF-TU in eubacteria and EF-1A in eukaryotes are orthologous factors. There are >150 entries in SWISSPROT which would take too long to align today so we will use the SRS query manager to provide a representative selection.
- Open another Linux Xterm.
- Type cd /net/fileserver4/scrap/yourname/
- We want to work on the central disks and this command should mount /scrap.
- Now start Netscape from the command line.
- In a Netscape window, go to SRS and click Start a new SRS session.
- Check the box for the Swiss-Prot database and continue.
- Click on the Query Manager button:
- The query manager allows "advanced users" to do powerful query and list refinements.
- Using the mouse, select and cut the following query text:
[swissprot-MainText:
EF11_HUMAN | EF12_HUMAN | EF12_XENLA | EF13_XENLA | EF1A_GIALA |
EF1A_PLAFK | EF1A_WHEAT | EF1A_YEAST | EF11_DROME | EF11_MOUSE |
EF11_SCHPO | EF11_XENLA | EF1A_ARATH | EF1A_CAEEL | EF1A_CHICK |
EF1A_DICDI | EF1A_ENTHI | EF1A_HALHA | EF1A_METJA | EF1A_PODAN |
EF1A_PYRWO | EF1A_SULAC | EF1A_THEAC |
EFTU_BACSU | EFTU_ECOLI | EFTU_HAEIN | EFTU_MYCPN | EFTU_THEAQ |
EFTU_AQUAE | EFTU_RICPR | EFTU_HUMAN | EFTU_BOVIN | EFTU_YEAST]
- Paste the text into the query box and click the expression button to execute the query.
- SRS should return a list of entries matching the query.
- Now click the save button.
- Set Use view to FastaSeqs and click the save button.
- Using the netscape menus, save the sequences to a text format file called eftu.fasta.
Step 2. Aligning the elongation factor sequences with Clustal X
Multiple Alignments have many uses. They are used for revealing important conserved residues, for making phylogenies, for secondary structure prediction etc.
- In the TAU terminal, type clustalx.
- clustalx will open a new window.
- Invoke File > Load Sequences and select eftu.fasta.
- The sequences will be displayed in the clustalx main window.
- Using the Alignment menu select Do complete alignment and click the align button.
- The alignment calculation will take a couple of minutes, then the alignment will be displayed.
- Examine the alignment and answer the questions below.
Questions
- Are there any completely conserved residues?
- What is the graph under the alignment showing?
- Is the order of the sequences changed after alignment?
- How might the sequences be regrouped?
- Some proteins have N-terminal extensions
- Are these bacterial or eukaryotic?
- What is the function of this extension?
- (Hint: You can check the entry in SRS)
- Are the residues coloured by
- Proximity in the alphabet?
- Physicochemical properties?
- Why are some residues not coloured?
- Are Gly and Pro always coloured?
- What is so special about these residues?
Step 3. Calculating a tree with Clustal X and displaying it with NJplot
Clustal X uses the neighbour-joining method to calculate trees. This is a distance method (based on distances between sequences) that gives reasonable results. NJ is not the best method (usually said to be the computationally intensive Maximum-Likelihood approach) but is fast and good for a quick examination of tree topology. In particular, NJ is less robust than ML to variation in mutation rates between the sequences. A common artefact of unequal rates is that fast evolving sequences (which have long branches) exhibit "long branch attraction" - moving toward each other and deeper into the tree than their true positions.
Calculate the tree:
- Using the tree menu, toggle on Correct for multiple substitutions.
- This approximates the actual number of differences (ie the number of mutations that occurred), not just the visible differences between 2 sequences.
- (You can see the distance matrix using Output format options: It would be instructive to compare the matrices with and without the correction, but we probably don't have time today.)
- Then choose Bootstrap N-J tree.
- Set number of bootstrap trials to 100 (to make it go quicker).
- Bootstrapping is a resampling method to look at the stability of branches when a sequence is slightly modified. The idea is that unstable branches should not be used for supporting a phylogeny.
- Note the name of the tree (by default eftu.phb) and click the OK button.
- The tree and bootstrapping calculations may take a minute or so.
Display the tree:
- On the command line in a TAU Xterminal, type njplot.
- The NJplot window will appear.
- Use File > Open to load in the tree file eftu.phb.
- NJ trees are unrooted and NJplot guesses how to root the tree.
- Is the tree plausibly displayed?
- If not, then use New Outgroup to display the tree so that the long internal branch roots the tree.
- This corresponds to the last common ancestor of archaea and eubacteria.
- Toggle on the display of Bootstrap values.
- Now examine the tree and answer the questions.
- (If you want to print the tree out, save plot as postscript and print using command lpr -Plw-v111 eftu.ps).
Questions
- Are eukaryotic sequences on the archaeal or eubacterial divide or both?
- Some trees support a clade of fungi and animals with plants outside
- Did toads split off before the plants, animals and fungi diverged?
- Does the annotaton in the Xenopus entries help to understand this?
- Which Xenopus entry has the longest branch?
- Might it's position be artefactual?
- Branch lengths provide an estimate of sequence divergence
- Which sequence is closest to the root?
- Is it closer in time to the last common ancestor?
- Is it closer in sequence to the last common ancestor?
- Is it a result of horizontal transfer of sequences from archaea?
- Are there any eukaryotic and archaeal sequences with notably short branches?
- Chemical reaction rates increase with temperature
- Therefore thermophilic prokaryotes like Aquifex and Pyrococcus will be subject to much higher mutation rates and evolve faster than mesophiles won't they?
- Mitochondria are derived from an ancestral alpha-purple bacterium and Rickettsia is one of their closest relatives
- Do the eukaryotic mitochondrial EF-TUs and Rickettsia share a unique branch?
- Are all the mitochondrial sequences on a unique branch?
- Which likely evolved faster, mitochondrial or bacterial sequences?
- Might it improve the topology to incorporate more sequences from mitochondria and their relatives?
- Is Giardia lambdia the earliest diverged eukaryote?
- Since it lacks a mitochondrion, is this good evidence for the emergence of amoeboid anaerobic eukaryotes with cytoskeleton, nucleus etc. that later engulfed a symbiotic purple bacterium to give rise to the aerobic, mitochondrial eukaryotes?
- Bootstraps report the stability of branches
- Which branch has the lowest bootstrap?
- Do the shortest branch lengths have the highest bootstrap values?
- Do the bootstraps help to warn us that the mitochondrial EF-TUs might be misplaced?
- Can they substitute for a proper rate test (i.e. whether the tree approximates to a constant molecular clock)?
Part 2 Take Home Lessons
It is said that biology cannot be understood without setting it in an evolutionary context. Comparative sequence analysis is a continuation of the Darwinian tradition. Phylogenetic trees are fascinating in themselves but, in conjunction with multiple sequence alignments, are also important tools for gaining insight into the function of sequence families. However, tree calculations are unreliable unless there are plenty of diagnostic mutations to correctly assign the branching order. Variation in rate of sequence evolution confounds the algorithms, and can give rise to highly misleading trees: as we saw here, parts of the tree were obviously wrong when we apply extrinsic knowledge. Various mechanisms can give rise to rate increases: obviously selection for a new function (which can also help fix neutral mutations by a piggy-back mechanism); conversely a loss of function mutation can release purifying selection, also increasing fixation of neutral mutations. Other factors such as effective population size are important too: the larger the population size, the lower the likelihood that a given polymorphism will become fixed. Why do thermophilic prokaryotes evolve very slowly even though the chemically induced mutation rate ought to be higher? Perhaps it is because they live in an environment that has existed for 4 billion years, in actual physical locations that change on a slow geological timescale so selection is primarily conservative (purifying), and the effective population size is very large? At any rate, this serves to remind us that when we look at sequence divergence we see the accepted mutation rate and this will depend on many factors.