Predocs02 Course
Basic Tools in Sequence Analysis Practical
by Toby Gibson,Chenna Ramu , Aidan Budd and Francesca Diella, October 8-9th, 2001
Step 2 BLAST2 searching with human SM-B protein
Questions and Answers
- 1. How many SM proteins are detected above the first false positive?
- About 7. Could change with database updates. Need to check the SRS links to see if some entries have PFAM SM link.
- 2. Is there another class of protein that is strongly detected?
- Collagens.
- 3. If so, is this biologically meaningful?
- No - and yes. These proteins cannot be homologous. They work in different compartments and do different things. But it is poosible that the pro-rich regions are to aggregate the SM proteins in complexes, perhaps as irregular fibres and that is not so different from the Pro-rich collagen assembly, althoulg collagen makes a regular repeat.
- 4. Are the P-values a reliable guide to homology?
- Not in this case where the reduced sequence complexity confounds the underlying assumptions.
Step 2B BLAST 2 search with SM-B and a filter
Questions and Answers
- 1. How many SM proteins are detected above the first false positive?
- Maybe more than before?
- 2. Is there another class of protein that is strongly detected?
- Collagens are no longer detected strongly
- 3. Why are rather few sequences listed?
- The bulk of sequences are excluded by the BLAST cut-off defaults.
- 4. How does this setup compare in sensitivity to the unfiltered search?
- Somewhat more sensitive.
- 5. Are the P-values are reliable guide to homology?
- Up to a point now. Good (low) P-values are reliable but actually there are many SM proteins not detected yet so they are not reliable for the more divergent but related sequences.
Step 3 Bic_SW search with human SM-B protein
Questions and Answers
- 1. How are SM proteins distributed in the output?
- Answers are roughly similar to the BLAST results.
- 2. What position is the highest false positive?
- 3. Is another class of proteins strongly detected?
- 4. Are the E-values a reliable guide to the SM protein detections?
- 5. Compared to BLAST:
- (a) Which, if any, is more sensitive?
- In this case, probably about the same, the reduced sequence complexity dominates over algorithmic differences.
- (b) Which output is easier to understand?
- I find BIC easier. But many people may already be familiar with BLAST, although I think it is a poor visual format.
Step3B Bic-SW search with the SM Domain only
Questions and Answers
- 1. Are more or less SM proteins detected?
- Answers roughly similar to the BLAST results.
- 2. Is another class of proteins strongly detected?
- 3. Are the E-values a reliable guide to the SM protein detections?
- 4. Compared to the BLAST filtered search which, if any, is more sensitive?
- Not a lot of difference. In theory Smith-Waterman (BIC) has an edge on sensitivity but this is more likely to be seen with larger domains (and a divergent exchange matrix) where it can enhance the signal to noise.
- 5. Collect a multiple alignment using the buttons in the header:
- Is this useful to judge the detections?
- Yes, now the highly conserved columns in SM proteins become clear. Even the more divergent SM proteins should have identical or similar residues in key positions.
- Which entries have incomplete sequence fragments?
- Several, one rat sequence is highy truncated at the N-terminus.
Step 4. Bic_profilesearch based on an alignment of SM proteins
Step 4B. BIC_Profilesearch with an SM domain profile prepared with the Gonnet Pam250 matrix
Questions and answers
- 1. How are SM entries distributed in the output?
- There are lots more and they are better grouped without false positives.
- 2. Are the E-values a reliable guide to SM protein detections?
- Much better. Only some truncated SM proteins are missing. Anything above e-4 for a globular domain profile should be true, anything aboue e-2 is usually true, based on our experience.
- 3. Is the profile search more or less sensitive than the single sequence queries?
- Much more sensitive. This is generally true but this example is a bit misleading - in this case, almost every sequence is in the profile query already so it is not surprising it is good.
- 4. Collect a multiple alignment using the buttons in the header:
- Is this useful to judge the detections?
- As before - yes - except now we have even more sequences and greater divergence.
- Can you see any conserved positions in the alignment?
- Yes I hope so.
Step 5. Comparing a sequence to a database of protein domains
Questions and answers
- Based on your recent experiences would you say the E-value scores are good?
- anything below about e-4 is pretty reliable. Above that and false matches may start to creep in.
- What happens if you click on a domain bubble?
- This is really just to show people that they can find out more about the domains.
- Is the domain common?
- Most are
- Is there any literature on the domain?
- should be
- Are there structures for any of these domains?
- yes
- Is this protein likely to be in the nucleus, cytoplasm or extracellular compartments?
- it spans cytoplasm and extracellular departments.
- Can you say what kind of protein it is?
- a transmembrane receptor tyrosine kinase.
- Do you think this protein has especially many or few domains?
- Try repeating the SMART search with FBN1_HUMAN, the Marfan Syndrome protein.
Step 6. Looking for targeting signals in proteins by comparing sequences to a set of protein motifs
Questions and answers
- To where is the protein being targeted?
- Ras to the membrane via CAAX box prenylation
- Catalase to the peroxisome using PTS1
- Annexin to the membrane via N-terminal myristoylation
- Calreticulin to the endoplasmic reticulum via KDEL retention signal
- (note signal peptide not checked here)
- Is the targeting motif N-, C-terminal or in the middle?
- 1 N and 3 C terminal.
- Would this location eliminate any of the other ELM matches?
- (To the extent that you are aware of the biology!)
- Yes e.g. nuclear and cytoplasmic motifs are ruled out for ER-located calreticulin.
- Catalase is exclusively peroxisomal (or you are ill...) where there are few other ELMs.
- Of course some proteins may be found in more than one compartment so caution!
- Why are some motifs being filtered out?
- Because they clash with a statistically verified globular domain.
- Do you think this is reliable?
- No because some ELMs might occur in exposed loops in domains.
- Might there be better ways of doing domain filtering?
- Yes, e.g. structures should give a better idea of whether an ELM instance is feasible.
- Are compartment targeting signals always N- or C-terminal?
- Many types are exclusively terminal but this is not a rigorous principle. Most obviously the NLS is usually internal.
Part 1 Take Home Lessons
Hopefully the exercise of varying the query type has illustrated that the way a search is set up is very important. The queries here illustrate the effect of different sequence types. There are other parameters that often influence the search sensitivity. For example when a globular domain is longer, the Gonnet Pam250 matrix would be expected to outperform the default Blosum62 in the detection of divergent homologues, because it is less stringent and so gives longer optimally matching segments. (Over short matches it is noisier and could perform worse). In fact we used the Gonnet matrix to make the profiles: because of the extra information in the alignment, profiles usually perform better with Pam250 than Blosum62. Also, gap penalties are critical parameters in dynamic programming and should always be tested by trial and error. In other words, it pays to try several variations in the searches, not just accept the results of the first search.
Part 2. Making multiple sequence alignments and calculating a tree.
Step 2. Aligning the elongation factor sequences with Clustal X
Questions and answers
- Are there any completely conserved residues?
- Yes
- How are these marked?
- With a *
- What is the graph under the alignment showing?
- Column conservation. For example it can be used to quickly find highly conserve blocks.
- Is the order of the sequences changed after alignment?
- Yes
- How might the sequences be regrouped?
- They are clustered into simplar groups using the guide tree topology. In this case archael, eukaryotic and eubacterial groupings are made.
- Some proteins have N-terminal extensions
- Are these bacterial or eukaryotic?
- eukaryotic EF-TUs
- What is the function of this extension?
- Mitochondrial import sequence
- (Hint: You can check the entry in SRS)
- Are the residues coloured by
- Proximity in the alphabet?
- No
- Physicochemical properties?
- Yes. By testing for a consensus in each column. The algorithm is very simple and anomalies can easily be found if you look carefully.
- Why are some residues not coloured?
- They do not fit to a consensus for that column.
-
- Are Gly and Pro always coloured?
- Yes
- What is special about these residues?
- They are secondary structure breakers, found mostly in loops. It is useful to see all occurrences.
Step 3. Calculating a tree with Clustal X and displaying it with NJplot
Questions and answers
- Are eukaryotic sequences on the archaeal or eubacterial divide or both?
- Why?
- One factor is mitochondrial, the other cytoplasmic
- Some trees support a clade of fungi and animals with plants outside
- What about this one?
- This one does too. This is cournterintuitive becuase fungi look more like plants.
- Did toads split off before the plants, animals and fungi diverged?
- No! but one of the sequences seems to suggest that.
- Does the annotaton in the Xenopus entries help to understand this?
- Yes. It is an oocyte specific factor so not typical.
- Which Xenopus entry has the longest branch?
- the oocyte -specific factor.
- Might it's position be artefactual?
- Probably. We don't so far find similar sequences in plants and fungi, which is required by this position. The Xenopus factor may be incorrectly placed by long branch attraction caused by rapid sequence evolution.
- Branch lengths provide an estimate of sequence divergence
- Which sequence is closest to the root?
- Aquifex
- Is it closer in time to the last common ancestor?
- No!
- Is it closer in sequence to the last common ancestor?
- Yes as far as we can tell. It retains more in common with the archaeal sequences than are other bacteria and these conservative traits suggest it is more like the common ancestor.
- Is it a result of horizontal transfer of sequences from archaea?
- No! It is clearly in the eubacterial clade. It is possible that some archaeal genes have been transferred to Aquifex as it lives in close association with some species. But EF-TU behaves typically for the Aquifex branch point.
- Are there any eukaryotic and archaeal sequences with notably short branches?
- Yes, e.g. Pyrococcus in Archaea and Entamoeba in Eukaryotes. These may be conservative organisms too.
- Chemical reaction rates increase with temperature
- Therefore thermophilic prokaryotes like Aquifex and Pyrococcus will be subject to much higher mutation rates and evolve faster than mesophiles won't they?
- They don't! Exactly the opposite. Also counterintuitve but many thermophiles are extremely consevative.
- Mitochondria are derived from an ancestral alpha-purple bacterium and Rickettsia is one of their closest relatives
- Do the eukaryotic mitochondrial EF-TUs and Rickettsia share a unique branch?
- They should do but they don't here.
- Are all the mitochondrial sequences on a unique branch?
- they should be but they aren't
- Which likely evolved faster, mitochondrial or bacterial sequences?
- Must be the mitochondria as it is they that suffer the long branch attraction. They should be joined to Rikettsia, but still have a long branch projecting out of the tree.
- Might it improve the topology to incorporate more sequences from mitochondria and their relatives?
- Might, even probably, but not definitely. It is usually best to focus on the area of interest and exclude unnecessarily divergent sequences. Many studies shave shown this. This can improve the signal to noise ratio. But whether it helps will depend on how much signal there is in the first place and if artefacts like long branch attraction can be overcome.
- Is Giardia lambdia the earliest diverged eukaryote?
- yes in this set.
- Since it lacks a mitochondrion, is this good evidence for the emergence of amoeboid anaerobic eukaryotes with cytoskeleton, nucleus etc. that later engulfed a symbiotic purple bacterium to give rise to the aerobic, mitochondrial eukaryotes?
- This is a parsimonious assumption as Giardia (and other amitochondrial genera) are all early diverging eukaryotes. But it is wrong. All the amitochondrial eukaryotes analyses so far, like Giardia and microsporidia, have genes that derived from a mitochondrial endosymbiont. So they have secondarily lost their mitochondria. It appears that there never was such a eukaryote. The eukaryotic lineage began with symbiosis of an archaea and a purple bacterium and the eukaryotic features evolved thereafter.
- Bootstraps report the stability of branches
- Which branch has the lowest bootstrap?
- Do the shortest branch lengths have the highest bootstrap values?
- No, in general the other way round.
- Do the bootstraps help to warn us that the mitochondrial EF-TUs might be misplaced?
- No
- Can they substitute for a proper rate test (i.e. whether the tree approximates to a constant molecular clock)?
- No. They measure branch stability but not tree distortion due to violation of the clock. This must be tested for separately.