Predocs03 Course
Basic Tools in Sequence Analysis Practical
by Toby Gibson and the Sequence Analysis Team, October 7-8th, 2003
Part 1. WWW sequence similarity search Tools
Step 2 BLAST2 searching with human SM-B protein
Questions and Answers
- 1. How many SM proteins are detected above the first false positive?
- About 7. Could change with database updates. Need to check the SRS links to see if some entries have PFAM SM link.
- 2. Is there another class of protein that is strongly detected?
- 3. If so, is this biologically meaningful?
- No - and yes. These proteins cannot be homologous. They work in different compartments and do different things. But it is poosible that the pro-rich regions are to aggregate the SM proteins in complexes, perhaps as irregular fibres and that is not so different from the Pro-rich collagen assembly, althoulg collagen makes a regular repeat.
- 4. Are the P-values a reliable guide to homology?
- Not in this case where the reduced sequence complexity confounds the underlying assumptions.
Step 2B BLAST 2 search with SM-B and a filter
Questions and Answers
- 1. How many SM proteins are detected above the first false positive?
- 2. Is there another class of protein that is strongly detected?
- Collagens are no longer detected strongly
- 3. Why are rather few sequences listed?
- The bulk of sequences are excluded by the BLAST cut-off defaults.
- 4. How does this setup compare in sensitivity to the unfiltered search?
- 5. Are the P-values are reliable guide to homology?
- Up to a point now. Good (low) P-values are reliable but actually there are many SM proteins not detected yet so they are not reliable for the more divergent but related sequences.
Step 3 Bic_SW search with human SM-B protein
Questions and Answers
- 1. How are SM proteins distributed in the output?
- Answers are roughly similar to the BLAST results.
- 2. What position is the highest false positive?
- 3. Is another class of proteins strongly detected?
- 4. Are the E-values a reliable guide to the SM protein detections?
- 5. Compared to BLAST:
- (a) Which, if any, is more sensitive?
- In this case, probably about the same, the reduced sequence complexity dominates over algorithmic differences.
- (b) Which output is easier to understand?
- I find BIC easier. But many people may already be familiar with BLAST, although I think it is a poor visual format.
Step3B Bic-SW search with the SM Domain only
Questions and Answers
- 1. Are more or less SM proteins detected?
- Answers roughly similar to the BLAST results.
- 2. Is another class of proteins strongly detected?
- 3. Are the E-values a reliable guide to the SM protein detections?
- 4. Compared to the BLAST filtered search which, if any, is more sensitive?
- Not a lot of difference. In theory Smith-Waterman (BIC) has an edge on sensitivity but this is more likely to be seen with larger domains (and a divergent exchange matrix) where it can enhance the signal to noise.
- 5. Collect a multiple alignment using the buttons in the header:
- Is this useful to judge the detections?
- Yes, now the highly conserved columns in SM proteins become clear. Even the more divergent SM proteins should have identical or similar residues in key positions.
- Which entries have incomplete sequence fragments?
- Several, one rat sequence is highy truncated at the N-terminus.
Part 2. Exploring protein architecture and function with SMART, GlobPlot and ELM
Step 1. Comparing a sequence to a database of protein domains
Questions and Answers
- Based on your recent experiences would you say the E-value scores are good?
- What happens if you click on a domain bubble?
- You get access to more info
- Is the domain common?
- They are all common domains
- Is there any literature on the domain?
- Are there structures for any of these domains?
- What is the longest region that has no known domain?
- The N-terminus at about 84 residues
- Do you think this protein has especially many or few domains?
- There are many proteins with just a few domains but also many proteins with huge numbers of domains.
- (You could repeat the SMART query with FBN1_HUMAN, the Marfan Syndrome protein.)
- This protein is made of only two kinds of domain - but has more than 50 of them!
Step 2. Exploring order/disorder with Globplot
Questions and Answers
- Is the slope mainly positive or negative?
- Are there any peaks/troughs where the slope inverts?
- Yes - most obviously at the beginning of the SH3 domain
- Could we use this to collect segments with a given conformational preference?
- What is the longest "putative unstructured segment" listed by Globplot?
- The N-terminal region is natively unstructured but GlobPlot gives it in several segments with short gaps where the graph went briefly negative.
- Do SMART and GlobPlot agree on where there is globular structure?
- There are lots of proteins that give interesting and informative GlobPlots:
- If there was time you could try CBP_HUMAN, P53_HUMAN, PRP1_HUMAN
Step 3. Searching for short functional sites with the ELM server
Questions and Answers
- Why have results been sorted by globular domain filtering?
- Because most ELMs are in non-globular regions, some classes exclusively so. ELMs in globular domains are only interesting if they are accessible so they need to be in the surface loops of the protein structure.
- Is this filter working well?
- (Note: there are usually phosphorylation sites in tyrosine kinase domains.)
- It helps a lot but is not perfect. We need a structure filter that eliminates the buried ELMs but not the surface ones. Even then, if the local conformation changes to bury the ELM (as it does with the tyrosine kinase domain!) no filter will help. In such cases, we will have to "hardwire" in the true instances.
- Find the set of reported motifs that obey the following criteria:
- (1) They are in the non-globular N-termini (approx. residues 1-80).
- (2) They are found at the same place in human and frog (indicating that there is functional conservation).
- Are there conserved N-myristoylation sites?
- Are there any conserved phosphorylation sites?
- Yes notably two CDK sites.
- Are there any conserved cyclin binding sites?
- Is a cyclin binding site meaningful on its own?
- No - it needs the CDK site too. These two sites are obligatorily coupled as their ligand is the cyclin/CDK complex.
- Are there more cyclin-binding than CDK phosphorylation sites?
- No - the other way round!
- Is src likely to be phosphorylated at specific points in the cell cycle?
- There is experimental evidence in the literature about CDK phosphorylation of src. Bony fish src (e.g. Xiphophora) has one cyclin and one CDK site, so the conservation is ~400 million years old.
- yes_human, yes_xenla?
- These do NOT have CDK sites. Src is different from most of the other paralogues in its strongly conserved cyclin/CDK binding sites.
Step 4. Exploring the architecture of the protein Epsin1.
Questions and answers
- How many globular domains are reported in the SMART output?
- One - because the UIM repeats are NOT globular (too small).
- Is there any indication that exon duplication has occurred?
- Introns found between functional motifs/domains. The phase zero introns give a second hint: Coding exons must have same phase splice junctions to be duplicated.
- To check further: Click on the Display all proteins with similar domain composition.
- Now we can find epsins with one less UIM exon and one less phase zreo intron. Looks like exon duplication did occur. But there are much more spectacular examples in extracellular proteins.
- Do SMART and GlobPlot agree about where the sequence is globular?
- Partially but the UIM motifs are too small to be natively folded so must have an "induced fit" globularity when bound to ligand. GlobPlot does not see this.
- How big is the largest segment of disordered sequence indicated by GlobPlot?
- The extensive C-terminal region is disordered. Globplot does not show it as continuous though.
- Which of these motifs are not found by ELM?
- UIM and AP2-binding motifs are not yet in ELM. One reported Clathrin box is not detected.
- Are any found by SMART instead?
- Which of these motifs has been "rescued" from the ENTH domain?
- The PiP2 membrane binding motif.
- Click on its link to get an idea why it should be rescued.
- It is always found in the ENTH domain. Structural work indicates some allostery in this region that may regulate when this ELM can function.
- Are there any other endocytosis ELMs picked up inside the ENTH domain?
- Is the match plausible?
- No for two reasons. it is only found in transmembrane proteins (Epsin is not one). The residues in the motif are actually core-packing residues and not available for ELM interactions: typical usage of the globular filter.
- Are there any other endocytosis motifs picked up in the Epsin tail that are not in the above list?
- Yes the LIG_AP_GamEar_1motif.
- Would it be worth doing an experimental assessment?
- It is found in related proteins so has a good chance of working here too.
- Are there any other motifs that one might consider following up experimentally?
- It would not be surprising if some phosphorylation sites, SH3, WW motifs were involved in regulating Epsin function. But most of these sites are prone to severe overprediction.
- Would you make a multiple alignment first to check for motif conservation?
- Too right you would! ELMs may come and go during evolution but at least the closer homologues should indicate conservation or the site is unlikely to be genuine and to work in vivo. In vitro, you could easily get phosphorylation of practically any short peptide with say an SP motif (part of the recognition sequence of the "proline-directed" kinases) but there would be a strong danger of overinterpreting the results.
Part 3. Making multiple sequence alignments and calculating a tree.
Step 2. Aligning the elongation factor sequences with Clustal X
Questions and answers
- Are there any completely conserved residues?
- Yes
- How are these marked?
- What is the graph under the alignment showing?
- Column conservation. For example it can be used to quickly find highly conserve blocks.
- Is the order of the sequences changed after alignment?
- Yes
- How might the sequences be regrouped?
- They are clustered into simplar groups using the guide tree topology. In this case archael, eukaryotic and eubacterial groupings are made.
- Some proteins have N-terminal extensions
- Are these bacterial or eukaryotic?
- What is the function of this extension?
- Mitochondrial import sequence
- (Hint: You can check the entry in SRS)
- Are the residues coloured by
- Proximity in the alphabet?
- Physicochemical properties?
- Yes. By testing for a consensus in each column. The algorithm is very simple and anomalies can easily be found if you look carefully.
- Why are some residues not coloured?
- They do not fit to a consensus for that column.
- Are Gly and Pro always coloured?
- Yes
- What is special about these residues?
- They are secondary structure breakers, found mostly in loops. It is useful to see all occurrences.
Step 3. Calculating a tree with Clustal X and displaying it with NJplot
Questions and answers
- Are eukaryotic sequences on the archaeal or eubacterial divide or both?
- Why?
- One factor is mitochondrial, the other cytoplasmic
- Some trees support a clade of fungi and animals with plants outside
- What about this one?
- This one does too. This is cournterintuitive becuase fungi look more like plants.
- Did toads split off before the plants, animals and fungi diverged?
- No! but one of the sequences seems to suggest that.
- Does the annotaton in the Xenopus entries help to understand this?
- Yes. It is an oocyte specific factor so not typical.
- Which Xenopus entry has the longest branch?
- the oocyte -specific factor.
- Might it's position be artefactual?
- Probably. We don't so far find similar sequences in plants and fungi, which is required by this position. The Xenopus factor may be incorrectly placed by long branch attraction caused by rapid sequence evolution.
- Branch lengths provide an estimate of sequence divergence
- Which sequence is closest to the root?
- Aquifex
- Is it closer in time to the last common ancestor?
- Is it closer in sequence to the last common ancestor?
- Yes as far as we can tell. It retains more in common with the archaeal sequences than are other bacteria and these conservative traits suggest it is more like the common ancestor.
- Is it a result of horizontal transfer of sequences from archaea?
- No! It is clearly in the eubacterial clade. It is possible that some archaeal genes have been transferred to Aquifex as it lives in close association with some species. But EF-TU behaves typically for the Aquifex branch point.
- Are there any eukaryotic and archaeal sequences with notably short branches?
- Yes, e.g. Pyrococcus in Archaea and Entamoeba in Eukaryotes. These may be conservative organisms too.
- Chemical reaction rates increase with temperature
- Therefore thermophilic prokaryotes like Aquifex and Pyrococcus will be subject to much higher mutation rates and evolve faster than mesophiles won't they?
- They don't! Exactly the opposite. Also counterintuitve but many thermophiles are extremely consevative.
- Mitochondria are derived from an ancestral alpha-purple bacterium and Rickettsia is one of their closest relatives
- Do the eukaryotic mitochondrial EF-TUs and Rickettsia share a unique branch?
- They should do but they don't here.
- Are all the mitochondrial sequences on a unique branch?
- they should be but they aren't
- Which likely evolved faster, mitochondrial or bacterial sequences?
- Must be the mitochondria as it is they that suffer the long branch attraction. They should be joined to Rikettsia, but still have a long branch projecting out of the tree.
- Might it improve the topology to incorporate more sequences from mitochondria and their relatives?
- Might, even probably, but not definitely. It is usually best to focus on the area of interest and exclude unnecessarily divergent sequences. Many studies shave shown this. This can improve the signal to noise ratio. But whether it helps will depend on how much signal there is in the first place and if artefacts like long branch attraction can be overcome.
- Is Giardia lambdia the earliest diverged eukaryote?
- yes in this set.
- Since it lacks a mitochondrion, is this good evidence for the emergence of amoeboid anaerobic eukaryotes with cytoskeleton, nucleus etc. that later engulfed a symbiotic purple bacterium to give rise to the aerobic, mitochondrial eukaryotes?
- This is a parsimonious assumption as Giardia (and other amitochondrial genera) are all early diverging eukaryotes. But it is wrong. All the amitochondrial eukaryotes analyses so far, like Giardia and microsporidia, have genes that derived from a mitochondrial endosymbiont. So they have secondarily lost their mitochondria. It appears that there never was such a eukaryote. The eukaryotic lineage began with symbiosis of an archaea and a purple bacterium and the eukaryotic features evolved thereafter.
- Bootstraps report the stability of branches
- Which branch has the lowest bootstrap?
- Do the shortest branch lengths have the highest bootstrap values?
- No, in general the other way round.
- Do the bootstraps help to warn us that the mitochondrial EF-TUs might be misplaced?
- No
- Can they substitute for a proper rate test (i.e. whether the tree approximates to a constant molecular clock)?
- No. They measure branch stability but not tree distortion due to violation of the clock. This must be tested for separately.