Predocs04 Course
Basic Tools in Sequence Analysis Practical
by Toby Gibson and the Sequence Analysis Team, October 6-7th, 2004
Part 1. WWW sequence similarity search Tools
Step 2 BLAST2 searching with human SM-B protein
Questions and Answers
- 1. How many SM proteins are detected above the first false positive?
- About 11. Could change with database updates. Need to check the SRS links to see if some entries have a PFAM SM link.
- 2. Is there another class of protein that is strongly detected?
- 3. If so, is this biologically meaningful?
- No - and yes. These proteins cannot be homologous. They work in different compartments and do different things. But it is possible that the pro-rich regions are to aggregate the SM proteins in complexes, perhaps as irregular fibres and that is not so different from the Pro-rich collagen assembly, although collagen makes a regular repeating structure.
- 4. Are the P-values a reliable guide to homology?
- Not in this case where the reduced sequence complexity confounds the underlying assumptions in the statistical model.
Step 2B BLAST 2 search with SM-B and a filter
Questions and Answers
- 1. How many SM proteins are detected above the first false positive?
- Many more than before, about 50.
- 2. Is there another class of protein that is strongly detected?
- Collagens are no longer detected strongly
- 3. Are there SM-like proteins found in Archaea?
- Yes, most archaea possess them.
- 4. Why are rather few sequences listed?
- The bulk of sequences are excluded by the BLAST cut-off defaults.
- 5. How does this setup compare in sensitivity to the unfiltered search?
- Clearly more sensitive at finding SM proteins.
- 6. Are the P-values a reliable guide to homology?
- Up to a point now. Good (low) P-values are reliable but actually there are many SM proteins, with weak P-values and still others that are not detected at all. So the P-values are not reliable for the more divergent but related sequences.
Step 3 Bic_SW search with human SM-B protein
Questions and Answers
- 1. How are SM proteins distributed in the output?
- Answers are roughly similar to the BLAST results.
- 2. What position is the highest false positive?
- 3. Is another class of proteins strongly detected?
- 4. Are the E-values a reliable guide to the SM protein detections?
- 5. Compared to BLAST:
- (a) Which, if any, is more sensitive?
- In this case, probably about the same, the reduced sequence complexity dominates over algorithmic differences.
- (b) Which output is easier to understand?
- I find BIC easier. But many people may already be familiar with BLAST, although I think it is a poor visual format.
Step3B Bic-SW search with the SM Domain only
Questions and Answers
- 1. Are more or less SM proteins detected?
- Answers roughly similar to the BLAST results.
- 2. Is another class of proteins strongly detected?
- 3. Are the E-values a reliable guide to the SM protein detections?
- 4. Are there any bacterial SM-like proteins?
- Yes but then are more divergent than the archaeal sequences. For example entry FLJA_SALAB
- 5. Compared to the BLAST filtered search which, if any, is more sensitive?
- Not a lot of difference. BLAST does better on the highest false positive but misses the bacterial hits. In theory Smith-Waterman (BIC) has an edge on sensitivity but this is more likely to be seen with larger domains (and a divergent exchange matrix e.g gonnet Pam250) where it can enhance the signal to noise.
- 6. Collect a multiple alignment using the buttons in the header:
- Is this useful to judge the detections?
- Yes, now the highly conserved columns in SM proteins become clear. Even the more divergent SM proteins should have identical or similar residues in key positions.
- Which entries have incomplete sequence fragments?
- At least one rat sequence is truncated at the N-terminus.
Part 2. Exploring protein architecture and function with SMART, GlobPlot and ELM
Step 1. Comparing a sequence to a database of protein domains
Questions and Answers
- Based on your recent experiences would you say the E-value scores are good?
- What happens if you click on a domain bubble?
- You get access to more info
- Is the domain common?
- They are all common domains
- Is there any literature on the domain?
- Are there structures for any of these domains?
- Are there any disease mutations in the domains?
- Yes, many have been found in kinase domains for example.
- What is the longest region that has no known domain?
- The N-terminus at about 84 residues
- Do you think this protein has especially many or few domains?
- There are many proteins with just a few domains but also many proteins with huge numbers of domains.
- (You could repeat the SMART query with FBN1_HUMAN, the Marfan Syndrome protein.)
- This protein is made of only two kinds of domain - but has more than 50 of them!
Step 2. Exploring order/disorder with Globplot
Questions and Answers
- Why does GlobPlot not slide a fixed window length over the sequences?
- Many bioinformatics programs do this. However, a fixed window has no meaning in regard to the conformational parameters used by GlobPlot. The structural segments can be of any length. Therefore the approach used here is more appropriate.
- Is the slope mainly positive or negative?
- Are there any peaks/troughs where the slope inverts?
- Yes - most obviously at the beginning of the SH3 domain
- What is the longest "putative unstructured segment" listed by Globplot?
- The N-terminal region is natively unstructured but GlobPlot gives it in several segments with short gaps where the graph went briefly negative.
- Do SMART and GlobPlot agree on where there is globular structure?
- Quite well in this case. The "GlobDoms" picked from the graph have missed part of the SH3 and Kinase domains. But the graph indicates that these regions have mainly globular preference.
- There are lots of proteins that give interesting and informative GlobPlots:
- If there was time you could try CBP_HUMAN, P53_HUMAN, PRP1_HUMAN, ABL1_HUMAN, IRS1_HUMAN, TAU_HUMAN.
- These are all splendid proteins to explore with GlobPlot.
Step 3. Searching for short functional sites with the ELM server
Questions and Answers
- Why have results been sorted by globular domain filtering?
- Because most ELMs are in non-globular regions, some classes exclusively so. ELMs in globular domains are only interesting if they are accessible so they need to be in the surface loops of the protein structure.
- Is this filter working well?
- (Note: there are usually phosphorylation sites in tyrosine kinase domains.)
- It helps a lot but is not perfect. We need a structure filter that eliminates the buried ELMs but not the surface ones. Even then, if the local conformation changes to bury the ELM (as it does with the tyrosine kinase domain!) no filter will help. In such cases, we will have to "hardwire" in the true instances.
- Find the set of reported motifs that obey the following criteria:
- (1) They are in the non-globular N-termini (approx. residues 1-80).
- (2) They are found at the same place in human and frog (indicating that there is functional conservation).
- Are there conserved N-myristoylation sites?
- Are there any conserved phosphorylation sites?
- Yes notably two CDK sites.
- Are there any conserved cyclin binding sites?
- Is a cyclin binding site meaningful on its own?
- No - it needs the CDK site too. These two sites are obligatorily coupled as their ligand is the cyclin/CDK complex.
- Are there more cyclin-binding than CDK phosphorylation sites?
- No - the other way round!
- Is src likely to be phosphorylated at specific points in the cell cycle?
- There is experimental evidence in the literature about CDK phosphorylation of src. Bony fish src (e.g. Xiphophora) has one cyclin and one CDK site, so the conservation is ~400 million years old.
- yes_human, yes_xenla?
- These do NOT have CDK sites. Src is different from most of the other paralogues in its strongly conserved cyclin/CDK binding sites.
Step 4. Exploring the architecture of the protein Epsin1.
Questions and answers
- How many globular domains are reported in the SMART output?
- One - because the UIM repeats are NOT globular (too small).
- Is there any indication that exon duplication has occurred?
- Introns found between functional motifs/domains. The phase zero introns give a second hint: Coding exons must have same phase splice junctions to be duplicated.
- To check further: Click on the Display all proteins with similar domain composition.
- Now we can find epsins with one less UIM exon and one less phase zero intron. Looks like exon duplication did occur. But there are much more spectacular examples in extracellular proteins.
- Do SMART and GlobPlot agree about where the sequence is globular?
- Partially but the UIM motifs are too small to be natively folded so must have an "induced fit" globularity when bound to ligand. GlobPlot does not see this.
- How big is the largest segment of disordered sequence indicated by GlobPlot?
- The extensive C-terminal region is disordered. Globplot does not show it as continuous though.
- Which of these motifs are not found by ELM?
- UIM and AP2-binding motifs are not yet in ELM. One reported Clathrin box is not detected.
- Are any found by SMART instead?
- Which of these motifs has been "rescued" from the ENTH domain?
- The PiP2 membrane binding motif.
- Click on its link to get an idea why it should be rescued.
- It is always found in the ENTH domain. Structural work indicates some allostery in this region that may regulate when this ELM can function.
- Are there any other endocytosis targeting ELMs picked up in Epsin?
- TRG_ENDOCYTIC_2
- TRG_LysEnd_APsAcLL_1.
- Are the matches plausible?
- No for two reasons. Both motifs are only found in transmembrane proteins (Epsin is not one). For TRG_ENDOCYTIC_2 , the residues in the motif are actually core-packing residues in the ENTH domain and not available for ELM interactions: typical usage of the globular filter.
- Are there any other motifs that one might consider following up experimentally?
- It would not be surprising if some phosphorylation sites, SH3, WW motifs were involved in regulating Epsin function. But most of these sites are prone to severe overprediction.
- If yes, would you make a multiple alignment first to check for motif conservation?
- Too right you would! ELMs may come and go during evolution but at least the closer homologues should indicate conservation or the site is unlikely to be genuine and to work in vivo. In vitro, you could easily get phosphorylation of practically any short peptide with say an SP motif (part of the recognition sequence of the "proline-directed" kinases) but there would be a strong danger of overinterpreting the results.
Part 3. Making multiple sequence alignments and calculating a tree.
Step 2. Aligning the elongation factor sequences with Clustal X
Questions and answers
- Are there any completely conserved residues?
- Yes
- How are these marked?
- What is the graph under the alignment showing?
- Column conservation. For example it can be used to quickly find highly conserve blocks.
- Is the order of the sequences changed after alignment?
- Yes
- How might the sequences be regrouped?
- They are clustered into simplar groups using the guide tree topology. In this case archael, eukaryotic and eubacterial groupings are made.
- Some proteins have N-terminal extensions
- Are these bacterial or eukaryotic?
- What is the function of this extension?
- Mitochondrial import sequence
- (Hint: You can check the entry in SRS)
- Are the residues coloured by
- Proximity in the alphabet?
- Physicochemical properties?
- Yes. By testing for a consensus in each column. The algorithm is very simple and anomalies can easily be found if you look carefully.
- Why are some residues not coloured?
- They do not fit to a consensus for that column.
- Are Gly and Pro always coloured?
- Yes
- What is special about these residues?
- They are secondary structure breakers, found mostly in loops. It is useful to see all occurrences.
Step 3. Calculating a tree with Clustal X and displaying it with NJplot
Questions and answers
- Are eukaryotic sequences on the archaeal or eubacterial divide or both?
- Why?
- One factor is mitochondrial, the other cytoplasmic
- Some trees support a clade of fungi and animals with plants outside
- What about this one?
- This one does too. This is cournterintuitive becuase fungi look more like plants.
- Did toads split off before the plants, animals and fungi diverged?
- No! but one of the sequences seems to suggest that.
- Does the annotaton in the Xenopus entries help to understand this?
- Yes. It is an oocyte specific factor so not typical.
- Which Xenopus entry has the longest branch?
- the oocyte -specific factor.
- Might it's position be artefactual?
- Probably. We don't so far find similar sequences in plants and fungi, which is required by this position. The Xenopus factor may be incorrectly placed by long branch attraction caused by rapid sequence evolution.
- Branch lengths provide an estimate of sequence divergence
- Which sequence is closest to the root?
- Aquifex
- Is it closer in time to the last common ancestor?
- Is it closer in sequence to the last common ancestor?
- Yes as far as we can tell. It retains more in common with the archaeal sequences than are other bacteria and these conservative traits suggest it is more like the common ancestor.
- Is it a result of horizontal transfer of sequences from archaea?
- No! It is clearly in the eubacterial clade. It is possible that some archaeal genes have been transferred to Aquifex as it lives in close association with some species. But EF-TU behaves typically for the Aquifex branch point.
- Are there any eukaryotic and archaeal sequences with notably short branches?
- Yes, e.g. Pyrococcus in Archaea and Entamoeba in Eukaryotes. These may be conservative organisms too.
- Chemical reaction rates increase with temperature
- Therefore thermophilic prokaryotes like Aquifex and Pyrococcus will be subject to much higher mutation rates and evolve faster than mesophiles won't they?
- They don't! Exactly the opposite. Also counterintuitve but many thermophiles are extremely consevative.
- Mitochondria are derived from an ancestral alpha-purple bacterium and Rickettsia is one of their closest relatives
- Do the eukaryotic mitochondrial EF-TUs and Rickettsia share a unique branch?
- They should do but they don't here.
- Are all the mitochondrial sequences on a unique branch?
- they should be but they aren't
- Which likely evolved faster, mitochondrial or bacterial sequences?
- Must be the mitochondria as it is they that suffer the long branch attraction. They should be joined to Rikettsia, but still have a long branch projecting out of the tree.
- Might it improve the topology to incorporate more sequences from mitochondria and their relatives?
- Might, even probably, but not definitely. It is usually best to focus on the area of interest and exclude unnecessarily divergent sequences. Many studies shave shown this. This can improve the signal to noise ratio. But whether it helps will depend on how much signal there is in the first place and if artefacts like long branch attraction can be overcome.
- Is Giardia lambdia the earliest diverged eukaryote?
- yes in this set.
- Since it lacks a mitochondrion, is this good evidence for the emergence of amoeboid anaerobic eukaryotes with cytoskeleton, nucleus etc. that later engulfed a symbiotic purple bacterium to give rise to the aerobic, mitochondrial eukaryotes?
- This is a parsimonious assumption as Giardia (and other amitochondrial genera) are all early diverging eukaryotes. But it is wrong. All the amitochondrial eukaryotes analyses so far, like Giardia and microsporidia, have genes that derived from a mitochondrial endosymbiont. So they have secondarily lost their mitochondria. It appears that there never was such a eukaryote. The eukaryotic lineage began with symbiosis of an archaea and a purple bacterium and the eukaryotic features evolved thereafter.
- Bootstraps report the stability of branches
- Which branch has the lowest bootstrap?
- Do the shortest branch lengths have the highest bootstrap values?
- No, in general the other way round.
- Do the bootstraps help to warn us that the mitochondrial EF-TUs might be misplaced?
- No
- Can they substitute for a proper rate test (i.e. whether the tree approximates to a constant molecular clock)?
- No. They measure branch stability but not tree distortion due to violation of the clock. This must be tested for separately.