Slides: Today we will give an introduction to structural domains, disordered regions and motifs. After introducing these concepts we will attempt to show you how to detect similarity between proteins based on a similar module architecture. This is a different way of looking at protein similarity. It can be useful to infer similar functions in proteins that don't share a high degree of sequence similarity.
Structural Domain Prediction
The presentation will introduce the concept of structural domains and introduce some of the tools for predicting secondary structure and domain topology.
Introduce Secondary Structure.
- What is primary structure?
- What is secondary structure?
- What are secondary structure elements?
- What are not?
- First Concept of domain = Structural self-contained folding subunit.
- Using the two target sequences below, run the secondary structure prediction algorithms to determine the likely position of any secondary structure elements for your protein. While you are waiting for these you can move on to the next question. All of these require you to enter an e-mail to which the results will be posted. You can't use hotmail or whatever for some of them so please use your institutional e-mail address e.g. email@example.com
o Jpred This will complain that a structure exists - ignore this
o Jufo Keep looking for Jufo
o ExPasy (List of tools)
- How do these methods work?
- Discuss the history of bioinformatic prediction of secondary structure.
- What information do you think both styles are uncovering.
Introduce Tertiary Structure.
What does this mean to you?
Slides on src and abl. Explain topology. Classification of domains. Allows comparison of 3D structures.
We have now assembled the smallest subunits of biological activity. These can now go on to perform biological roles in the cell. Ask what have we left out - any ideas why?
- What is a domain?
- Can you think of examples of classes of proteins that contain domains?
- In what kind of protein was the concept of domain first introduced?
- What do you know about the evolution of the term domain in terms of its changing definition/meaning to biologists?
- Contrast this level of similarity (i.e. structural) with the primary sequence similarity. Multiple alignment of the sequences that you all learned to do yesterday.
Why is modular architecture important. Idea is that domain == function. So two proteins with the same domain have the same function. Underline the idea that primary sequence is not critical to domain sequence.
Introduction to SMART, and other domain prediction sites.
o Pfam (Click sequence search)
- What are the domains present in this protein?
- What is the difference between this protein in Human and Xenopus?
- Using what you know of the structure and properties of globular domains explain the distribution of different kinds of physiochemical properties of the residues.
For the proteins you ran earlier, src_human, ABL1_HUMAN, does the secondary structure prediction align with the predicted/known domains?
Using SMART determine the following - using human and abl1_human:
- How many other proteins in humans also have the same domain?
- How many have the same domain organisation in all organisms?
Follow on questions - if time allows.
- Using the following proteins, run them in either SMART or Pfam. What can you tell about the likely interactions of the proteins solely from the annotations/predictions here.
- What can you infer from sequences which share similar domain topologies - it depends. Membrane proteins == good. Enzymes == bad.
Globularity and Non-globularity
Here we will introduce tools to predict Intrinsically Unstructured regions of Proteins.
Using this sequence Breakless generate disorder predictions for your sequence. Or choose a sequence from here: Data and sequences for exercises.
Using IUPred generate a disorder prediction.
- What is the tendency for your protein to be disordered? i.e. Identify the tendency for disorder in this sequence.
- Compare using IUPred the results of the long prediction, short prediction for your protein?
If you are using your own sequence, which type of disorder would you expect to see?
Run the sequence now with the DisProt server using the different algorithms.
- Do they agree - if so, would you expect them to - if not - why not?
- Comparing the output of these to the output of IUPred, explain the differences?
- Some 3D structures are the result of residues that are not primary structure neighbours. How do these disorder algorithms take into account long range effects?
What effect does the window size for smoothing have on the prediction? Run your target sequence in IUPredwith the long and short window.
- Run GlobPlot on your protein, or any of the ones here: src_human, csk_human, abl1_human. Globplot
- Use GlobPlot and alter the parameters to see what effect this has on the prediction.
- Are there likely to be domains missing from one site present in another? If so, what is the best strategy for ensuring that you get the best prediction?
Definitions of Domains - how the profiles work.
We will look at the multiple sequence alignment for the Pfam prediction of the insulin receptor domain of the sequence
- Enter the insulin receptor protein into Pfam. Click on the L-Recep Domain. On the side bar, click the alignment - how much similarity is there between the sequences in the alignment? Also take a look at the HMM Logo.
- What does the profile tell you about the type of similarity between the sequences that all have the domain.
- Look at the HMM definition for the Pfam B set - what can you say about the type of similarity that the HMM uses to define the domain. Sequence: epn1_human.
- What is the difference between Pfam A and B - hint take a look at the help pages.
- Compare the results of GlobPlot to those of the SMART domains. How well does the prediction line up with the annotated domains?
- For those domains that are not predicted well, using the information available about those domains from Pfam, CD-Search and SMART explain why they would not be predicted well with GlobPlot?. Example mdm2
- How well does the globular prediction of IUPred compare to the globular predictors from earlier?
- How do the disorder predictions line up with the globular domain predictions from earlier on your protein?
- In particular, contrast the scoring of residues in disprot, iupred and globplot. Which is better and why?
Using p53_human, run GlobPlot and search for domains on Pfam using whichever tool you find easiest (i.e. Pfam, DAS). How well do the predictions agree?
Repeat these exercises this time with the Breakless protein. Compare this with the prediction for Insulin Receptor Protein.
- What characteristics do you think the machine learning algorithms such as disprot might be picking up?
- What properties of the amino acid sequence are the predictors identifing. How is this encoded in the machine?
Get familiar with the linear motif collected in the ELM server. Go to ELM “browse page” link.
- Choose 2 or more of the above links to address the following questions:
- What biological activity is associated with these different motifs?
- Which proteins contain these motifs?
- Where are the proteins containing these motifs localized in the cell?
- What is the meaning of the 4 prefix classes: CLV, LIG, MOD, TRG?
- Searching for short functional sites with the ELM server
You will use the ELM server to identify possible ELMs present in the query sequences. For each sequence specify its subcellular localisation and taxon.
Src is an example of a protein that has many small functional sites for modification and/or interaction with ligands
Looking for conserved motifs in the human and Xenopus src protein.
Open the ELM server query page in a new window.
Type src_human into the SWISS-PROT ID box or submit the sequence directly Src ELM link.
Specify species as Homo sapiens and cell compartment as cytosol. (Note that src is actually directed to the inner plasma membrane surface by myristoylation.)
Click on the Submit Button. The results will appear in a new window.
Immediately start a new search with src1_xenla or submit the sequence directly link and set species to Xenopus laevis. The searches should take about 1-2 minutes unless the ELM and SMART servers are busy.
Look over the outputs: almost everything in the output graphic has mouse over and is clickable - explore! Then answer the questions below.
- Why are the motifs colour-coded?
- What is indicated by motifs, within a structural region, of the following colours: blue, grey, blue/grey?
- Note that some of the motifs have been "greyed out" in the output - can you see, comparing these motifs with the output of other aspects in the display, which other features these greyed-out motifs correlate?
- Why do you think these motifs have been greyed out?
- Does the server identify any of the motifs as having been already demonstrated to be active via experiments?
- Why is cell compartment important?
- For the src CDK site, follow links to find whether CDK2 or CDK5 modifies this site?
Find the set of reported motifs that obey the following criteria:
(1) They are in the non-globular N-termini (approx. residues 1-80).
(2) They are found at the same place in human and frog (indicating that there is functional conservation).
N.B. An easy, and frequently used, way of checking for conservation in position of a motif between related sequences is to obtain a BLAST pairwise alignment between the two sequences, and to examine the region around the motifs from this alignment. You could do this using the BLAST 2 sequences server at the NCBI.
- Are there conserved N-myristoylation sites?
- Are there any conserved phosphorylation sites?
- Are there any conserved cyclin binding sites?
- Is a cyclin binding site meaningful on its own? Or is there another motif that must be found in the same protein as this motif for it to be functional?
- Are there more cyclin-binding than CDK phosphorylation sites?
- Is src likely to be phosphorylated at specific points in the cell cycle?
- There are 8 src-like kinases in the human proteome with partially redundant function. Presumably they will be substrates of CDK too?
If there is enough time do a new ELM queries with the src-like kinases yes_human and yes_xenla to check whether it too is a CDK substrate
Exploring unknown sequences.
Use the following sequence link unknown1 to perform a search in the ELM server
- Which kind of protein is? (Tipp: the domain composition should help)
- Is it correct to do the search selecting “cytosol” only?
- Which kinds of elms are most likely to be true?
Go to Uniprot to check if you can find additional annotation of linear motifs.
Go to phospho.ELM to see information specific on phospho-site(s), using accession number P06213. (we will look the phospho.ELM database more in detail with the next sequence, so do NOT close this window).
Use the following sequence unknown2 perform a search in the ELM server.
Have a brief look at the ELM results: follow this link and perform a search in phospho.ELM using accession number P35568.
- Can you identify the relation between this protein and the previous one?
- How many species are annotated? Which has the most sites? Which has the best-annotated sites?
- Which of pTyr, pThr, pSer are most common?
- Which class is the insulin receptor?
- What do LTP and HTP mean? Does HTP outnumber LTP?
- Is there a link to the literature on the site?
- What is NetworKIN?
- Click on the link in the substrate column. What do you get? What is MINT?
- Click on a link in the Interaction Network column. What do you get?
Summary Again 15 minutes at end.