This section will provide an introduction to Native Disorder.
An example of a sequence that is differently predicted by all three predictors:
>FBpp0080607 type=protein; loc=2L:complement(18356225..18358753,18358813..18359022); ID=FBpp0080607; name=Acp36DE-PA; parent=FBgn0011559,FBtr0081054; dbxref=FlyBase:FBpp0080607,FlyBase_Annotation_IDs:CG7157-PA,GB_protein:AAF53664.1,GB_protein:AAF53664; MD5=188d260ce4f4f3c72952621fbcb2c7ca; release=r5.1; species=Dmel; length=912; MWTLTCQQFIALILLGTLVPSESFLCKHCFRKNLEKVHESFRDILSPPIF GVNPQPLIEVQQPKVTPKPESSQVIHVHQPQVILKPIYYPKVDTISTKNQ IGIHGPYSQYPSLLPSANLLGIPNQQLINAQDVLSDKDQKQTQVQNNNLH IRFGVSALREGRNNPSLETISRDKVDKISPALQLQLLRYADSQSQSQTQS QSASQSESNASSQFQAQEQSNRLLENPPVSESQSQSESQSQSESQKQSQS QSQRQQQIQTQLQILRQLQQKSNEQSAAQSASQIQSQRQSDSQSNLQLQE QSQSQSEQGKPIQSQIQILQGLQQKELDDKSASQSQSESKTRQEQQKQLN LQQLEELSSSLSQSRLGLGQQIQSQLQKNQLDKQFASQFQSQSKSQLEQQ MQLQLQSLRQLQQKQLDEQSASQSQPQSQVAQQIQSHLQLLRLLQSRLKT QSALKSDLEQQILLQLKKLTEVQQKQLAEQPTLRPSSKSQSPGQLEQQIL LQLQNLLQFQQNQLKSDTQTQSQLQESKSNSLSQSQSQSQEQLQLQRDQN LRQLEQIKLEMQNIRELLQKGKSELQTQSDSQRRIHELYQNILQLNKEKL SYQLKQLKLKELEDQKKSQAEISKGSNPSNLFIIGQLPSEGKPAPGNQGP SIEPKLVPQPGSLDKLPSGGGLIGKPASTGLYILSPDFNDLSDYRDQFRL QQELKKHQNILSLLQRRQNDIKKQQNAQLLLGQQQKEQQAQESINKQQSS SAGSSSQTKLQQDIQSTGAQGSQQGLQAGSTGLQTSSLQGTESSASQSAA LQRLKEQEQLRIQTENDQKTSSSSSHSNSQNSQSSSSQSSQASQSEAQRQ EAGNRNTLLLDQSSSKTQSESKSESSSQSSSHSSSQSTSNSSSNVQSKLQ GESQALLNNLSG
The aim of this session is:
* Raising awareness of characteristics of intrinsically disordered peptides in terms of: o Function o Three-dimensional structure o Primary amino acid sequence * Providing an overview of the algorithmic/theoretical basis of some of the methods used to predict disordered regions in proteins * Providing hands-on experience of several different bioinformatic tools that can be used to predict disordered regions of proteins * Demonstrating the differences between the results obtained using different prediction methods - showing that if one method does not predict a domain in your sequence, another method might, and thus that it can be good to apply several different methods to the same sequence * Raising awareness of situations in which it can be useful to identify globular domains (i.e. typical use cases) o Deciding which regions of a sequence to exclude from a construct to be used in structural studies
After being taught this section we want that you:
* Are aware of common features and characteristics of disordered peptides * Gain at least a basic understanding of the algorithms/theory used to predict disordered regions of protein sequences * Know which servers and tools are available to predict disordered regions in your protein sequences * Realise that it is often good to analyse the same sequence using several different methods * Know of situations in which it could be useful for you to apply these techniques
What is Native Disorder
These are segments of a protein that under native conditions do not exhibit a fixed or stable 3-D structure. This may be transient or permanent. A more precise definition includes an analysis of the flexiblility of the amino acid backbone. Here if the Ramachandran angles do not exhibit a tendency towards equilibrium positions over time then the region is said to be disordered. In contrast the Ramchandran angles of residues in stable 3D structures are restricted and therefore settle into equilibrium positions.
Under the induced fit model of protein interactions a segment of a protein may only form a stable 3D structure when in direct contact with biomolecule. Examples include DNA binding, or protein complex formation. A clear example of this is SH3 ligands. These are short motifs in the disordered region of a protein that interact with the SH3 domain of another protein.
The traditional view that the structure of a protein determines its function is slowly unravelling (no pun intended). Non-folded or unstructured regions of more and more proteins have been shown to be important in the regulation of pathways, mediating complex formation, enzymatic kinetics etc. These natively disordered regions of proteins contain many signals or motifs that play important biological roles.
Types of algorithm
Although its a well used word, there are many differing defintions of precisely what constitutes an algorithms - so I'll add to the fray by suggesting that it is a set of instructions for carrying out transformations (plus, minus, divide) to objects (usually numbers). In the case of biological sequence analysis the numbers are usually represented as characters. However, this is just a way of hiding the inner details of the implementation of the algorithm in the computer from the user. In fact in the computer the characters are represented by numbers. In general they are used to solve problems in a systematic or formalised way. This makes them inherently reproducible.
Rather than finding the best solution these algorithms focus on finding good solutions, that while not optimal will offer benefits in terms of the amount of time required to find the solution. These type of methods are particularly appealing to biologists since they seem to mirror the stochastic mechanisms at work in biology. They offer a method for discerning similarities in a sea of chaos and random variation.
These are probably what you think of when someone mentions the word algorithm. Input flows through the algorithm and is transformed according to a set of decisions along the way. It can be drawn as a flowchart, if at a decision point the input meets certain criteria it flows one of a number of ways depending on the threshold at the decision point. The internals of the algorithm are stable and therefore given the same input the same output will always be achieved.
Rule based search methods, if properly defined will offer the best results. However, the defintion may be difficult to achieve, difficult to implement or take too long to complete. Therefore, probabilistic methods have been created which the advantage that they may better encapsulate biological problems. Heuristic based methods try to find a middle ground between black box solutions like ANNs and deterministic algorithms. They use the rules of thumb to try to improve the speed of implementation.
The algorithm used here is based on the pairwise inter-residue interaction energy. Similar to globplot this algorithm works on the assumption that IUP will have a characteristic frequency. It assumes that unfolded sequences do not form stable structures as a result of unfavourable interactions between residues. A matrix of of the expected pairwise energies of each residue is generated and used to make comparisons between ordered and unordered sequences. A clear separation in the energy content of ordered and unordered sequences is observed.
Database of Disordered Sequences. If you have experimentally validated sequences please send them here.
They have used the collected data to analyse the frequency of different amino acids in the disordered database and used this to make predictions about the likelihood a sequence will be found in ordered or disordered regions. They used a wide range of properties of amino acids to attempt to classify the amino acids as order promoting or disorder promoting. As an example one of the best predictors was a residue contact scale. Other high performers included hydrophobicity scales.
DisProt? prediction tools DisProt? uses a range of Artifical Neural Network based prediction methods. The differences arise as a result of the dataset used to train the algorithm and as a result of the parameters used internally by the algorithm such as the size of the population, number of iterations etc. http://www3.interscience.wiley.com/cgi-bin/abstract/106559004/ABSTRACT?CRETRY=1&SRETRY=0
These also use the sliding window scheme to make local predictions about the folded nature of the target protein. These algorithms encode information about the frequencies of amino acids, their positions, the flexibility index, hydropathy, net charge and coordination number, which were also averaged over the window Win, and entropy, a measure of sequence complexity.