Burkhard Rost: Publications in molecular biology

Burkhard Rost
Address: EMBL, 69 012 Heidelberg, Germany
Correspondence: Burkhard Rost
e-mail: rost@embl-heidelberg.de
contact e-mail:rost@embl-heidelberg.de


Table of Contents


Abstracts


Exercising Multi-layered Networks on Protein Secondary Structure

Burkhard Rost & Chris Sander

in: O. Benhar, S. Brunak, P. DelGiudice and M. Grandolfo (eds.), Neural Networks: From Biology to High Energy Physics, Elba, Italy: Intern. J. of Neural Systems, 209-220 (1992).

The quality of a multi-layered network predicting the secondary structure of proteins is improved substantially by: (i) using information about evolutionarily conserved amino acids (increase of overall accuracy by six percentage points), (ii) balancing the training dynamics (increase of accuracy for strand), and (iii) combining uncorrelated networks in a jury (increase two percentage points). In addition, appending a second level structure-to-structure network results in better reproduction of the length of secondary structure segments.


Neural Networks in Chemistry

Burkhard Rost & Gerrit Vriend

CDA News , 8, 24-27 (1993).


Prediction of protein secondary structure at better than 70% accuracy

Burkhard Rost & Chris Sander

J. Mol. Biol., 232, 584-599 (1993).

We have trained a two layered feed-forward neural network on a non-redundant database of 130 protein chains to predict the secondary structure of water-soluble proteins. A new key aspect is the use of evolutionary information in the form of multiple sequence alignments that are used as input in place of single sequences. The inclusion of protein family information in this form increases the prediction accuracy by 6-8 percentage points. A combination of three levels of networks results in an overall three state accuracy of 70.8% for globular proteins (sustained performance). If four membrane protein chains are included in the evaluation, the overall accuracy drops to 70.2%. The prediction is well balanced between [[alpha]]-helix, [[beta]]-strand and loop. 65% of the observed strand residues are predicted correctly. The accuracy in predicting the content of three secondary structure types is comparable to that of circular dichroism spectroscopy. The performance accuracy is verified by a seven-fold cross-validation test, and an additional test on 26 recently solved proteins. Of particular practical importance is the definition of a position-specific reliability index. For the half of the residues predicted with high reliability the overall accuracy increases to better than 82%. A further strength of the method is the more realistic prediction of segment length. The protein family prediction method is available for testing by academic researchers via an electronic mail server.


Progress in protein structure prediction?

Burkhard Rost, Chris Sander & Reinhard Schneider

TIBS, 18, 120-123 (1993).

Prediction of protein secondary structure is an old problem and progress has been slow over the years. Recently, spectacular success has been claimed in the blind prediction of the catalytic subunit of the cAMP dependent protein kinase. When predictions in this and other test cases are assessed critically, some claims of prediction success turn out to be exaggerated, but a kernel of real progress remains: protein structure prediction can be improved substantially when a family of related sequences is available. Enough so that molecular biologists equipped with a new amino acid sequence and a multiple sequence alignment in hand may be tempted to test the new prediction methods.


Improved prediction of protein secondary structure
by use of sequence profiles and neural networks

Burkhard Rost & Chris Sander

Proc. Natl. Acad. Sc. U.S.A. , 90, 7558-7562 (1993).

The explosive accumulation of protein sequences in the wake of large scale sequencing projects is in stark contrast to the much slower experimental determination of protein structures. Improved methods of structure prediction from the gene sequence alone are therefore needed. Here, we report a substantial increase in both the accuracy and quality of secondary structure predictions, using a neural network algorithm. The main improvements come from the use of multiple sequence alignments (better overall accuracy), from 'balanced training' (better prediction of [[beta]]-strands) and from 'structure context training' (better prediction of helix and strand lengths). The best method, cross-validated on seven different test sets purged of sequence similarity to learning sets, achieves a three-state prediction accuracy of 69.7%, significantly better than any previous method. The improved distribution and accuracy of helices and strands makes the predictions well suitable for use in practice as a first estimate of structural type of newly sequenced proteins.


Secondary structure prediction of all-helical proteins in two states

Burkhard Rost & Chris Sander

Prot. Engin. , 6, 831-836 (1993).

Can secondary structure prediction be improved by prediction rules that focus on a particular structural class of proteins? To help answer this question, we have assessed the accuracy of prediction for all-helical proteins, using two conceptually different methods and two levels of description. An overall two-state single residue accuracy of about 80% can be obtained by a neural network, no matter whether it is trained on two states (helix, non-helix), or first trained on three states (helix, strand, loop) and then evaluated on two states. For four test proteins, this is similar to the accuracy obtained with inductive logic programming. We conclude that on the level of secondary structure, there is no practical advantage in training on two states, especially given the added margin of error in identifying the structural class of a protein. In the further development of these methods, it is increasingly important to focus on aspects of secondary structure that aid in the construction of a correct three-dimensional model, such a the correct placement of segments.


PHD - an automatic server for protein secondary structure prediction

Burkhard Rost, Reinhard Schneider & Chris Sander

CABIOS, 10, 53-60 (1994).

In the middle of 1993, more than 30,000 protein sequences are known. For 1000 of these the three-dimensional (tertiary) structure is experimentally solved. Another 7000 can be modelled by homology. For the remaining 21,000 sequences secondary structure prediction provides a rough estimate of structural features. Predictions in three states rate between 36% (random) and 88% (homology modelling) overall accuracy. Using information about evolutionary conservation as contained in multiple sequence alignments, the secondary structure of 4700 protein sequences was predicted by the automatic e-mail server PHD. For proteins with at least one known homologue, the method has an expected overall three-state accuracy of 71.4% for proteins with at least one known homologue (evaluated on 126 unique protein chains).


Redefining the goals of protein secondary structure prediction

Burkhard Rost, Reinhard Schneider & Chris Sander

J. Mol. Biol., 235, 13-26 (1994).

Secondary structure prediction recently has surpassed the 70% level of average accuracy, evaluated on the single residue states helix, strand and loop (Q3). But the ultimate goal is reliable prediction of tertiary (3D) structure, not 100% single residue accuracy for secondary structure. A comparison of pairs of structurally homologous proteins with divergent sequences reveals that considerable variation in the position and length of secondary structure segments can be accommodated within the same 3D fold. It is therefore sufficient to predict the approximate location of helix, strand, turn and loop segments, provided they are compatible with the formation of 3D structure. Accordingly, we define here a measure of segment overlap (Sov) that is somewhat insensitive to small variations in secondary structure assignments. The new segment overlap measure ranges from an ignorance level of 37% (random protein pairs) via a current level of 72% for a prediction method based on sequence profile input to neural networks (PHD) to an average 90% level for homologous protein pairs. We conclude that the highest scores one can reasonably expect for secondary structure prediction are a single residue accuracy of Q3>85% and a fractional segment overlap of Sov>90%.


Combining evolutionary information and neural networks to predict protein secondary structure

Burkhard Rost & Chris Sander

Proteins, 19, 55-72 (1994).

Using evolutionary information as contained in multiple sequence alignments as input to neural networks, secondary structure can be predicted at significantly increased accuracy. Here, we extend our previous three-level system of neural networks by using additional input information derived from multiple alignments. Using a position-specific conservation weight as part of the input increases performance. Using the number of insertions and deletions reduces the tendency for overprediction and increases overall accuracy. Addition of the global amino acid content yields a further improvement, mainly in predicting structural class. The final network system has a sustained overall accuracy of 71.6% in a multiple cross-validation test on 126 unique protein chains. A test on a new set of 124 recently solved protein structures that have no significant sequence similarity to the learning set confirms the high level of accuracy. The average cross-validated accuracy for all 250 sequence-unique chains is above 72%. Using various data sets, the method is compared to alternative prediction methods, some of which also use multiple alignments: the performance advantage of the network system is at least 6 percentage points in three-state accuracy. In addition, the network estimates secondary structure content from multiple sequence alignments about as well as circular dichroism spectroscopy on a single protein and classifies 75% of the 250 proteins correctly into one of four protein structural classes. Of particular practical importance is the definition of a position-specific reliability index. For 40% of all residues the method has a sustained three-state accuracy of 88%, as high as the overall average for homology modelling. A further strength of the method is greatly increased accuracy in predicting the placement of secondary structure segments.


Evolution and Neural Networks - Protein secondary structure prediction above 71% accuracy

Burkhard Rost, Reinhard Schneider & Chris Sander

In: L. Hunter (ed.) "27th Hawaii International Conference on System Sciences, Wailea, Hawaii, U.S.A." IEEE Society Press, 385-394 (1994).

Some 30,000 protein sequences are known. For 1,000 the structure is experimentally solved. Another 4,000 can be modeled by homology. For the remaining 25,000 sequences, the tertiary structure (3D) cannot be predicted generally from the sequence. A reduction of the problem is the projection of 3D structure onto a one-dimensional string of secondary structure assignments. Predictions in three states rate between 36% (random) and 88% (homology modelling) accuracy. Here, we present an improvement of a neural network system using information about evolutionary conservation. The method achieves a sustained overall accuracy of 71.4%. A test on 45 new proteins confirms the estimated accuracy. Of practical importance is the definition of a reliability index at each residue position: e.g. about 40% of the predicted residues have an expected accuracy of 88%. The method has been made publicly available by an automatic e-mail server.


1D secondary structure prediction through evolutionary profiles

Burkhard Rost & Chris Sander

In: Bohr, H. and Brunak, S. (eds.) "Distance-Based Approaches to Protein Structure Determination" Amsterdam: IOS press, North-Holland, 257-276 (1994).

For only about one third of the new proteins, three-dimensional (3D) structure can be predicted. For the remaining two thirds, a compromise has to be made. An extreme simplification is the projection of 3D structure onto a string of 1D secondary structure assignments. Here, we report how neural networks can be configured such that strand is predicted significantly better, and that the prediction looks like native proteins in terms of the length of predicted segments. Using evolutionary information contained in multiple sequence alignments as input to neural networks, secondary structure can be predicted at significantly increased accuracy. Pre-processing the alignment information by using a position-specific conservation weight and the number of insertions and deletions in each alignment position is found to be advantageous. Addition of the global amino acid content yields a further improvement, mainly in predicting structural class. The final network system has a sustained overall accuracy of more than 72% evaluated on 250 sequence-unique chains. Of particular practical importance is the definition of a position-specific reliability index. For 40% of all residues the method has a sustained three-state accuracy of 88%, as high as the overall average for homology modelling.


Conservation and prediction of solvent accessibility in protein families

Burkhard Rost & Chris Sander

Proteins, 20, 216-226 (1994).

Predicting three-dimensional (3D) protein structure alone from sequence in general is currently an insurmountable task. As intermediate step, a much simpler task has been pursued extensively: prediction of a projection of 3D structure onto 1D strings of secondary structure. Here, we present an analysis of another 1D projection of 3D structure: the relative solvent accessibility of a residue. We show that solvent accessibility is less conserved in 3D families than secondary structure: the average correlation of relative solvent accessibility between 3D homologues is only 0.66. This value provides an effective practical upper limit for the accuracy of predicting accessibility from sequence. We introduce a neural network system that predicts relative solvent accessibility (projected onto 10 discrete states) using evolutionary profiles of amino acid substitutions derived from multiple sequence alignments. Evaluated in a cross-validation test on 126 unique proteins, the correlation between predicted and observed relative accessibility is 0.54. For a ternary (buried, intermediate, exposed) description of relative accessibility the fraction of correctly predicted residue states is about 58%. In absolute terms, this accuracy appears poor, but given the relatively low conservation of accessibility in 3D families (correlation 0.66), the network system is not far from optimal performance. Prediction is best for buried residues, e.g. 86% of the completely buried sites are correctly predicted as having 0% relative accessibility.


Structure prediction of proteins - where are we now?

Burkhard Rost & Chris Sander

Curr.Opin. Biotech. , 5, 372-380 (1994).

Although the structure-from-sequence prediction problem remains fundamentally unsolved, new and promising methods in 3D, 2D, and 1D have reopened the field. Pseudopotentials or information values derived from the databases can distinguish between correct and incorrect models (3D). Interresidue contacts (2D) can detected by the analysis of correlated mutations, albeit with low accuracy. Significantly improved prediction of secondary structure (1D) from multiple sequence alignments is now available in daily practice.


Prediction of helical transmembrane segments at 95% accuracy

Burkhard Rost, Rita Casadio, Piero Fariselli & Chris Sander

Protein Science , 4, 521-533 (1995).

We describe a neural network system that predicts the locations of transmembrane helices in integral membrane proteins. By using evolutionary information as input to the network system, the method significantly improved on a previously published neural network prediction method that had been based on single sequence information. The input data was derived from multiple alignments for each position in a window of 13 adjacent residues: amino acid frequency, conservation weights, number of insertions and deletions, and position of the window with respect to the ends of the protein chain. Additional input was the amino acid composition and length of the whole protein. A rigorous cross-validation test on 69 proteins with experimentally determined locations of transmembrane segments yielded an overall two-state per-residue accuracy of 95%. About 94% of all segments were predicted correctly. When applied to known globular proteins as a negative control, the network system incorrectly predicted fewer than 5% of globular proteins as having transmembrane helices. The method was applied to all 269 open reading frames from the complete yeast VIII chromosome. For 59 of these at least two transmembrane helices were predicted. Thus, the prediction is that about one fourth of all proteins from yeast VIII contain one transmembrane helix, and some 20% more than one.


Protein Structure Prediction by Neural Networks

Burkhard Rost & Chris Sander

In: M. Arbib (ed.) "The Handbook of Brain Theory and Neural Networks" Cambridge,MA: Bradford Books/The MIT Press, 772-775 (1995).

Introduction

What is a protein? The information for life is stored by a four-letter alphabet in the genes (DNA: deoxy-ribonucleic acid). Proteins are, among others, the macromolecules that perform all important tasks in organisms, such as catalysis of biochemical reactions, transport of nutrients, recognition and transmission of signals. Thus, genes are the blueprints and proteins the machinery of life. Proteins are formed by joining the amino acids into a long stretched chain, the protein sequence, a translation of the genes into a 20-letter alphabet of amino acids. Proteins differ in length (from 30 to 30,000 amino acids) and in the arrangement of amino acids (called residues, when joined in proteins). In water, the chain folds up to a unique three-dimensional (3D) structure. The main driving force is the need to pack residues for which a contact with water is energetically unfavourable (hydrophobic residues) into the interior of the molecule. A detailed analysis of the underlying chemistry shows that this is only possible if the protein forms regular patterns of a macroscopic substructure called secondary structure (Fig. 1; for an introduction in protein structure see Branden, 1991.

What determines protein function and structure? The 3D structure of a protein determines its function. But what determines the 3D structure? It is well established that the details of the 3D structure (also called the fold), are uniquely determined by the specificity of the sequence. Can the code be deciphered, i.e. can 3D structure be predicted from sequence? In principle, the code could by deciphered by calculating the physico-chemical force fields determining the fold. Unfortunately, the required computer time to calculate the 3D structure based on first principles is many orders of magnitude beyond today's possibilities. However, it is of practical importance to know the 3D structure. One reason is rational drug design.

Why not simply look by microscope at the 3D structure? Unfortunately, the techniques to experimentally determine 3D structure of a protein are rather complicated. Solving a structure can take from one to several years. Today for some 36,000 proteins the sequence is known, but only for 2,000 has the 3D structure been determined by experiment. Large gene sequencing projects increase the sequence-structure gap further. The most accurate way to predict 3D structure from the sequence is by homology modelling, i.e., search for a protein with similar sequence that has a known 3D structure and then model the 3D structure of the unknown protein in analogy to the known one. Such techniques lead to a reduction of the sequence-structure gap by some 9,000 proteins. However, there are still some 25,000 proteins for which researchers would like to know as much as possible about the structure, and this number is rapidly increasing.

Why can homology modelling be successful? The exchange of a few residues can already destabilise a protein. This implies that the majority of the 20N possible sequences of length N form different structures. But, has evolution created such an immense variety? Random errors in the DNA sequence lead to a different translation of protein sequences. These 'errors' are the basis of evolution. Mutations resulting in a structural change are not likely to be accepted, since the protein cannot perform its task. Furthermore, the universe of stable structures is not continuous, i.e., minor changes on the level of the 3D structure destabilise the structure. The evolutionary pressure to conserve function and the discontinuity of the universe of structures have the result that structure is evolutionarily more conserved than sequence. Evolution has produced pairs of proteins which have the same 3D structure with only 25% identical residues. Therefore, the 3D structure can be predicted rather accurately by homology if a protein with sufficient sequence identity and known 3D structure is found in the data bank.

Can the egg be unboiled? When an egg is boiled, the proteins it contains unfold. Can this procedure be reversed in theory? Or, in other words, can the encrypted code of protein folding be deciphered from the sequence? Current tools to predict the 3D structure from the sequence are rather limited (Rost and Sander, 1994b). Therefore, the problem has to be simplified. One extreme simplification is the prediction of one dimensional (1D) strings of secondary structure assignment. Others are the location of functionally important residues, the classification of proteins into structurally related families, or the prediction of whether or not a particular residue is buried in the core of the protein.

How can neural networks predict protein structure? In practice, the most successful predictions are based on an analysis of common features in the data bank of known 3D structures. Artificial neural networks are well suited for pattern classification. Here, we shall attempt to show how the application of neural networks as devices for pattern classification can be used for the prediction of protein structure. First, we give examples of how the data bank of known 3D structures can be used to predict secondary structure (1), and other structural features (2). Then, we briefly review attempts to predict entire 3D structures (3). Finally, we give a critical evaluation of the neural network applications by comparing these to alternative approaches and outline the prospects of applying neural networks for protein structure prediction.


Fitting 1D predictions into 3D structures

Burkhard Rost

In: Bohr, H. and Brunak, S. "Protein folds: a distance based approach" Boca Raton,Florida: CRC Press, 132-151 (1995).

The experimental determination of protein structure cannot keep track with the rapid generation of new sequence information. Can theory contribute? The most successful prediction method - and the only one for prediction of 3D structure - is homology modelling. It is applicable for about one quarter of the proteins. For the rest, the prediction task has to be simplified. An extreme simplification is to project 3D structure onto 1D strings of secondary structure or solvent accessibility. For these 1D aspects of 3D structure, prediction accuracy has been improved significantly by using evolutionary information as input to neural network systems. The gain in accuracy bases on the conservation of secondary structure and relative solvent accessibility within sequence families. Secondary structure and accessibility are conserved, as well, between remote homologues. This fact can be used by fitting 1D predictions into 3D structures to detect such remote homologues. In comparison to other threading approaches, 1D threading is rather flexible. However, two factors decrease detection accuracy. First, the loss of information by projecting 3D structure onto 1D strings (in particular the loss of distances between secondary structure segments). And second, the inaccuracy of predicting 1D structure. A preliminary result is that every fifth remote homologue is detected correctly.


TOPITS: Threading One-dimensional Predictions Into Three-dimensional Structures

Burkhard Rost

In: C. Rawlings, D. Clark, R. Altman, L. Hunter, T. Lengauer and S. Wodak (eds.),Third International Conference on Intelligent Systems for Molecular Biology, Cambridge, England: Menlo Park, CA: AAAI Press, 314-321 (1995).

Homology modelling, currently, is the only theoretical tool which can successfully predict protein 3D structure. As 3D structure is well conserved within sequence families, homology modelling allows to predict 3D structure for 20% of the SWISSPROT proteins. 20% of the proteins in are remote homologues to another PDB protein, i.e. the structures are homologous but pairwise sequence identity is not significant. Threading techniques attempt to predict such remote homologues based on sequence information to thus increase the scope of homology modelling. Here, a new threading method is presented. First, for a list of PDB proteins, 3D structure was projected onto 1D strings of secondary structure and relative solvent accessibility. Then, secondary structure and solvent accessibility were predicted by neural network systems (PHD) for a search sequence. Finally, the predicted and observed 1D strings were aligned by dynamic programming. The resulting alignment was used to detect remote 3D homologues. Four results stand out. First, even for an optimal prediction of 1D strings (taken from PDB), only about half the hits that ranked above a given threshold were correctly identified as remote homologues; only about 25% of the first hits were correct. Second, real predictions (PHD) were not much worse: about 20% of the first hits were correct. Third, a simple filtering procedure improved prediction performance to about 30% correct first hits. With such a filter, the correct hit ranked among the first three for more than 23 out of 46 cases. Fourth, the combination of the 1D threading and sequence alignments markedly improved the performance of the threading method TOPITS for some selected cases.


Progress of 1D protein structure prediction at last

Burkhard Rost & Chris Sander

Proteins , 23, 295-300 (1995).

Accuracy of predicting protein secondary structure and solvent accessibility from sequence information has been improved significantly by using information contained in multiple sequence alignments as input to a neural network system. For the Asilomar meeting, predictions for 13 proteins were generated automatically using the publicly available prediction method PHD. The results confirm the estimate of 72% three-state prediction accuracy. The fairly accurate predictions of secondary structure segments made the tool useful as a starting point for modelling of higher dimensional aspects of protein structure.


Bridging the protein sequence-structure gap by predictions?

Burkhard Rost & Chris Sander

Annual Review of Biophysics and Biomolecular Structure , 25, 113-136 (1996)

The problem of accurately predicting protein three-dimensional structure from sequence has yet to be solved. Recently, a number of new and promising methods that work in one, two or three dimensions have invigorated the field. Modelling by homology can yield fairly accurate three-dimensional structures for about 25% of the currently known protein sequences. Techniques for cooperatively fitting sequences into known three-dimensional folds, called threading methods, are capable of increasing this rate by detecting very remote homologies in favourable cases. Prediction of protein structure in two dimension, i.e., prediction of inter-residue contacts, is in its infancy. Prediction tools that work in one dimension are both mature and generally applicable; they predict secondary structure, residue solvent accessibility, and the location of transmembrane helices with reasonable accuracy. These and other prediction methods have gained immensely from the rapid increase of information in publicly accessible databases. Growing databases will lead to further improvements of prediction methods and thus to narrowing the gap between the number of known protein sequences and known protein structures.


NN which predicts protein secondary structure

Burkhard Rost

In: E. Fiesler and R. Beale (eds.) "Handbook of Neural Computation" New York:Oxford Univ. Press, G4.1 (1996).

Currently, the prediction of three-dimensional (3D) protein structure from sequence alone poses insurmountable difficulties. As an intermediate step, a much simpler task has been pursued extensively: predicting 1D strings of secondary structure. Here, a composite neural network is described which predicts three secondary structure states (helix, strand, loop). The network system comprises two levels of feed-forward networks (one hidden layer each) and a final jury decision over differently trained networks. Training is done by an adaptive-like back-propagation. An important key features of the system is that the input is not only the sequence of one protein but the profile of a whole bunch of sequences of proteins which have the same 3D structure. The combination of the problem specific topology and the pre-processing of the input improve prediction accuracy from some 62% to 72%. Furthermore, the specific topology and training procedure successfully corrects for shortcomings of both simpler NN and classical methods. Over the last years, the system has been the best automatic predictor in a very competitive area of research.


PHD: predicting 1D protein structure by profile based neural networks

Burkhard Rost

In: Doolittle, R (ed.) "Computer Methods for Macromolecular Sequence Analysis"Methods in Enzymology, 266, 525-539 (1996).

Full Paper

Introduction

We still cannot predict protein three-dimensional (3D) structure from sequence alone. But, we can predict 3D structure for one fourth of the known protein sequences (SWISSPROT) by homology modelling based on significant sequence identity (>25%) to known 3D structures (PDB). For the remaining, about 30,000 known sequences, the prediction problem has to be simplified. An extreme simplification is to try to predict projections of 3D structure, e.g., 1D secondary structure, solvent accessibility, or transmembrane location assignments for each residue.

Despite the extreme simplification, the success of 1D predictions has been limited as segments from single sequences (used as input) do not contain sufficient global information about 3D structures. Patterns of amino acid substitutions within sequence families are highly specific for the 3D structure of that family. Using such evolutionary information is the key to a significant improvement of 1D predictions.

In this review I describe three prediction methods that use evolutionary information as input to neural network systems to predict secondary structure (PHDsec), relative solvent accessibility (PHDacc), and transmembrane helices (PHDhtm). I shall also illustrate the possibilities and limitations in practical applications of these methods with results from careful cross-validation experiments on large sets of unique protein structures.

All predictions are made available by an automatic email prediction service (see Availability). The baseline conclusion after some 30,000 requests to the service is that 1D predictions have become accurate enough to be used as a starting point for expert-driven modelling of protein structure.


Refining neural network predictions for helical transmembrane proteins by dynamicprogramming

Burkhard Rost, Rita Casadio & Piero Fariselli

In: States, D., Agarwal, P., Gaasterland, T., Hunter, L. & Smith, R. F.,(eds.), Fourth International Conference on Intelligent Systems for Molecular Biology, St. Louis, U.S.A.: Menlo Park, CA: AAAI Press, pp. 192-200 (1996).

For transmembrane proteins experimental determination of three-dimensional structure is problematic. However, membrane proteins have important impact for molecular biology in general, and for drug design in particular. Thus, prediction method are needed. Here we introduce a method that started from the output of a profile-based neural network system (PHDhtm). Instead of choosing the neural network output unit with maximal value as prediction, we implemented a dynamic programming-like refinement procedure that aimed at producing the best model for all transmembrane helices compatible with the neural network output. Preliminary results suggest that the refinement was clearly superior to the initial neural network system; and that, in terms of correctly predicting all transmembrane helices of a protein correctly, the method was more accurate than a previously applied empirical filter. The refined prediction was used successfully to predict transmembrane topology based on an empirical rule for the charge difference between extra- and intra-cellular regions (positive-inside rule). The resulting accuracy in predicting topology was better than 80%. Although a more thorough evaluation of the method on a larger data set will be required, the results compared favourably with alternative methods for the prediction of transmembrane helices and topology.


Topology prediction for helical transmembrane proteins at 86% accuracy

Burkhard Rost, Rita Casadio & Piero Fariselli

Prot. Sci., 5, 1704-1718 (1996)

Previously, we introduced a neural network system predicting the locations of transmembrane helices in integral membrane proteins based on evolutionary profiles (PHDhtm). Here, we describe an improvement and an extension of that system. The improvement is achieved by a dynamic programming-like algorithm that optimises helices compatible with the neural network output. The extension is the prediction of topology (orientation of first loop region with respect to membrane) by applying the observation that positively charged residues are more abundant in extra-cytoplasmic regions to the refined prediction of all transmembrane helices. Furthermore, we introduce a method to reduce the number of false positives, i.e., proteins falsely predicted with membrane helices. The evaluation of prediction accuracy is based on a cross-validation and a double-blind test set (in total 131 proteins). The final method appears to be more accurate than other methods published. (1) For almost 89% (+/-3%) of the test proteins all transmembrane helices are predicted correctly. (2) For more than 86% (+/-3%) of the proteins topology is predicted correctly. (3) We define reliability indices which correlate with prediction accuracy: for the most strongly predicted half of the proteins the likelihood of predicting all transmembrane helices correctly raises to 98%; and for two-thirds of the proteins the accuracy of topology prediction was 95%. (4) The rate of proteins for which transmembrane helices are predicted falsely is below 2% (+/-1%). Finally, the method is applied to 1616 sequences of Haemophilus influenzae. We predict 19% of the genome sequences to contain one or more transmembrane helices. This appears to be lower than what we predicted previously for the yeast VIII chromosome (about 25%).


Pitfalls of protein sequence analysis

Burkhard Rost & Alfonso Valencia

Curr. Opin. Biotech., 7, 457-461 (1996)

Full Paper

INTRODUCTION

Imagine you have a protein sequence, either sequenced in your own lab or pulled down from genome projects of EST production. You decide to let theoretical biology assist you in finding a priori information about your protein that may be useful to accelerate and design experiments. You submit your sequence to database search and/or structure prediction services. The possible pitfalls are numerous, including picking a lousy server or misinterpreting the results. We give examples for common pitfalls collected after 80,000 requests to an automatic prediction service (Table).

What can theory predict of protein structure? In general, protein three-dimensional (3D) structure can NOT be predicted from sequence. However, 3D structure can be predicted by homology modelling, i.e., by using a sequence homologue (>25% sequence identity) with an experimentally determined 3D structures. If no sequence homologue is found in PDB, there still is a chance to predict 3D structure by threading, i.e., by remote homology modelling (<25% sequence identity). However, correct 3D models -and even correct detection of remote homology - from threading are rare But, theory can assist by predicting one-dimensional (1D) aspects of 3D structure, e.g., secondary structure, solvent accessibility, transmembrane helices, binding sites, sequence motifs, and aspects of protein function.

Ease of use bears an ease of misuse. Rapidly developing electronic communication (Internet, World Wide Web) facilitates spreading prediction methods. Experimental biologists submit sequences, theoretical biologists configure automatic services that return predictions. The advantage is that users need not become experts for sequence analysis tools. However, the ease of offering and accessing predictions bears two problems. (1) Inaccurate methods (or insufficiently validated ones) are made available bypassing selection systems such as referees. (2) Users may misinterpret results due to a lack of insight into the features of prediction methods.


Protein fold recognition by prediction-based threading

Burkhard Rost, Reinhard Schneider & Chris Sander

JMB,270, 471-480 (1997).

Full Paper

In fold recognition by threading one takes the amino acid sequence of a proteinand evaluates how well it fits into one of the known three-dimensional (3D) protein structures. The quality of sequence-structure fit is typically evaluated using inter-residue potentials of mean force or other statistical parameters. Here, we present a new approach to evaluating sequence-structure fitness. Starting from the amino acid sequence we first predict secondary structure and solvent accessibility for each residue. We then thread the resulting one-dimensional (1D) profile of predicted structure assignments into each of the known 3D structures. The agreement between predicted and observed structure profile is evaluated using statistical parameters. The optimal threading for each sequence-structure pair is obtained using dynamic programming. The overall best sequence-structure pair constitutes the predicted 3D structure for the input sequence. The method is fine-tuned by adding information from direct sequence-sequence comparison and applying a series of empirical filters. Although the method relies on reduction of 3D information into 1D structure profiles, its accuracy is, surprisingly, not clearly inferior to methods based on evaluation of residue interactions in 3D. We therefore hypothesise that existing 1D-3D threading methods essentially capture not more than the fitness of an amino acid sequence for a particular 1D succession of secondary structure segments and residue solvent accessibility. The prediction-based threading method on average finds any structurally homologues region at first rank in 30% of the cases. For the 17% first hits detected at highest scores, the expected accuracy raised to 70%. However, the task to detect entire folds rather than homologous fragments, was managed much better: depending on the cut-off for what was regarded as an 'entire fold' the first hit was correct in 60-80% of all cases. The quality of the resulting 3D models depends crucially on the details of the sequence-structure alignments which can be inaccurate in detail even in cases in which the correct fold is detected.


Protein structures evolve at random - almost

Burkhard Rost, Sean I. O'Donoghue & Chris Sander
Preprint, EMBL, PDG97 (1997)

Full Paper

Today, we have a detailed and ever-widening knowledge of the evolution of DNA sequences, but what do we really know about the evolution of protein structure? Until recently, the answer was: not much. The first detailed structures were determined 26 years ago; 13 years ago, the database of atomic-resolution protein structures contained just 312 structures (PDB). Since then, due to advances in determination methods, the PDB has grown exponentially; presently it holds over 4000 entries. With this size, we can just begin to analyse the evolution of protein structure. Here, we report an analysis of all pairs of proteins in the PDB which have similar three-dimensional (3D) structures. For each pair, we aligned the 3D structures, and measured the sequence identity (pairwise identical residues) in the aligned regions. The resulting distribution of pair identity scores shows one prominent and unexpected feature: most pairs cluster in an approximately Gaussian peak centred at 8-9% sequence identity. The distribution is surprisingly similar to that expected for `random' pairs of completely unrelated sequences. This result has implications for our understanding of protein folding, and of the effect of convergent (different ancestor) and divergent (same ancestor) evolution on protein structure.


Protein structures sustain evolutionary drift

Burkhard Rost

Folding & Design, 2, 519-524 (1997)

Full Paper

A protein sequence folds into a unique three-dimensional protein structure. Different sequences, though, can fold into similar structures. How stable is a protein structure with respect to sequence changes? What percentage of the sequence are 'anchor' residues, i.e., are crucial for protein structure and function? Here, these questions are pursued by analysing large numbers of structurally homologous protein pairs. Most pairs of similar structures have sequence identity as low as expected from randomly related sequences. On average only three to four percent of all residues are 'anchor' residues (residues crucial for maintaining the structure). The symmetric shape of the distribution at low sequence identity suggests that for most structures, four billion years of evolution was sufficient to reach an equilibrium. The mean identities for convergent (different ancestor) and divergent evolution (same ancestor) of proteins to similar structures are quite close, and hence, in most cases it is difficult to distinguish between the two effects. In particular, low levels of sequence identity appear not to be indicative of convergent evolution.


Sisyphus and protein structure prediction

Burkhard Rost & Sean I. O'Donoghue

CABIOS, 13, 345-356 (1997)

Full Paper

The problem of predicting protein structure from sequence remains fundamentally unsolved despite more than three decades of intensive research effort. However, new and promising methods in 3D, 2D, and 1D prediction have reopened the field. Mean-force-potentials derived from the protein databases can distinguish between correct and incorrect models (3D). Inter-residue contacts (2D) can be detected by analysis of correlated mutations, albeit with low accuracy. Secondary structure, solvent accessibility, and transmembrane helices (1D) can be predicted with significantly improved accuracy using multiple sequence alignments. Some of these new prediction methods have proven accurate and reliable enough to be useful in genome analysis, and in experimental structure determination. Moreover, the new generation of theoretical methods is increasingly influencing experiments in molecular biology.


Learning From Evolution To Predict Protein Structure

Burkhard Rost

In: B. Olsson, D. Lundh, A. Narayanam (eds.) `Bio-Computing and EmergentComputation', Skövde, Sweden, Sep 1-2, 1997. World Scientific, 87 - 101 (1997)

Full Paper

In the wake of the genome data flow, we need - more urgently than ever - accurate tools to predict protein structure. The problem of predicting protein structure from sequence remains fundamentally unsolved despite more than three decades of intensive research effort. However, the wealth of evolutionary information deposited in current databases enabled a significant improvement for methods predicting protein structure in 1D: secondary structure, transmembrane helices, and solvent accessibility. In particular, the combination of evolutionary information with neural networks proved extremely successful. The new generation of prediction methods proved to be accurate and reliable enough to be useful in genome analysis, and in experimental structure determination. Moreover, the new generation of theoretical methods is increasingly influencing experiments in molecular biology.


Pedestrian guide to analysing sequence database

Burkhard Rost & Reinhard Schneider

In: Core techniques in biochemistry (Ashman, K., ed.), pp. in press, Springer,Heidelberg (1998)

Full Paper

Over the past few years our means of communication have changed rapidly due to the growth of the World Wide Web (WWW). The Web enables molecular biologists to immediately access databases, scan literature, find information about related research and researchers, and to trace cell cultures. Wet-lab biologists can uncover information about the protein of interest without having to become experts in sequence analysis. Here, we present a variety of tools; provide an overview of the state-of-the art in sequence analysis; and described some of the principles of the methods.


Protein structure prediction in 1D, 2D, and 3D

Burkhard Rost

in: Encyclopedia of Computational Chemistry, in press, (1998)

Full Paper

Introduction

Proteins are the machinery of life. The information for life is stored by a four-letter alphabet in the genes (DNA). Proteins are, among others, the macromolecules that perform all important tasks in organisms, such as catalysis of biochemical reactions, transport of nutrients, recognition, and transmission of signals. Thus, genes are the blueprints or library, and proteins are the machinery of life. Proteins are formed by joining amino acids by peptide bonds into a stretched chain. This protein sequence comprises a translation of the four-letter DNA alphabet into a 20-letter alphabet of native amino acids. Proteins differ in length (from 30 to over 30,000 amino acids), and in the arrangement of the amino acids (dubbed residues, when joined in proteins). In water, the chain folds up into a unique three-dimensional (3D) structure. The main driving force is the need to pack residues for which a contact with water is energetically unfavourable (hydrophobic residues) into the interior of the molecule. A detailed analysis of the underlying chemistry shows that this is only possible if the protein forms regular patterns of a macroscopic substructure called secondary structure (Fig. 1; for an excellent introduction into protein structure for a short review of the basic principles of folding:).

Sequence determines structure determines function. Protein three-dimensional (3D) structure (i.e. the co-ordinates of all atoms) determines protein function. But what determines 3D structure? The hypothesis that structure (also referred to as 'the fold') is uniquely determined by the specificity of the sequence, has been verified for many proteins. While it is now known that particular proteins (chaperones) often play a rôle in the folding pathway, and in correcting misfolds, it is still generally assumed that the final structure is at the free-energy minimum. Thus, all information about the native structure of a protein is coded in the amino acid sequence, plus its native solution environment. Can the code be deciphered, i.e. can 3D structure be predicted from sequence? In principle, the code could by deciphered from physico-chemical principles using, for example, molecular dynamics methods. In practice, however, such approaches are frustrated by two principle obstacles. Firstly, energy differences between native and unfolded proteins are extremely small (order of 1 kcal/mol). Secondly, the high complexity (i.e. co-operativity) of protein folding requires several orders of magnitudes more computing time than we anticipate to have over the next decades. Thus, the inaccuracy in experimentally determining the basic parameters, and the limited computing resources become fatal for predicting protein structure from first principles. The only successful structure prediction tools are knowledge-based, using a combination of statistical theory and empirical rules.

The sequence-structure gap is rapidly increasing. Currently, databases for protein sequences (e.g. SWISS-PROT are expanding rapidly, largely due to large-scale genome sequencing projects. The first four entire genome sequences have been published; they represent all three terrestrial kingdoms: (1) prokaryotes: haemophilus influenzae, and mycoplasma genitalium; (2) eucaryotes: yeast, and (3) archeans: methanococcus jannaschii, At least, another dozen of genomes will be completely sequenced before the end of 1997 (Terry Gaasterland, priv. communication); the entire human genome is likely to be known in the year 2003. This implies that the explosion of genome, and hence, protein sequences is supposedly the only field outgrowing the speed in development of computer hardware. It also implies, that despite significant improvements of structure determination techniques the gap between the number of proteins for which structure is deposited in public databases (PDB), and the number of proteins for which sequences are known is increasing.

Can the egg be unboiled? When an egg is boiled, the proteins it contains unfold. Can this procedure be reversed in theory? Can the encrypted code of protein structure be deciphered? Or, can theory help to bridge the sequence-structure gap? Indeed, for over 30 years, there has been an ardent search for methods to predict protein structure from the sequence. Many methods were found which looked initially very promising - but always the hope has been dashed. How well do we do?

No general prediction of structure from sequence, yet. An important experiment has been initiated by John Moult (CARB, Washington): those who determine protein structures submitted the sequences of proteins for which they were about to solve the structure to a 'to-be-predicted' database; for each entry in that database predictors could send in their predictions before a given deadline (the public release of the structure); finally, the results were compared, and discussed during a workshop (in Asilomar, California). Two such experiments have been completed: in December 1994 (Proteins special issue, Vol. 23, 1995), and in December 1996 (to be published in Proteins, 1997). The results of both experiments demonstrated clearly that the goal to predict structure from sequence has not been reached, yet. So, no improvement despite ardent attempts, and the explosion of knowledge deposited in databases?

Indeed, there is a flood of literature on protein structure prediction attempting to keep track with the expanding databases. In this review focus will be laid on recent prediction methods that do actually contribute to bridging the sequence structure gap in particular in view of analysing entire genomes. The first section will provide a brief sketch about where we are today in protein structure prediction. The following chapters will sketch the problems, and some of the solutions in database searches, and the prediction of protein structure in 1D, 2D, and 3D (Fig. 1).


Better 1D predictions by experts with machines

Burkhard Rost

Proteins, Suppl. 1, 192-197 (1998)

Full Paper

Accuracy of predicting protein secondary structure and solvent accessibility has been improved significantly by using evolutionary information contained in multiple sequence alignments. For the second Asilomar meeting, predictions were made automatically for all targets using the publicly available prediction service PredictProtein. Additionally, a semi-automatic procedure for generating more informative alignments was used in combination with the PHD prediction methods. Results confirmed the estimates for prediction accuracy. Furthermore, the more informative alignments yielded better predictions. The fairly accurate predictions of 1D structure were successfully used by various groups for the Asilomar meeting as first step towards predicting higher dimensions of protein structure.


Adaptation of protein surfaces to subcellular location

Miguel A. Andrade, Se·n I. O'Donoghue,& Burkhard Rost

J. Mol. Biol., 276, 517-525 (1998)

Full Paper

In vivo, proteins occur in widely different physio-chemical environments, and, from in vitro studies, we know that protein structures can be very sensitive to environment. However, theoretical studies of protein structure have tended to ignore this complexity. In this paper, we have approached this problem by grouping proteins by their subcellular location and looking of structural properties that are characteristic to each location. We hypothesise that, throughout evolution, each subcellular location has maintained a characteristic physio-chemical environment, and that proteins in each location have adapted to these environments. If so, we would expect that protein structures from different locations will show characteristic differences, particular at the surface, which is directly exposed to the environment. To test this hypothesis, we have examined all eukaryotic proteins with known three-dimensional structure and for which the subcellular location is known to be either nuclear, cytoplasmic, or extracellular. In agreement with previous studies, we find that the total amino acid composition carries a signal that identifies the subcellular location. This signal was due almost entirely to the surface residues. The surface residue signal was often strong enough to accurately predict subcellular location, given only a knowledge of which residues are at the protein surface. The results suggest how the accuracy of prediction of location from sequence can be improved. We concluded that protein surfaces show adaptation to their subcellular location. The nature of these adaptations suggests several principles that proteins may have used in adapting to particular physio-chemical environments; these principles may be useful for protein design.


3rd Generation Prediction Of Secondary

Burkhard Rost & Chris Sander

in: Webster D. M. (ed.): 'Predicting protein structure'. Humana Press, 1998, in press. (1998)

Full Paper

We still cannot predict protein structure from sequence, in general. But, we can do much better in predicting simplified aspects of structure. Particularly, the field of secondary structure has been revived by a break-through that has been achieved by a combination of elaborated algorithms and evolutionary information available in ever growing data bases. Some of the new, third generation methods for secondary structure prediction are clearly superior to previous methods: b-strands are predicted more accurately; predicted segments look like those observed; and the overall accuracy is about ten percentage points higher than for methods from previous generations. Performance can be improved even further by using these methods in an 'expert' rather than in an 'automatic' mode.


Marrying structure and genomics

Burkhard Rost

Structure, manuscript in prep. (1998)

Full Paper

Today. Large-scale genome sequencing is filling up the catalogue of natural proteins at a breath-taking speed. Today, we have available not just a large number of sequences, but also glimpses of the inventory of entire organisms. This will soon improve our understanding of cells, in particular, and of life, in general. Three means will contribute: (1) sequencing genomes (genomics), (2) determining protein structures, and (3) determining protein function. Protein structure is interwoven with function (see Structure, in general, [1, 2, 3, in particular). Sequencing and determining function are also routinely combined (e.g. [4] ). However, what about the relation between structure determination and genomics?

Tomorrow. Structural genomics, the marriage between protein structure determination and genomics, is already beginning. Here, I attempted to illustrate the likely direction this marriage will take. Structure determination will be pushed by, and profit from genomics. Basing research and technical developments (such as drug design) on all three pillars (sequence, structure, function) will be a big step toward understanding of life.

Objectives. Structure determination will benefit from genomics in two ways ( Fig. 1 ). (1) The mass of available sequences will facilitate quick determination of structure for most existing folds. (2) Sequences for entire organisms will help to unravel missing links in functional pathways, to explore alternative pathways, and to widen our understanding of principle mechanisms and of evolutionary cross-links.


Twilight zone of protein sequence alignments

Burkhard Rost

, submitted, (1998)


Sequence alignments unambiguously distinguish between protein pairs of similar and non-similar structure when the pairwise sequence identity is significant. The signal gets blurred when intruding into the twilight zone of 20-30% sequence identity. I analysed more than a million sequence alignments between protein pairs of known structures to explore the twilight zone. Goals were to unravel clues for why sequence alignments are difficult in the twilight zone, and to define a line distinguishing between true and false positives for low levels of similarity. Six results stood out. (1) When entering the twilight zone, the number of pairs exploded. (2) More than 95% of all pairs detected in the twilight zone had different structures. (3) The level of significant sequence identity and similarity were confirmed to be dependent on alignment length. For example, if ten residues were similar in an alignment of length 16 (> 60%), structural similarity could not be inferred. (4) Above 30% sequence similarity, more than 95% of the pairs detected were homologous; below 20% less than 20% were homologous. (5) The 'more similar than identical' rule that implied to discard all pairs for which percentage similarity was lower than percentage identity significantly reduced false positives. (6) Similarly successful was sequence space hopping: two proteins were predicted to be homologous whenever proteins were common in the sequence families of both proteins. All findings would be applicable to automatic database searches.


, (1998)

Full Paper




Rost Home LION ExPasy Home EMBL Home Mail to Rost WWW services PredictProtein