Burkhard Rost
CUBIC Columbia University
Dept Biochemistry & Molecular Biophysics, 630 West 168 th Street, New York, NY 10032, rost@columbia.edu, http://www.columbia.edu/~rost/
contact e-mail:rost@columbia.edu
Neural networks have been applied to many pattern classification problems. Here, I reviewed applications to the problem of predicting protein structure from protein sequence. Initially, methods were designed as a quick and dirty demonstration that artificial intelligence-based machines could solve real-life problems. At that stage, biologists typically reached higher levels of accuracy when using their expertise than computer scientists when using their machines. However, more thourough investigations enabled to include the information used by experts into neural network-based tools. Now, some tools are - on average - as accurate as the best experts; and experts using such tools often arrive at even more accurate predictions. Thus, several neural network-based methods have eventually contributed significantly to advancing the field of bio-informatics, and some are clearly influencing molecular biology.
Key words: prediction of protein structure and function; neural network applications; extraction of information learned by neural networks
List of outstanding questions: (xx) to add after consultation of editor
Proteins constitute lifes machinery. The first bacterial genome was sequenced in 1995 [1] ; the first eukaryote (yeast) followed in 1996 [2] . Meanwhile, more than 20 other genomes have been published [3] , and the human genome (200 times larger than yeast) is expected to be sequenced as one of the first milestones in the next millennium. Why bother? Since genomes contain the blueprint for all parts of lifes machinery. The machinery itself consists of proteins that perform all important tasks in organisms (catalysis of biochemical reactions, transport of nutrients, recognition, and transmission of signals). Proteins are formed by joining 20 different amino acids (dubbed residues, when joined in proteins) into a stretched chain. In water, the chain folds into a unique three-dimensional (3D) structure ( Fig. 1 ; introduction to protein structure: [4] ).
Fig. 1. Representation of HIV-1 protease monomer (Protein Data Bank code 1HHP) in 1D and in 3D. 1D: SEQ , sequence in one-letter code; SEC , secondary structure assignment (E for strand, blank for non-regular structure); note only the first 34 residues corresponding to the first strands (upper left arrows in 3D representation) are shown. 3D: The trace of the protein chain in 3D is plotted schematically as a ribbon. Strands are indicated by arrows, the short helix is on the right towards the end of the protein. Graph made with MOLSCRIPT [48] .
Sequence determines structure determines function. The world of proteins is governed by shape: interactions between proteins are mediated by the key-hole principle, i.e., two proteins interact when they fit to one another like a key into a hole. Thus, protein structure determines protein function. What determines structure? All information about the native structure of a protein is coded in the amino acid sequence, plus its native solution environment [5] . Can we decipher the code, i.e., can we predict 3D structure from sequence? In principle, we could; in practice, such approaches are frustrated by the difficulty of the task resulting from the high complexity of protein structure formation [6] . For over 40 years, there has been an ardent search for methods predicting protein structure from sequence (reviews: [7, 8, 6] ; books: [9] ). Many methods were found which looked initially very promising - but always the hope has been dashed [10] . The most successful predictions are achieved by experts starting combining machine-based predictions with their intuition and expertise [11, 12] .
How can neural networks predict protein structure? In practice, the most successful structure predictions extract patterns from data bases of known protein structures. Neural networks comprise a particular tool for pattern recognition and classification (Box) [13, 14, 15, 16, 17, 8] . To which extent do neural networks contribute to predicting protein structure, in practice? Initially, researchers applied black boxes, and searched improvements through optimising the internal free parameters (training speed, network architecture). Later, researchers have opened the black boxes by extracting, or implementing rules, by carving specific knowledge into the networks, and by using networks to detect errors or outliers in data bases. More recently, the full potential of the tool has been used by combining neural networks with evolutionary information.
No improvement in secondary structure prediction by black boxes. Secondary structure prediction methods distinguish between helix (H), strand (E), and non-regular structure (L). Some stretches of sequence show a particular preference to be in one of these three states. Task is to classify a pattern of w adjacent residues as either H, E, or L ( Fig. 2 ). Simple neural networks reached values for prediction accuracy of around 62% (residues predicted correctly in either H, E, or L) [18, 19, 20] . This was similar to the best methods 30 years of research had resulted in by the end of the 80's [21] . Further attempts to improve performance by changing the network details failed [13, 14] . In contrast, combining neural networks with other prediction methods succeeded to some extent [22] . The basic problem was that the input information (i.e. the training data) was not sufficient to make use of the ability of neural networks to higher order correlations: networks with and without hidden layer (Box) achieve similar levels of accuracy [20] .
Fig. 2. Simple neural network for secondary structure prediction. For simplification the protein sequence given consists of two amino acid types (S and P). The protein sequence is translated into patterns by shifting a window of w adjacent residues (shown w = 5; typical values in practice are w = 13-21) through the protein. The output of the network is uniquely determined (Box). Suppose the output would be: 0.2, 0.4, 0.5 for the three output states (H, E, L). For known examples the desired output is also known (1, 0, 0 if the central residue is in a helix). Consequently, the network error is given by the difference between actual network output and desired output. The only free variables are the connections. Training or learning means changing the connections such that the error decreases for the given examples. A training set typically comprises some 30,000 examples. If training is successful, the patterns are correctly classified. But how can new patterns be classified correctly? The hope is that the network succeeds in extracting general rules by the classification of the training patterns. The generalisation ability is checked by another set of test samples for which the mapping of sequence window to secondary structure is known as well. Sufficient testing is crucial and has to meet two requirements. First, any significant sequence similarity between test and training set has to be removed. Second, evaluations of expected prediction accuracy have to be based on a sufficient number of test proteins (> 100).
Prediction of functional class. Methods predicting functional similarities between proteins have been based on two types of neural networks: (1) on layered feed-forward networks (Box) usually trained by simple back-propagation [23, 15] (e.g. (1a) multiple networks [24] using proteins of similar sequences as input, or (1b) on single networks using different amino acid features as input [25] , some other examples for networks recognising functional motifs are: [26, 27, 28, 29, 30] ), and (2) on Kohonen maps [31, 15] (e.g. (2a) using the frequency with which any of the 20*20 possible residue pairs occurs in the sequence [32] , or (2b) using the information extracted from database annotations [33] ). There are two ways to describe the principle difference between these two types of networks. Firstly, the network types can be contrasted in terms of the final result of their function: Kohonen maps provide a more continuous topography of similarity, whereas back-propagation networks differentiate the input into larger categories determined by the number of output units. Secondly, the network types can be contrasted in terms of the way they are trained: Kohonen maps find an unknown classification scheme, whereas back-propagation networks learn from known examples. Consequently, back-propagation networks are useful to learn a classification into known features (e.g. types of secondary structure), Kohonen maps have been applied to render a general classification scheme of proteins (e.g. A and B, are similar, and A is more similar to C, than B). Such a classification is a priori not evident (and on itself an area of controversial research, e.g., attempting to answer questions like: "Are we more similar to an orang-outang than to a pig?"). One hope guiding such analyses is to end up with similarities between proteins that might help to learn about details in protein function.
Extracting rules from, and implementing rules into neural networks. Genome sequences do contain some information about protein structure [34] . A prerequisite to uncover this result was to learn the genetic code by a neural network, i.e., the mapping between the four-letter alphabet of the nucleic acids (DNA), and the 20-letter alphabet of the amino acids (proteins). Analysing the rules learned by the network suggested evidence for a particular scenario for the evolution of the genetic code Fig. 3 . In a similar attempt to extract rules by specific modulation of the training procedure Tchoumatchenko, Vissotsky, and Ganascia extracted more complicated rules from secondary structure predicting neural networks than were available by statistical analysis [35] . For the case of learning the genetic code, the extraction has led to interesting hypotheses. For the case of secondary structure predictions, the problem was to make use of the complex rules extracted.
Fig. 3. Learning the genetic code. The four-letter nucleic acid code from the genomes is translated into a 20-letter amino acid code from proteins. Three nucleic acids (dubbed one codon) code for one amino acid. This implies that the four nucleic acid can code for 4*4*4 = 64 amino acids, i.e., the code is redundant: some amino acids are coded for by more than one codon, and three codons are used for stop-signals during the translation procedure. The minimal network that learned the genetic code had two hidden units [47] . The four graph represent the connections between the 20 input and the two hidden units. The untrained network with randomly assigned weights locates all 61 points near the centre of the square; after seven training epochs the points have moved into a transient local minimum, where the activities of the intermediate units are close to one and the activities of all the output units are close to zero; at 30 epochs the groups have started to segregate, but are still mixed; finally at 13,000 epochs the network positions the 61 codons in groups on the edge of the circular region. After the four epochs shown the number of correctly classified codons was 2, 6, 26 and 61, respectively. The final grouping separates hydrophobic residues (top: IMVPF) from hydrophilic (centre right and left: YQHKNEDR), and others (lower right: TSAGPCW). The figure is taken from [47] .
Carving biology into neural networks. Two problems are common to most secondary structure prediction methods (including simple networks, Fig. 2 ): (1) strands are predicted at almost random levels of accuracy, (2) and predicted secondary structure segments are too short [21] . The common explanation for the first problem was that strands are s Table 13, Table 14, Table 15, Table 16, Table 17, Table 18, Table 19, Table 20, Table 21 residues. The training dynamics of neural networks revealed that networks learned to classify helix, and loop ten times faster than strand [17] . Consequently, the idea was to simply increase the frequency in presenting strand residues during training. This change of the training improved strand accuracy significantly, indicating that the inferior prediction of strand did NOT result primarily from long-range interactions, but from technical problems. The second problem of predicting too short segments originates from the fact that the sliding window (Fig. 2) erases the correlation between adjacent residues. This shortcoming was corrected by introducing a second level network [17] ( Fig. 4 ). Such a network system learned correlations between adjacent residues. These examples illustrate that neural networks can easily be tailored to particular problems.
Fig. 4. Second level neural network [43, 17] . (1) The window of w adjacent residues is shifted through the protein (here w = 5). For each window secondary structure is predicted for the central residue (shown three windows with central residues S, P, S). (2) The prediction of this first level network is fed into a second level network. This is again realised by shifting a window of w adjacent predictions through the protein (for the second level w = 3). The final prediction of secondary structure is valid for the central residue of the second window (here a P).
Detecting database errors during training. Neural networks generalise by extracting the underlying physico-chemical principles from the training data. Obviously, this requires a correct training set. Søren Brunak has pioneered the idea to unravel errors in the training set by monitoring samples that could not be learned even when the networks were trained until over-fitting the data [36, 37, 38] . This technique has not only been used successfully to identify errors, and inconsistencies in public data bases, but also to improve the network performance.
Long-range information in multiple sequence alignments. Some residues can be replaced by others without changing structure. However, not every amino acid can be replaced by any other. On the contrary, one evolutionary step (exchange of one residue) can destabilise a structure. Residue substitution patterns observed in protein families are highly specific for a particular structure, and thus, contain more information about structure than single sequences. Furthermore, multiple alignments of sequence families implicitly also carry information about interactions between residues further apart in sequence than w residues. These evolutionary patterns are being used by experts [39, 40, 41, 42] . However, this information can also be incorporated into neural networks in the following way 17 . (1) A sequence of unknown structure (U) is aligned against a database of known sequences. (2) Proteins with significant sequence identity to U are extracted. (3) For each sequence position the profile of residue exchanges in the final multiple alignment is compiled, and fed into a network ( Fig. 5 ).
Fig. 5. Feeding evolutionary information into a neural network system [43, 17] . (1) A sequence family is aligned (shown are the sequence of unknown structure and three aligned relatives). (2) For each sequence position of profile is compiled that gives the percentage of S, or P in the alignment (shown in centre for window of five adjacent residues). (3) Instead of using binary input units (0 or 1), now the profile is fed into the first neural network. (4) Finally, the output is again fed into a second level network (Fig. 4).
Significant improvement of secondary structure prediction. Using evolutionary information has improved prediction accuracy from 62% to over 70% [43, 44] , or even over 72% [17] . Such a profile based neural network system was the first method to surpass the magic line of 70% accuracy, and has proven to be todays most accurate method at the structure prediction contests [11, 12] . This success is based on the ability of the neural networks to extract higher-order information from the input patterns. Whereas, segments taken from single sequences did not contain enough information to result in a different performance of networks with and without hidden units, segments taken from multiple sequence alignments did.
Finding the optimal path through the network output. The neural network system designed to predict secondary structure for water-soluble proteins failed in predicting helices inserted into the lipid-bilayer of membranes. However, the networks have been re-trained to also predict transmembrane helices [17] . Again information from multiple alignments improved prediction accuracy significantly [17] . This example illustrated the obvious problem that neural networks cannot do magic: generalisation fails for examples that have not been presented sufficiently while learning. A more technical issue was the incorporation of additional globular information by combining the neural network with a post-processing algorithm. The principal idea for this has been to regard the neural network prediction as an energy landscape and to search the best path through this landscape given that transmembrane helices are constrained to a minimal and a maximal length. The final system has achieved a significantly higher accuracy than the simple neural network-based system, and has been applied to analysing entire genomes [45] .
Structure prediction: work in progress... Protein 3D structures are encoded by a linear sequence of amino acid residues. To predict 3D structure from sequence is a task challenging enough to have occupied a generation of researchers. Have we finally succeeded? The bad news is: we still cannot predict structure for any sequence. The good news is: we have come closer, and growing databases facilitate the task. A solution of the structure prediction problem would supposedly change experimental molecular biology more than any other theoretical method. We may witness such a break-through in the near future. However, the lessons from the Asilomar prediction contests were that we may need a common frame-work to co-ordinate the efforts of the researchers in the field.
Contributions of neural networks to the structure prediction problem. Almost any imaginable algorithm has been applied to the secondary structure prediction problem. However, once researchers left the path of trying to optimise black-boxes, it was through neural network applications that many break-throughs were achieved. For example, a neural network system for predicting various aspects of 1D structure based on evolutionary information is by far the most widely used prediction method [46] . Other network-based methods are unique, or superior in their field [32, 44, 33, 29, 30] . Furthermore, neural networks revealed data base errors, and principles underlying protein structures [37, 46, 47, 38] . Thus, neural networks applications have matured from loosing the competition with experts to means used by experts arriving at more reliable predictions than ever before.
Will neural network constitute THE tool assisting molecular biology? In this brief review I covered a small percentage of the neural network methods used for protein structure prediction, only. A different area of an increasing number of neural network applications is the analysis of genome sequences. Sequencing the human genome takes enormous human, and financial resources. The reason is an anticipated profit for health care, in particular, or for understanding the mechanism of life, bio-diversity and evolution, in general. To arrive at this goal we need to attach knowledge about protein structure and function to the genes. In this process, data-driven bio-informatics may assist. Will neural network applications enable further advances? Protein structure prediction is not likely to be managed through a single genial discovery. Instead, it appears we need to direct an orchestra of different prediction methods such that they combine to a melody.
Simple neural network: The simplest layered feed-forward neural network consists of a layer of input units (here two), and a layer of output unit(s) (here one). Signals are transmitted from input to output layer (feed-forward) via the connections (Js). The network dynamic consists of a linear and a non-linear step. (1) The value of each input unit (example: 0 for unit 1; 1 for unit 2) is multiplied with the strength of the connection; the products sum to a local field (h) representing the signal that arrives at the output unit. The multiplication represents a projection of the input vector onto the vector of the connections. (2) The final output is determined by applying a sigmoid function (shown is the hyperbolic tangent) to the local field. The result is that the output is constrained to values between 0 and 1. On the right hand side the potential of such a network is illustrated: the open, and the dark circles are separated by a line.
Two-layered neural network: Two open and two dark circles can obviously not be separated by a single straight line. Two lines would enable the separation, but how can a neural network introduce two lines? The simple trick is the introduction of a layer of hidden layers (hidden as neither input nor output). The dynamics of such a network are identical to the simple network without hidden layer.
Training a neural network: How can particular pattern classification problems be implemented? The input is fixed by the pattern, as well is the desired output. The output for a given set of connections is uniquely determined by the dynamics of the network described above. The actual network error can be written as:
The free variables that contain the potential of the network to learn a given problem are the connections between the layers of units. The simplest way to reduce the network error is by changing the connections according to the derivative of the error with respect to the connections, i.e., by a gradient descent that assures to move downhill in the error-landscape:
This is often referred to as back-propagating the error through the neural network [23, 15, 16] . To avoid being trapped in local minima, in practise, the actual training is typically performed by a variant of this algorithm that permits up-hill moves (conjugate gradient descent [23, 15, 16] ).
Generalisation ability: With enough hidden units neural networks can learn to separate any set of patterns. Typical applications require to extract particular features (underlying rules) present in the patterns rather than to learn the known examples by heart. A successful extraction of such features permits the network to generalise, i.e., to also correctly classify patterns that have not been learned explicitly. Generalisation requires a balance between the number of training examples (enough to enable feature extraction), and the number of connections (enough to separate patterns). As a rule-of-thumb the number of connections should be an order of magnitude lower than the number of patterns to avoid over-fitting the training data (this learning-by-heart of the training set is also referred to as over-training).