Burkhard Rost: Publications in molecular biology
Burkhard Rost
Address: EMBL, 69 012 Heidelberg, Germany
Correspondence: Burkhard Rost
e-mail: rost@embl-heidelberg.de
contact e-mail:rost@embl-heidelberg.de
Table of Contents
The quality of a multi-layered network predicting the secondary structure of
proteins is improved substantially by: (i) using information about
evolutionarily conserved amino acids (increase of overall accuracy by six
percentage points), (ii) balancing the training dynamics (increase of accuracy
for strand), and (iii) combining uncorrelated networks in a jury (increase two
percentage points). In addition, appending a second level
structure-to-structure network results in better reproduction of the length of
secondary structure segments.
We have trained a two layered feed-forward neural network on a non-redundant
database of 130 protein chains to predict the secondary structure of
water-soluble proteins. A new key aspect is the use of evolutionary
information in the form of multiple sequence alignments that are used as input
in place of single sequences. The inclusion of protein family information in
this form increases the prediction accuracy by 6-8 percentage points. A
combination of three levels of networks results in an overall three state
accuracy of 70.8% for globular proteins (sustained performance). If four
membrane protein chains are included in the evaluation, the overall accuracy
drops to 70.2%. The prediction is well balanced between [[alpha]]-helix,
[[beta]]-strand and loop. 65% of the observed strand residues are predicted
correctly. The accuracy in predicting the content of three secondary structure
types is comparable to that of circular dichroism spectroscopy. The
performance accuracy is verified by a seven-fold cross-validation test, and an
additional test on 26 recently solved proteins. Of particular practical
importance is the definition of a position-specific reliability index. For the
half of the residues predicted with high reliability the overall accuracy
increases to better than 82%. A further strength of the method is the more
realistic prediction of segment length. The protein family prediction method
is available for testing by academic researchers via an electronic mail
server.
Prediction of protein secondary structure is an old problem and progress has
been slow over the years. Recently, spectacular success has been claimed in the
blind prediction of the catalytic subunit of the cAMP dependent protein kinase.
When predictions in this and other test cases are assessed critically, some
claims of prediction success turn out to be exaggerated, but a kernel of real
progress remains: protein structure prediction can be improved substantially
when a family of related sequences is available. Enough so that molecular
biologists equipped with a new amino acid sequence and a multiple sequence
alignment in hand may be tempted to test the new prediction methods.
The explosive accumulation of protein sequences in the wake of large scale
sequencing projects is in stark contrast to the much slower experimental
determination of protein structures. Improved methods of structure prediction
from the gene sequence alone are therefore needed. Here, we report a
substantial increase in both the accuracy and quality of secondary structure
predictions, using a neural network algorithm. The main improvements come
from the use of multiple sequence alignments (better overall accuracy), from
'balanced training' (better prediction of [[beta]]-strands) and from
'structure context training' (better prediction of helix and strand lengths).
The best method, cross-validated on seven different test sets purged of
sequence similarity to learning sets, achieves a three-state prediction
accuracy of 69.7%, significantly better than any previous method. The improved
distribution and accuracy of helices and strands makes the predictions well
suitable for use in practice as a first estimate of structural type of newly
sequenced proteins.
Can secondary structure prediction be improved by prediction rules that focus
on a particular structural class of proteins? To help answer this question, we
have assessed the accuracy of prediction for all-helical proteins, using two
conceptually different methods and two levels of description. An overall
two-state single residue accuracy of about 80% can be obtained by a neural
network, no matter whether it is trained on two states (helix, non-helix), or
first trained on three states (helix, strand, loop) and then evaluated on two
states. For four test proteins, this is similar to the accuracy obtained with
inductive logic programming. We conclude that on the level of secondary
structure, there is no practical advantage in training on two states,
especially given the added margin of error in identifying the structural class
of a protein. In the further development of these methods, it is increasingly
important to focus on aspects of secondary structure that aid in the
construction of a correct three-dimensional model, such a the correct placement
of segments.
In the middle of 1993, more than 30,000 protein sequences are known. For 1000
of these the three-dimensional (tertiary) structure is experimentally solved.
Another 7000 can be modelled by homology. For the remaining 21,000 sequences
secondary structure prediction provides a rough estimate of structural
features. Predictions in three states rate between 36% (random) and 88%
(homology modelling) overall accuracy. Using information about evolutionary
conservation as contained in multiple sequence alignments, the secondary
structure of 4700 protein sequences was predicted by the automatic e-mail
server PHD. For proteins with at least one known homologue, the method has an
expected overall three-state accuracy of 71.4% for proteins with at least one
known homologue (evaluated on 126 unique protein chains).
Secondary structure prediction recently has surpassed the 70% level of average
accuracy, evaluated on the single residue states helix, strand and loop (Q3).
But the ultimate goal is reliable prediction of tertiary (3D) structure, not
100% single residue accuracy for secondary structure. A comparison of pairs of
structurally homologous proteins with divergent sequences reveals that
considerable variation in the position and length of secondary structure
segments can be accommodated within the same 3D fold. It is therefore
sufficient to predict the approximate location of helix, strand, turn and loop
segments, provided they are compatible with the formation of 3D structure.
Accordingly, we define here a measure of segment overlap (Sov) that is somewhat
insensitive to small variations in secondary structure assignments. The new
segment overlap measure ranges from an ignorance level of 37% (random protein
pairs) via a current level of 72% for a prediction method based on sequence
profile input to neural networks (PHD) to an average 90% level for homologous
protein pairs. We conclude that the highest scores one can reasonably expect
for secondary structure prediction are a single residue accuracy of Q3>85%
and a fractional segment overlap of Sov>90%.
Using evolutionary information as contained in multiple sequence alignments as
input to neural networks, secondary structure can be predicted at significantly
increased accuracy. Here, we extend our previous three-level system of neural
networks by using additional input information derived from multiple
alignments. Using a position-specific conservation weight as part of the input
increases performance. Using the number of insertions and deletions reduces
the tendency for overprediction and increases overall accuracy. Addition of
the global amino acid content yields a further improvement, mainly in
predicting structural class. The final network system has a sustained overall
accuracy of 71.6% in a multiple cross-validation test on 126 unique protein
chains. A test on a new set of 124 recently solved protein structures that
have no significant sequence similarity to the learning set confirms the high
level of accuracy. The average cross-validated accuracy for all 250
sequence-unique chains is above 72%. Using various data sets, the method is
compared to alternative prediction methods, some of which also use multiple
alignments: the performance advantage of the network system is at least 6
percentage points in three-state accuracy. In addition, the network estimates
secondary structure content from multiple sequence alignments about as well as
circular dichroism spectroscopy on a single protein and classifies 75% of the
250 proteins correctly into one of four protein structural classes. Of
particular practical importance is the definition of a position-specific
reliability index. For 40% of all residues the method has a sustained
three-state accuracy of 88%, as high as the overall average for homology
modelling. A further strength of the method is greatly increased accuracy in
predicting the placement of secondary structure segments.
Some 30,000 protein sequences are known. For 1,000 the structure is
experimentally solved. Another 4,000 can be modeled by homology. For the
remaining 25,000 sequences, the tertiary structure (3D) cannot be predicted
generally from the sequence. A reduction of the problem is the projection of
3D structure onto a one-dimensional string of secondary structure assignments.
Predictions in three states rate between 36% (random) and 88% (homology
modelling) accuracy. Here, we present an improvement of a neural network
system using information about evolutionary conservation. The method achieves
a sustained overall accuracy of 71.4%. A test on 45 new proteins confirms the
estimated accuracy. Of practical importance is the definition of a reliability
index at each residue position: e.g. about 40% of the predicted residues have
an expected accuracy of 88%. The method has been made publicly available by an
automatic e-mail server.
For only about one third of the new proteins, three-dimensional (3D) structure
can be predicted. For the remaining two thirds, a compromise has to be made.
An extreme simplification is the projection of 3D structure onto a string of 1D
secondary structure assignments. Here, we report how neural networks can be
configured such that strand is predicted significantly better, and that the
prediction looks like native proteins in terms of the length of predicted
segments. Using evolutionary information contained in multiple sequence
alignments as input to neural networks, secondary structure can be predicted at
significantly increased accuracy. Pre-processing the alignment information by
using a position-specific conservation weight and the number of insertions and
deletions in each alignment position is found to be advantageous. Addition of
the global amino acid content yields a further improvement, mainly in
predicting structural class. The final network system has a sustained overall
accuracy of more than 72% evaluated on 250 sequence-unique chains. Of
particular practical importance is the definition of a position-specific
reliability index. For 40% of all residues the method has a sustained
three-state accuracy of 88%, as high as the overall average for homology
modelling.
Predicting three-dimensional (3D) protein structure alone from sequence in
general is currently an insurmountable task. As intermediate step, a much
simpler task has been pursued extensively: prediction of a projection of 3D
structure onto 1D strings of secondary structure. Here, we present an analysis
of another 1D projection of 3D structure: the relative solvent accessibility of
a residue. We show that solvent accessibility is less conserved in 3D families
than secondary structure: the average correlation of relative solvent
accessibility between 3D homologues is only 0.66. This value provides an
effective practical upper limit for the accuracy of predicting accessibility
from sequence. We introduce a neural network system that predicts relative
solvent accessibility (projected onto 10 discrete states) using evolutionary
profiles of amino acid substitutions derived from multiple sequence alignments.
Evaluated in a cross-validation test on 126 unique proteins, the correlation
between predicted and observed relative accessibility is 0.54. For a ternary
(buried, intermediate, exposed) description of relative accessibility the
fraction of correctly predicted residue states is about 58%. In absolute
terms, this accuracy appears poor, but given the relatively low conservation of
accessibility in 3D families (correlation 0.66), the network system is not far
from optimal performance. Prediction is best for buried residues, e.g. 86% of
the completely buried sites are correctly predicted as having 0% relative
accessibility.
Although the structure-from-sequence prediction problem remains fundamentally
unsolved, new and promising methods in 3D, 2D, and 1D have reopened the field.
Pseudopotentials or information values derived from the databases can
distinguish between correct and incorrect models (3D). Interresidue contacts
(2D) can detected by the analysis of correlated mutations, albeit with low
accuracy. Significantly improved prediction of secondary structure (1D) from
multiple sequence alignments is now available in daily practice.
We describe a neural network system that predicts the locations of
transmembrane helices in integral membrane proteins. By using evolutionary
information as input to the network system, the method significantly improved
on a previously published neural network prediction method that had been based
on single sequence information. The input data was derived from multiple
alignments for each position in a window of 13 adjacent residues: amino acid
frequency, conservation weights, number of insertions and deletions, and
position of the window with respect to the ends of the protein chain.
Additional input was the amino acid composition and length of the whole
protein. A rigorous cross-validation test on 69 proteins with experimentally
determined locations of transmembrane segments yielded an overall two-state
per-residue accuracy of 95%. About 94% of all segments were predicted
correctly. When applied to known globular proteins as a negative control, the
network system incorrectly predicted fewer than 5% of globular proteins as
having transmembrane helices. The method was applied to all 269 open reading
frames from the complete yeast VIII chromosome. For 59 of these at least two
transmembrane helices were predicted. Thus, the prediction is that about one
fourth of all proteins from yeast VIII contain one transmembrane helix, and
some 20% more than one.
What is a protein?
The information for life is stored by a four-letter
alphabet in the genes (DNA: deoxy-ribonucleic acid). Proteins are, among
others, the macromolecules that perform all important tasks in organisms, such
as catalysis of biochemical reactions, transport of nutrients, recognition and
transmission of signals. Thus, genes are the blueprints and proteins the
machinery of life. Proteins are formed by joining the amino acids into a long
stretched chain, the protein sequence, a translation of the genes into a
20-letter alphabet of amino acids. Proteins differ in length (from 30 to
30,000 amino acids) and in the arrangement of amino acids (called residues,
when joined in proteins). In water, the chain folds up to a unique
three-dimensional (3D) structure. The main driving force is the need to pack
residues for which a contact with water is energetically unfavourable
(hydrophobic residues) into the interior of the molecule. A detailed analysis
of the underlying chemistry shows that this is only possible if the protein
forms regular patterns of a macroscopic substructure called secondary structure
(Fig. 1; for an introduction in protein structure see Branden, 1991.
What determines protein function and structure? The 3D structure of a
protein determines its function. But what determines the 3D structure? It is
well established that the details of the 3D structure (also called the fold),
are uniquely determined by the specificity of the sequence. Can the code be
deciphered, i.e. can 3D structure be predicted from sequence? In principle,
the code could by deciphered by calculating the physico-chemical force fields
determining the fold. Unfortunately, the required computer time to calculate
the 3D structure based on first principles is many orders of magnitude beyond
today's possibilities. However, it is of practical importance to know the 3D
structure. One reason is rational drug design.
Why not simply look by microscope at the 3D structure? Unfortunately,
the techniques to experimentally determine 3D structure of a protein are rather
complicated. Solving a structure can take from one to several years. Today
for some 36,000 proteins the sequence is known, but only for 2,000 has the 3D
structure been determined by experiment. Large gene sequencing projects
increase the sequence-structure gap further. The most accurate way to predict
3D structure from the sequence is by homology modelling, i.e., search for a
protein with similar sequence that has a known 3D structure and then model the
3D structure of the unknown protein in analogy to the known one. Such
techniques lead to a reduction of the sequence-structure gap by some 9,000
proteins. However, there are still some 25,000 proteins for which researchers
would like to know as much as possible about the structure, and this number is
rapidly increasing.
Why can homology modelling be successful? The exchange of a few
residues can already destabilise a protein. This implies that the majority of
the 20N possible sequences of length N form different structures.
But, has evolution created such an immense variety? Random errors in the DNA
sequence lead to a different translation of protein sequences. These 'errors'
are the basis of evolution. Mutations resulting in a structural change are not
likely to be accepted, since the protein cannot perform its task. Furthermore,
the universe of stable structures is not continuous, i.e., minor changes on the
level of the 3D structure destabilise the structure. The evolutionary pressure
to conserve function and the discontinuity of the universe of structures have
the result that structure is evolutionarily more conserved than sequence.
Evolution has produced pairs of proteins which have the same 3D structure with
only 25% identical residues. Therefore, the 3D structure can be predicted
rather accurately by homology if a protein with sufficient sequence identity
and known 3D structure is found in the data bank.
Can the egg be unboiled? When an egg is boiled, the proteins it
contains unfold. Can this procedure be reversed in theory? Or, in other
words, can the encrypted code of protein folding be deciphered from the
sequence? Current tools to predict the 3D structure from the sequence are
rather limited (Rost and Sander, 1994b). Therefore, the problem has to be
simplified. One extreme simplification is the prediction of one dimensional
(1D) strings of secondary structure assignment. Others are the location of
functionally important residues, the classification of proteins into
structurally related families, or the prediction of whether or not a particular
residue is buried in the core of the protein.
How can neural networks predict protein structure? In practice, the
most successful predictions are based on an analysis of common features in the
data bank of known 3D structures. Artificial neural networks are well suited
for pattern classification. Here, we shall attempt to show how the application
of neural networks as devices for pattern classification can be used for the
prediction of protein structure. First, we give examples of how the data bank
of known 3D structures can be used to predict secondary structure (1), and
other structural features (2). Then, we briefly review attempts to predict
entire 3D structures (3). Finally, we give a critical evaluation of the neural
network applications by comparing these to alternative approaches and outline
the prospects of applying neural networks for protein structure prediction.
The experimental determination of protein structure cannot keep track with the
rapid generation of new sequence information. Can theory contribute? The most
successful prediction method - and the only one for prediction of 3D structure
- is homology modelling. It is applicable for about one quarter of the
proteins. For the rest, the prediction task has to be simplified. An extreme
simplification is to project 3D structure onto 1D strings of secondary
structure or solvent accessibility. For these 1D aspects of 3D structure,
prediction accuracy has been improved significantly by using evolutionary
information as input to neural network systems. The gain in accuracy bases on
the conservation of secondary structure and relative solvent accessibility
within sequence families. Secondary structure and accessibility are conserved,
as well, between remote homologues. This fact can be used by fitting 1D
predictions into 3D structures to detect such remote homologues. In comparison
to other threading approaches, 1D threading is rather flexible. However, two
factors decrease detection accuracy. First, the loss of information by
projecting 3D structure onto 1D strings (in particular the loss of distances
between secondary structure segments). And second, the inaccuracy of
predicting 1D structure. A preliminary result is that every fifth remote
homologue is detected correctly.
Homology modelling, currently, is the only theoretical tool which can
successfully predict protein 3D structure. As 3D structure is well conserved
within sequence families, homology modelling allows to predict 3D structure for
20% of the SWISSPROT proteins. 20% of the proteins in are remote homologues to
another PDB protein, i.e. the structures are homologous but pairwise sequence
identity is not significant. Threading techniques attempt to predict such
remote homologues based on sequence information to thus increase the scope of
homology modelling. Here, a new threading method is presented. First, for a
list of PDB proteins, 3D structure was projected onto 1D strings of secondary
structure and relative solvent accessibility. Then, secondary structure and
solvent accessibility were predicted by neural network systems (PHD) for a
search sequence. Finally, the predicted and observed 1D strings were aligned
by dynamic programming. The resulting alignment was used to detect remote 3D
homologues. Four results stand out. First, even for an optimal prediction of
1D strings (taken from PDB), only about half the hits that ranked above a given
threshold were correctly identified as remote homologues; only about 25% of the
first hits were correct. Second, real predictions (PHD) were not much worse:
about 20% of the first hits were correct. Third, a simple filtering procedure
improved prediction performance to about 30% correct first hits. With such a
filter, the correct hit ranked among the first three for more than 23 out of 46
cases. Fourth, the combination of the 1D threading and sequence alignments
markedly improved the performance of the threading method TOPITS for some
selected cases.
Accuracy of predicting protein secondary structure and solvent accessibility
from sequence information has been improved significantly by using information
contained in multiple sequence alignments as input to a neural network system.
For the Asilomar meeting, predictions for 13 proteins were generated
automatically using the publicly available prediction method PHD. The results
confirm the estimate of 72% three-state prediction accuracy. The fairly
accurate predictions of secondary structure segments made the tool useful as a
starting point for modelling of higher dimensional aspects of protein
structure.
The problem of accurately predicting protein three-dimensional structure from
sequence has yet to be solved. Recently, a number of new and promising methods
that work in one, two or three dimensions have invigorated the field.
Modelling by homology can yield fairly accurate three-dimensional structures
for about 25% of the currently known protein sequences. Techniques for
cooperatively fitting sequences into known three-dimensional folds, called
threading methods, are capable of increasing this rate by detecting very remote
homologies in favourable cases. Prediction of protein structure in two
dimension, i.e., prediction of inter-residue contacts, is in its infancy.
Prediction tools that work in one dimension are both mature and generally
applicable; they predict secondary structure, residue solvent accessibility,
and the location of transmembrane helices with reasonable accuracy. These and
other prediction methods have gained immensely from the rapid increase of
information in publicly accessible databases. Growing databases will lead to
further improvements of prediction methods and thus to narrowing the gap
between the number of known protein sequences and known protein structures.
Currently, the prediction of three-dimensional (3D) protein structure from
sequence alone poses insurmountable difficulties. As an intermediate step, a
much simpler task has been pursued extensively: predicting 1D strings of
secondary structure. Here, a composite neural network is described which
predicts three secondary structure states (helix, strand, loop). The network
system comprises two levels of feed-forward networks (one hidden layer each)
and a final jury decision over differently trained networks. Training is done
by an adaptive-like back-propagation. An important key features of the system
is that the input is not only the sequence of one protein but the profile of a
whole bunch of sequences of proteins which have the same 3D structure. The
combination of the problem specific topology and the pre-processing of the
input improve prediction accuracy from some 62% to 72%. Furthermore, the
specific topology and training procedure successfully corrects for shortcomings
of both simpler NN and classical methods. Over the last years, the system has
been the best automatic predictor in a very competitive area of research.
Full Paper
We still cannot predict protein three-dimensional (3D) structure from sequence
alone. But, we can predict 3D structure for one fourth of the known protein
sequences (SWISSPROT) by homology modelling based on significant sequence
identity (>25%) to known 3D structures (PDB). For the remaining, about
30,000 known sequences, the prediction problem has to be simplified. An
extreme simplification is to try to predict projections of 3D structure, e.g.,
1D secondary structure, solvent accessibility, or transmembrane location
assignments for each residue.
Despite the extreme simplification, the success of 1D predictions has been
limited as segments from single sequences (used as input) do not contain
sufficient global information about 3D structures. Patterns of amino acid
substitutions within sequence families are highly specific for the 3D structure
of that family. Using such evolutionary information is the key to a
significant improvement of 1D predictions.
In this review I describe three prediction methods that use evolutionary
information as input to neural network systems to predict secondary structure
(PHDsec), relative solvent accessibility (PHDacc), and transmembrane helices
(PHDhtm). I shall also illustrate the possibilities and limitations in
practical applications of these methods with results from careful
cross-validation experiments on large sets of unique protein structures.
All predictions are made available by an automatic email prediction service
(see Availability). The baseline conclusion after some 30,000 requests to the
service is that 1D predictions have become accurate enough to be used as a
starting point for expert-driven modelling of protein structure.
For transmembrane proteins experimental determination of three-dimensional
structure is problematic. However, membrane proteins have important impact for
molecular biology in general, and for drug design in particular. Thus,
prediction method are needed. Here we introduce a method that started from the
output of a profile-based neural network system (PHDhtm). Instead of choosing
the neural network output unit with maximal value as prediction, we implemented
a dynamic programming-like refinement procedure that aimed at producing the
best model for all transmembrane helices compatible with the neural network
output. Preliminary results suggest that the refinement was clearly superior
to the initial neural network system; and that, in terms of correctly
predicting all transmembrane helices of a protein correctly, the method was
more accurate than a previously applied empirical filter. The refined
prediction was used successfully to predict transmembrane topology based on an
empirical rule for the charge difference between extra- and intra-cellular
regions (positive-inside rule). The resulting accuracy in predicting topology
was better than 80%. Although a more thorough evaluation of the method on a
larger data set will be required, the results compared favourably with
alternative methods for the prediction of transmembrane helices and topology.
Previously, we introduced a neural network system predicting the locations of
transmembrane helices in integral membrane proteins based on evolutionary
profiles (PHDhtm). Here, we describe an improvement and an extension of that
system. The improvement is achieved by a dynamic programming-like algorithm
that optimises helices compatible with the neural network output. The
extension is the prediction of topology (orientation of first loop region with
respect to membrane) by applying the observation that positively charged
residues are more abundant in extra-cytoplasmic regions to the refined
prediction of all transmembrane helices. Furthermore, we introduce a method to
reduce the number of false positives, i.e., proteins falsely predicted with
membrane helices. The evaluation of prediction accuracy is based on a
cross-validation and a double-blind test set (in total 131 proteins). The
final method appears to be more accurate than other methods published. (1) For
almost 89% (+/-3%) of the test proteins all transmembrane helices are predicted
correctly. (2) For more than 86% (+/-3%) of the proteins topology is predicted
correctly. (3) We define reliability indices which correlate with prediction
accuracy: for the most strongly predicted half of the proteins the likelihood
of predicting all transmembrane helices correctly raises to 98%; and for
two-thirds of the proteins the accuracy of topology prediction was 95%. (4)
The rate of proteins for which transmembrane helices are predicted falsely is
below 2% (+/-1%). Finally, the method is applied to 1616 sequences of
Haemophilus influenzae. We predict 19% of the genome sequences to contain one
or more transmembrane helices. This appears to be lower than what we predicted
previously for the yeast VIII chromosome (about 25%).
Full Paper
INTRODUCTION
Imagine you have a protein sequence, either sequenced in your own lab or
pulled down from genome projects of EST production. You decide to let
theoretical biology assist you in finding a priori information about your
protein that may be useful to accelerate and design experiments. You submit
your sequence to database search and/or structure prediction services. The
possible pitfalls are numerous, including picking a lousy server or
misinterpreting the results. We give examples for common pitfalls collected
after 80,000 requests to an automatic prediction service (Table).
What can theory predict of protein structure? In general, protein
three-dimensional (3D) structure can NOT be predicted from sequence. However,
3D structure can be predicted by homology modelling, i.e., by using a sequence
homologue (>25% sequence identity) with an experimentally determined 3D
structures. If no sequence homologue is found in PDB, there still is a chance
to predict 3D structure by threading, i.e., by remote homology modelling
(<25% sequence identity). However, correct 3D models -and even correct
detection of remote homology - from threading are rare But, theory can assist
by predicting one-dimensional (1D) aspects of 3D structure, e.g., secondary
structure, solvent accessibility, transmembrane helices, binding sites,
sequence motifs, and aspects of protein function.
Ease of use bears an ease of misuse. Rapidly developing electronic
communication (Internet, World Wide Web) facilitates spreading prediction
methods. Experimental biologists submit sequences, theoretical biologists
configure automatic services that return predictions. The advantage is that
users need not become experts for sequence analysis tools. However, the ease of
offering and accessing predictions bears two problems. (1) Inaccurate methods
(or insufficiently validated ones) are made available bypassing selection
systems such as referees. (2) Users may misinterpret results due to a lack of
insight into the features of prediction methods.
Full Paper
In fold recognition by threading one takes the amino acid sequence of a proteinand evaluates how well it fits into one of the known three-dimensional (3D)
protein structures. The quality of sequence-structure fit is typically
evaluated using inter-residue potentials of mean force or other statistical
parameters. Here, we present a new approach to evaluating sequence-structure
fitness. Starting from the amino acid sequence we first predict secondary
structure and solvent accessibility for each residue. We then thread the
resulting one-dimensional (1D) profile of predicted structure assignments into
each of the known 3D structures. The agreement between predicted and observed
structure profile is evaluated using statistical parameters. The optimal
threading for each sequence-structure pair is obtained using dynamic
programming. The overall best sequence-structure pair constitutes the
predicted 3D structure for the input sequence. The method is fine-tuned by
adding information from direct sequence-sequence comparison and applying a
series of empirical filters. Although the method relies on reduction of 3D
information into 1D structure profiles, its accuracy is, surprisingly, not
clearly inferior to methods based on evaluation of residue interactions in 3D.
We therefore hypothesise that existing 1D-3D threading methods essentially
capture not more than the fitness of an amino acid sequence for a particular 1D
succession of secondary structure segments and residue solvent accessibility.
The prediction-based threading method on average finds any structurally
homologues region at first rank in 30% of the cases. For the 17% first hits
detected at highest scores, the expected accuracy raised to 70%. However, the
task to detect entire folds rather than homologous fragments, was managed much
better: depending on the cut-off for what was regarded as an 'entire fold' the
first hit was correct in 60-80% of all cases. The quality of the resulting 3D
models depends crucially on the details of the sequence-structure alignments
which can be inaccurate in detail even in cases in which the correct fold is
detected.
Full Paper
Today, we have a detailed and ever-widening knowledge of the evolution of
DNA sequences, but what do we really know about the evolution of protein
structure? Until recently, the answer was: not much. The first detailed
structures were determined 26 years ago; 13 years ago, the database of
atomic-resolution protein structures contained just 312 structures (PDB).
Since then, due to advances in determination methods, the PDB has grown
exponentially; presently it holds over 4000 entries. With this size, we can
just begin to analyse the evolution of protein structure. Here, we report an
analysis of all pairs of proteins in the PDB which have similar
three-dimensional (3D) structures. For each pair, we aligned the 3D
structures, and measured the sequence identity (pairwise identical residues) in
the aligned regions. The resulting distribution of pair identity scores shows
one prominent and unexpected feature: most pairs cluster in an approximately
Gaussian peak centred at 8-9% sequence identity. The distribution is
surprisingly similar to that expected for `random' pairs of completely
unrelated sequences. This result has implications for our understanding of
protein folding, and of the effect of convergent (different ancestor) and
divergent (same ancestor) evolution on protein structure.
Full Paper
A protein sequence folds into a unique three-dimensional protein structure.
Different sequences, though, can fold into similar structures. How stable is a
protein structure with respect to sequence changes? What percentage of the
sequence are 'anchor' residues, i.e., are crucial for protein structure and
function? Here, these questions are pursued by analysing large numbers of
structurally homologous protein pairs. Most pairs of similar structures have
sequence identity as low as expected from randomly related sequences. On
average only three to four percent of all residues are 'anchor' residues
(residues crucial for maintaining the structure). The symmetric shape of the
distribution at low sequence identity suggests that for most structures, four
billion years of evolution was sufficient to reach an equilibrium. The mean
identities for convergent (different ancestor) and divergent evolution (same
ancestor) of proteins to similar structures are quite close, and hence, in most
cases it is difficult to distinguish between the two effects. In particular,
low levels of sequence identity appear not to be indicative of convergent
evolution.
Full Paper
The problem of predicting protein structure from sequence remains fundamentally
unsolved despite more than three decades of intensive research effort.
However, new and promising methods in 3D, 2D, and 1D prediction have reopened
the field. Mean-force-potentials derived from the protein databases can
distinguish between correct and incorrect models (3D). Inter-residue contacts
(2D) can be detected by analysis of correlated mutations, albeit with low
accuracy. Secondary structure, solvent accessibility, and transmembrane
helices (1D) can be predicted with significantly improved accuracy using
multiple sequence alignments. Some of these new prediction methods have proven
accurate and reliable enough to be useful in genome analysis, and in
experimental structure determination. Moreover, the new generation of
theoretical methods is increasingly influencing experiments in molecular
biology.
Full Paper
In the wake of the genome data flow, we need - more urgently than ever -
accurate tools to predict protein structure. The problem of predicting protein
structure from sequence remains fundamentally unsolved despite more than three
decades of intensive research effort. However, the wealth of evolutionary
information deposited in current databases enabled a significant improvement
for methods predicting protein structure in 1D: secondary structure,
transmembrane helices, and solvent accessibility. In particular, the
combination of evolutionary information with neural networks proved extremely
successful. The new generation of prediction methods proved to be accurate and
reliable enough to be useful in genome analysis, and in experimental structure
determination. Moreover, the new generation of theoretical methods is
increasingly influencing experiments in molecular biology.
Full Paper
Over the past few years our means of communication have changed rapidly due to
the growth of the World Wide Web (WWW). The Web enables molecular biologists to
immediately access databases, scan literature, find information about related
research and researchers, and to trace cell cultures. Wet-lab biologists can
uncover information about the protein of interest without having to become
experts in sequence analysis. Here, we present a variety of tools; provide an
overview of the state-of-the art in sequence analysis; and described some of
the principles of the methods.
Full Paper
Proteins are the machinery of life. The information for life is stored
by a four-letter alphabet in the genes (DNA). Proteins are, among others, the
macromolecules that perform all important tasks in organisms, such as catalysis
of biochemical reactions, transport of nutrients, recognition, and transmission
of signals. Thus, genes are the blueprints or library, and proteins are the
machinery of life. Proteins are formed by joining amino acids by peptide bonds
into a stretched chain. This protein sequence comprises a translation of the
four-letter DNA alphabet into a 20-letter alphabet of native amino acids.
Proteins differ in length (from 30 to over 30,000 amino acids), and in the
arrangement of the amino acids (dubbed residues, when joined in proteins). In
water, the chain folds up into a unique three-dimensional (3D) structure. The
main driving force is the need to pack residues for which a contact with water
is energetically unfavourable (hydrophobic residues) into the interior of the
molecule. A detailed analysis of the underlying chemistry shows that this is
only possible if the protein forms regular patterns of a macroscopic
substructure called secondary structure (Fig. 1; for an excellent introduction
into protein structure for a short review of the basic principles of folding:).
Sequence determines structure determines function. Protein
three-dimensional (3D) structure (i.e. the co-ordinates of all atoms)
determines protein function. But what determines 3D structure? The hypothesis
that structure (also referred to as 'the fold') is uniquely determined by the
specificity of the sequence, has been verified for many proteins. While it is
now known that particular proteins (chaperones) often play a rôle in the
folding pathway, and in correcting misfolds, it is still generally assumed that
the final structure is at the free-energy minimum. Thus, all information about
the native structure of a protein is coded in the amino acid sequence, plus its
native solution environment. Can the code be deciphered, i.e. can 3D structure
be predicted from sequence? In principle, the code could by deciphered from
physico-chemical principles using, for example, molecular dynamics methods. In
practice, however, such approaches are frustrated by two principle obstacles.
Firstly, energy differences between native and unfolded proteins are extremely
small (order of 1 kcal/mol). Secondly, the high complexity (i.e.
co-operativity) of protein folding requires several orders of magnitudes more
computing time than we anticipate to have over the next decades. Thus, the
inaccuracy in experimentally determining the basic parameters, and the limited
computing resources become fatal for predicting protein structure from first
principles.
The only successful structure prediction tools are knowledge-based, using a
combination of statistical theory and empirical rules.
The sequence-structure gap is rapidly increasing. Currently, databases
for protein sequences (e.g. SWISS-PROT are expanding rapidly, largely due to
large-scale genome sequencing projects. The first four entire genome sequences
have been published; they represent all three terrestrial kingdoms: (1)
prokaryotes: haemophilus influenzae, and mycoplasma genitalium; (2) eucaryotes:
yeast, and (3) archeans: methanococcus jannaschii, At least, another dozen of
genomes will be completely sequenced before the end of 1997 (Terry Gaasterland,
priv. communication); the entire human genome is likely to be known in the year
2003. This implies that the explosion of genome, and hence, protein sequences
is supposedly the only field outgrowing the speed in development of computer
hardware. It also implies, that despite significant improvements of structure
determination techniques the gap between the number of proteins for which
structure is deposited in public databases (PDB), and the number of proteins
for which sequences are known is increasing.
Can the egg be unboiled? When an egg is boiled, the proteins it
contains unfold. Can this procedure be reversed in theory? Can the encrypted
code of protein structure be deciphered? Or, can theory help to bridge the
sequence-structure gap? Indeed, for over 30 years, there has been an ardent
search for methods to predict protein structure from the sequence. Many
methods were found which looked initially very promising - but always the hope
has been dashed. How well do we do?
No general prediction of structure from sequence, yet. An important
experiment has been initiated by John Moult (CARB, Washington): those who
determine protein structures submitted the sequences of proteins for which they
were about to solve the structure to a 'to-be-predicted' database; for each
entry in that database predictors could send in their predictions before a
given deadline (the public release of the structure); finally, the results were
compared, and discussed during a workshop (in Asilomar, California). Two such
experiments have been completed: in December 1994 (Proteins special issue, Vol.
23, 1995), and in December 1996 (to be published in Proteins, 1997). The
results of both experiments demonstrated clearly that the goal to predict
structure from sequence has not been reached, yet. So, no improvement despite
ardent attempts, and the explosion of knowledge deposited in databases?
Indeed, there is a flood of literature on protein structure prediction
attempting to keep track with the expanding databases. In this review focus
will be laid on recent prediction methods that do actually contribute to
bridging the sequence structure gap in particular in view of analysing entire
genomes. The first section will provide a brief sketch about where we are
today in protein structure prediction. The following chapters will sketch the
problems, and some of the solutions in database searches, and the prediction of
protein structure in 1D, 2D, and 3D (Fig. 1).
Full Paper
Accuracy of predicting protein secondary structure and solvent accessibility
has been improved significantly by using evolutionary information contained in
multiple sequence alignments. For the second Asilomar meeting, predictions
were made automatically for all targets using the publicly available prediction
service PredictProtein. Additionally, a semi-automatic procedure for
generating more informative alignments was used in combination with the PHD
prediction methods. Results confirmed the estimates for prediction accuracy.
Furthermore, the more informative alignments yielded better predictions. The
fairly accurate predictions of 1D structure were successfully used by various
groups for the Asilomar meeting as first step towards predicting higher
dimensions of protein structure.
Miguel A. Andrade, Se·n I. O'Donoghue,& Burkhard Rost
J. Mol. Biol., 276, 517-525 (1998)
Full Paper
In vivo, proteins occur in widely different physio-chemical
environments, and, from in vitro studies, we know that
protein structures can be very sensitive to environment. However,
theoretical studies of protein structure have tended to ignore
this complexity. In this paper, we have approached this problem
by grouping proteins by their subcellular location and looking
of structural properties that are characteristic to each location.
We hypothesise that, throughout evolution, each subcellular location
has maintained a characteristic physio-chemical environment, and
that proteins in each location have adapted to these environments.
If so, we would expect that protein structures from different
locations will show characteristic differences, particular at
the surface, which is directly exposed to the environment. To
test this hypothesis, we have examined all eukaryotic proteins
with known three-dimensional structure and for which the subcellular
location is known to be either nuclear, cytoplasmic, or extracellular.
In agreement with previous studies, we find that the total amino
acid composition carries a signal that identifies the subcellular
location. This signal was due almost entirely to the surface residues.
The surface residue signal was often strong enough to accurately
predict subcellular location, given only a knowledge of which
residues are at the protein surface. The results suggest how the
accuracy of prediction of location from sequence can be improved.
We concluded that protein surfaces show adaptation to their subcellular
location. The nature of these adaptations suggests several principles
that proteins may have used in adapting to particular physio-chemical
environments; these principles may be useful for protein design.
Burkhard Rost & Chris Sander
in: Webster D. M. (ed.): 'Predicting protein structure'. Humana Press, 1998, in press. (1998)
Full Paper
We still cannot predict protein structure from sequence, in general.
But, we can do much better in predicting simplified aspects of
structure. Particularly, the field of secondary structure has
been revived by a break-through that has been achieved by a combination
of elaborated algorithms and evolutionary information available
in ever growing data bases. Some of the new, third generation
methods for secondary structure prediction are clearly superior
to previous methods: b-strands are
predicted more accurately; predicted segments look like those
observed; and the overall accuracy is about ten percentage points
higher than for methods from previous generations. Performance
can be improved even further by using these methods in an 'expert'
rather than in an 'automatic' mode.
Burkhard Rost
Structure, manuscript in prep. (1998)
Full Paper
Today. Large-scale genome sequencing is filling up the
catalogue of natural proteins at a breath-taking speed. Today,
we have available not just a large number of sequences, but also
glimpses of the inventory of entire organisms. This will soon
improve our understanding of cells, in particular, and of life,
in general. Three means will contribute: (1) sequencing genomes
(genomics), (2) determining protein structures, and (3) determining
protein function. Protein structure is interwoven with function
(see Structure, in general, [1, 2, 3, in particular).
Sequencing and determining function are also routinely combined
(e.g. [4] ). However, what about the relation between structure
determination and genomics?
Tomorrow. Structural genomics, the marriage between protein
structure determination and genomics, is already beginning. Here,
I attempted to illustrate the likely direction this marriage will
take. Structure determination will be pushed by, and profit from
genomics. Basing research and technical developments (such as
drug design) on all three pillars (sequence, structure, function)
will be a big step toward understanding of life.
Objectives. Structure determination will benefit from
genomics in two ways (
Fig. 1 ). (1) The mass of available sequences
will facilitate quick determination of structure for most existing
folds. (2) Sequences for entire organisms will help to unravel
missing links in functional pathways, to explore alternative pathways,
and to widen our understanding of principle mechanisms and of
evolutionary cross-links.
Sequence alignments unambiguously distinguish between protein pairs of similar
and non-similar structure when the pairwise sequence identity is significant.
The signal gets blurred when intruding into the twilight zone of 20-30%
sequence identity. I analysed more than a million sequence alignments between
protein pairs of known structures to explore the twilight zone. Goals were to
unravel clues for why sequence alignments are difficult in the twilight zone,
and to define a line distinguishing between true and false positives for low
levels of similarity. Six results stood out. (1) When entering the twilight
zone, the number of pairs exploded. (2) More than 95% of all pairs detected in
the twilight zone had different structures. (3) The level of significant
sequence identity and similarity were confirmed to be dependent on alignment
length. For example, if ten residues were similar in an alignment of length 16
(> 60%), structural similarity could not be inferred. (4) Above 30%
sequence similarity, more than 95% of the pairs detected were homologous; below
20% less than 20% were homologous. (5) The 'more similar than identical' rule
that implied to discard all pairs for which percentage similarity was lower
than percentage identity significantly reduced false positives. (6) Similarly
successful was sequence space hopping: two proteins were predicted to be
homologous whenever proteins were common in the sequence families of both
proteins. All findings would be applicable to automatic database searches.
, (1998)
Full Paper