Characterization of drug-induced transcriptional modules: towards drug repositioning and functional understanding.
Iskar, M., Zeller, G., Blattmann, P., Campillos, M., Kuhn, M., Kaminska, K.H., Runz, H., Gavin, A.C., Pepperkok, R., van Noort, V. & Bork, P.
Mol Syst Biol. 2013 Apr 30;9:662. doi: 10.1038/msb.2013.20.
In pharmacology, it is crucial to understand the complex biological responses that drugs elicit in the human organism and how well they can be inferred from model organisms. We therefore identified a large set of drug-induced transcriptional modules from genome-wide microarray data of drug-treated human cell lines and rat liver, and first characterized their conservation. Over 70% of these modules were common for multiple cell lines and 15% were conserved between the human in vitro and the rat in vivo system. We then illustrate the utility of conserved and cell-type-specific drug-induced modules by predicting and experimentally validating (i) gene functions, e.g., 10 novel regulators of cellular cholesterol homeostasis and (ii) new mechanisms of action for existing drugs, thereby providing a starting point for drug repositioning, e.g., novel cell cycle inhibitors and new modulators of alpha-adrenergic receptor, peroxisome proliferator-activated receptor and estrogen receptor. Taken together, the identified modules reveal the conservation of transcriptional responses towards drugs across cell types and organisms, and improve our understanding of both the molecular basis of drug action and human biology.
Country-specific antibiotic use practices impact the human gut resistome.
Forslund, K., Sunagawa, S., Roat Kultima, J., Mende, D., Arumugam, M., Typas, A. & Bork, P.
Genome Res. 2013 Apr 8.
Despite increasing concerns over inappropriate use of antibiotics in medicine and food production, population-level resistance transfer into the human gut microbiota has not been demonstrated beyond individual case studies. To determine the "antibiotic resistance potential" for entire microbial communities, we employ metagenomic data and quantify the totality of known resistance genes in each community (its resistome) for 68 classes and subclasses of antibiotics. In 252 fecal metagenomes from three countries, we show that the most abundant resistance determinants are those for antibiotics also used in animals, and for antibiotics that have been available longer. Resistance genes are also more abundant in samples from Spain, Italy and France than from Denmark, the US, or Japan. Where comparable country-level data on antibiotic use in both humans and animals are available, differences in these statistics match the observed resistance potential differences. The results are robust over time as the antibiotic resistance determinants of individuals persist in the human gut flora for at least a year.
Genomic variation landscape of the human gut microbiome.
Schloissnig, S., Arumugam, M., Sunagawa, S., Mitreva, M., Tap, J., Zhu, A., Waller, A., Mende, D.R., Kultima, J.R., Martin, J., Kota, K., Sunyaev, S.R., Weinstock, G.M. & Bork, P.
Nature. 2013 Jan 3;493(7430):45-50. doi: 10.1038/nature11711. Epub 2012 Dec 5.
Whereas large-scale efforts have rapidly advanced the understanding and practical impact of human genomic variation, the practical impact of variation is largely unexplored in the human microbiome. We therefore developed a framework for metagenomic variation analysis and applied it to 252 faecal metagenomes of 207 individuals from Europe and North America. Using 7.4 billion reads aligned to 101 reference species, we detected 10.3 million single nucleotide polymorphisms (SNPs), 107,991 short insertions/deletions, and 1,051 structural variants. The average ratio of non-synonymous to synonymous polymorphism rates of 0.11 was more variable between gut microbial species than across human hosts. Subjects sampled at varying time intervals exhibited individuality and temporal stability of SNP variation patterns, despite considerable composition changes of their gut microbiota. This indicates that individual-specific strains are not easily replaced and that an individual might have a unique metagenomic genotype, which may be exploitable for personalized diet or drug intake.
Cell type-specific nuclear pores: a case in point for context-dependent stoichiometry of molecular machines.
Ori, A., Banterle, N., Iskar, M., Andres-Pons, A., Escher, C., Khanh Bui, H., Sparks, L., Solis-Mezarino, V., Rinner, O., Bork, P., Lemke, E.A. & Beck, M.
Mol Syst Biol. 2013 Mar 19;9:648. doi: 10.1038/msb.2013.4.
To understand the structure and function of large molecular machines, accurate knowledge of their stoichiometry is essential. In this study, we developed an integrated targeted proteomics and super-resolution microscopy approach to determine the absolute stoichiometry of the human nuclear pore complex (NPC), possibly the largest eukaryotic protein complex. We show that the human NPC has a previously unanticipated stoichiometry that varies across cancer cell types, tissues and in disease. Using large-scale proteomics, we provide evidence that more than one third of the known, well-defined nuclear protein complexes display a similar cell type-specific variation of their subunit stoichiometry. Our data point to compositional rearrangement as a widespread mechanism for adapting the functions of molecular machines toward cell type-specific constraints and context-dependent needs, and highlight the need of deeper investigation of such structural variants.
Orthologous gene clusters and taxon signature genes for viruses of prokaryotes.
Kristensen, D.M., Waller, A.S., Yamada, T., Bork, P., Mushegian, A.R. & Koonin, E.V.
J Bacteriol. 2013 Mar;195(5):941-50. doi: 10.1128/JB.01801-12. Epub 2012 Dec 7.
Viruses are the most abundant biological entities on earth and encompass a vast amount of genetic diversity. The recent rapid increase in the number of sequenced viral genomes has created unprecedented opportunities for gaining new insight into the structure and evolution of the virosphere. Here, we present an update of the phage orthologous groups (POGs), a collection of 4,542 clusters of orthologous genes from bacteriophages that now also includes viruses infecting archaea and encompasses more than 1,000 distinct virus genomes. Analysis of this expanded data set shows that the number of POGs keeps growing without saturation and that a substantial majority of the POGs remain specific to viruses, lacking homologues in prokaryotic cells, outside known proviruses. Thus, the great majority of virus genes apparently remains to be discovered. A complementary observation is that numerous viral genomes remain poorly, if at all, covered by POGs. The genome coverage by POGs is expected to increase as more genomes are sequenced. Taxon-specific, single-copy signature genes that are not observed in prokaryotic genomes outside detected proviruses were identified for two-thirds of the 57 taxa (those with genomes available from at least 3 distinct viruses), with half of these present in all members of the respective taxon. These signatures can be used to specifically identify the presence and quantify the abundance of viruses from particular taxa in metagenomic samples and thus gain new insights into the ecology and evolution of viruses in relation to their hosts.
The microbiome explored: recent insights and future challenges.
Blaser, M., Bork, P., Fraser, C., Knight, R. & Wang, J.
Nat Rev Microbiol. 2013 Mar;11(3):213-7. doi: 10.1038/nrmicro2973. Epub 2013 Feb4.
One of the most exciting scientific advances in recent years has been the realization that commensal microorganisms are not simple 'passengers' in our bodies, but instead have key roles in our physiology, including our immune responses and metabolism, as well as in disease. These insights have been obtained, in part, through the work of large-scale, consortium-driven metagenomic projects. Here, five experts in the field of microbiome research discuss the most surprising and exciting new findings, and outline the future steps that will be necessary to elucidate the numerous roles of the microbiota in human health and disease and to develop viable therapeutic strategies.
Consistent mutational paths predict eukaryotic thermostability.
van Noort, V., Bradatsch, B., Arumugam, M., Amlacher, S., Bange, G., Creevey, C., Falk, S., Mende, D.R., Sinning, I., Hurt, E. & Bork, P.
BMC Evol Biol. 2013 Jan 10;13:7. doi: 10.1186/1471-2148-13-7.
ABSTRACT: BACKGROUND: Proteomes of thermophilic prokaryotes have been instrumental in structural biology and successfully exploited in biotechnology, however many proteins required for eukaryotic cell function are absent from bacteria or archaea. With Chaetomium thermophilum, Thielavia terrestris and Thielavia heterothallica three genome sequences of thermophilic eukaryotes have been published. RESULTS: Studying the genomes and proteomes of these thermophilic fungi, we found common strategies of thermal adaptation across the different kingdoms of Life, including amino acid biases and a reduced genome size. A phylogenetics-guided comparison of thermophilic proteomes with those of other, mesophilic Sordariomycetes revealed consistent amino acid substitutions associated to thermophily that were also present in an independent lineage of thermophilic fungi. The most consistent pattern is the substitution of lysine by arginine, which we could find in almost all lineages but has not been extensively used in protein stability engineering. By exploiting mutational paths towards the thermophiles, we could predict particular amino acid residues in individual proteins that contribute to thermostability and validated some of them experimentally. By determining the three-dimensional structure of an exemplar protein from C. thermophilum (Arx1), we could also characterise the molecular consequences of some of these mutations. CONCLUSIONS: The comparative analysis of these three genomes not only enhances our understanding of the evolution of thermophily, but also provides new ways to engineer protein stability.
MOCAT: a metagenomics assembly and gene prediction toolkit.
Kultima, J.R., Sunagawa, S., Li, J., Chen, W., Chen, H., Mende, D.R., Arumugam, M., Pan, Q., Liu, B., Qin, J., Wang, J. & Bork, P.
PLoS One. 2012;7(10):e47656. doi: 10.1371/journal.pone.0047656. Epub 2012 Oct 17.
MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/.
eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges.
Powell, S., Szklarczyk, D., Trachana, K., Roth, A., Kuhn, M., Muller, J., Arnold, R., Rattei, T., Letunic, I., Doerks, T., Jensen, L.J., von Mering, C. & Bork, P.
Nucleic Acids Res. 2012 Jan;40(Database issue):D284-9. Epub 2011 Nov 16.
Orthologous relationships form the basis of most comparative genomic and metagenomic studies and are essential for proper phylogenetic and functional analyses. The third version of the eggNOG database (http://eggnog.embl.de) contains non-supervised orthologous groups constructed from 1133 organisms, doubling the number of genes with orthology assignment compared to eggNOG v2. The new release is the result of a number of improvements and expansions: (i) the underlying homology searches are now based on the SIMAP database; (ii) the orthologous groups have been extended to 41 levels of selected taxonomic ranges enabling much more fine-grained orthology assignments; and (iii) the newly designed web page is considerably faster with more functionality. In total, eggNOG v3 contains 721,801 orthologous groups, encompassing a total of 4,396,591 genes. Additionally, we updated 4873 and 4850 original COGs and KOGs, respectively, to include all 1133 organisms. At the universal level, covering all three domains of life, 101,208 orthologous groups are available, while the others are applicable at 40 more limited taxonomic ranges. Each group is amended by multiple sequence alignments and maximum-likelihood trees and broad functional descriptions are provided for 450,904 orthologous groups (62.5%).
InterPro in 2011: new developments in the family and domain prediction database.
Hunter, S., Jones, P., Mitchell, A., Apweiler, R., Attwood, T.K., Bateman, A., Bernard, T., Binns, D., Bork, P., Burge, S., de Castro, E., Coggill, P., Corbett, M., Das, U., Daugherty, L., Duquenne, L., Finn, R.D., Fraser, M., Gough, J., Haft, D., Hulo, N., Kahn, D., Kelly, E., Letunic, I., Lonsdale, D., Lopez, R., Madera, M., Maslen, J., McAnulla, C., McDowall, J., McMenamin, C., Mi, H., Mutowo-Muellenet, P., Mulder, N., Natale, D., Orengo, C., Pesseat, S., Punta, M., Quinn, A.F., Rivoire, C., Sangrador-Vegas, A., Selengut, J.D., Sigrist, C.J., Scheremetjew, M., Tate, J., Thimmajanarthanan, M., Thomas, P.D., Wu, C.H., Yeats, C. & Yong, S.Y.
Nucleic Acids Res. 2012 Jan;40(Database issue):D306-12. Epub 2011 Nov 16.
InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
Transcription start site associated RNAs in bacteria.
Yus, E., Guell, M., Vivancos, A.P., Chen, W.H., Lluch-Senar, M., Delgado, J., Gavin, A.C., Bork, P. & Serrano, L.
Mol Syst Biol. 2012 May 22;8:585. doi: 10.1038/msb.2012.16.
Here, we report the genome-wide identification of small RNAs associated with transcription start sites (TSSs), termed tssRNAs, in Mycoplasma pneumoniae. tssRNAs were also found to be present in a different bacterial phyla, Escherichia coli. Similar to the recently identified promoter-associated tiny RNAs (tiRNAs) in eukaryotes, tssRNAs are associated with active promoters. Evidence suggests that these tssRNAs are distinct from previously described abortive transcription RNAs. ssRNAs have an average size of 45 bases and map exactly to the beginning of cognate full-length transcripts and to cryptic TSSs. Expression of bacterial tssRNAs requires factors other than the standard RNA polymerase holoenzyme. We have found that the RNA polymerase is halted at tssRNA positions in vivo, which may indicate that a pausing mechanism exists to prevent transcription in the absence of genes. These results suggest that small RNAs associated with TSSs could be a universal feature of bacterial transcription.
Prediction and identification of sequences coding for orphan enzymes using genomic and metagenomic neighbours.
Yamada, T., Waller, A.S., Raes, J., Zelezniak, A., Perchat, N., Perret, A., Salanoubat, M., Patil, K.R., Weissenbach, J. & Bork, P.
Mol Syst Biol. 2012 May 8;8:581. doi: 10.1038/msb.2012.13.
Despite the current wealth of sequencing data, one-third of all biochemically characterized metabolic enzymes lack a corresponding gene or protein sequence, and as such can be considered orphan enzymes. They represent a major gap between our molecular and biochemical knowledge, and consequently are not amenable to modern systemic analyses. As 555 of these orphan enzymes have metabolic pathway neighbours, we developed a global framework that utilizes the pathway and (meta)genomic neighbour information to assign candidate sequences to orphan enzymes. For 131 orphan enzymes (37% of those for which (meta)genomic neighbours are available), we associate sequences to them using scoring parameters with an estimated accuracy of 70%, implying functional annotation of 16 345 gene sequences in numerous (meta)genomes. As a case in point, two of these candidate sequences were experimentally validated to encode the predicted activity. In addition, we augmented the currently available genome-scale metabolic models with these new sequence-function associations and were able to expand the models by on average 8%, with a considerable change in the flux connectivity patterns and improved essentiality prediction.
Cross-talk between phosphorylation and lysine acetylation in a genome-reduced bacterium.
van Noort, V., Seebacher, J., Bader, S., Mohammed, S., Vonkova, I., Betts, M.J., Kuhner, S., Kumar, R., Maier, T., O'Flaherty, M., Rybin, V., Schmeisky, A., Yus, E., Stulke, J., Serrano, L., Russell, R.B., Heck, A.J., Bork, P. & Gavin, A.C.
Mol Syst Biol. 2012 Feb 28;8:571. doi: 10.1038/msb.2012.4.
Protein post-translational modifications (PTMs) represent important regulatory states that when combined have been hypothesized to act as molecular codes and to generate a functional diversity beyond genome and transcriptome. We systematically investigate the interplay of protein phosphorylation with other post-transcriptional regulatory mechanisms in the genome-reduced bacterium Mycoplasma pneumoniae. Systematic perturbations by deletion of its only two protein kinases and its unique protein phosphatase identified not only the protein-specific effect on the phosphorylation network, but also a modulation of proteome abundance and lysine acetylation patterns, mostly in the absence of transcriptional changes. Reciprocally, deletion of the two putative N-acetyltransferases affects protein phosphorylation, confirming cross-talk between the two PTMs. The measured M. pneumoniae phosphoproteome and lysine acetylome revealed that both PTMs are very common, that (as in Eukaryotes) they often co-occur within the same protein and that they are frequently observed at interaction interfaces and in multifunctional proteins. The results imply previously unreported hidden layers of post-transcriptional regulation intertwining phosphorylation with lysine acetylation and other mechanisms that define the functional state of a cell.
Deciphering a global network of functionally associated post-translational modifications.
Minguez, P., Parca, L., Diella, F., Mende, D.R., Kumar, R., Helmer-Citterich, M., Gavin, A.C., van Noort, V. & Bork, P.
Mol Syst Biol. 2012 Jul 17;8:599. doi: 10.1038/msb.2012.31.
Various post-translational modifications (PTMs) fine-tune the functions of almost all eukaryotic proteins, and co-regulation of different types of PTMs has been shown within and between a number of proteins. Aiming at a more global view of the interplay between PTM types, we collected modifications for 13 frequent PTM types in 8 eukaryotes, compared their speed of evolution and developed a method for measuring PTM co-evolution within proteins based on the co-occurrence of sites across eukaryotes. As many sites are still to be discovered, this is a considerable underestimate, yet, assuming that most co-evolving PTMs are functionally associated, we found that PTM types are vastly interconnected, forming a global network that comprise in human alone >50 000 residues in about 6000 proteins. We predict substantial PTM type interplay in secreted and membrane-associated proteins and in the context of particular protein domains and short-linear motifs. The global network of co-evolving PTM types implies a complex and intertwined post-translational regulation landscape that is likely to regulate multiple functional states of many if not all eukaryotic proteins.
Drug discovery in the age of systems biology: the rise of computational approaches for data integration.
Iskar, M., Zeller, G., Zhao, X.M., van Noort, V. & Bork, P.
Curr Opin Biotechnol. 2011 Dec 5.
The increased availability of large-scale open-access resources on bioactivities of small molecules has a significant impact on pharmacology facilitated mainly by computational approaches that digest the vast amounts of data. We discuss here how computational data integration enables systemic views on a drug's action and allows to tackle complex problems such as the large-scale prediction of drug targets, drug repurposing, the molecular mechanisms, cellular responses or side effects. We particularly focus on computational methods that leverage various cell-based transcriptional, proteomic and phenotypic profiles of drug response in order to gain a systemic view of drug action at the molecular, cellular and whole-organism scale.
Prediction of drug combinations by integrating molecular and pharmacological data.
Zhao, X.M., Iskar, M., Zeller, G., Kuhn, M., Noort, V. & Bork, P.
PLoS Comput Biol. 2011 Dec;7(12):e1002323. Epub 2011 Dec 29.
Combinatorial therapy is a promising strategy for combating complex disorders due to improved efficacy and reduced side effects. However, screening new drug combinations exhaustively is impractical considering all possible combinations between drugs. Here, we present a novel computational approach to predict drug combinations by integrating molecular and pharmacological data. Specifically, drugs are represented by a set of their properties, such as their targets or indications. By integrating several of these features, we show that feature patterns enriched in approved drug combinations are not only predictive for new drug combinations but also provide insights into mechanisms underlying combinatorial therapy. Further analysis confirmed that among our top ranked predictions of effective combinations, 69% are supported by literature, while the others represent novel potential drug combinations. We believe that our proposed approach can help to limit the search space of drug combinations and provide a new way to effectively utilize existing drugs for new purposes.
A holistic approach to marine eco-systems biology.
Karsenti, E., Acinas, S.G., Bork, P., Bowler, C., De Vargas, C., Raes, J., Sullivan, M., Arendt, D., Benzoni, F., Claverie, J.M., Follows, M., Gorsky, G., Hingamp, P., Iudicone, D., Jaillon, O., Kandels-Lewis, S., Krzic, U., Not, F., Ogata, H., Pesant, S., Reynaud, E.G., Sardet, C., Sieracki, M.E., Speich, S., Velayoudon, D., Weissenbach, J. & Wincker, P.
PLoS Biol. 2011 Oct;9(10):e1001177. doi: 10.1371/journal.pbio.1001177. Epub 2011Oct 18.
The structure, robustness, and dynamics of ocean plankton ecosystems remain poorly understood due to sampling, analysis, and computational limitations. The Tara Oceans consortium organizes expeditions to help fill this gap at the global level.
Orthology prediction methods: a quality assessment using curated protein families.
Trachana, K., Larsson, T.A., Powell, S., Chen, W.H., Doerks, T., Muller, J. & Bork, P.
Bioessays. 2011 Oct;33(10):769-80. doi: 10.1002/bies.201100062. Epub 2011Aug 19.
The increasing number of sequenced genomes has prompted the development of several automated orthology prediction methods. Tests to evaluate the accuracy of predictions and to explore biases caused by biological and technical factors are therefore required. We used 70 manually curated families to analyze the performance of five public methods in Metazoa. We analyzed the strengths and weaknesses of the methods and quantified the impact of biological and technical challenges. From the latter part of the analysis, genome annotation emerged as the largest single influencer, affecting up to 30% of the performance. Generally, most methods did well in assigning orthologous group but they failed to assign the exact number of genes for half of the groups. The publicly available benchmark set (http://eggnog.embl.de/orthobench/) should facilitate the improvement of current orthology assignment protocols, which is of utmost importance for many fields of biology and should be tackled by a broad scientific community.
Insight into structure and assembly of the nuclear pore complex by utilizing the genome of a eukaryotic thermophile.
Amlacher, S., Sarges, P., Flemming, D., van Noort, V., Kunze, R., Devos, D.P., Arumugam, M., Bork, P. & Hurt, E.
Cell. 2011 Jul 22;146(2):277-89.
Despite decades of research, the structure and assembly of the nuclear pore complex (NPC), which is composed of approximately 30 nucleoporins (Nups), remain elusive. Here, we report the genome of the thermophilic fungus Chaetomium thermophilum (ct) and identify the complete repertoire of Nups therein. The thermophilic proteins show improved properties for structural and biochemical studies compared to their mesophilic counterparts, and purified ctNups enabled the reconstitution of the inner pore ring module that spans the width of the NPC from the anchoring membrane to the central transport channel. This module is composed of two large Nups, Nup192 and Nup170, which are flexibly bridged by short linear motifs made up of linker Nups, Nic96 and Nup53. This assembly illustrates how Nup interactions can generate structural plasticity within the NPC scaffold. Our findings therefore demonstrate the utility of the genome of a thermophilic eukaryote for studying complex molecular machines.
iPath2.0: interactive pathway explorer.
Yamada, T., Letunic, I., Okuda, S., Kanehisa, M. & Bork, P.
Nucleic Acids Res. 2011 Jul;39(Web Server issue):W412-5. Epub 2011 May 5.
iPath2.0 is a web-based tool (http://pathways.embl.de) for the visualization and analysis of cellular pathways. Its primary map summarizes the metabolism in biological systems as annotated to date. Nodes in the map correspond to various chemical compounds and edges represent series of enzymatic reactions. In two other maps, iPath2.0 provides an overview of secondary metabolite biosynthesis and a hand-picked selection of important regulatory pathways and other functional modules, allowing a more general overview of protein functions in a genome or metagenome. iPath2.0's main interface is an interactive Flash-based viewer, which allows users to easily navigate and explore the complex pathway maps. In addition to the default pre-computed overview maps, iPath offers several data mapping tools. Users can upload various types of data and completely customize all nodes and edges of iPath2.0's maps. These customized maps give users an intuitive overview of their own data, guiding the analysis of various genomics and metagenomics projects.
Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy.
Letunic, I. & Bork, P.
Nucleic Acids Res. 2011 Jul;39(Web Server issue):W475-8. Epub 2011 Apr 5.
Interactive Tree Of Life (http://itol.embl.de) is a web-based tool for the display, manipulation and annotation of phylogenetic trees. It is freely available and open to everyone. In addition to classical tree viewer functions, iTOL offers many novel ways of annotating trees with various additional data. Current version introduces numerous new features and greatly expands the number of supported data set types. Trees can be interactively manipulated and edited. A free personal account system is available, providing management and sharing of trees in user defined workspaces and projects. Export to various bitmap and vector graphics formats is supported. Batch access interface is available for programmatic access or inclusion of interactive trees into other web services.
Enterotypes of the human gut microbiome.
Arumugam, M., Raes, J., Pelletier, E., Le Paslier, D., Yamada, T., Mende, D.R., Fernandes, G.R., Tap, J., Bruls, T., Batto, J.M., Bertalan, M., Borruel, N., Casellas, F., Fernandez, L., Gautier, L., Hansen, T., Hattori, M., Hayashi, T., Kleerebezem, M., Kurokawa, K., Leclerc, M., Levenez, F., Manichanh, C., Nielsen, H.B., Nielsen, T., Pons, N., Poulain, J., Qin, J., Sicheritz-Ponten, T., Tims, S., Torrents, D., Ugarte, E., Zoetendal, E.G., Wang, J., Guarner, F., Pedersen, O., de Vos, W.M., Brunak, S., Dore, J., Antolin, M., Artiguenave, F., Blottiere, H.M., Almeida, M., Brechot, C., Cara, C., Chervaux, C., Cultrone, A., Delorme, C., Denariaz, G., Dervyn, R., Foerstner, K.U., Friss, C., van de Guchte, M., Guedon, E., Haimet, F., Huber, W., van Hylckama-Vlieg, J., Jamet, A., Juste, C., Kaci, G., Knol, J., Lakhdari, O., Layec, S., Le Roux, K., Maguin, E., Merieux, A., Melo Minardi, R., M'rini, C., Muller, J., Oozeer, R., Parkhill, J., Renault, P., Rescigno, M., Sanchez, N., Sunagawa, S., Torrejon, A., Turner, K., Vandemeulebrouck, G., Varela, E., Winogradsky, Y., Zeller, G., Weissenbach, J., Ehrlich, S.D. & Bork, P.
Nature. 2011 May 12;473(7346):174-80. Epub 2011 Apr 20.
Our knowledge of species and functional composition of the human gut microbiome is rapidly increasing, but it is still based on very few cohorts and little is known about variation across the world. By combining 22 newly sequenced faecal metagenomes of individuals from four countries with previously published data sets, here we identify three robust clusters (referred to as enterotypes hereafter) that are not nation or continent specific. We also confirmed the enterotypes in two published, larger cohorts, indicating that intestinal microbiota variation is generally stratified, not continuous. This indicates further the existence of a limited number of well-balanced host-microbial symbiotic states that might respond differently to diet and drug intake. The enterotypes are mostly driven by species composition, but abundant molecular functions are not necessarily provided by abundant species, highlighting the importance of a functional analysis to understand microbial communities. Although individual host properties such as body mass index, age, or gender cannot explain the observed enterotypes, data-driven marker genes or functional modules can be identified for each of these host properties. For example, twelve genes significantly correlate with age and three functional modules with the body mass index, hinting at a diagnostic potential of microbial markers.
The ecoresponsive genome of Daphnia pulex.
Colbourne, J.K., Pfrender, M.E., Gilbert, D., Thomas, W.K., Tucker, A., Oakley, T.H., Tokishita, S., Aerts, A., Arnold, G.J., Basu, M.K., Bauer, D.J., Caceres, C.E., Carmel, L., Casola, C., Choi, J.H., Detter, J.C., Dong, Q., Dusheyko, S., Eads, B.D., Frohlich, T., Geiler-Samerotte, K.A., Gerlach, D., Hatcher, P., Jogdeo, S., Krijgsveld, J., Kriventseva, E.V., Kultz, D., Laforsch, C., Lindquist, E., Lopez, J., Manak, J.R., Muller, J., Pangilinan, J., Patwardhan, R.P., Pitluck, S., Pritham, E.J., Rechtsteiner, A., Rho, M., Rogozin, I.B., Sakarya, O., Salamov, A., Schaack, S., Shapiro, H., Shiga, Y., Skalitzky, C., Smith, Z., Souvorov, A., Sung, W., Tang, Z., Tsuchiya, D., Tu, H., Vos, H., Wang, M., Wolf, Y.I., Yamagata, H., Yamada, T., Ye, Y., Shaw, J.R., Andrews, J., Crease, T.J., Tang, H., Lucas, S.M., Robertson, H.M., Bork, P., Koonin, E.V., Zdobnov, E.M., Grigoriev, I.V., Lynch, M. & Boore, J.L.
Science. 2011 Feb 4;331(6017):555-61. doi: 10.1126/science.1197761.
We describe the draft genome of the microcrustacean Daphnia pulex, which is only 200 megabases and contains at least 30,907 genes. The high gene count is a consequence of an elevated rate of gene duplication resulting in tandem gene clusters. More than a third of Daphnia's genes have no detectable homologs in any other available proteome, and the most amplified gene families are specific to the Daphnia lineage. The coexpansion of gene families interacting within metabolic pathways suggests that the maintenance of duplicated genes is not random, and the analysis of gene expression under different environmental conditions reveals that numerous paralogs acquire divergent expression patterns soon after duplication. Daphnia-specific genes, including many additional loci within sequenced regions that are otherwise devoid of annotations, are the most responsive genes to ecological challenges.
The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored.
Szklarczyk, D., Franceschini, A., Kuhn, M., Simonovic, M., Roth, A., Minguez, P., Doerks, T., Stark, M., Muller, J., Bork, P., Jensen, L.J. & von Mering, C.
Nucleic Acids Res. 2011 Jan;39(Database issue):D561-8. Epub 2010 Nov 2.
An essential prerequisite for any systems-level understanding of cellular functions is to correctly uncover and annotate all functional interactions among proteins in the cell. Toward this goal, remarkable progress has been made in recent years, both in terms of experimental measurements and computational prediction techniques. However, public efforts to collect and present protein interaction information have struggled to keep up with the pace of interaction discovery, partly because protein-protein interaction information can be error-prone and require considerable effort to annotate. Here, we present an update on the online database resource Search Tool for the Retrieval of Interacting Genes (STRING); it provides uniquely comprehensive coverage and ease of access to both experimental as well as predicted interaction information. Interactions in STRING are provided with a confidence score, and accessory information such as protein domains and 3D structures is made available, all within a stable and consistent identifier space. New features in STRING include an interactive network viewer that can cluster networks on demand, updated on-screen previews of structural information including homology models, extensive data updates and strongly improved connectivity and integration with third-party resources. Version 9.0 of STRING covers more than 1100 completely sequenced organisms; the resource can be reached at http://string-db.org.
Network neighbors of drug targets contribute to drug side-effect similarity.
Brouwers, L., Iskar, M., Zeller, G., van Noort, V. & Bork, P.
PLoS One. 2011;6(7):e22187. Epub 2011 Jul 13.
In pharmacology, it is essential to identify the molecular mechanisms of drug action in order to understand adverse side effects. These adverse side effects have been used to infer whether two drugs share a target protein. However, side-effect similarity of drugs could also be caused by their target proteins being close in a molecular network, which as such could cause similar downstream effects. In this study, we investigated the proportion of side-effect similarities that is due to targets that are close in the network compared to shared drug targets. We found that only a minor fraction of side-effect similarities (5.8 %) are caused by drugs targeting proteins close in the network, compared to side-effect similarities caused by overlapping drug targets (64%). Moreover, these targets that cause similar side effects are more often in a linear part of the network, having two or less interactions, than drug targets in general. Based on the examples, we gained novel insight into the molecular mechanisms of side effects associated with several drug targets. Looking forward, such analyses will be extremely useful in the process of drug development to better understand adverse side effects.
SmashCell: a software framework for the analysis of single-cell amplified genome sequences.
Harrington, E.D., Arumugam, M., Raes, J., Bork, P. & Relman, D.A.
Bioinformatics. 2010 Dec 1;26(23):2979-80. Epub 2010 Oct 21.
SUMMARY: Recent advances in single-cell manipulation technology, whole genome amplification and high-throughput sequencing have now made it possible to sequence the genome of an individual cell. The bioinformatic analysis of these genomes, however, is far more complicated than the analysis of those generated using traditional, culture-based methods. In order to simplify this analysis, we have developed SmashCell (Simple Metagenomics Analysis SHell-for sequences from single Cells). It is designed to automate the main steps in microbial genome analysis-assembly, gene prediction, functional annotation-in a way that allows parameter and algorithm exploration at each step in the process. It also manages the data created by these analyses and provides visualization methods for rapid analysis of the results. AVAILABILITY: The SmashCell source code and a comprehensive manual are available at http://asiago.stanford.edu/SmashCell CONTACT: email@example.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
SmashCommunity: a metagenomic annotation and analysis tool.
Arumugam, M., Harrington, E.D., Foerstner, K.U., Raes, J. & Bork, P.
Bioinformatics. 2010 Dec 1;26(23):2977-8. Epub 2010 Oct 19.
SUMMARY: SmashCommunity is a stand-alone metagenomic annotation and analysis pipeline suitable for data from Sanger and 454 sequencing technologies. It supports state-of-the-art software for essential metagenomic tasks such as assembly and gene prediction. It provides tools to estimate the quantitative phylogenetic and functional compositions of metagenomes, to compare compositions of multiple metagenomes and to produce intuitive visual representations of such analyses. AVAILABILITY: SmashCommunity source code and documentation are available at http://www.bork.embl.de/software/smash CONTACT: firstname.lastname@example.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
A systematic screen for protein-lipid interactions in Saccharomyces cerevisiae.
Gallego, O., Betts, M.J., Gvozdenovic-Jeremic, J., Maeda, K., Matetzki, C., Aguilar-Gurrieri, C., Beltran-Alvarez, P., Bonn, S., Fernandez-Tornero, C., Jensen, L.J., Kuhn, M., Trott, J., Rybin, V., Muller, C.W., Bork, P., Kaksonen, M., Russell, R.B. & Gavin, A.C.
Mol Syst Biol. 2010 Nov 30;6:430. doi: 10.1038/msb.2010.87.
Protein-metabolite networks are central to biological systems, but are incompletely understood. Here, we report a screen to catalog protein-lipid interactions in yeast. We used arrays of 56 metabolites to measure lipid-binding fingerprints of 172 proteins, including 91 with predicted lipid-binding domains. We identified 530 protein-lipid associations, the majority of which are novel. To show the data set's biological value, we studied further several novel interactions with sphingolipids, a class of conserved bioactive lipids with an elusive mode of action. Integration of live-cell imaging suggests new cellular targets for these molecules, including several with pleckstrin homology (PH) domains. Validated interactions with Slm1, a regulator of actin polarization, show that PH domains can have unexpected lipid-binding specificities and can act as coincidence sensors for both phosphatidylinositol phosphates and phosphorylated sphingolipids.
A human gut microbial gene catalogue established by metagenomic sequencing.
Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K.S., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T., Mende, D.R., Li, J., Xu, J., Li, S., Li, D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie, Y., Tap, J., Lepage, P., Bertalan, M., Batto, J.M., Hansen, T., Le Paslier, D., Linneberg, A., Nielsen, H.B., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H., Yu, C., Li, S., Jian, M., Zhou, Y., Li, Y., Zhang, X., Li, S., Qin, N., Yang, H., Wang, J., Brunak, S., Dore, J., Guarner, F., Kristiansen, K., Pedersen, O., Parkhill, J., Weissenbach, J., Bork, P., Ehrlich, S.D. & Wang, J.
Nature. 2010 Mar 4;464(7285):59-65.
To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Here we describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million non-redundant microbial genes, derived from 576.7 gigabases of sequence, from faecal samples of 124 European individuals. The gene set, approximately 150 times larger than the human gene complement, contains an overwhelming majority of the prevalent (more frequent) microbial genes of the cohort and probably includes a large proportion of the prevalent human intestinal microbial genes. The genes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, indicating that the entire cohort harbours between 1,000 and 1,150 prevalent bacterial species and each individual at least 160 such species, which are also largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms of functions present in all individuals and most bacteria, respectively.
High-resolution transcription atlas of the mitotic cell cycle in budding yeast.
Granovskaia, M.V., Jensen, L.J., Ritchie, M.E., Toedling, J., Ning, Y., Bork, P., Huber, W. & Steinmetz, L.M.
Genome Biol. 2010 Mar 1;11(3):R24.
ABSTRACT: BACKGROUND: Extensive transcription of non-coding RNAs has been detected in eukaryotic genomes and is thought to constitute an additional layer in the regulation of gene expression. Despite this role, their transcription through the cell cycle has not been studied; genome-wide approaches have only focused on protein-coding genes. To explore the complex transcriptome architecture underlying the budding yeast cell cycle, we used 8 bp tiling arrays to generate a 5 minute-resolution, strand-specific expression atlas of the whole genome. RESULTS: We discovered 523 antisense transcripts, of which 80 cycle or are located opposite periodically expressed mRNAs, 135 unannotated intergenic non-coding RNAs, of which 11 cycle, and 109 cell-cycle-regulated protein-coding genes that had not previously been shown to cycle. We detected periodic expression coupling of sense and antisense transcript pairs, including antisense transcripts opposite of key cell-cycle regulators, like FAR1 and TAF2. CONCLUSIONS: Our dataset presents the most comprehensive resource to date on gene expression during the budding yeast cell cycle, revealing both protein-coding and non-coding RNA periodicity of expression and the first that profiles non-annotated RNAs. It enables hypothesis-driven mechanistic studies concerning the functions of non-coding RNAs.
Evolution and regulation of cellular periodic processes: a role for paralogues.
Trachana, K., Jensen, L.J. & Bork, P.
EMBO Rep. 2010 Mar;11(3):233-8. Epub 2010 Feb 19.
Several cyclic processes take place within a single organism. For example, the cell cycle is coordinated with the 24 h diurnal rhythm in animals and plants, and with the 40 min ultradian rhythm in budding yeast. To examine the evolution of periodic gene expression during these processes, we performed the first systematic comparison in three organisms (Homo sapiens, Arabidopsis thaliana and Saccharomyces cerevisiae) by using public microarray data. We observed that although diurnal-regulated and ultradian-regulated genes are not generally cell-cycle-regulated, they tend to have cell-cycle-regulated paralogues. Thus, diverged temporal expression of paralogues seems to facilitate cellular orchestration under different periodic stimuli. Lineage-specific functional repertoires of periodic-associated paralogues imply that this mode of regulation might have evolved independently in several organisms.
Ancient animal microRNAs and the evolution of tissue identity.
Christodoulou, F., Raible, F., Tomer, R., Simakov, O., Trachana, K., Klaus, S., Snyman, H., Hannon, G.J., Bork, P. & Arendt, D.
Nature. 2010 Feb 25;463(7284):1084-8. Epub 2010 Jan 31.
The spectacular escalation in complexity in early bilaterian evolution correlates with a strong increase in the number of microRNAs. To explore the link between the birth of ancient microRNAs and body plan evolution, we set out to determine the ancient sites of activity of conserved bilaterian microRNA families in a comparative approach. We reason that any specific localization shared between protostomes and deuterostomes (the two major superphyla of bilaterian animals) should probably reflect an ancient specificity of that microRNA in their last common ancestor. Here, we investigate the expression of conserved bilaterian microRNAs in Platynereis dumerilii, a protostome retaining ancestral bilaterian features, in Capitella, another marine annelid, in the sea urchin Strongylocentrotus, a deuterostome, and in sea anemone Nematostella, representing an outgroup to the bilaterians. Our comparative data indicate that the oldest known animal microRNA, miR-100, and the related miR-125 and let-7 were initially active in neurosecretory cells located around the mouth. Other sets of ancient microRNAs were first present in locomotor ciliated cells, specific brain centres, or, more broadly, one of four major organ systems: central nervous system, sensory tissue, musculature and gut. These findings reveal that microRNA evolution and the establishment of tissue identities were closely coupled in bilaterian evolution. Also, they outline a minimum set of cell types and tissues that existed in the protostome-deuterostome ancestor.
AQUA: automated quality improvement for multiple sequence alignments.
Muller, J., Creevey, C.J., Thompson, J.D., Arendt, D. & Bork, P.
Bioinformatics. 2010 Jan 15;26(2):263-5. Epub 2009 Nov 19.
Multiple sequence alignment (MSA) is a central tool in most modern biology studies. However, despite generations of valuable tools, human experts are still able to improve automatically generated MSAs. In an effort to automatically identify the most reliable MSA for a given protein family, we propose a very simple protocol, named AQUA for 'Automated quality improvement for multiple sequence alignments'. Our current implementation relies on two alignment programs (MUSCLE and MAFFT), one refinement program (RASCAL) and one assessment program (NORMD), but other programs could be incorporated at any of the three steps. Availability: AQUA is implemented in Tcl/Tk and runs in command line on all platforms. The source code is available under the GNU GPL license. Source code, README and Supplementary data are available at http://www.bork.embl.de/Docu/AQUA.
Drug-induced regulation of target expression.
Iskar, M., Campillos, M., Kuhn, M., Jensen, L.J., van Noort, V. & Bork, P.
PLoS Comput Biol. 2010 Sep 9;6(9). pii: e1000925.
Drug perturbations of human cells lead to complex responses upon target binding. One of the known mechanisms is a (positive or negative) feedback loop that adjusts the expression level of the respective target protein. To quantify this mechanism systems-wide in an unbiased way, drug-induced differential expression of drug target mRNA was examined in three cell lines using the Connectivity Map. To overcome various biases in this valuable resource, we have developed a computational normalization and scoring procedure that is applicable to gene expression recording upon heterogeneous drug treatments. In 1290 drug-target relations, corresponding to 466 drugs acting on 167 drug targets studied, 8% of the targets are subject to regulation at the mRNA level. We confirmed systematically that in particular G-protein coupled receptors, when serving as known targets, are regulated upon drug treatment. We further newly identified drug-induced differential regulation of Lanosterol 14-alpha demethylase, Endoplasmin, DNA topoisomerase 2-alpha and Calmodulin 1. The feedback regulation in these and other targets is likely to be relevant for the success or failure of the molecular intervention.
Impact of genome reduction on bacterial metabolism and its regulation.
Yus, E., Maier, T., Michalodimitrakis, K., van Noort, V., Yamada, T., Chen, W.H., Wodke, J.A., Guell, M., Martinez, S., Bourgeois, R., Kuhner, S., Raineri, E., Letunic, I., Kalinina, O.V., Rode, M., Herrmann, R., Gutierrez-Gallego, R., Russell, R.B., Gavin, A.C., Bork, P. & Serrano, L.
Science. 2009 Nov 27;326(5957):1263-8.
To understand basic principles of bacterial metabolism organization and regulation, but also the impact of genome size, we systematically studied one of the smallest bacteria, Mycoplasma pneumoniae. A manually curated metabolic network of 189 reactions catalyzed by 129 enzymes allowed the design of a defined, minimal medium with 19 essential nutrients. More than 1300 growth curves were recorded in the presence of various nutrient concentrations. Measurements of biomass indicators, metabolites, and 13C-glucose experiments provided information on directionality, fluxes, and energetics; integration with transcription profiling enabled the global analysis of metabolic regulation. Compared with more complex bacteria, the M. pneumoniae metabolic network has a more linear topology and contains a higher fraction of multifunctional enzymes; general features such as metabolite concentrations, cellular energetics, adaptability, and global gene expression responses are similar, however.
Transcriptome complexity in a genome-reduced bacterium.
Guell, M., van Noort, V., Yus, E., Chen, W.H., Leigh-Bell, J., Michalodimitrakis, K., Yamada, T., Arumugam, M., Doerks, T., Kuhner, S., Rode, M., Suyama, M., Schmidt, S., Gavin, A.C., Bork, P. & Serrano, L.
Science. 2009 Nov 27;326(5957):1268-71.
To study basic principles of transcriptome organization in bacteria, we analyzed one of the smallest self-replicating organisms, Mycoplasma pneumoniae. We combined strand-specific tiling arrays, complemented by transcriptome sequencing, with more than 252 spotted arrays. We detected 117 previously undescribed, mostly noncoding transcripts, 89 of them in antisense configuration to known genes. We identified 341 operons, of which 139 are polycistronic; almost half of the latter show decaying expression in a staircase-like manner. Under various conditions, operons could be divided into 447 smaller transcriptional units, resulting in many alternative transcripts. Frequent antisense transcripts, alternative transcripts, and multiple regulators per gene imply a highly dynamic transcriptome, more similar to that of eukaryotes than previously thought.
Evolution of biomolecular networks: lessons from metabolic and protein interactions.
Yamada, T. & Bork, P.
Nat Rev Mol Cell Biol. 2009 Nov;10(11):791-803.
Despite only becoming popular at the beginning of this decade, biomolecular networks are now frameworks that facilitate many discoveries in molecular biology. The nodes of these networks are usually proteins (specifically enzymes in metabolic networks), whereas the links (or edges) are their interactions with other molecules. These networks are made up of protein-protein interactions or enzyme-enzyme interactions through shared metabolites in the case of metabolic networks. Evolutionary analysis has revealed that changes in the nodes and links in protein-protein interaction and metabolic networks are subject to different selection pressures owing to distinct topological features. However, many evolutionary constraints can be uncovered only if temporal and spatial aspects are included in the network analysis.
ASTD: The Alternative Splicing and Transcript Diversity database.
Koscielny, G., Le Texier, V., Gopalakrishnan, C., Kumanduri, V., Riethoven, J.J., Nardone, F., Stanley, E., Fallsehr, C., Hofmann, O., Kull, M., Harrington, E., Boue, S., Eyras, E., Plass, M., Lopez, F., Ritchie, W., Moucadel, V., Ara, T., Pospisil, H., Herrmann, A., G Reich, J., Guigo, R., Bork, P., Doeberitz, M.K., Vilo, J., Hide, W., Apweiler, R., Thanaraj, T.A. & Gautheret, D.
Genomics. 2009 Mar;93(3):213-20. Epub 2008 Dec 24.
The Alternative Splicing and Transcript Diversity database (ASTD) gives access to a vast collection of alternative transcripts that integrate transcription initiation, polyadenylation and splicing variant data. Alternative transcripts are derived from the mapping of transcribed sequences to the complete human, mouse and rat genomes using an extension of the computational pipeline developed for the ASD (Alternative Splicing Database) and ATD (Alternative Transcript Diversity) databases, which are now superseded by ASTD. For the human genome, ASTD identifies splicing variants, transcription initiation variants and polyadenylation variants in 68%, 68% and 62% of the gene set, respectively, consistent with current estimates for transcription variation. Users can access ASTD through a variety of browsing and query tools, including expression state-based queries for the identification of tissue-specific isoforms. Participating laboratories have experimentally validated a subset of ASTD-predicted alternative splice forms and alternative polyadenylation forms that were not previously reported. The ASTD database can be accessed at http://www.ebi.ac.uk/astd.
Sequence-based feature prediction and annotation of proteins.
Juncker, A.S., Jensen, L.J., Pierleoni, A., Bernsel, A., Tress, M.L., Bork, P., von Heijne, G., Valencia, A., Ouzounis, C.A., Casadio, R. & Brunak, S.
Genome Biol. 2009 Feb 2;10(2):206.
ABSTRACT: A recent trend in computational methods for annotation of protein function is that many prediction tools are combined in complex workflows and pipelines to facilitate the analysis of feature combinations, for example, the entire repertoire of kinase-binding motifs in the human proteome.
SMART 6: recent updates and new developments.
Letunic, I., Doerks, T. & Bork, P.
Nucleic Acids Res. 2009 Jan;37(Database issue):D229-32. Epub 2008 Oct 31.
Simple modular architecture research tool (SMART) is an online tool (http://smart.embl.de/) for the identification and annotation of protein domains. It provides a user-friendly platform for the exploration and comparative study of domain architectures in both proteins and genes. The current release of SMART contains manually curated models for 784 protein domains. Recent developments were focused on further data integration and improving user friendliness. The underlying protein database based on completely sequenced genomes was greatly expanded and now includes 630 species, compared to 191 in the previous release. As an initial step towards integrating information on biological pathways into SMART, our domain annotations were extended with data on metabolic pathways and links to several pathways resources. The interaction network view was completely redesigned and is now available for more than 2 million proteins. In addition to the standard web access to the database, users can now query SMART using distributed annotation system (DAS) or through a simple object access protocol (SOAP) based web service.
InterPro: the integrative protein signature database.
Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L., Duquenne, L., Finn, R.D., Gough, J., Haft, D., Hulo, N., Kahn, D., Kelly, E., Laugraud, A., Letunic, I., Lonsdale, D., Lopez, R., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Mulder, N., Natale, D., Orengo, C., Quinn, A.F., Selengut, J.D., Sigrist, C.J., Thimma, M., Thomas, P.D., Valentin, F., Wilson, D., Wu, C.H. & Yeats, C.
Nucleic Acids Res. 2009 Jan;37(Database issue):D211-5. Epub 2008 Oct 21.
The InterPro database (http://www.ebi.ac.uk/interpro/) integrates together predictive models or 'signatures' representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Integration is performed manually and approximately half of the total approximately 58,000 signatures available in the source databases belong to an InterPro entry. Recently, we have started to also display the remaining un-integrated signatures via our web interface. Other developments include the provision of non-signature data, such as structural data, in new XML files on our FTP site, as well as the inclusion of matchless UniProtKB proteins in the existing match XML files. The web interface has been extended and now links out to the ADAN predicted protein-protein interaction database and the SPICE and Dasty viewers. The latest public release (v18.0) covers 79.8% of UniProtKB (v14.1) and consists of 16 549 entries. InterPro data may be accessed either via the web address above, via web services, by downloading files by anonymous FTP or by using the InterProScan search software (http://www.ebi.ac.uk/Tools/InterProScan/).
Proteome organization in a genome-reduced bacterium.
Kühner S., van Noort V., Betts M.J., Leo-Macias A., Batisse C., Rode M., Yamada T., Maier T., Bader S., Beltran-Alvarez P., Castaño-Diez D., Chen W.H., Devos D., Güell Cargol M., Norambuena T., Racke I., Rybin V., Schmidt A., Yus E., Aebersold R., Herrmann R., Böttcher B., Frangakis A.S., Russell R.B., Serrano L., Bork, P. and Gavin, A.C.
Science, 2009, 326, 1235-1240
Molecular eco-systems biology: towards an understanding of community function.
Raes, J. & Bork, P.
Nat Rev Microbiol. 2008 Sep;6(9):693-9. Epub 2008 Jun 30.
Systems-biology approaches, which are driven by genome sequencing and high-throughput functional genomics data, are revolutionizing single-cell-organism biology. With the advent of various high-throughput techniques that aim to characterize complete microbial ecosystems (metagenomics, meta-transcriptomics and meta-metabolomics), we propose that the time is ripe to consider molecular systems biology at the ecosystem level (eco-systems biology). Here, we discuss the necessary data types that are required to unite molecular microbiology and ecology to develop an understanding of community function and discuss the potential shortcomings of these approaches.
Evolution of the phospho-tyrosine signaling machinery in premetazoan lineages.
Pincus, D., Letunic, I., Bork, P. & Lim, W.A.
Proc Natl Acad Sci U S A. 2008 Jul 15;105(28):9680-4. Epub 2008 Jul 3.
Multicellular animals use a three-part molecular toolkit to mediate phospho-tyrosine signaling: Tyrosine kinases (TyrK), protein tyrosine phosphatases (PTP), and Src Homology 2 (SH2) domains function, respectively, as "writers," "erasers," and "readers" of phospho-tyrosine modifications. How did this system of three components evolve, given their interdependent function? Here, we examine the usage of these components in 41 eukaryotic genomes, including the newly sequenced genome of the choanoflagellate, Monosiga brevicollis, the closest known unicellular relative to metazoans. This analysis indicates that SH2 and PTP domains likely evolved earliest-a handful of these domains are found in premetazoan eukaryotes lacking tyrosine kinases, most likely to deal with limited tyrosine phosphorylation cross-catalyzed by promiscuous Ser/Thr kinases. Modern TyrK proteins, however, are only observed in two lineages, metazoans and choanoflagellates. These two lineages show a dramatic coexpansion of all three domain families. Concurrent expansion of the three domain families is consistent with a stepwise evolutionary model in which preexisting SH2 and PTP domains were of limited utility until the appearance of the TyrK domain in the last common ancestor of metazoans and choanoflagellates. The emergence of the full three-component signaling system, with its dramatically increased encoding potential, may have contributed to the advent of metazoan multicellularity.
KEGG Atlas mapping for global analysis of metabolic pathways.
Okuda, S., Yamada, T., Hamajima, M., Itoh, M., Katayama, T., Bork, P., Goto, S. & Kanehisa, M.
Nucleic Acids Res. 2008 Jul 1;36(Web Server issue):W423-6. Epub 2008 May13.
KEGG Atlas is a new graphical interface to the KEGG suite of databases, especially to the systems information in the PATHWAY and BRITE databases. It currently consists of a single global map and an associated viewer for metabolism, covering about 120 KEGG metabolic pathway maps and about 10 BRITE hierarchies. The viewer allows the user to navigate and zoom the global map under the Ajax technology. The mapping of high-throughput experimental data onto the global map is the main use of KEGG Atlas. In the global metabolism map, the node (circle) is a chemical compound and the edge (line) is a set of reactions linked to a set of KEGG Orthology (KO) entries for enzyme genes. Once gene identifiers in different organisms are converted to the K number identifiers in the KO system, corresponding line segments can be highlighted in the global map, allowing the user to view genome sequence data as organism-specific pathways, gene expression data as up- or down-regulated pathways, etc. Once chemical compounds are converted to the C number identifiers in KEGG, metabolomics data can also be displayed in the global map. KEGG Atlas is available at http://www.genome.jp/kegg/atlas/.
The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans.
King, N., Westbrook, M.J., Young, S.L., Kuo, A., Abedin, M., Chapman, J., Fairclough, S., Hellsten, U., Isogai, Y., Letunic, I., Marr, M., Pincus, D., Putnam, N., Rokas, A., Wright, K.J., Zuzow, R., Dirks, W., Good, M., Goodstein, D., Lemons, D., Li, W., Lyons, J.B., Morris, A., Nichols, S., Richter, D.J., Salamov, A., Sequencing, J.G., Bork, P., Lim, W.A., Manning, G., Miller, W.T., McGinnis, W., Shapiro, H., Tjian, R., Grigoriev, I.V. & Rokhsar, D.
Nature. 2008 Feb 14;451(7180):783-8.
Choanoflagellates are the closest known relatives of metazoans. To discover potential molecular mechanisms underlying the evolution of metazoan multicellularity, we sequenced and analysed the genome of the unicellular choanoflagellate Monosiga brevicollis. The genome contains approximately 9,200 intron-rich genes, including a number that encode cell adhesion and signalling protein domains that are otherwise restricted to metazoans. Here we show that the physical linkages among protein domains often differ between M. brevicollis and metazoans, suggesting that abundant domain shuffling followed the separation of the choanoflagellate and metazoan lineages. The completion of the M. brevicollis genome allows us to reconstruct with increasing resolution the genomic changes that accompanied the origin of metazoans.
iPath: interactive exploration of biochemical pathways and networks.
Letunic, I., Yamada, T., Kanehisa, M. & Bork, P.
Trends Biochem Sci. 2008 Feb 12;.
iPath is an open-access online tool (http://pathways.embl.de) for visualizing and analyzing metabolic pathways. An interactive viewer provides straightforward navigation through various pathways and enables easy access to the underlying chemicals and enzymes. Customized pathway maps can be generated and annotated using various external data. For example, by merging human genome data with two important gut commensals, iPath can pinpoint the complementarity of the host-symbiont metabolic capacities.
Enhanced function annotations for Drosophila serine proteases: a case study for systematic annotation of multi-member gene families.
Shah, P.K., Tripathi, L.P., Jensen, L.J., Gahnim, M., Mason, C., Furlong, E.E., Rodrigues, V., White, K.P., Bork, P. & Sowdhamini, R.
Gene. 2008 Jan 15;407(1-2):199-215. Epub 2007 Oct 15.
Systematically annotating function of enzymes that belong to large protein families encoded in a single eukaryotic genome is a very challenging task. We carried out such an exercise to annotate function for serine-protease family of the trypsin fold in Drosophila melanogaster, with an emphasis on annotating serine-protease homologues (SPHs) that may have lost their catalytic function. Our approach involves data mining and data integration to provide function annotations for 190 Drosophila gene products containing serine-protease-like domains, of which 35 are SPHs. This was accomplished by analysis of structure-function relationships, gene-expression profiles, large-scale protein-protein interaction data, literature mining and bioinformatic tools. We introduce functional residue clustering (FRC), a method that performs hierarchical clustering of sequences using properties of functionally important residues and utilizes correlation co-efficient as a quantitative similarity measure to transfer in vivo substrate specificities to proteases. We show that the efficiency of transfer of substrate-specificity information using this method is generally high. FRC was also applied on Drosophila proteases to assign putative competitive inhibitor relationships (CIRs). Microarray gene-expression data were utilized to uncover a large-scale and dual involvement of proteases in development and in immune response. We found specific recruitment of SPHs and proteases with CLIP domains in immune response, suggesting evolution of a new function for SPHs. We also suggest existence of separate downstream protease cascades for immune response against bacterial/fungal infections and parasite/parasitoid infections. We verify quality of our annotations using information from RNAi screens and other evidence types. Utilization of such multi-fold approaches results in 10-fold increase of function annotation for Drosophila serine proteases and demonstrates value in increasing annotations in multiple genomes.
NetworKIN: a resource for exploring cellular phosphorylation networks.
Linding, R., Jensen, L.J., Pasculescu, A., Olhovsky, M., Colwill, K., Bork, P., Yaffe, M.B. & Pawson, T.
Nucleic Acids Res. 2008 Jan;36(Database issue):D695-9. Epub 2007 Nov 2.
Protein kinases control cellular responses by phosphorylating specific substrates. Recent proteome-wide mapping of protein phosphorylation sites by mass spectrometry has discovered thousands of in vivo sites. Systematically assigning all 518 human kinases to all these sites is a challenging problem. The NetworKIN database (http://networkin.info) integrates consensus substrate motifs with context modelling for improved prediction of cellular kinase-substrate relations. Based on the latest human phosphoproteome from the Phospho.ELM and PhosphoSite databases, the resource offers insight into phosphorylation-modulated interaction networks. Here, we describe how NetworKIN can be used for both global and targeted molecular studies. Via the web interface users can query the database of precomputed kinase-substrate relations or obtain predictions on novel phosphoproteins. The database currently contains a predicted phosphorylation network with 20,224 site-specific interactions involving 3978 phosphoproteins and 73 human kinases from 20 families.
STITCH: interaction networks of chemicals and proteins.
Kuhn, M., von Mering, C., Campillos, M., Jensen, L.J. & Bork, P.
Nucleic Acids Res. 2008 Jan;36(Database issue):D684-8. Epub 2007 Dec 15.
The knowledge about interactions between proteins and small molecules is essential for the understanding of molecular and cellular functions. However, information on such interactions is widely dispersed across numerous databases and the literature. To facilitate access to this data, STITCH ('search tool for interactions of chemicals') integrates information about interactions from metabolic pathways, crystal structures, binding experiments and drug-target relationships. Inferred information from phenotypic effects, text mining and chemical structure similarity is used to predict relations between chemicals. STITCH further allows exploring the network of chemical relations, also in the context of associated binding proteins. Each proposed interaction can be traced back to the original data sources. Our database contains interaction information for over 68,000 different chemicals, including 2200 drugs, and connects them to 1.5 million genes across 373 genomes and their interactions contained in the STRING database. STITCH is available at http://stitch.embl.de/.
eggNOG: automated construction and annotation of orthologous groups of genes.
Jensen, L.J., Julien, P., Kuhn, M., von Mering, C., Muller, J., Doerks, T. & Bork, P.
Nucleic Acids Res. 2008 Jan;36(Database issue):D250-4. Epub 2007 Oct 16.
The identification of orthologous genes forms the basis for most comparative genomics studies. Existing approaches either lack functional annotation of the identified orthologous groups, hampering the interpretation of subsequent results, or are manually annotated and thus lag behind the rapid sequencing of new genomes. Here we present the eggNOG database ('evolutionary genealogy of genes: Non-supervised Orthologous Groups'), which contains orthologous groups constructed from Smith-Waterman alignments through identification of reciprocal best matches and triangular linkage clustering. Applying this procedure to 312 bacterial, 26 archaeal and 35 eukaryotic genomes yielded 43 582 course-grained orthologous groups of which 9724 are extended versions of those from the original COG/KOG database. We also constructed more fine-grained groups for selected subsets of organisms, such as the 19 914 mammalian orthologous groups. We automatically annotated our non-supervised orthologous groups with functional descriptions, which were derived by identifying common denominators for the genes based on their individual textual descriptions, annotated functional categories, and predicted protein domains. The orthologous groups in eggNOG contain 1 241 751 genes and provide at least a broad functional description for 77% of them. Users can query the resource for individual genes via a web interface or download the complete set of orthologous groups at http://eggnog.embl.de.
SuperTarget and Matador: resources for exploring drug-target relationships.
Gunther, S., Kuhn, M., Dunkel, M., Campillos, M., Senger, C., Petsalaki, E., Ahmed, J., Urdiales, E.G., Gewiess, A., Jensen, L.J., Schneider, R., Skoblo, R., Russell, R.B., Bourne, P.E., Bork, P. & Preissner, R.
Nucleic Acids Res. 2008 Jan;36(Database issue):D919-22. Epub 2007 Oct 16.
The molecular basis of drug action is often not well understood. This is partly because the very abundant and diverse information generated in the past decades on drugs is hidden in millions of medical articles or textbooks. Therefore, we developed a one-stop data warehouse, SuperTarget that integrates drug-related information about medical indication areas, adverse drug effects, drug metabolization, pathways and Gene Ontology terms of the target proteins. An easy-to-use query interface enables the user to pose complex queries, for example to find drugs that target a certain pathway, interacting drugs that are metabolized by the same cytochrome P450 or drugs that target the same protein but are metabolized by different enzymes. Furthermore, we provide tools for 2D drug screening and sequence comparison of the targets. The database contains more than 2500 target proteins, which are annotated with about 7300 relations to 1500 drugs; the vast majority of entries have pointers to the respective literature source. A subset of these drugs has been annotated with additional binding information and indirect interactions and is available as a separate resource called Matador. SuperTarget and Matador are available at http://insilico.charite.de/supertarget and http://matador.embl.de.
Selective maintenance of Drosophila tandemly arranged duplicated genes during evolution.
Quijano, C., Tomancak, P., Lopez-Marti, J., Suyama, M., Bork, P., Milan, M., Torrents, D. & Manzanares, M.
Genome Biol. 2008;9(12):R176. Epub 2008 Dec 16.
BACKGROUND: The physical organization and chromosomal localization of genes within genomes is known to play an important role in their function. Most genes arise by duplication and move along the genome by random shuffling of DNA segments. Higher order structuring of the genome occurs in eukaryotes, where groups of physically linked genes are co-expressed. However, the contribution of gene duplication to gene order has not been analyzed in detail, as it is believed that co-expression due to recent duplicates would obscure other domains of co-expression. RESULTS: We have catalogued ordered duplicated genes in Drosophila melanogaster, and found that one in five of all genes is organized as tandem arrays. Furthermore, among arrays that have been spatially conserved over longer periods than would be expected on the basis of random shuffling, a disproportionate number contain genes encoding developmental regulators. Using in situ gene expression data for more than half of the Drosophila genome, we find that genes in these conserved clusters are co-expressed to a much higher extent than other duplicated genes. CONCLUSIONS: These results reveal the existence of functional constraints in insects that retain copies of genes encoding developmental and regulatory proteins as neighbors, allowing their co-expression. This co-expression may be the result of shared cis-regulatory elements or a shared need for a specific chromatin structure. Our results highlight the association between genome architecture and the gene regulatory networks involved in the construction of the body plan.
Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions?
Palleja, A., Harrington, E.D. & Bork, P.
BMC Genomics. 2008 Jul 15;9:335.
BACKGROUND: Across the fully sequenced microbial genomes there are thousands of examples of overlapping genes. Many of these are only a few nucleotides long and are thought to function by permitting the coordinated regulation of gene expression. However, there should also be selective pressure against long overlaps, as the existence of overlapping reading frames increases the risk of deleterious mutations. Here we examine the longest overlaps and assess whether they are the product of special functional constraints or of erroneous annotation. RESULTS: We analysed the genes that overlap by 60 bps or more among 338 fully-sequenced prokaryotic genomes. The likely functional significance of an overlap was determined by comparing each of the genes to its respective orthologs. If a gene showed a significantly different length from its orthologs it was considered unlikely to be functional and therefore the result of an error either in sequencing or gene prediction. Focusing on 715 co-directional overlaps longer than 60 bps, we classified the erroneous ones into five categories: i) 5'-end extension of the downstream gene due to either a mispredicted start codon or a frameshift at 5'-end of the gene (409 overlaps), ii) fragmentation of a gene caused by a frameshift (163), iii) 3'-end extension of the upstream gene due to either a frameshift at 3'-end of a gene or point mutation at the stop codon (68), iv) Redundant gene predictions (4), v) 5' & 3'-end extension which is a combination of i) and iii) (71). We also studied 75 divergent overlaps that could be classified as misannotations of group i). Nevertheless we found some convergent long overlaps (54) that might be true overlaps, although an important part of convergent overlaps could be classified as group iii) (124). CONCLUSION: Among the 968 overlaps larger than 60 bps which we analysed, we did not find a single real one among the co-directional and divergent orientations and concluded that there had been an excessive number of misannotations. Only convergent orientation seems to permit some long overlaps, although convergent overlaps are also hampered by misannotations. We propose a simple rule to flag these erroneous gene length predictions to facilitate automatic annotation.
Circular reasoning rather than cyclic expression.
Jensen, L.J., de Lichtenberg, U., Jensen, T.S., Brunak, S. & Bork, P.
Genome Biol. 2008;9(6):403. Epub 2008 Jun 23.
A response to Combined analysis reveals a core set of cycling genes by Y Lu, S Mahony, PV Benos, R Rosenfeld, I Simon, LL Breeden and Z Bar-Joseph. Genome Biol 2007, 8:R146.
A nitrile hydratase in the eukaryote Monosiga brevicollis.
Foerstner, K.U., Doerks, T., Muller, J., Raes, J. & Bork, P.
PLoS One. 2008;3(12):e3976. Epub 2008 Dec 19.
Bacterial nitrile hydratase (NHases) are important industrial catalysts and waste water remediation tools. In a global computational screening of conventional and metagenomic sequence data for NHases, we detected the two usually separated NHase subunits fused in one protein of the choanoflagellate Monosiga brevicollis, a recently sequenced unicellular model organism from the closest sister group of Metazoa. This is the first time that an NHase is found in eukaryotes and the first time it is observed as a fusion protein. The presence of an intron, subunit fusion and expressed sequence tags covering parts of the gene exclude contamination and suggest a functional gene. Phylogenetic analyses and genomic context imply a probable ancient horizontal gene transfer (HGT) from proteobacteria. The newly discovered NHase might open biotechnological routes due to its unconventional structure, its new type of host and its apparent integration into eukaryotic protein networks.
Genome-wide experimental determination of barriers to horizontal gene transfer.
Sorek, R., Zhu, Y., Creevey, C.J., Francino, M.P., Bork, P. & Rubin, E.M.
Science. 2007 Nov 30;318(5855):1449-52. Epub 2007 Oct 18.
Horizontal gene transfer, in which genetic material is transferred from the genome of one organism to that of another, has been investigated in microbial species mainly through computational sequence analyses. To address the lack of experimental data, we studied the attempted movement of 246,045 genes from 79 prokaryotic genomes into Escherichia coli and identified genes that consistently fail to transfer. We studied the mechanisms underlying transfer inhibition by placing coding regions from different species under the control of inducible promoters. Our data suggest that toxicity to the host inhibited transfer regardless of the species of origin and that increased gene dosage and associated increased expression may be a predominant cause for transfer failure. Although these experimental studies examined transfer solely into E. coli, a computational analysis of gene-transfer rates across available bacterial and archaeal genomes supports that the barriers observed in our study are general across the tree of life.
4DXpress: a database for cross-species expression pattern comparisons.
Haudry, Y., Berube, H., Letunic, I., Weeber, P.D., Gagneur, J., Girardot, C., Kapushesky, M., Arendt, D., Bork, P., Brazma, A., Furlong, E., Wittbrodt, J. & Henrich, T.
Nucleic Acids Res. 2007 Oct 4;.
In the major animal model species like mouse, fish or fly, detailed spatial information on gene expression over time can be acquired through whole mount in situ hybridization experiments. In these species, expression patterns of many genes have been studied and data has been integrated into dedicated model organism databases like ZFIN for zebrafish, MEPD for medaka, BDGP for Drosophila or GXD for mouse. However, a central repository that allows users to query and compare gene expression patterns across different species has not yet been established. Therefore, we have integrated expression patterns for zebrafish, Drosophila, medaka and mouse into a central public repository called 4DXpress (expression database in four dimensions). Users can query anatomy ontology-based expression annotations across species and quickly jump from one gene to the orthologues in other species. Genes are linked to public microarray data in ArrayExpress. We have mapped developmental stages between the species to be able to compare developmental time phases. We store the largest collection of gene expression patterns available to date in an individual resource, reflecting 16 505 annotated genes. 4DXpress will be an invaluable tool for developmental as well as for computational biologists interested in gene regulation and evolution. 4DXpress is available at http://ani.embl.de/4DXpress.
Target-specific requirements for enhancers of decapping in miRNA-mediated gene silencing.
Eulalio, A., Rehwinkel, J., Stricker, M., Huntzinger, E., Yang, S.F., Doerks, T., Dorner, S., Bork, P., Boutros, M. & Izaurralde, E.
Genes Dev. 2007 Oct 15;21(20):2558-70. Epub 2007 Sep 27.
microRNAs (miRNAs) silence gene expression by suppressing protein production and/or by promoting mRNA decay. To elucidate how silencing is accomplished, we screened an RNA interference library for suppressors of miRNA-mediated regulation in Drosophila melanogaster cells. In addition to proteins known to be required for miRNA biogenesis and function (i.e., Drosha, Pasha, Dicer-1, AGO1, and GW182), the screen identified the decapping activator Ge-1 as being required for silencing by miRNAs. Depleting Ge-1 alone and/or in combination with other decapping activators (e.g., DCP1, EDC3, HPat, or Me31B) suppresses silencing of several miRNA targets, indicating that miRNAs elicit mRNA decapping. A comparison of gene expression profiles in cells depleted of AGO1 or of individual decapping activators shows that approximately 15% of AGO1-targets are also regulated by Ge-1, DCP1, and HPat, whereas 5% are dependent on EDC3 and LSm1-7. These percentages are underestimated because decapping activators are partially redundant. Furthermore, in the absence of active translation, some miRNA targets are stabilized, whereas others continue to be degraded in a miRNA-dependent manner. These findings suggest that miRNAs mediate post-transcriptional gene silencing by more than one mechanism.
Get the most out of your metagenome: computational analysis of environmental sequence data.
Raes, J., Foerstner, K.U. & Bork, P.
Curr Opin Microbiol. 2007 Oct;10(5):490-8. Epub 2007 Oct 23.
New advances in sequencing technologies bring random shotgun sequencing of ecosystems within reach of smaller labs, but the complexity of metagenomics data can be overwhelming. Recently, many novel computational tools have been developed to unravel ecosystem properties starting from fragmented sequences. In addition, the so-called 'comparative metagenomics' approaches have allowed the discovery of specific genomic and community adaptations to environmental factors. However, many of the parameters extracted from these data to describe the environment at hand (e.g. genomic features, functional complement, phylogenetic composition) are interdependent and influenced by technical aspects of sample preparation and data treatment, leading to various pitfalls during analysis. To avoid this and complement existing initiatives in data standards, we propose a minimal standard for metagenomics data analysis ('MINIMESS') to be able to take full advantage of the power of comparative metagenomics in understanding microbial life on earth.
Quantitative assessment of protein function prediction from metagenomics shotgun sequences.
Harrington, E.D., Singh, A.H., Doerks, T., Letunic, I., von Mering, C., Jensen, L.J., Raes, J. & Bork, P.
Proc Natl Acad Sci U S A. 2007 Aug 28;104(35):13913-8. Epub 2007 Aug 23.
To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families.
Evolution of cell cycle control: same molecular machines, different regulation.
de Lichtenberg, U., Jensen, T.S., Brunak, S., Bork, P. & Jensen, L.J.
Cell Cycle. 2007 Aug 1;6(15):1819-25. Epub 2007 Jun 4.
Decades of research has together with the availability of whole genomes made it clear that many of the core components involved in the cell cycle are conserved across eukaryotes, both functionally and structurally. These proteins are organized in complexes and modules that are activated or deactivated at specific stages during the cell cycle through a wide variety of mechanisms including transcriptional regulation, phosphorylation, subcellular translocation and targeted degradation. In a series of integrative analyses of different genome-scale data sets, we have studied how these different layers of regulation together control the activity of cell cycle complexes and how this regulation has evolved. The results show surprisingly poor conservation of both the transcriptional and the post-translation regulation of individual genes and proteins; however, the changes in one layer of regulation are often mirrored by changes in other layers, implying that independent layers of control coevolve. By taking a bird's eye view of the cell cycle, we demonstrate how the modular organization of cellular systems possesses a built-in flexibility, which allows evolution to find many different solutions for assembling the same molecular machines just in time for action.
Use of pathway analysis and genome context methods for functional genomics of Mycoplasma pneumoniae nucleotide metabolism.
Pachkov, M., Dandekar, T., Korbel, J., Bork, P. & Schuster, S.
Gene. 2007 Jul 15;396(2):215-25. Epub 2007 Mar 24.
Elementary modes analysis allows one to reveal whether a set of known enzymes is sufficient to sustain functionality of the cell. Moreover, it is helpful in detecting missing reactions and predicting which enzymes could fill these gaps. Here, we perform a comprehensive elementary modes analysis and a genomic context analysis of Mycoplasma pneumoniae nucleotide metabolism, and search for new enzyme activities. The purine and pyrimidine networks are reconstructed by assembling enzymes annotated in the genome or found experimentally. We show that these reaction sets are sufficient for enabling synthesis of DNA and RNA in M. pneumoniae. Special focus is on the key modes for growth. Moreover, we make an educated guess on the nutritional requirements of this micro-organism. For the case that M. pneumoniae does not require adenine as a substrate, we suggest adenylosuccinate synthetase (EC 22.214.171.124), adenylosuccinate lyase (EC 126.96.36.199) and GMP reductase (EC 188.8.131.52) to be operative. GMP reductase activity is putatively assigned to the NRDI_MYCPN gene on the basis of the genomic context analysis. For the pyrimidine network, we suggest CTP synthase (EC 184.108.40.206) to be active. Further experiments on the nutritional requirements are needed to make a decision. Pyrimidine metabolism appears to be more appropriate as a drug target than purine metabolism since it shows lower plasticity.
Systematic discovery of in vivo phosphorylation networks.
Linding, R., Jensen, L.J., Ostheimer, G.J., van Vugt, M.A., Jorgensen, C., Miron, I.M., Diella, F., Colwill, K., Taylor, L., Elder, K., Metalnikov, P., Nguyen, V., Pasculescu, A., Jin, J., Park, J.G., Samson, L.D., Woodgett, J.R., Russell, R.B., Bork, P., Yaffe, M.B. & Pawson, T.
Cell. 2007 Jun 29;129(7):1415-26. Epub 2007 Jun 14.
Protein kinases control cellular decision processes by phosphorylating specific substrates. Thousands of in vivo phosphorylation sites have been identified, mostly by proteome-wide mapping. However, systematically matching these sites to specific kinases is presently infeasible, due to limited specificity of consensus motifs, and the influence of contextual factors, such as protein scaffolds, localization, and expression, on cellular substrate specificity. We have developed an approach (NetworKIN) that augments motif-based predictions with the network context of kinases and phosphoproteins. The latter provides 60%-80% of the computational capability to assign in vivo substrate specificity. NetworKIN pinpoints kinases responsible for specific phosphorylations and yields a 2.5-fold improvement in the accuracy with which phosphorylation networks can be constructed. Applying this approach to DNA damage signaling, we show that 53BP1 and Rad50 are phosphorylated by CDK1 and ATM, respectively. We describe a scalable strategy to evaluate predictions, which suggests that BCLAF1 is a GSK-3 substrate.
Protein function space: viewing the limits or limited by our view?
Raes, J., Harrington, E.D., Singh, A.H. & Bork, P.
Curr Opin Struct Biol. 2007 Jun;17(3):362-9. Epub 2007 Jun 15.
Given that the number of protein functions on earth is finite, the rapid expansion of biological knowledge and the concomitant exponential increase in the number of protein sequences should, at some point, enable the estimation of the limits of protein function space. The functional coverage of protein sequences can be investigated using computational methods, especially given the massive amount of data being generated by large-scale environmental sequencing (metagenomics). In completely sequenced genomes, the fraction of proteins to which at least some functional features can be assigned has recently risen to as much as approximately 85%. Although this fraction is more uncertain in metagenomics surveys, because of environmental complexities and differences in analysis protocols, our global knowledge of protein functions still appears to be considerable. However, when we consider protein families, continued sequencing seems to yield an ever-increasing number of novel families. Until we reconcile these two views, the limits of protein space will remain obscured.
Sequence-based factors influencing the expression of heterologous genes in the yeast Pichia pastoris--A comparative view on 79 human genes.
Boettner, M., Steffens, C., von Mering, C., Bork, P., Stahl, U. & Lang, C.
J Biotechnol. 2007 May 31;130(1):1-10. Epub 2007 Feb 28.
High yield expression of heterologous proteins is usually a matter of "trial and error". In the search of parameters with a major impact on expression, we have applied a comparative analysis to 79 different human cDNAs expressed in Pichia pastoris. The cDNAs were cloned in an expression vector for intracellular expression and recombinant protein expression was monitored in a standardized procedure and classified with respect to the expression level. Of all sequence-based parameters with a possible influence on the expression level, more than 10 were analysed. Three of those factors proved to have a statistically significant association with the expression level. Low abundance of AT-rich regions in the cDNA associates with a high expression level. A comparatively high isoelectric point of the recombinant protein associates with failure of expression and, finally, the occurrence of a protein homologue in yeast is associated with detectable protein expression. Interestingly, some often discussed factors like codon usage or GC content did not show a significant impact on protein yield. These results could provide a basis for a knowledge-oriented optimisation of gene sequences both to increase protein yields and to help target selection and the design of high-throughput expression approaches.
Splicing factors stimulate polyadenylation via USEs at non-canonical 3' end formation signals.
Danckwardt, S., Kaufmann, I., Gentzel, M., Foerstner, K.U., Gantzert, A.S., Gehring, N.H., Neu-Yilik, G., Bork, P., Keller, W., Wilm, M., Hentze, M.W. & Kulozik, A.E.
EMBO J. 2007 Apr 26;.
The prothrombin (F2) 3' end formation signal is highly susceptible to thrombophilia-associated gain-of-function mutations. In its unusual architecture, the F2 3' UTR contains an upstream sequence element (USE) that compensates for weak activities of the non-canonical cleavage site and the downstream U-rich element. Here, we address the mechanism of USE function. We show that the F2 USE contains a highly conserved nonameric core sequence, which promotes 3' end formation in a position- and sequence-dependent manner. We identify proteins that specifically interact with the USE, and demonstrate their function as trans-acting factors that promote 3' end formation. Interestingly, these include the splicing factors U2AF35, U2AF65 and hnRNPI. We show that these splicing factors not only modulate 3' end formation via the USEs contained in the F2 and the complement C2 mRNAs, but also in the biocomputationally identified BCL2L2, IVNS and ACTR mRNAs, suggesting a broader functional role. These data uncover a novel mechanism that functionally links the splicing and 3' end formation machineries of multiple cellular mRNAs in an USE-dependent manner.
Quantitative phylogenetic assessment of microbial communities in diverse environments.
von Mering, C., Hugenholtz, P., Raes, J., Tringe, S.G., Doerks, T., Jensen, L.J., Ward, N. & Bork, P.
Science. 2007 Feb 23;315(5815):1126-30. Epub 2007 Feb 1.
The taxonomic composition of environmental communities is an important indicator of their ecology and function. We used a set of protein-coding marker genes, extracted from large-scale environmental shotgun sequencing data, to provide a more direct, quantitative, and accurate picture of community composition than that provided by traditional ribosomal RNA-based approaches depending on the polymerase chain reaction. Mapping marker genes from four diverse environmental data sets onto a reference species phylogeny shows that certain communities evolve faster than others. The method also enables determination of preferred habitats for entire microbial clades and provides evidence that such habitat preferences are often remarkably stable over time.
Quantification of insect genome divergence.
Zdobnov, E.M. & Bork, P.
Trends Genet. 2007 Jan;23(1):16-20. Epub 2006 Nov 9.
The recent sequencing of twelve insect genomes has enabled us to quantify their divergence using synteny conservation and sequence identity of single-copy orthologs. Protein identity correlates well with synteny and is about three times more conserved, an observation consistent with comparisons among vertebrates. The observed distribution of the lengths of synteny blocks follows a power law and differs from the expectations of the currently accepted random breakage model. Our results show that there is only limited selection for conservation of gene order and reveal a few hundred genes, proximity among which seems to be vital.
STRING 7--recent developments in the integration and prediction of protein interactions.
von Mering, C., Jensen, L.J., Kuhn, M., Chaffron, S., Doerks, T., Kruger, B., Snel, B. & Bork, P.
Nucleic Acids Res. 2007 Jan;35(Database issue):D358-62. Epub 2006 Nov 10.
Information on protein-protein interactions is still mostly limited to a small number of model organisms, and originates from a wide variety of experimental and computational techniques. The database and online resource STRING generalizes access to protein interaction data, by integrating known and predicted interactions from a variety of sources. The underlying infrastructure includes a consistent body of completely sequenced genomes and exhaustive orthology classifications, based on which interaction evidence is transferred between organisms. Although primarily developed for protein interaction analysis, the resource has also been successfully applied to comparative genomics, phylogenetics and network studies, which are all facilitated by programmatic access to the database backend and the availability of compact download files. As of release 7, STRING has almost doubled to 373 distinct organisms, and contains more than 1.5 million proteins for which associations have been pre-computed. Novel features include AJAX-based web-navigation, inclusion of additional resources such as BioGRID, and detailed protein domain annotation. STRING is available at http://string.embl.de/
New developments in the InterPro database.
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Buillard, V., Cerutti, L., Copley, R., Courcelle, E., Das, U., Daugherty, L., Dibley, M., Finn, R., Fleischmann, W., Gough, J., Haft, D., Hulo, N., Hunter, S., Kahn, D., Kanapin, A., Kejariwal, A., Labarga, A., Langendijk-Genevaux, P.S., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Nikolskaya, A.N., Orchard, S., Orengo, C., Petryszak, R., Selengut, J.D., Sigrist, C.J., Thomas, P.D., Valentin, F., Wilson, D., Wu, C.H. & Yeats, C.
Nucleic Acids Res. 2007 Jan;35(Database issue):D224-8.
InterPro is an integrated resource for protein families, domains and functional sites, which integrates the following protein signature databases: PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D and PANTHER. The latter two new member databases have been integrated since the last publication in this journal. There have been several new developments in InterPro, including an additional reading field, new database links, extensions to the web interface and additional match XML files. InterPro has always provided matches to UniProtKB proteins on the website and in the match XML file on the FTP site. Additional matches to proteins in UniParc (UniProt archive) are now available for download in the new match XML files only. The latest InterPro release (13.0) contains more than 13 000 entries, covering over 78% of all proteins in UniProtKB. The database is available for text- and sequence-based searches via a webserver (http://www.ebi.ac.uk/interpro), and for download by anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro). The InterProScan search tool is now also available via a web service at http://www.ebi.ac.uk/Tools/webservices/WSInterProScan.html.
Prediction of effective genome size in metagenomic samples.
Raes, J., Korbel, J.O., Lercher, M.J., von Mering, C. & Bork, P.
Genome Biol. 2007;8(1):R10.
We introduce a novel computational approach to predict effective genome size (EGS; a measure that includes multiple plasmid copies, inserted sequences, and associated phages and viruses) from short sequencing reads of environmental genomics (or metagenomics) projects. We observe considerable EGS differences between environments and link this with ecologic complexity as well as species composition (for instance, the presence of eukaryotes). For example, we estimate EGS in a complex, organism-dense farm soil sample at about 6.3 megabases (Mb) whereas that of the bacteria therein is only 4.7 Mb; for bacteria in a nutrient-poor, organism-sparse ocean surface water sample, EGS is as low as 1.6 Mb. The method also permits evaluation of completion status and assembly bias in single-genome sequencing projects.
Identification of tightly regulated groups of genes during Drosophila melanogaster embryogenesis.
Hooper, S.D., Boue, S., Krause, R., Jensen, L.J., Mason, C.E., Ghanim, M., White, K.P., Furlong, E.E. & Bork, P.
Mol Syst Biol. 2007;3:72. Epub 2007 Jan 16.
Time-series analysis of whole-genome expression data during Drosophila melanogaster development indicates that up to 86% of its genes change their relative transcript level during embryogenesis. By applying conservative filtering criteria and requiring 'sharp' transcript changes, we identified 1534 maternal genes, 792 transient zygotic genes, and 1053 genes whose transcript levels increase during embryogenesis. Each of these three categories is dominated by groups of genes where all transcript levels increase and/or decrease at similar times, suggesting a common mode of regulation. For example, 34% of the transiently expressed genes fall into three groups, with increased transcript levels between 2.5-12, 11-20, and 15-20 h of development, respectively. We highlight common and distinctive functional features of these expression groups and identify a coupling between downregulation of transcript levels and targeted protein degradation. By mapping the groups to the protein network, we also predict and experimentally confirm new functional associations.
Insights into social insects from the genome of the honeybee Apis mellifera.
Sequencing Consortium, T.H. (Bork, P.)
Nature. 2006 Nov 23;444(7118):512. PubMed
Assessing systems properties of yeast mitochondria through an interaction map of the organelle.
Perocchi, F., Jensen, L.J., Gagneur, J., Ahting, U., von Mering, C., Bork, P., Prokisch, H. & Steinmetz, L.M.
PLoS Genet. 2006 Oct 20;2(10):e170.
Mitochondria carry out specialized functions; compartmentalized, yet integrated into the metabolic and signaling processes of the cell. Although many mitochondrial proteins have been identified, understanding their functional interrelationships has been a challenge. Here we construct a comprehensive network of the mitochondrial system. We integrated genome-wide datasets to generate an accurate and inclusive mitochondrial parts list. Together with benchmarked measures of protein interactions, a network of mitochondria was constructed in their cellular context, including extra-mitochondrial proteins. This network also integrates data from different organisms to expand the known mitochondrial biology beyond the information in the existing databases. Our network brings together annotated and predicted functions into a single framework. This enabled, for the entire system, a survey of mutant phenotypes, gene regulation, evolution, and disease susceptibility. Furthermore, we experimentally validated the localization of several candidate proteins and derived novel functional contexts for hundreds of uncharacterized proteins. Our network thus advances the understanding of the mitochondrial system in yeast and identifies properties of genes underlying human mitochondrial disorders.
Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation.
Letunic, I. & Bork, P.
Bioinformatics. 2006 Oct 18;.
SUMMARY: Interactive Tree Of Life (iTOL) is a web based tool for the display, manipulation and annotation of phylogenetic trees. Trees can be interactively pruned and re-rooted. Various types of data such as genome sizes or protein domain repertoires can be mapped onto the tree. Export to several bitmap and vector graphics formats is supported. AVAILABILITY: iTOL is available through WWW at http://itol.embl.de.
Opsins and clusters of sensory G-protein-coupled receptors in the sea urchin genome.
Raible, F., Tessmar-Raible, K., Arboleda, E., Kaller, T., Bork, P., Arendt, D. & Arnone, M.I.
Dev Biol. 2006 Sep 5;.
Rhodopsin-type G-protein-coupled receptors (GPCRs) contribute the majority of sensory receptors in vertebrates. With 979 members, they form the largest GPCR family in the sequenced sea urchin genome, constituting more than 3% of all predicted genes. The sea urchin genome encodes at least six Opsin proteins. Of these, one rhabdomeric, one ciliary and two G(o)-type Opsins can be assigned to ancient bilaterian Opsin subfamilies. Moreover, we identified four greatly expanded subfamilies of rhodopsin-type GPCRs that we call sea urchin specific rapidly expanded lineages of GPCRs (surreal-GPCRs). Our analysis of two of these groups revealed genomic clustering and single-exon gene structures similar to the most expanded group of vertebrate rhodopsin-type GPCRs, the olfactory receptors. We hypothesize that these genes arose by rapid duplication in the echinoid lineage and act as chemosensory receptors of the animal. In support of this, group B surreal-GPCRs are most prominently expressed in distinct classes of pedicellariae and tube feet of the adult sea urchin, structures that have previously been shown to react to chemical stimuli and to harbor sensory neurons in echinoderms. Notably, these structures also express different opsins, indicating that sea urchins possess an intricate molecular set-up to sense their environment.
Computational characterization of multiple Gag-like human proteins.
Campillos, M., Doerks, T., Shah, P.K. & Bork, P.
Trends Genet. 2006 Sep 15;.
In a genome-wide analysis, we have identified 85 human genes encoding 103 protein isoforms that resemble retroviral Gag proteins. These genes were domesticated from retrotransposons in at least five independent events during vertebrate evolution and were subsequently duplicated further in mammals. Structural insights into the mammalian proteins can be inferred by homology to Gag from viruses such as HIV; in turn, the cellular roles of the mammalian Gag homologs, such as apoptosis-related functions and binding to ubiquitin ligases, might hint at further functionality of viral Gag itself.
PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments.
Suyama, M., Torrents, D. & Bork, P.
Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W609-12.
PAL2NAL is a web server that constructs a multiple codon alignment from the corresponding aligned protein sequences. Such codon alignments can be used to evaluate the type and rate of nucleotide substitutions in coding DNA for a wide range of evolutionary analyses, such as the identification of levels of selective constraint acting on genes, or to perform DNA-based phylogenetic studies. The server takes a protein sequence alignment and the corresponding DNA sequences as input. In contrast to other existing applications, this server is able to construct codon alignments even if the input DNA sequence has mismatches with the input protein sequence, or contains untranslated regions and polyA tails. The server can also deal with frame shifts and inframe stop codons in the input models, and is thus suitable for the analysis of pseudogenes. Another distinct feature is that the user can specify a subregion of the input alignment in order to specifically analyze functional domains or exons of interest. The PAL2NAL server is available at http://www.bork.embl.de/pal2nal.
mRNA degradation by miRNAs and GW182 requires both CCR4:NOT deadenylase and DCP1:DCP2 decapping complexes.
Behm-Ansmant, I., Rehwinkel, J., Doerks, T., Stark, A., Bork, P. & Izaurralde, E.
Genes Dev. 2006 Jun 30;.
MicroRNAs (miRNAs) silence the expression of target genes post-transcriptionally. Their function is mediated by the Argonaute proteins (AGOs), which colocalize to P-bodies with mRNA degradation enzymes. Mammalian P-bodies are also marked by the GW182 protein, which interacts with the AGOs and is required for miRNA function. We show that depletion of GW182 leads to changes in mRNA expression profiles strikingly similar to those observed in cells depleted of the essential Drosophila miRNA effector AGO1, indicating that GW182 functions in the miRNA pathway. When GW182 is bound to a reporter transcript, it silences its expression, bypassing the requirement for AGO1. Silencing by GW182 is effected by changes in protein expression and mRNA stability. Similarly, miRNAs silence gene expression by repressing protein expression and/or by promoting mRNA decay, and both mechanisms require GW182. mRNA degradation, but not translational repression, by GW182 or miRNAs is inhibited in cells depleted of CAF1, NOT1, or the decapping DCP1:DCP2 complex. We further show that the N-terminal GW repeats of GW182 interact with the PIWI domain of AGO1. Our findings indicate that GW182 links the miRNA pathway to mRNA degradation by interacting with AGO1 and promoting decay of at least a subset of miRNA targets.
LSAT: learning about alternative transcripts in MEDLINE.
Shah, P.K. & Bork, P.
Bioinformatics. 2006 Apr 1;22(7):857-65. Epub 2006 Jan 12.
MOTIVATION: Generation of alternative transcripts from the same gene is an important biological event due to their contribution in creating functional diversity in eukaryotes. In this work, we choose the task of extracting information around this complex topic using a two-step procedure involving machine learning and information extraction. RESULTS: In the first step, we trained a classifier that inductively learns to identify sentences about physiological transcript diversity from the MEDLINE abstracts. Using a large hand-built corpus, we compared the sentence classification performance of various text categorization methods. Support vector machines (SVMs) followed by the maximum entropy classifier outperformed other methods for the sentence classification task. The SVM with the radial basis function kernel and optimized parameters achieved Fbeta-measure of 91% during the 4-fold cross validation and of 74% when applied to all sentences in more than 12 million abstracts of MEDLINE. In the second step, we identified eight frequently present semantic categories in the sentences and performed a limited amount of semantic role labeling. The role labeling step also achieved very high Fbeta-measure for all eight categories. AVAILABILITY: The results of our two-step procedure are summarized in the LSAT database of alternative transcripts. LSAT is available at http://www.bork.embl.de/LSAT CONTACT: email@example.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Proteome survey reveals modularity of the yeast cell machinery.
Gavin, A.C., Aloy, P., Grandi, P., Krause, R., Boesche, M., Marzioch, M., Rau, C., Jensen, L.J., Bastuck, S., Dumpelfeld, B., Edelmann, A., Heurtier, M.A., Hoffman, V., Hoefert, C., Klein, K., Hudak, M., Michon, A.M., Schelder, M., Schirle, M., Remor, M., Rudi, T., Hooper, S., Bauer, A., Bouwmeester, T., Casari, G., Drewes, G., Neubauer, G., Rick, J.M., Kuster, B., Bork, P., Russell, R.B. & Superti-Furga, G.
Nature. 2006 Mar 30;440(7084):631-6. Epub 2006 Jan 22.
Protein complexes are key molecular entities that integrate multiple gene products to perform cellular functions. Here we report the first genome-wide screen for complexes in an organism, budding yeast, using affinity purification and mass spectrometry. Through systematic tagging of open reading frames (ORFs), the majority of complexes were purified several times, suggesting screen saturation. The richness of the data set enabled a de novo characterization of the composition and organization of the cellular machinery. The ensemble of cellular proteins partitions into 491 complexes, of which 257 are novel, that differentially combine with additional attachment proteins or protein modules to enable a diversification of potential functions. Support for this modular organization of the proteome comes from integration with available data on expression, localization, function, evolutionary conservation, protein structure and binary interactions. This study provides the largest collection of physically determined eukaryotic cellular machines so far and a platform for biological data integration and modelling.
Toward automatic reconstruction of a highly resolved tree of life.
Ciccarelli, F.D., Doerks, T., von Mering, C., Creevey, C.J., Snel, B. & Bork, P.
Science. 2006 Mar 3;311(5765):1283-7.
We have developed an automatable procedure for reconstructing the tree of life with branch lengths comparable across all three domains. The tree has its basis in a concatenation of 31 orthologs occurring in 191 species with sequenced genomes. It revealed interdomain discrepancies in taxonomic classification. Systematic detection and subsequent exclusion of products of horizontal gene transfer increased phylogenetic resolution, allowing us to confirm accepted relationships and resolve disputed and preliminary classifications. For example, we place the phylum Acidobacteria as a sister group of delta-Proteobacteria, support a Gram-positive origin of Bacteria, and suggest a thermophilic last universal common ancestor.
Comparative analysis of environmental sequences: potential and challenges.
Förstner, K.U., von Mering, C. & Bork, P.
Philos Trans R Soc Lond B Biol Sci. 2006 Mar 29;361(1467):519-23.
Environmental sequencing, also dubbed metagenomics, is increasingly being used to obtain insights into organismal communities in diverse habitats, and has a variety of potential applications foreseeable in biotechnology and medicine. The first public large-scale data provide already a wealth of information hidden in vast amounts of fragmented pieces of DNA from unknown species residing in these environments. Comparative sequence analysis is essential for the interpretation of such data. However, different layers of complexity that are intrinsic to each sample require the establishment of some baselines for comparison: how to normalize for the differences in phylogenetic and functional diversity, how to avoid biases from incomplete data, and how to deal with differences in species dominance or genome sizes? Here we discuss a few of these items and delineate some simple discriminative sequence properties for four distinct habitats.
Extraction of regulatory gene/protein networks from Medline.
Saric, J., Jensen, L.J., Ouzounova, R., Rojas, I. & Bork, P.
Bioinformatics. 2006 Mar 15;22(6):645-50. Epub 2005 Jul 26.
MOTIVATION: We have previously developed a rule-based approach for extracting information on the regulation of gene expression in yeast. The biomedical literature, however, contains information on several other equally important regulatory mechanisms, in particular phosphorylation, which we now expanded for our rule-based system also to extract. RESULTS: This paper presents new results for extraction of relational information from biomedical text. We have improved our system, STRING-IE, to capture both new types of linguistic constructs as well as new types of biological information [i.e. (de-)phosphorylation]. The precision remains stable with a slight increase in recall. From almost one million PubMed abstracts related to four model organisms, we manage to extract regulatory networks and binary phosphorylations comprising 3,319 relation chunks. The accuracy is 83-90% and 86-95% for gene expression and (de-)phosphorylation relations, respectively. To achieve this, we made use of an organism-specific resource of gene/protein names considerably larger than those used in most other biology related information extraction approaches. These names were included in the lexicon when retraining the part-of-speech (POS) tagger on the GENIA corpus. For the domain in question, an accuracy of 96.4% was attained on POS tags. It should be noted that the rules were developed for yeast and successfully applied to both abstracts and full-text articles related to other organisms with comparable accuracy. AVAILABILITY: The revised GENIA corpus, the POS tagger, the extraction rules and the full sets of extracted relations are available from http://www.bork.embl.de/Docu/STRING-IE
Identification and analysis of evolutionarily cohesive functional modules in protein networks.
Campillos, M., von Mering, C., Jensen, L.J. & Bork, P.
Genome Res. 2006 Mar;16(3):374-82. Epub 2006 Jan 31.
The increasing number of sequenced genomes makes it possible to infer the evolutionary history of functional modules, i.e., groups of proteins that contribute jointly to the same cellular function in a given species. Here we identify and analyze those prokaryotic functional modules, whose composition remains largely unchanged during evolution, and study their properties. Such "cohesive" modules have a large number of internal functional connections, encode genes that tend to be in close proximity in prokaryotic genomes, and correspond to physical complexes or complex functional systems like the flagellar apparatus. Cohesive modules are enriched in processes such as energy and amino acid metabolism, cell motility, and intracellular trafficking, or secretion. By grouping genes into modules we achieve a more precise estimate of their age and find that the young modules are often horizontally transferred between species and are enriched in functions involved in interactions with the environment, implying that they play an important role in the adaptation of species to new environments.
Literature mining for the biologist: from information retrieval to biological discovery.
Jensen, L.J., Saric, J. & Bork, P.
Nat Rev Genet. 2006 Feb;7(2):119-29.
For the average biologist, hands-on literature mining currently means a keyword search in PubMed. However, methods for extracting biomedical facts from the scientific literature have improved considerably, and the associated tools will probably soon be used in many laboratories to automatically annotate and analyse the growing number of system-wide experimental data sets. Owing to the increasing body of text and the open-access policies of many journals, literature mining is also becoming useful for both hypothesis generation and biological discovery. However, the latter will require the integration of literature and high-throughput data, which should encourage close collaborations between biologists and computational linguists.
SMART 5: domains in the context of genomes and networks.
Letunic, I., Copley, R.R., Pils, B., Pinkert, S., Schultz, J. & Bork, P.
Nucleic Acids Res. 2006 Jan 1;34(Database issue):D257-60.
The Simple Modular Architecture Research Tool (SMART) is an online resource (http://smart.embl.de/) used for protein domain identification and the analysis of protein domain architectures. Many new features were implemented to make SMART more accessible to scientists from different fields. The new 'Genomic' mode in SMART makes it easy to analyze domain architectures in completely sequenced genomes. Domain annotation has been updated with a detailed taxonomic breakdown and a prediction of the catalytic activity for 50 SMART domains is now available, based on the presence of essential amino acids. Furthermore, intrinsically disordered protein regions can be identified and displayed. The network context is now displayed in the results page for more than 350 000 proteins, enabling easy analyses of domain interactions.
Co-evolution of transcriptional and posttranslational cell-cycle regulation
Jensen, LJ; Jensen, TS; de Lichtenberg, U; Brunak, S; Bork, P.
Nature 2006 443(7111) 594-597
Medusa: a simple tool for interaction graph analysis.
Hooper, S.D. & Bork, P.
Bioinformatics. 2005 Dec 15;21(24):4432-3. Epub 2005 Sep 27.
SUMMARY: Medusa is a Java application for visualizing and manipulating graphs of interaction, such as data from the STRING database. It features an intuitive user interface developed with the help of biologists. Medusa is optimized for accessing protein interaction data from STRING, but can be used for any type of graph from any scientific field. AVAILABILITY: Medusa, along with sample datasets and instructions, can be downloaded from http://www.bork.embl.de/medusa CONTACT: firstname.lastname@example.org.
Very-KIND is a novel nervous system specific guanine nucleotide exchange factor for Ras GTPases.
Mees, A., Rock, R., Ciccarelli, F.D., Leberfinger, C.B., Borawski, J.M., Bork, P., Wiese, S., Gessler, M. & Kerkhoff, E.
Gene Expr Patterns. 2005 Dec;6(1):79-85. Epub 2005 Aug 15.
The kinase non-catalytic c-lobe domain (KIND) evolved from the catalytic protein kinase fold into a potential protein interaction module for signalling proteins. Spir family actin organizers and the non-receptor phosphatase type 13 (PTP type 13) encode a KIND domain in the very N-terminal parts of the proteins. Here we report the characterization and cloning of a third member of the KIND protein family, which we have named very-KIND (VKIND) because of its two KIND domains. Like the other members of the protein family, VKIND has a KIND domain at the N-terminus. A second KIND domain is located in the central part of the protein. The C-terminal half encodes a guanine nucleotide exchange factor motif for Ras-like GTPases (RasGEF) and a RasGEF N-terminal module (RasGEFN). There is only one VKIND gene in the mammalian genomes and up to now we have found the gene only in vertebrates. During mouse embryogenesis the VKIND gene was specifically expressed in the developing nervous system. In adult mice Northern hybridizations revealed high expression only in brain. Low expression could be detected in ovary. In situ hybridizations showed a specific expression of VKIND in neuronal cells of the granular and Purkinje cell layers of the cerebellum.
Environments shape the nucleotide composition of genomes.
Foerstner, K.U., von Mering, C., Hooper, S.D. & Bork, P.
EMBO Rep 2005 Dec;6(12):1208-13.
To test the impact of environments on genome evolution, we analysed the relative abundance of the nucleotides guanine and cytosine ('GC content') of large numbers of sequences from four distinct environmental samples (ocean surface water, farm soil, an acidophilic mine drainage biofilm and deep-sea whale carcasses). We show that the GC content of complex microbial communities seems to be globally and actively influenced by the environment. The observed nucleotide compositions cannot be easily explained by distinct phylogenetic origins of the species in the environments; the genomic GC content may change faster than was previously thought, and is also reflected in the amino-acid composition of the proteins in these habitats.
Vertebrate-type intron-rich genes in the marine annelid Platynereis dumerilii.
Raible, F., Tessmar-Raible, K., Osoegawa, K., Wincker, P., Jubin, C., Balavoine, G., Ferrier, D., Benes, V., de Jong, P., Weissenbach, J., Bork, P. & Arendt, D.
Science 2005 Nov 25;310(5752):1325-6.
Previous genome comparisons have suggested that one important trend in vertebrate evolution has been a sharp rise in intron abundance. By using genomic data and expressed sequence tags from the marine annelid Platynereis dumerilii, we provide direct evidence that about two-thirds of human introns predate the bilaterian radiation but were lost from insect and nematode genomes to a large extent. A comparison of coding exon sequences confirms the ancestral nature of Platynereis and human genes. Thus, the urbilaterian ancestor had complex, intron-rich genes that have been retained in Platynereis and human.
Spore number control and breeding in Saccharomyces cerevisiae: a key role for a self-organizing system.
Taxis, C., Keller, P., Kavagiou, Z., Jensen, L.J., Colombelli, J., Bork, P., Stelzer, E.H.K. & Knop, M.
J Cell Biol 2005 Nov 21;171(4):627-40. Epub 2005 Nov 14.
Spindle pole bodies (SPBs) provide a structural basis for genome inheritance and spore formation during meiosis in yeast. Upon carbon source limitation during sporulation, the number of haploid spores formed per cell is reduced. We show that precise spore number control (SNC) fulfills two functions. SNC maximizes the production of spores (1-4) that are formed by a single cell. This is regulated by the concentration of three structural meiotic SPB components, which is dependent on available amounts of carbon source. Using experiments and computer simulation, we show that the molecular mechanism relies on a self-organizing system, which is able to generate particular patterns (different numbers of spores) in dependency on one single stimulus (gradually increasing amounts of SPB constituents). We also show that SNC enhances intratetrad mating, whereby maximal amounts of germinated spores are able to return to a diploid lifestyle without intermediary mitotic division. This is beneficial for the immediate fitness of the population of postmeiotic cells.
Palindromic repetitive DNA elements with coding potential in Methanocaldococcus jannaschii.
Suyama, M., Lathe WC, 3rd & Bork, P.
FEBS Lett 2005 Oct 10;579(24):5281-6.
We have identified 141 novel palindromic repetitive elements in the genome of euryarchaeon Methanocaldococcus jannaschii. The total length of these elements is 14.3kb, which corresponds to 0.9% of the total genomic sequence and 6.3% of all extragenic regions. The elements can be divided into three groups (MJRE1-3) based on the sequence similarity. The low sequence identity within each of the groups suggests rather old origin of these elements in M. jannaschii. Three MJRE2 elements were located within the protein coding regions without disrupting the coding potential of the host genes, indicating that insertion of repeats might be a widespread mechanism to enhance sequence diversity in coding regions.
Nonsense-mediated mRNA decay factors act in concert to regulate common mRNA targets.
Rehwinkel, J., Letunic, I., Raes, J., Bork, P. & Izaurralde, E.
RNA 2005 Oct;11(10):1530-44.
Nonsense-mediated mRNA decay (NMD) is a surveillance pathway that degrades mRNAs containing nonsense codons, and regulates the expression of naturally occurring transcripts. While NMD is not essential in yeast or nematodes, UPF1, a key NMD effector, is essential in mice. Here we show that NMD components are required for cell proliferation in Drosophila. This raises the question of whether NMD effectors diverged functionally during evolution. To address this question, we examined expression profiles in Drosophila cells depleted of all known metazoan NMD components. We show that UPF1, UPF2, UPF3, SMG1, SMG5, and SMG6 regulate in concert the expression of a cohort of genes with functions in a wide range of cellular activities, including cell cycle progression. Only a few transcripts were regulated exclusively by individual factors, suggesting that these proteins act mainly in the NMD pathway and their role in mRNA decay has not diverged substantially. Finally, the vast majority of NMD targets in Drosophila are not orthologs of targets previously identified in yeast or human cells. Thus phenotypic differences observed across species following inhibition of NMD can be largely attributed to changes in the repertoire of regulated genes.
G2D: a tool for mining genes associated with disease.
Perez-Iratxeta, C., Wjst, M., Bork, P. & Andrade, M.A.
BMC Genet 2005 Aug 22;6:45.
BACKGROUND: Human inherited diseases can be associated by genetic linkage with one or more genomic regions. The availability of the complete sequence of the human genome allows examining those locations for an associated gene. We previously developed an algorithm to prioritize genes on a chromosomal region according to their possible relation to an inherited disease using a combination of data mining on biomedical databases and gene sequence analysis. RESULTS: We have implemented this method as a web application in our site G2D (Genes to Diseases). It allows users to inspect any region of the human genome to find candidate genes related to a genetic disease of their interest. In addition, the G2D server includes pre-computed analyses of candidate genes for 552 linked monogenic diseases without an associated gene, and the analysis of 18 asthma loci. CONCLUSION: G2D can be publicly accessed at http://www.ogic.ca/projects/g2d_2/.
Structural genomics of human proteins--target selection and generation of a public catalogue of expression clones.
Bussow, K., Scheich, C., Sievert, V., Harttig, U., Schultz, J., Simon, B., Bork, P., Lehrach, H. & Heinemann, U.
Microb Cell Fact 2005 Jul 5;4:21.
BACKGROUND: The availability of suitable recombinant protein is still a major bottleneck in protein structure analysis. The Protein Structure Factory, part of the international structural genomics initiative, targets human proteins for structure determination. It has implemented high throughput procedures for all steps from cloning to structure calculation. This article describes the selection of human target proteins for structure analysis, our high throughput cloning strategy, and the expression of human proteins in Escherichia coli host cells. RESULTS AND CONCLUSION: Protein expression and sequence data of 1414 E. coli expression clones representing 537 different proteins are presented. 139 human proteins (18%) could be expressed and purified in soluble form and with the expected size. All E. coli expression clones are publicly available to facilitate further functional characterisation of this set of human proteins.
DCD - a novel plant specific domain in proteins involved in development and programmed cell death.
Tenhaken, R., Doerks, T. & Bork, P.
BMC Bioinformatics 2005 Jul 11;6:169.
BACKGROUND: Recognition of microbial pathogens by plants triggers the hypersensitive reaction, a common form of programmed cell death in plants. These dying cells generate signals that activate the plant immune system and alarm the neighboring cells as well as the whole plant to activate defense responses to limit the spread of the pathogen. The molecular mechanisms behind the hypersensitive reaction are largely unknown except for the recognition process of pathogens. We delineate the NRP-gene in soybean, which is specifically induced during this programmed cell death and contains a novel protein domain, which is commonly found in different plant proteins. RESULTS: The sequence analysis of the protein, encoded by the NRP-gene from soybean, led to the identification of a novel domain, which we named DCD, because it is found in plant proteins involved in development and cell death. The domain is shared by several proteins in the Arabidopsis and the rice genomes, which otherwise show a different protein architecture. Biological studies indicate a role of these proteins in phytohormone response, embryo development and programmed cell by pathogens or ozone. CONCLUSION: It is tempting to speculate, that the DCD domain mediates signaling in plant development and programmed cell death and could thus be used to identify interacting proteins to gain further molecular insights into these processes.
Consistency of genome-based methods in measuring Metazoan evolution.
Zdobnov, E.M., von Mering, C., Letunic, I. & Bork, P.
FEBS Lett 2005 Jun 13;579(15):3355-61. Epub 2005 Apr 18.
Seven distinct genome-wide divergence measures were applied pairwise to the nine sequenced animal genomes of human, mouse, rat, chicken, pufferfish, fruit fly, mosquito, and two nematode worms (Caenorhabditis briggsae and Caenorhabditis elegans). Qualitatively, all of these divergence measures are found to correlate with the estimated time since speciation; however, marked deviations are observed in a few lineages. The distinct genome divergence measures also correlate well among themselves, indicating that most of the processes shaping genomes are dominated by neutral events. The deviations from the clock-like scenario in some lineages are observed consistently by several measures, implicitly confirming their reliability.
Extraction of transcript diversity from scientific literature.
Shah, P.K., Jensen, L.J., Boue, S. & Bork, P.
PLoS Comput Biol. 2005 Jun;1(1):e10. Epub 2005 Jun 24.
Transcript diversity generated by alternative splicing and associated mechanisms contributes heavily to the functional complexity of biological systems. The numerous examples of the mechanisms and functional implications of these events are scattered throughout the scientific literature. Thus, it is crucial to have a tool that can automatically extract the relevant facts and collect them in a knowledge base that can aid the interpretation of data from high-throughput methods. We have developed and applied a composite text-mining method for extracting information on transcript diversity from the entire MEDLINE database in order to create a database of genes with alternative transcripts. It contains information on tissue specificity, number of isoforms, causative mechanisms, functional implications, and experimental methods used for detection. We have mined this resource to identify 959 instances of tissue-specific splicing. Our results in combination with those from EST-based methods suggest that alternative splicing is the preferred mechanism for generating transcript diversity in the nervous system. We provide new annotations for 1,860 genes with the potential for generating transcript diversity. We assign the MeSH term "alternative splicing" to 1,536 additional abstracts in the MEDLINE database and suggest new MeSH terms for other events. We have successfully extracted information about transcript diversity and semiautomatically generated a database, LSAT, that can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression. LSAT (Literature Support for Alternative Transcripts) is publicly available at http://www.bork.embl.de/LSAT/.
Towards cellular systems in 4D.
Bork, P. & Serrano, L.
Cell 2005 May 20;121(4):507-9. PubMed
Structural similarity to bridge sequence space: finding new families on the bridges.
Shah, P.K., Aloy, P., Bork, P. & Russell, R.B.
Protein Sci 2005 May;14(5):1305-14.
Structures for protein domains have increased rapidly in recent years owing to advances in structural biology and structural genomics projects. New structures are often similar to those solved previously, and such similarities can give insights into function by linking poorly understood families to those that are better characterized. They also allow the possibility of combing information to find still more proteins adopting a similar structure and sometimes a similar function, and to reprioritize families in structural genomics pipelines. We explore this possibility here by preparing merged profiles for pairs of structurally similar, but not necessarily sequence-similar, domains within the SMART and Pfam database by way of the Structural Classification of Proteins (SCOP). We show that such profiles are often able to successfully identify further members of the same superfamily and thus can be used to increase the sensitivity of database searching methods like HMMer and PSI-BLAST. We perform detailed benchmarks using the SMART and Pfam databases with four complete genomes frequently used as annotation benchmarks. We quantify the associated increase in structural information in Swissprot and discuss examples illustrating the applicability of this approach to understand functional and evolutionary relationships between protein families.
Generation and annotation of the DNA sequences of human chromosomes 2 and 4.
Hillier, L.W., Graves, T.A., Fulton, R.S., Fulton, L.A., Pepin, K.H., Minx, P., Wagner-McPherson, C., Layman, D., Wylie, K., Sekhon, M., Becker, M.C., Fewell, G.A., Delehaunty, K.D., Miner, T.L., Nash, W.E., Kremitzki, C., Oddy, L., Du, H., Sun, H., Bradshaw-Cordum, H., Ali, J., Carter, J., Cordes, M., Harris, A., Isak, A., van Brunt, A., Nguyen, C., Du, F., Courtney, L., Kalicki, J., Ozersky, P., Abbott, S., Armstrong, J., Belter, E.A., Caruso, L., Cedroni, M., Cotton, M., Davidson, T., Desai, A., Elliott, G., Erb, T., Fronick, C., Gaige, T., Haakenson, W., Haglund, K., Holmes, A., Harkins, R., Kim, K., Kruchowski, S.S., Strong, C.M., Grewal, N., Goyea, E., Hou, S., Levy, A., Martinka, S., Mead, K., McLellan, M.D., Meyer, R., Randall-Maher, J., Tomlinson, C., Dauphin-Kohlberg, S., Kozlowicz-Reilly, A., Shah, N., Swearengen-Shahid, S., Snider, J., Strong, J.T., Thompson, J., Yoakum, M., Leonard, S., Pearman, C., Trani, L., Radionenko, M., Waligorski, J.E., Wang, C., Rock, S.M., Tin-Wollam, A.M., Maupin, R., Latreille, P., Wendl, M.C., Yang, S.P., Pohl, C., Wallis, J.W., Spieth, J., Bieri, T.A., Berkowicz, N., Nelson, J.O., Osborne, J., Ding, L., Meyer, R., Sabo, A., Shotland, Y., Sinha, P., Wohldmann, P.E., Cook, L.L., Hickenbotham, M.T., Eldred, J., Williams, D., Jones, T.A., She, X., Ciccarelli, F.D., Izaurralde, E., Taylor, J., Schmutz, J., Myers, R.M., Cox, D.R., Huang, X., McPherson, J.D., Mardis, E.R., Clifton, S.W., Warren, W.C., Chinwalla, A.T., Eddy, S.R., Marra, M.A., Ovcharenko, I., Furey, T.S., Miller, W., Eichler, E.E., Bork, P., Suyama, M., Torrents, D., Waterston, R.H. & Wilson, R.K.
Nature 2005 Apr 7;434(7034):724-31.
Human chromosome 2 is unique to the human lineage in being the product of a head-to-head fusion of two intermediate-sized ancestral chromosomes. Chromosome 4 has received attention primarily related to the search for the Huntington's disease gene, but also for genes associated with Wolf-Hirschhorn syndrome, polycystic kidney disease and a form of muscular dystrophy. Here we present approximately 237 million base pairs of sequence for chromosome 2, and 186 million base pairs for chromosome 4, representing more than 99.6% of their euchromatic sequences. Our initial analyses have identified 1,346 protein-coding genes and 1,239 pseudogenes on chromosome 2, and 796 protein-coding genes and 778 pseudogenes on chromosome 4. Extensive analyses confirm the underlying construction of the sequence, and expand our understanding of the structure and evolution of mammalian chromosomes, including gene deserts, segmental duplications and highly variant regions.
Systematic association of genes to phenotypes by genome and literature mining.
Korbel, J.O., Doerks, T., Jensen, L.J., Perez-Iratxeta, C., Kaczanowski, S., Hooper, S.D., Andrade, M.A. & Bork, P.
PLoS Biol 2005 Apr 5;3(5):e134.
One of the major challenges of functional genomics is to unravel the connection between genotype and phenotype. So far no global analysis has attempted to explore those connections in the light of the large phenotypic variability seen in nature. Here, we use an unsupervised, systematic approach for associating genes and phenotypic characteristics that combines literature mining with comparative genome analysis. We first mine the MEDLINE literature database for terms that reflect phenotypic similarities of species. Subsequently we predict the likely genomic determinants: genes specifically present in the respective genomes. In a global analysis involving 92 prokaryotic genomes we retrieve 323 clusters containing a total of 2,700 significant gene-phenotype associations. Some clusters contain mostly known relationships, such as genes involved in motility or plant degradation, often with additional hypothetical proteins associated with those phenotypes. Other clusters comprise unexpected associations; for example, a group of terms related to food and spoilage is linked to genes predicted to be involved in bacterial food poisoning. Among the clusters, we observe an enrichment of pathogenicity-related associations, suggesting that the approach reveals many novel genes likely to play a role in infectious diseases.
Comparative metagenomics of microbial communities.
Tringe, S.G., von Mering, C., Kobayashi, A., Salamov, A.A., Chen, K., Chang, H.W., Podar, M., Short, J.M., Mathur, E.J., Detter, J.C., Bork, P., Hugenholtz, P. & Rubin, E.M.
Science 2005 Apr 22;308(5721):554-7.
The species complexity of microbial communities and challenges in culturing representative isolates make it difficult to obtain assembled genomes. Here we characterize and compare the metabolic capabilities of terrestrial and marine microbial communities using largely unassembled sequence data obtained by shotgun sequencing DNA isolated from the various environments. Quantitative gene content analysis reveals habitat-specific fingerprints that reflect known characteristics of the sampled environments. The identification of environment-specific genes through a gene-centric comparative analysis presents new opportunities for interpreting and diagnosing environments.
The WHy domain mediates the response to desiccation in plants and bacteria.
Ciccarelli, F.D. & Bork, P.
Bioinformatics 2005 Apr 15;21(8):1304-7. Epub 2004 Dec 14.
MOTIVATION: The hypersensitive response (HR) is a process activated by plants after microbial infection. Its main phenotypic effects are both a programmed death of the plant cells near the infection site and a reduction of the microbial proliferation. Although many resistance genes (R genes) associated to HR have been identified, very little is known about the molecular mechanisms activated after their expression. RESULTS: The analysis of the product of one of the R genes, the Hin1 protein, led to the identification of a novel domain, which we named WHy because it is detectable in proteins involved in Water stress and Hypersensitive response. The expression of this domain during both biotic infection and response to desiccation points to a molecular machinery common to these two stress conditions. Moreover, its presence in a restricted number of bacteria suggests a possible use for marking plant pathogenicity. CONTACT: email@example.com SUPPLEMENTARY INFORMATION: Supplementary data (Figures S1 and S2 and Table S1) and the alignment in clustal format are available at http://www.bork.embl.de/~ciccarel/WHy_add_data.html.
Complex genomic rearrangements lead to novel primate gene function.
Ciccarelli, F.D., von Mering, C., Suyama, M., Harrington, E.D., Izaurralde, E. & Bork, P.
Genome Res 2005 Mar;15(3):343-51. Epub 2005 Feb 14.
Orthologous genes that maintain a single-copy status in a broad range of species may indicate a selection against gene duplication. If this is the case, then duplicates of such genes that do survive may have escaped the dosage control by rapid and sizable changes in their function. To test this hypothesis and to develop a strategy for the identification of novel gene functions, we have analyzed 22 primate-specific intrachromosomal duplications of genes with a single-copy ortholog in all other completely sequenced metazoans. When comparing this set to genes not exposed to the single-copy status constraint, we observed a higher tendency of the former to modify their gene structure, often through complex genomic rearrangements. The analysis of the most dramatic of these duplications, affecting approximately 10% of human Chromosome 2, enabled a detailed reconstruction of the events leading to the appearance of a novel gene family. The eight members of this family originated from the highly conserved nucleoporin RanBP2 by several genetic rearrangements such as segmental duplications, inversions, translocations, exon loss, and domain accretion. We have experimentally verified that at least one of the newly formed proteins has a cellular localization different from RanBP2's, and we show that positive selection did act on specific domains during evolution.
Dynamic complex formation during the yeast cell cycle.
de Lichtenberg, U., Jensen, L.J., Brunak, S. & Bork, P.
Science 2005 Feb 4;307(5710):724-7.
To analyze the dynamics of protein complexes during the yeast cell cycle, we integrated data on protein interactions and gene expression. The resulting time-dependent interaction network places both periodically and constitutively expressed proteins in a temporal cell cycle context, thereby revealing previously unknown components and modules. We discovered that most complexes consist of both periodically and constitutively expressed subunits, which suggests that the former control complex activity by a mechanism of just-in-time assembly. Consistent with this, we show that additional regulation through targeted degradation and phosphorylation by Cdc28p (Cdk1) specifically affects the periodically expressed proteins.
STRING: known and predicted protein-protein associations, integrated and transferred across organisms.
von Mering, C., Jensen, L.J., Snel, B., Hooper, S.D., Krupp, M., Foglierini, M., Jouffre, N., Huynen, M.A. & Bork, P.
Nucleic Acids Res 2005 Jan 1;33 Database Issue:D433-7.
A full description of a protein's function requires knowledge of all partner proteins with which it specifically associates. From a functional perspective, 'association' can mean direct physical binding, but can also mean indirect interaction such as participation in the same metabolic pathway or cellular process. Currently, information about protein association is scattered over a wide variety of resources and model organisms. STRING aims to simplify access to this information by providing a comprehensive, yet quality-controlled collection of protein-protein associations for a large number of organisms. The associations are derived from high-throughput experimental data, from the mining of databases and literature, and from predictions based on genomic context analysis. STRING integrates and ranks these associations by benchmarking them against a common reference set, and presents evidence in a consistent and intuitive web interface. Importantly, the associations are extended beyond the organism in which they were originally described, by automatic transfer to orthologous protein pairs in other organisms, where applicable. STRING currently holds 730,000 proteins in 180 fully sequenced organisms, and is available at http://string.embl.de/.
InterPro, progress and status in 2005.
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bradley, P., Bork, P., Bucher, P., Cerutti, L., Copley, R., Courcelle, E., Das, U., Durbin, R., Fleischmann, W., Gough, J., Haft, D., Harte, N., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McDowall, J., Mitchell, A., Nikolskaya, A.N., Orchard, S., Pagni, M., Ponting, C.P., Quevillon, E., Selengut, J., Sigrist, C.J., Silventoinen, V., Studholme, D.J., Vaughan, R. & Wu, C.H.
Nucleic Acids Res 2005 Jan 1;33(Database issue):D201-5.
InterPro, an integrated documentation resource of protein families, domains and functional sites, was created to integrate the major protein signature databases. Currently, it includes PROSITE, Pfam, PRINTS, ProDom, SMART, TIGRFAMs, PIRSF and SUPERFAMILY. Signatures are manually integrated into InterPro entries that are curated to provide biological and functional information. Annotation is provided in an abstract, Gene Ontology mapping and links to specialized databases. New features of InterPro include extended protein match views, taxonomic range information and protein 3D structure data. One of the new match views is the InterPro Domain Architecture view, which shows the domain composition of protein matches. Two new entry types were introduced to better describe InterPro entries: these are active site and binding site. PIRSF and the structure-based SUPERFAMILY are the latest member databases to join InterPro, and CATH and PANTHER are soon to be integrated. InterPro release 8.0 contains 11 007 entries, representing 2573 domains, 8166 families, 201 repeats, 26 active sites, 21 binding sites and 20 post-translational modification sites. InterPro covers over 78% of all proteins in the Swiss-Prot and TrEMBL components of UniProt. The database is available for text- and sequence-based searches via a webserver (http://www.ebi.ac.uk/interpro), and for download by anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro).
Comparative architectures of mammalian and chicken genomes reveal highly variable rates of genomic rearrangements across different lineages.
Bourque, G., Zdobnov, E.M., Bork, P., Pevzner, P.A. & Tesler, G.
Genome Res 2005 Jan;15(1):98-110. Epub 2004 Dec 08.
Molecular evolution studies are usually based on the analysis of individual genes and thus reflect only small-range variations in genomic sequences. A complementary approach is to study the evolutionary history of rearrangements in entire genomes based on the analysis of gene orders. The progress in whole genome sequencing provides an unprecedented level of detailed sequence data to infer genome rearrangements through comparative approaches. The comparative analysis of recently sequenced rodent genomes with the human genome revealed evidence for a larger number of rearrangements than previously thought and led to the reconstruction of the putative genomic architecture of the murid rodent ancestor, while the architecture of the ancestral mammalian genome and the rate of rearrangements in the human lineage remained unknown. Sequencing the chicken genome provides an opportunity to reconstruct the architecture of the ancestral mammalian genome by using chicken as an outgroup. Our analysis reveals a very low rate of rearrangements and, in particular, interchromosomal rearrangements in chicken, in the early mammalian ancestor, or in both. The suggested number of interchromosomal rearrangements between the mammalian ancestor and chicken, during an estimated 500 million years of evolution, only slightly exceeds the number of interchromosomal rearrangements that happened in the mouse lineage, over the course of about 87 million years.
Protein coding potential of retroviruses and other transposable elements in vertebrate genomes.
Zdobnov, E.M., Campillos, M., Harrington, E.D., Torrents, D. & Bork, P.
Nucleic Acids Res 2005 Feb 16;33(3):946-54. Print 2005.
We suggest an annotation strategy for genes encoded by retroviruses and transposable elements (RETRA genes) based on a set of marker protein domains. Usually RETRA genes are masked in vertebrate genomes prior to the application of automated gene prediction pipelines under the assumption that they provide no selective advantage to the host. Yet, we show that about 1000 genes in four vertebrate gene sets analyzed contain at least one RETRA gene marker domain. Using the conservation of genomic neighborhood (synteny), we were able to discriminate between RETRA genes with putative functionality in the vertebrates and those that probably function only in the context of mobile elements. We identified 35 such genes in human, along with their corresponding mouse and rat orthologs; which included almost all known human genes with similarity to mobile elements. The results also imply that the vast majority of the remaining RETRA genes in current gene sets are unlikely to encode vertebrate functions. To automatically annotate RETRA genes in other vertebrate genomes, we provide as a tool a set of marker protein domains and a manually refined list of domesticated or ancestral RETRA genes for rescuing genes with vertebrate functions.
Is there biological research beyond Systems Biology? A comparative analysis of terms.
Molecular Systems Biology 2005
Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution.
Hillier, L.W., Miller, W., Birney, E., Warren, W., Hardison, R.C., Ponting, C.P., Bork, P., Burt, D.W., Groenen, M.A., Delany, M.E., Dodgson, J.B., Chinwalla, A.T., Cliften, P.F., Clifton, S.W., Delehaunty, K.D., Fronick, C., Fulton, R.S., Graves, T.A., Kremitzki, C., Layman, D., Magrini, V., McPherson, J.D., Miner, T.L., Minx, P., Nash, W.E., Nhan, M.N., Nelson, J.O., Oddy, L.G., Pohl, C.S., Randall-Maher, J., Smith, S.M., Wallis, J.W., Yang, S.P., Romanov, M.N., Rondelli, C.M., Paton, B., Smith, J., Morrice, D., Daniels, L., Tempest, H.G., Robertson, L., Masabanda, J.S., Griffin, D.K., Vignal, A., Fillon, V., Jacobbson, L., Kerje, S., Andersson, L., Crooijmans, R.P., Aerts, J., van der Poel, J.J., Ellegren, H., Caldwell, R.B., Hubbard, S.J., Grafham, D.V., Kierzek, A.M., McLaren, S.R., Overton, I.M., Arakawa, H., Beattie, K.J., Bezzubov, Y., Boardman, P.E., Bonfield, J.K., Croning, M.D., Davies, R.M., Francis, M.D., Humphray, S.J., Scott, C.E., Taylor, R.G., Tickle, C., Brown, W.R., Rogers, J., Buerstedde, J.M., Wilson, S.A., Stubbs, L., Ovcharenko, I., Gordon, L., Lucas, S., Miller, M.M., Inoko, H., Shiina, T., Kaufman, J., Salomonsen, J., Skjoedt, K., Wong, G.K., Wang, J., Liu, B., Wang, J., Yu, J., Yang, H., Nefedov, M., Koriabine, M., Dejong, P.J., Goodstadt, L., Webber, C., Dickens, N.J., Letunic, I., Suyama, M., Torrents, D., von Mering, C., Zdobnov, E.M., Makova, K., Nekrutenko, A., Elnitski, L., Eswara, P., King, D.C., Yang, S., Tyekucheva, S., Radakrishnan, A., Harris, R.S., Chiaromonte, F., Taylor, J., He, J., Rijnkels, M., Griffiths-Jones, S., Ureta-Vidal, A., Hoffman, M.M., Severin, J., Searle, S.M., Law, A.S., Speed, D., Waddington, D., Cheng, Z., Tuzun, E., Eichler, E., Bao, Z., Flicek, P., Shteynberg, D.D., Brent, M.R., Bye, J.M., Huckle, E.J., Chatterji, S., Dewey, C., Pachter, L., Kouranov, A., Mourelatos, Z., Hatzigeorgiou, A.G., Paterson, A.H., Ivarie, R., Brandstrom, M., Axelsson, E., Backstrom, N., Berlin, S., Webster, M.T., Pourquie, O., Reymond, A., Ucla, C., Antonarakis, S.E., Long, M., Emerson, J.J., Betran, E., Dupanloup, I., Kaessmann, H., Hinrichs, A.S., Bejerano, G., Furey, T.S., Harte, R.A., Raney, B., Siepel, A., Kent, W.J., Haussler, D., Eyras, E., Castelo, R., Abril, J.F., Castellano, S., Camara, F., Parra, G., Guigo, R., Bourque, G., Tesler, G., Pevzner, P.A., Smit, A., Fulton, L.A., Mardis, E.R. & Wilson, R.K.
Nature 2004 Dec 9;432(7018):695-716.
We present here a draft genome sequence of the red jungle fowl, Gallus gallus. Because the chicken is a modern descendant of the dinosaurs and the first non-mammalian amniote to have its genome sequenced, the draft sequence of its genome--composed of approximately one billion base pairs of sequence and an estimated 20,000-23,000 genes--provides a new perspective on vertebrate genome evolution, while also improving the annotation of mammalian genomes. For example, the evolutionary distance between chicken and human provides high specificity in detecting functional elements, both non-coding and coding. Notably, many conserved non-coding sequences are far from genes and cannot be assigned to defined functional classes. In coding regions the evolutionary dynamics of protein domains and orthologous groups illustrate processes that distinguish the lineages leading to birds and mammals. The distinctive properties of avian microchromosomes, together with the inferred patterns of conserved synteny, provide additional insights into vertebrate chromosome architecture.
Shared components of protein complexes--versatile building blocks or biochemical artefacts?
Krause, R., von Mering, C., Bork, P. & Dandekar, T.
Bioessays 2004 Dec;26(12):1333-43.
Protein complexes perform many important functions in the cell. Large-scale studies of protein-protein interactions have not only revealed new complexes but have also placed many proteins into multiple complexes. Whilst the advocates of hypothesis-free research touted the discovery of these shared components as new links between diverse cellular processes, critical commentators denounced many of the findings as artefacts, thus questioning the usefulness of large-scale approaches. Here, we survey proteins known to be shared between complexes, as established in the literature, and compare them to shared components found in high-throughput screens. We discuss the various challenges to the identification and functional interpretation of bona fide shared components, namely contaminants, variant and megacomplexes, and transient interactions, and suggest that many of the novel shared components found in high-throughput screens are neither the results of contamination nor central components, but appear to be primarily regulatory links in cellular processes.
Gene expression profiling of the rat superior olivary complex using serial analysis of gene expression.
Koehl, A., Schmidt, N., Rieger, A., Pilgram, S.M., Letunic, I., Bork, P., Soto, F., Friauf, E. & Nothwang, H.G.
Eur J Neurosci 2004 Dec;20(12):3244-58.
The superior olivary complex (SOC) is an auditory brainstem region that represents a favourable system to study rapid neurotransmission and the maturation of neuronal circuits. Here we performed serial analysis of gene expression (SAGE) on the SOC in 60-day-old Sprague-Dawley rats to identify genes specifically important for its function and to create a transcriptome reference for the subsequent identification of age-related or disease-related changes. Sequencing of 31 035 tags identified 10 473 different transcripts. Fifty-seven per cent of the unique tags with a count greater than four were statistically more highly represented in the SOC than in the hippocampus. Among them were genes encoding proteins involved in energy supply, the glutamate/glutamine shuttle, and myelination. Approximately 80 plasma membrane transporters, receptors, channels, and vesicular transporters were identified, and 25% of them displayed a significantly higher expression level in the SOC than in the hippocampus. Some of the plasma membrane proteins were not previously characterized in the SOC, e.g. the purinergic receptor subunit P2X(6) and the metabotropic GABA receptor Gpr51. Differential gene expression between SOC and hippocampus was confirmed using RNA in situ hybridization or immunohistochemistry. The extensive gene inventory presented here will alleviate the dissection of the molecular mechanisms underlying specific SOC functions and the comparison with other SAGE libraries from brain will ease the identification of promoters to generate region-specific transgenic animals. The analysis will be part of the publicly available database ID-GRAB.
Comparison of computational methods for the identification of cell cycle regulated genes.
de Lichtenberg, U., Jensen, L.J., Fausboll, A., Jensen, T.S., Bork, P. & Brunak, S.
Bioinformatics 2004 Oct 28.
MOTIVATION: DNA microarrays have been used extensively to study the cell cycle transcription programme in a number of model organisms. The Saccharomyces cerevisiae data in particular have been subjected to a wide range of bioinformatics analysis methods, aimed at identifying the correct and complete set of periodically expressed genes. RESULTS: Here, we provide the first thorough benchmark of such methods, surprisingly revealing that most new and more mathematically advanced methods actually perform worse than the analysis published with the original microarray data sets. We show that this loss of accuracy specifically affects methods that only model the shape of the expression profile without taking into account the magnitude of regulation. We present a simple permutation-based method that performs better than most existing methods. SUPPLEMENTARY INFORMATION: Results and benchmark sets are available at http://www.cbs.dtu.dk/cellcycle.
Gene annotation from scientific literature using mappings between keyword systems.
Perez, A.J., Perez-Iratxeta, C., Bork, P., Thode, G. & Andrade, M.A.
Bioinformatics 2004 Sep 1;20(13):2084-91. Epub 2004 Apr 01.
MOTIVATION: The description of genes in databases by keywords helps the non-specialist to quickly grasp the properties of a gene and increases the efficiency of computational tools that are applied to gene data (e.g. searching a gene database for sequences related to a particular biological process). However, the association of keywords to genes or protein sequences is a difficult process that ultimately implies examination of the literature related to a gene. RESULTS: To support this task, we present a procedure to derive keywords from the set of scientific abstracts related to a gene. Our system is based on the automated extraction of mappings between related terms from different databases using a model of fuzzy associations that can be applied with all generality to any pair of linked databases. We tested the system by annotating genes of the SWISS-PROT database with keywords derived from the abstracts linked to their entries (stored in the MEDLINE database of scientific references). The performance of the annotation procedure was much better for SWISS-PROT keywords (recall of 47%, precision of 68%) than for Gene Ontology terms (recall of 8%, precision of 67%). AVAILABILITY: The algorithm can be publicly accessed and used for the annotation of sequences through a web server at http://www.bork.embl.de/kat
Homology-based functional proteomics by mass spectrometry: application to the Xenopus microtubule-associated proteome.
Liska, A.J., Popov, A.V., Sunyaev, S., Coughlin, P., Habermann, B., Shevchenko, A., Bork, P., Karsenti, E. & Shevchenko, A.
Proteomics 2004 Sep;4(9):2707-21.
The application of functional proteomics to important model organisms with unsequenced genomes is restricted because of the limited ability to identify proteins by conventional mass spectrometry (MS) methods. Here we applied MS and sequence-similarity database searching strategies to characterize the Xenopus laevis microtubule-associated proteome. We identified over 40 unique, and many novel, microtubule-bound proteins, as well as two macromolecular protein complexes involved in protein translation. This finding was corroborated by electron microscopy showing the presence of ribosomes on spindles assembled from frog egg extracts. Taken together, these results suggest that protein translation occurs on the spindle during meiosis in the Xenopus oocyte. These findings were made possible due to the application of sequence-similarity methods, which extended mass spectrometric protein identification capabilities by 2-fold compared to conventional methods.
ArrayProspector: a web resource of functional associations inferred from microarray expression data.
Jensen, L.J., Lagarde, J., von Mering, C. & Bork, P.
Nucleic Acids Res 2004 Jul 1;32(Web Server issue):W445-8.
DNA microarray experiments have provided vast amounts of data which can be used for inferring gene function. However, most methods for predicting functional associations between genes from expression data are not suited to simultaneous analysis of multiple datasets, and a comprehensive resource of coexpression-based predictions is currently lacking. Here, we present an interactive web resource of gene associations predicted by applying a novel algorithm to all expression data in the Stanford Microarray Database. The underlying pre-computed database currently contains more than 200 000 high-confidence gene associations in 12 different species sampled from a broad taxonomic range. The resource allows every association to be inspected visually and can be accessed at http://www.bork.embl.de/ArrayProspector.
Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs.
Korbel, J.O., Jensen, L.J., von Mering, C. & Bork, P.
Nat Biotechnol 2004 Jul;22(7):911-7.
Several widely used methods for predicting functional associations between proteins are based on the systematic analysis of genomic context. Efforts are ongoing to improve these methods and to search for novel aspects in genomes that could be exploited for function prediction. Here, we use gene expression data to demonstrate two functional implications of genome organization: first, chromosomal proximity indicates gene coregulation in prokaryotes independent of relative gene orientation; and second, adjacent bidirectionally transcribed genes (that is,'divergently' organized coding regions) with conserved gene orientation are strongly coregulated. We further demonstrate that such bidirectionally transcribed gene pairs are functionally associated and derive from this a novel genomic context method that reliably predicts links between >2,500 pairs of genes in approximately 100 species. Around 650 of these functional associations are supported by other genomic context methods. In most instances, one gene encodes a transcriptional regulator, and the other a nonregulatory protein. In-depth analysis in Escherichia coli shows that the vast majority of these regulators both control transcription of the divergently transcribed target gene/operon and auto-regulate their own biosynthesis. The method thus enables the prediction of target processes and regulatory features for several hundred transcriptional regulators.
Protein interaction networks from yeast to human.
Bork, P., Jensen, L.J., von Mering, C., Ramani, A.K., Lee, I. & Marcotte, E.M.
Curr Opin Struct Biol 2004 Jun;14(3):292-9.
Protein interaction networks summarize large amounts of protein-protein interaction data, both from individual, small-scale experiments and from automated high-throughput screens. The past year has seen a flood of new experimental data, especially on metazoans, as well as an increasing number of analyses designed to reveal aspects of network topology, modularity and evolution. As only minimal progress has been made in mapping the human proteome using high-throughput screens, the transfer of interaction information within and across species has become increasingly important. With more and more heterogeneous raw data becoming available, proper data integration and quality control have become essential for reliable protein network reconstruction, and will be especially important for reconstructing the human protein interaction network.
BLAST2GENE: a comprehensive conversion of BLAST output into independent genes and gene fragments.
Suyama, M., Torrents, D. & Bork, P.
Bioinformatics 2004 Apr 1.
SUMMARY: BLAST2GENE is a program that allows a detailed analysis of genomic regions containing completely or partially duplicated genes. From a BLAST (or BL2SEQ) comparison of a protein or nucleotide query sequence with any genomic region of interest, BLAST2GENE processes all high scoring pairwise alignments (HSPs) and provides the disposition of all independent copies along the genomic fragment. The results are provided in text and PostScript formats to allow an automatic and visual evaluation of the respective region. AVAILABILITY: The program is available upon request from the authors. A web server of BLAST2GENE is maintained at http://www.bork.embl.de/blast2gene.
Genome sequence of the Brown Norway rat yields insights into mammalian evolution.
Gibbs, R.A., Weinstock, G.M., Metzker, M.L., Muzny, D.M., Sodergren, E.J., Scherer, S., Scott, G., Steffen, D., Worley, K.C., Burch, P.E., Okwuonu, G., Hines, S., Lewis, L., DeRamo, C., Delgado, O., Dugan-Rocha, S., Miner, G., Morgan, M., Hawes, A., Gill, R., Celera, Holt, R.A., Adams, M.D., Amanatides, P.G., Baden-Tillson, H., Barnstead, M., Chin, S., Evans, C.A., Ferriera, S., Fosler, C., Glodek, A., Gu, Z., Jennings, D., Kraft, C.L., Nguyen, T., Pfannkoch, C.M., Sitter, C., Sutton, G.G., Venter, J.C., Woodage, T., Smith, D., Lee, H.M., Gustafson, E., Cahill, P., Kana, A., Doucette-Stamm, L., Weinstock, K., Fechtel, K., Weiss, R.B., Dunn, D.M., Green, E.D., Blakesley, R.W., Bouffard, G.G., De Jong, P.J., Osoegawa, K., Zhu, B., Marra, M., Schein, J., Bosdet, I., Fjell, C., Jones, S., Krzywinski, M., Mathewson, C., Siddiqui, A., Wye, N., McPherson, J., Zhao, S., Fraser, C.M., Shetty, J., Shatsman, S., Geer, K., Chen, Y., Abramzon, S., Nierman, W.C., Havlak, P.H., Chen, R., Durbin, K.J., Egan, A., Ren, Y., Song, X.Z., Li, B., Liu, Y., Qin, X., Cawley, S., Worley, K.C., Cooney, A.J., D'Souza, L.M., Martin, K., Wu, J.Q., Gonzalez-Garay, M.L., Jackson, A.R., Kalafus, K.J., McLeod, M.P., Milosavljevic, A., Virk, D., Volkov, A., Wheeler, D.A., Zhang, Z., Bailey, J.A., Eichler, E.E., Tuzun, E., Birney, E., Mongin, E., Ureta-Vidal, A., Woodwark, C., Zdobnov, E., Bork, P., Suyama, M., Torrents, D., Alexandersson, M., Trask, B.J., Young, J.M., Huang, H., Wang, H., Xing, H., Daniels, S., Gietzen, D., Schmidt, J., Stevens, K., Vitt, U., Wingrove, J., Camara, F., Mar Alba, M., Abril, J.F., Guigo, R., Smit, A., Dubchak, I., Rubin, E.M., Couronne, O., Poliakov, A., Hubner, N., Ganten, D., Goesele, C., Hummel, O., Kreitler, T., Lee, Y.A., Monti, J., Schulz, H., Zimdahl, H., Himmelbauer, H., Lehrach, H., Jacob, H.J., Bromberg, S., Gullings-Handley, J., Jensen-Seaman, M.I., Kwitek, A.E., Lazar, J., Pasko, D., Tonellato, P.J., Twigger, S., Ponting, C.P., Duarte, J.M., Rice, S., Goodstadt, L., Beatson, S.A., Emes, R.D., Winter, E.E., Webber, C., Brandt, P., Nyakatura, G., Adetobi, M., Chiaromonte, F., Elnitski, L., Eswara, P., Hardison, R.C., Hou, M., Kolbe, D., Makova, K., Miller, W., Nekrutenko, A., Riemer, C., Schwartz, S., Taylor, J., Yang, S., Zhang, Y., Lindpaintner, K., Andrews, T.D., Caccamo, M., Clamp, M., Clarke, L., Curwen, V., Durbin, R., Eyras, E., Searle, S.M., Cooper, G.M., Batzoglou, S., Brudno, M., Sidow, A., Stone, E.A., Venter, J.C., Payseur, B.A., Bourque, G., Lopez-Otin, C., Puente, X.S., Chakrabarti, K., Chatterji, S., Dewey, C., Pachter, L., Bray, N., Yap, V.B., Caspi, A., Tesler, G., Pevzner, P.A., Haussler, D., Roskin, K.M., Baertsch, R., Clawson, H., Furey, T.S., Hinrichs, A.S., Karolchik, D., Kent, W.J., Rosenbloom, K.R., Trumbower, H., Weirauch, M., Cooper, D.N., Stenson, P.D., Ma, B., Brent, M., Arumugam, M., Shteynberg, D., Copley, R.R., Taylor, M.S., Riethman, H., Mudunuri, U., Peterson, J., Guyer, M., Felsenfeld, A., Old, S., Mockrin, S. & Collins, F.
Nature 2004 Apr 1;428(6982):493-521.
The laboratory rat (Rattus norvegicus) is an indispensable tool in experimental medicine and drug development, having made inestimable contributions to human health. We report here the genome sequence of the Brown Norway (BN) rat strain. The sequence represents a high-quality 'draft' covering over 90% of the genome. The BN rat sequence is the third complete mammalian genome to be deciphered, and three-way comparisons with the human and mouse genomes resolve details of mammalian evolution. This first comprehensive analysis includes genes and proteins and their relation to human disease, repeated sequences, comparative genome-wide studies of mammalian orthologous chromosomal regions and rearrangement breakpoints, reconstruction of ancestral karyotypes and the events leading to existing species, rates of variation, and lineage-specific and lineage-independent evolutionary events such as expansion of gene families, orthology relations and protein evolution.
Structure-based assembly of protein complexes in yeast.
Aloy, P., Böttcher, B., Ceulemans, H., Leutwein, C., Mellwig, C., Fischer, S., Gavin, A.C., Bork, P., Superti-Furga, G., Serrano, L. & Russell, R.B.
Science 2004 Mar 26;303(5666):2026-9.
Images of entire cells are preceding atomic structures of the separate molecular machines that they contain. The resulting gap in knowledge can be partly bridged by protein-protein interactions, bioinformatics, and electron microscopy. Here we use interactions of known three-dimensional structure to model a large set of yeast complexes, which we also screen by electron microscopy. For 54 of 102 complexes, we obtain at least partial models of interacting subunits. For 29, including the exosome, the chaperonin containing TCP-1, a 3'-messenger RNA degradation complex, and RNA polymerase II, the process suggests atomic details not easily seen by homology, involving the combination of two or more known structures. We also consider interactions between complexes (cross-talk) and use these to construct a structure-based network of molecular machines in the cell.
Global analysis of bacterial transcription factors to predict cellular target processes.
Doerks, T., Andrade, M.A., Lathe W, 3rd, von Mering, C. & Bork, P.
Trends Genet 2004 Mar;20(3):126-31.
Whole-genome sequences are now available for >100 bacterial species, giving unprecedented power to comparative genomics approaches. We have applied genome-context methods to predict target processes that are regulated by transcription factors (TFs). Of 128 orthologous groups of proteins annotated as TFs, to date, 36 are functionally uncharacterized; in our analysis we predict a probable cellular target process or biochemical pathway for half of these functionally uncharacterized TFs.
The HUPO PSI's molecular interaction format--a community standard for the representation of protein interaction data.
Hermjakob, H., Montecchi-Palazzi, L., Bader, G., Wojcik, J., Salwinski, L., Ceol, A., Moore, S., Orchard, S., Sarkans, U., von Mering, C., Roechert, B., Poux, S., Jung, E., Mersch, H., Kersey, P., Lappe, M., Li, Y., Zeng, R., Rana, D., Nikolski, M., Husi, H., Brun, C., Shanker, K., Grant, S.G., Sander, C., Bork, P., Zhu, W., Pandey, A., Brazma, A., Jacq, B., Vidal, M., Sherman, D., Legrain, P., Cesareni, G., Xenarios, I., Eisenberg, D., Steipe, B., Hogue, C. & Apweiler, R.
Nat Biotechnol 2004 Feb;22(2):177-83.
A major goal of proteomics is the complete description of the protein interaction network underlying cell physiology. A large number of small scale and, more recently, large-scale experiments have contributed to expanding our understanding of the nature of the interaction network. However, the necessary data integration across experiments is currently hampered by the fragmentation of publicly available protein interaction data, which exists in different formats in databases, on authors' websites or sometimes only in print publications. Here, we propose a community standard data model for the representation and exchange of protein interaction data. This data model has been jointly developed by members of the Proteomics Standards Initiative (PSI), a work group of the Human Proteome Organization (HUPO), and is supported by major protein interaction data providers, in particular the Biomolecular Interaction Network Database (BIND), Cellzome (Heidelberg, Germany), the Database of Interacting Proteins (DIP), Dana Farber Cancer Institute (Boston, MA, USA), the Human Protein Reference Database (HPRD), Hybrigenics (Paris, France), the European Bioinformatics Institute's (EMBL-EBI, Hinxton, UK) IntAct, the Molecular Interactions (MINT, Rome, Italy) database, the Protein-Protein Interaction Database (PPID, Edinburgh, UK) and the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING, EMBL, Heidelberg, Germany).
RanBP2/Nup358 Provides a Major Binding Site for NXF1-p15 Dimers at the Nuclear Pore Complex and Functions in Nuclear mRNA Export.
Forler, D., Rabut, G., Ciccarelli, F.D., Herold, A., Kocher, T., Niggeweg, R., Bork, P., Ellenberg, J. & Izaurralde, E.
Mol Cell Biol 2004 Feb;24(3):1155-67.
Metazoan NXF1-p15 heterodimers promote the nuclear export of bulk mRNA across nuclear pore complexes (NPCs). In vitro, NXF1-p15 forms a stable complex with the nucleoporin RanBP2/Nup358, a component of the cytoplasmic filaments of the NPC, suggesting a role for this nucleoporin in mRNA export. We show that depletion of RanBP2 from Drosophila cells inhibits proliferation and mRNA export. Concomitantly, the localization of NXF1 at the NPC is strongly reduced and a significant fraction of this normally nuclear protein is detected in the cytoplasm. Under the same conditions, the steady-state subcellular localization of other nuclear or cytoplasmic proteins and CRM1-mediated protein export are not detectably affected, indicating that the release of NXF1 into the cytoplasm and the inhibition of mRNA export are not due to a general defect in NPC function. The specific role of RanBP2 in the recruitment of NXF1 to the NPC is highlighted by the observation that depletion of CAN/Nup214 also inhibits cell proliferation and mRNA export but does not affect NXF1 localization. Our results indicate that RanBP2 provides a major binding site for NXF1 at the cytoplasmic filaments of the NPC, thereby restricting its diffusion in the cytoplasm after NPC translocation. In RanBP2-depleted cells, NXF1 diffuses freely through the cytoplasm. Consequently, the nuclear levels of the protein decrease and export of bulk mRNA is impaired.
The Helmholtz Network for Bioinformatics: an integrative web portal for bioinformatics resources.
Crass, T., Antes, I., Basekow, R., Bork, P., Buning, C., Christensen, M., Claussen, H., Ebeling, C., Ernst, P., Gailus-Durner, V., Glatting, K.H., Gohla, R., Gossling, F., Grote, K., Heidtke, K., Herrmann, A., O'Keeffe, S., Kiesslich, O., Kolibal, S., Korbel, J.O., Lengauer, T., Liebich, I., Van Der Linden, M., Luz, H., Meissner, K., Von Mering, C., Mevissen, H.T., Mewes, H.W., Michael, H., Mokrejs, M., Muller, T., Pospisil, H., Rarey, M., Reich, J.G., Schneider, R., Schomburg, D., Schulze-Kremer, S., Schwarzer, K., Sommer, I., Springstubbe, S., Suhai, S., Thoppae, G., Vingron, M., Warfsmann, J., Werner, T., Wetzler, D., Wingender, E. & Zimmer, R.
Bioinformatics 2004 Jan 22;20(2):268-270.
SUMMARY: The Helmholtz Network for Bioinformatics (HNB) is a joint venture of eleven German bioinformatics research groups that offers convenient access to numerous bioinformatics resources through a single web portal. The 'Guided Solution Finder' which is available through the HNB portal helps users to locate the appropriate resources to answer their queries by employing a detailed, tree-like questionnaire. Furthermore, automated complex tool cascades ('tasks'), involving resources located on different servers, have been implemented, allowing users to perform comprehensive data analyses without the requirement of further manual intervention for data transfer and re-formatting. Currently, automated cascades for the analysis of regulatory DNA segments as well as for the prediction of protein functional properties are provided. AVAILABILITY: The HNB portal is available at http://www.hnbioinfo.de
SMART 4.0: towards genomic data integration.
Letunic, I., Copley, R.R., Schmidt, S., Ciccarelli, F.D., Doerks, T., Schultz, J., Ponting, C.P. & Bork, P.
Nucleic Acids Res 2004 Jan 1;32(1):D142-4.
SMART (Simple Modular Architecture Research Tool) is a web tool (http://smart.embl.de/) for the identification and annotation of protein domains, and provides a platform for the comparative study of complex domain architectures in genes and proteins. The January 2004 release of SMART contains 685 protein domains. New developments in SMART are centred on the integration of data from completed metazoan genomes. SMART now uses predicted proteins from complete genomes in its source sequence databases, and integrates these with predictions of orthology. New visualization tools have been developed to allow analysis of gene intron-exon structure within the context of protein domain structure, and to align these displays to provide schematic comparisons of orthologous genes, or multiple transcripts from the same gene. Other improvements include the ability to query SMART by Gene Ontology terms, improved structure database searching and batch retrieval of multiple entries.
Extracting regulatory gene expression networks from pubmed.
Sarik, J., Lensen, L.J., Ouzounova, R., Rojas, I. & Bork, P.
Proceedings of the 42nd annuual meeting of the association of computational linguistics 2004;192-199.
Quality analysis and integration of large scale molecular data sets.
Jensen, L.J. & Bork, P.
Drug Discovery Today (TARGETS) 2004;3:51-56.
Estimating rates of alternative splicing in mammals and invertebrates.
Harrington, E.D., Boue, S., Valcarcel, J., Reich, J.G., & Bork, P.
Nature Genet 2004;36:916-917.
Functional clues for hypothetical proteins based on genomic context analysis in prokaryotes.
Doerks, T., von Mering, C. & Bork, P.
Nucleic Acids Res ;32(21):6321-6. Print 2004.
Three integrated genomic context methods were used to annotate uncharacterized proteins in 102 bacterial genomes. Of 7853 orthologous groups with unknown function containing 45,110 proteins, 1738 groups could be linked to functionally associated partners. In many cases, those partners are uncharacterized themselves (hinting at newly identified modules) or have been described in general terms only. However, we were able to assign pathways, cellular processes or physical complexes for 273 groups (encompassing 3624 previously functionally uncharacterized proteins).
Sequences and structures in context.
Bork, P. & Orengo, C.
Curr. Opin. Struct. Biol. 2004 14 261-263
Genome evolution reveals biochemical networks and functional modules.
von Mering, C., Zdobnov, E.M., Tsoka, S., Ciccarelli, F.D., Pereira-Leal, J.B., Ouzounis, C.A. & Bork, P.
Proc Natl Acad Sci U S A 2003 Dec 23;100(26):15428-33.
The analysis of completely sequenced genomes uncovers an astonishing variability between species in terms of gene content and order. During genome history, the genes are frequently rear-ranged, duplicated, lost, or transferred horizontally between genomes. These events appear to be stochastic, yet they are under selective constraints resulting from the functional interactions between genes. These genomic constraints form the basis for a variety of techniques that employ systematic genome comparisons to predict functional associations among genes. The most powerful techniques to date are based on conserved gene neighborhood, gene fusion events, and common phylogenetic distributions of gene families. Here we show that these techniques, if integrated quantitatively and applied to a sufficiently large number of genomes, have reached a resolution which allows the characterization of function at a higher level than that of the individual gene: global modularity becomes detectable in a functional protein network. In Escherichia coli, the predicted modules can be bench-marked by comparison to known metabolic pathways. We found as many as 74% of the known metabolic enzymes clustering together in modules, with an average pathway specificity of at least 84%. The modules extend beyond metabolism, and have led to hundreds of reliable functional predictions both at the protein and pathway level. The results indicate that modularity in protein networks is intrinsically encoded in present-day genomes.
The PAM domain, a multi-protein complex-associated module with an all-alpha-helix fold.
Ciccarelli, F.D., Izaurralde, E. & Bork, P.
BMC Bioinformatics 2003 Dec 19;4(1):64.
Background: Multimeric protein complexes have a role in many cellular pathways and are highly interconnected with various other proteins. The characterization of their domain composition and organization provides useful information on the specific role of each region of their sequence. Results: We identified a new module, the PAM domain (PCI/PINT associated module), present in single subunits of well characterized multiprotein complexes, like the regulatory lid of the 26S proteasome, the COP-9 signalosome and the Sac3-Thp1 complex. This module is an around 200 residue long domain with a predicted TPR-like all-alpha-helical fold. Conclusions: The occurrence of the PAM domain in specific subunits of multimeric protein complexes, together with the role of other all-alpha-helical folds in protein-protein interactions, suggest a function for this domain in mediating transient binding to diverse target proteins.
Impact of selection, mutation rate and genetic drift on human genetic variation.
Sunyaev, S., Kondrashov, F.A., Bork, P. & Ramensky, V.
Hum Mol Genet 2003 Dec 15;12(24):3325-30.
The accumulation of genome-wide information on single nucleotide polymorphisms in humans provides an unprecedented opportunity to detect the evolutionary forces responsible for heterogeneity of the level of genetic variability across loci. Previous studies have shown that history of recombination events has produced long haplotype blocks in the human genome, which contribute to this heterogeneity. Other factors, however, such as natural selection or the heterogeneity of mutation rates across loci, may also lead to heterogeneity of genetic variability. We compared synonymous and non-synonymous variability within human genes with their divergence from murine orthologs. We separately analyzed the non-synonymous variants predicted to damage protein structure or function and the variants predicted to be functionally benign. The predictions were based on comparative sequence analysis and, in some cases, on the analysis of protein structure. A strong correlation between non-synonymous, benign variability and non-synonymous human-mouse divergence suggests that selection played an important role in shaping the pattern of variability in coding regions of human genes. However, the lack of correlation between deleterious variability and evolutionary divergence shows that a substantial proportion of the observed non-synonymous single-nucleotide polymorphisms reduces fitness and never reaches fixation. Evolutionary and medical implications of the impact of selection on human polymorphisms are discussed.
A genome-wide survey of human pseudogenes.
Torrents, D., Suyama, M., Zdobnov, E. & Bork, P.
Genome Res 2003 Dec;13(12):2559-67.
We screened all intergenic regions in the human genome to identify pseudogenes with a combination of homology searches and a functionality test using the ratio of silent to replacement nucleotide substitutions (KA/KS). We identified 19,724 regions of which 95% +/- 3% are estimated to evolve neutrally and thus are likely to encode pseudogenes. Half of these have no detectable truncation in their pseudocoding regions and therefore are not identifiable by methods that require the presence of truncations to prove nonfunctionality. A comparative analysis with the mouse genome showed that 70% of these pseudogenes have a retrotranspositional origin (processed), and the rest arose by segmental duplication (nonprocessed). Although the spread of both types of pseudogenes correlates with chromosome size, nonprocessed pseudogenes appear to be enriched in regions with high gene density. It is likely that the human pseudogenes identified here represent only a small fraction of the total, which probably exceeds the number of genes.
Alternative splicing and evolution.
Boue, S., Letunic, I. & Bork, P.
Bioessays 2003 Nov;25(11):1031-4.
Alternative splicing is a critical post-transcriptional event leading to an increase in the transcriptome diversity. Recent bioinformatics studies revealed a high frequency of alternative splicing. Although the extent of AS conservation among mammals is still being discussed, it has been argued that major forms of alternatively spliced transcripts are much better conserved than minor forms. It suggests that alternative splicing plays a major role in genome evolution allowing new exons to evolve with less constraint.
A comprehensive set of protein complexes in yeast: mining large scale protein-protein interaction screens.
Krause, R., von Mering, C. & Bork, P.
Bioinformatics 2003 Oct 12;19(15):1901-8.
MOTIVATION: The analysis of protein-protein interactions allows for detailed exploration of the cellular machinery. The biochemical purification of protein complexes followed by identification of components by mass spectrometry is currently the method, which delivers the most reliable information--albeit that the data sets are still difficult to interpret. Consolidating individual experiments into protein complexes, especially for high-throughput screens, is complicated by many contaminants, the occurrence of proteins in otherwise dissimilar purifications due to functional re-use and technical limitations in the detection. A non-redundant collection of protein complexes from experimental data would be useful for biological interpretation, but manual assembly is tedious and often inconsistent. RESULTS: Here, we introduce a measure to define similarity within collections of purifications and generate a set of minimally redundant, comprehensive complexes using unsupervised clustering. AVAILABILITY: Programs and results are freely available from http://www.bork.embl-heidelberg.de/Docu/purclust/
Nonsense-mediated mRNA decay in Drosophila: at the intersection of the yeast and mammalian pathways.
Gatfield, D., Unterholzner, L., Ciccarelli, F.D., Bork, P. & Izaurralde, E.
EMBO J 2003 Aug 1;22(15):3960-70.
The nonsense-mediated mRNA decay (NMD) pathway promotes the rapid degradation of mRNAs containing premature stop codons (PTCs). In Caenorhabditis elegans, seven genes (smg1-7) playing an essential role in NMD have been identified. Only SMG2-4 (known as UPF1-3) have orthologs in Saccharomyces cerevisiae. Here we show that the Drosophila orthologs of UPF1-3, SMG1, SMG5 and SMG6 are required for the degradation of PTC-containing mRNAs, but that there is no SMG7 ortholog in this organism. In contrast, orthologs of SMG5-7 are encoded by the human genome and all three are required for NMD. In human cells, exon boundaries have been shown to play a critical role in defining PTCs. This role is mediated by components of the exon junction complex (EJC). Contrary to expectation, however, we show that the components of the EJC are dispensable for NMD in Drosophila cells. Consistently, PTC definition occurs independently of exon boundaries in DROSOPHILA: Our findings reveal that despite conservation of the NMD machinery, different mechanisms have evolved to discriminate premature from natural stop codons in metazoa.
The DNA sequence of human chromosome 7.
Hillier, L.W., Fulton, R.S., Fulton, L.A., Graves, T.A., Pepin, K.H., Wagner-McPherson, C., Layman, D., Maas, J., Jaeger, S., Walker, R., Wylie, K., Sekhon, M., Becker, M.C., O'Laughlin, M.D., Schaller, M.E., Fewell, G.A., Delehaunty, K.D., Miner, T.L., Nash, W.E., Cordes, M., Du, H., Sun, H., Edwards, J., Bradshaw-Cordum, H., Ali, J., Andrews, S., Isak, A., Vanbrunt, A., Nguyen, C., Du, F., Lamar, B., Courtney, L., Kalicki, J., Ozersky, P., Bielicki, L., Scott, K., Holmes, A., Harkins, R., Harris, A., Strong, C.M., Hou, S., Tomlinson, C., Dauphin-Kohlberg, S., Kozlowicz-Reilly, A., Leonard, S., Rohlfing, T., Rock, S.M., Tin-Wollam, A.M., Abbott, A., Minx, P., Maupin, R., Strowmatt, C., Latreille, P., Miller, N., Johnson, D., Murray, J., Woessner, J.P., Wendl, M.C., Yang, S.P., Schultz, B.R., Wallis, J.W., Spieth, J., Bieri, T.A., Nelson, J.O., Berkowicz, N., Wohldmann, P.E., Cook, L.L., Hickenbotham, M.T., Eldred, J., Williams, D., Bedell, J.A., Mardis, E.R., Clifton, S.W., Chissoe, S.L., Marra, M.A., Raymond, C., Haugen, E., Gillett, W., Zhou, Y., James, R., Phelps, K., Iadanoto, S., Bubb, K., Simms, E., Levy, R., Clendenning, J., Kaul, R., Kent, W.J., Furey, T.S., Baertsch, R.A., Brent, M.R., Keibler, E., Flicek, P., Bork, P., Suyama, M., Bailey, J.A., Portnoy, M.E., Torrents, D., Chinwalla, A.T., Gish, W.R., Eddy, S.R., McPherson, J.D., Olson, M.V., Eichler, E.E., Green, E.D., Waterston, R.H. & Wilson, R.K.
Nature 2003 Jul 10;424(6945):157-64.
Human chromosome 7 has historically received prominent attention in the human genetics community, primarily related to the search for the cystic fibrosis gene and the frequent cytogenetic changes associated with various forms of cancer. Here we present more than 153 million base pairs representing 99.4% of the euchromatic sequence of chromosome 7, the first metacentric chromosome completed so far. The sequence has excellent concordance with previously established physical and genetic maps, and it exhibits an unusual amount of segmentally duplicated sequence (8.2%), with marked differences between the two arms. Our initial analyses have identified 1,150 protein-coding genes, 605 of which have been confirmed by complementary DNA sequences, and an additional 941 pseudogenes. Of genes confirmed by transcript sequences, some are polymorphic for mutations that disrupt the reading frame.
ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins.
Puntervoll, P., Linding, R., Gemund, C., Chabanis-Davidson, S., Mattingsdal, M., Cameron, S., Martin, D.M., Ausiello, G., Brannetti, B., Costantini, A., Ferre, F., Maselli, V., Via, A., Cesareni, G., Diella, F., Superti-Furga, G., Wyrwicz, L., Ramu, C., McGuigan, C., Gudavalli, R., Letunic, I., Bork, P., Rychlewski, L., Kuster, B., Helmer-Citterich, M., Hunter, W.N., Aasland, R. & Gibson, T.J.
Nucleic Acids Res 2003 Jul 1;31(13):3625-30.
Multidomain proteins predominate in eukaryotic proteomes. Individual functions assigned to different sequence segments combine to create a complex function for the whole protein. While on-line resources are available for revealing globular domains in sequences, there has hitherto been no comprehensive collection of small functional sites/motifs comparable to the globular domain resources, yet these are as important for the function of multidomain proteins. Short linear peptide motifs are used for cell compartment targeting, protein-protein interaction, regulation by phosphorylation, acetylation, glycosylation and a host of other post-translational modifications. ELM, the Eukaryotic Linear Motif server at http://elm.eu.org/, is a new bioinformatics resource for investigating candidate short non-globular functional motifs in eukaryotic proteins, aiming to fill the void in bioinformatics tools. Sequence comparisons with short motifs are difficult to evaluate because the usual significance assessments are inappropriate. Therefore the server is implemented with several logical filters to eliminate false positives. Current filters are for cell compartment, globular domain clash and taxonomic range. In favourable cases, the filters can reduce the number of retained matches by an order of magnitude or more.
Update on XplorMed: A web server for exploring scientific literature.
Perez-Iratxeta, C., Perez, A.J., Bork, P. & Andrade, M.A.
Nucleic Acids Res 2003 Jul 1;31(13):3866-8.
As scientific literature databases like MEDLINE increase in size, so does the time required to search them. Scientists must frequently inspect long lists of references manually, often just reading the titles. XplorMed is a web tool that aids MEDLINE searching by summarizing the subjects contained in the results, thus allowing users to focus on subjects of interest. Here we describe new features added to XplorMed during the last 2 years (http://www.bork.embl-heidelberg.de/xplormed/).
Systematic discovery of analogous enzymes in thiamin biosynthesis.
Morett, E., Korbel, J.O., Rajan, E., Saab-Rincon, G., Olvera, L., Olvera, M., Schmidt, S., Snel, B. & Bork, P.
Nat Biotechnol 2003 Jul;21(7):790-5.
In all genome-sequencing projects completed to date, a considerable number of 'gaps' have been found in the biochemical pathways of the respective species. In many instances, missing enzymes are displaced by analogs, functionally equivalent proteins that have evolved independently and lack sequence and structural similarity. Here we fill such gaps by analyzing anticorrelating occurrences of genes across species. Our approach, applied to the thiamin biosynthesis pathway comprising approximately 15 catalytic steps, predicts seven instances in which known enzymes have been displaced by analogous proteins. So far we have verified four predictions by genetic complementation, including three proteins for which there was no previous experimental evidence of a role in the thiamin biosynthesis pathway. For one hypothetical protein, biochemical characterization confirmed the predicted thiamin phosphate synthase (ThiE) activity. The results demonstrate the ability of our computational approach to predict specific functions without taking into account sequence similarity.
The KIND module: a putative signalling domain evolved from the C lobe of the protein kinase fold.
Ciccarelli, F.D., Bork, P. & Kerkhoff, E.
Trends Biochem Sci 2003 Jul;28(7):349-52. PubMed
Metabolites: a helping hand for pathway evolution?
Schmidt, S., Sunyaev, S., Bork, P. & Dandekar, T.
Trends Biochem Sci 2003 Jun;28(6):336-41.
The evolution of enzymes and pathways is under debate. Recent studies show that recruitment of single enzymes from different pathways could be the driving force for pathway evolution. Other mechanisms of evolution, such as pathway duplication, enzyme specialization, de novo invention of pathways or retro-evolution of pathways, appear to be less abundant. Twenty percent of enzyme superfamilies are quite variable, not only in changing reaction chemistry or metabolite type but in changing both at the same time. These variable superfamilies account for nearly half of all known reactions. The most frequently occurring metabolites provide a helping hand for such changes because they can be accommodated by many enzyme superfamilies. Thus, a picture is emerging in which new pathways are evolving from central metabolites by preference, thereby keeping the overall topology of the metabolic network.
Information extraction from full text scientific articles: where are the keywords?
Shah, P.K., Perez-Iratxeta, C., Bork, P. & Andrade, M.A.
BMC Bioinformatics 2003 May 29;4(1):20.
BACKGROUND: To date, many of the methods for information extraction of biological information from scientific articles are restricted to the abstract of the article. However, full text articles in electronic version, which offer larger sources of data, are currently available. Several questions arise as to whether the effort of scanning full text articles is worthy, or whether the information that can be extracted from the different sections of an article can be relevant. RESULTS: In this work we addressed those questions showing that the keyword content of the different sections of a standard scientific article (abstract, introduction, methods, results, and discussion) is very heterogeneous. CONCLUSIONS: Although the abstract contains the best ratio of keywords per total of words, other sections of the article may be a better source of biologically relevant data.
The way we write.
Netzel, R., Perez-Iratxeta, C., Bork, P. & Andrade, M.A.
EMBO Rep 2003 May;4(5):446-51. PubMed
Pathogenesis of DNA repair-deficient cancers: a statistical meta-analysis of putative Real Common Target genes.
Woerner, S.M., Benner, A., Sutter, C., Schiller, M., Yuan, Y.P., Keller, G., Bork, P., Doeberitz, M.K. & Gebert, J.F.
Oncogene 2003 Apr 17;22(15):2226-35.
DNA mismatch repair deficiency is observed in about 15% of human colorectal, gastric, and endometrial tumors and in lower frequencies in a minority of other tumors thereby causing insertion/deletion mutations at short repetitive sequences, recognized as microsatellite instability (MSI). Evolution of tumors, including those with MSI, is a continuous process of mutation and selection favoring neoplastic growth. Mutations in microsatellite-bearing genes that promote tumor cell growth in general (Real Common Target genes) are assumed to be the driving force during MSI carcinogenesis. Thus, microsatellite mutations in these genes should occur more frequently than mutations in microsatellite genes without contribution to malignancy (ByStander genes). So far, only a few Real Common Target genes have been identified by functional studies. Thus, comprehensive analysis of microsatellite mutations will provide important clues to the understanding of MSI-driven carcinogenesis. Here, we evaluated published mutation frequencies on 194 repeat tracts in 137 genes in MSI-H colorectal, endometrial, and gastric carcinomas and propose a statistical model that aims to identify Real Common Target genes. According to our model nine genes including BAX and TGFbetaRII were identified as Real Common Targets in colorectal cancer, one gene in gastric cancer, and three genes in endometrial cancer. Microsatellite mutations in five additional genes seem to be counterselected in gastrointestinal tumors. Overall, the general applicability, the capacity to unlimited data analysis, the inclusion of mutation data generated by different groups on different sets of tumors make this model a useful tool for predicting Real Common Target genes with specificity for MSI-H tumors of different organs, guiding subsequent functional studies to the most likely targets among numerous microsatellite harboring genes.
Function prediction and protein networks.
Huynen, M.A., Snel, B., Mering, C. & Bork, P.
Curr Opin Cell Biol 2003 Apr;15(2):191-8.
In the genomics era, the interactions between proteins are at the center of attention. Genomic-context methods used to predict these interactions have been put on a quantitative basis, revealing that they are at least on an equal footing with genomics experimental data. A survey of experimentally confirmed predictions proves the applicability of these methods, and new concepts to predict protein interactions in eukaryotes have been described. Finally, the interaction networks that can be obtained by combining the predicted pair-wise interactions have enough internal structure to detect higher levels of organization, such as 'functional modules'.
The identification of a conserved domain in both spartin and spastin, mutated in hereditary spastic paraplegia.
Ciccarelli, F.D., Proukakis, C., Patel, H., Cross, H., Azam, S., Patton, M.A., Bork, P. & Crosby, A.H.
Genomics 2003 Apr;81(4):437-41.
Multiple sequence alignment has revealed the presence of a sequence domain of approximately 80 amino acids in two molecules, spartin and spastin, mutated in hereditary spastic paraplegia. The domain, which corresponds to a slightly extended version of the recently described ESP domain of unknown function, was also identified in VPS4, SKD1, RPK118, and SNX15, all of which have a well established and consistent role in endosomal trafficking. Recent functional information indicates that spastin is likely to be involved in microtubule interaction. With this new information relating to its likely function, we propose the more descriptive name 'MIT' (contained within microtubule-interacting and trafficking molecules) for the domain and predict endosomal trafficking as the principal functionality of all molecules in which it is present.
Increase of functional diversity by alternative splicing.
Kriventseva, E.V., Koch, I., Apweiler, R., Vingron, M., Bork, P., Gelfand, M.S. & Sunyaev, S.
Trends Genet 2003 Mar;19(3):124-8.
A large-scale analysis of protein isoforms arising from alternative splicing shows that alternative splicing tends to insert or delete complete protein domains more frequently than expected by chance, whereas disruption of domains and other structural modules is less frequent. If domain regions are disrupted, the functional effect, as predicted from 3D structure, is frequently equivalent to removal of the entire domain. Also, short alternative splicing events within domains, which might preserve folded structure, target functional residues more frequently than expected. Thus, it seems that positive selection has had a major role in the evolution of alternative splicing.
Bioinformatics in the post-sequence era.
Kanehisa, M. & Bork, P.
Nat Genet 2003 Mar;33 Suppl:305-10.
In the past decade, bioinformatics has become an integral part of research and development in the biomedical sciences. Bioinformatics now has an essential role both in deciphering genomic, transcriptomic and proteomic data generated by high-throughput experimental technologies and in organizing information gathered from traditional biology. Sequence-based methods of analyzing individual genes or proteins have been elaborated and expanded, and methods have been developed for analyzing large numbers of genes or proteins simultaneously, such as in the identification of clusters of related genes and networks of interacting proteins. With the complete genome sequences for an increasing number of organisms at hand, bioinformatics is beginning to provide both conceptual bases and practical methods for detecting systemic functional behaviors of the cell and the organism.
STRING: a database of predicted functional associations between proteins.
von Mering, C., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P. & Snel, B.
Nucleic Acids Res 2003 Jan 1;31(1):258-61.
Functional links between proteins can often be inferred from genomic associations between the genes that encode them: groups of genes that are required for the same function tend to show similar species coverage, are often located in close proximity on the genome (in prokaryotes), and tend to be involved in gene-fusion events. The database STRING is a precomputed global resource for the exploration and analysis of these associations. Since the three types of evidence differ conceptually, and the number of predicted interactions is very large, it is essential to be able to assess and compare the significance of individual predictions. Thus, STRING contains a unique scoring-framework based on benchmarks of the different types of associations against a common reference set, integrated in a single confidence score per prediction. The graphical representation of the network of inferred, weighted protein interactions provides a high-level view of functional linkage, facilitating the analysis of modularity in biological processes. STRING is updated continuously, and currently contains 261 033 orthologs in 89 fully sequenced genomes. The database predicts functional interactions at an expected level of accuracy of at least 80% for more than half of the genes; it is online at http://www.bork.embl-heidelberg.de/STRING/.
The InterPro Database, 2003 brings increased coverage and new features.
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Barrell, D., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., Bucher, P., Copley, R.R., Courcelle, E., Das, U., Durbin, R., Falquet, L., Fleischmann, W., Griffiths-Jones, S., Haft, D., Harte, N., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lopez, R., Letunic, I., Lonsdale, D., Silventoinen, V., Orchard, S.E., Pagni, M., Peyruc, D., Ponting, C.P., Selengut, J.D., Servant, F., Sigrist, C.J., Vaughan, R. & Zdobnov, E.M.
Nucleic Acids Res 2003 Jan 1;31(1):315-8.
InterPro, an integrated documentation resource of protein families, domains and functional sites, was created in 1999 as a means of amalgamating the major protein signature databases into one comprehensive resource. PROSITE, Pfam, PRINTS, ProDom, SMART and TIGRFAMs have been manually integrated and curated and are available in InterPro for text- and sequence-based searching. The results are provided in a single format that rationalises the results that would be obtained by searching the member databases individually. The latest release of InterPro contains 5629 entries describing 4280 families, 1239 domains, 95 repeats and 15 post-translational modifications. Currently, the combined signatures in InterPro cover more than 74% of all proteins in SWISS-PROT and TrEMBL, an increase of nearly 15% since the inception of InterPro. New features of the database include improved searching capabilities and enhanced graphical user interfaces for visualisation of the data. The database is available via a webserver (http://www.ebi.ac.uk/interpro) and anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro).
The human genome: genes, pseudogenes, and variation on chromosome 7.
Waterston, R.H., Hillier, L.W., Fulton, L.A., Fulton, R.S., Graves, T.A., Pepin, K.H., Bork, P., Suyama, M., Torrents, D., Chinwalla, A.T., Mardis, E.R., McPherson, J.D. & Wilson, R.K.
Cold Spring Harb Symp Quant Biol 2003;68:13-22. PubMed
Detection and characterization of pseudogenes.
Torrents, D., Suyama, M. & Bork, P.
In "Bioinformatics and Genomes." M. Andrade (Ed). Horizon Sci. Press, 197-209
Nanoelectrospray tandem mass spectrometry and sequence similarity searching for identification of proteins from organisms with unknown genomes.
Shevchenko, A., Sunyaev, S., Liska, A., Bork, P. & Shevchenko, A.
Methods Mol Biol 2003;211:221-34. PubMed
High rate of gene displacement in vitamine biosynthetic pathways.
Morett, E., Saab-Rincon, G., Merino, E., Bork, P., Rajan, E., Olvera, L. & Olvera, M.
In "Bioinformatics and Genomes." M. Andrade (Ed.). Horizon Sci. Press, 69-79
Initial sequencing and comparative analysis of the mouse genome.
Waterston, R.H., Lindblad-Toh, K., Birney, E., et al.
Nature 2002 Dec 5;420(6915):520-62.
The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.
Identification and characterization of UEV3, a human cDNA with similarities to inactive E2 ubiquitin-conjugating enzymes.
Kloor, M., Bork, P., Duwe, A., Klaes, R., von Knebel Doeberitz, M. & Ridder, R.
Biochim Biophys Acta 2002 Dec 12;1579(2-3):219-24.
Recent studies have shown that ubiquitination is an essential factor in endosomal sorting and virus assembly. The human TSG101 gene has been demonstrated to belong to a group of genes coding for apparently inactive E2 ubiquitin-conjugating enzymes, which exert regulatory effects on E2 activity in cellular ubiquitination processes. In this study, a novel human cDNA (UEV3) encoding a putative protein of 379 amino acids was isolated from a human placenta library that may represent a partial paralogue of human TSG101. The predicted protein contains an N-terminal domain homologous to the catalytic domain of ubiquitin-conjugating enzymes (Ubc), which is fused to a sequence showing significant homology to members of the lactate dehydrogenase protein family. The UEV3 gene is located on chromosome 11 closely adjacent to TSG101 and LDH-C. Northern blot and UEV3-specific reverse transcription/polymerase chain reaction (RT/PCR) analyses of various colon carcinoma cell lines as well as both normal and tumor samples from colon revealed an expression of the UEV3 cDNA in all tested samples.
Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster.
Zdobnov, E.M., von Mering, C., Letunic, I., Torrents, D., Suyama, M., Copley, R.R., Christophides, G.K., Thomasova, D., Holt, R.A., Subramanian, G.M., Mueller, H.M., Dimopoulos, G., Law, J.H., Wells, M.A., Birney, E., Charlab, R., Halpern, A.L., Kokoza, E., Kraft, C.L., Lai, Z., Lewis, S., Louis, C., Barillas-Mury, C., Nusskern, D., Rubin, G.M., Salzberg, S.L., Sutton, G.G., Topalis, P., Wides, R., Wincker, P., Yandell, M., Collins, F.H., Ribeiro, J., Gelbart, W.M., Kafatos, F.C. & Bork, P.
Science 2002 Oct 4;298(5591):149-59.
Comparison of the genomes and proteomes of the two diptera Anopheles gambiae and Drosophila melanogaster, which diverged about 250 million years ago, reveals considerable similarities. However, numerous differences are also observed; some of these must reflect the selection and subsequent adaptation associated with different ecologies and life strategies. Almost half of the genes in both genomes are interpreted as orthologs and show an average sequence identity of about 56%, which is slightly lower than that observed between the orthologs of the pufferfish and human (diverged about 450 million years ago). This indicates that these two insects diverged considerably faster than vertebrates. Aligned sequences reveal that orthologous genes have retained only half of their intron/exon structure, indicating that intron gains or losses have occurred at a rate of about one per gene per 125 million years. Chromosomal arms exhibit significant remnants of homology between the two species, although only 34% of the genes colocalize in small "microsyntenic" clusters, and major interarm transfers as well as intra-arm shuffling of gene order are detected.
The genome sequence of the malaria mosquito Anopheles gambiae.
Holt, R.A., Subramanian, G.M., Halpern, A., Sutton, G.G., Charlab, R., Nusskern, D.R., Wincker, P., Clark, A.G., Ribeiro, J.M., Wides, R., Salzberg, S.L., Loftus, B., Yandell, M., Majoros, W.H., Rusch, D.B., Lai, Z., Kraft, C.L., Abril, J.F., Anthouard, V., Arensburger, P., Atkinson, P.W., Baden, H., de Berardinis, V., Baldwin, D., Benes, V., Biedler, J., Blass, C., Bolanos, R., Boscus, D., Barnstead, M., Cai, S., Center, A., Chatuverdi, K., Christophides, G.K., Chrystal, M.A., Clamp, M., Cravchik, A., Curwen, V., Dana, A., Delcher, A., Dew, I., Evans, C.A., Flanigan, M., Grundschober-Freimoser, A., Friedli, L., Gu, Z., Guan, P., Guigo, R., Hillenmeyer, M.E., Hladun, S.L., Hogan, J.R., Hong, Y.S., Hoover, J., Jaillon, O., Ke, Z., Kodira, C., Kokoza, E., Koutsos, A., Letunic, I., Levitsky, A., Liang, Y., Lin, J.J., Lobo, N.F., Lopez, J.R., Malek, J.A., McIntosh, T.C., Meister, S., Miller, J., Mobarry, C., Mongin, E., Murphy, S.D., O'Brochta, D.A., Pfannkoch, C., Qi, R., Regier, M.A., Remington, K., Shao, H., Sharakhova, M.V., Sitter, C.D., Shetty, J., Smith, T.J., Strong, R., Sun, J., Thomasova, D., Ton, L.Q., Topalis, P., Tu, Z., Unger, M.F., Walenz, B., Wang, A., Wang, J., Wang, M., Wang, X., Woodford, K.J., Wortman, J.R., Wu, M., Yao, A., Zdobnov, E.M., Zhang, H., Zhao, Q., Zhao, S., Zhu, S.C., Zhimulev, I., Coluzzi, M., della Torre, A., Roth, C.W., Louis, C., Kalush, F., Mural, R.J., Myers, E.W., Adams, M.D., Smith, H.O., Broder, S., Gardner, M.J., Fraser, C.M., Birney, E., Bork, P., Brey, P.T., Venter, J.C., Weissenbach, J., Kafatos, F.C., Collins, F.H. & Hoffman, S.L.
Science 2002 Oct 4;298(5591):129-49.
Anopheles gambiae is the principal vector of malaria, a disease that afflicts more than 500 million people and causes more than 1 million deaths each year. Tenfold shotgun sequence coverage was obtained from the PEST strain of A. gambiae and assembled into scaffolds that span 278 million base pairs. A total of 91% of the genome was organized in 303 scaffolds; the largest scaffold was 23.1 million base pairs. There was substantial genetic variation within this strain, and the apparent existence of two haplotypes of approximately equal frequency ("dual haplotypes") in a substantial fraction of the genome likely reflects the outbred nature of the PEST strain. The sequence produced a conservative inference of more than 400,000 single-nucleotide polymorphisms that showed a markedly bimodal density distribution. Analysis of the genome sequence revealed strong evidence for about 14,000 protein-encoding transcripts. Prominent expansions in specific families of proteins likely involved in cell adhesion and immunity were noted. An expressed sequence tag analysis of genes regulated by blood feeding provided insights into the physiological adaptations of a hematophagous insect.
Immunity-related genes and gene families in Anopheles gambiae.
Christophides, G.K., Zdobnov, E., Barillas-Mury, C., Birney, E., Blandin, S., Blass, C., Brey, P.T., Collins, F.H., Danielli, A., Dimopoulos, G., Hetru, C., Hoa, N.T., Hoffmann, J.A., Kanzok, S.M., Letunic, I., Levashina, E.A., Loukeris, T.G., Lycett, G., Meister, S., Michel, K., Moita, L.F., Muller, H.M., Osta, M.A., Paskewitz, S.M., Reichhart, J.M., Rzhetsky, A., Troxler, L., Vernick, K.D., Vlachou, D., Volz, J., von Mering, C., Xu, J., Zheng, L., Bork, P. & Kafatos, F.C.
Science 2002 Oct 4;298(5591):159-65.
We have identified 242 Anopheles gambiae genes from 18 gene families implicated in innate immunity and have detected marked diversification relative to Drosophila melanogaster. Immune-related gene families involved in recognition, signal modulation, and effector systems show a marked deficit of orthologs and excessive gene expansions, possibly reflecting selection pressures from different pathogens encountered in these insects' very different life-styles. In contrast, the multifunctional Toll signal transduction pathway is substantially conserved, presumably because of counterselection for developmental stability. Representative expression profiles confirm that sequence diversification is accompanied by specific responses to different immune challenges. Alternative RNA splicing may also contribute to expansion of the immune repertoire.
The genome sequence of Bifidobacterium longum reflects its adaptation to the human gastrointestinal tract.
Schell, M.A., Karmirantzou, M., Snel, B., Vilanova, D., Berger, B., Pessi, G., Zwahlen, M.C., Desiere, F., Bork, P., Delley, M., Pridmore, R.D. & Arigoni, F.
Proc Natl Acad Sci U S A 2002 Oct 29;99(22):14422-7.
Bifidobacteria are Gram-positive prokaryotes that naturally colonize the human gastrointestinal tract (GIT) and vagina. Although not numerically dominant in the complex intestinal microflora, they are considered as key commensals that promote a healthy GIT. We determined the 2.26-Mb genome sequence of an infant-derived strain of Bifidobacterium longum, and identified 1,730 possible coding sequences organized in a 60%-GC circular chromosome. Bioinformatic analysis revealed several physiological traits that could partially explain the successful adaptation of this bacteria to the colon. An unexpectedly large number of the predicted proteins appeared to be specialized for catabolism of a variety of oligosaccharides, some possibly released by rare or novel glycosyl hydrolases acting on "nondigestible" plant polymers or host-derived glycoproteins and glycoconjugates. This ability to scavenge from a large variety of nutrients likely contributes to the competitiveness and persistence of bifidobacteria in the colon. Many genes for oligosaccharide metabolism were found in self-regulated modules that appear to have arisen in part from gene duplication or horizontal acquisition. Complete pathways for all amino acids, nucleotides, and some key vitamins were identified; however, routes for Asp and Cys were atypical. More importantly, genome analysis provided insights into the reciprocal interactions of bifidobacteria with their hosts. We identified polypeptides that showed homology to most major proteins needed for production of glycoprotein-binding fimbriae, structures that could possibly be important for adhesion and persistence in the GIT. We also found a eukaryotic-type serine protease inhibitor (serpin) possibly involved in the reported immunomodulatory activity of bifidobacteria.
Comparative analysis of protein interaction networks.
Bioinformatics 2002 Oct;18 Suppl 2:S64.
Recent advances in proteomics and computational biology have lead to a flood of protein interaction data and resulting interaction networks (e.g. (Gavin et al., 2002)). Here I first analyse the status and quality of parts lists (genes and proteins), then comparatively assess large-scale protein interaction data (von Mering et al., 2002) and finally try to identify biological meaningful units (e.g. pathways, cellular processes) within interaction networks that are derived from the conservation of gene neighborhood (Snel et al., 2002). Possible extensions of gene neighborhood analysis to eukaryotes (von Mering and Bork, 2002) will be discussed.
Human non-synonymous SNPs: server and survey.
Ramensky, V., Bork, P. & Sunyaev, S.
Nucleic Acids Res 2002 Sep 1;30(17):3894-900.
Human single nucleotide polymorphisms (SNPs) represent the most frequent type of human population DNA variation. One of the main goals of SNP research is to understand the genetics of the human phenotype variation and especially the genetic basis of human complex diseases. Non-synonymous coding SNPs (nsSNPs) comprise a group of SNPs that, together with SNPs in regulatory regions, are believed to have the highest impact on phenotype. Here we present a World Wide Web server to predict the effect of an nsSNP on protein structure and function. The prediction method enabled analysis of the publicly available SNP database HGVbase, which gave rise to a dataset of nsSNPs with predicted functionality. The dataset was further used to compare the effect of various structural and functional characteristics of amino acid substitutions responsible for phenotypic display of nsSNPs. We also studied the dependence of selective pressure on the structural and functional properties of proteins. We found that in our dataset the selection pressure against deleterious SNPs depends on the molecular function of the protein, although it is insensitive to several other protein features considered. The strongest selective pressure was detected for proteins involved in transcription regulation.
InterPro: an integrated documentation resource for protein families, domains and functional sites.
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., Bucher, P., Copley, R., Courcelle, E., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Griffith-Jones, S., Haft, D., Hermjakob, H., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lopez, R., Letunic, I., Orchard, S., Pagni, M., Peyruc, D., Ponting, C.P., Servant, F. & Sigrist, C.J.
Brief Bioinform 2002 Sep;3(3):225-35.
The exponential increase in the submission of nucleotide sequences to the nucleotide sequence database by genome sequencing centres has resulted in a need for rapid, automatic methods for classification of the resulting protein sequences. There are several signature and sequence cluster-based methods for protein classification, each resource having distinct areas of optimum application owing to the differences in the underlying analysis methods. In recognition of this, InterPro was developed as an integrated documentation resource for protein families, domains and functional sites, to rationalise the complementary efforts of the individual protein signature database projects. The member databases - PRINTS, PROSITE, Pfam, ProDom, SMART and TIGRFAMs - form the InterPro core. Related signatures from each member database are unified into single InterPro entries. Each InterPro entry includes a unique accession number, functional descriptions and literature references, and links are made back to the relevant member database(s). Release 4.0 of InterPro (November 2001) contains 4,691 entries, representing 3,532 families, 1,068 domains, 74 repeats and 15 sites of post-translational modification (PTMs) encoded by different regular expressions, profiles, fingerprints and hidden Markov models (HMMs). Each InterPro entry lists all the matches against SWISS-PROT and TrEMBL (2,141,621 InterPro hits from 586,124 SWISS-PROT and TrEMBL protein sequences). The database is freely accessible for text- and sequence-based searches.
NEAT: a domain duplicated in genes near the components of a putative Fe3+ siderophore transporter from Gram-positive pathogenic bacteria.
Andrade, M.A., Ciccarelli, F.D., Perez-Iratxeta, C. & Bork, P.
Genome Biol 2002 Aug 15;3(9):RESEARCH0047.
BACKGROUND: Iron uptake from the host is essential for bacteria that infect animals. To find potential targets for drugs active against pathogenic bacteria, we have searched all completely sequenced genomes of pathogenic bacteria for genes relevant for iron transport. RESULTS: We identified a protein domain that appears in variable copy number in bacterial genes that are usually in the vicinity of a putative Fe3+ siderophore transporter. Accordingly, we have denoted this domain NEAT for 'near transporter'. Most of the bacterial species containing this domain are pathogenic. Sequence features indicate that the domain is anchored to the extracellular side of the membrane. The domain seems to be under high selective pressure for rapid independent duplications that are typical of sequences involved in signaling and binding. CONCLUSIONS: The NEAT domain might be functionally related to iron transport. The taxonomic specificity of this domain and its predicted extracellular position could make it an interesting target for designing new drugs against some highly pathogenic bacteria.
SPG20 is mutated in Troyer syndrome, an hereditary spastic paraplegia.
Patel, H., Cross, H., Proukakis, C., Hershberger, R., Bork, P., Ciccarelli, F.D., Patton, M.A., McKusick, V.A. & Crosby, A.H.
Nat Genet 2002 Aug;31(4):347-8.
Troyer syndrome (TRS) is an autosomal recessive complicated hereditary spastic paraplegia (HSP) that occurs with high frequency in the Old Order Amish. We report mapping of the TRS locus to chromosome 13q12.3 and identify a frameshift mutation in SPG20, encoding spartin. Comparative sequence analysis indicates that spartin shares similarity with molecules involved in endosomal trafficking and with spastin, a molecule implicated in microtubule interaction that is commonly mutated in HSP.
Predicting protein cellular localization using a domain projection method.
Mott, R., Schultz, J., Bork, P. & Ponting, C.P.
Genome Res 2002 Aug;12(8):1168-74.
We investigate the co-occurrence of domain families in eukaryotic proteins to predict protein cellular localization. Approximately half (300) of SMART domains form a "small-world network", linked by no more than seven degrees of separation. Projection of the domains onto two-dimensional space reveals three clusters that correspond to cellular compartments containing secreted, cytoplasmic, and nuclear proteins. The projection method takes into account the existence of "bridging" domains, that is, instances where two domains might not occur with each other but frequently co-occur with a third domain; in such circumstances the domains are neighbors in the projection. While the majority of domains are specific to a compartment ("locale"), and hence may be used to localize any protein that contains such a domain, a small subset of domains either are present in multiple locales or occur in transmembrane proteins. Comparison with previously annotated proteins shows that SMART domain data used with this approach can predict, with 92% accuracy, the localizations of 23% of eukaryotic proteins. The coverage and accuracy will increase with improvements in domain database coverage. This method is complementary to approaches that use amino-acid composition or identify sorting sequences; these methods may be combined to further enhance prediction accuracy.
The rhodanese/Cdc25 phosphatase superfamily. Sequence-structure-function relations.
Bordo, D. & Bork, P.
EMBO Rep 2002 Aug;3(8):741-6.
Rhodanese domains are ubiquitous structural modules occurring in the three major evolutionary phyla. They are found as tandem repeats, with the C-terminal domain hosting the properly structured active-site Cys residue, as single domain proteins or in combination with distinct protein domains. An increasing number of reports indicate that rhodanese modules are versatile sulfur carriers that have adapted their function to fulfill the need for reactive sulfane sulfur in distinct metabolic and regulatory pathways. Recent investigations have shown that rhodanese domains are also structurally related to the catalytic subunit of Cdc25 phosphatase enzymes and that the two enzyme families are likely to share a common evolutionary origin. In this review, the rhodanese/Cdc25 phosphatase superfamily is analyzed. Although the identification of their biological substrates has thus far proven elusive, the emerging picture points to a role for the amino-acid composition of the active-site loop in substrate recognition/specificity. Furthermore, the frequently observed association of catalytically inactive rhodanese modules with other protein domains suggests a distinct regulatory role for these inactive domains, possibly in connection with signaling.
Association of genes to genetically inherited diseases using data mining.
Perez-Iratxeta, C., Bork, P. & Andrade, M.A.
Nat Genet 2002 Jul;31(3):316-9.
Although approximately one-quarter of the roughly 4,000 genetically inherited diseases currently recorded in respective databases (LocusLink, OMIM) are already linked to a region of the human genome, about 450 have no known associated gene. Finding disease-related genes requires laborious examination of hundreds of possible candidate genes (sometimes, these are not even annotated; see, for example, refs 3,4). The public availability of the human genome draft sequence has fostered new strategies to map molecular functional features of gene products to complex phenotypic descriptions, such as those of genetically inherited diseases. Owing to recent progress in the systematic annotation of genes using controlled vocabularies, we have developed a scoring system for the possible functional relationships of human genes to 455 genetically inherited diseases that have been mapped to chromosomal regions without assignment of a particular gene. In a benchmark of the system with 100 known disease-associated genes, the disease-associated gene was among the 8 best-scoring genes with a 25% chance, and among the best 30 genes with a 50% chance, showing that there is a relationship between the score of a gene and its likelihood of being associated with a particular disease. The scoring also indicates that for some diseases, the chance of identifying the underlying gene is higher.
A complex prediction: three-dimensional model of the yeast exosome.
Aloy, P., Ciccarelli, F.D., Leutwein, C., Gavin, A.C., Superti-Furga, G., Bork, P., Böttcher, B. & Russell, R.B.
EMBO Rep 2002 Jul;3(7):628-35.
We present a model of the yeast exosome based on the bacterial degradosome component polynucleotide phosphorylase (PNPase). Electron microscopy shows the exosome to resemble PNPase but with key differences likely related to the position of RNA binding domains, and to the location of domains unique to the exosome. We use various techniques to reduce the many possible models of exosome subunits based on PNPase to just one. The model suggests numerous experiments to probe exosome function, particularly with respect to subunits making direct atomic contacts and conserved, possibly functional residues within the predicted central pore of the complex.
Teamed up for transcription.
von Mering, C. & Bork, P.
Nature 2002 Jun 20;417(6891):797-8. PubMed
Comparative genomic analysis in the region of a major Plasmodium-refractoriness locus of Anopheles gambiae.
Thomasova, D., Ton, L.Q., Copley, R.R., Zdobnov, E.M., Wang, X., Hong, Y.S., Sim, C., Bork, P., Kafatos, F.C. & Collins, F.H.
Proc Natl Acad Sci U S A 2002 Jun 11;99(12):8179-84.
We have sequenced six overlapping clones from a library of bacterial artificial chromosome (BAC) clones derived from a laboratory strain of the mosquito, Anopheles gambiae, the major vector of human malaria in Africa. The resulting uninterrupted 528-kb sequence is from the 8C region of the mosquito 2R chromosome, at or very near the major refractoriness locus associated with melanotic encapsulation of parasites. This sequence represents the first extensive view of the mosquito genome structure encompassing 48 genes. Genomic comparison reveals that the majority of the orthologues are found in six microsyntenic clusters in Drosophila melanogaster. A BAC clone that is wholly contained within this region demonstrates the existence of a remarkable degree of local polymorphism in this species, which may prove important for its population structure and vectorial capacity.
Computing fuzzy associations for the analysis of biological literature.
Perez-Iratxeta, C., Keer, H.S., Bork, P. & Andrade, M.A.
Biotechniques 2002 Jun;32(6):1380-2, 1384-5.
The increase of information in biology makes it difficult for researchers in any field to keep current with the literature. The MEDLINE database of scientific abstracts can be quickly scanned using electronic mechanisms. Potentially interesting abstracts can be selected by matching words joined by Boolean operators. However this means of selecting documents is not optimal. Nonspecific queries have to be effected, resulting in large numbers of irrelevant abstracts that have to be manually scanned To facilitate this analysis, we have developed a system that compiles a summary of subjects and related documents on the results of a MEDLINE query. For this, we have applied a fuzzy binary relation formalism that deduces relations between words present in a set of abstracts preprocessed with a standard grammatical tagger. Those relations are used to derive ensembles of related words and their associated subsets of abstracts. The algorithm can be used publicly at http:// www.bork.embl-heidelberg.de/xplormed/.
Exploring MEDLINE abstracts with XplorMed.
Perez-Iratxeta, C., Bork, P. & Andrade, M.A.
Drugs Today (Barc) 2002 Jun;38(6):381-9.
XplorMed is a publicly available web tool conceived to make life easier for MEDLINE(c) users looking for scientific information. Searching scientific literature is an information retrieval problem. Abstracts that are of possible interest to the user are usually selected by a keyword search followed by manual screening, which often results in the retrieval of a large number of abstracts. Interesting references can be buried among irrelevant ones because of nonspecific queries. XplorMed is intended to extract dependency relations between the words of the abstracts. These relations can be filtered and arranged to deduce different subjects in the query and offer a condensed view of the abstract, allowing users to select texts of interest without having to read them all. XplorMed is available http://www.bork. embl-heidelberg.de/xplormed.
Comparative assessment of large-scale data sets of protein-protein interactions.
von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S. & Bork, P.
Nature 2002 May 23;417(6887):399-403.
Comprehensive protein protein interaction maps promise to reveal many aspects of the complex regulatory network underlying cellular function. Recently, large-scale approaches have predicted many new protein interactions in yeast. To measure their accuracy and potential as well as to identify biases, strengths and weaknesses, we compare the methods with each other and with a reference set of previously reported protein interactions.
The identification of functional modules from the genomic association of genes.
Snel, B., Bork, P. & Huynen, M.A.
Proc Natl Acad Sci U S A 2002 Apr 30;99(9):5890-5.
By combining the pairwise interactions between proteins, as predicted by the conserved co-occurrence of their genes in operons, we obtain protein interaction networks. Here we study the properties of such networks to identify functional modules: sets of proteins that together are involved in a biological process. The complete network contains 3,033 orthologous groups of proteins in 38 genomes. It consists of one giant component, containing 1,611 orthologous groups, and of 516 small disjointed clusters that, on average, contain only 2.7 orthologous groups. These small clusters have a homogeneous functional composition and thus represent functional modules in themselves. Analysis of the giant component reveals that it is a scale-free, small-world network with a high degree of local clustering (C = 0.6). It consists of locally highly connected subclusters that are connected to each other by linker proteins. The linker proteins tend to have multiple functions, or are involved in multiple processes and have an above average probability of being essential. By splitting up the giant component at these linker proteins, we identify 265 subclusters that tend to have a homogeneous functional composition. The rare functional inhomogeneities in our subclusters reflect the mixing of different types of (molecular) functions in a single cellular process, exemplified by subclusters containing both metabolic enzymes as well as the transcription factors that regulate them. Comparative genome analysis, thus, allows identification of a level of functional interaction between that of pairwise interactions, and of the complete genome.
BSD: a novel domain in transcription factors and synapse-associated proteins.
Doerks, T., Huber, S., Buchner, E. & Bork, P.
Trends Biochem Sci 2002 Apr;27(4):168-70.
This article describes a novel domain, BSD, that is present in basal transcription factors, synapse-associated proteins and several hypothetical proteins. It occurs in a variety of species ranging from primal protozoan to human. The BSD domain is characterized by three predicted alpha helices, which probably form a three-helical bundle, as well as by conserved tryptophan and phenylalanine residues, located at the C terminus of the domain.
A versatile structural domain analysis server using profile weight matrices.
Schmidt, S., Bork, P. & Dandekar, T.
J Chem Inf Comput Sci 2002 Mar-Apr;42(2):405-7.
The WEB tool "AnDom" assigns to a given protein sequence all experimentally determined structural domains contained within it, including multidomain and large proteins. The server uses profile specific matrices from custom generated multiple sequence alignments of all known SCOP domains (SCOP version 1.50). Prediction time is short allowing numerous applications for structural genomics including investigation of complex eucaryotic protein families. The WWW server is at http://www.bork.embl-heidelberg.de/AnDom, and profiles can be downloaded at ftp.bork.embl-heidelberg.de/pub/users/ schmidt/AnDom.
SHOT: a web server for the construction of genome phylogenies.
Korbel, J.O., Snel, B., Huynen, M.A. & Bork, P.
Trends Genet 2002 Mar;18(3):158-62.
With the increasing availability of genome sequences, new methods are being proposed that exploit information from complete genomes to classify species in a phylogeny. Here we present SHOT, a web server for the classification of genomes on the basis of shared gene content or the conservation of gene order that reflects the dominant, phylogenetic signal in these genomic properties. In general, the genome trees are consistent with classical gene-based phylogenies, although some interesting exceptions indicate massive horizontal gene transfer. SHOT is a useful tool for analysing the tree of life from a genomic point of view. It is available at http://www.Bork.EMBL-Heidelberg.de/SHOT.
AMOP, a protein module alternatively spliced in cancer cells.
Ciccarelli, F.D., Doerks, T. & Bork, P.
Trends Biochem Sci 2002 Mar;27(3):113-5.
This article describes a new extracellular domain--AMOP, for adhesion-associated domain in MUC4 and other proteins. This domain occurs in putative cell adhesion molecules and in some splice variants of MUC4. MUC4 splice variants are overexpressed in several tumours; in particular, they are highly expressed in pancreatic carcinomas but not in normal pancreas. The presence of AMOP in cell adhesion molecules could be indicative of a role for this domain in adhesion.
Protein domain analysis in the era of complete genomes.
Copley, R.R., Doerks, T., Letunic, I. & Bork, P.
FEBS Lett 2002 Feb 20;513(1):129-34.
Domains present one of the most useful levels at which to understand protein function, and domain family-based analysis has had a profound impact on the study of individual proteins. Protein domain discovery has been progressing steadily over the past 30 years. What are the realistically achievable goals of sequence-based domain analysis, and how far off are they for the sequences encoded in eukaryotic genomes? Here we address some of the issues involved in better coverage of sequence-based domain annotation, and the integration of these results within the wider context of genomes, structures and function.
Genome and protein evolution in eukaryotes.
Copley, R.R., Letunic, I. & Bork, P.
Curr Opin Chem Biol 2002 Feb;6(1):39-45.
The past year has seen the completion of the genome sequence of the flowering plant Arabidopsis thaliana and the initial sequence reports of the human genome. The availability of completely sequenced eukaryotic genomes from disparate phylogenetic lineages has opened the door to comparative analyses and a better understanding of the evolutionary processes shaping genomes. Complex many-to-many relationships between genes from different species appear to be the norm, suggesting that transfer of detailed functional annotation will not be straightforward. In addition to expansion and contraction of gene families, new genes evolve from recombination of pre-existing domains, although some domain families do appear to have evolved recently and to be specific to restricted phylogenetic lineages. The overall picture is of a huge diversity of gene content within eukaryotic genomes, reflecting different functional demands in different species.
CASH a beta-helix domain widespread among carbohydrate-binding proteins.
Ciccarelli, F.D., Copley, R.R., Doerks, T., Russell, R.B. & Bork, P.
Trends Biochem Sci 2002 Feb;27(2):59-62.
In this article, we describe a novel, widespread domain (CASH) that is shared by many carbohydrate-binding proteins and sugar hydrolases. This domain occurs in more than 1000 proteins distributed among all three kingdoms of life. The CASH domain is characterized by internal repetitions of glycines and hydrophobic residues that correspond to the repetitive units of a predicted or observed right-handed beta-helix structure of the pectate lyase superfamily.
Functional organization of the yeast proteome by systematic analysis of protein complexes.
Gavin, A.C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J.M., Michon, A.M., Cruciat, C.M., Remor, M., Hofert, C., Schelder, M., Brajenovic, M., Ruffner, H., Merino, A., Klein, K., Hudak, M., Dickson, D., Rudi, T., Gnau, V., Bauch, A., Bastuck, S., Huhse, B., Leutwein, C., Heurtier, M.A., Copley, R.R., Edelmann, A., Querfurth, E., Rybin, V., Drewes, G., Raida, M., Bouwmeester, T., Bork, P., Seraphin, B., Kuster, B., Neubauer, G. & Superti-Furga, G.
Nature 2002 Jan 10;415(6868):141-7.
Most cellular processes are carried out by multiprotein complexes. The identification and analysis of their components provides insight into how the ensemble of expressed proteins (proteome) is organized into functional units. We used tandem-affinity purification (TAP) and mass spectrometry in a large-scale approach to characterize multiprotein complexes in Saccharomyces cerevisiae. We processed 1,739 genes, including 1,143 human orthologues of relevance to human biology, and purified 589 protein assemblies. Bioinformatic analysis of these assemblies defined 232 distinct multiprotein complexes and proposed new cellular roles for 344 proteins, including 231 proteins with no previous functional annotation. Comparison of yeast and human complexes showed that conservation across species extends from single proteins to their molecular environment. Our analysis provides an outline of the eukaryotic proteome as a network of protein complexes at a level of organization beyond binary interactions. This higher-order map contains fundamental biological information and offers the context for a more reasoned and informed approach to drug discovery.
Recent improvements to the SMART domain-based sequence annotation resource.
Letunic, I., Goodstadt, L., Dickens, N.J., Doerks, T., Schultz, J., Mott, R., Ciccarelli, F., Copley, R.R., Ponting, C.P. & Bork, P.
Nucleic Acids Res 2002 Jan 1;30(1):242-4.
SMART (Simple Modular Architecture Research Tool, http://smart.embl-heidelberg.de) is a web-based resource used for the annotation of protein domains and the analysis of domain architectures, with particular emphasis on mobile eukaryotic domains. Extensive annotation for each domain family is available, providing information relating to function, subcellular localization, phyletic distribution and tertiary structure. The January 2002 release has added more than 200 hand-curated domain models. This brings the total to over 600 domain families that are widely represented among nuclear, signalling and extracellular proteins. Annotation now includes links to the Online Mendelian Inheritance in Man (OMIM) database in cases where a human disease is associated with one or more mutations in a particular domain. We have implemented new analysis methods and updated others. New advanced queries provide direct access to the SMART relational database using SQL. This database now contains information on intrinsic sequence features such as transmembrane regions, coiled-coils, signal peptides and internal repeats. SMART output can now be easily included in users' documents. A SMART mirror has been created at http://smart.ox.ac.uk.
HGVbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources.
Fredman, D., Siegfried, M., Yuan, Y.P., Bork, P., Lehvaslaiho, H. & Brookes, A.J.
Nucleic Acids Res 2002 Jan 1;30(1):387-91.
HGVbase (Human Genome Variation database; http://hgvbase.cgb.ki.se, formerly known as HGBASE) is an academic effort to provide a high quality and non-redundant database of available genomic variation data of all types, mostly comprising single nucleotide polymorphisms (SNPs). Records include neutral polymorphisms as well as disease-related mutations. Online search tools facilitate data interrogation by sequence similarity and keyword queries, and searching by genome coordinates is now being implemented. Downloads are freely available in XML, Fasta, SRS, SQL and tagged-text file formats. Each entry is presented in the context of its surrounding sequence and many records are related to neighboring human genes and affected features therein. Population allele frequencies are included wherever available. Thorough semi-automated data checking ensures internal consistency and addresses common errors in the source information. To keep pace with recent growth in the field, we have developed tools for fully automated annotation. All variants have been uniquely mapped to the draft genome sequence and are referenced to positions in EMBL/GenBank files. Data utility is enhanced by provision of genotyping assays and functional predictions. Recent data structure extensions allow the capture of haplotype and genotype information, and a new initiative (along with BiSC and HUGO-MDI) aims to create a central repository for the broad collection of clinical mutations and associated disease phenotypes of interest.
Genomes in flux: the evolution of archaeal and proteobacterial gene content.
Snel, B., Bork, P. & Huynen, M.A.
Genome Res 2002 Jan;12(1):17-25.
In the course of evolution, genomes are shaped by processes like gene loss, gene duplication, horizontal gene transfer, and gene genesis (the de novo origin of genes). Here we reconstruct the gene content of ancestral Archaea and Proteobacteria and quantify the processes connecting them to their present day representatives based on the distribution of genes in completely sequenced genomes. We estimate that the ancestor of the Proteobacteria contained around 2500 genes, and the ancestor of the Archaea around 2050 genes. Although it is necessary to invoke horizontal gene transfer to explain the content of present day genomes, gene loss, gene genesis, and simple vertical inheritance are quantitatively the most dominant processes in shaping the genome. Together they result in a turnover of gene content such that even the lineage leading from the ancestor of the Proteobacteria to the relatively large genome of Escherichia coli has lost at least 950 genes. Gene loss, unlike the other processes, correlates fairly well with time. This clock-like behavior suggests that gene loss is under negative selection, while the processes that add genes are under positive selection.
Systematic identification of novel protein domain families associated with nuclear functions.
Doerks, T., Copley, R.R., Schultz, J., Ponting, C.P. & Bork, P.
Genome Res 2002 Jan;12(1):47-56.
A systematic computational analysis of protein sequences containing known nuclear domains led to the identification of 28 novel domain families. This represents a 26% increase in the starting set of 107 known nuclear domain families used for the analysis. Most of the novel domains are present in all major eukaryotic lineages, but 3 are species specific. For about 500 of the 1200 proteins that contain these new domains, nuclear localization could be inferred, and for 700, additional features could be predicted. For example, we identified a new domain, likely to have a role downstream of the unfolded protein response; a nematode-specific signalling domain; and a widespread domain, likely to be a noncatalytic homolog of ubiquitin-conjugating enzymes.
Alternative splicing and genome complexity.
Brett, D., Pospisil, H., Valcarcel, J., Reich, J. & Bork, P.
Nat Genet 2002 Jan;30(1):29-30.
Alternative splicing of mRNA allows many gene products with different functions to be produced from a single coding sequence. It has recently been proposed as a mechanism by which higher-order diversity is generated. Here we show, using large-scale expressed sequence tag (EST) analysis, that among seven different eukaryotes the amount of alternative splicing is comparable, with no large differences between humans and other animals.
Von Mering, C. and Bork, P.
- ERC Investigator Click here to learn more about the European Research Council
- Tara Oceans science Explore Tara Oceans research and inspiring marine life