eggNOG v4.0: nested orthology inference across 3686 organisms.
Powell, S., Forslund, K., Szklarczyk, D., Trachana, K., Roth, A., Huerta-Cepas, J., Gabaldon, T., Rattei, T., Creevey, C., Kuhn, M., Jensen, L.J., von Mering, C. & Bork, P.
Nucleic Acids Res. 2013 Dec 1.
With the increasing availability of various 'omics data, high-quality orthology assignment is crucial for evolutionary and functional genomics studies. We here present the fourth version of the eggNOG database (available at http://eggnog.embl.de) that derives nonsupervised orthologous groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. Compared with the previous version, we have more than tripled the underlying species set to cover 3686 organisms, keeping track with genome project completions while prioritizing the inclusion of high-quality genomes to minimize error propagation from incomplete proteome sets. Major technological advances include (i) a robust and scalable procedure for the identification and inclusion of high-quality genomes, (ii) provision of orthologous groups for 107 different taxonomic levels compared with 41 in eggNOGv3, (iii) identification and annotation of particularly closely related orthologous groups, facilitating analysis of related gene families, (iv) improvements of the clustering and functional annotation approach, (v) adoption of a revised tree building procedure based on the multiple alignments generated during the process and (vi) implementation of quality control procedures throughout the entire pipeline. As in previous versions, eggNOGv4 provides multiple sequence alignments and maximum-likelihood trees, as well as broad functional annotation. Users can access the complete database of orthologous groups via a web interface, as well as through bulk download.
STITCH 4: integration of protein-chemical interactions with user data.
Kuhn, M., Szklarczyk, D., Pletscher-Frankild, S., Blicher, T.H., von Mering, C., Jensen, L.J. & Bork, P.
Nucleic Acids Res. 2013 Nov 28.
STITCH is a database of protein-chemical interactions that integrates many sources of experimental and manually curated evidence with text-mining information and interaction predictions. Available at http://stitch.embl.de, the resulting interaction network includes 390 000 chemicals and 3.6 million proteins from 1133 organisms. Compared with the previous version, the number of high-confidence protein-chemical interactions in human has increased by 45%, to 367 000. In this version, we added features for users to upload their own data to STITCH in the form of internal identifiers, chemical structures or quantitative data. For example, a user can now upload a spreadsheet with screening hits to easily check which interactions are already known. To increase the coverage of STITCH, we expanded the text mining to include full-text articles and added a prediction method based on chemical structures. We further changed our scheme for transferring interactions between species to rely on orthology rather than protein similarity. This improves the performance within protein families, where scores are now transferred only to orthologous proteins, but not to paralogous proteins. STITCH can be accessed with a web-interface, an API and downloadable files.
A quantitative liposome microarray to systematically characterize protein-lipid interactions.
Saliba, A.E., Vonkova, I., Ceschia, S., Findlay, G.M., Maeda, K., Tischer, C., Deghou, S., van Noort, V., Bork, P., Pawson, T., Ellenberg, J. & Gavin, A.C.
Nat Methods. 2013 Nov 24. doi: 10.1038/nmeth.2734.
Lipids have a role in virtually all biological processes, acting as structural elements, scaffolds and signaling molecules, but they are still largely under-represented in known biological networks. Here we describe a liposome microarray-based assay (LiMA), a method that measures protein recruitment to membranes in a quantitative, automated, multiplexed and high-throughput manner.
Metagenomic species profiling using universal phylogenetic marker genes.
Sunagawa, S., Mende, D.R., Zeller, G., Izquierdo-Carrasco, F., Berger, S.A., Kultima, J.R., Coelho, L.P., Arumugam, M., Tap, J., Nielsen, H.B., Rasmussen, S., Brunak, S., Pedersen, O., Guarner, F., de Vos, W.M., Wang, J., Li, J., Dore, J., Ehrlich, S.D., Stamatakis, A. & Bork, P.
Nat Methods. 2013 Oct 20. doi: 10.1038/nmeth.2693.
To quantify known and unknown microorganisms at species-level resolution using shotgun sequencing data, we developed a method that establishes metagenomic operational taxonomic units (mOTUs) based on single-copy phylogenetic marker genes. Applied to 252 human fecal samples, the method revealed that on average 43% of the species abundance and 58% of the richness cannot be captured by current reference genome-based methods. An implementation of the method is available at http://www.bork.embl.de/software/mOTU/.
Exploring nucleo-cytoplasmic large DNA viruses in Tara Oceans microbial metagenomes.
Hingamp, P., Grimsley, N., Acinas, S.G., Clerissi, C., Subirana, L., Poulain, J., Ferrera, I., Sarmento, H., Villar, E., Lima-Mendez, G., Faust, K., Sunagawa, S., Claverie, J.M., Moreau, H., Desdevises, Y., Bork, P., Raes, J., de Vargas, C., Karsenti, E., Kandels-Lewis, S., Jaillon, O., Not, F., Pesant, S., Wincker, P. & Ogata, H.
ISME J. 2013 Sep;7(9):1678-95. doi: 10.1038/ismej.2013.59. Epub 2013 Apr 11.
Nucleo-cytoplasmic large DNA viruses (NCLDVs) constitute a group of eukaryotic viruses that can have crucial ecological roles in the sea by accelerating the turnover of their unicellular hosts or by causing diseases in animals. To better characterize the diversity, abundance and biogeography of marine NCLDVs, we analyzed 17 metagenomes derived from microbial samples (0.2-1.6 mum size range) collected during the Tara Oceans Expedition. The sample set includes ecosystems under-represented in previous studies, such as the Arabian Sea oxygen minimum zone (OMZ) and Indian Ocean lagoons. By combining computationally derived relative abundance and direct prokaryote cell counts, the abundance of NCLDVs was found to be in the order of 10(4)-10(5) genomes ml(-1) for the samples from the photic zone and 10(2)-10(3) genomes ml(-1) for the OMZ. The Megaviridae and Phycodnaviridae dominated the NCLDV populations in the metagenomes, although most of the reads classified in these families showed large divergence from known viral genomes. Our taxon co-occurrence analysis revealed a potential association between viruses of the Megaviridae family and eukaryotes related to oomycetes. In support of this predicted association, we identified six cases of lateral gene transfer between Megaviridae and oomycetes. Our results suggest that marine NCLDVs probably outnumber eukaryotic organisms in the photic layer (per given water mass) and that metagenomic sequence analyses promise to shed new light on the biodiversity of marine viruses and their interactions with potential hosts.
Richness of human gut microbiome correlates with metabolic markers.
Le Chatelier, E., Nielsen, T., Qin, J., Prifti, E., Hildebrand, F., Falony, G., Almeida, M., Arumugam, M., Batto, J.M., Kennedy, S., Leonard, P., Li, J., Burgdorf, K., Grarup, N., Jorgensen, T., Brandslund, I., Nielsen, H.B., Juncker, A.S., Bertalan, M., Levenez, F., Pons, N., Rasmussen, S., Sunagawa, S., Tap, J., Tims, S., Zoetendal, E.G., Brunak, S., Clement, K., Dore, J., Kleerebezem, M., Kristiansen, K., Renault, P., Sicheritz-Ponten, T., de Vos, W.M., Zucker, J.D., Raes, J., Hansen, T., Bork, P., Wang, J., Ehrlich, S.D., Pedersen, O., Guedon, E., Delorme, C., Layec, S., Khaci, G., van de Guchte, M., Vandemeulebrouck, G., Jamet, A., Dervyn, R., Sanchez, N., Maguin, E., Haimet, F., Winogradski, Y., Cultrone, A., Leclerc, M., Juste, C., Blottiere, H., Pelletier, E., LePaslier, D., Artiguenave, F., Bruls, T., Weissenbach, J., Turner, K., Parkhill, J., Antolin, M., Manichanh, C., Casellas, F., Boruel, N., Varela, E., Torrejon, A., Guarner, F., Denariaz, G., Derrien, M., van Hylckama Vlieg, J.E., Veiga, P., Oozeer, R., Knol, J., Rescigno, M., Brechot, C., M'Rini, C., Merieux, A. & Yamada, T.
Nature. 2013 Aug 29;500(7464):541-6. doi: 10.1038/nature12506.
We are facing a global metabolic health crisis provoked by an obesity epidemic. Here we report the human gut microbial composition in a population sample of 123 non-obese and 169 obese Danish individuals. We find two groups of individuals that differ by the number of gut microbial genes and thus gut bacterial richness. They contain known and previously unknown bacterial species at different proportions; individuals with a low bacterial richness (23% of the population) are characterized by more marked overall adiposity, insulin resistance and dyslipidaemia and a more pronounced inflammatory phenotype when compared with high bacterial richness individuals. The obese individuals among the lower bacterial richness group also gain more weight over time. Only a few bacterial species are sufficient to distinguish between individuals with high and low bacterial richness, and even between lean and obese participants. Our classifications based on variation in the gut microbiome identify subsets of individuals in the general white adult population who may be at increased risk of progressing to adiposity-associated co-morbidities.
Metagenomic 16S rDNA Illumina tags are a powerful alternative to amplicon sequencing to explore diversity and structure of microbial communities.
Logares, R., Sunagawa, S., Salazar, G., Cornejo-Castillo, F.M., Ferrera, I., Sarmento, H., Hingamp, P., Ogata, H., de Vargas, C., Lima-Mendez, G., Raes, J., Poulain, J., Jaillon, O., Wincker, P., Kandels-Lewis, S., Karsenti, E., Bork, P. & Acinas, S.G.
Environ Microbiol. 2013 Aug 18. doi: 10.1111/1462-2920.12250.
Sequencing of 16S rDNA polymerase chain reaction (PCR) amplicons is the most common approach for investigating environmental prokaryotic diversity, despite the known biases introduced during PCR. Here we show that 16S rDNA fragments derived from Illumina-sequenced environmental metagenomes (mi tags) are a powerful alternative to 16S rDNA amplicons for investigating the taxonomic diversity and structure of prokaryotic communities. As part of the Tara Oceans global expedition, marine plankton was sampled in three locations, resulting in 29 subsamples for which metagenomes were produced by shotgun Illumina sequencing (ca. 700 Gb). For comparative analyses, a subset of samples was also selected for Roche-454 sequencing using both shotgun (m454 tags; 13 metagenomes, ca. 2.4 Gb) and 16S rDNA amplicon (454 tags; ca. 0.075 Gb) approaches. Our results indicate that by overcoming PCR biases related to amplification and primer mismatch, mi tags may provide more realistic estimates of community richness and evenness than amplicon 454 tags. In addition, mi tags can capture expected beta diversity patterns. Using mi tags is now economically feasible given the dramatic reduction in high-throughput sequencing costs, having the advantage of retrieving simultaneously both taxonomic (Bacteria, Archaea and Eukarya) and functional information from the same microbial community.
Accurate and universal delineation of prokaryotic species.
Mende, D.R., Sunagawa, S., Zeller, G. & Bork, P.
Nat Methods. 2013 Aug;10(9):881-4. doi: 10.1038/nmeth.2575. Epub 2013 Jul 28.
The exponentially increasing number of sequenced genomes necessitates fast, accurate, universally applicable and automated approaches for the delineation of prokaryotic species. We developed specI (species identification tool; http://www.bork.embl.de/software/specI/), a method to group organisms into species clusters based on 40 universal, single-copy phylogenetic marker genes. Applied to 3,496 prokaryotic genomes, specI identified 1,753 species clusters. Of 314 discrepancies with a widely used taxonomic classification, >62% were resolved by literature support.
Country-specific antibiotic use practices impact the human gut resistome.
Forslund, K., Sunagawa, S., Kultima, J.R., Mende, D.R., Arumugam, M., Typas, A. & Bork, P.
Genome Res. 2013 Jul;23(7):1163-9. doi: 10.1101/gr.155465.113. Epub 2013 Apr 8.
Despite increasing concerns over inappropriate use of antibiotics in medicine and food production, population-level resistance transfer into the human gut microbiota has not been demonstrated beyond individual case studies. To determine the "antibiotic resistance potential" for entire microbial communities, we employ metagenomic data and quantify the totality of known resistance genes in each community (its resistome) for 68 classes and subclasses of antibiotics. In 252 fecal metagenomes from three countries, we show that the most abundant resistance determinants are those for antibiotics also used in animals and for antibiotics that have been available longer. Resistance genes are also more abundant in samples from Spain, Italy, and France than from Denmark, the United States, or Japan. Where comparable country-level data on antibiotic use in both humans and animals are available, differences in these statistics match the observed resistance potential differences. The results are robust over time as the antibiotic resistance determinants of individuals persist in the human gut flora for at least a year.
Systematic identification of proteins that elicit drug side effects.
Kuhn, M., Al Banchaabouchi, M., Campillos, M., Jensen, L.J., Gross, C., Gavin, A.C. & Bork, P.
Mol Syst Biol. 2013;9:663. doi: 10.1038/msb.2013.10.
Side effect similarities of drugs have recently been employed to predict new drug targets, and networks of side effects and targets have been used to better understand the mechanism of action of drugs. Here, we report a large-scale analysis to systematically predict and characterize proteins that cause drug side effects. We integrated phenotypic data obtained during clinical trials with known drug-target relations to identify overrepresented protein-side effect combinations. Using independent data, we confirm that most of these overrepresentations point to proteins which, when perturbed, cause side effects. Of 1428 side effects studied, 732 were predicted to be predominantly caused by individual proteins, at least 137 of them backed by existing pharmacological or phenotypic data. We prove this concept in vivo by confirming our prediction that activation of the serotonin 7 receptor (HTR7) is responsible for hyperesthesia in mice, which, in turn, can be prevented by a drug that selectively inhibits HTR7. Taken together, we show that a large fraction of complex drug side effects are mediated by individual proteins and create a reference for such relations.
Characterization of drug-induced transcriptional modules: towards drug repositioning and functional understanding.
Iskar, M., Zeller, G., Blattmann, P., Campillos, M., Kuhn, M., Kaminska, K.H., Runz, H., Gavin, A.C., Pepperkok, R., van Noort, V. & Bork, P.
Mol Syst Biol. 2013;9:662. doi: 10.1038/msb.2013.20.
In pharmacology, it is crucial to understand the complex biological responses that drugs elicit in the human organism and how well they can be inferred from model organisms. We therefore identified a large set of drug-induced transcriptional modules from genome-wide microarray data of drug-treated human cell lines and rat liver, and first characterized their conservation. Over 70% of these modules were common for multiple cell lines and 15% were conserved between the human in vitro and the rat in vivo system. We then illustrate the utility of conserved and cell-type-specific drug-induced modules by predicting and experimentally validating (i) gene functions, e.g., 10 novel regulators of cellular cholesterol homeostasis and (ii) new mechanisms of action for existing drugs, thereby providing a starting point for drug repositioning, e.g., novel cell cycle inhibitors and new modulators of alpha-adrenergic receptor, peroxisome proliferator-activated receptor and estrogen receptor. Taken together, the identified modules reveal the conservation of transcriptional responses towards drugs across cell types and organisms, and improve our understanding of both the molecular basis of drug action and human biology.
Cell type-specific nuclear pores: a case in point for context-dependent stoichiometry of molecular machines.
Ori, A., Banterle, N., Iskar, M., Andres-Pons, A., Escher, C., Khanh Bui, H., Sparks, L., Solis-Mezarino, V., Rinner, O., Bork, P., Lemke, E.A. & Beck, M.
Mol Syst Biol. 2013 Mar 19;9:648. doi: 10.1038/msb.2013.4.
To understand the structure and function of large molecular machines, accurate knowledge of their stoichiometry is essential. In this study, we developed an integrated targeted proteomics and super-resolution microscopy approach to determine the absolute stoichiometry of the human nuclear pore complex (NPC), possibly the largest eukaryotic protein complex. We show that the human NPC has a previously unanticipated stoichiometry that varies across cancer cell types, tissues and in disease. Using large-scale proteomics, we provide evidence that more than one third of the known, well-defined nuclear protein complexes display a similar cell type-specific variation of their subunit stoichiometry. Our data point to compositional rearrangement as a widespread mechanism for adapting the functions of molecular machines toward cell type-specific constraints and context-dependent needs, and highlight the need of deeper investigation of such structural variants.
Genomic variation landscape of the human gut microbiome.
Schloissnig, S., Arumugam, M., Sunagawa, S., Mitreva, M., Tap, J., Zhu, A., Waller, A., Mende, D.R., Kultima, J.R., Martin, J., Kota, K., Sunyaev, S.R., Weinstock, G.M. & Bork, P.
Nature. 2013 Jan 3;493(7430):45-50. doi: 10.1038/nature11711. Epub 2012 Dec 5.
Whereas large-scale efforts have rapidly advanced the understanding and practical impact of human genomic variation, the practical impact of variation is largely unexplored in the human microbiome. We therefore developed a framework for metagenomic variation analysis and applied it to 252 faecal metagenomes of 207 individuals from Europe and North America. Using 7.4 billion reads aligned to 101 reference species, we detected 10.3 million single nucleotide polymorphisms (SNPs), 107,991 short insertions/deletions, and 1,051 structural variants. The average ratio of non-synonymous to synonymous polymorphism rates of 0.11 was more variable between gut microbial species than across human hosts. Subjects sampled at varying time intervals exhibited individuality and temporal stability of SNP variation patterns, despite considerable composition changes of their gut microbiota. This indicates that individual-specific strains are not easily replaced and that an individual might have a unique metagenomic genotype, which may be exploitable for personalized diet or drug intake.
DvD: An R/Cytoscape pipeline for drug repurposing using public repositories of gene expression data.
Pacini, C., Iorio, F., Goncalves, E., Iskar, M., Klabunde, T., Bork, P. & Saez-Rodriguez, J.
Bioinformatics. 2013 Jan 1;29(1):132-4. doi: 10.1093/bioinformatics/bts656. Epub2012 Nov 4.
SUMMARY: Drug versus Disease (DvD) provides a pipeline, available through R or Cytoscape, for the comparison of drug and disease gene expression profiles from public microarray repositories. Negatively correlated profiles can be used to generate hypotheses of drug-repurposing, whereas positively correlated profiles may be used to infer side effects of drugs. DvD allows users to compare drug and disease signatures with dynamic access to databases Array Express, Gene Expression Omnibus and data from the Connectivity Map. Availability and implementation: R package (submitted to Bioconductor) under GPL 3 and Cytoscape plug-in freely available for download at www.ebi.ac.uk/saezrodriguez/DVD/.
Human monogenic disease genes have frequently functionally redundant paralogs.
Chen, W.H., Zhao, X.M., van Noort, V. & Bork, P.
PLoS Comput Biol. 2013 May;9(5):e1003073. doi: 10.1371/journal.pcbi.1003073. Epub2013 May 16.
Mendelian disorders are often caused by mutations in genes that are not lethal but induce functional distortions leading to diseases. Here we study the extent of gene duplicates that might compensate genes causing monogenic diseases. We provide evidence for pervasive functional redundancy of human monogenic disease genes (MDs) by duplicates by manifesting 1) genes involved in human genetic disorders are enriched in duplicates and 2) duplicated disease genes tend to have higher functional similarities with their closest paralogs in contrast to duplicated non-disease genes of similar age. We propose that functional compensation by duplication of genes masks the phenotypic effects of deleterious mutations and reduces the probability of purging the defective genes from the human population; this functional compensation could be further enhanced by higher purification selection between disease genes and their duplicates as well as their orthologous counterpart compared to non-disease genes. However, due to the intrinsic expression stochasticity among individuals, the deleterious mutations could still be present as genetic diseases in some subpopulations where the duplicate copies are expressed at low abundances. Consequently the defective genes are linked to genetic disorders while they continue propagating within the population. Our results provide insight into the molecular basis underlying the spreading of duplicated disease genes.
Orthologous gene clusters and taxon signature genes for viruses of prokaryotes.
Kristensen, D.M., Waller, A.S., Yamada, T., Bork, P., Mushegian, A.R. & Koonin, E.V.
J Bacteriol. 2013 Mar;195(5):941-50. doi: 10.1128/JB.01801-12. Epub 2012 Dec 7.
Viruses are the most abundant biological entities on earth and encompass a vast amount of genetic diversity. The recent rapid increase in the number of sequenced viral genomes has created unprecedented opportunities for gaining new insight into the structure and evolution of the virosphere. Here, we present an update of the phage orthologous groups (POGs), a collection of 4,542 clusters of orthologous genes from bacteriophages that now also includes viruses infecting archaea and encompasses more than 1,000 distinct virus genomes. Analysis of this expanded data set shows that the number of POGs keeps growing without saturation and that a substantial majority of the POGs remain specific to viruses, lacking homologues in prokaryotic cells, outside known proviruses. Thus, the great majority of virus genes apparently remains to be discovered. A complementary observation is that numerous viral genomes remain poorly, if at all, covered by POGs. The genome coverage by POGs is expected to increase as more genomes are sequenced. Taxon-specific, single-copy signature genes that are not observed in prokaryotic genomes outside detected proviruses were identified for two-thirds of the 57 taxa (those with genomes available from at least 3 distinct viruses), with half of these present in all members of the respective taxon. These signatures can be used to specifically identify the presence and quantify the abundance of viruses from particular taxa in metagenomic samples and thus gain new insights into the ecology and evolution of viruses in relation to their hosts.
The microbiome explored: recent insights and future challenges.
Blaser, M., Bork, P., Fraser, C., Knight, R. & Wang, J.
Nat Rev Microbiol. 2013 Mar;11(3):213-7. doi: 10.1038/nrmicro2973. Epub 2013 Feb4.
One of the most exciting scientific advances in recent years has been the realization that commensal microorganisms are not simple 'passengers' in our bodies, but instead have key roles in our physiology, including our immune responses and metabolism, as well as in disease. These insights have been obtained, in part, through the work of large-scale, consortium-driven metagenomic projects. Here, five experts in the field of microbiome research discuss the most surprising and exciting new findings, and outline the future steps that will be necessary to elucidate the numerous roles of the microbiota in human health and disease and to develop viable therapeutic strategies.
Consistent mutational paths predict eukaryotic thermostability.
van Noort, V., Bradatsch, B., Arumugam, M., Amlacher, S., Bange, G., Creevey, C., Falk, S., Mende, D.R., Sinning, I., Hurt, E. & Bork, P.
BMC Evol Biol. 2013 Jan 10;13:7. doi: 10.1186/1471-2148-13-7.
ABSTRACT: BACKGROUND: Proteomes of thermophilic prokaryotes have been instrumental in structural biology and successfully exploited in biotechnology, however many proteins required for eukaryotic cell function are absent from bacteria or archaea. With Chaetomium thermophilum, Thielavia terrestris and Thielavia heterothallica three genome sequences of thermophilic eukaryotes have been published. RESULTS: Studying the genomes and proteomes of these thermophilic fungi, we found common strategies of thermal adaptation across the different kingdoms of Life, including amino acid biases and a reduced genome size. A phylogenetics-guided comparison of thermophilic proteomes with those of other, mesophilic Sordariomycetes revealed consistent amino acid substitutions associated to thermophily that were also present in an independent lineage of thermophilic fungi. The most consistent pattern is the substitution of lysine by arginine, which we could find in almost all lineages but has not been extensively used in protein stability engineering. By exploiting mutational paths towards the thermophiles, we could predict particular amino acid residues in individual proteins that contribute to thermostability and validated some of them experimentally. By determining the three-dimensional structure of an exemplar protein from C. thermophilum (Arx1), we could also characterise the molecular consequences of some of these mutations. CONCLUSIONS: The comparative analysis of these three genomes not only enhances our understanding of the evolution of thermophily, but also provides new ways to engineer protein stability.
Role for urea in nitrification by polar marine Archaea.
Alonso-Saez, L., Waller, A.S., Mende, D.R., Bakker, K., Farnelid, H., Yager, P.L., Lovejoy, C., Tremblay, J.E., Potvin, M., Heinrich, F., Estrada, M., Riemann, L., Bork, P., Pedros-Alio, C. & Bertilsson, S.
Proc Natl Acad Sci U S A. 2012 Oct 30;109(44):17989-94. doi:10.1073/pnas.1201914109. Epub 2012 Oct 1.
Despite the high abundance of Archaea in the global ocean, their metabolism and biogeochemical roles remain largely unresolved. We investigated the population dynamics and metabolic activity of Thaumarchaeota in polar environments, where these microorganisms are particularly abundant and exhibit seasonal growth. Thaumarchaeota were more abundant in deep Arctic and Antarctic waters and grew throughout the winter at surface and deeper Arctic halocline waters. However, in situ single-cell activity measurements revealed a low activity of this group in the uptake of both leucine and bicarbonate (<5% Thaumarchaeota cells active), which is inconsistent with known heterotrophic and autotrophic thaumarchaeal lifestyles. These results suggested the existence of alternative sources of carbon and energy. Our analysis of an environmental metagenome from the Arctic winter revealed that Thaumarchaeota had pathways for ammonia oxidation and, unexpectedly, an abundance of genes involved in urea transport and degradation. Quantitative PCR analysis confirmed that most polar Thaumarchaeota had the potential to oxidize ammonia, and a large fraction of them had urease genes, enabling the use of urea to fuel nitrification. Thaumarchaeota from Arctic deep waters had a higher abundance of urease genes than those near the surface suggesting genetic differences between closely related archaeal populations. In situ measurements of urea uptake and concentration in Arctic waters showed that small-sized prokaryotes incorporated the carbon from urea, and the availability of urea was often higher than that of ammonium. Therefore, the degradation of urea may be a relevant pathway for Thaumarchaeota and other microorganisms exposed to the low-energy conditions of dark polar waters.
MOCAT: a metagenomics assembly and gene prediction toolkit.
Kultima, J.R., Sunagawa, S., Li, J., Chen, W., Chen, H., Mende, D.R., Arumugam, M., Pan, Q., Liu, B., Qin, J., Wang, J. & Bork, P.
PLoS One. 2012;7(10):e47656. doi: 10.1371/journal.pone.0047656. Epub 2012 Oct 17.
MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/.
Drug discovery in the age of systems biology: the rise of computational approaches for data integration.
Iskar, M., Zeller, G., Zhao, X.M., van Noort, V. & Bork, P.
Curr Opin Biotechnol. 2012 Aug;23(4):609-16. doi: 10.1016/j.copbio.2011.11.010.Epub 2011 Dec 5.
The increased availability of large-scale open-access resources on bioactivities of small molecules has a significant impact on pharmacology facilitated mainly by computational approaches that digest the vast amounts of data. We discuss here how computational data integration enables systemic views on a drug's action and allows to tackle complex problems such as the large-scale prediction of drug targets, drug repurposing, the molecular mechanisms, cellular responses or side effects. We particularly focus on computational methods that leverage various cell-based transcriptional, proteomic and phenotypic profiles of drug response in order to gain a systemic view of drug action at the molecular, cellular and whole-organism scale.
The human small intestinal microbiota is driven by rapid uptake and conversion of simple carbohydrates.
Zoetendal, E.G., Raes, J., van den Bogert, B., Arumugam, M., Booijink, C.C., Troost, F.J., Bork, P., Wels, M., de Vos, W.M. & Kleerebezem, M.
ISME J. 2012 Jul;6(7):1415-26. doi: 10.1038/ismej.2011.212. Epub 2012 Jan 19.
The human gastrointestinal tract (GI tract) harbors a complex community of microbes. The microbiota composition varies between different locations in the GI tract, but most studies focus on the fecal microbiota, and that inhabiting the colonic mucosa. Consequently, little is known about the microbiota at other parts of the GI tract, which is especially true for the small intestine because of its limited accessibility. Here we deduce an ecological model of the microbiota composition and function in the small intestine, using complementing culture-independent approaches. Phylogenetic microarray analyses demonstrated that microbiota compositions that are typically found in effluent samples from ileostomists (subjects without a colon) can also be encountered in the small intestine of healthy individuals. Phylogenetic mapping of small intestinal metagenome of three different ileostomy effluent samples from a single individual indicated that Streptococcus sp., Escherichia coli, Clostridium sp. and high G+C organisms are most abundant in the small intestine. The compositions of these populations fluctuated in time and correlated to the short-chain fatty acids profiles that were determined in parallel. Comparative functional analysis with fecal metagenomes identified functions that are overrepresented in the small intestine, including simple carbohydrate transport phosphotransferase systems (PTS), central metabolism and biotin production. Moreover, metatranscriptome analysis supported high level in-situ expression of PTS and carbohydrate metabolic genes, especially those belonging to Streptococcus sp. Overall, our findings suggest that rapid uptake and fermentation of available carbohydrates contribute to maintaining the microbiota in the human small intestine.
Younger genes are less likely to be essential than older genes, and duplicates are less likely to be essential than singletons of the same age.
Chen, W.H., Trachana, K., Lercher, M.J. & Bork, P.
Mol Biol Evol. 2012 Jul;29(7):1703-6. doi: 10.1093/molbev/mss014. Epub 2012 Jan19.
Recently duplicated genes are believed to often overlap in function and expression. A priori, they are thus less likely to be essential. Although this was indeed observed in yeast, mouse singletons and duplicates were reported to be equally often essential. This contradiction can only partly be explained by experimental biases. We herein show that older genes (i.e., genes with earlier phyletic origin) are more likely to be essential, regardless of their duplication status. At a given phyletic gene age, duplicates are always less likely to be essential compared with singletons. The "paradoxical" high essentiality among mouse gene duplicates is then caused by different age profiles of singletons and duplicates, with the latter tending to be derived from older genes.
Annotation of the M. tuberculosis hypothetical orfeome: adding functional information to more than half of the uncharacterized proteins.
Doerks, T., van Noort, V., Minguez, P. & Bork, P.
PLoS One. 2012;7(4):e34302. doi: 10.1371/journal.pone.0034302. Epub 2012 Apr 2.
The genome of Mycobacterium tuberculosis (H37Rv) contains 4,019 protein coding genes, of which more than thousand have been categorized as 'hypothetical' implying that for these not even weak functional associations could be identified so far. We here predict reliable functional indications for half of this large hypothetical orfeome: 497 genes can be annotated based on orthology, and another 125 can be linked to interacting proteins via integrated genomic context analysis and literature mining. The assignments include newly identified clusters of interacting proteins, hypothetical genes that are associated to well known pathways and putative disease-relevant targets. All together, we have raised the fraction of the proteome with at least some functional annotation to 88% which should considerably enhance the interpretation of large-scale experiments targeting this medically important organism.
A 15q24 microdeletion in transient myeloproliferative disease (TMD) and acute megakaryoblastic leukaemia (AMKL) implicates PML and SUMO3 in the leukaemogenesis of TMD/AMKL.
Haemmerling, S., Behnisch, W., Doerks, T., Korbel, J.O., Bork, P., Moog, U., Hentze, S., Grasshoff, U., Bonin, M., Riess, O., Janssen, J.W., Jauch, A., Bartram, C.R., Reinhardt, D., Koch, K.A., Bandapalli, O.R. & Kulozik, A.E.
Br J Haematol. 2012 Apr;157(2):180-7. doi: 10.1111/j.1365-2141.2012.09028.x. Epub2012 Feb 1.
Transient myeloproliferative disorder (TMD) of the newborn and acute megakaryoblastic leukaemia (AMKL) in children with Down syndrome (DS) represent paradigmatic models of leukaemogenesis. Chromosome 21 gene dosage effects and truncating mutations of the X-chromosomal transcription factor GATA1 synergize to trigger TMD and AMKL in most patients. Here, we report the occurrence of TMD, which spontaneously remitted and later progressed to AMKL in a patient without DS but with a distinct dysmorphic syndrome. Genetic analysis of the leukaemic clone revealed somatic trisomy 21 and a truncating GATA1 mutation. The analysis of the patient's normal blood cell DNA on a genomic single nucleotide polymorphism (SNP) array revealed a de novo germ line 2.58 Mb 15q24 microdeletion including 41 known genes encompassing the tumour suppressor PML. Genomic context analysis of proteins encoded by genes that are included in the microdeletion, chromosome 21-encoded proteins and GATA1 suggests that the microdeletion may trigger leukaemogenesis by disturbing the balance of a hypothetical regulatory network of normal megakaryopoiesis involving PML, SUMO3 and GATA1. The 15q24 microdeletion may thus represent the first genetic hit to initiate leukaemogenesis and implicates PML and SUMO3 as novel components of the leukaemogenic network in TMD/AMKL.
Assessment of metagenomic assembly using simulated next generation sequencing data.
Mende, D.R., Waller, A.S., Sunagawa, S., Jarvelin, A.I., Chan, M.M., Arumugam, M., Raes, J. & Bork, P.
PLoS One. 2012;7(2):e31386. Epub 2012 Feb 23.
Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available.
The Tenth Asia Pacific Bioinformatics Conference (APBC 2012). Introduction.
Chen, Y.P. & Bork, P.
BMC Genomics. 2012;13 Suppl 1:I1. doi: 10.1186/1471-2164-13-S1-I1. Epub 2012 Jan17. Europe PMC
SMART 7: recent updates to the protein domain annotation resource.
Letunic, I., Doerks, T. & Bork, P.
Nucleic Acids Res. 2012 Jan;40(Database issue):D302-5. doi: 10.1093/nar/gkr931.Epub 2011 Nov 3.
SMART (Simple Modular Architecture Research Tool) is an online resource (http://smart.embl.de/) for the identification and annotation of protein domains and the analysis of protein domain architectures. SMART version 7 contains manually curated models for 1009 protein domains, 200 more than in the previous version. The current release introduces several novel features and a streamlined user interface resulting in a faster and more comfortable workflow. The underlying protein databases were greatly expanded, resulting in a 2-fold increase in number of annotated domains and features. The database of completely sequenced genomes now includes 1133 species, compared to 630 in the previous release. Domain architecture analysis results can now be exported and visualized through the iTOL phylogenetic tree viewer. 'metaSMART' was introduced as a novel subresource dedicated to the exploration and analysis of domain architectures in various metagenomics data sets. An advanced full text search engine was implemented, covering the complete annotations for SMART and Pfam domains, as well as the complete set of protein descriptions, allowing users to quickly find relevant information.
STITCH 3: zooming in on protein-chemical interactions.
Kuhn, M., Szklarczyk, D., Franceschini, A., von Mering, C., Jensen, L.J. & Bork, P.
Nucleic Acids Res. 2012 Jan;40(Database issue):D876-80. doi: 10.1093/nar/gkr1011.Epub 2011 Nov 9.
To facilitate the study of interactions between proteins and chemicals, we have created STITCH, an aggregated database of interactions connecting over 300,000 chemicals and 2.6 million proteins from 1133 organisms. Compared to the previous version, the number of chemicals with interactions and the number of high-confidence interactions both increase 4-fold. The database can be accessed interactively through a web interface, displaying interactions in an integrated network view. It is also available for computational studies through downloadable files and an API. As an extension in the current version, we offer the option to switch between two levels of detail, namely whether stereoisomers of a given compound are shown as a merged entity or as separate entities. Separate display of stereoisomers is necessary, for example, for carbohydrates and chiral drugs. Combining the isomers increases the coverage, as interaction databases and publications found through text mining will often refer to compounds without specifying the stereoisomer. The database is accessible at http://stitch.embl.de/.
OGEE: an online gene essentiality database.
Chen, W.H., Minguez, P., Lercher, M.J. & Bork, P.
Nucleic Acids Res. 2012 Jan;40(Database issue):D901-6. doi: 10.1093/nar/gkr986.Epub 2011 Nov 10.
OGEE is an Online GEne Essentiality database. Its main purpose is to enhance our understanding of the essentiality of genes. This is achieved by collecting not only experimentally tested essential and non-essential genes, but also associated gene features such as expression profiles, duplication status, conservation across species, evolutionary origins and involvement in embryonic development. We focus on large-scale experiments and complement our data with text-mining results. Genes are organized into data sets according to their sources. Genes with variable essentiality status across data sets are tagged as conditionally essential, highlighting the complex interplay between gene functions and environments. Linked tools allow the user to compare gene essentiality among different gene groups, or compare features of essential genes to non-essential genes, and visualize the results. OGEE is freely available at http://ogeedb.embl.de.
eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges.
Powell, S., Szklarczyk, D., Trachana, K., Roth, A., Kuhn, M., Muller, J., Arnold, R., Rattei, T., Letunic, I., Doerks, T., Jensen, L.J., von Mering, C. & Bork, P.
Nucleic Acids Res. 2012 Jan;40(Database issue):D284-9. Epub 2011 Nov 16.
Orthologous relationships form the basis of most comparative genomic and metagenomic studies and are essential for proper phylogenetic and functional analyses. The third version of the eggNOG database (http://eggnog.embl.de) contains non-supervised orthologous groups constructed from 1133 organisms, doubling the number of genes with orthology assignment compared to eggNOG v2. The new release is the result of a number of improvements and expansions: (i) the underlying homology searches are now based on the SIMAP database; (ii) the orthologous groups have been extended to 41 levels of selected taxonomic ranges enabling much more fine-grained orthology assignments; and (iii) the newly designed web page is considerably faster with more functionality. In total, eggNOG v3 contains 721,801 orthologous groups, encompassing a total of 4,396,591 genes. Additionally, we updated 4873 and 4850 original COGs and KOGs, respectively, to include all 1133 organisms. At the universal level, covering all three domains of life, 101,208 orthologous groups are available, while the others are applicable at 40 more limited taxonomic ranges. Each group is amended by multiple sequence alignments and maximum-likelihood trees and broad functional descriptions are provided for 450,904 orthologous groups (62.5%).
InterPro in 2011: new developments in the family and domain prediction database.
Hunter, S., Jones, P., Mitchell, A., Apweiler, R., Attwood, T.K., Bateman, A., Bernard, T., Binns, D., Bork, P., Burge, S., de Castro, E., Coggill, P., Corbett, M., Das, U., Daugherty, L., Duquenne, L., Finn, R.D., Fraser, M., Gough, J., Haft, D., Hulo, N., Kahn, D., Kelly, E., Letunic, I., Lonsdale, D., Lopez, R., Madera, M., Maslen, J., McAnulla, C., McDowall, J., McMenamin, C., Mi, H., Mutowo-Muellenet, P., Mulder, N., Natale, D., Orengo, C., Pesseat, S., Punta, M., Quinn, A.F., Rivoire, C., Sangrador-Vegas, A., Selengut, J.D., Sigrist, C.J., Scheremetjew, M., Tate, J., Thimmajanarthanan, M., Thomas, P.D., Wu, C.H., Yeats, C. & Yong, S.Y.
Nucleic Acids Res. 2012 Jan;40(Database issue):D306-12. Epub 2011 Nov 16.
InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
Transcription start site associated RNAs in bacteria.
Yus, E., Guell, M., Vivancos, A.P., Chen, W.H., Lluch-Senar, M., Delgado, J., Gavin, A.C., Bork, P. & Serrano, L.
Mol Syst Biol. 2012 May 22;8:585. doi: 10.1038/msb.2012.16.
Here, we report the genome-wide identification of small RNAs associated with transcription start sites (TSSs), termed tssRNAs, in Mycoplasma pneumoniae. tssRNAs were also found to be present in a different bacterial phyla, Escherichia coli. Similar to the recently identified promoter-associated tiny RNAs (tiRNAs) in eukaryotes, tssRNAs are associated with active promoters. Evidence suggests that these tssRNAs are distinct from previously described abortive transcription RNAs. ssRNAs have an average size of 45 bases and map exactly to the beginning of cognate full-length transcripts and to cryptic TSSs. Expression of bacterial tssRNAs requires factors other than the standard RNA polymerase holoenzyme. We have found that the RNA polymerase is halted at tssRNA positions in vivo, which may indicate that a pausing mechanism exists to prevent transcription in the absence of genes. These results suggest that small RNAs associated with TSSs could be a universal feature of bacterial transcription.
Prediction and identification of sequences coding for orphan enzymes using genomic and metagenomic neighbours.
Yamada, T., Waller, A.S., Raes, J., Zelezniak, A., Perchat, N., Perret, A., Salanoubat, M., Patil, K.R., Weissenbach, J. & Bork, P.
Mol Syst Biol. 2012 May 8;8:581. doi: 10.1038/msb.2012.13.
Despite the current wealth of sequencing data, one-third of all biochemically characterized metabolic enzymes lack a corresponding gene or protein sequence, and as such can be considered orphan enzymes. They represent a major gap between our molecular and biochemical knowledge, and consequently are not amenable to modern systemic analyses. As 555 of these orphan enzymes have metabolic pathway neighbours, we developed a global framework that utilizes the pathway and (meta)genomic neighbour information to assign candidate sequences to orphan enzymes. For 131 orphan enzymes (37% of those for which (meta)genomic neighbours are available), we associate sequences to them using scoring parameters with an estimated accuracy of 70%, implying functional annotation of 16 345 gene sequences in numerous (meta)genomes. As a case in point, two of these candidate sequences were experimentally validated to encode the predicted activity. In addition, we augmented the currently available genome-scale metabolic models with these new sequence-function associations and were able to expand the models by on average 8%, with a considerable change in the flux connectivity patterns and improved essentiality prediction.
Cross-talk between phosphorylation and lysine acetylation in a genome-reduced bacterium.
van Noort, V., Seebacher, J., Bader, S., Mohammed, S., Vonkova, I., Betts, M.J., Kuhner, S., Kumar, R., Maier, T., O'Flaherty, M., Rybin, V., Schmeisky, A., Yus, E., Stulke, J., Serrano, L., Russell, R.B., Heck, A.J., Bork, P. & Gavin, A.C.
Mol Syst Biol. 2012 Feb 28;8:571. doi: 10.1038/msb.2012.4.
Protein post-translational modifications (PTMs) represent important regulatory states that when combined have been hypothesized to act as molecular codes and to generate a functional diversity beyond genome and transcriptome. We systematically investigate the interplay of protein phosphorylation with other post-transcriptional regulatory mechanisms in the genome-reduced bacterium Mycoplasma pneumoniae. Systematic perturbations by deletion of its only two protein kinases and its unique protein phosphatase identified not only the protein-specific effect on the phosphorylation network, but also a modulation of proteome abundance and lysine acetylation patterns, mostly in the absence of transcriptional changes. Reciprocally, deletion of the two putative N-acetyltransferases affects protein phosphorylation, confirming cross-talk between the two PTMs. The measured M. pneumoniae phosphoproteome and lysine acetylome revealed that both PTMs are very common, that (as in Eukaryotes) they often co-occur within the same protein and that they are frequently observed at interaction interfaces and in multifunctional proteins. The results imply previously unreported hidden layers of post-transcriptional regulation intertwining phosphorylation with lysine acetylation and other mechanisms that define the functional state of a cell.
Deciphering a global network of functionally associated post-translational modifications.
Minguez, P., Parca, L., Diella, F., Mende, D.R., Kumar, R., Helmer-Citterich, M., Gavin, A.C., van Noort, V. & Bork, P.
Mol Syst Biol. 2012 Jul 17;8:599. doi: 10.1038/msb.2012.31.
Various post-translational modifications (PTMs) fine-tune the functions of almost all eukaryotic proteins, and co-regulation of different types of PTMs has been shown within and between a number of proteins. Aiming at a more global view of the interplay between PTM types, we collected modifications for 13 frequent PTM types in 8 eukaryotes, compared their speed of evolution and developed a method for measuring PTM co-evolution within proteins based on the co-occurrence of sites across eukaryotes. As many sites are still to be discovered, this is a considerable underestimate, yet, assuming that most co-evolving PTMs are functionally associated, we found that PTM types are vastly interconnected, forming a global network that comprise in human alone >50 000 residues in about 6000 proteins. We predict substantial PTM type interplay in secreted and membrane-associated proteins and in the context of particular protein domains and short-linear motifs. The global network of co-evolving PTM types implies a complex and intertwined post-translational regulation landscape that is likely to regulate multiple functional states of many if not all eukaryotic proteins.
Prediction of drug combinations by integrating molecular and pharmacological data.
Zhao, X.M., Iskar, M., Zeller, G., Kuhn, M., Noort, V. & Bork, P.
PLoS Comput Biol. 2011 Dec;7(12):e1002323. Epub 2011 Dec 29.
Combinatorial therapy is a promising strategy for combating complex disorders due to improved efficacy and reduced side effects. However, screening new drug combinations exhaustively is impractical considering all possible combinations between drugs. Here, we present a novel computational approach to predict drug combinations by integrating molecular and pharmacological data. Specifically, drugs are represented by a set of their properties, such as their targets or indications. By integrating several of these features, we show that feature patterns enriched in approved drug combinations are not only predictive for new drug combinations but also provide insights into mechanisms underlying combinatorial therapy. Further analysis confirmed that among our top ranked predictions of effective combinations, 69% are supported by literature, while the others represent novel potential drug combinations. We believe that our proposed approach can help to limit the search space of drug combinations and provide a new way to effectively utilize existing drugs for new purposes.
A holistic approach to marine eco-systems biology.
Karsenti, E., Acinas, S.G., Bork, P., Bowler, C., De Vargas, C., Raes, J., Sullivan, M., Arendt, D., Benzoni, F., Claverie, J.M., Follows, M., Gorsky, G., Hingamp, P., Iudicone, D., Jaillon, O., Kandels-Lewis, S., Krzic, U., Not, F., Ogata, H., Pesant, S., Reynaud, E.G., Sardet, C., Sieracki, M.E., Speich, S., Velayoudon, D., Weissenbach, J. & Wincker, P.
PLoS Biol. 2011 Oct;9(10):e1001177. doi: 10.1371/journal.pbio.1001177. Epub 2011Oct 18.
The structure, robustness, and dynamics of ocean plankton ecosystems remain poorly understood due to sampling, analysis, and computational limitations. The Tara Oceans consortium organizes expeditions to help fill this gap at the global level.
Orthology prediction methods: a quality assessment using curated protein families.
Trachana, K., Larsson, T.A., Powell, S., Chen, W.H., Doerks, T., Muller, J. & Bork, P.
Bioessays. 2011 Oct;33(10):769-80. doi: 10.1002/bies.201100062. Epub 2011Aug 19.
The increasing number of sequenced genomes has prompted the development of several automated orthology prediction methods. Tests to evaluate the accuracy of predictions and to explore biases caused by biological and technical factors are therefore required. We used 70 manually curated families to analyze the performance of five public methods in Metazoa. We analyzed the strengths and weaknesses of the methods and quantified the impact of biological and technical challenges. From the latter part of the analysis, genome annotation emerged as the largest single influencer, affecting up to 30% of the performance. Generally, most methods did well in assigning orthologous group but they failed to assign the exact number of genes for half of the groups. The publicly available benchmark set (http://eggnog.embl.de/orthobench/) should facilitate the improvement of current orthology assignment protocols, which is of utmost importance for many fields of biology and should be tackled by a broad scientific community.
Insight into structure and assembly of the nuclear pore complex by utilizing the genome of a eukaryotic thermophile.
Amlacher, S., Sarges, P., Flemming, D., van Noort, V., Kunze, R., Devos, D.P., Arumugam, M., Bork, P. & Hurt, E.
Cell. 2011 Jul 22;146(2):277-89.
Despite decades of research, the structure and assembly of the nuclear pore complex (NPC), which is composed of approximately 30 nucleoporins (Nups), remain elusive. Here, we report the genome of the thermophilic fungus Chaetomium thermophilum (ct) and identify the complete repertoire of Nups therein. The thermophilic proteins show improved properties for structural and biochemical studies compared to their mesophilic counterparts, and purified ctNups enabled the reconstitution of the inner pore ring module that spans the width of the NPC from the anchoring membrane to the central transport channel. This module is composed of two large Nups, Nup192 and Nup170, which are flexibly bridged by short linear motifs made up of linker Nups, Nic96 and Nup53. This assembly illustrates how Nup interactions can generate structural plasticity within the NPC scaffold. Our findings therefore demonstrate the utility of the genome of a thermophilic eukaryote for studying complex molecular machines.
iPath2.0: interactive pathway explorer.
Yamada, T., Letunic, I., Okuda, S., Kanehisa, M. & Bork, P.
Nucleic Acids Res. 2011 Jul;39(Web Server issue):W412-5. Epub 2011 May 5.
iPath2.0 is a web-based tool (http://pathways.embl.de) for the visualization and analysis of cellular pathways. Its primary map summarizes the metabolism in biological systems as annotated to date. Nodes in the map correspond to various chemical compounds and edges represent series of enzymatic reactions. In two other maps, iPath2.0 provides an overview of secondary metabolite biosynthesis and a hand-picked selection of important regulatory pathways and other functional modules, allowing a more general overview of protein functions in a genome or metagenome. iPath2.0's main interface is an interactive Flash-based viewer, which allows users to easily navigate and explore the complex pathway maps. In addition to the default pre-computed overview maps, iPath offers several data mapping tools. Users can upload various types of data and completely customize all nodes and edges of iPath2.0's maps. These customized maps give users an intuitive overview of their own data, guiding the analysis of various genomics and metagenomics projects.
Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy.
Letunic, I. & Bork, P.
Nucleic Acids Res. 2011 Jul;39(Web Server issue):W475-8. Epub 2011 Apr 5.
Interactive Tree Of Life (http://itol.embl.de) is a web-based tool for the display, manipulation and annotation of phylogenetic trees. It is freely available and open to everyone. In addition to classical tree viewer functions, iTOL offers many novel ways of annotating trees with various additional data. Current version introduces numerous new features and greatly expands the number of supported data set types. Trees can be interactively manipulated and edited. A free personal account system is available, providing management and sharing of trees in user defined workspaces and projects. Export to various bitmap and vector graphics formats is supported. Batch access interface is available for programmatic access or inclusion of interactive trees into other web services.
Enterotypes of the human gut microbiome.
Arumugam, M., Raes, J., Pelletier, E., Le Paslier, D., Yamada, T., Mende, D.R., Fernandes, G.R., Tap, J., Bruls, T., Batto, J.M., Bertalan, M., Borruel, N., Casellas, F., Fernandez, L., Gautier, L., Hansen, T., Hattori, M., Hayashi, T., Kleerebezem, M., Kurokawa, K., Leclerc, M., Levenez, F., Manichanh, C., Nielsen, H.B., Nielsen, T., Pons, N., Poulain, J., Qin, J., Sicheritz-Ponten, T., Tims, S., Torrents, D., Ugarte, E., Zoetendal, E.G., Wang, J., Guarner, F., Pedersen, O., de Vos, W.M., Brunak, S., Dore, J., Antolin, M., Artiguenave, F., Blottiere, H.M., Almeida, M., Brechot, C., Cara, C., Chervaux, C., Cultrone, A., Delorme, C., Denariaz, G., Dervyn, R., Foerstner, K.U., Friss, C., van de Guchte, M., Guedon, E., Haimet, F., Huber, W., van Hylckama-Vlieg, J., Jamet, A., Juste, C., Kaci, G., Knol, J., Lakhdari, O., Layec, S., Le Roux, K., Maguin, E., Merieux, A., Melo Minardi, R., M'rini, C., Muller, J., Oozeer, R., Parkhill, J., Renault, P., Rescigno, M., Sanchez, N., Sunagawa, S., Torrejon, A., Turner, K., Vandemeulebrouck, G., Varela, E., Winogradsky, Y., Zeller, G., Weissenbach, J., Ehrlich, S.D. & Bork, P.
Nature. 2011 May 12;473(7346):174-80. Epub 2011 Apr 20.
Our knowledge of species and functional composition of the human gut microbiome is rapidly increasing, but it is still based on very few cohorts and little is known about variation across the world. By combining 22 newly sequenced faecal metagenomes of individuals from four countries with previously published data sets, here we identify three robust clusters (referred to as enterotypes hereafter) that are not nation or continent specific. We also confirmed the enterotypes in two published, larger cohorts, indicating that intestinal microbiota variation is generally stratified, not continuous. This indicates further the existence of a limited number of well-balanced host-microbial symbiotic states that might respond differently to diet and drug intake. The enterotypes are mostly driven by species composition, but abundant molecular functions are not necessarily provided by abundant species, highlighting the importance of a functional analysis to understand microbial communities. Although individual host properties such as body mass index, age, or gender cannot explain the observed enterotypes, data-driven marker genes or functional modules can be identified for each of these host properties. For example, twelve genes significantly correlate with age and three functional modules with the body mass index, hinting at a diagnostic potential of microbial markers.
The ecoresponsive genome of Daphnia pulex.
Colbourne, J.K., Pfrender, M.E., Gilbert, D., Thomas, W.K., Tucker, A., Oakley, T.H., Tokishita, S., Aerts, A., Arnold, G.J., Basu, M.K., Bauer, D.J., Caceres, C.E., Carmel, L., Casola, C., Choi, J.H., Detter, J.C., Dong, Q., Dusheyko, S., Eads, B.D., Frohlich, T., Geiler-Samerotte, K.A., Gerlach, D., Hatcher, P., Jogdeo, S., Krijgsveld, J., Kriventseva, E.V., Kultz, D., Laforsch, C., Lindquist, E., Lopez, J., Manak, J.R., Muller, J., Pangilinan, J., Patwardhan, R.P., Pitluck, S., Pritham, E.J., Rechtsteiner, A., Rho, M., Rogozin, I.B., Sakarya, O., Salamov, A., Schaack, S., Shapiro, H., Shiga, Y., Skalitzky, C., Smith, Z., Souvorov, A., Sung, W., Tang, Z., Tsuchiya, D., Tu, H., Vos, H., Wang, M., Wolf, Y.I., Yamagata, H., Yamada, T., Ye, Y., Shaw, J.R., Andrews, J., Crease, T.J., Tang, H., Lucas, S.M., Robertson, H.M., Bork, P., Koonin, E.V., Zdobnov, E.M., Grigoriev, I.V., Lynch, M. & Boore, J.L.
Science. 2011 Feb 4;331(6017):555-61. doi: 10.1126/science.1197761.
We describe the draft genome of the microcrustacean Daphnia pulex, which is only 200 megabases and contains at least 30,907 genes. The high gene count is a consequence of an elevated rate of gene duplication resulting in tandem gene clusters. More than a third of Daphnia's genes have no detectable homologs in any other available proteome, and the most amplified gene families are specific to the Daphnia lineage. The coexpansion of gene families interacting within metabolic pathways suggests that the maintenance of duplicated genes is not random, and the analysis of gene expression under different environmental conditions reveals that numerous paralogs acquire divergent expression patterns soon after duplication. Daphnia-specific genes, including many additional loci within sequenced regions that are otherwise devoid of annotations, are the most responsive genes to ecological challenges.
The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored.
Szklarczyk, D., Franceschini, A., Kuhn, M., Simonovic, M., Roth, A., Minguez, P., Doerks, T., Stark, M., Muller, J., Bork, P., Jensen, L.J. & von Mering, C.
Nucleic Acids Res. 2011 Jan;39(Database issue):D561-8. Epub 2010 Nov 2.
An essential prerequisite for any systems-level understanding of cellular functions is to correctly uncover and annotate all functional interactions among proteins in the cell. Toward this goal, remarkable progress has been made in recent years, both in terms of experimental measurements and computational prediction techniques. However, public efforts to collect and present protein interaction information have struggled to keep up with the pace of interaction discovery, partly because protein-protein interaction information can be error-prone and require considerable effort to annotate. Here, we present an update on the online database resource Search Tool for the Retrieval of Interacting Genes (STRING); it provides uniquely comprehensive coverage and ease of access to both experimental as well as predicted interaction information. Interactions in STRING are provided with a confidence score, and accessory information such as protein domains and 3D structures is made available, all within a stable and consistent identifier space. New features in STRING include an interactive network viewer that can cluster networks on demand, updated on-screen previews of structural information including homology models, extensive data updates and strongly improved connectivity and integration with third-party resources. Version 9.0 of STRING covers more than 1100 completely sequenced organisms; the resource can be reached at http://string-db.org.
Network neighbors of drug targets contribute to drug side-effect similarity.
Brouwers, L., Iskar, M., Zeller, G., van Noort, V. & Bork, P.
PLoS One. 2011;6(7):e22187. Epub 2011 Jul 13.
In pharmacology, it is essential to identify the molecular mechanisms of drug action in order to understand adverse side effects. These adverse side effects have been used to infer whether two drugs share a target protein. However, side-effect similarity of drugs could also be caused by their target proteins being close in a molecular network, which as such could cause similar downstream effects. In this study, we investigated the proportion of side-effect similarities that is due to targets that are close in the network compared to shared drug targets. We found that only a minor fraction of side-effect similarities (5.8 %) are caused by drugs targeting proteins close in the network, compared to side-effect similarities caused by overlapping drug targets (64%). Moreover, these targets that cause similar side effects are more often in a linear part of the network, having two or less interactions, than drug targets in general. Based on the examples, we gained novel insight into the molecular mechanisms of side effects associated with several drug targets. Looking forward, such analyses will be extremely useful in the process of drug development to better understand adverse side effects.
SmashCell: a software framework for the analysis of single-cell amplified genome sequences.
Harrington, E.D., Arumugam, M., Raes, J., Bork, P. & Relman, D.A.
Bioinformatics. 2010 Dec 1;26(23):2979-80. Epub 2010 Oct 21.
SUMMARY: Recent advances in single-cell manipulation technology, whole genome amplification and high-throughput sequencing have now made it possible to sequence the genome of an individual cell. The bioinformatic analysis of these genomes, however, is far more complicated than the analysis of those generated using traditional, culture-based methods. In order to simplify this analysis, we have developed SmashCell (Simple Metagenomics Analysis SHell-for sequences from single Cells). It is designed to automate the main steps in microbial genome analysis-assembly, gene prediction, functional annotation-in a way that allows parameter and algorithm exploration at each step in the process. It also manages the data created by these analyses and provides visualization methods for rapid analysis of the results. AVAILABILITY: The SmashCell source code and a comprehensive manual are available at http://asiago.stanford.edu/SmashCell CONTACT: email@example.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
SmashCommunity: a metagenomic annotation and analysis tool.
Arumugam, M., Harrington, E.D., Foerstner, K.U., Raes, J. & Bork, P.
Bioinformatics. 2010 Dec 1;26(23):2977-8. Epub 2010 Oct 19.
SUMMARY: SmashCommunity is a stand-alone metagenomic annotation and analysis pipeline suitable for data from Sanger and 454 sequencing technologies. It supports state-of-the-art software for essential metagenomic tasks such as assembly and gene prediction. It provides tools to estimate the quantitative phylogenetic and functional compositions of metagenomes, to compare compositions of multiple metagenomes and to produce intuitive visual representations of such analyses. AVAILABILITY: SmashCommunity source code and documentation are available at http://www.bork.embl.de/software/smash CONTACT: firstname.lastname@example.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
A systematic screen for protein-lipid interactions in Saccharomyces cerevisiae.
Gallego, O., Betts, M.J., Gvozdenovic-Jeremic, J., Maeda, K., Matetzki, C., Aguilar-Gurrieri, C., Beltran-Alvarez, P., Bonn, S., Fernandez-Tornero, C., Jensen, L.J., Kuhn, M., Trott, J., Rybin, V., Muller, C.W., Bork, P., Kaksonen, M., Russell, R.B. & Gavin, A.C.
Mol Syst Biol. 2010 Nov 30;6:430. doi: 10.1038/msb.2010.87.
Protein-metabolite networks are central to biological systems, but are incompletely understood. Here, we report a screen to catalog protein-lipid interactions in yeast. We used arrays of 56 metabolites to measure lipid-binding fingerprints of 172 proteins, including 91 with predicted lipid-binding domains. We identified 530 protein-lipid associations, the majority of which are novel. To show the data set's biological value, we studied further several novel interactions with sphingolipids, a class of conserved bioactive lipids with an elusive mode of action. Integration of live-cell imaging suggests new cellular targets for these molecules, including several with pleckstrin homology (PH) domains. Validated interactions with Slm1, a regulator of actin polarization, show that PH domains can have unexpected lipid-binding specificities and can act as coincidence sensors for both phosphatidylinositol phosphates and phosphorylated sphingolipids.
A human gut microbial gene catalogue established by metagenomic sequencing.
Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K.S., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T., Mende, D.R., Li, J., Xu, J., Li, S., Li, D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie, Y., Tap, J., Lepage, P., Bertalan, M., Batto, J.M., Hansen, T., Le Paslier, D., Linneberg, A., Nielsen, H.B., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H., Yu, C., Li, S., Jian, M., Zhou, Y., Li, Y., Zhang, X., Li, S., Qin, N., Yang, H., Wang, J., Brunak, S., Dore, J., Guarner, F., Kristiansen, K., Pedersen, O., Parkhill, J., Weissenbach, J., Bork, P., Ehrlich, S.D. & Wang, J.
Nature. 2010 Mar 4;464(7285):59-65.
To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Here we describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million non-redundant microbial genes, derived from 576.7 gigabases of sequence, from faecal samples of 124 European individuals. The gene set, approximately 150 times larger than the human gene complement, contains an overwhelming majority of the prevalent (more frequent) microbial genes of the cohort and probably includes a large proportion of the prevalent human intestinal microbial genes. The genes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, indicating that the entire cohort harbours between 1,000 and 1,150 prevalent bacterial species and each individual at least 160 such species, which are also largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms of functions present in all individuals and most bacteria, respectively.
High-resolution transcription atlas of the mitotic cell cycle in budding yeast.
Granovskaia, M.V., Jensen, L.J., Ritchie, M.E., Toedling, J., Ning, Y., Bork, P., Huber, W. & Steinmetz, L.M.
Genome Biol. 2010 Mar 1;11(3):R24.
ABSTRACT: BACKGROUND: Extensive transcription of non-coding RNAs has been detected in eukaryotic genomes and is thought to constitute an additional layer in the regulation of gene expression. Despite this role, their transcription through the cell cycle has not been studied; genome-wide approaches have only focused on protein-coding genes. To explore the complex transcriptome architecture underlying the budding yeast cell cycle, we used 8 bp tiling arrays to generate a 5 minute-resolution, strand-specific expression atlas of the whole genome. RESULTS: We discovered 523 antisense transcripts, of which 80 cycle or are located opposite periodically expressed mRNAs, 135 unannotated intergenic non-coding RNAs, of which 11 cycle, and 109 cell-cycle-regulated protein-coding genes that had not previously been shown to cycle. We detected periodic expression coupling of sense and antisense transcript pairs, including antisense transcripts opposite of key cell-cycle regulators, like FAR1 and TAF2. CONCLUSIONS: Our dataset presents the most comprehensive resource to date on gene expression during the budding yeast cell cycle, revealing both protein-coding and non-coding RNA periodicity of expression and the first that profiles non-annotated RNAs. It enables hypothesis-driven mechanistic studies concerning the functions of non-coding RNAs.
Evolution and regulation of cellular periodic processes: a role for paralogues.
Trachana, K., Jensen, L.J. & Bork, P.
EMBO Rep. 2010 Mar;11(3):233-8. Epub 2010 Feb 19.
Several cyclic processes take place within a single organism. For example, the cell cycle is coordinated with the 24 h diurnal rhythm in animals and plants, and with the 40 min ultradian rhythm in budding yeast. To examine the evolution of periodic gene expression during these processes, we performed the first systematic comparison in three organisms (Homo sapiens, Arabidopsis thaliana and Saccharomyces cerevisiae) by using public microarray data. We observed that although diurnal-regulated and ultradian-regulated genes are not generally cell-cycle-regulated, they tend to have cell-cycle-regulated paralogues. Thus, diverged temporal expression of paralogues seems to facilitate cellular orchestration under different periodic stimuli. Lineage-specific functional repertoires of periodic-associated paralogues imply that this mode of regulation might have evolved independently in several organisms.
Ancient animal microRNAs and the evolution of tissue identity.
Christodoulou, F., Raible, F., Tomer, R., Simakov, O., Trachana, K., Klaus, S., Snyman, H., Hannon, G.J., Bork, P. & Arendt, D.
Nature. 2010 Feb 25;463(7284):1084-8. Epub 2010 Jan 31.
The spectacular escalation in complexity in early bilaterian evolution correlates with a strong increase in the number of microRNAs. To explore the link between the birth of ancient microRNAs and body plan evolution, we set out to determine the ancient sites of activity of conserved bilaterian microRNA families in a comparative approach. We reason that any specific localization shared between protostomes and deuterostomes (the two major superphyla of bilaterian animals) should probably reflect an ancient specificity of that microRNA in their last common ancestor. Here, we investigate the expression of conserved bilaterian microRNAs in Platynereis dumerilii, a protostome retaining ancestral bilaterian features, in Capitella, another marine annelid, in the sea urchin Strongylocentrotus, a deuterostome, and in sea anemone Nematostella, representing an outgroup to the bilaterians. Our comparative data indicate that the oldest known animal microRNA, miR-100, and the related miR-125 and let-7 were initially active in neurosecretory cells located around the mouth. Other sets of ancient microRNAs were first present in locomotor ciliated cells, specific brain centres, or, more broadly, one of four major organ systems: central nervous system, sensory tissue, musculature and gut. These findings reveal that microRNA evolution and the establishment of tissue identities were closely coupled in bilaterian evolution. Also, they outline a minimum set of cell types and tissues that existed in the protostome-deuterostome ancestor.
AQUA: automated quality improvement for multiple sequence alignments.
Muller, J., Creevey, C.J., Thompson, J.D., Arendt, D. & Bork, P.
Bioinformatics. 2010 Jan 15;26(2):263-5. Epub 2009 Nov 19.
Multiple sequence alignment (MSA) is a central tool in most modern biology studies. However, despite generations of valuable tools, human experts are still able to improve automatically generated MSAs. In an effort to automatically identify the most reliable MSA for a given protein family, we propose a very simple protocol, named AQUA for 'Automated quality improvement for multiple sequence alignments'. Our current implementation relies on two alignment programs (MUSCLE and MAFFT), one refinement program (RASCAL) and one assessment program (NORMD), but other programs could be incorporated at any of the three steps. Availability: AQUA is implemented in Tcl/Tk and runs in command line on all platforms. The source code is available under the GNU GPL license. Source code, README and Supplementary data are available at http://www.bork.embl.de/Docu/AQUA.
Drug-induced regulation of target expression.
Iskar, M., Campillos, M., Kuhn, M., Jensen, L.J., van Noort, V. & Bork, P.
PLoS Comput Biol. 2010 Sep 9;6(9). pii: e1000925.
Drug perturbations of human cells lead to complex responses upon target binding. One of the known mechanisms is a (positive or negative) feedback loop that adjusts the expression level of the respective target protein. To quantify this mechanism systems-wide in an unbiased way, drug-induced differential expression of drug target mRNA was examined in three cell lines using the Connectivity Map. To overcome various biases in this valuable resource, we have developed a computational normalization and scoring procedure that is applicable to gene expression recording upon heterogeneous drug treatments. In 1290 drug-target relations, corresponding to 466 drugs acting on 167 drug targets studied, 8% of the targets are subject to regulation at the mRNA level. We confirmed systematically that in particular G-protein coupled receptors, when serving as known targets, are regulated upon drug treatment. We further newly identified drug-induced differential regulation of Lanosterol 14-alpha demethylase, Endoplasmin, DNA topoisomerase 2-alpha and Calmodulin 1. The feedback regulation in these and other targets is likely to be relevant for the success or failure of the molecular intervention.
Impact of genome reduction on bacterial metabolism and its regulation.
Yus, E., Maier, T., Michalodimitrakis, K., van Noort, V., Yamada, T., Chen, W.H., Wodke, J.A., Guell, M., Martinez, S., Bourgeois, R., Kuhner, S., Raineri, E., Letunic, I., Kalinina, O.V., Rode, M., Herrmann, R., Gutierrez-Gallego, R., Russell, R.B., Gavin, A.C., Bork, P. & Serrano, L.
Science. 2009 Nov 27;326(5957):1263-8.
To understand basic principles of bacterial metabolism organization and regulation, but also the impact of genome size, we systematically studied one of the smallest bacteria, Mycoplasma pneumoniae. A manually curated metabolic network of 189 reactions catalyzed by 129 enzymes allowed the design of a defined, minimal medium with 19 essential nutrients. More than 1300 growth curves were recorded in the presence of various nutrient concentrations. Measurements of biomass indicators, metabolites, and 13C-glucose experiments provided information on directionality, fluxes, and energetics; integration with transcription profiling enabled the global analysis of metabolic regulation. Compared with more complex bacteria, the M. pneumoniae metabolic network has a more linear topology and contains a higher fraction of multifunctional enzymes; general features such as metabolite concentrations, cellular energetics, adaptability, and global gene expression responses are similar, however.
Transcriptome complexity in a genome-reduced bacterium.
Guell, M., van Noort, V., Yus, E., Chen, W.H., Leigh-Bell, J., Michalodimitrakis, K., Yamada, T., Arumugam, M., Doerks, T., Kuhner, S., Rode, M., Suyama, M., Schmidt, S., Gavin, A.C., Bork, P. & Serrano, L.
Science. 2009 Nov 27;326(5957):1268-71.
To study basic principles of transcriptome organization in bacteria, we analyzed one of the smallest self-replicating organisms, Mycoplasma pneumoniae. We combined strand-specific tiling arrays, complemented by transcriptome sequencing, with more than 252 spotted arrays. We detected 117 previously undescribed, mostly noncoding transcripts, 89 of them in antisense configuration to known genes. We identified 341 operons, of which 139 are polycistronic; almost half of the latter show decaying expression in a staircase-like manner. Under various conditions, operons could be divided into 447 smaller transcriptional units, resulting in many alternative transcripts. Frequent antisense transcripts, alternative transcripts, and multiple regulators per gene imply a highly dynamic transcriptome, more similar to that of eukaryotes than previously thought.
Evolution of biomolecular networks: lessons from metabolic and protein interactions.
Yamada, T. & Bork, P.
Nat Rev Mol Cell Biol. 2009 Nov;10(11):791-803.
Despite only becoming popular at the beginning of this decade, biomolecular networks are now frameworks that facilitate many discoveries in molecular biology. The nodes of these networks are usually proteins (specifically enzymes in metabolic networks), whereas the links (or edges) are their interactions with other molecules. These networks are made up of protein-protein interactions or enzyme-enzyme interactions through shared metabolites in the case of metabolic networks. Evolutionary analysis has revealed that changes in the nodes and links in protein-protein interaction and metabolic networks are subject to different selection pressures owing to distinct topological features. However, many evolutionary constraints can be uncovered only if temporal and spatial aspects are included in the network analysis.
ASTD: The Alternative Splicing and Transcript Diversity database.
Koscielny, G., Le Texier, V., Gopalakrishnan, C., Kumanduri, V., Riethoven, J.J., Nardone, F., Stanley, E., Fallsehr, C., Hofmann, O., Kull, M., Harrington, E., Boue, S., Eyras, E., Plass, M., Lopez, F., Ritchie, W., Moucadel, V., Ara, T., Pospisil, H., Herrmann, A., G Reich, J., Guigo, R., Bork, P., Doeberitz, M.K., Vilo, J., Hide, W., Apweiler, R., Thanaraj, T.A. & Gautheret, D.
Genomics. 2009 Mar;93(3):213-20. Epub 2008 Dec 24.
The Alternative Splicing and Transcript Diversity database (ASTD) gives access to a vast collection of alternative transcripts that integrate transcription initiation, polyadenylation and splicing variant data. Alternative transcripts are derived from the mapping of transcribed sequences to the complete human, mouse and rat genomes using an extension of the computational pipeline developed for the ASD (Alternative Splicing Database) and ATD (Alternative Transcript Diversity) databases, which are now superseded by ASTD. For the human genome, ASTD identifies splicing variants, transcription initiation variants and polyadenylation variants in 68%, 68% and 62% of the gene set, respectively, consistent with current estimates for transcription variation. Users can access ASTD through a variety of browsing and query tools, including expression state-based queries for the identification of tissue-specific isoforms. Participating laboratories have experimentally validated a subset of ASTD-predicted alternative splice forms and alternative polyadenylation forms that were not previously reported. The ASTD database can be accessed at http://www.ebi.ac.uk/astd.
Sequence-based feature prediction and annotation of proteins.
Juncker, A.S., Jensen, L.J., Pierleoni, A., Bernsel, A., Tress, M.L., Bork, P., von Heijne, G., Valencia, A., Ouzounis, C.A., Casadio, R. & Brunak, S.
Genome Biol. 2009 Feb 2;10(2):206.
ABSTRACT: A recent trend in computational methods for annotation of protein function is that many prediction tools are combined in complex workflows and pipelines to facilitate the analysis of feature combinations, for example, the entire repertoire of kinase-binding motifs in the human proteome.
SMART 6: recent updates and new developments.
Letunic, I., Doerks, T. & Bork, P.
Nucleic Acids Res. 2009 Jan;37(Database issue):D229-32. Epub 2008 Oct 31.
Simple modular architecture research tool (SMART) is an online tool (http://smart.embl.de/) for the identification and annotation of protein domains. It provides a user-friendly platform for the exploration and comparative study of domain architectures in both proteins and genes. The current release of SMART contains manually curated models for 784 protein domains. Recent developments were focused on further data integration and improving user friendliness. The underlying protein database based on completely sequenced genomes was greatly expanded and now includes 630 species, compared to 191 in the previous release. As an initial step towards integrating information on biological pathways into SMART, our domain annotations were extended with data on metabolic pathways and links to several pathways resources. The interaction network view was completely redesigned and is now available for more than 2 million proteins. In addition to the standard web access to the database, users can now query SMART using distributed annotation system (DAS) or through a simple object access protocol (SOAP) based web service.
InterPro: the integrative protein signature database.
Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L., Duquenne, L., Finn, R.D., Gough, J., Haft, D., Hulo, N., Kahn, D., Kelly, E., Laugraud, A., Letunic, I., Lonsdale, D., Lopez, R., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Mulder, N., Natale, D., Orengo, C., Quinn, A.F., Selengut, J.D., Sigrist, C.J., Thimma, M., Thomas, P.D., Valentin, F., Wilson, D., Wu, C.H. & Yeats, C.
Nucleic Acids Res. 2009 Jan;37(Database issue):D211-5. Epub 2008 Oct 21.
The InterPro database (http://www.ebi.ac.uk/interpro/) integrates together predictive models or 'signatures' representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Integration is performed manually and approximately half of the total approximately 58,000 signatures available in the source databases belong to an InterPro entry. Recently, we have started to also display the remaining un-integrated signatures via our web interface. Other developments include the provision of non-signature data, such as structural data, in new XML files on our FTP site, as well as the inclusion of matchless UniProtKB proteins in the existing match XML files. The web interface has been extended and now links out to the ADAN predicted protein-protein interaction database and the SPICE and Dasty viewers. The latest public release (v18.0) covers 79.8% of UniProtKB (v14.1) and consists of 16 549 entries. InterPro data may be accessed either via the web address above, via web services, by downloading files by anonymous FTP or by using the InterProScan search software (http://www.ebi.ac.uk/Tools/InterProScan/).
Proteome organization in a genome-reduced bacterium.
Kuhner, S., van Noort, V., Betts, M.J., Leo-Macias, A., Batisse, C., Rode, M., Yamada, T., Maier, T., Bader, S., Beltran-Alvarez, P., Castano-Diez, D., Chen, W.H., Devos, D., Guell, M., Norambuena, T., Racke, I., Rybin, V., Schmidt, A., Yus, E., Aebersold, R., Herrmann, R., Bottcher, B., Frangakis, A.S., Russell, R.B., Serrano, L., Bork, P. & Gavin, A.C.
Science. 2009 Nov 27;326(5957):1235-40. doi: 10.1126/science.1176343.
The genome of Mycoplasma pneumoniae is among the smallest found in self-replicating organisms. To study the basic principles of bacterial proteome organization, we used tandem affinity purification-mass spectrometry (TAP-MS) in a proteome-wide screen. The analysis revealed 62 homomultimeric and 116 heteromultimeric soluble protein complexes, of which the majority are novel. About a third of the heteromultimeric complexes show higher levels of proteome organization, including assembly into larger, multiprotein complex entities, suggesting sequential steps in biological processes, and extensive sharing of components, implying protein multifunctionality. Incorporation of structural models for 484 proteins, single-particle electron microscopy, and cellular electron tomograms provided supporting structural details for this proteome organization. The data set provides a blueprint of the minimal cellular machinery required for life.
Molecular eco-systems biology: towards an understanding of community function.
Raes, J. & Bork, P.
Nat Rev Microbiol. 2008 Sep;6(9):693-9. Epub 2008 Jun 30.
Systems-biology approaches, which are driven by genome sequencing and high-throughput functional genomics data, are revolutionizing single-cell-organism biology. With the advent of various high-throughput techniques that aim to characterize complete microbial ecosystems (metagenomics, meta-transcriptomics and meta-metabolomics), we propose that the time is ripe to consider molecular systems biology at the ecosystem level (eco-systems biology). Here, we discuss the necessary data types that are required to unite molecular microbiology and ecology to develop an understanding of community function and discuss the potential shortcomings of these approaches.
Evolution of the phospho-tyrosine signaling machinery in premetazoan lineages.
Pincus, D., Letunic, I., Bork, P. & Lim, W.A.
Proc Natl Acad Sci U S A. 2008 Jul 15;105(28):9680-4. Epub 2008 Jul 3.
Multicellular animals use a three-part molecular toolkit to mediate phospho-tyrosine signaling: Tyrosine kinases (TyrK), protein tyrosine phosphatases (PTP), and Src Homology 2 (SH2) domains function, respectively, as "writers," "erasers," and "readers" of phospho-tyrosine modifications. How did this system of three components evolve, given their interdependent function? Here, we examine the usage of these components in 41 eukaryotic genomes, including the newly sequenced genome of the choanoflagellate, Monosiga brevicollis, the closest known unicellular relative to metazoans. This analysis indicates that SH2 and PTP domains likely evolved earliest-a handful of these domains are found in premetazoan eukaryotes lacking tyrosine kinases, most likely to deal with limited tyrosine phosphorylation cross-catalyzed by promiscuous Ser/Thr kinases. Modern TyrK proteins, however, are only observed in two lineages, metazoans and choanoflagellates. These two lineages show a dramatic coexpansion of all three domain families. Concurrent expansion of the three domain families is consistent with a stepwise evolutionary model in which preexisting SH2 and PTP domains were of limited utility until the appearance of the TyrK domain in the last common ancestor of metazoans and choanoflagellates. The emergence of the full three-component signaling system, with its dramatically increased encoding potential, may have contributed to the advent of metazoan multicellularity.
KEGG Atlas mapping for global analysis of metabolic pathways.
Okuda, S., Yamada, T., Hamajima, M., Itoh, M., Katayama, T., Bork, P., Goto, S. & Kanehisa, M.
Nucleic Acids Res. 2008 Jul 1;36(Web Server issue):W423-6. Epub 2008 May13.
KEGG Atlas is a new graphical interface to the KEGG suite of databases, especially to the systems information in the PATHWAY and BRITE databases. It currently consists of a single global map and an associated viewer for metabolism, covering about 120 KEGG metabolic pathway maps and about 10 BRITE hierarchies. The viewer allows the user to navigate and zoom the global map under the Ajax technology. The mapping of high-throughput experimental data onto the global map is the main use of KEGG Atlas. In the global metabolism map, the node (circle) is a chemical compound and the edge (line) is a set of reactions linked to a set of KEGG Orthology (KO) entries for enzyme genes. Once gene identifiers in different organisms are converted to the K number identifiers in the KO system, corresponding line segments can be highlighted in the global map, allowing the user to view genome sequence data as organism-specific pathways, gene expression data as up- or down-regulated pathways, etc. Once chemical compounds are converted to the C number identifiers in KEGG, metabolomics data can also be displayed in the global map. KEGG Atlas is available at http://www.genome.jp/kegg/atlas/.
iPath: interactive exploration of biochemical pathways and networks.
Letunic, I., Yamada, T., Kanehisa, M. & Bork, P.
Trends Biochem Sci. 2008 Mar;33(3):101-3. doi: 10.1016/j.tibs.2008.01.001. Epub2008 Feb 13.
iPath is an open-access online tool (http://pathways.embl.de) for visualizing and analyzing metabolic pathways. An interactive viewer provides straightforward navigation through various pathways and enables easy access to the underlying chemicals and enzymes. Customized pathway maps can be generated and annotated using various external data. For example, by merging human genome data with two important gut commensals, iPath can pinpoint the complementarity of the host-symbiont metabolic capacities.
The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans.
King, N., Westbrook, M.J., Young, S.L., Kuo, A., Abedin, M., Chapman, J., Fairclough, S., Hellsten, U., Isogai, Y., Letunic, I., Marr, M., Pincus, D., Putnam, N., Rokas, A., Wright, K.J., Zuzow, R., Dirks, W., Good, M., Goodstein, D., Lemons, D., Li, W., Lyons, J.B., Morris, A., Nichols, S., Richter, D.J., Salamov, A., Sequencing, J.G., Bork, P., Lim, W.A., Manning, G., Miller, W.T., McGinnis, W., Shapiro, H., Tjian, R., Grigoriev, I.V. & Rokhsar, D.
Nature. 2008 Feb 14;451(7180):783-8.
Choanoflagellates are the closest known relatives of metazoans. To discover potential molecular mechanisms underlying the evolution of metazoan multicellularity, we sequenced and analysed the genome of the unicellular choanoflagellate Monosiga brevicollis. The genome contains approximately 9,200 intron-rich genes, including a number that encode cell adhesion and signalling protein domains that are otherwise restricted to metazoans. Here we show that the physical linkages among protein domains often differ between M. brevicollis and metazoans, suggesting that abundant domain shuffling followed the separation of the choanoflagellate and metazoan lineages. The completion of the M. brevicollis genome allows us to reconstruct with increasing resolution the genomic changes that accompanied the origin of metazoans.
4DXpress: a database for cross-species expression pattern comparisons.
Haudry, Y., Berube, H., Letunic, I., Weeber, P.D., Gagneur, J., Girardot, C., Kapushesky, M., Arendt, D., Bork, P., Brazma, A., Furlong, E.E., Wittbrodt, J. & Henrich, T.
Nucleic Acids Res. 2008 Jan;36(Database issue):D847-53. Epub 2007 Oct 4.
In the major animal model species like mouse, fish or fly, detailed spatial information on gene expression over time can be acquired through whole mount in situ hybridization experiments. In these species, expression patterns of many genes have been studied and data has been integrated into dedicated model organism databases like ZFIN for zebrafish, MEPD for medaka, BDGP for Drosophila or GXD for mouse. However, a central repository that allows users to query and compare gene expression patterns across different species has not yet been established. Therefore, we have integrated expression patterns for zebrafish, Drosophila, medaka and mouse into a central public repository called 4DXpress (expression database in four dimensions). Users can query anatomy ontology-based expression annotations across species and quickly jump from one gene to the orthologues in other species. Genes are linked to public microarray data in ArrayExpress. We have mapped developmental stages between the species to be able to compare developmental time phases. We store the largest collection of gene expression patterns available to date in an individual resource, reflecting 16 505 annotated genes. 4DXpress will be an invaluable tool for developmental as well as for computational biologists interested in gene regulation and evolution. 4DXpress is available at http://ani.embl.de/4DXpress.
Enhanced function annotations for Drosophila serine proteases: a case study for systematic annotation of multi-member gene families.
Shah, P.K., Tripathi, L.P., Jensen, L.J., Gahnim, M., Mason, C., Furlong, E.E., Rodrigues, V., White, K.P., Bork, P. & Sowdhamini, R.
Gene. 2008 Jan 15;407(1-2):199-215. Epub 2007 Oct 15.
Systematically annotating function of enzymes that belong to large protein families encoded in a single eukaryotic genome is a very challenging task. We carried out such an exercise to annotate function for serine-protease family of the trypsin fold in Drosophila melanogaster, with an emphasis on annotating serine-protease homologues (SPHs) that may have lost their catalytic function. Our approach involves data mining and data integration to provide function annotations for 190 Drosophila gene products containing serine-protease-like domains, of which 35 are SPHs. This was accomplished by analysis of structure-function relationships, gene-expression profiles, large-scale protein-protein interaction data, literature mining and bioinformatic tools. We introduce functional residue clustering (FRC), a method that performs hierarchical clustering of sequences using properties of functionally important residues and utilizes correlation co-efficient as a quantitative similarity measure to transfer in vivo substrate specificities to proteases. We show that the efficiency of transfer of substrate-specificity information using this method is generally high. FRC was also applied on Drosophila proteases to assign putative competitive inhibitor relationships (CIRs). Microarray gene-expression data were utilized to uncover a large-scale and dual involvement of proteases in development and in immune response. We found specific recruitment of SPHs and proteases with CLIP domains in immune response, suggesting evolution of a new function for SPHs. We also suggest existence of separate downstream protease cascades for immune response against bacterial/fungal infections and parasite/parasitoid infections. We verify quality of our annotations using information from RNAi screens and other evidence types. Utilization of such multi-fold approaches results in 10-fold increase of function annotation for Drosophila serine proteases and demonstrates value in increasing annotations in multiple genomes.
NetworKIN: a resource for exploring cellular phosphorylation networks.
Linding, R., Jensen, L.J., Pasculescu, A., Olhovsky, M., Colwill, K., Bork, P., Yaffe, M.B. & Pawson, T.
Nucleic Acids Res. 2008 Jan;36(Database issue):D695-9. Epub 2007 Nov 2.
Protein kinases control cellular responses by phosphorylating specific substrates. Recent proteome-wide mapping of protein phosphorylation sites by mass spectrometry has discovered thousands of in vivo sites. Systematically assigning all 518 human kinases to all these sites is a challenging problem. The NetworKIN database (http://networkin.info) integrates consensus substrate motifs with context modelling for improved prediction of cellular kinase-substrate relations. Based on the latest human phosphoproteome from the Phospho.ELM and PhosphoSite databases, the resource offers insight into phosphorylation-modulated interaction networks. Here, we describe how NetworKIN can be used for both global and targeted molecular studies. Via the web interface users can query the database of precomputed kinase-substrate relations or obtain predictions on novel phosphoproteins. The database currently contains a predicted phosphorylation network with 20,224 site-specific interactions involving 3978 phosphoproteins and 73 human kinases from 20 families.
STITCH: interaction networks of chemicals and proteins.
Kuhn, M., von Mering, C., Campillos, M., Jensen, L.J. & Bork, P.
Nucleic Acids Res. 2008 Jan;36(Database issue):D684-8. Epub 2007 Dec 15.
The knowledge about interactions between proteins and small molecules is essential for the understanding of molecular and cellular functions. However, information on such interactions is widely dispersed across numerous databases and the literature. To facilitate access to this data, STITCH ('search tool for interactions of chemicals') integrates information about interactions from metabolic pathways, crystal structures, binding experiments and drug-target relationships. Inferred information from phenotypic effects, text mining and chemical structure similarity is used to predict relations between chemicals. STITCH further allows exploring the network of chemical relations, also in the context of associated binding proteins. Each proposed interaction can be traced back to the original data sources. Our database contains interaction information for over 68,000 different chemicals, including 2200 drugs, and connects them to 1.5 million genes across 373 genomes and their interactions contained in the STRING database. STITCH is available at http://stitch.embl.de/.
eggNOG: automated construction and annotation of orthologous groups of genes.
Jensen, L.J., Julien, P., Kuhn, M., von Mering, C., Muller, J., Doerks, T. & Bork, P.
Nucleic Acids Res. 2008 Jan;36(Database issue):D250-4. Epub 2007 Oct 16.
The identification of orthologous genes forms the basis for most comparative genomics studies. Existing approaches either lack functional annotation of the identified orthologous groups, hampering the interpretation of subsequent results, or are manually annotated and thus lag behind the rapid sequencing of new genomes. Here we present the eggNOG database ('evolutionary genealogy of genes: Non-supervised Orthologous Groups'), which contains orthologous groups constructed from Smith-Waterman alignments through identification of reciprocal best matches and triangular linkage clustering. Applying this procedure to 312 bacterial, 26 archaeal and 35 eukaryotic genomes yielded 43 582 course-grained orthologous groups of which 9724 are extended versions of those from the original COG/KOG database. We also constructed more fine-grained groups for selected subsets of organisms, such as the 19 914 mammalian orthologous groups. We automatically annotated our non-supervised orthologous groups with functional descriptions, which were derived by identifying common denominators for the genes based on their individual textual descriptions, annotated functional categories, and predicted protein domains. The orthologous groups in eggNOG contain 1 241 751 genes and provide at least a broad functional description for 77% of them. Users can query the resource for individual genes via a web interface or download the complete set of orthologous groups at http://eggnog.embl.de.
SuperTarget and Matador: resources for exploring drug-target relationships.
Gunther, S., Kuhn, M., Dunkel, M., Campillos, M., Senger, C., Petsalaki, E., Ahmed, J., Urdiales, E.G., Gewiess, A., Jensen, L.J., Schneider, R., Skoblo, R., Russell, R.B., Bourne, P.E., Bork, P. & Preissner, R.
Nucleic Acids Res. 2008 Jan;36(Database issue):D919-22. Epub 2007 Oct 16.
The molecular basis of drug action is often not well understood. This is partly because the very abundant and diverse information generated in the past decades on drugs is hidden in millions of medical articles or textbooks. Therefore, we developed a one-stop data warehouse, SuperTarget that integrates drug-related information about medical indication areas, adverse drug effects, drug metabolization, pathways and Gene Ontology terms of the target proteins. An easy-to-use query interface enables the user to pose complex queries, for example to find drugs that target a certain pathway, interacting drugs that are metabolized by the same cytochrome P450 or drugs that target the same protein but are metabolized by different enzymes. Furthermore, we provide tools for 2D drug screening and sequence comparison of the targets. The database contains more than 2500 target proteins, which are annotated with about 7300 relations to 1500 drugs; the vast majority of entries have pointers to the respective literature source. A subset of these drugs has been annotated with additional binding information and indirect interactions and is available as a separate resource called Matador. SuperTarget and Matador are available at http://insilico.charite.de/supertarget and http://matador.embl.de.
Selective maintenance of Drosophila tandemly arranged duplicated genes during evolution.
Quijano, C., Tomancak, P., Lopez-Marti, J., Suyama, M., Bork, P., Milan, M., Torrents, D. & Manzanares, M.
Genome Biol. 2008;9(12):R176. Epub 2008 Dec 16.
BACKGROUND: The physical organization and chromosomal localization of genes within genomes is known to play an important role in their function. Most genes arise by duplication and move along the genome by random shuffling of DNA segments. Higher order structuring of the genome occurs in eukaryotes, where groups of physically linked genes are co-expressed. However, the contribution of gene duplication to gene order has not been analyzed in detail, as it is believed that co-expression due to recent duplicates would obscure other domains of co-expression. RESULTS: We have catalogued ordered duplicated genes in Drosophila melanogaster, and found that one in five of all genes is organized as tandem arrays. Furthermore, among arrays that have been spatially conserved over longer periods than would be expected on the basis of random shuffling, a disproportionate number contain genes encoding developmental regulators. Using in situ gene expression data for more than half of the Drosophila genome, we find that genes in these conserved clusters are co-expressed to a much higher extent than other duplicated genes. CONCLUSIONS: These results reveal the existence of functional constraints in insects that retain copies of genes encoding developmental and regulatory proteins as neighbors, allowing their co-expression. This co-expression may be the result of shared cis-regulatory elements or a shared need for a specific chromatin structure. Our results highlight the association between genome architecture and the gene regulatory networks involved in the construction of the body plan.
Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions?
Palleja, A., Harrington, E.D. & Bork, P.
BMC Genomics. 2008 Jul 15;9:335.
BACKGROUND: Across the fully sequenced microbial genomes there are thousands of examples of overlapping genes. Many of these are only a few nucleotides long and are thought to function by permitting the coordinated regulation of gene expression. However, there should also be selective pressure against long overlaps, as the existence of overlapping reading frames increases the risk of deleterious mutations. Here we examine the longest overlaps and assess whether they are the product of special functional constraints or of erroneous annotation. RESULTS: We analysed the genes that overlap by 60 bps or more among 338 fully-sequenced prokaryotic genomes. The likely functional significance of an overlap was determined by comparing each of the genes to its respective orthologs. If a gene showed a significantly different length from its orthologs it was considered unlikely to be functional and therefore the result of an error either in sequencing or gene prediction. Focusing on 715 co-directional overlaps longer than 60 bps, we classified the erroneous ones into five categories: i) 5'-end extension of the downstream gene due to either a mispredicted start codon or a frameshift at 5'-end of the gene (409 overlaps), ii) fragmentation of a gene caused by a frameshift (163), iii) 3'-end extension of the upstream gene due to either a frameshift at 3'-end of a gene or point mutation at the stop codon (68), iv) Redundant gene predictions (4), v) 5' & 3'-end extension which is a combination of i) and iii) (71). We also studied 75 divergent overlaps that could be classified as misannotations of group i). Nevertheless we found some convergent long overlaps (54) that might be true overlaps, although an important part of convergent overlaps could be classified as group iii) (124). CONCLUSION: Among the 968 overlaps larger than 60 bps which we analysed, we did not find a single real one among the co-directional and divergent orientations and concluded that there had been an excessive number of misannotations. Only convergent orientation seems to permit some long overlaps, although convergent overlaps are also hampered by misannotations. We propose a simple rule to flag these erroneous gene length predictions to facilitate automatic annotation.
Circular reasoning rather than cyclic expression.
Jensen, L.J., de Lichtenberg, U., Jensen, T.S., Brunak, S. & Bork, P.
Genome Biol. 2008;9(6):403. Epub 2008 Jun 23.
A response to Combined analysis reveals a core set of cycling genes by Y Lu, S Mahony, PV Benos, R Rosenfeld, I Simon, LL Breeden and Z Bar-Joseph. Genome Biol 2007, 8:R146.
A nitrile hydratase in the eukaryote Monosiga brevicollis.
Foerstner, K.U., Doerks, T., Muller, J., Raes, J. & Bork, P.
PLoS One. 2008;3(12):e3976. Epub 2008 Dec 19.
Bacterial nitrile hydratase (NHases) are important industrial catalysts and waste water remediation tools. In a global computational screening of conventional and metagenomic sequence data for NHases, we detected the two usually separated NHase subunits fused in one protein of the choanoflagellate Monosiga brevicollis, a recently sequenced unicellular model organism from the closest sister group of Metazoa. This is the first time that an NHase is found in eukaryotes and the first time it is observed as a fusion protein. The presence of an intron, subunit fusion and expressed sequence tags covering parts of the gene exclude contamination and suggest a functional gene. Phylogenetic analyses and genomic context imply a probable ancient horizontal gene transfer (HGT) from proteobacteria. The newly discovered NHase might open biotechnological routes due to its unconventional structure, its new type of host and its apparent integration into eukaryotic protein networks.
Genome-wide experimental determination of barriers to horizontal gene transfer.
Sorek, R., Zhu, Y., Creevey, C.J., Francino, M.P., Bork, P. & Rubin, E.M.
Science. 2007 Nov 30;318(5855):1449-52. Epub 2007 Oct 18.
Horizontal gene transfer, in which genetic material is transferred from the genome of one organism to that of another, has been investigated in microbial species mainly through computational sequence analyses. To address the lack of experimental data, we studied the attempted movement of 246,045 genes from 79 prokaryotic genomes into Escherichia coli and identified genes that consistently fail to transfer. We studied the mechanisms underlying transfer inhibition by placing coding regions from different species under the control of inducible promoters. Our data suggest that toxicity to the host inhibited transfer regardless of the species of origin and that increased gene dosage and associated increased expression may be a predominant cause for transfer failure. Although these experimental studies examined transfer solely into E. coli, a computational analysis of gene-transfer rates across available bacterial and archaeal genomes supports that the barriers observed in our study are general across the tree of life.
Target-specific requirements for enhancers of decapping in miRNA-mediated gene silencing.
Eulalio, A., Rehwinkel, J., Stricker, M., Huntzinger, E., Yang, S.F., Doerks, T., Dorner, S., Bork, P., Boutros, M. & Izaurralde, E.
Genes Dev. 2007 Oct 15;21(20):2558-70. Epub 2007 Sep 27.
microRNAs (miRNAs) silence gene expression by suppressing protein production and/or by promoting mRNA decay. To elucidate how silencing is accomplished, we screened an RNA interference library for suppressors of miRNA-mediated regulation in Drosophila melanogaster cells. In addition to proteins known to be required for miRNA biogenesis and function (i.e., Drosha, Pasha, Dicer-1, AGO1, and GW182), the screen identified the decapping activator Ge-1 as being required for silencing by miRNAs. Depleting Ge-1 alone and/or in combination with other decapping activators (e.g., DCP1, EDC3, HPat, or Me31B) suppresses silencing of several miRNA targets, indicating that miRNAs elicit mRNA decapping. A comparison of gene expression profiles in cells depleted of AGO1 or of individual decapping activators shows that approximately 15% of AGO1-targets are also regulated by Ge-1, DCP1, and HPat, whereas 5% are dependent on EDC3 and LSm1-7. These percentages are underestimated because decapping activators are partially redundant. Furthermore, in the absence of active translation, some miRNA targets are stabilized, whereas others continue to be degraded in a miRNA-dependent manner. These findings suggest that miRNAs mediate post-transcriptional gene silencing by more than one mechanism.
Get the most out of your metagenome: computational analysis of environmental sequence data.
Raes, J., Foerstner, K.U. & Bork, P.
Curr Opin Microbiol. 2007 Oct;10(5):490-8. Epub 2007 Oct 23.
New advances in sequencing technologies bring random shotgun sequencing of ecosystems within reach of smaller labs, but the complexity of metagenomics data can be overwhelming. Recently, many novel computational tools have been developed to unravel ecosystem properties starting from fragmented sequences. In addition, the so-called 'comparative metagenomics' approaches have allowed the discovery of specific genomic and community adaptations to environmental factors. However, many of the parameters extracted from these data to describe the environment at hand (e.g. genomic features, functional complement, phylogenetic composition) are interdependent and influenced by technical aspects of sample preparation and data treatment, leading to various pitfalls during analysis. To avoid this and complement existing initiatives in data standards, we propose a minimal standard for metagenomics data analysis ('MINIMESS') to be able to take full advantage of the power of comparative metagenomics in understanding microbial life on earth.
Quantitative assessment of protein function prediction from metagenomics shotgun sequences.
Harrington, E.D., Singh, A.H., Doerks, T., Letunic, I., von Mering, C., Jensen, L.J., Raes, J. & Bork, P.
Proc Natl Acad Sci U S A. 2007 Aug 28;104(35):13913-8. Epub 2007 Aug 23.
To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families.
Evolution of cell cycle control: same molecular machines, different regulation.
de Lichtenberg, U., Jensen, T.S., Brunak, S., Bork, P. & Jensen, L.J.
Cell Cycle. 2007 Aug 1;6(15):1819-25. Epub 2007 Jun 4.
Decades of research has together with the availability of whole genomes made it clear that many of the core components involved in the cell cycle are conserved across eukaryotes, both functionally and structurally. These proteins are organized in complexes and modules that are activated or deactivated at specific stages during the cell cycle through a wide variety of mechanisms including transcriptional regulation, phosphorylation, subcellular translocation and targeted degradation. In a series of integrative analyses of different genome-scale data sets, we have studied how these different layers of regulation together control the activity of cell cycle complexes and how this regulation has evolved. The results show surprisingly poor conservation of both the transcriptional and the post-translation regulation of individual genes and proteins; however, the changes in one layer of regulation are often mirrored by changes in other layers, implying that independent layers of control coevolve. By taking a bird's eye view of the cell cycle, we demonstrate how the modular organization of cellular systems possesses a built-in flexibility, which allows evolution to find many different solutions for assembling the same molecular machines just in time for action.
Use of pathway analysis and genome context methods for functional genomics of Mycoplasma pneumoniae nucleotide metabolism.
Pachkov, M., Dandekar, T., Korbel, J., Bork, P. & Schuster, S.
Gene. 2007 Jul 15;396(2):215-25. Epub 2007 Mar 24.
Elementary modes analysis allows one to reveal whether a set of known enzymes is sufficient to sustain functionality of the cell. Moreover, it is helpful in detecting missing reactions and predicting which enzymes could fill these gaps. Here, we perform a comprehensive elementary modes analysis and a genomic context analysis of Mycoplasma pneumoniae nucleotide metabolism, and search for new enzyme activities. The purine and pyrimidine networks are reconstructed by assembling enzymes annotated in the genome or found experimentally. We show that these reaction sets are sufficient for enabling synthesis of DNA and RNA in M. pneumoniae. Special focus is on the key modes for growth. Moreover, we make an educated guess on the nutritional requirements of this micro-organism. For the case that M. pneumoniae does not require adenine as a substrate, we suggest adenylosuccinate synthetase (EC 18.104.22.168), adenylosuccinate lyase (EC 22.214.171.124) and GMP reductase (EC 126.96.36.199) to be operative. GMP reductase activity is putatively assigned to the NRDI_MYCPN gene on the basis of the genomic context analysis. For the pyrimidine network, we suggest CTP synthase (EC 188.8.131.52) to be active. Further experiments on the nutritional requirements are needed to make a decision. Pyrimidine metabolism appears to be more appropriate as a drug target than purine metabolism since it shows lower plasticity.
Splicing factors stimulate polyadenylation via USEs at non-canonical 3' end formation signals.
Danckwardt, S., Kaufmann, I., Gentzel, M., Foerstner, K.U., Gantzert, A.S., Gehring, N.H., Neu-Yilik, G., Bork, P., Keller, W., Wilm, M., Hentze, M.W. & Kulozik, A.E.
EMBO J. 2007 Jun 6;26(11):2658-69. Epub 2007 Apr 26.
The prothrombin (F2) 3' end formation signal is highly susceptible to thrombophilia-associated gain-of-function mutations. In its unusual architecture, the F2 3' UTR contains an upstream sequence element (USE) that compensates for weak activities of the non-canonical cleavage site and the downstream U-rich element. Here, we address the mechanism of USE function. We show that the F2 USE contains a highly conserved nonameric core sequence, which promotes 3' end formation in a position- and sequence-dependent manner. We identify proteins that specifically interact with the USE, and demonstrate their function as trans-acting factors that promote 3' end formation. Interestingly, these include the splicing factors U2AF35, U2AF65 and hnRNPI. We show that these splicing factors not only modulate 3' end formation via the USEs contained in the F2 and the complement C2 mRNAs, but also in the biocomputationally identified BCL2L2, IVNS and ACTR mRNAs, suggesting a broader functional role. These data uncover a novel mechanism that functionally links the splicing and 3' end formation machineries of multiple cellular mRNAs in an USE-dependent manner.
Systematic discovery of in vivo phosphorylation networks.
Linding, R., Jensen, L.J., Ostheimer, G.J., van Vugt, M.A., Jorgensen, C., Miron, I.M., Diella, F., Colwill, K., Taylor, L., Elder, K., Metalnikov, P., Nguyen, V., Pasculescu, A., Jin, J., Park, J.G., Samson, L.D., Woodgett, J.R., Russell, R.B., Bork, P., Yaffe, M.B. & Pawson, T.
Cell. 2007 Jun 29;129(7):1415-26. Epub 2007 Jun 14.
Protein kinases control cellular decision processes by phosphorylating specific substrates. Thousands of in vivo phosphorylation sites have been identified, mostly by proteome-wide mapping. However, systematically matching these sites to specific kinases is presently infeasible, due to limited specificity of consensus motifs, and the influence of contextual factors, such as protein scaffolds, localization, and expression, on cellular substrate specificity. We have developed an approach (NetworKIN) that augments motif-based predictions with the network context of kinases and phosphoproteins. The latter provides 60%-80% of the computational capability to assign in vivo substrate specificity. NetworKIN pinpoints kinases responsible for specific phosphorylations and yields a 2.5-fold improvement in the accuracy with which phosphorylation networks can be constructed. Applying this approach to DNA damage signaling, we show that 53BP1 and Rad50 are phosphorylated by CDK1 and ATM, respectively. We describe a scalable strategy to evaluate predictions, which suggests that BCLAF1 is a GSK-3 substrate.
Protein function space: viewing the limits or limited by our view?
Raes, J., Harrington, E.D., Singh, A.H. & Bork, P.
Curr Opin Struct Biol. 2007 Jun;17(3):362-9. Epub 2007 Jun 15.
Given that the number of protein functions on earth is finite, the rapid expansion of biological knowledge and the concomitant exponential increase in the number of protein sequences should, at some point, enable the estimation of the limits of protein function space. The functional coverage of protein sequences can be investigated using computational methods, especially given the massive amount of data being generated by large-scale environmental sequencing (metagenomics). In completely sequenced genomes, the fraction of proteins to which at least some functional features can be assigned has recently risen to as much as approximately 85%. Although this fraction is more uncertain in metagenomics surveys, because of environmental complexities and differences in analysis protocols, our global knowledge of protein functions still appears to be considerable. However, when we consider protein families, continued sequencing seems to yield an ever-increasing number of novel families. Until we reconcile these two views, the limits of protein space will remain obscured.
Sequence-based factors influencing the expression of heterologous genes in the yeast Pichia pastoris--A comparative view on 79 human genes.
Boettner, M., Steffens, C., von Mering, C., Bork, P., Stahl, U. & Lang, C.
J Biotechnol. 2007 May 31;130(1):1-10. Epub 2007 Feb 28.
High yield expression of heterologous proteins is usually a matter of "trial and error". In the search of parameters with a major impact on expression, we have applied a comparative analysis to 79 different human cDNAs expressed in Pichia pastoris. The cDNAs were cloned in an expression vector for intracellular expression and recombinant protein expression was monitored in a standardized procedure and classified with respect to the expression level. Of all sequence-based parameters with a possible influence on the expression level, more than 10 were analysed. Three of those factors proved to have a statistically significant association with the expression level. Low abundance of AT-rich regions in the cDNA associates with a high expression level. A comparatively high isoelectric point of the recombinant protein associates with failure of expression and, finally, the occurrence of a protein homologue in yeast is associated with detectable protein expression. Interestingly, some often discussed factors like codon usage or GC content did not show a significant impact on protein yield. These results could provide a basis for a knowledge-oriented optimisation of gene sequences both to increase protein yields and to help target selection and the design of high-throughput expression approaches.
Quantitative phylogenetic assessment of microbial communities in diverse environments.
von Mering, C., Hugenholtz, P., Raes, J., Tringe, S.G., Doerks, T., Jensen, L.J., Ward, N. & Bork, P.
Science. 2007 Feb 23;315(5815):1126-30. Epub 2007 Feb 1.
The taxonomic composition of environmental communities is an important indicator of their ecology and function. We used a set of protein-coding marker genes, extracted from large-scale environmental shotgun sequencing data, to provide a more direct, quantitative, and accurate picture of community composition than that provided by traditional ribosomal RNA-based approaches depending on the polymerase chain reaction. Mapping marker genes from four diverse environmental data sets onto a reference species phylogeny shows that certain communities evolve faster than others. The method also enables determination of preferred habitats for entire microbial clades and provides evidence that such habitat preferences are often remarkably stable over time.
Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation.
Letunic, I. & Bork, P.
Bioinformatics. 2007 Jan 1;23(1):127-8. Epub 2006 Oct 18.
Interactive Tree Of Life (iTOL) is a web-based tool for the display, manipulation and annotation of phylogenetic trees. Trees can be interactively pruned and re-rooted. Various types of data such as genome sizes or protein domain repertoires can be mapped onto the tree. Export to several bitmap and vector graphics formats is supported. AVAILABILITY: iTOL is available at http://itol.embl.de
Quantification of insect genome divergence.
Zdobnov, E.M. & Bork, P.
Trends Genet. 2007 Jan;23(1):16-20. Epub 2006 Nov 9.
The recent sequencing of twelve insect genomes has enabled us to quantify their divergence using synteny conservation and sequence identity of single-copy orthologs. Protein identity correlates well with synteny and is about three times more conserved, an observation consistent with comparisons among vertebrates. The observed distribution of the lengths of synteny blocks follows a power law and differs from the expectations of the currently accepted random breakage model. Our results show that there is only limited selection for conservation of gene order and reveal a few hundred genes, proximity among which seems to be vital.
STRING 7--recent developments in the integration and prediction of protein interactions.
von Mering, C., Jensen, L.J., Kuhn, M., Chaffron, S., Doerks, T., Kruger, B., Snel, B. & Bork, P.
Nucleic Acids Res. 2007 Jan;35(Database issue):D358-62. Epub 2006 Nov 10.
Information on protein-protein interactions is still mostly limited to a small number of model organisms, and originates from a wide variety of experimental and computational techniques. The database and online resource STRING generalizes access to protein interaction data, by integrating known and predicted interactions from a variety of sources. The underlying infrastructure includes a consistent body of completely sequenced genomes and exhaustive orthology classifications, based on which interaction evidence is transferred between organisms. Although primarily developed for protein interaction analysis, the resource has also been successfully applied to comparative genomics, phylogenetics and network studies, which are all facilitated by programmatic access to the database backend and the availability of compact download files. As of release 7, STRING has almost doubled to 373 distinct organisms, and contains more than 1.5 million proteins for which associations have been pre-computed. Novel features include AJAX-based web-navigation, inclusion of additional resources such as BioGRID, and detailed protein domain annotation. STRING is available at http://string.embl.de/
New developments in the InterPro database.
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Buillard, V., Cerutti, L., Copley, R., Courcelle, E., Das, U., Daugherty, L., Dibley, M., Finn, R., Fleischmann, W., Gough, J., Haft, D., Hulo, N., Hunter, S., Kahn, D., Kanapin, A., Kejariwal, A., Labarga, A., Langendijk-Genevaux, P.S., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Nikolskaya, A.N., Orchard, S., Orengo, C., Petryszak, R., Selengut, J.D., Sigrist, C.J., Thomas, P.D., Valentin, F., Wilson, D., Wu, C.H. & Yeats, C.
Nucleic Acids Res. 2007 Jan;35(Database issue):D224-8.
InterPro is an integrated resource for protein families, domains and functional sites, which integrates the following protein signature databases: PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D and PANTHER. The latter two new member databases have been integrated since the last publication in this journal. There have been several new developments in InterPro, including an additional reading field, new database links, extensions to the web interface and additional match XML files. InterPro has always provided matches to UniProtKB proteins on the website and in the match XML file on the FTP site. Additional matches to proteins in UniParc (UniProt archive) are now available for download in the new match XML files only. The latest InterPro release (13.0) contains more than 13 000 entries, covering over 78% of all proteins in UniProtKB. The database is available for text- and sequence-based searches via a webserver (http://www.ebi.ac.uk/interpro), and for download by anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro). The InterProScan search tool is now also available via a web service at http://www.ebi.ac.uk/Tools/webservices/WSInterProScan.html.
Prediction of effective genome size in metagenomic samples.
Raes, J., Korbel, J.O., Lercher, M.J., von Mering, C. & Bork, P.
Genome Biol. 2007;8(1):R10.
We introduce a novel computational approach to predict effective genome size (EGS; a measure that includes multiple plasmid copies, inserted sequences, and associated phages and viruses) from short sequencing reads of environmental genomics (or metagenomics) projects. We observe considerable EGS differences between environments and link this with ecologic complexity as well as species composition (for instance, the presence of eukaryotes). For example, we estimate EGS in a complex, organism-dense farm soil sample at about 6.3 megabases (Mb) whereas that of the bacteria therein is only 4.7 Mb; for bacteria in a nutrient-poor, organism-sparse ocean surface water sample, EGS is as low as 1.6 Mb. The method also permits evaluation of completion status and assembly bias in single-genome sequencing projects.
Identification of tightly regulated groups of genes during Drosophila melanogaster embryogenesis.
Hooper, S.D., Boue, S., Krause, R., Jensen, L.J., Mason, C.E., Ghanim, M., White, K.P., Furlong, E.E. & Bork, P.
Mol Syst Biol. 2007;3:72. Epub 2007 Jan 16.
Time-series analysis of whole-genome expression data during Drosophila melanogaster development indicates that up to 86% of its genes change their relative transcript level during embryogenesis. By applying conservative filtering criteria and requiring 'sharp' transcript changes, we identified 1534 maternal genes, 792 transient zygotic genes, and 1053 genes whose transcript levels increase during embryogenesis. Each of these three categories is dominated by groups of genes where all transcript levels increase and/or decrease at similar times, suggesting a common mode of regulation. For example, 34% of the transiently expressed genes fall into three groups, with increased transcript levels between 2.5-12, 11-20, and 15-20 h of development, respectively. We highlight common and distinctive functional features of these expression groups and identify a coupling between downregulation of transcript levels and targeted protein degradation. By mapping the groups to the protein network, we also predict and experimentally confirm new functional associations.
Opsins and clusters of sensory G-protein-coupled receptors in the sea urchin genome.
Raible, F., Tessmar-Raible, K., Arboleda, E., Kaller, T., Bork, P., Arendt, D. & Arnone, M.I.
Dev Biol. 2006 Dec 1;300(1):461-75. Epub 2006 Sep 5.
Rhodopsin-type G-protein-coupled receptors (GPCRs) contribute the majority of sensory receptors in vertebrates. With 979 members, they form the largest GPCR family in the sequenced sea urchin genome, constituting more than 3% of all predicted genes. The sea urchin genome encodes at least six Opsin proteins. Of these, one rhabdomeric, one ciliary and two G(o)-type Opsins can be assigned to ancient bilaterian Opsin subfamilies. Moreover, we identified four greatly expanded subfamilies of rhodopsin-type GPCRs that we call sea urchin specific rapidly expanded lineages of GPCRs (surreal-GPCRs). Our analysis of two of these groups revealed genomic clustering and single-exon gene structures similar to the most expanded group of vertebrate rhodopsin-type GPCRs, the olfactory receptors. We hypothesize that these genes arose by rapid duplication in the echinoid lineage and act as chemosensory receptors of the animal. In support of this, group B surreal-GPCRs are most prominently expressed in distinct classes of pedicellariae and tube feet of the adult sea urchin, structures that have previously been shown to react to chemical stimuli and to harbor sensory neurons in echinoderms. Notably, these structures also express different opsins, indicating that sea urchins possess an intricate molecular set-up to sense their environment.
Computational characterization of multiple Gag-like human proteins.
Campillos, M., Doerks, T., Shah, P.K. & Bork, P.
Trends Genet. 2006 Nov;22(11):585-9. Epub 2006 Sep 18.
In a genome-wide analysis, we have identified 85 human genes encoding 103 protein isoforms that resemble retroviral Gag proteins. These genes were domesticated from retrotransposons in at least five independent events during vertebrate evolution and were subsequently duplicated further in mammals. Structural insights into the mammalian proteins can be inferred by homology to Gag from viruses such as HIV; in turn, the cellular roles of the mammalian Gag homologs, such as apoptosis-related functions and binding to ubiquitin ligases, might hint at further functionality of viral Gag itself.
Insights into social insects from the genome of the honeybee Apis mellifera.
Sequencing Consortium, T.H. (Bork, P.)
Nature. 2006 Nov 23;444(7118):512. Europe PMC
Co-evolution of transcriptional and post-translational cell-cycle regulation.
Jensen, L.J., Jensen, T.S., de Lichtenberg, U., Brunak, S. & Bork, P.
Nature. 2006 Oct 5;443(7111):594-7. Epub 2006 Sep 27.
DNA microarray studies have shown that hundreds of genes are transcribed periodically during the mitotic cell cycle of humans, budding yeast, fission yeast and the plant Arabidopsis thaliana. Here we show that despite the fact the protein complexes involved in this process are largely the same among all eukaryotes, their regulation has evolved considerably. Our comparative analysis of several large-scale data sets reveals that although the regulated subunits of each protein complex are expressed just before its time of action, the identity of the periodically expressed proteins differs significantly between organisms. Moreover, we show that these changes in transcriptional regulation have co-evolved with post-translational control independently in several lineages; loss or gain of cell-cycle-regulated transcription of specific genes is often mirrored by changes in phosphorylation of the proteins that they encode. Our results indicate that many different solutions have evolved for assembling the same molecular machines at the right time during the cell cycle, involving both transcriptional and post-translational layers that jointly control the dynamics of biological systems.
Assessing systems properties of yeast mitochondria through an interaction map of the organelle.
Perocchi, F., Jensen, L.J., Gagneur, J., Ahting, U., von Mering, C., Bork, P., Prokisch, H. & Steinmetz, L.M.
PLoS Genet. 2006 Oct 20;2(10):e170.
Mitochondria carry out specialized functions; compartmentalized, yet integrated into the metabolic and signaling processes of the cell. Although many mitochondrial proteins have been identified, understanding their functional interrelationships has been a challenge. Here we construct a comprehensive network of the mitochondrial system. We integrated genome-wide datasets to generate an accurate and inclusive mitochondrial parts list. Together with benchmarked measures of protein interactions, a network of mitochondria was constructed in their cellular context, including extra-mitochondrial proteins. This network also integrates data from different organisms to expand the known mitochondrial biology beyond the information in the existing databases. Our network brings together annotated and predicted functions into a single framework. This enabled, for the entire system, a survey of mutant phenotypes, gene regulation, evolution, and disease susceptibility. Furthermore, we experimentally validated the localization of several candidate proteins and derived novel functional contexts for hundreds of uncharacterized proteins. Our network thus advances the understanding of the mitochondrial system in yeast and identifies properties of genes underlying human mitochondrial disorders.
mRNA degradation by miRNAs and GW182 requires both CCR4:NOT deadenylase and DCP1:DCP2 decapping complexes.
Behm-Ansmant, I., Rehwinkel, J., Doerks, T., Stark, A., Bork, P. & Izaurralde, E.
Genes Dev. 2006 Jul 15;20(14):1885-98. Epub 2006 Jun 30.
MicroRNAs (miRNAs) silence the expression of target genes post-transcriptionally. Their function is mediated by the Argonaute proteins (AGOs), which colocalize to P-bodies with mRNA degradation enzymes. Mammalian P-bodies are also marked by the GW182 protein, which interacts with the AGOs and is required for miRNA function. We show that depletion of GW182 leads to changes in mRNA expression profiles strikingly similar to those observed in cells depleted of the essential Drosophila miRNA effector AGO1, indicating that GW182 functions in the miRNA pathway. When GW182 is bound to a reporter transcript, it silences its expression, bypassing the requirement for AGO1. Silencing by GW182 is effected by changes in protein expression and mRNA stability. Similarly, miRNAs silence gene expression by repressing protein expression and/or by promoting mRNA decay, and both mechanisms require GW182. mRNA degradation, but not translational repression, by GW182 or miRNAs is inhibited in cells depleted of CAF1, NOT1, or the decapping DCP1:DCP2 complex. We further show that the N-terminal GW repeats of GW182 interact with the PIWI domain of AGO1. Our findings indicate that GW182 links the miRNA pathway to mRNA degradation by interacting with AGO1 and promoting decay of at least a subset of miRNA targets.
PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments.
Suyama, M., Torrents, D. & Bork, P.
Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W609-12.
PAL2NAL is a web server that constructs a multiple codon alignment from the corresponding aligned protein sequences. Such codon alignments can be used to evaluate the type and rate of nucleotide substitutions in coding DNA for a wide range of evolutionary analyses, such as the identification of levels of selective constraint acting on genes, or to perform DNA-based phylogenetic studies. The server takes a protein sequence alignment and the corresponding DNA sequences as input. In contrast to other existing applications, this server is able to construct codon alignments even if the input DNA sequence has mismatches with the input protein sequence, or contains untranslated regions and polyA tails. The server can also deal with frame shifts and inframe stop codons in the input models, and is thus suitable for the analysis of pseudogenes. Another distinct feature is that the user can specify a subregion of the input alignment in order to specifically analyze functional domains or exons of interest. The PAL2NAL server is available at http://www.bork.embl.de/pal2nal.
LSAT: learning about alternative transcripts in MEDLINE.
Shah, P.K. & Bork, P.
Bioinformatics. 2006 Apr 1;22(7):857-65. Epub 2006 Jan 12.
MOTIVATION: Generation of alternative transcripts from the same gene is an important biological event due to their contribution in creating functional diversity in eukaryotes. In this work, we choose the task of extracting information around this complex topic using a two-step procedure involving machine learning and information extraction. RESULTS: In the first step, we trained a classifier that inductively learns to identify sentences about physiological transcript diversity from the MEDLINE abstracts. Using a large hand-built corpus, we compared the sentence classification performance of various text categorization methods. Support vector machines (SVMs) followed by the maximum entropy classifier outperformed other methods for the sentence classification task. The SVM with the radial basis function kernel and optimized parameters achieved Fbeta-measure of 91% during the 4-fold cross validation and of 74% when applied to all sentences in more than 12 million abstracts of MEDLINE. In the second step, we identified eight frequently present semantic categories in the sentences and performed a limited amount of semantic role labeling. The role labeling step also achieved very high Fbeta-measure for all eight categories. AVAILABILITY: The results of our two-step procedure are summarized in the LSAT database of alternative transcripts. LSAT is available at http://www.bork.embl.de/LSAT CONTACT: email@example.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Proteome survey reveals modularity of the yeast cell machinery.
Gavin, A.C., Aloy, P., Grandi, P., Krause, R., Boesche, M., Marzioch, M., Rau, C., Jensen, L.J., Bastuck, S., Dumpelfeld, B., Edelmann, A., Heurtier, M.A., Hoffman, V., Hoefert, C., Klein, K., Hudak, M., Michon, A.M., Schelder, M., Schirle, M., Remor, M., Rudi, T., Hooper, S., Bauer, A., Bouwmeester, T., Casari, G., Drewes, G., Neubauer, G., Rick, J.M., Kuster, B., Bork, P., Russell, R.B. & Superti-Furga, G.
Nature. 2006 Mar 30;440(7084):631-6. Epub 2006 Jan 22.
Protein complexes are key molecular entities that integrate multiple gene products to perform cellular functions. Here we report the first genome-wide screen for complexes in an organism, budding yeast, using affinity purification and mass spectrometry. Through systematic tagging of open reading frames (ORFs), the majority of complexes were purified several times, suggesting screen saturation. The richness of the data set enabled a de novo characterization of the composition and organization of the cellular machinery. The ensemble of cellular proteins partitions into 491 complexes, of which 257 are novel, that differentially combine with additional attachment proteins or protein modules to enable a diversification of potential functions. Support for this modular organization of the proteome comes from integration with available data on expression, localization, function, evolutionary conservation, protein structure and binary interactions. This study provides the largest collection of physically determined eukaryotic cellular machines so far and a platform for biological data integration and modelling.
Toward automatic reconstruction of a highly resolved tree of life.
Ciccarelli, F.D., Doerks, T., von Mering, C., Creevey, C.J., Snel, B. & Bork, P.
Science. 2006 Mar 3;311(5765):1283-7.
We have developed an automatable procedure for reconstructing the tree of life with branch lengths comparable across all three domains. The tree has its basis in a concatenation of 31 orthologs occurring in 191 species with sequenced genomes. It revealed interdomain discrepancies in taxonomic classification. Systematic detection and subsequent exclusion of products of horizontal gene transfer increased phylogenetic resolution, allowing us to confirm accepted relationships and resolve disputed and preliminary classifications. For example, we place the phylum Acidobacteria as a sister group of delta-Proteobacteria, support a Gram-positive origin of Bacteria, and suggest a thermophilic last universal common ancestor.
Comparative analysis of environmental sequences: potential and challenges.
Förstner, K.U., von Mering, C. & Bork, P.
Philos Trans R Soc Lond B Biol Sci. 2006 Mar 29;361(1467):519-23.
Environmental sequencing, also dubbed metagenomics, is increasingly being used to obtain insights into organismal communities in diverse habitats, and has a variety of potential applications foreseeable in biotechnology and medicine. The first public large-scale data provide already a wealth of information hidden in vast amounts of fragmented pieces of DNA from unknown species residing in these environments. Comparative sequence analysis is essential for the interpretation of such data. However, different layers of complexity that are intrinsic to each sample require the establishment of some baselines for comparison: how to normalize for the differences in phylogenetic and functional diversity, how to avoid biases from incomplete data, and how to deal with differences in species dominance or genome sizes? Here we discuss a few of these items and delineate some simple discriminative sequence properties for four distinct habitats.
Extraction of regulatory gene/protein networks from Medline.
Saric, J., Jensen, L.J., Ouzounova, R., Rojas, I. & Bork, P.
Bioinformatics. 2006 Mar 15;22(6):645-50. Epub 2005 Jul 26.
MOTIVATION: We have previously developed a rule-based approach for extracting information on the regulation of gene expression in yeast. The biomedical literature, however, contains information on several other equally important regulatory mechanisms, in particular phosphorylation, which we now expanded for our rule-based system also to extract. RESULTS: This paper presents new results for extraction of relational information from biomedical text. We have improved our system, STRING-IE, to capture both new types of linguistic constructs as well as new types of biological information [i.e. (de-)phosphorylation]. The precision remains stable with a slight increase in recall. From almost one million PubMed abstracts related to four model organisms, we manage to extract regulatory networks and binary phosphorylations comprising 3,319 relation chunks. The accuracy is 83-90% and 86-95% for gene expression and (de-)phosphorylation relations, respectively. To achieve this, we made use of an organism-specific resource of gene/protein names considerably larger than those used in most other biology related information extraction approaches. These names were included in the lexicon when retraining the part-of-speech (POS) tagger on the GENIA corpus. For the domain in question, an accuracy of 96.4% was attained on POS tags. It should be noted that the rules were developed for yeast and successfully applied to both abstracts and full-text articles related to other organisms with comparable accuracy. AVAILABILITY: The revised GENIA corpus, the POS tagger, the extraction rules and the full sets of extracted relations are available from http://www.bork.embl.de/Docu/STRING-IE
Identification and analysis of evolutionarily cohesive functional modules in protein networks.
Campillos, M., von Mering, C., Jensen, L.J. & Bork, P.
Genome Res. 2006 Mar;16(3):374-82. Epub 2006 Jan 31.
The increasing number of sequenced genomes makes it possible to infer the evolutionary history of functional modules, i.e., groups of proteins that contribute jointly to the same cellular function in a given species. Here we identify and analyze those prokaryotic functional modules, whose composition remains largely unchanged during evolution, and study their properties. Such "cohesive" modules have a large number of internal functional connections, encode genes that tend to be in close proximity in prokaryotic genomes, and correspond to physical complexes or complex functional systems like the flagellar apparatus. Cohesive modules are enriched in processes such as energy and amino acid metabolism, cell motility, and intracellular trafficking, or secretion. By grouping genes into modules we achieve a more precise estimate of their age and find that the young modules are often horizontally transferred between species and are enriched in functions involved in interactions with the environment, implying that they play an important role in the adaptation of species to new environments.
Literature mining for the biologist: from information retrieval to biological discovery.
Jensen, L.J., Saric, J. & Bork, P.
Nat Rev Genet. 2006 Feb;7(2):119-29.
For the average biologist, hands-on literature mining currently means a keyword search in PubMed. However, methods for extracting biomedical facts from the scientific literature have improved considerably, and the associated tools will probably soon be used in many laboratories to automatically annotate and analyse the growing number of system-wide experimental data sets. Owing to the increasing body of text and the open-access policies of many journals, literature mining is also becoming useful for both hypothesis generation and biological discovery. However, the latter will require the integration of literature and high-throughput data, which should encourage close collaborations between biologists and computational linguists.
SMART 5: domains in the context of genomes and networks.
Letunic, I., Copley, R.R., Pils, B., Pinkert, S., Schultz, J. & Bork, P.
Nucleic Acids Res. 2006 Jan 1;34(Database issue):D257-60.
The Simple Modular Architecture Research Tool (SMART) is an online resource (http://smart.embl.de/) used for protein domain identification and the analysis of protein domain architectures. Many new features were implemented to make SMART more accessible to scientists from different fields. The new 'Genomic' mode in SMART makes it easy to analyze domain architectures in completely sequenced genomes. Domain annotation has been updated with a detailed taxonomic breakdown and a prediction of the catalytic activity for 50 SMART domains is now available, based on the presence of essential amino acids. Furthermore, intrinsically disordered protein regions can be identified and displayed. The network context is now displayed in the results page for more than 350 000 proteins, enabling easy analyses of domain interactions.
Medusa: a simple tool for interaction graph analysis.
Hooper, S.D. & Bork, P.
Bioinformatics. 2005 Dec 15;21(24):4432-3. Epub 2005 Sep 27.
SUMMARY: Medusa is a Java application for visualizing and manipulating graphs of interaction, such as data from the STRING database. It features an intuitive user interface developed with the help of biologists. Medusa is optimized for accessing protein interaction data from STRING, but can be used for any type of graph from any scientific field. AVAILABILITY: Medusa, along with sample datasets and instructions, can be downloaded from http://www.bork.embl.de/medusa CONTACT: firstname.lastname@example.org.
Very-KIND is a novel nervous system specific guanine nucleotide exchange factor for Ras GTPases.
Mees, A., Rock, R., Ciccarelli, F.D., Leberfinger, C.B., Borawski, J.M., Bork, P., Wiese, S., Gessler, M. & Kerkhoff, E.
Gene Expr Patterns. 2005 Dec;6(1):79-85. Epub 2005 Aug 15.
The kinase non-catalytic c-lobe domain (KIND) evolved from the catalytic protein kinase fold into a potential protein interaction module for signalling proteins. Spir family actin organizers and the non-receptor phosphatase type 13 (PTP type 13) encode a KIND domain in the very N-terminal parts of the proteins. Here we report the characterization and cloning of a third member of the KIND protein family, which we have named very-KIND (VKIND) because of its two KIND domains. Like the other members of the protein family, VKIND has a KIND domain at the N-terminus. A second KIND domain is located in the central part of the protein. The C-terminal half encodes a guanine nucleotide exchange factor motif for Ras-like GTPases (RasGEF) and a RasGEF N-terminal module (RasGEFN). There is only one VKIND gene in the mammalian genomes and up to now we have found the gene only in vertebrates. During mouse embryogenesis the VKIND gene was specifically expressed in the developing nervous system. In adult mice Northern hybridizations revealed high expression only in brain. Low expression could be detected in ovary. In situ hybridizations showed a specific expression of VKIND in neuronal cells of the granular and Purkinje cell layers of the cerebellum.
Environments shape the nucleotide composition of genomes.
Foerstner, K.U., von Mering, C., Hooper, S.D. & Bork, P.
EMBO Rep 2005 Dec;6(12):1208-13.
To test the impact of environments on genome evolution, we analysed the relative abundance of the nucleotides guanine and cytosine ('GC content') of large numbers of sequences from four distinct environmental samples (ocean surface water, farm soil, an acidophilic mine drainage biofilm and deep-sea whale carcasses). We show that the GC content of complex microbial communities seems to be globally and actively influenced by the environment. The observed nucleotide compositions cannot be easily explained by distinct phylogenetic origins of the species in the environments; the genomic GC content may change faster than was previously thought, and is also reflected in the amino-acid composition of the proteins in these habitats.
Vertebrate-type intron-rich genes in the marine annelid Platynereis dumerilii.
Raible, F., Tessmar-Raible, K., Osoegawa, K., Wincker, P., Jubin, C., Balavoine, G., Ferrier, D., Benes, V., de Jong, P., Weissenbach, J., Bork, P. & Arendt, D.
Science 2005 Nov 25;310(5752):1325-6.
Previous genome comparisons have suggested that one important trend in vertebrate evolution has been a sharp rise in intron abundance. By using genomic data and expressed sequence tags from the marine annelid Platynereis dumerilii, we provide direct evidence that about two-thirds of human introns predate the bilaterian radiation but were lost from insect and nematode genomes to a large extent. A comparison of coding exon sequences confirms the ancestral nature of Platynereis and human genes. Thus, the urbilaterian ancestor had complex, intron-rich genes that have been retained in Platynereis and human.
Spore number control and breeding in Saccharomyces cerevisiae: a key role for a self-organizing system.
Taxis, C., Keller, P., Kavagiou, Z., Jensen, L.J., Colombelli, J., Bork, P., Stelzer, E.H.K. & Knop, M.
J Cell Biol 2005 Nov 21;171(4):627-40. Epub 2005 Nov 14.
Spindle pole bodies (SPBs) provide a structural basis for genome inheritance and spore formation during meiosis in yeast. Upon carbon source limitation during sporulation, the number of haploid spores formed per cell is reduced. We show that precise spore number control (SNC) fulfills two functions. SNC maximizes the production of spores (1-4) that are formed by a single cell. This is regulated by the concentration of three structural meiotic SPB components, which is dependent on available amounts of carbon source. Using experiments and computer simulation, we show that the molecular mechanism relies on a self-organizing system, which is able to generate particular patterns (different numbers of spores) in dependency on one single stimulus (gradually increasing amounts of SPB constituents). We also show that SNC enhances intratetrad mating, whereby maximal amounts of germinated spores are able to return to a diploid lifestyle without intermediary mitotic division. This is beneficial for the immediate fitness of the population of postmeiotic cells.
Palindromic repetitive DNA elements with coding potential in Methanocaldococcus jannaschii.
Suyama, M., Lathe WC, 3rd & Bork, P.
FEBS Lett 2005 Oct 10;579(24):5281-6.
We have identified 141 novel palindromic repetitive elements in the genome of euryarchaeon Methanocaldococcus jannaschii. The total length of these elements is 14.3kb, which corresponds to 0.9% of the total genomic sequence and 6.3% of all extragenic regions. The elements can be divided into three groups (MJRE1-3) based on the sequence similarity. The low sequence identity within each of the groups suggests rather old origin of these elements in M. jannaschii. Three MJRE2 elements were located within the protein coding regions without disrupting the coding potential of the host genes, indicating that insertion of repeats might be a widespread mechanism to enhance sequence diversity in coding regions.
Nonsense-mediated mRNA decay factors act in concert to regulate common mRNA targets.
Rehwinkel, J., Letunic, I., Raes, J., Bork, P. & Izaurralde, E.
RNA 2005 Oct;11(10):1530-44.
Nonsense-mediated mRNA decay (NMD) is a surveillance pathway that degrades mRNAs containing nonsense codons, and regulates the expression of naturally occurring transcripts. While NMD is not essential in yeast or nematodes, UPF1, a key NMD effector, is essential in mice. Here we show that NMD components are required for cell proliferation in Drosophila. This raises the question of whether NMD effectors diverged functionally during evolution. To address this question, we examined expression profiles in Drosophila cells depleted of all known metazoan NMD components. We show that UPF1, UPF2, UPF3, SMG1, SMG5, and SMG6 regulate in concert the expression of a cohort of genes with functions in a wide range of cellular activities, including cell cycle progression. Only a few transcripts were regulated exclusively by individual factors, suggesting that these proteins act mainly in the NMD pathway and their role in mRNA decay has not diverged substantially. Finally, the vast majority of NMD targets in Drosophila are not orthologs of targets previously identified in yeast or human cells. Thus phenotypic differences observed across species following inhibition of NMD can be largely attributed to changes in the repertoire of regulated genes.
G2D: a tool for mining genes associated with disease.
Perez-Iratxeta, C., Wjst, M., Bork, P. & Andrade, M.A.
BMC Genet 2005 Aug 22;6:45.
BACKGROUND: Human inherited diseases can be associated by genetic linkage with one or more genomic regions. The availability of the complete sequence of the human genome allows examining those locations for an associated gene. We previously developed an algorithm to prioritize genes on a chromosomal region according to their possible relation to an inherited disease using a combination of data mining on biomedical databases and gene sequence analysis. RESULTS: We have implemented this method as a web application in our site G2D (Genes to Diseases). It allows users to inspect any region of the human genome to find candidate genes related to a genetic disease of their interest. In addition, the G2D server includes pre-computed analyses of candidate genes for 552 linked monogenic diseases without an associated gene, and the analysis of 18 asthma loci. CONCLUSION: G2D can be publicly accessed at http://www.ogic.ca/projects/g2d_2/.
Structural genomics of human proteins--target selection and generation of a public catalogue of expression clones.
Bussow, K., Scheich, C., Sievert, V., Harttig, U., Schultz, J., Simon, B., Bork, P., Lehrach, H. & Heinemann, U.
Microb Cell Fact 2005 Jul 5;4:21.
BACKGROUND: The availability of suitable recombinant protein is still a major bottleneck in protein structure analysis. The Protein Structure Factory, part of the international structural genomics initiative, targets human proteins for structure determination. It has implemented high throughput procedures for all steps from cloning to structure calculation. This article describes the selection of human target proteins for structure analysis, our high throughput cloning strategy, and the expression of human proteins in Escherichia coli host cells. RESULTS AND CONCLUSION: Protein expression and sequence data of 1414 E. coli expression clones representing 537 different proteins are presented. 139 human proteins (18%) could be expressed and purified in soluble form and with the expected size. All E. coli expression clones are publicly available to facilitate further functional characterisation of this set of human proteins.
DCD - a novel plant specific domain in proteins involved in development and programmed cell death.
Tenhaken, R., Doerks, T. & Bork, P.
BMC Bioinformatics 2005 Jul 11;6:169.
BACKGROUND: Recognition of microbial pathogens by plants triggers the hypersensitive reaction, a common form of programmed cell death in plants. These dying cells generate signals that activate the plant immune system and alarm the neighboring cells as well as the whole plant to activate defense responses to limit the spread of the pathogen. The molecular mechanisms behind the hypersensitive reaction are largely unknown except for the recognition process of pathogens. We delineate the NRP-gene in soybean, which is specifically induced during this programmed cell death and contains a novel protein domain, which is commonly found in different plant proteins. RESULTS: The sequence analysis of the protein, encoded by the NRP-gene from soybean, led to the identification of a novel domain, which we named DCD, because it is found in plant proteins involved in development and cell death. The domain is shared by several proteins in the Arabidopsis and the rice genomes, which otherwise show a different protein architecture. Biological studies indicate a role of these proteins in phytohormone response, embryo development and programmed cell by pathogens or ozone. CONCLUSION: It is tempting to speculate, that the DCD domain mediates signaling in plant development and programmed cell death and could thus be used to identify interacting proteins to gain further molecular insights into these processes.
Consistency of genome-based methods in measuring Metazoan evolution.
Zdobnov, E.M., von Mering, C., Letunic, I. & Bork, P.
FEBS Lett 2005 Jun 13;579(15):3355-61. Epub 2005 Apr 18.
Seven distinct genome-wide divergence measures were applied pairwise to the nine sequenced animal genomes of human, mouse, rat, chicken, pufferfish, fruit fly, mosquito, and two nematode worms (Caenorhabditis briggsae and Caenorhabditis elegans). Qualitatively, all of these divergence measures are found to correlate with the estimated time since speciation; however, marked deviations are observed in a few lineages. The distinct genome divergence measures also correlate well among themselves, indicating that most of the processes shaping genomes are dominated by neutral events. The deviations from the clock-like scenario in some lineages are observed consistently by several measures, implicitly confirming their reliability.
Extraction of transcript diversity from scientific literature.
Shah, P.K., Jensen, L.J., Boue, S. & Bork, P.
PLoS Comput Biol. 2005 Jun;1(1):e10. Epub 2005 Jun 24.
Transcript diversity generated by alternative splicing and associated mechanisms contributes heavily to the functional complexity of biological systems. The numerous examples of the mechanisms and functional implications of these events are scattered throughout the scientific literature. Thus, it is crucial to have a tool that can automatically extract the relevant facts and collect them in a knowledge base that can aid the interpretation of data from high-throughput methods. We have developed and applied a composite text-mining method for extracting information on transcript diversity from the entire MEDLINE database in order to create a database of genes with alternative transcripts. It contains information on tissue specificity, number of isoforms, causative mechanisms, functional implications, and experimental methods used for detection. We have mined this resource to identify 959 instances of tissue-specific splicing. Our results in combination with those from EST-based methods suggest that alternative splicing is the preferred mechanism for generating transcript diversity in the nervous system. We provide new annotations for 1,860 genes with the potential for generating transcript diversity. We assign the MeSH term "alternative splicing" to 1,536 additional abstracts in the MEDLINE database and suggest new MeSH terms for other events. We have successfully extracted information about transcript diversity and semiautomatically generated a database, LSAT, that can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression. LSAT (Literature Support for Alternative Transcripts) is publicly available at http://www.bork.embl.de/LSAT/.
Is there biological research beyond Systems Biology? A comparative analysis of terms.
Mol Syst Biol. 2005;1:2005.0012. Epub 2005 May 25. Europe PMC
Towards cellular systems in 4D.
Bork, P. & Serrano, L.
Cell 2005 May 20;121(4):507-9. Europe PMC
Structural similarity to bridge sequence space: finding new families on the bridges.
Shah, P.K., Aloy, P., Bork, P. & Russell, R.B.
Protein Sci 2005 May;14(5):1305-14.
Structures for protein domains have increased rapidly in recent years owing to advances in structural biology and structural genomics projects. New structures are often similar to those solved previously, and such similarities can give insights into function by linking poorly understood families to those that are better characterized. They also allow the possibility of combing information to find still more proteins adopting a similar structure and sometimes a similar function, and to reprioritize families in structural genomics pipelines. We explore this possibility here by preparing merged profiles for pairs of structurally similar, but not necessarily sequence-similar, domains within the SMART and Pfam database by way of the Structural Classification of Proteins (SCOP). We show that such profiles are often able to successfully identify further members of the same superfamily and thus can be used to increase the sensitivity of database searching methods like HMMer and PSI-BLAST. We perform detailed benchmarks using the SMART and Pfam databases with four complete genomes frequently used as annotation benchmarks. We quantify the associated increase in structural information in Swissprot and discuss examples illustrating the applicability of this approach to understand functional and evolutionary relationships between protein families.
Comparison of computational methods for the identification of cell cycle-regulated genes.
de Lichtenberg, U., Jensen, L.J., Fausboll, A., Jensen, T.S., Bork, P. & Brunak, S.
Bioinformatics. 2005 Apr 1;21(7):1164-71. Epub 2004 Oct 28.
MOTIVATION: DNA microarrays have been used extensively to study the cell cycle transcription programme in a number of model organisms. The Saccharomyces cerevisiae data in particular have been subjected to a wide range of bioinformatics analysis methods, aimed at identifying the correct and complete set of periodically expressed genes. RESULTS: Here, we provide the first thorough benchmark of such methods, surprisingly revealing that most new and more mathematically advanced methods actually perform worse than the analysis published with the original microarray data sets. We show that this loss of accuracy specifically affects methods that only model the shape of the expression profile without taking into account the magnitude of regulation. We present a simple permutation-based method that performs better than most existing methods.
Generation and annotation of the DNA sequences of human chromosomes 2 and 4.
Hillier, L.W., Graves, T.A., Fulton, R.S., Fulton, L.A., Pepin, K.H., Minx, P., Wagner-McPherson, C., Layman, D., Wylie, K., Sekhon, M., Becker, M.C., Fewell, G.A., Delehaunty, K.D., Miner, T.L., Nash, W.E., Kremitzki, C., Oddy, L., Du, H., Sun, H., Bradshaw-Cordum, H., Ali, J., Carter, J., Cordes, M., Harris, A., Isak, A., van Brunt, A., Nguyen, C., Du, F., Courtney, L., Kalicki, J., Ozersky, P., Abbott, S., Armstrong, J., Belter, E.A., Caruso, L., Cedroni, M., Cotton, M., Davidson, T., Desai, A., Elliott, G., Erb, T., Fronick, C., Gaige, T., Haakenson, W., Haglund, K., Holmes, A., Harkins, R., Kim, K., Kruchowski, S.S., Strong, C.M., Grewal, N., Goyea, E., Hou, S., Levy, A., Martinka, S., Mead, K., McLellan, M.D., Meyer, R., Randall-Maher, J., Tomlinson, C., Dauphin-Kohlberg, S., Kozlowicz-Reilly, A., Shah, N., Swearengen-Shahid, S., Snider, J., Strong, J.T., Thompson, J., Yoakum, M., Leonard, S., Pearman, C., Trani, L., Radionenko, M., Waligorski, J.E., Wang, C., Rock, S.M., Tin-Wollam, A.M., Maupin, R., Latreille, P., Wendl, M.C., Yang, S.P., Pohl, C., Wallis, J.W., Spieth, J., Bieri, T.A., Berkowicz, N., Nelson, J.O., Osborne, J., Ding, L., Meyer, R., Sabo, A., Shotland, Y., Sinha, P., Wohldmann, P.E., Cook, L.L., Hickenbotham, M.T., Eldred, J., Williams, D., Jones, T.A., She, X., Ciccarelli, F.D., Izaurralde, E., Taylor, J., Schmutz, J., Myers, R.M., Cox, D.R., Huang, X., McPherson, J.D., Mardis, E.R., Clifton, S.W., Warren, W.C., Chinwalla, A.T., Eddy, S.R., Marra, M.A., Ovcharenko, I., Furey, T.S., Miller, W., Eichler, E.E., Bork, P., Suyama, M., Torrents, D., Waterston, R.H. & Wilson, R.K.
Nature 2005 Apr 7;434(7034):724-31.
Human chromosome 2 is unique to the human lineage in being the product of a head-to-head fusion of two intermediate-sized ancestral chromosomes. Chromosome 4 has received attention primarily related to the search for the Huntington's disease gene, but also for genes associated with Wolf-Hirschhorn syndrome, polycystic kidney disease and a form of muscular dystrophy. Here we present approximately 237 million base pairs of sequence for chromosome 2, and 186 million base pairs for chromosome 4, representing more than 99.6% of their euchromatic sequences. Our initial analyses have identified 1,346 protein-coding genes and 1,239 pseudogenes on chromosome 2, and 796 protein-coding genes and 778 pseudogenes on chromosome 4. Extensive analyses confirm the underlying construction of the sequence, and expand our understanding of the structure and evolution of mammalian chromosomes, including gene deserts, segmental duplications and highly variant regions.
Systematic association of genes to phenotypes by genome and literature mining.
Korbel, J.O., Doerks, T., Jensen, L.J., Perez-Iratxeta, C., Kaczanowski, S., Hooper, S.D., Andrade, M.A. & Bork, P.
PLoS Biol 2005 Apr 5;3(5):e134.
One of the major challenges of functional genomics is to unravel the connection between genotype and phenotype. So far no global analysis has attempted to explore those connections in the light of the large phenotypic variability seen in nature. Here, we use an unsupervised, systematic approach for associating genes and phenotypic characteristics that combines literature mining with comparative genome analysis. We first mine the MEDLINE literature database for terms that reflect phenotypic similarities of species. Subsequently we predict the likely genomic determinants: genes specifically present in the respective genomes. In a global analysis involving 92 prokaryotic genomes we retrieve 323 clusters containing a total of 2,700 significant gene-phenotype associations. Some clusters contain mostly known relationships, such as genes involved in motility or plant degradation, often with additional hypothetical proteins associated with those phenotypes. Other clusters comprise unexpected associations; for example, a group of terms related to food and spoilage is linked to genes predicted to be involved in bacterial food poisoning. Among the clusters, we observe an enrichment of pathogenicity-related associations, suggesting that the approach reveals many novel genes likely to play a role in infectious diseases.
Comparative metagenomics of microbial communities.
Tringe, S.G., von Mering, C., Kobayashi, A., Salamov, A.A., Chen, K., Chang, H.W., Podar, M., Short, J.M., Mathur, E.J., Detter, J.C., Bork, P., Hugenholtz, P. & Rubin, E.M.
Science 2005 Apr 22;308(5721):554-7.
The species complexity of microbial communities and challenges in culturing representative isolates make it difficult to obtain assembled genomes. Here we characterize and compare the metabolic capabilities of terrestrial and marine microbial communities using largely unassembled sequence data obtained by shotgun sequencing DNA isolated from the various environments. Quantitative gene content analysis reveals habitat-specific fingerprints that reflect known characteristics of the sampled environments. The identification of environment-specific genes through a gene-centric comparative analysis presents new opportunities for interpreting and diagnosing environments.
The WHy domain mediates the response to desiccation in plants and bacteria.
Ciccarelli, F.D. & Bork, P.
Bioinformatics 2005 Apr 15;21(8):1304-7. Epub 2004 Dec 14.
MOTIVATION: The hypersensitive response (HR) is a process activated by plants after microbial infection. Its main phenotypic effects are both a programmed death of the plant cells near the infection site and a reduction of the microbial proliferation. Although many resistance genes (R genes) associated to HR have been identified, very little is known about the molecular mechanisms activated after their expression. RESULTS: The analysis of the product of one of the R genes, the Hin1 protein, led to the identification of a novel domain, which we named WHy because it is detectable in proteins involved in Water stress and Hypersensitive response. The expression of this domain during both biotic infection and response to desiccation points to a molecular machinery common to these two stress conditions. Moreover, its presence in a restricted number of bacteria suggests a possible use for marking plant pathogenicity. CONTACT: email@example.com SUPPLEMENTARY INFORMATION: Supplementary data (Figures S1 and S2 and Table S1) and the alignment in clustal format are available at http://www.bork.embl.de/~ciccarel/WHy_add_data.html.
Complex genomic rearrangements lead to novel primate gene function.
Ciccarelli, F.D., von Mering, C., Suyama, M., Harrington, E.D., Izaurralde, E. & Bork, P.
Genome Res 2005 Mar;15(3):343-51. Epub 2005 Feb 14.
Orthologous genes that maintain a single-copy status in a broad range of species may indicate a selection against gene duplication. If this is the case, then duplicates of such genes that do survive may have escaped the dosage control by rapid and sizable changes in their function. To test this hypothesis and to develop a strategy for the identification of novel gene functions, we have analyzed 22 primate-specific intrachromosomal duplications of genes with a single-copy ortholog in all other completely sequenced metazoans. When comparing this set to genes not exposed to the single-copy status constraint, we observed a higher tendency of the former to modify their gene structure, often through complex genomic rearrangements. The analysis of the most dramatic of these duplications, affecting approximately 10% of human Chromosome 2, enabled a detailed reconstruction of the events leading to the appearance of a novel gene family. The eight members of this family originated from the highly conserved nucleoporin RanBP2 by several genetic rearrangements such as segmental duplications, inversions, translocations, exon loss, and domain accretion. We have experimentally verified that at least one of the newly formed proteins has a cellular localization different from RanBP2's, and we show that positive selection did act on specific domains during evolution.
Dynamic complex formation during the yeast cell cycle.
de Lichtenberg, U., Jensen, L.J., Brunak, S. & Bork, P.
Science 2005 Feb 4;307(5710):724-7.
To analyze the dynamics of protein complexes during the yeast cell cycle, we integrated data on protein interactions and gene expression. The resulting time-dependent interaction network places both periodically and constitutively expressed proteins in a temporal cell cycle context, thereby revealing previously unknown components and modules. We discovered that most complexes consist of both periodically and constitutively expressed subunits, which suggests that the former control complex activity by a mechanism of just-in-time assembly. Consistent with this, we show that additional regulation through targeted degradation and phosphorylation by Cdc28p (Cdk1) specifically affects the periodically expressed proteins.
STRING: known and predicted protein-protein associations, integrated and transferred across organisms.
von Mering, C., Jensen, L.J., Snel, B., Hooper, S.D., Krupp, M., Foglierini, M., Jouffre, N., Huynen, M.A. & Bork, P.
Nucleic Acids Res 2005 Jan 1;33 Database Issue:D433-7.
A full description of a protein's function requires knowledge of all partner proteins with which it specifically associates. From a functional perspective, 'association' can mean direct physical binding, but can also mean indirect interaction such as participation in the same metabolic pathway or cellular process. Currently, information about protein association is scattered over a wide variety of resources and model organisms. STRING aims to simplify access to this information by providing a comprehensive, yet quality-controlled collection of protein-protein associations for a large number of organisms. The associations are derived from high-throughput experimental data, from the mining of databases and literature, and from predictions based on genomic context analysis. STRING integrates and ranks these associations by benchmarking them against a common reference set, and presents evidence in a consistent and intuitive web interface. Importantly, the associations are extended beyond the organism in which they were originally described, by automatic transfer to orthologous protein pairs in other organisms, where applicable. STRING currently holds 730,000 proteins in 180 fully sequenced organisms, and is available at http://string.embl.de/.
InterPro, progress and status in 2005.
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bradley, P., Bork, P., Bucher, P., Cerutti, L., Copley, R., Courcelle, E., Das, U., Durbin, R., Fleischmann, W., Gough, J., Haft, D., Harte, N., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McDowall, J., Mitchell, A., Nikolskaya, A.N., Orchard, S., Pagni, M., Ponting, C.P., Quevillon, E., Selengut, J., Sigrist, C.J., Silventoinen, V., Studholme, D.J., Vaughan, R. & Wu, C.H.
Nucleic Acids Res 2005 Jan 1;33(Database issue):D201-5.
InterPro, an integrated documentation resource of protein families, domains and functional sites, was created to integrate the major protein signature databases. Currently, it includes PROSITE, Pfam, PRINTS, ProDom, SMART, TIGRFAMs, PIRSF and SUPERFAMILY. Signatures are manually integrated into InterPro entries that are curated to provide biological and functional information. Annotation is provided in an abstract, Gene Ontology mapping and links to specialized databases. New features of InterPro include extended protein match views, taxonomic range information and protein 3D structure data. One of the new match views is the InterPro Domain Architecture view, which shows the domain composition of protein matches. Two new entry types were introduced to better describe InterPro entries: these are active site and binding site. PIRSF and the structure-based SUPERFAMILY are the latest member databases to join InterPro, and CATH and PANTHER are soon to be integrated. InterPro release 8.0 contains 11 007 entries, representing 2573 domains, 8166 families, 201 repeats, 26 active sites, 21 binding sites and 20 post-translational modification sites. InterPro covers over 78% of all proteins in the Swiss-Prot and TrEMBL components of UniProt. The database is available for text- and sequence-based searches via a webserver (http://www.ebi.ac.uk/interpro), and for download by anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro).
Comparative architectures of mammalian and chicken genomes reveal highly variable rates of genomic rearrangements across different lineages.
Bourque, G., Zdobnov, E.M., Bork, P., Pevzner, P.A. & Tesler, G.
Genome Res 2005 Jan;15(1):98-110. Epub 2004 Dec 08.
Molecular evolution studies are usually based on the analysis of individual genes and thus reflect only small-range variations in genomic sequences. A complementary approach is to study the evolutionary history of rearrangements in entire genomes based on the analysis of gene orders. The progress in whole genome sequencing provides an unprecedented level of detailed sequence data to infer genome rearrangements through comparative approaches. The comparative analysis of recently sequenced rodent genomes with the human genome revealed evidence for a larger number of rearrangements than previously thought and led to the reconstruction of the putative genomic architecture of the murid rodent ancestor, while the architecture of the ancestral mammalian genome and the rate of rearrangements in the human lineage remained unknown. Sequencing the chicken genome provides an opportunity to reconstruct the architecture of the ancestral mammalian genome by using chicken as an outgroup. Our analysis reveals a very low rate of rearrangements and, in particular, interchromosomal rearrangements in chicken, in the early mammalian ancestor, or in both. The suggested number of interchromosomal rearrangements between the mammalian ancestor and chicken, during an estimated 500 million years of evolution, only slightly exceeds the number of interchromosomal rearrangements that happened in the mouse lineage, over the course of about 87 million years.
Protein coding potential of retroviruses and other transposable elements in vertebrate genomes.
Zdobnov, E.M., Campillos, M., Harrington, E.D., Torrents, D. & Bork, P.
Nucleic Acids Res 2005 Feb 16;33(3):946-54. Print 2005.
We suggest an annotation strategy for genes encoded by retroviruses and transposable elements (RETRA genes) based on a set of marker protein domains. Usually RETRA genes are masked in vertebrate genomes prior to the application of automated gene prediction pipelines under the assumption that they provide no selective advantage to the host. Yet, we show that about 1000 genes in four vertebrate gene sets analyzed contain at least one RETRA gene marker domain. Using the conservation of genomic neighborhood (synteny), we were able to discriminate between RETRA genes with putative functionality in the vertebrates and those that probably function only in the context of mobile elements. We identified 35 such genes in human, along with their corresponding mouse and rat orthologs; which included almost all known human genes with similarity to mobile elements. The results also imply that the vast majority of the remaining RETRA genes in current gene sets are unlikely to encode vertebrate functions. To automatically annotate RETRA genes in other vertebrate genomes, we provide as a tool a set of marker protein domains and a manually refined list of domesticated or ancestral RETRA genes for rescuing genes with vertebrate functions.
Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution.
Hillier, L.W., Miller, W., Birney, E., Warren, W., Hardison, R.C., Ponting, C.P., Bork, P., Burt, D.W., Groenen, M.A., Delany, M.E., Dodgson, J.B., Chinwalla, A.T., Cliften, P.F., Clifton, S.W., Delehaunty, K.D., Fronick, C., Fulton, R.S., Graves, T.A., Kremitzki, C., Layman, D., Magrini, V., McPherson, J.D., Miner, T.L., Minx, P., Nash, W.E., Nhan, M.N., Nelson, J.O., Oddy, L.G., Pohl, C.S., Randall-Maher, J., Smith, S.M., Wallis, J.W., Yang, S.P., Romanov, M.N., Rondelli, C.M., Paton, B., Smith, J., Morrice, D., Daniels, L., Tempest, H.G., Robertson, L., Masabanda, J.S., Griffin, D.K., Vignal, A., Fillon, V., Jacobbson, L., Kerje, S., Andersson, L., Crooijmans, R.P., Aerts, J., van der Poel, J.J., Ellegren, H., Caldwell, R.B., Hubbard, S.J., Grafham, D.V., Kierzek, A.M., McLaren, S.R., Overton, I.M., Arakawa, H., Beattie, K.J., Bezzubov, Y., Boardman, P.E., Bonfield, J.K., Croning, M.D., Davies, R.M., Francis, M.D., Humphray, S.J., Scott, C.E., Taylor, R.G., Tickle, C., Brown, W.R., Rogers, J., Buerstedde, J.M., Wilson, S.A., Stubbs, L., Ovcharenko, I., Gordon, L., Lucas, S., Miller, M.M., Inoko, H., Shiina, T., Kaufman, J., Salomonsen, J., Skjoedt, K., Wong, G.K., Wang, J., Liu, B., Wang, J., Yu, J., Yang, H., Nefedov, M., Koriabine, M., Dejong, P.J., Goodstadt, L., Webber, C., Dickens, N.J., Letunic, I., Suyama, M., Torrents, D., von Mering, C., Zdobnov, E.M., Makova, K., Nekrutenko, A., Elnitski, L., Eswara, P., King, D.C., Yang, S., Tyekucheva, S., Radakrishnan, A., Harris, R.S., Chiaromonte, F., Taylor, J., He, J., Rijnkels, M., Griffiths-Jones, S., Ureta-Vidal, A., Hoffman, M.M., Severin, J., Searle, S.M., Law, A.S., Speed, D., Waddington, D., Cheng, Z., Tuzun, E., Eichler, E., Bao, Z., Flicek, P., Shteynberg, D.D., Brent, M.R., Bye, J.M., Huckle, E.J., Chatterji, S., Dewey, C., Pachter, L., Kouranov, A., Mourelatos, Z., Hatzigeorgiou, A.G., Paterson, A.H., Ivarie, R., Brandstrom, M., Axelsson, E., Backstrom, N., Berlin, S., Webster, M.T., Pourquie, O., Reymond, A., Ucla, C., Antonarakis, S.E., Long, M., Emerson, J.J., Betran, E., Dupanloup, I., Kaessmann, H., Hinrichs, A.S., Bejerano, G., Furey, T.S., Harte, R.A., Raney, B., Siepel, A., Kent, W.J., Haussler, D., Eyras, E., Castelo, R., Abril, J.F., Castellano, S., Camara, F., Parra, G., Guigo, R., Bourque, G., Tesler, G., Pevzner, P.A., Smit, A., Fulton, L.A., Mardis, E.R. & Wilson, R.K.
Nature 2004 Dec 9;432(7018):695-716.
We present here a draft genome sequence of the red jungle fowl, Gallus gallus. Because the chicken is a modern descendant of the dinosaurs and the first non-mammalian amniote to have its genome sequenced, the draft sequence of its genome--composed of approximately one billion base pairs of sequence and an estimated 20,000-23,000 genes--provides a new perspective on vertebrate genome evolution, while also improving the annotation of mammalian genomes. For example, the evolutionary distance between chicken and human provides high specificity in detecting functional elements, both non-coding and coding. Notably, many conserved non-coding sequences are far from genes and cannot be assigned to defined functional classes. In coding regions the evolutionary dynamics of protein domains and orthologous groups illustrate processes that distinguish the lineages leading to birds and mammals. The distinctive properties of avian microchromosomes, together with the inferred patterns of conserved synteny, provide additional insights into vertebrate chromosome architecture.
Shared components of protein complexes--versatile building blocks or biochemical artefacts?
Krause, R., von Mering, C., Bork, P. & Dandekar, T.
Bioessays 2004 Dec;26(12):1333-43.
Protein complexes perform many important functions in the cell. Large-scale studies of protein-protein interactions have not only revealed new complexes but have also placed many proteins into multiple complexes. Whilst the advocates of hypothesis-free research touted the discovery of these shared components as new links between diverse cellular processes, critical commentators denounced many of the findings as artefacts, thus questioning the usefulness of large-scale approaches. Here, we survey proteins known to be shared between complexes, as established in the literature, and compare them to shared components found in high-throughput screens. We discuss the various challenges to the identification and functional interpretation of bona fide shared components, namely contaminants, variant and megacomplexes, and transient interactions, and suggest that many of the novel shared components found in high-throughput screens are neither the results of contamination nor central components, but appear to be primarily regulatory links in cellular processes.
Gene expression profiling of the rat superior olivary complex using serial analysis of gene expression.
Koehl, A., Schmidt, N., Rieger, A., Pilgram, S.M., Letunic, I., Bork, P., Soto, F., Friauf, E. & Nothwang, H.G.
Eur J Neurosci 2004 Dec;20(12):3244-58.
The superior olivary complex (SOC) is an auditory brainstem region that represents a favourable system to study rapid neurotransmission and the maturation of neuronal circuits. Here we performed serial analysis of gene expression (SAGE) on the SOC in 60-day-old Sprague-Dawley rats to identify genes specifically important for its function and to create a transcriptome reference for the subsequent identification of age-related or disease-related changes. Sequencing of 31 035 tags identified 10 473 different transcripts. Fifty-seven per cent of the unique tags with a count greater than four were statistically more highly represented in the SOC than in the hippocampus. Among them were genes encoding proteins involved in energy supply, the glutamate/glutamine shuttle, and myelination. Approximately 80 plasma membrane transporters, receptors, channels, and vesicular transporters were identified, and 25% of them displayed a significantly higher expression level in the SOC than in the hippocampus. Some of the plasma membrane proteins were not previously characterized in the SOC, e.g. the purinergic receptor subunit P2X(6) and the metabotropic GABA receptor Gpr51. Differential gene expression between SOC and hippocampus was confirmed using RNA in situ hybridization or immunohistochemistry. The extensive gene inventory presented here will alleviate the dissection of the molecular mechanisms underlying specific SOC functions and the comparison with other SAGE libraries from brain will ease the identification of promoters to generate region-specific transgenic animals. The analysis will be part of the publicly available database ID-GRAB.
Gene annotation from scientific literature using mappings between keyword systems.
Perez, A.J., Perez-Iratxeta, C., Bork, P., Thode, G. & Andrade, M.A.
Bioinformatics 2004 Sep 1;20(13):2084-91. Epub 2004 Apr 01.
MOTIVATION: The description of genes in databases by keywords helps the non-specialist to quickly grasp the properties of a gene and increases the efficiency of computational tools that are applied to gene data (e.g. searching a gene database for sequences related to a particular biological process). However, the association of keywords to genes or protein sequences is a difficult process that ultimately implies examination of the literature related to a gene. RESULTS: To support this task, we present a procedure to derive keywords from the set of scientific abstracts related to a gene. Our system is based on the automated extraction of mappings between related terms from different databases using a model of fuzzy associations that can be applied with all generality to any pair of linked databases. We tested the system by annotating genes of the SWISS-PROT database with keywords derived from the abstracts linked to their entries (stored in the MEDLINE database of scientific references). The performance of the annotation procedure was much better for SWISS-PROT keywords (recall of 47%, precision of 68%) than for Gene Ontology terms (recall of 8%, precision of 67%). AVAILABILITY: The algorithm can be publicly accessed and used for the annotation of sequences through a web server at http://www.bork.embl.de/kat
Homology-based functional proteomics by mass spectrometry: application to the Xenopus microtubule-associated proteome.
Liska, A.J., Popov, A.V., Sunyaev, S., Coughlin, P., Habermann, B., Shevchenko, A., Bork, P., Karsenti, E. & Shevchenko, A.
Proteomics 2004 Sep;4(9):2707-21.
The application of functional proteomics to important model organisms with unsequenced genomes is restricted because of the limited ability to identify proteins by conventional mass spectrometry (MS) methods. Here we applied MS and sequence-similarity database searching strategies to characterize the Xenopus laevis microtubule-associated proteome. We identified over 40 unique, and many novel, microtubule-bound proteins, as well as two macromolecular protein complexes involved in protein translation. This finding was corroborated by electron microscopy showing the presence of ribosomes on spindles assembled from frog egg extracts. Taken together, these results suggest that protein translation occurs on the spindle during meiosis in the Xenopus oocyte. These findings were made possible due to the application of sequence-similarity methods, which extended mass spectrometric protein identification capabilities by 2-fold compared to conventional methods.
BLAST2GENE: a comprehensive conversion of BLAST output into independent genes and gene fragments.
Suyama, M., Torrents, D. & Bork, P.
Bioinformatics. 2004 Aug 12;20(12):1968-70. Epub 2004 Mar 22.
SUMMARY: BLAST2GENE is a program that allows a detailed analysis of genomic regions containing completely or partially duplicated genes. From a BLAST (or BL2SEQ) comparison of a protein or nucleotide query sequence with any genomic region of interest, BLAST2GENE processes all high scoring pairwise alignments (HSPs) and provides the disposition of all independent copies along the genomic fragment. The results are provided in text and PostScript formats to allow an automatic and visual evaluation of the respective region. AVAILABILITY: The program is available upon request from the authors. A web server of BLAST2GENE is maintained at http://www.bork.embl.de/blast2gene
ArrayProspector: a web resource of functional associations inferred from microarray expression data.
Jensen, L.J., Lagarde, J., von Mering, C. & Bork, P.
Nucleic Acids Res 2004 Jul 1;32(Web Server issue):W445-8.
DNA microarray experiments have provided vast amounts of data which can be used for inferring gene function. However, most methods for predicting functional associations between genes from expression data are not suited to simultaneous analysis of multiple datasets, and a comprehensive resource of coexpression-based predictions is currently lacking. Here, we present an interactive web resource of gene associations predicted by applying a novel algorithm to all expression data in the Stanford Microarray Database. The underlying pre-computed database currently contains more than 200 000 high-confidence gene associations in 12 different species sampled from a broad taxonomic range. The resource allows every association to be inspected visually and can be accessed at http://www.bork.embl.de/ArrayProspector.
Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs.
Korbel, J.O., Jensen, L.J., von Mering, C. & Bork, P.
Nat Biotechnol 2004 Jul;22(7):911-7.
Several widely used methods for predicting functional associations between proteins are based on the systematic analysis of genomic context. Efforts are ongoing to improve these methods and to search for novel aspects in genomes that could be exploited for function prediction. Here, we use gene expression data to demonstrate two functional implications of genome organization: first, chromosomal proximity indicates gene coregulation in prokaryotes independent of relative gene orientation; and second, adjacent bidirectionally transcribed genes (that is,'divergently' organized coding regions) with conserved gene orientation are strongly coregulated. We further demonstrate that such bidirectionally transcribed gene pairs are functionally associated and derive from this a novel genomic context method that reliably predicts links between >2,500 pairs of genes in approximately 100 species. Around 650 of these functional associations are supported by other genomic context methods. In most instances, one gene encodes a transcriptional regulator, and the other a nonregulatory protein. In-depth analysis in Escherichia coli shows that the vast majority of these regulators both control transcription of the divergently transcribed target gene/operon and auto-regulate their own biosynthesis. The method thus enables the prediction of target processes and regulatory features for several hundred transcriptional regulators.
Protein interaction networks from yeast to human.
Bork, P., Jensen, L.J., von Mering, C., Ramani, A.K., Lee, I. & Marcotte, E.M.
Curr Opin Struct Biol 2004 Jun;14(3):292-9.
Protein interaction networks summarize large amounts of protein-protein interaction data, both from individual, small-scale experiments and from automated high-throughput screens. The past year has seen a flood of new experimental data, especially on metazoans, as well as an increasing number of analyses designed to reveal aspects of network topology, modularity and evolution. As only minimal progress has been made in mapping the human proteome using high-throughput screens, the transfer of interaction information within and across species has become increasingly important. With more and more heterogeneous raw data becoming available, proper data integration and quality control have become essential for reliable protein network reconstruction, and will be especially important for reconstructing the human protein interaction network.
Genome sequence of the Brown Norway rat yields insights into mammalian evolution.
Gibbs, R.A., Weinstock, G.M., Metzker, M.L., Muzny, D.M., Sodergren, E.J., Scherer, S., Scott, G., Steffen, D., Worley, K.C., Burch, P.E., Okwuonu, G., Hines, S., Lewis, L., DeRamo, C., Delgado, O., Dugan-Rocha, S., Miner, G., Morgan, M., Hawes, A., Gill, R., Celera, Holt, R.A., Adams, M.D., Amanatides, P.G., Baden-Tillson, H., Barnstead, M., Chin, S., Evans, C.A., Ferriera, S., Fosler, C., Glodek, A., Gu, Z., Jennings, D., Kraft, C.L., Nguyen, T., Pfannkoch, C.M., Sitter, C., Sutton, G.G., Venter, J.C., Woodage, T., Smith, D., Lee, H.M., Gustafson, E., Cahill, P., Kana, A., Doucette-Stamm, L., Weinstock, K., Fechtel, K., Weiss, R.B., Dunn, D.M., Green, E.D., Blakesley, R.W., Bouffard, G.G., De Jong, P.J., Osoegawa, K., Zhu, B., Marra, M., Schein, J., Bosdet, I., Fjell, C., Jones, S., Krzywinski, M., Mathewson, C., Siddiqui, A., Wye, N., McPherson, J., Zhao, S., Fraser, C.M., Shetty, J., Shatsman, S., Geer, K., Chen, Y., Abramzon, S., Nierman, W.C., Havlak, P.H., Chen, R., Durbin, K.J., Egan, A., Ren, Y., Song, X.Z., Li, B., Liu, Y., Qin, X., Cawley, S., Worley, K.C., Cooney, A.J., D'Souza, L.M., Martin, K., Wu, J.Q., Gonzalez-Garay, M.L., Jackson, A.R., Kalafus, K.J., McLeod, M.P., Milosavljevic, A., Virk, D., Volkov, A., Wheeler, D.A., Zhang, Z., Bailey, J.A., Eichler, E.E., Tuzun, E., Birney, E., Mongin, E., Ureta-Vidal, A., Woodwark, C., Zdobnov, E., Bork, P., Suyama, M., Torrents, D., Alexandersson, M., Trask, B.J., Young, J.M., Huang, H., Wang, H., Xing, H., Daniels, S., Gietzen, D., Schmidt, J., Stevens, K., Vitt, U., Wingrove, J., Camara, F., Mar Alba, M., Abril, J.F., Guigo, R., Smit, A., Dubchak, I., Rubin, E.M., Couronne, O., Poliakov, A., Hubner, N., Ganten, D., Goesele, C., Hummel, O., Kreitler, T., Lee, Y.A., Monti, J., Schulz, H., Zimdahl, H., Himmelbauer, H., Lehrach, H., Jacob, H.J., Bromberg, S., Gullings-Handley, J., Jensen-Seaman, M.I., Kwitek, A.E., Lazar, J., Pasko, D., Tonellato, P.J., Twigger, S., Ponting, C.P., Duarte, J.M., Rice, S., Goodstadt, L., Beatson, S.A., Emes, R.D., Winter, E.E., Webber, C., Brandt, P., Nyakatura, G., Adetobi, M., Chiaromonte, F., Elnitski, L., Eswara, P., Hardison, R.C., Hou, M., Kolbe, D., Makova, K., Miller, W., Nekrutenko, A., Riemer, C., Schwartz, S., Taylor, J., Yang, S., Zhang, Y., Lindpaintner, K., Andrews, T.D., Caccamo, M., Clamp, M., Clarke, L., Curwen, V., Durbin, R., Eyras, E., Searle, S.M., Cooper, G.M., Batzoglou, S., Brudno, M., Sidow, A., Stone, E.A., Venter, J.C., Payseur, B.A., Bourque, G., Lopez-Otin, C., Puente, X.S., Chakrabarti, K., Chatterji, S., Dewey, C., Pachter, L., Bray, N., Yap, V.B., Caspi, A., Tesler, G., Pevzner, P.A., Haussler, D., Roskin, K.M., Baertsch, R., Clawson, H., Furey, T.S., Hinrichs, A.S., Karolchik, D., Kent, W.J., Rosenbloom, K.R., Trumbower, H., Weirauch, M., Cooper, D.N., Stenson, P.D., Ma, B., Brent, M., Arumugam, M., Shteynberg, D., Copley, R.R., Taylor, M.S., Riethman, H., Mudunuri, U., Peterson, J., Guyer, M., Felsenfeld, A., Old, S., Mockrin, S. & Collins, F.
Nature 2004 Apr 1;428(6982):493-521.
The laboratory rat (Rattus norvegicus) is an indispensable tool in experimental medicine and drug development, having made inestimable contributions to human health. We report here the genome sequence of the Brown Norway (BN) rat strain. The sequence represents a high-quality 'draft' covering over 90% of the genome. The BN rat sequence is the third complete mammalian genome to be deciphered, and three-way comparisons with the human and mouse genomes resolve details of mammalian evolution. This first comprehensive analysis includes genes and proteins and their relation to human disease, repeated sequences, comparative genome-wide studies of mammalian orthologous chromosomal regions and rearrangement breakpoints, reconstruction of ancestral karyotypes and the events leading to existing species, rates of variation, and lineage-specific and lineage-independent evolutionary events such as expansion of gene families, orthology relations and protein evolution.
Structure-based assembly of protein complexes in yeast.
Aloy, P., Böttcher, B., Ceulemans, H., Leutwein, C., Mellwig, C., Fischer, S., Gavin, A.C., Bork, P., Superti-Furga, G., Serrano, L. & Russell, R.B.
Science 2004 Mar 26;303(5666):2026-9.
Images of entire cells are preceding atomic structures of the separate molecular machines that they contain. The resulting gap in knowledge can be partly bridged by protein-protein interactions, bioinformatics, and electron microscopy. Here we use interactions of known three-dimensional structure to model a large set of yeast complexes, which we also screen by electron microscopy. For 54 of 102 complexes, we obtain at least partial models of interacting subunits. For 29, including the exosome, the chaperonin containing TCP-1, a 3'-messenger RNA degradation complex, and RNA polymerase II, the process suggests atomic details not easily seen by homology, involving the combination of two or more known structures. We also consider interactions between complexes (cross-talk) and use these to construct a structure-based network of molecular machines in the cell.
Global analysis of bacterial transcription factors to predict cellular target processes.
Doerks, T., Andrade, M.A., Lathe W, 3rd, von Mering, C. & Bork, P.
Trends Genet 2004 Mar;20(3):126-31.
Whole-genome sequences are now available for >100 bacterial species, giving unprecedented power to comparative genomics approaches. We have applied genome-context methods to predict target processes that are regulated by transcription factors (TFs). Of 128 orthologous groups of proteins annotated as TFs, to date, 36 are functionally uncharacterized; in our analysis we predict a probable cellular target process or biochemical pathway for half of these functionally uncharacterized TFs.
The HUPO PSI's molecular interaction format--a community standard for the representation of protein interaction data.
Hermjakob, H., Montecchi-Palazzi, L., Bader, G., Wojcik, J., Salwinski, L., Ceol, A., Moore, S., Orchard, S., Sarkans, U., von Mering, C., Roechert, B., Poux, S., Jung, E., Mersch, H., Kersey, P., Lappe, M., Li, Y., Zeng, R., Rana, D., Nikolski, M., Husi, H., Brun, C., Shanker, K., Grant, S.G., Sander, C., Bork, P., Zhu, W., Pandey, A., Brazma, A., Jacq, B., Vidal, M., Sherman, D., Legrain, P., Cesareni, G., Xenarios, I., Eisenberg, D., Steipe, B., Hogue, C. & Apweiler, R.
Nat Biotechnol 2004 Feb;22(2):177-83.
A major goal of proteomics is the complete description of the protein interaction network underlying cell physiology. A large number of small scale and, more recently, large-scale experiments have contributed to expanding our understanding of the nature of the interaction network. However, the necessary data integration across experiments is currently hampered by the fragmentation of publicly available protein interaction data, which exists in different formats in databases, on authors' websites or sometimes only in print publications. Here, we propose a community standard data model for the representation and exchange of protein interaction data. This data model has been jointly developed by members of the Proteomics Standards Initiative (PSI), a work group of the Human Proteome Organization (HUPO), and is supported by major protein interaction data providers, in particular the Biomolecular Interaction Network Database (BIND), Cellzome (Heidelberg, Germany), the Database of Interacting Proteins (DIP), Dana Farber Cancer Institute (Boston, MA, USA), the Human Protein Reference Database (HPRD), Hybrigenics (Paris, France), the European Bioinformatics Institute's (EMBL-EBI, Hinxton, UK) IntAct, the Molecular Interactions (MINT, Rome, Italy) database, the Protein-Protein Interaction Database (PPID, Edinburgh, UK) and the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING, EMBL, Heidelberg, Germany).
RanBP2/Nup358 Provides a Major Binding Site for NXF1-p15 Dimers at the Nuclear Pore Complex and Functions in Nuclear mRNA Export.
Forler, D., Rabut, G., Ciccarelli, F.D., Herold, A., Kocher, T., Niggeweg, R., Bork, P., Ellenberg, J. & Izaurralde, E.
Mol Cell Biol 2004 Feb;24(3):1155-67.
Metazoan NXF1-p15 heterodimers promote the nuclear export of bulk mRNA across nuclear pore complexes (NPCs). In vitro, NXF1-p15 forms a stable complex with the nucleoporin RanBP2/Nup358, a component of the cytoplasmic filaments of the NPC, suggesting a role for this nucleoporin in mRNA export. We show that depletion of RanBP2 from Drosophila cells inhibits proliferation and mRNA export. Concomitantly, the localization of NXF1 at the NPC is strongly reduced and a significant fraction of this normally nuclear protein is detected in the cytoplasm. Under the same conditions, the steady-state subcellular localization of other nuclear or cytoplasmic proteins and CRM1-mediated protein export are not detectably affected, indicating that the release of NXF1 into the cytoplasm and the inhibition of mRNA export are not due to a general defect in NPC function. The specific role of RanBP2 in the recruitment of NXF1 to the NPC is highlighted by the observation that depletion of CAN/Nup214 also inhibits cell proliferation and mRNA export but does not affect NXF1 localization. Our results indicate that RanBP2 provides a major binding site for NXF1 at the cytoplasmic filaments of the NPC, thereby restricting its diffusion in the cytoplasm after NPC translocation. In RanBP2-depleted cells, NXF1 diffuses freely through the cytoplasm. Consequently, the nuclear levels of the protein decrease and export of bulk mRNA is impaired.
The Helmholtz Network for Bioinformatics: an integrative web portal for bioinformatics resources.
Crass, T., Antes, I., Basekow, R., Bork, P., Buning, C., Christensen, M., Claussen, H., Ebeling, C., Ernst, P., Gailus-Durner, V., Glatting, K.H., Gohla, R., Gossling, F., Grote, K., Heidtke, K., Herrmann, A., O'Keeffe, S., Kiesslich, O., Kolibal, S., Korbel, J.O., Lengauer, T., Liebich, I., Van Der Linden, M., Luz, H., Meissner, K., Von Mering, C., Mevissen, H.T., Mewes, H.W., Michael, H., Mokrejs, M., Muller, T., Pospisil, H., Rarey, M., Reich, J.G., Schneider, R., Schomburg, D., Schulze-Kremer, S., Schwarzer, K., Sommer, I., Springstubbe, S., Suhai, S., Thoppae, G., Vingron, M., Warfsmann, J., Werner, T., Wetzler, D., Wingender, E. & Zimmer, R.
Bioinformatics 2004 Jan 22;20(2):268-270.
SUMMARY: The Helmholtz Network for Bioinformatics (HNB) is a joint venture of eleven German bioinformatics research groups that offers convenient access to numerous bioinformatics resources through a single web portal. The 'Guided Solution Finder' which is available through the HNB portal helps users to locate the appropriate resources to answer their queries by employing a detailed, tree-like questionnaire. Furthermore, automated complex tool cascades ('tasks'), involving resources located on different servers, have been implemented, allowing users to perform comprehensive data analyses without the requirement of further manual intervention for data transfer and re-formatting. Currently, automated cascades for the analysis of regulatory DNA segments as well as for the prediction of protein functional properties are provided. AVAILABILITY: The HNB portal is available at http://www.hnbioinfo.de
SMART 4.0: towards genomic data integration.
Letunic, I., Copley, R.R., Schmidt, S., Ciccarelli, F.D., Doerks, T., Schultz, J., Ponting, C.P. & Bork, P.
Nucleic Acids Res 2004 Jan 1;32(1):D142-4.
SMART (Simple Modular Architecture Research Tool) is a web tool (http://smart.embl.de/) for the identification and annotation of protein domains, and provides a platform for the comparative study of complex domain architectures in genes and proteins. The January 2004 release of SMART contains 685 protein domains. New developments in SMART are centred on the integration of data from completed metazoan genomes. SMART now uses predicted proteins from complete genomes in its source sequence databases, and integrates these with predictions of orthology. New visualization tools have been developed to allow analysis of gene intron-exon structure within the context of protein domain structure, and to align these displays to provide schematic comparisons of orthologous genes, or multiple transcripts from the same gene. Other improvements include the ability to query SMART by Gene Ontology terms, improved structure database searching and batch retrieval of multiple entries.
Extracting regulatory gene expression networks from pubmed.
Sarik, J., Lensen, L.J., Ouzounova, R., Rojas, I. & Bork, P.
Proceedings of the 42nd annuual meeting of the association of computational linguistics 2004;192-199.
Quality analysis and integration of large scale molecular data sets.
Jensen, L.J. & Bork, P.
Drug Discovery Today (TARGETS) 2004;3:51-56.
Estimating rates of alternative splicing in mammals and invertebrates.
Harrington, E.D., Boue, S., Valcarcel, J., Reich, J.G., & Bork, P.
Nature Genet 2004;36:916-917.
Functional clues for hypothetical proteins based on genomic context analysis in prokaryotes.
Doerks, T., von Mering, C. & Bork, P.
Nucleic Acids Res ;32(21):6321-6. Print 2004.
Three integrated genomic context methods were used to annotate uncharacterized proteins in 102 bacterial genomes. Of 7853 orthologous groups with unknown function containing 45,110 proteins, 1738 groups could be linked to functionally associated partners. In many cases, those partners are uncharacterized themselves (hinting at newly identified modules) or have been described in general terms only. However, we were able to assign pathways, cellular processes or physical complexes for 273 groups (encompassing 3624 previously functionally uncharacterized proteins).
Sequences and structures in context.
Bork, P. & Orengo, C.
Curr. Opin. Struct. Biol. 2004 14 261-263
Genome evolution reveals biochemical networks and functional modules.
von Mering, C., Zdobnov, E.M., Tsoka, S., Ciccarelli, F.D., Pereira-Leal, J.B., Ouzounis, C.A. & Bork, P.
Proc Natl Acad Sci U S A 2003 Dec 23;100(26):15428-33.
The analysis of completely sequenced genomes uncovers an astonishing variability between species in terms of gene content and order. During genome history, the genes are frequently rear-ranged, duplicated, lost, or transferred horizontally between genomes. These events appear to be stochastic, yet they are under selective constraints resulting from the functional interactions between genes. These genomic constraints form the basis for a variety of techniques that employ systematic genome comparisons to predict functional associations among genes. The most powerful techniques to date are based on conserved gene neighborhood, gene fusion events, and common phylogenetic distributions of gene families. Here we show that these techniques, if integrated quantitatively and applied to a sufficiently large number of genomes, have reached a resolution which allows the characterization of function at a higher level than that of the individual gene: global modularity becomes detectable in a functional protein network. In Escherichia coli, the predicted modules can be bench-marked by comparison to known metabolic pathways. We found as many as 74% of the known metabolic enzymes clustering together in modules, with an average pathway specificity of at least 84%. The modules extend beyond metabolism, and have led to hundreds of reliable functional predictions both at the protein and pathway level. The results indicate that modularity in protein networks is intrinsically encoded in present-day genomes.
The PAM domain, a multi-protein complex-associated module with an all-alpha-helix fold.
Ciccarelli, F.D., Izaurralde, E. & Bork, P.
BMC Bioinformatics 2003 Dec 19;4(1):64.
Background: Multimeric protein complexes have a role in many cellular pathways and are highly interconnected with various other proteins. The characterization of their domain composition and organization provides useful information on the specific role of each region of their sequence. Results: We identified a new module, the PAM domain (PCI/PINT associated module), present in single subunits of well characterized multiprotein complexes, like the regulatory lid of the 26S proteasome, the COP-9 signalosome and the Sac3-Thp1 complex. This module is an around 200 residue long domain with a predicted TPR-like all-alpha-helical fold. Conclusions: The occurrence of the PAM domain in specific subunits of multimeric protein complexes, together with the role of other all-alpha-helical folds in protein-protein interactions, suggest a function for this domain in mediating transient binding to diverse target proteins.
Impact of selection, mutation rate and genetic drift on human genetic variation.
Sunyaev, S., Kondrashov, F.A., Bork, P. & Ramensky, V.
Hum Mol Genet 2003 Dec 15;12(24):3325-30.
The accumulation of genome-wide information on single nucleotide polymorphisms in humans provides an unprecedented opportunity to detect the evolutionary forces responsible for heterogeneity of the level of genetic variability across loci. Previous studies have shown that history of recombination events has produced long haplotype blocks in the human genome, which contribute to this heterogeneity. Other factors, however, such as natural selection or the heterogeneity of mutation rates across loci, may also lead to heterogeneity of genetic variability. We compared synonymous and non-synonymous variability within human genes with their divergence from murine orthologs. We separately analyzed the non-synonymous variants predicted to damage protein structure or function and the variants predicted to be functionally benign. The predictions were based on comparative sequence analysis and, in some cases, on the analysis of protein structure. A strong correlation between non-synonymous, benign variability and non-synonymous human-mouse divergence suggests that selection played an important role in shaping the pattern of variability in coding regions of human genes. However, the lack of correlation between deleterious variability and evolutionary divergence shows that a substantial proportion of the observed non-synonymous single-nucleotide polymorphisms reduces fitness and never reaches fixation. Evolutionary and medical implications of the impact of selection on human polymorphisms are discussed.
A genome-wide survey of human pseudogenes.
Torrents, D., Suyama, M., Zdobnov, E. & Bork, P.
Genome Res 2003 Dec;13(12):2559-67.
We screened all intergenic regions in the human genome to identify pseudogenes with a combination of homology searches and a functionality test using the ratio of silent to replacement nucleotide substitutions (KA/KS). We identified 19,724 regions of which 95% +/- 3% are estimated to evolve neutrally and thus are likely to encode pseudogenes. Half of these have no detectable truncation in their pseudocoding regions and therefore are not identifiable by methods that require the presence of truncations to prove nonfunctionality. A comparative analysis with the mouse genome showed that 70% of these pseudogenes have a retrotranspositional origin (processed), and the rest arose by segmental duplication (nonprocessed). Although the spread of both types of pseudogenes correlates with chromosome size, nonprocessed pseudogenes appear to be enriched in regions with high gene density. It is likely that the human pseudogenes identified here represent only a small fraction of the total, which probably exceeds the number of genes.
Alternative splicing and evolution.
Boue, S., Letunic, I. & Bork, P.
Bioessays 2003 Nov;25(11):1031-4.
Alternative splicing is a critical post-transcriptional event leading to an increase in the transcriptome diversity. Recent bioinformatics studies revealed a high frequency of alternative splicing. Although the extent of AS conservation among mammals is still being discussed, it has been argued that major forms of alternatively spliced transcripts are much better conserved than minor forms. It suggests that alternative splicing plays a major role in genome evolution allowing new exons to evolve with less constraint.
A comprehensive set of protein complexes in yeast: mining large scale protein-protein interaction screens.
Krause, R., von Mering, C. & Bork, P.
Bioinformatics 2003 Oct 12;19(15):1901-8.
MOTIVATION: The analysis of protein-protein interactions allows for detailed exploration of the cellular machinery. The biochemical purification of protein complexes followed by identification of components by mass spectrometry is currently the method, which delivers the most reliable information--albeit that the data sets are still difficult to interpret. Consolidating individual experiments into protein complexes, especially for high-throughput screens, is complicated by many contaminants, the occurrence of proteins in otherwise dissimilar purifications due to functional re-use and technical limitations in the detection. A non-redundant collection of protein complexes from experimental data would be useful for biological interpretation, but manual assembly is tedious and often inconsistent. RESULTS: Here, we introduce a measure to define similarity within collections of purifications and generate a set of minimally redundant, comprehensive complexes using unsupervised clustering. AVAILABILITY: Programs and results are freely available from http://www.bork.embl-heidelberg.de/Docu/purclust/
Nonsense-mediated mRNA decay in Drosophila: at the intersection of the yeast and mammalian pathways.
Gatfield, D., Unterholzner, L., Ciccarelli, F.D., Bork, P. & Izaurralde, E.
EMBO J 2003 Aug 1;22(15):3960-70.
The nonsense-mediated mRNA decay (NMD) pathway promotes the rapid degradation of mRNAs containing premature stop codons (PTCs). In Caenorhabditis elegans, seven genes (smg1-7) playing an essential role in NMD have been identified. Only SMG2-4 (known as UPF1-3) have orthologs in Saccharomyces cerevisiae. Here we show that the Drosophila orthologs of UPF1-3, SMG1, SMG5 and SMG6 are required for the degradation of PTC-containing mRNAs, but that there is no SMG7 ortholog in this organism. In contrast, orthologs of SMG5-7 are encoded by the human genome and all three are required for NMD. In human cells, exon boundaries have been shown to play a critical role in defining PTCs. This role is mediated by components of the exon junction complex (EJC). Contrary to expectation, however, we show that the components of the EJC are dispensable for NMD in Drosophila cells. Consistently, PTC definition occurs independently of exon boundaries in DROSOPHILA: Our findings reveal that despite conservation of the NMD machinery, different mechanisms have evolved to discriminate premature from natural stop codons in metazoa.
The DNA sequence of human chromosome 7.
Hillier, L.W., Fulton, R.S., Fulton, L.A., Graves, T.A., Pepin, K.H., Wagner-McPherson, C., Layman, D., Maas, J., Jaeger, S., Walker, R., Wylie, K., Sekhon, M., Becker, M.C., O'Laughlin, M.D., Schaller, M.E., Fewell, G.A., Delehaunty, K.D., Miner, T.L., Nash, W.E., Cordes, M., Du, H., Sun, H., Edwards, J., Bradshaw-Cordum, H., Ali, J., Andrews, S., Isak, A., Vanbrunt, A., Nguyen, C., Du, F., Lamar, B., Courtney, L., Kalicki, J., Ozersky, P., Bielicki, L., Scott, K., Holmes, A., Harkins, R., Harris, A., Strong, C.M., Hou, S., Tomlinson, C., Dauphin-Kohlberg, S., Kozlowicz-Reilly, A., Leonard, S., Rohlfing, T., Rock, S.M., Tin-Wollam, A.M., Abbott, A., Minx, P., Maupin, R., Strowmatt, C., Latreille, P., Miller, N., Johnson, D., Murray, J., Woessner, J.P., Wendl, M.C., Yang, S.P., Schultz, B.R., Wallis, J.W., Spieth, J., Bieri, T.A., Nelson, J.O., Berkowicz, N., Wohldmann, P.E., Cook, L.L., Hickenbotham, M.T., Eldred, J., Williams, D., Bedell, J.A., Mardis, E.R., Clifton, S.W., Chissoe, S.L., Marra, M.A., Raymond, C., Haugen, E., Gillett, W., Zhou, Y., James, R., Phelps, K., Iadanoto, S., Bubb, K., Simms, E., Levy, R., Clendenning, J., Kaul, R., Kent, W.J., Furey, T.S., Baertsch, R.A., Brent, M.R., Keibler, E., Flicek, P., Bork, P., Suyama, M., Bailey, J.A., Portnoy, M.E., Torrents, D., Chinwalla, A.T., Gish, W.R., Eddy, S.R., McPherson, J.D., Olson, M.V., Eichler, E.E., Green, E.D., Waterston, R.H. & Wilson, R.K.
Nature 2003 Jul 10;424(6945):157-64.
Human chromosome 7 has historically received prominent attention in the human genetics community, primarily related to the search for the cystic fibrosis gene and the frequent cytogenetic changes associated with various forms of cancer. Here we present more than 153 million base pairs representing 99.4% of the euchromatic sequence of chromosome 7, the first metacentric chromosome completed so far. The sequence has excellent concordance with previously established physical and genetic maps, and it exhibits an unusual amount of segmentally duplicated sequence (8.2%), with marked differences between the two arms. Our initial analyses have identified 1,150 protein-coding genes, 605 of which have been confirmed by complementary DNA sequences, and an additional 941 pseudogenes. Of genes confirmed by transcript sequences, some are polymorphic for mutations that disrupt the reading frame.
ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins.
Puntervoll, P., Linding, R., Gemund, C., Chabanis-Davidson, S., Mattingsdal, M., Cameron, S., Martin, D.M., Ausiello, G., Brannetti, B., Costantini, A., Ferre, F., Maselli, V., Via, A., Cesareni, G., Diella, F., Superti-Furga, G., Wyrwicz, L., Ramu, C., McGuigan, C., Gudavalli, R., Letunic, I., Bork, P., Rychlewski, L., Kuster, B., Helmer-Citterich, M., Hunter, W.N., Aasland, R. & Gibson, T.J.
Nucleic Acids Res 2003 Jul 1;31(13):3625-30.
Multidomain proteins predominate in eukaryotic proteomes. Individual functions assigned to different sequence segments combine to create a complex function for the whole protein. While on-line resources are available for revealing globular domains in sequences, there has hitherto been no comprehensive collection of small functional sites/motifs comparable to the globular domain resources, yet these are as important for the function of multidomain proteins. Short linear peptide motifs are used for cell compartment targeting, protein-protein interaction, regulation by phosphorylation, acetylation, glycosylation and a host of other post-translational modifications. ELM, the Eukaryotic Linear Motif server at http://elm.eu.org/, is a new bioinformatics resource for investigating candidate short non-globular functional motifs in eukaryotic proteins, aiming to fill the void in bioinformatics tools. Sequence comparisons with short motifs are difficult to evaluate because the usual significance assessments are inappropriate. Therefore the server is implemented with several logical filters to eliminate false positives. Current filters are for cell compartment, globular domain clash and taxonomic range. In favourable cases, the filters can reduce the number of retained matches by an order of magnitude or more.
Update on XplorMed: A web server for exploring scientific literature.
Perez-Iratxeta, C., Perez, A.J., Bork, P. & Andrade, M.A.
Nucleic Acids Res 2003 Jul 1;31(13):3866-8.
As scientific literature databases like MEDLINE increase in size, so does the time required to search them. Scientists must frequently inspect long lists of references manually, often just reading the titles. XplorMed is a web tool that aids MEDLINE searching by summarizing the subjects contained in the results, thus allowing users to focus on subjects of interest. Here we describe new features added to XplorMed during the last 2 years (http://www.bork.embl-heidelberg.de/xplormed/).
Systematic discovery of analogous enzymes in thiamin biosynthesis.
Morett, E., Korbel, J.O., Rajan, E., Saab-Rincon, G., Olvera, L., Olvera, M., Schmidt, S., Snel, B. & Bork, P.
Nat Biotechnol 2003 Jul;21(7):790-5.
In all genome-sequencing projects completed to date, a considerable number of 'gaps' have been found in the biochemical pathways of the respective species. In many instances, missing enzymes are displaced by analogs, functionally equivalent proteins that have evolved independently and lack sequence and structural similarity. Here we fill such gaps by analyzing anticorrelating occurrences of genes across species. Our approach, applied to the thiamin biosynthesis pathway comprising approximately 15 catalytic steps, predicts seven instances in which known enzymes have been displaced by analogous proteins. So far we have verified four predictions by genetic complementation, including three proteins for which there was no previous experimental evidence of a role in the thiamin biosynthesis pathway. For one hypothetical protein, biochemical characterization confirmed the predicted thiamin phosphate synthase (ThiE) activity. The results demonstrate the ability of our computational approach to predict specific functions without taking into account sequence similarity.
The KIND module: a putative signalling domain evolved from the C lobe of the protein kinase fold.
Ciccarelli, F.D., Bork, P. & Kerkhoff, E.
Trends Biochem Sci 2003 Jul;28(7):349-52. Europe PMC
Metabolites: a helping hand for pathway evolution?
Schmidt, S., Sunyaev, S., Bork, P. & Dandekar, T.
Trends Biochem Sci 2003 Jun;28(6):336-41.
The evolution of enzymes and pathways is under debate. Recent studies show that recruitment of single enzymes from different pathways could be the driving force for pathway evolution. Other mechanisms of evolution, such as pathway duplication, enzyme specialization, de novo invention of pathways or retro-evolution of pathways, appear to be less abundant. Twenty percent of enzyme superfamilies are quite variable, not only in changing reaction chemistry or metabolite type but in changing both at the same time. These variable superfamilies account for nearly half of all known reactions. The most frequently occurring metabolites provide a helping hand for such changes because they can be accommodated by many enzyme superfamilies. Thus, a picture is emerging in which new pathways are evolving from central metabolites by preference, thereby keeping the overall topology of the metabolic network.
Information extraction from full text scientific articles: where are the keywords?
Shah, P.K., Perez-Iratxeta, C., Bork, P. & Andrade, M.A.
BMC Bioinformatics 2003 May 29;4(1):20.
BACKGROUND: To date, many of the methods for information extraction of biological information from scientific articles are restricted to the abstract of the article. However, full text articles in electronic version, which offer larger sources of data, are currently available. Several questions arise as to whether the effort of scanning full text articles is worthy, or whether the information that can be extracted from the different sections of an article can be relevant. RESULTS: In this work we addressed those questions showing that the keyword content of the different sections of a standard scientific article (abstract, introduction, methods, results, and discussion) is very heterogeneous. CONCLUSIONS: Although the abstract contains the best ratio of keywords per total of words, other sections of the article may be a better source of biologically relevant data.
The way we write.
Netzel, R., Perez-Iratxeta, C., Bork, P. & Andrade, M.A.
EMBO Rep 2003 May;4(5):446-51. Europe PMC
Pathogenesis of DNA repair-deficient cancers: a statistical meta-analysis of putative Real Common Target genes.
Woerner, S.M., Benner, A., Sutter, C., Schiller, M., Yuan, Y.P., Keller, G., Bork, P., Doeberitz, M.K. & Gebert, J.F.
Oncogene 2003 Apr 17;22(15):2226-35.
DNA mismatch repair deficiency is observed in about 15% of human colorectal, gastric, and endometrial tumors and in lower frequencies in a minority of other tumors thereby causing insertion/deletion mutations at short repetitive sequences, recognized as microsatellite instability (MSI). Evolution of tumors, including those with MSI, is a continuous process of mutation and selection favoring neoplastic growth. Mutations in microsatellite-bearing genes that promote tumor cell growth in general (Real Common Target genes) are assumed to be the driving force during MSI carcinogenesis. Thus, microsatellite mutations in these genes should occur more frequently than mutations in microsatellite genes without contribution to malignancy (ByStander genes). So far, only a few Real Common Target genes have been identified by functional studies. Thus, comprehensive analysis of microsatellite mutations will provide important clues to the understanding of MSI-driven carcinogenesis. Here, we evaluated published mutation frequencies on 194 repeat tracts in 137 genes in MSI-H colorectal, endometrial, and gastric carcinomas and propose a statistical model that aims to identify Real Common Target genes. According to our model nine genes including BAX and TGFbetaRII were identified as Real Common Targets in colorectal cancer, one gene in gastric cancer, and three genes in endometrial cancer. Microsatellite mutations in five additional genes seem to be counterselected in gastrointestinal tumors. Overall, the general applicability, the capacity to unlimited data analysis, the inclusion of mutation data generated by different groups on different sets of tumors make this model a useful tool for predicting Real Common Target genes with specificity for MSI-H tumors of different organs, guiding subsequent functional studies to the most likely targets among numerous microsatellite harboring genes.
Function prediction and protein networks.
Huynen, M.A., Snel, B., Mering, C. & Bork, P.
Curr Opin Cell Biol 2003 Apr;15(2):191-8.
In the genomics era, the interactions between proteins are at the center of attention. Genomic-context methods used to predict these interactions have been put on a quantitative basis, revealing that they are at least on an equal footing with genomics experimental data. A survey of experimentally confirmed predictions proves the applicability of these methods, and new concepts to predict protein interactions in eukaryotes have been described. Finally, the interaction networks that can be obtained by combining the predicted pair-wise interactions have enough internal structure to detect higher levels of organization, such as 'functional modules'.
The identification of a conserved domain in both spartin and spastin, mutated in hereditary spastic paraplegia.
Ciccarelli, F.D., Proukakis, C., Patel, H., Cross, H., Azam, S., Patton, M.A., Bork, P. & Crosby, A.H.
Genomics 2003 Apr;81(4):437-41.
Multiple sequence alignment has revealed the presence of a sequence domain of approximately 80 amino acids in two molecules, spartin and spastin, mutated in hereditary spastic paraplegia. The domain, which corresponds to a slightly extended version of the recently described ESP domain of unknown function, was also identified in VPS4, SKD1, RPK118, and SNX15, all of which have a well established and consistent role in endosomal trafficking. Recent functional information indicates that spastin is likely to be involved in microtubule interaction. With this new information relating to its likely function, we propose the more descriptive name 'MIT' (contained within microtubule-interacting and trafficking molecules) for the domain and predict endosomal trafficking as the principal functionality of all molecules in which it is present.
Increase of functional diversity by alternative splicing.
Kriventseva, E.V., Koch, I., Apweiler, R., Vingron, M., Bork, P., Gelfand, M.S. & Sunyaev, S.
Trends Genet 2003 Mar;19(3):124-8.
A large-scale analysis of protein isoforms arising from alternative splicing shows that alternative splicing tends to insert or delete complete protein domains more frequently than expected by chance, whereas disruption of domains and other structural modules is less frequent. If domain regions are disrupted, the functional effect, as predicted from 3D structure, is frequently equivalent to removal of the entire domain. Also, short alternative splicing events within domains, which might preserve folded structure, target functional residues more frequently than expected. Thus, it seems that positive selection has had a major role in the evolution of alternative splicing.
Bioinformatics in the post-sequence era.
Kanehisa, M. & Bork, P.
Nat Genet 2003 Mar;33 Suppl:305-10.
In the past decade, bioinformatics has become an integral part of research and development in the biomedical sciences. Bioinformatics now has an essential role both in deciphering genomic, transcriptomic and proteomic data generated by high-throughput experimental technologies and in organizing information gathered from traditional biology. Sequence-based methods of analyzing individual genes or proteins have been elaborated and expanded, and methods have been developed for analyzing large numbers of genes or proteins simultaneously, such as in the identification of clusters of related genes and networks of interacting proteins. With the complete genome sequences for an increasing number of organisms at hand, bioinformatics is beginning to provide both conceptual bases and practical methods for detecting systemic functional behaviors of the cell and the organism.
STRING: a database of predicted functional associations between proteins.
von Mering, C., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P. & Snel, B.
Nucleic Acids Res 2003 Jan 1;31(1):258-61.
Functional links between proteins can often be inferred from genomic associations between the genes that encode them: groups of genes that are required for the same function tend to show similar species coverage, are often located in close proximity on the genome (in prokaryotes), and tend to be involved in gene-fusion events. The database STRING is a precomputed global resource for the exploration and analysis of these associations. Since the three types of evidence differ conceptually, and the number of predicted interactions is very large, it is essential to be able to assess and compare the significance of individual predictions. Thus, STRING contains a unique scoring-framework based on benchmarks of the different types of associations against a common reference set, integrated in a single confidence score per prediction. The graphical representation of the network of inferred, weighted protein interactions provides a high-level view of functional linkage, facilitating the analysis of modularity in biological processes. STRING is updated continuously, and currently contains 261 033 orthologs in 89 fully sequenced genomes. The database predicts functional interactions at an expected level of accuracy of at least 80% for more than half of the genes; it is online at http://www.bork.embl-heidelberg.de/STRING/.
The InterPro Database, 2003 brings increased coverage and new features.
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Barrell, D., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., Bucher, P., Copley, R.R., Courcelle, E., Das, U., Durbin, R., Falquet, L., Fleischmann, W., Griffiths-Jones, S., Haft, D., Harte, N., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lopez, R., Letunic, I., Lonsdale, D., Silventoinen, V., Orchard, S.E., Pagni, M., Peyruc, D., Ponting, C.P., Selengut, J.D., Servant, F., Sigrist, C.J., Vaughan, R. & Zdobnov, E.M.
Nucleic Acids Res 2003 Jan 1;31(1):315-8.
InterPro, an integrated documentation resource of protein families, domains and functional sites, was created in 1999 as a means of amalgamating the major protein signature databases into one comprehensive resource. PROSITE, Pfam, PRINTS, ProDom, SMART and TIGRFAMs have been manually integrated and curated and are available in InterPro for text- and sequence-based searching. The results are provided in a single format that rationalises the results that would be obtained by searching the member databases individually. The latest release of InterPro contains 5629 entries describing 4280 families, 1239 domains, 95 repeats and 15 post-translational modifications. Currently, the combined signatures in InterPro cover more than 74% of all proteins in SWISS-PROT and TrEMBL, an increase of nearly 15% since the inception of InterPro. New features of the database include improved searching capabilities and enhanced graphical user interfaces for visualisation of the data. The database is available via a webserver (http://www.ebi.ac.uk/interpro) and anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro).
The human genome: genes, pseudogenes, and variation on chromosome 7.
Waterston, R.H., Hillier, L.W., Fulton, L.A., Fulton, R.S., Graves, T.A., Pepin, K.H., Bork, P., Suyama, M., Torrents, D., Chinwalla, A.T., Mardis, E.R., McPherson, J.D. & Wilson, R.K.
Cold Spring Harb Symp Quant Biol 2003;68:13-22. Europe PMC
Detection and characterization of pseudogenes.
Torrents, D., Suyama, M. & Bork, P.
In "Bioinformatics and Genomes." M. Andrade (Ed). Horizon Sci. Press, 197-209
Nanoelectrospray tandem mass spectrometry and sequence similarity searching for identification of proteins from organisms with unknown genomes.
Shevchenko, A., Sunyaev, S., Liska, A., Bork, P. & Shevchenko, A.
Methods Mol Biol 2003;211:221-34. Europe PMC
High rate of gene displacement in vitamine biosynthetic pathways.
Morett, E., Saab-Rincon, G., Merino, E., Bork, P., Rajan, E., Olvera, L. & Olvera, M.
In "Bioinformatics and Genomes." M. Andrade (Ed.). Horizon Sci. Press, 69-79
Initial sequencing and comparative analysis of the mouse genome.
Waterston, R.H., Lindblad-Toh, K., Birney, E., et al.
Nature 2002 Dec 5;420(6915):520-62.
The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.
Identification and characterization of UEV3, a human cDNA with similarities to inactive E2 ubiquitin-conjugating enzymes.
Kloor, M., Bork, P., Duwe, A., Klaes, R., von Knebel Doeberitz, M. & Ridder, R.
Biochim Biophys Acta 2002 Dec 12;1579(2-3):219-24.
Recent studies have shown that ubiquitination is an essential factor in endosomal sorting and virus assembly. The human TSG101 gene has been demonstrated to belong to a group of genes coding for apparently inactive E2 ubiquitin-conjugating enzymes, which exert regulatory effects on E2 activity in cellular ubiquitination processes. In this study, a novel human cDNA (UEV3) encoding a putative protein of 379 amino acids was isolated from a human placenta library that may represent a partial paralogue of human TSG101. The predicted protein contains an N-terminal domain homologous to the catalytic domain of ubiquitin-conjugating enzymes (Ubc), which is fused to a sequence showing significant homology to members of the lactate dehydrogenase protein family. The UEV3 gene is located on chromosome 11 closely adjacent to TSG101 and LDH-C. Northern blot and UEV3-specific reverse transcription/polymerase chain reaction (RT/PCR) analyses of various colon carcinoma cell lines as well as both normal and tumor samples from colon revealed an expression of the UEV3 cDNA in all tested samples.
Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster.
Zdobnov, E.M., von Mering, C., Letunic, I., Torrents, D., Suyama, M., Copley, R.R., Christophides, G.K., Thomasova, D., Holt, R.A., Subramanian, G.M., Mueller, H.M., Dimopoulos, G., Law, J.H., Wells, M.A., Birney, E., Charlab, R., Halpern, A.L., Kokoza, E., Kraft, C.L., Lai, Z., Lewis, S., Louis, C., Barillas-Mury, C., Nusskern, D., Rubin, G.M., Salzberg, S.L., Sutton, G.G., Topalis, P., Wides, R., Wincker, P., Yandell, M., Collins, F.H., Ribeiro, J., Gelbart, W.M., Kafatos, F.C. & Bork, P.
Science 2002 Oct 4;298(5591):149-59.
Comparison of the genomes and proteomes of the two diptera Anopheles gambiae and Drosophila melanogaster, which diverged about 250 million years ago, reveals considerable similarities. However, numerous differences are also observed; some of these must reflect the selection and subsequent adaptation associated with different ecologies and life strategies. Almost half of the genes in both genomes are interpreted as orthologs and show an average sequence identity of about 56%, which is slightly lower than that observed between the orthologs of the pufferfish and human (diverged about 450 million years ago). This indicates that these two insects diverged considerably faster than vertebrates. Aligned sequences reveal that orthologous genes have retained only half of their intron/exon structure, indicating that intron gains or losses have occurred at a rate of about one per gene per 125 million years. Chromosomal arms exhibit significant remnants of homology between the two species, although only 34% of the genes colocalize in small "microsyntenic" clusters, and major interarm transfers as well as intra-arm shuffling of gene order are detected.
The genome sequence of the malaria mosquito Anopheles gambiae.
Holt, R.A., Subramanian, G.M., Halpern, A., Sutton, G.G., Charlab, R., Nusskern, D.R., Wincker, P., Clark, A.G., Ribeiro, J.M., Wides, R., Salzberg, S.L., Loftus, B., Yandell, M., Majoros, W.H., Rusch, D.B., Lai, Z., Kraft, C.L., Abril, J.F., Anthouard, V., Arensburger, P., Atkinson, P.W., Baden, H., de Berardinis, V., Baldwin, D., Benes, V., Biedler, J., Blass, C., Bolanos, R., Boscus, D., Barnstead, M., Cai, S., Center, A., Chatuverdi, K., Christophides, G.K., Chrystal, M.A., Clamp, M., Cravchik, A., Curwen, V., Dana, A., Delcher, A., Dew, I., Evans, C.A., Flanigan, M., Grundschober-Freimoser, A., Friedli, L., Gu, Z., Guan, P., Guigo, R., Hillenmeyer, M.E., Hladun, S.L., Hogan, J.R., Hong, Y.S., Hoover, J., Jaillon, O., Ke, Z., Kodira, C., Kokoza, E., Koutsos, A., Letunic, I., Levitsky, A., Liang, Y., Lin, J.J., Lobo, N.F., Lopez, J.R., Malek, J.A., McIntosh, T.C., Meister, S., Miller, J., Mobarry, C., Mongin, E., Murphy, S.D., O'Brochta, D.A., Pfannkoch, C., Qi, R., Regier, M.A., Remington, K., Shao, H., Sharakhova, M.V., Sitter, C.D., Shetty, J., Smith, T.J., Strong, R., Sun, J., Thomasova, D., Ton, L.Q., Topalis, P., Tu, Z., Unger, M.F., Walenz, B., Wang, A., Wang, J., Wang, M., Wang, X., Woodford, K.J., Wortman, J.R., Wu, M., Yao, A., Zdobnov, E.M., Zhang, H., Zhao, Q., Zhao, S., Zhu, S.C., Zhimulev, I., Coluzzi, M., della Torre, A., Roth, C.W., Louis, C., Kalush, F., Mural, R.J., Myers, E.W., Adams, M.D., Smith, H.O., Broder, S., Gardner, M.J., Fraser, C.M., Birney, E., Bork, P., Brey, P.T., Venter, J.C., Weissenbach, J., Kafatos, F.C., Collins, F.H. & Hoffman, S.L.
Science 2002 Oct 4;298(5591):129-49.
Anopheles gambiae is the principal vector of malaria, a disease that afflicts more than 500 million people and causes more than 1 million deaths each year. Tenfold shotgun sequence coverage was obtained from the PEST strain of A. gambiae and assembled into scaffolds that span 278 million base pairs. A total of 91% of the genome was organized in 303 scaffolds; the largest scaffold was 23.1 million base pairs. There was substantial genetic variation within this strain, and the apparent existence of two haplotypes of approximately equal frequency ("dual haplotypes") in a substantial fraction of the genome likely reflects the outbred nature of the PEST strain. The sequence produced a conservative inference of more than 400,000 single-nucleotide polymorphisms that showed a markedly bimodal density distribution. Analysis of the genome sequence revealed strong evidence for about 14,000 protein-encoding transcripts. Prominent expansions in specific families of proteins likely involved in cell adhesion and immunity were noted. An expressed sequence tag analysis of genes regulated by blood feeding provided insights into the physiological adaptations of a hematophagous insect.
Immunity-related genes and gene families in Anopheles gambiae.
Christophides, G.K., Zdobnov, E., Barillas-Mury, C., Birney, E., Blandin, S., Blass, C., Brey, P.T., Collins, F.H., Danielli, A., Dimopoulos, G., Hetru, C., Hoa, N.T., Hoffmann, J.A., Kanzok, S.M., Letunic, I., Levashina, E.A., Loukeris, T.G., Lycett, G., Meister, S., Michel, K., Moita, L.F., Muller, H.M., Osta, M.A., Paskewitz, S.M., Reichhart, J.M., Rzhetsky, A., Troxler, L., Vernick, K.D., Vlachou, D., Volz, J., von Mering, C., Xu, J., Zheng, L., Bork, P. & Kafatos, F.C.
Science 2002 Oct 4;298(5591):159-65.
We have identified 242 Anopheles gambiae genes from 18 gene families implicated in innate immunity and have detected marked diversification relative to Drosophila melanogaster. Immune-related gene families involved in recognition, signal modulation, and effector systems show a marked deficit of orthologs and excessive gene expansions, possibly reflecting selection pressures from different pathogens encountered in these insects' very different life-styles. In contrast, the multifunctional Toll signal transduction pathway is substantially conserved, presumably because of counterselection for developmental stability. Representative expression profiles confirm that sequence diversification is accompanied by specific responses to different immune challenges. Alternative RNA splicing may also contribute to expansion of the immune repertoire.
The genome sequence of Bifidobacterium longum reflects its adaptation to the human gastrointestinal tract.
Schell, M.A., Karmirantzou, M., Snel, B., Vilanova, D., Berger, B., Pessi, G., Zwahlen, M.C., Desiere, F., Bork, P., Delley, M., Pridmore, R.D. & Arigoni, F.
Proc Natl Acad Sci U S A 2002 Oct 29;99(22):14422-7.
Bifidobacteria are Gram-positive prokaryotes that naturally colonize the human gastrointestinal tract (GIT) and vagina. Although not numerically dominant in the complex intestinal microflora, they are considered as key commensals that promote a healthy GIT. We determined the 2.26-Mb genome sequence of an infant-derived strain of Bifidobacterium longum, and identified 1,730 possible coding sequences organized in a 60%-GC circular chromosome. Bioinformatic analysis revealed several physiological traits that could partially explain the successful adaptation of this bacteria to the colon. An unexpectedly large number of the predicted proteins appeared to be specialized for catabolism of a variety of oligosaccharides, some possibly released by rare or novel glycosyl hydrolases acting on "nondigestible" plant polymers or host-derived glycoproteins and glycoconjugates. This ability to scavenge from a large variety of nutrients likely contributes to the competitiveness and persistence of bifidobacteria in the colon. Many genes for oligosaccharide metabolism were found in self-regulated modules that appear to have arisen in part from gene duplication or horizontal acquisition. Complete pathways for all amino acids, nucleotides, and some key vitamins were identified; however, routes for Asp and Cys were atypical. More importantly, genome analysis provided insights into the reciprocal interactions of bifidobacteria with their hosts. We identified polypeptides that showed homology to most major proteins needed for production of glycoprotein-binding fimbriae, structures that could possibly be important for adhesion and persistence in the GIT. We also found a eukaryotic-type serine protease inhibitor (serpin) possibly involved in the reported immunomodulatory activity of bifidobacteria.
Comparative analysis of protein interaction networks.
Bioinformatics 2002 Oct;18 Suppl 2:S64.
Recent advances in proteomics and computational biology have lead to a flood of protein interaction data and resulting interaction networks (e.g. (Gavin et al., 2002)). Here I first analyse the status and quality of parts lists (genes and proteins), then comparatively assess large-scale protein interaction data (von Mering et al., 2002) and finally try to identify biological meaningful units (e.g. pathways, cellular processes) within interaction networks that are derived from the conservation of gene neighborhood (Snel et al., 2002). Possible extensions of gene neighborhood analysis to eukaryotes (von Mering and Bork, 2002) will be discussed.
Human non-synonymous SNPs: server and survey.
Ramensky, V., Bork, P. & Sunyaev, S.
Nucleic Acids Res 2002 Sep 1;30(17):3894-900.
Human single nucleotide polymorphisms (SNPs) represent the most frequent type of human population DNA variation. One of the main goals of SNP research is to understand the genetics of the human phenotype variation and especially the genetic basis of human complex diseases. Non-synonymous coding SNPs (nsSNPs) comprise a group of SNPs that, together with SNPs in regulatory regions, are believed to have the highest impact on phenotype. Here we present a World Wide Web server to predict the effect of an nsSNP on protein structure and function. The prediction method enabled analysis of the publicly available SNP database HGVbase, which gave rise to a dataset of nsSNPs with predicted functionality. The dataset was further used to compare the effect of various structural and functional characteristics of amino acid substitutions responsible for phenotypic display of nsSNPs. We also studied the dependence of selective pressure on the structural and functional properties of proteins. We found that in our dataset the selection pressure against deleterious SNPs depends on the molecular function of the protein, although it is insensitive to several other protein features considered. The strongest selective pressure was detected for proteins involved in transcription regulation.
InterPro: an integrated documentation resource for protein families, domains and functional sites.
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., Bucher, P., Copley, R., Courcelle, E., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Griffith-Jones, S., Haft, D., Hermjakob, H., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lopez, R., Letunic, I., Orchard, S., Pagni, M., Peyruc, D., Ponting, C.P., Servant, F. & Sigrist, C.J.
Brief Bioinform 2002 Sep;3(3):225-35.
The exponential increase in the submission of nucleotide sequences to the nucleotide sequence database by genome sequencing centres has resulted in a need for rapid, automatic methods for classification of the resulting protein sequences. There are several signature and sequence cluster-based methods for protein classification, each resource having distinct areas of optimum application owing to the differences in the underlying analysis methods. In recognition of this, InterPro was developed as an integrated documentation resource for protein families, domains and functional sites, to rationalise the complementary efforts of the individual protein signature database projects. The member databases - PRINTS, PROSITE, Pfam, ProDom, SMART and TIGRFAMs - form the InterPro core. Related signatures from each member database are unified into single InterPro entries. Each InterPro entry includes a unique accession number, functional descriptions and literature references, and links are made back to the relevant member database(s). Release 4.0 of InterPro (November 2001) contains 4,691 entries, representing 3,532 families, 1,068 domains, 74 repeats and 15 sites of post-translational modification (PTMs) encoded by different regular expressions, profiles, fingerprints and hidden Markov models (HMMs). Each InterPro entry lists all the matches against SWISS-PROT and TrEMBL (2,141,621 InterPro hits from 586,124 SWISS-PROT and TrEMBL protein sequences). The database is freely accessible for text- and sequence-based searches.
NEAT: a domain duplicated in genes near the components of a putative Fe3+ siderophore transporter from Gram-positive pathogenic bacteria.
Andrade, M.A., Ciccarelli, F.D., Perez-Iratxeta, C. & Bork, P.
Genome Biol 2002 Aug 15;3(9):RESEARCH0047.
BACKGROUND: Iron uptake from the host is essential for bacteria that infect animals. To find potential targets for drugs active against pathogenic bacteria, we have searched all completely sequenced genomes of pathogenic bacteria for genes relevant for iron transport. RESULTS: We identified a protein domain that appears in variable copy number in bacterial genes that are usually in the vicinity of a putative Fe3+ siderophore transporter. Accordingly, we have denoted this domain NEAT for 'near transporter'. Most of the bacterial species containing this domain are pathogenic. Sequence features indicate that the domain is anchored to the extracellular side of the membrane. The domain seems to be under high selective pressure for rapid independent duplications that are typical of sequences involved in signaling and binding. CONCLUSIONS: The NEAT domain might be functionally related to iron transport. The taxonomic specificity of this domain and its predicted extracellular position could make it an interesting target for designing new drugs against some highly pathogenic bacteria.
SPG20 is mutated in Troyer syndrome, an hereditary spastic paraplegia.
Patel, H., Cross, H., Proukakis, C., Hershberger, R., Bork, P., Ciccarelli, F.D., Patton, M.A., McKusick, V.A. & Crosby, A.H.
Nat Genet 2002 Aug;31(4):347-8.
Troyer syndrome (TRS) is an autosomal recessive complicated hereditary spastic paraplegia (HSP) that occurs with high frequency in the Old Order Amish. We report mapping of the TRS locus to chromosome 13q12.3 and identify a frameshift mutation in SPG20, encoding spartin. Comparative sequence analysis indicates that spartin shares similarity with molecules involved in endosomal trafficking and with spastin, a molecule implicated in microtubule interaction that is commonly mutated in HSP.
Predicting protein cellular localization using a domain projection method.
Mott, R., Schultz, J., Bork, P. & Ponting, C.P.
Genome Res 2002 Aug;12(8):1168-74.
We investigate the co-occurrence of domain families in eukaryotic proteins to predict protein cellular localization. Approximately half (300) of SMART domains form a "small-world network", linked by no more than seven degrees of separation. Projection of the domains onto two-dimensional space reveals three clusters that correspond to cellular compartments containing secreted, cytoplasmic, and nuclear proteins. The projection method takes into account the existence of "bridging" domains, that is, instances where two domains might not occur with each other but frequently co-occur with a third domain; in such circumstances the domains are neighbors in the projection. While the majority of domains are specific to a compartment ("locale"), and hence may be used to localize any protein that contains such a domain, a small subset of domains either are present in multiple locales or occur in transmembrane proteins. Comparison with previously annotated proteins shows that SMART domain data used with this approach can predict, with 92% accuracy, the localizations of 23% of eukaryotic proteins. The coverage and accuracy will increase with improvements in domain database coverage. This method is complementary to approaches that use amino-acid composition or identify sorting sequences; these methods may be combined to further enhance prediction accuracy.
The rhodanese/Cdc25 phosphatase superfamily. Sequence-structure-function relations.
Bordo, D. & Bork, P.
EMBO Rep 2002 Aug;3(8):741-6.
Rhodanese domains are ubiquitous structural modules occurring in the three major evolutionary phyla. They are found as tandem repeats, with the C-terminal domain hosting the properly structured active-site Cys residue, as single domain proteins or in combination with distinct protein domains. An increasing number of reports indicate that rhodanese modules are versatile sulfur carriers that have adapted their function to fulfill the need for reactive sulfane sulfur in distinct metabolic and regulatory pathways. Recent investigations have shown that rhodanese domains are also structurally related to the catalytic subunit of Cdc25 phosphatase enzymes and that the two enzyme families are likely to share a common evolutionary origin. In this review, the rhodanese/Cdc25 phosphatase superfamily is analyzed. Although the identification of their biological substrates has thus far proven elusive, the emerging picture points to a role for the amino-acid composition of the active-site loop in substrate recognition/specificity. Furthermore, the frequently observed association of catalytically inactive rhodanese modules with other protein domains suggests a distinct regulatory role for these inactive domains, possibly in connection with signaling.
Association of genes to genetically inherited diseases using data mining.
Perez-Iratxeta, C., Bork, P. & Andrade, M.A.
Nat Genet 2002 Jul;31(3):316-9.
Although approximately one-quarter of the roughly 4,000 genetically inherited diseases currently recorded in respective databases (LocusLink, OMIM) are already linked to a region of the human genome, about 450 have no known associated gene. Finding disease-related genes requires laborious examination of hundreds of possible candidate genes (sometimes, these are not even annotated; see, for example, refs 3,4). The public availability of the human genome draft sequence has fostered new strategies to map molecular functional features of gene products to complex phenotypic descriptions, such as those of genetically inherited diseases. Owing to recent progress in the systematic annotation of genes using controlled vocabularies, we have developed a scoring system for the possible functional relationships of human genes to 455 genetically inherited diseases that have been mapped to chromosomal regions without assignment of a particular gene. In a benchmark of the system with 100 known disease-associated genes, the disease-associated gene was among the 8 best-scoring genes with a 25% chance, and among the best 30 genes with a 50% chance, showing that there is a relationship between the score of a gene and its likelihood of being associated with a particular disease. The scoring also indicates that for some diseases, the chance of identifying the underlying gene is higher.
A complex prediction: three-dimensional model of the yeast exosome.
Aloy, P., Ciccarelli, F.D., Leutwein, C., Gavin, A.C., Superti-Furga, G., Bork, P., Böttcher, B. & Russell, R.B.
EMBO Rep 2002 Jul;3(7):628-35.
We present a model of the yeast exosome based on the bacterial degradosome component polynucleotide phosphorylase (PNPase). Electron microscopy shows the exosome to resemble PNPase but with key differences likely related to the position of RNA binding domains, and to the location of domains unique to the exosome. We use various techniques to reduce the many possible models of exosome subunits based on PNPase to just one. The model suggests numerous experiments to probe exosome function, particularly with respect to subunits making direct atomic contacts and conserved, possibly functional residues within the predicted central pore of the complex.
Teamed up for transcription.
von Mering, C. & Bork, P.
Nature. 2002 Jun 20;417(6891):797-8. Europe PMC
Comparative genomic analysis in the region of a major Plasmodium-refractoriness locus of Anopheles gambiae.
Thomasova, D., Ton, L.Q., Copley, R.R., Zdobnov, E.M., Wang, X., Hong, Y.S., Sim, C., Bork, P., Kafatos, F.C. & Collins, F.H.
Proc Natl Acad Sci U S A 2002 Jun 11;99(12):8179-84.
We have sequenced six overlapping clones from a library of bacterial artificial chromosome (BAC) clones derived from a laboratory strain of the mosquito, Anopheles gambiae, the major vector of human malaria in Africa. The resulting uninterrupted 528-kb sequence is from the 8C region of the mosquito 2R chromosome, at or very near the major refractoriness locus associated with melanotic encapsulation of parasites. This sequence represents the first extensive view of the mosquito genome structure encompassing 48 genes. Genomic comparison reveals that the majority of the orthologues are found in six microsyntenic clusters in Drosophila melanogaster. A BAC clone that is wholly contained within this region demonstrates the existence of a remarkable degree of local polymorphism in this species, which may prove important for its population structure and vectorial capacity.
Computing fuzzy associations for the analysis of biological literature.
Perez-Iratxeta, C., Keer, H.S., Bork, P. & Andrade, M.A.
Biotechniques 2002 Jun;32(6):1380-2, 1384-5.
The increase of information in biology makes it difficult for researchers in any field to keep current with the literature. The MEDLINE database of scientific abstracts can be quickly scanned using electronic mechanisms. Potentially interesting abstracts can be selected by matching words joined by Boolean operators. However this means of selecting documents is not optimal. Nonspecific queries have to be effected, resulting in large numbers of irrelevant abstracts that have to be manually scanned To facilitate this analysis, we have developed a system that compiles a summary of subjects and related documents on the results of a MEDLINE query. For this, we have applied a fuzzy binary relation formalism that deduces relations between words present in a set of abstracts preprocessed with a standard grammatical tagger. Those relations are used to derive ensembles of related words and their associated subsets of abstracts. The algorithm can be used publicly at http:// www.bork.embl-heidelberg.de/xplormed/.
Exploring MEDLINE abstracts with XplorMed.
Perez-Iratxeta, C., Bork, P. & Andrade, M.A.
Drugs Today (Barc) 2002 Jun;38(6):381-9.
XplorMed is a publicly available web tool conceived to make life easier for MEDLINE(c) users looking for scientific information. Searching scientific literature is an information retrieval problem. Abstracts that are of possible interest to the user are usually selected by a keyword search followed by manual screening, which often results in the retrieval of a large number of abstracts. Interesting references can be buried among irrelevant ones because of nonspecific queries. XplorMed is intended to extract dependency relations between the words of the abstracts. These relations can be filtered and arranged to deduce different subjects in the query and offer a condensed view of the abstract, allowing users to select texts of interest without having to read them all. XplorMed is available http://www.bork. embl-heidelberg.de/xplormed.
Comparative assessment of large-scale data sets of protein-protein interactions.
von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S. & Bork, P.
Nature 2002 May 23;417(6887):399-403.
Comprehensive protein protein interaction maps promise to reveal many aspects of the complex regulatory network underlying cellular function. Recently, large-scale approaches have predicted many new protein interactions in yeast. To measure their accuracy and potential as well as to identify biases, strengths and weaknesses, we compare the methods with each other and with a reference set of previously reported protein interactions.
The identification of functional modules from the genomic association of genes.
Snel, B., Bork, P. & Huynen, M.A.
Proc Natl Acad Sci U S A 2002 Apr 30;99(9):5890-5.
By combining the pairwise interactions between proteins, as predicted by the conserved co-occurrence of their genes in operons, we obtain protein interaction networks. Here we study the properties of such networks to identify functional modules: sets of proteins that together are involved in a biological process. The complete network contains 3,033 orthologous groups of proteins in 38 genomes. It consists of one giant component, containing 1,611 orthologous groups, and of 516 small disjointed clusters that, on average, contain only 2.7 orthologous groups. These small clusters have a homogeneous functional composition and thus represent functional modules in themselves. Analysis of the giant component reveals that it is a scale-free, small-world network with a high degree of local clustering (C = 0.6). It consists of locally highly connected subclusters that are connected to each other by linker proteins. The linker proteins tend to have multiple functions, or are involved in multiple processes and have an above average probability of being essential. By splitting up the giant component at these linker proteins, we identify 265 subclusters that tend to have a homogeneous functional composition. The rare functional inhomogeneities in our subclusters reflect the mixing of different types of (molecular) functions in a single cellular process, exemplified by subclusters containing both metabolic enzymes as well as the transcription factors that regulate them. Comparative genome analysis, thus, allows identification of a level of functional interaction between that of pairwise interactions, and of the complete genome.
BSD: a novel domain in transcription factors and synapse-associated proteins.
Doerks, T., Huber, S., Buchner, E. & Bork, P.
Trends Biochem Sci 2002 Apr;27(4):168-70.
This article describes a novel domain, BSD, that is present in basal transcription factors, synapse-associated proteins and several hypothetical proteins. It occurs in a variety of species ranging from primal protozoan to human. The BSD domain is characterized by three predicted alpha helices, which probably form a three-helical bundle, as well as by conserved tryptophan and phenylalanine residues, located at the C terminus of the domain.
A versatile structural domain analysis server using profile weight matrices.
Schmidt, S., Bork, P. & Dandekar, T.
J Chem Inf Comput Sci 2002 Mar-Apr;42(2):405-7.
The WEB tool "AnDom" assigns to a given protein sequence all experimentally determined structural domains contained within it, including multidomain and large proteins. The server uses profile specific matrices from custom generated multiple sequence alignments of all known SCOP domains (SCOP version 1.50). Prediction time is short allowing numerous applications for structural genomics including investigation of complex eucaryotic protein families. The WWW server is at http://www.bork.embl-heidelberg.de/AnDom, and profiles can be downloaded at ftp.bork.embl-heidelberg.de/pub/users/ schmidt/AnDom.
SHOT: a web server for the construction of genome phylogenies.
Korbel, J.O., Snel, B., Huynen, M.A. & Bork, P.
Trends Genet 2002 Mar;18(3):158-62.
With the increasing availability of genome sequences, new methods are being proposed that exploit information from complete genomes to classify species in a phylogeny. Here we present SHOT, a web server for the classification of genomes on the basis of shared gene content or the conservation of gene order that reflects the dominant, phylogenetic signal in these genomic properties. In general, the genome trees are consistent with classical gene-based phylogenies, although some interesting exceptions indicate massive horizontal gene transfer. SHOT is a useful tool for analysing the tree of life from a genomic point of view. It is available at http://www.Bork.EMBL-Heidelberg.de/SHOT.
AMOP, a protein module alternatively spliced in cancer cells.
Ciccarelli, F.D., Doerks, T. & Bork, P.
Trends Biochem Sci 2002 Mar;27(3):113-5.
This article describes a new extracellular domain--AMOP, for adhesion-associated domain in MUC4 and other proteins. This domain occurs in putative cell adhesion molecules and in some splice variants of MUC4. MUC4 splice variants are overexpressed in several tumours; in particular, they are highly expressed in pancreatic carcinomas but not in normal pancreas. The presence of AMOP in cell adhesion molecules could be indicative of a role for this domain in adhesion.
Protein domain analysis in the era of complete genomes.
Copley, R.R., Doerks, T., Letunic, I. & Bork, P.
FEBS Lett 2002 Feb 20;513(1):129-34.
Domains present one of the most useful levels at which to understand protein function, and domain family-based analysis has had a profound impact on the study of individual proteins. Protein domain discovery has been progressing steadily over the past 30 years. What are the realistically achievable goals of sequence-based domain analysis, and how far off are they for the sequences encoded in eukaryotic genomes? Here we address some of the issues involved in better coverage of sequence-based domain annotation, and the integration of these results within the wider context of genomes, structures and function.
Genome and protein evolution in eukaryotes.
Copley, R.R., Letunic, I. & Bork, P.
Curr Opin Chem Biol 2002 Feb;6(1):39-45.
The past year has seen the completion of the genome sequence of the flowering plant Arabidopsis thaliana and the initial sequence reports of the human genome. The availability of completely sequenced eukaryotic genomes from disparate phylogenetic lineages has opened the door to comparative analyses and a better understanding of the evolutionary processes shaping genomes. Complex many-to-many relationships between genes from different species appear to be the norm, suggesting that transfer of detailed functional annotation will not be straightforward. In addition to expansion and contraction of gene families, new genes evolve from recombination of pre-existing domains, although some domain families do appear to have evolved recently and to be specific to restricted phylogenetic lineages. The overall picture is of a huge diversity of gene content within eukaryotic genomes, reflecting different functional demands in different species.
CASH a beta-helix domain widespread among carbohydrate-binding proteins.
Ciccarelli, F.D., Copley, R.R., Doerks, T., Russell, R.B. & Bork, P.
Trends Biochem Sci 2002 Feb;27(2):59-62.
In this article, we describe a novel, widespread domain (CASH) that is shared by many carbohydrate-binding proteins and sugar hydrolases. This domain occurs in more than 1000 proteins distributed among all three kingdoms of life. The CASH domain is characterized by internal repetitions of glycines and hydrophobic residues that correspond to the repetitive units of a predicted or observed right-handed beta-helix structure of the pectate lyase superfamily.
Functional organization of the yeast proteome by systematic analysis of protein complexes.
Gavin, A.C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J.M., Michon, A.M., Cruciat, C.M., Remor, M., Hofert, C., Schelder, M., Brajenovic, M., Ruffner, H., Merino, A., Klein, K., Hudak, M., Dickson, D., Rudi, T., Gnau, V., Bauch, A., Bastuck, S., Huhse, B., Leutwein, C., Heurtier, M.A., Copley, R.R., Edelmann, A., Querfurth, E., Rybin, V., Drewes, G., Raida, M., Bouwmeester, T., Bork, P., Seraphin, B., Kuster, B., Neubauer, G. & Superti-Furga, G.
Nature 2002 Jan 10;415(6868):141-7.
Most cellular processes are carried out by multiprotein complexes. The identification and analysis of their components provides insight into how the ensemble of expressed proteins (proteome) is organized into functional units. We used tandem-affinity purification (TAP) and mass spectrometry in a large-scale approach to characterize multiprotein complexes in Saccharomyces cerevisiae. We processed 1,739 genes, including 1,143 human orthologues of relevance to human biology, and purified 589 protein assemblies. Bioinformatic analysis of these assemblies defined 232 distinct multiprotein complexes and proposed new cellular roles for 344 proteins, including 231 proteins with no previous functional annotation. Comparison of yeast and human complexes showed that conservation across species extends from single proteins to their molecular environment. Our analysis provides an outline of the eukaryotic proteome as a network of protein complexes at a level of organization beyond binary interactions. This higher-order map contains fundamental biological information and offers the context for a more reasoned and informed approach to drug discovery.
Recent improvements to the SMART domain-based sequence annotation resource.
Letunic, I., Goodstadt, L., Dickens, N.J., Doerks, T., Schultz, J., Mott, R., Ciccarelli, F., Copley, R.R., Ponting, C.P. & Bork, P.
Nucleic Acids Res 2002 Jan 1;30(1):242-4.
SMART (Simple Modular Architecture Research Tool, http://smart.embl-heidelberg.de) is a web-based resource used for the annotation of protein domains and the analysis of domain architectures, with particular emphasis on mobile eukaryotic domains. Extensive annotation for each domain family is available, providing information relating to function, subcellular localization, phyletic distribution and tertiary structure. The January 2002 release has added more than 200 hand-curated domain models. This brings the total to over 600 domain families that are widely represented among nuclear, signalling and extracellular proteins. Annotation now includes links to the Online Mendelian Inheritance in Man (OMIM) database in cases where a human disease is associated with one or more mutations in a particular domain. We have implemented new analysis methods and updated others. New advanced queries provide direct access to the SMART relational database using SQL. This database now contains information on intrinsic sequence features such as transmembrane regions, coiled-coils, signal peptides and internal repeats. SMART output can now be easily included in users' documents. A SMART mirror has been created at http://smart.ox.ac.uk.
HGVbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources.
Fredman, D., Siegfried, M., Yuan, Y.P., Bork, P., Lehvaslaiho, H. & Brookes, A.J.
Nucleic Acids Res 2002 Jan 1;30(1):387-91.
HGVbase (Human Genome Variation database; http://hgvbase.cgb.ki.se, formerly known as HGBASE) is an academic effort to provide a high quality and non-redundant database of available genomic variation data of all types, mostly comprising single nucleotide polymorphisms (SNPs). Records include neutral polymorphisms as well as disease-related mutations. Online search tools facilitate data interrogation by sequence similarity and keyword queries, and searching by genome coordinates is now being implemented. Downloads are freely available in XML, Fasta, SRS, SQL and tagged-text file formats. Each entry is presented in the context of its surrounding sequence and many records are related to neighboring human genes and affected features therein. Population allele frequencies are included wherever available. Thorough semi-automated data checking ensures internal consistency and addresses common errors in the source information. To keep pace with recent growth in the field, we have developed tools for fully automated annotation. All variants have been uniquely mapped to the draft genome sequence and are referenced to positions in EMBL/GenBank files. Data utility is enhanced by provision of genotyping assays and functional predictions. Recent data structure extensions allow the capture of haplotype and genotype information, and a new initiative (along with BiSC and HUGO-MDI) aims to create a central repository for the broad collection of clinical mutations and associated disease phenotypes of interest.
Genomes in flux: the evolution of archaeal and proteobacterial gene content.
Snel, B., Bork, P. & Huynen, M.A.
Genome Res 2002 Jan;12(1):17-25.
In the course of evolution, genomes are shaped by processes like gene loss, gene duplication, horizontal gene transfer, and gene genesis (the de novo origin of genes). Here we reconstruct the gene content of ancestral Archaea and Proteobacteria and quantify the processes connecting them to their present day representatives based on the distribution of genes in completely sequenced genomes. We estimate that the ancestor of the Proteobacteria contained around 2500 genes, and the ancestor of the Archaea around 2050 genes. Although it is necessary to invoke horizontal gene transfer to explain the content of present day genomes, gene loss, gene genesis, and simple vertical inheritance are quantitatively the most dominant processes in shaping the genome. Together they result in a turnover of gene content such that even the lineage leading from the ancestor of the Proteobacteria to the relatively large genome of Escherichia coli has lost at least 950 genes. Gene loss, unlike the other processes, correlates fairly well with time. This clock-like behavior suggests that gene loss is under negative selection, while the processes that add genes are under positive selection.
Systematic identification of novel protein domain families associated with nuclear functions.
Doerks, T., Copley, R.R., Schultz, J., Ponting, C.P. & Bork, P.
Genome Res 2002 Jan;12(1):47-56.
A systematic computational analysis of protein sequences containing known nuclear domains led to the identification of 28 novel domain families. This represents a 26% increase in the starting set of 107 known nuclear domain families used for the analysis. Most of the novel domains are present in all major eukaryotic lineages, but 3 are species specific. For about 500 of the 1200 proteins that contain these new domains, nuclear localization could be inferred, and for 700, additional features could be predicted. For example, we identified a new domain, likely to have a role downstream of the unfolded protein response; a nematode-specific signalling domain; and a widespread domain, likely to be a noncatalytic homolog of ubiquitin-conjugating enzymes.
Alternative splicing and genome complexity.
Brett, D., Pospisil, H., Valcarcel, J., Reich, J. & Bork, P.
Nat Genet 2002 Jan;30(1):29-30.
Alternative splicing of mRNA allows many gene products with different functions to be produced from a single coding sequence. It has recently been proposed as a mechanism by which higher-order diversity is generated. Here we show, using large-scale expressed sequence tag (EST) analysis, that among seven different eukaryotes the amount of alternative splicing is comparable, with no large differences between humans and other animals.
Proteinkomplexe und Netzwerke: Neue Herausforderungen für die Proteomik
Von Mering, C. and Bork, P.
Genomexpress 2/02, 2-4
Conservation of gene co-regulation in prokaryotes and eukaryotes.
Snel, B., Bork, P. & Huynen, M.A.
Trends Biotechnol. 2002 20(10) 410
Exon duplication: A driving force in eukaryotic gene evolution
Letunic, I., Copley, R. and Bork, P.
Hum.Mol.Genet., 11, 1561-1567
Comparative genome analysis of the mollicutes.
Dandekar, T., Snel, B., Schmidt, S., Lathe, W., Suyama, M., Huynen, M. and Bork, P.
In "Mollicutes" (eds Herrmann et al.)
Sequence analysis of multidomain proteins: past perspectives and future directions.
Copley, R.R., Ponting, C.P., Schultz, J. & Bork, P.
Adv Protein Chem 2002;61:75-98. Europe PMC
The Spir actin organizers are involved in vesicle transport processes.
Kerkhoff, E., Simpson, J.C., Leberfinger, C.B., Otto, I.M., Doerks, T., Bork, P., Rapp, U.R., Raabe, T. & Pepperkok, R.
Curr Biol 2001 Dec 11;11(24):1963-8.
The p150-Spir protein, which was discovered as a phosphorylation target of the Jun N-terminal kinase, is an essential regulator of the polarization of the Drosophila oocyte. Spir proteins are highly conserved between species and belong to the family of Wiskott-Aldrich homology region 2 (WH2) proteins involved in actin organization. The C-terminal region of Spir encodes a zinc finger structure highly homologous to FYVE motifs. A region with high homology between the Spir family proteins is located adjacent (N-terminal) to the modified FYVE domain and is designated as "Spir-box." The Spir-box has sequence similarity to a region of rabphilin-3A, which mediates interaction with the small GTPase Rab3A. Coexpression of p150-Spir and green fluorescent protein-tagged Rab GTPases in NIH 3T3 cells revealed that the Spir protein colocalized specifically with the Rab11 GTPase, which is localized at the trans-Golgi network (TGN), post-Golgi vesicles, and the recycling endosome. The distinct Spir localization pattern was dependent on the integrity of the modified FYVE finger motif and the Spir-box. Overexpression of a mouse Spir-1 dominant interfering mutant strongly inhibited the transport of the vesicular stomatitis virus G (VSV G) protein to the plasma membrane. The viral protein was arrested in membrane structures, largely colocalizing with the TGN marker TGN46. Our findings that the Spir actin organizer is targeted to intracellular membrane structures by its modified FYVE zinc finger and is involved in vesicle transport processes provide a novel link between actin organization and intracellular transport.
Novel protein domains and repeats in Drosophila melanogaster: insights into structure, function, and evolution.
Ponting, C.P., Mott, R., Bork, P. & Copley, R.R.
Genome Res 2001 Dec;11(12):1996-2008.
Sequence database searching methods such as BLAST, are invaluable for predicting molecular function on the basis of sequence similarities among single regions of proteins. Searches of whole databases however, are not optimized to detect multiple homologous regions within a single polypeptide. Here we have used the prospero algorithm to perform self-comparisons of all predicted Drosophila melanogaster gene products. Predicted repeats, and their homologs from all species, were analyzed further to detect hitherto unappreciated evolutionary relationships. Results included the identification of novel tandem repeats in the human X-linked retinitis pigmentosa type-2 gene product, repeated segments in cystinosin, associated with a defect in cystine transport, and 'nested' homologous domains in dysferlin, whose gene is mutated in limb girdle muscular dystrophy. Novel signaling domain families were found that may regulate the microtubule-based cytoskeleton and ubiquitin-mediated proteolysis, respectively. Two families of glycosyl hydrolases were shown to contain internal repetitions that hint at their evolution via a piecemeal, modular approach. In addition, three examples of fruit fly genes were detected with tandem exons that appear to have arisen via internal duplication. These findings demonstrate how completely sequenced genomes can be exploited to further understand the relationships between molecular structure, function, and evolution.
The phylogenetic distribution of frataxin indicates a role in iron-sulfur cluster protein assembly.
Huynen, M.A., Snel, B., Bork, P. & Gibson, T.J.
Hum Mol Genet 2001 Oct 1;10(21):2463-8.
Much has been learned about the cellular pathology of Friedreich's ataxia, a recessive neurodegenerative disease resulting from insufficient expression of the mitochondrial protein frataxin. However, the biochemical function of frataxin has remained obscure, hampering attempts at therapeutic intervention. To predict functional interactions of frataxin with other proteins we investigated whether its gene specifically co-occurs with any other genes in sequenced genomes. In 56 available genomes we identified two genes with identical phylogenetic distributions to the frataxin/cyaY gene: hscA and hscB/JAC1. These genes have not only emerged in the same evolutionary lineage as the frataxin gene, they have also been lost at least twice with it, and they have been horizontally transferred with it in the evolution of the mitochondria. The proteins encoded by hscA and hscB, the chaperone HSP66 and the co-chaperone HSP20, have been shown to be required for the synthesis of 2Fe-2S clusters on ferredoxin in proteobacteria. JAC1, an ortholog of hscB, and SSQ1, a paralog of hscA, have been shown to be required for iron-sulfur cluster assembly in mitochondria of Saccharomyces cerevisiae. Combining data on the co-occurrence of genes in genomes with experimental and predicted cellular localization data of their proteins supports the hypothesis that frataxin is directly involved in iron-sulfur cluster protein assembly. They indicate that frataxin is specifically involved in the same sub-process as HSP20/Jac1p.
XplorMed: a tool for exploring MEDLINE abstracts.
Perez-Iratxeta, C., Bork, P. & Andrade, M.A.
Trends Biochem Sci 2001 Sep;26(9):573-5.
The most frequent access to the MEDLINE database of scientific abstracts is by keyword search. However, this is often not sufficient because although the user might find all the useful abstracts, these are buried in hundreds that are irrelevant. The exploratory tool XplorMed has been developed to analyse the result of any MEDLINE query. It suggests main groups of related topics and documents, sparing the user the need of reading all abstracts.
Evolution of tuf genes: ancient duplication, differential loss and gene conversion.
Lathe WC, 3rd & Bork, P.
FEBS Lett 2001 Aug 3;502(3):113-6.
The tuf gene of eubacteria, encoding the EF-tu elongation factor, was duplicated early in the evolution of the taxon. Phylogenetic and genomic location analysis of 20 complete eubacterial genomes suggests that this ancient duplication has been differentially lost and maintained in eubacteria.
Systematic identification of genes with coding microsatellites mutated in DNA mismatch repair-deficient cancer cells.
Woerner, S.M., Gebert, J., Yuan, Y.P., Sutter, C., Ridder, R., Bork, P. & von Knebel Doeberitz, M.
Int J Cancer 2001 Jul 1;93(1):12-9.
Microsatellite instability (MSI) caused by deficient DNA mismatch-repair functions is a hallmark of cancers associated with the hereditary nonpolyposis colorectal cancer (HNPCC) syndrome but is also found in about 15% of all sporadic tumors. Most affected microsatellites reside in untranslated intergenic or intronic sequences. However, recently few genes with coding microsatellites were also shown to be mutational targets in MSI-positive cancers and might represent important mutation targets in their pathogenesis. The systematic identification of such genes and the analysis of their mutation frequency in MSI-positive cancers might thus reveal major clues to their functional role in MSI-associated carcinogenesis. We therefore initiated a systematic database search in 33,595 distinctly annotated human genes and identified 17,654 potentially coding mononucleotide repeats (cMNRs) and 2,028 coding dinucleotide repeats (cDNRs), which consist of n > or = 6 and n > or = 4 repeat units, respectively. Expression pattern and mutation frequency of 19 of these genes with the longest repeats were compared between DNA mismatch repair-deficient (MSI(+)) and proficient (MSS) cancer cells. Instability frequencies in these coding microsatellite genes ranged from 10% to 100% in MSI-H tumor cells, whereas MSS cancer cells did not show mutations. RT-PCR analysis further showed that most of the affected genes (10/15) were highly expressed in tumor cells. The approach outlined here identified a new set of genes frequently affected by mutations in MSI-positive tumor cells. It will lead to novel and highly specific diagnostic and therapeutic targets for microsatellite unstable cancers.
Frameshift peptide-derived T-cell epitopes: a source of novel tumor-specific antigens.
Linnebacher, M., Gebert, J., Rudy, W., Woerner, S., Yuan, Y.P., Bork, P. & von Knebel Doeberitz, M.
Int J Cancer 2001 Jul 1;93(1):6-11.
Microsatellite instability (MSI) caused by defective DNA mismatch repair (MMR) is a hallmark of hereditary nonpolyposis colorectal cancers (HNPCC) but also occurs in about 15% of sporadic tumors. If instability affects microsatellites in coding regions, translational frameshifts lead to truncated proteins often marked by unique frameshift peptide sequences at their C-terminus. Since MSI tumors show enhanced lymphocytic infiltration and our previous analysis identified numerous coding mono- and dinucleotide repeat-bearing candidate genes as targets of genetic instability, we examined the role of frameshift peptides in triggering cellular immune responses. Using peptide pulsed autologous CD40-activated B cells, we have generated cytotoxic T lymphocytes (CTL) that specifically recognize HLA-A2.1-restricted peptides derived from frameshift sequences. Among 16 frameshift peptides predicted from mutations in 8 different genes, 3 peptides conferred specific lysis of target cells exogenously loaded with cognate peptide. One peptide derived from a (-1) frameshift mutation in the TGFbetaIIR gene gave rise to a CTL bulk culture capable of lysing the MSI colorectal cancer cell line HCT116 carrying this frameshift mutation. Given the huge number of human coding microsatellites and assuming only a fraction being mutated and encoding immunologically relevant peptides in MSI tumors, frameshift protein sequences represent a novel subclass of tumor-specific antigens. It is tempting to speculate that a frameshift peptide-directed vaccination approach not only could offer new treatment modalities for existing MSI tumors but also might benefit asymptomatic at-risk individuals in HNPCC families by a prophylactic vaccination strategy.
Inversions and the dynamics of eukaryotic gene order.
Huynen, M.A., Snel, B. & Bork, P.
Trends Genet 2001 Jun;17(6):304-6.
Comparisons of the gene order in closely related genomes reveal a major role for inversions in the genome shuffling process. In contrast to prokaryotes, where the inversions are predominantly large, half of the inversions between Saccharomyces cerevisiae and Candida albicans appear to be small, often encompassing only a single gene. Overall the genome rearrangement rate appears higher in eukaryotes than in prokaryotes, and the current genome data do not indicate that functional constraints on the co-expression of neighboring genes have a large role in conserving eukaryotic gene order. Nevertheless, qualitatively interesting examples of conservation of gene order in eukaryotes can be observed.
Comparison of ARM and HEAT protein repeats.
Andrade, M.A., Petosa, C., O'Donoghue, S.I., Muller, C.W. & Bork, P.
J Mol Biol 2001 May 25;309(1):1-18.
ARM and HEAT motifs are tandemly repeated sequences of approximately 50 amino acid residues that occur in a wide variety of eukaryotic proteins. An exhaustive search of sequence databases detected new family members and revealed that at least 1 in 500 eukaryotic protein sequences contain such repeats. It also rendered the similarity between ARM and HEAT repeats, believed to be evolutionarily related, readily apparent. All the proteins identified in the database searches could be clustered by sequence similarity into four groups: canonical ARM-repeat proteins and three groups of the more divergent HEAT-repeat proteins. This allowed us to build improved sequence profiles for the automatic detection of repeat motifs. Inspection of these profiles indicated that the individual repeat motifs of all four classes share a common set of seven highly conserved hydrophobic residues, which in proteins of known three-dimensional structure are buried within or between repeats. However, the motifs differ at several specific residue positions, suggesting important structural or functional differences among the classes. Our results illustrate that ARM and HEAT-repeat proteins, while having a common phylogenetic origin, have since diverged significantly. We discuss evolutionary scenarios that could account for the great diversity of repeats observed.
Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole time-of-flight mass spectrometry and BLAST homology searching.
Shevchenko, A., Sunyaev, S., Loboda, A., Bork, P., Ens, W. & Standing, K.G.
Anal Chem 2001 May 1;73(9):1917-26.
MALDI-quadrupole time-of-flight mass spectrometry was applied to identify proteins from organisms whose genomes are still unknown. The identification was carried out by successively searching a sequence database-first with a peptide mass fingerprint, then with a packet of noninterpreted MS/MS spectra, and finally with peptide sequences obtained by automated interpretation of the MS/MS spectra. A "MS BLAST" homology searching protocol was developed to overcome specific limitations imposed by mass spectrometric data, such as the limited accuracy of de novo sequence predictions. This approach was tested in a small-scale proteomic project involving the identification of 15 bands of gel-separated proteins from the methylotrophic yeast Pichia pastoris, whose genome has not yet been sequenced and which is only distantly related to other fungi.
Prediction of deleterious human alleles.
Sunyaev, S., Ramensky, V., Koch, I., Lathe W, 3rd, Kondrashov, A.S. & Bork, P.
Hum Mol Genet 2001 Mar 15;10(6):591-7.
Single nucleotide polymorphisms (SNPs) constitute the bulk of human genetic variation, occurring with an average density of approximately 1/1000 nucleotides of a genotype. SNPs are either neutral allelic variants or are under selection of various strengths, and the impact of SNPs on fitness remains unknown. Identification of SNPs affecting human phenotype, especially leading to risks of complex disorders, is one of the key problems of medical genetics. SNPs in protein-coding regions that cause amino acid variants (non-synonymous cSNPs) are most likely to affect phenotypes. We have developed a straightforward and reliable method based on physical and comparative considerations that estimates the impact of an amino acid replacement on the three-dimensional structure and function of the protein. We estimate that approximately 20% of common human non-synonymous SNPs damage the protein. The average minor allele frequency of such SNPs in our data set was two times lower than that of benign non-synonymous SNPs. The average human genotype carries approximately 10(3) damaging non-synonymous SNPs that together cause a substantial reduction in fitness.
Pheophorbide A from Solanum diflorum interferes with NF-kappa B activation.
Heinrich, M., Bork, P.M., Schmitz, M.L., Rimpler, H., Frei, B. & Sticher, O.
Planta Med 2001 Mar;67(2):156-7.
Continuing our search for biogenic NF-kappa B inhibitors we investigated Solanum diflorum, used by the Istmo Sierra Zapotec Indians of Mexico in the treatment of inflammatory skin conditions. It became obvious very early that the active substance seems to be a degradation product of chlorophyll. Pheophorbide A was identified as one of the key compounds responsible for the NF-kappa B inhibitory activity. The compound interferes with NF-kappa B activation, was cytotoxic if exposed to light, but devoid of any cytotoxic activity in the dark.
DDT -- a novel domain in different transcription and chromosome remodeling factors.
Doerks, T., Copley, R. & Bork, P.
Trends Biochem Sci 2001 Mar;26(3):145-6.
Homology-based sequence analyses have revealed the presence of a novel domain (DDT) in bromodomain PHD finger transcription factors (BPTFs), chromatin remodeling factors of the BAZ-family and other putative nuclear proteins. This domain is characterized by a number of conserved aromatic and charged residues and is predicted to consist of three alpha helices. Recent studies indicate a likely DNA-binding function for the DDT domain.
Bork, P. & Copley, R.
Nature 2001 Feb 15;409(6822):815. Europe PMC
The draft sequences. Filling in the gaps.
Bork, P. & Copley, R.
Nature 2001 Feb 15;409(6822):818-20. Europe PMC
Molecular characterization of a cDNA encoding functional human CLK4 kinase and localization to chromosome 5q35 [correction of 4q35].
Schultz, J., Jones, T., Bork, P., Sheer, D., Blencke, S., Steyrer, S., Wellbrock, U., Bevec, D., Ullrich, A. & Wallasch, C.
Genomics 2001 Feb 1;71(3):368-70.
Phosphorylated serine- and arginine-rich (SR) proteins play an important role in the formation of spliceosomes, possibly controlling the regulation of alternative splicing. Enzymes that phosphorylate the SR proteins belong to the family of CDC2/CDC28-like kinases (CLK). Employing nucleotide sequence comparison of human expressed sequence tag sequences to the murine counterpart, we identified, cloned, and recombinantly expressed the human orthologue to the murine CLK4 cDNA. When fused to glutathione S-transferase, the catalytically active human CLK4 is able to autophosphorylate and to phosphorylate myelin basic protein, but not histone H2B as a substrate. Inspection of mRNA accumulation demonstrated gene expression in all human tissues, with the most prominent abundance in liver, kidney, brain, and heart. Using fluorescence in situ hybridization, the human CLK4 cDNA was localized to band q35 on chromosome 5 [corrected].
Integration of genome data and protein structures: prediction of protein folds, protein interactions and "molecular phenotypes" of single nucleotide polymorphisms.
Sunyaev, S., Lathe W, 3rd & Bork, P.
Curr Opin Struct Biol 2001 Feb;11(1):125-30.
With the massive amount of sequence and structural data being produced, new avenues emerge for exploiting the information therein for applications in several fields. Fold distributions can be mapped onto entire genomes to learn about the nature of the protein universe and many of the interactions between proteins can now be predicted solely on the basis of the genomic context of their genes. Furthermore, by utilising the new incoming data on single nucleotide polymorphisms by mapping them onto three-dimensional structures of proteins, problems concerning population, medical and evolutionary genetics can be addressed.
The black-pearl gene of Drosophila defines a novel conserved protein family and is required for larval growth and survival.
Becker, S., Gehrsitz, A., Bork, P., Buchner, S. & Buchner, E.
Gene 2001 Jan 10;262(1-2):15-22.
Using a transposon insertion line of the Drosophila Genome Project we have cloned the black-pearl gene (blp), analyzed cDNA clones, generated various mutants, and characterized their phenotypes. The blp gene codes for a protein of 15.7 kDa calculated molecular weight that has been conserved from yeast to plants and mammals with high homology. A domain of these new proteins shows distant similarity to DnaJ domains indicating a functionally relevant interaction with other proteins. The P element insertion in line P1539 lies within the 5' untranslated leader of the black-pearl gene. Flies homozygous for this insertion are semi-lethal, escapers produce very few offspring and show melanotic inclusions in the hemocoel ('black pearls') similar to various melanotic 'tumor' mutants. Two small deletions confined to the blp gene and two EMS-induced mutations are homozygous lethal. These null mutants appear normal up to a prolonged first instar larval stage but fail to grow and die. Thus in Drosophila the blp gene is specifically required for larval growth. The evolutionary conservation in both unicellular and multicellular organisms suggests for the new protein family described here a fundamental role in cell growth.
Evolution of prokaryotic gene order: genome rearrangements in closely related species.
Suyama, M. & Bork, P.
Trends Genet 2001 Jan 1;17(1):10-13.
Conservation of gene order in prokaryotes has become important in predicting protein function because, over the evolutionary timescale, genomes are shuffled so that local gene-order conservation reflects the functional constraints within the protein. Here, we compare closely related genomes to identify the rate with which gene order is disrupted and to infer the genes involved in the genome rearrangement.
Post-translational GPI lipid anchor modification of proteins in kingdoms of life: analysis of protein sequence data from complete genomes.
Eisenhaber, B., Bork, P. & Eisenhaber, F.
Protein Eng 2001 Jan;14(1):17-25.
To investigate the occurrence of glycosylphosphatidylinositol (GPI) lipid anchor modification in various taxonomic ranges, potential substrate proteins have been searched for in completely sequenced genomes. We applied the big-pi predictor for the recognition of propeptide cleavage and anchor attachment sites with a new, generalized analytical form of the extreme-value distribution for evaluating false-positive prediction rates. (i) We find that GPI modification is present among lower and higher Eukaryota (approximately 0.5% of all proteins) but it seems absent in all eubacterial and three archaeobacterial species studied. Four other archaean genomes appear to encode such a fraction of substrate proteins (in the range of eukaryots) that they cannot be explained as false-positive predictions. This result supports the possible existence of GPI anchor modification in an archaean subgroup. (ii) The frequency of GPI-modified proteins on various chromosomes of a given eukaryotic species is different. (iii) Lists of potentially GPI-modified proteins in complete genomes with their predicted cleavage sites are available at http://mendel.imp.univie.ac.at/gpi/gpi_genomes.html. (iv) Orthologues of known transamidase subunits have been found only for EUKARYA: Inconsistencies in domain structure among homologues some of which may indicate sequencing errors are described. We present a refined model of the transamidase complex.
Quod erat demonstrandum? The mystery of experimental validation of apparently erroneous computational analyses of protein sequences.
Iyer, L.M., Aravind, L., Bork, P., Hofmann, K., Mushegian, A.R., Zhulin, I.B. & Koonin, E.V.
Genome Biol 2001;2(12):RESEARCH0051.
BACKGROUND: Computational predictions are critical for directing the experimental study of protein functions. Therefore it is paradoxical when an apparently erroneous computational prediction seems to be supported by experiment. RESULTS: We analyzed six cases where application of novel or conventional computational methods for protein sequence and structure analysis led to non-trivial predictions that were subsequently supported by direct experiments. We show that, on all six occasions, the original prediction was unjustified, and in at least three cases, an alternative, well-supported computational prediction, incompatible with the original one, could be derived. The most unusual cases involved the identification of an archaeal cysteinyl-tRNA synthetase, a dihydropteroate synthase and a thymidylate synthase, for which experimental verifications of apparently erroneous computational predictions were reported. Using sequence-profile analysis, multiple alignment and secondary-structure prediction, we have identified the unique archaeal 'cysteinyl-tRNA synthetase' as a homolog of extracellular polygalactosaminidases, and the 'dihydropteroate synthase' as a member of the beta-lactamase-like superfamily of metal-dependent hydrolases. CONCLUSIONS: In each of the analyzed cases, the original computational predictions could be refuted and, in some instances, alternative strongly supported predictions were obtained. The nature of the experimental evidence that appears to support these predictions remains an open question. Some of these experiments might signify discovery of extremely unusual forms of the respective enzymes, whereas the results of others could be due to artifacts.
TAP (NXF1) belongs to a multigene family of putative RNA export factors with a conserved modular architecture.
Herold, A., Suyama, M., Rodrigues, J.P., Braun, I.C., Kutay, U., Carmo-Fonseca, M., Bork, P. & Izaurralde, E.
Mol Cell Biol 2000 Dec;20(23):8996-9008
Vertebrate TAP (also called NXF1) and its yeast orthologue, Mex67p, have been implicated in the export of mRNAs from the nucleus. The TAP protein includes a noncanonical RNP-type RNA binding domain, four leucine-rich repeats, an NTF2-like domain that allows heterodimerization with p15 (also called NXT1), and a ubiquitin-associated domain that mediates the interaction with nucleoporins. Here we show that TAP belongs to an evolutionarily conserved family of proteins that has more than one member in higher eukaryotes. Not only the overall domain organization but also residues important for p15 and nucleoporin interaction are conserved in most family members. We characterize two of four human TAP homologues and show that one of them, NXF2, binds RNA, localizes to the nuclear envelope, and exhibits RNA export activity. NXF3, which does not bind RNA or localize to the nuclear rim, has no RNA export activity. Database searches revealed that although only one p15 (nxt) gene is present in the Drosophila melanogaster and Caenorhabditis elegans genomes, there is at least one additional p15 homologue (p15-2 [also called NXT2]) encoded by the human genome. Both human p15 homologues bind TAP, NXF2, and NXF3. Together, our results indicate that the TAP-p15 mRNA export pathway has diversified in higher eukaryotes compared to yeast, perhaps reflecting a greater substrate complexity.
Homology among (betaalpha)(8) barrels: implications for the evolution of metabolic pathways.
Copley, R.R. & Bork, P.
J Mol Biol 2000 Nov 3;303(4):627-41.
We provide statistically reliable sequence evidence indicating that at least 12 of 23 SCOP (betaalpha)(8) (TIM) barrel superfamilies share a common origin. This includes all but one of the known and predicted TIM barrels found in central metabolism. The statistical evidence is complemented by an examination of the details of protein structure, with certain structural locations favouring catalytic residues even though the nature of their molecular function may change. The combined analysis of sequence, structure and function also enables us to propose a phylogeny of TIM barrels. Based on these data, we are able to examine differing theories of pathway and enzyme evolution, by mapping known TIM barrel folds to the pathways of central metabolism. The results favour widespread recruitment of enzymes between pathways, rather than a "backwards evolution" model, and support the idea that modern proteins may have arisen from common ancestors that bound key metabolites.
Functional genomic analysis of cell division in C. elegans using RNAi of genes on chromosome III.
Gonczy, P., Echeverri, G., Oegema, K., Coulson, A., Jones, S.J., Copley, R.R., Duperon, J., Oegema, J., Brehm, M., Cassin, E., Hannak, E., Kirkham, M., Pichler, S., Flohrs, K., Goessen, A., Leidel, S., Alleaume, A.M., Martin, C., Ozlu, N., Bork, P. & Hyman, A.A.
Nature 2000 Nov 16;408(6810):331-6.
Genome sequencing projects generate a wealth of information; however, the ultimate goal of such projects is to accelerate the identification of the biological function of genes. This creates a need for comprehensive studies to fill the gap between sequence and function. Here we report the results of a functional genomic screen to identify genes required for cell division in Caenorhabditis elegans. We inhibited the expression of approximately 96% of the approximately 2,300 predicted open reading frames on chromosome III using RNA-mediated interference (RNAi). By using an in vivo time-lapse differential interference contrast microscopy assay, we identified 133 genes (approximately 6%) necessary for distinct cellular processes in early embryos. Our results indicate that these genes represent most of the genes on chromosome III that are required for proper cell division in C. elegans embryos. The complete data set, including sample time-lapse recordings, has been deposited in an open access database. We found that approximately 47% of the genes associated with a differential interference contrast phenotype have clear orthologues in other eukaryotes, indicating that this screen provides putative gene functions for other species as well.
Gene context conservation of a higher order than operons.
Lathe, W.C., Snel, B. & Bork, P.
Trends Biochem Sci 2000 Oct;25(10):474-9.
Operons, co-transcribed and co-regulated contiguous sets of genes, are poorly conserved over short periods of evolutionary time. The gene order, gene content and regulatory mechanisms of operons can be very different, even in closely related species. Here, we present several lines of evidence which suggest that, although an operon and its individual genes and regulatory structures are rearranged when comparing the genomes of different species, this rearrangement is a conservative process. Genomic rearrangements invariably maintain individual genes in very specific functional and regulatory contexts. We call this conserved context an uber-operon.
GRAM, a novel domain in glucosyltransferases, myotubularins and other putative membrane-associated proteins.
Doerks, T., Strauss, M., Brendel, M. & Bork, P.
Trends Biochem Sci 2000 Oct;25(10):483-5. Europe PMC
STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene.
Snel, B., Lehmann, G., Bork, P. & Huynen, M.A.
Nucleic Acids Res 2000 Sep 15;28(18):3442-4.
The repeated occurrence of genes in each other's neighbourhood on genomes has been shown to indicate a functional association between the proteins they encode. Here we introduce STRING (search tool for recurring instances of neighbouring genes), a tool to retrieve and display the genes a query gene repeatedly occurs with in clusters on the genome. The tool performs iterative searches and visualises the results in their genomic context. By finding the genomically associated genes for a query, it delineates a set of potentially functionally associated genes. The usefulness of STRING is illustrated with an example that suggests a functional context for an RNA methylase with unknown specificity.
Re-annotating the Mycoplasma pneumoniae genome sequence: adding value, function and reading frames.
Dandekar, T., Huynen, M., Regula, J.T., Ueberle, B., Zimmermann, C.U., Andrade, M.A., Doerks, T., Sanchez-Pulido, L., Snel, B., Suyama, M., Yuan, Y.P., Herrmann, R. & Bork, P.
Nucleic Acids Res 2000 Sep 1;28(17):3278-88.
Four years after the original sequence submission, we have re-annotated the genome of Mycoplasma pneumoniae to incorporate novel data. The total number of ORFss has been increased from 677 to 688 (10 new proteins were predicted in intergenic regions, two further were newly identified by mass spectrometry and one protein ORF was dismissed) and the number of RNAs from 39 to 42 genes. For 19 of the now 35 tRNAs and for six other functional RNAs the exact genome positions were re-annotated and two new tRNA(Leu) and a small 200 nt RNA were identified. Sixteen protein reading frames were extended and eight shortened. For each ORF a consistent annotation vocabulary has been introduced. Annotation reasoning, annotation categories and comparisons to other published data on M.pneumoniae functional assignments are given. Experimental evidence includes 2-dimensional gel electrophoresis in combination with mass spectrometry as well as gene expression data from this study. Compared to the original annotation, we increased the number of proteins with predicted functional features from 349 to 458. The increase includes 36 new predictions and 73 protein assignments confirmed by the published literature. Furthermore, there are 23 reductions and 30 additions with respect to the previous annotation. mRNA expression data support transcription of 184 of the functionally unassigned reading frames.
EST analysis online: WWW tools for detection of SNPs and alternative splice forms.
Brett, D., Lehmann, G., Hanke, J., Gross, S., Reich, J. & Bork, P.
Trends Genet 2000 Sep;16(9):416-8. Europe PMC
SNP frequencies in human genes an excess of rare alleles and differing modes of selection.
Sunyaev, S.R., Lathe WC, 3rd, Ramensky, V.E. & Bork, P.
Trends Genet. 2000 Aug;16(8):335-7. Europe PMC
Predicting protein function by genomic context: quantitative evaluation and qualitative inferences.
Huynen, M., Snel, B., Lathe, W. & Bork, P.
Genome Res 2000 Aug;10(8):1204-10.
Various new methods have been proposed to predict functional interactions between proteins based on the genomic context of their genes. The types of genomic context that they use are Type I: the fusion of genes; Type II: the conservation of gene-order or co-occurrence of genes in potential operons; and Type III: the co-occurrence of genes across genomes (phylogenetic profiles). Here we compare these types for their coverage, their correlations with various types of functional interaction, and their overlap with homology-based function assignment. We apply the methods to Mycoplasma genitalium, the standard benchmarking genome in computational and experimental genomics. Quantitatively, conservation of gene order is the technique with the highest coverage, applying to 37% of the genes. By combining gene order conservation with gene fusion (6%), the co-occurrence of genes in operons in absence of gene order conservation (8%), and the co-occurrence of genes across genomes (11%), significant context information can be obtained for 50% of the genes (the categories overlap). Qualitatively, we observe that the functional interactions between genes are stronger as the requirements for physical neighborhood on the genome are more stringent, while the fraction of potential false positives decreases. Moreover, only in cases in which gene order is conserved in a substantial fraction of the genomes, in this case six out of twenty-five, does a single type of functional interaction (physical interaction) clearly dominate (>80%). In other cases, complementary function information from homology searches, which is available for most of the genes with significant genomic context, is essential to predict the type of interaction. Using a combination of genomic context and homology searches, new functional features can be predicted for 10% of M. genitalium genes.
Prediction of structural domains of TAP reveals details of its interaction with p15 and nucleoporins.
Suyama, M., Doerks, T., Braun, I.C., Sattler, M., Izaurralde, E. & Bork, P.
EMBO Rep. 2000 Jul;1(1):53-8.
Vertebrate TAP is a nuclear mRNA export factor homologous to yeast Mex67p. The middle domain of TAP binds directly to p15, a protein related to the nuclear transport factor 2 (NTF2), whereas its C-terminal domain interacts with various nucleoporins, the components of the nuclear pore complex (NPC). Here, we report that the middle domain of TAP is also similar to NTF2, as well as to regions in Ras-GAP SH3 domain binding protein (G3BP) and some plant protein kinases. Based on the known three-dimensional structure of NTF2 homodimer, a heterodimerization model of TAP and p15 could be inferred. This model was confirmed by site-directed mutagenesis of residues located at the dimer interface. Furthermore, the C-terminus of TAP was found to contain a ubiquitin-associated (UBA) domain. By site-directed mutagenesis we show that a conserved loop in this domain plays an essential role in mediating TAP-nucleoporin interaction.
NAIL-Network Analysis Interface for Linking HMMER results.
Sanchez-Pulido, L., Yuan, Y.P., Andrade, M.A. & Bork, P.
Bioinformatics 2000 Jul;16(7):656-7.
SUMMARY: Network Analysis Interface for Linking HMMER results (NAIL) is a web-based tool for the analysis of results from a HMMER protein database-search. NAIL facilitates the selection of protein hits and the creation of an alignment, which can be used for a new sequence similarity search.
Automated annotation of GPI anchor sites: case study C. elegans.
Eisenhaber, B., Bork, P., Yuan, Y., Loffler, G. & Eisenhaber, F.
Trends Biochem Sci 2000 Jul;25(7):340-1. Europe PMC
L27, a novel heterodimerization domain in receptor targeting proteins Lin-2 and Lin-7.
Doerks, T., Bork, P., Kamberov, E., Makarova, O., Muecke, S. & Margolis, B.
Trends Biochem Sci 2000 Jul;25(7):317-8. Europe PMC
Anopheles gambiae pilot gene discovery project: identification of mosquito innate immunity genes from expressed sequence tags generated from immune-competent cell lines.
Dimopoulos, G., Casavant, T.L., Chang, S., Scheetz, T., Roberts, C., Donohue, M., Schultz, J., Benes, V., Bork, P., Ansorge, W., Soares, M.B. & Kafatos, F.C.
Proc Natl Acad Sci U S A 2000 Jun 6;97(12):6619-24
Together with AIDS and tuberculosis, malaria is at the top of the list of devastating infectious diseases. However, molecular genetic studies of its major vector, Anopheles gambiae, are still quite limited. We have conducted a pilot gene discovery project to accelerate progress in the molecular analysis of vector biology, with emphasis on the mosquito's antimalarial immune defense. A total of 5,925 expressed sequence tags were determined from normalized cDNA libraries derived from immune-responsive hemocyte-like cell lines. The 3,242 expressed sequence tag-containing cDNA clones were grouped into 2,380 clone clusters, potentially representing unique genes. Of these, 1,118 showed similarities to known genes from other organisms, but only 27 were identical to previously known mosquito genes. We identified 38 candidate genes, based on sequence similarity, that may be implicated in immune reactions including antimalarial defense; 19 of these were shown experimentally to be inducible by bacterial challenge, lending support to their proposed involvement in mosquito immunity.
Automated extraction of information in molecular biology.
Andrade, M.A. & Bork, P.
FEBS Lett 2000 Jun 30;476(1-2):12-7.
We review data mining techniques in molecular biology, specifically those that extract information from the scientific literature itself. As more of the biological literature is published electronically, there is an opportunity, and even a need, to automatically summarize the literature in a customized way, for example by associating keywords to a topic. These keywords can be extracted from relevant publications. The process of keyword extraction can be automated and optimized to keep literature pointers automatically up-to-date or to filter relevant information from the literature. To illustrate these points, OMIM (Online Mendelian Inheritance in Man), a database of human inherited diseases, was linked to the literature and keywords were derived that covered distinct aspects such as genetic information on the one hand and disease-specific protein and phenotypic information on the other. They were used to extract information that is helpful for keeping entries about disease up-to-date.
More than 1,000 putative new human signalling proteins revealed by EST data mining.
Schultz, J., Doerks, T., Ponting, C.P., Copley, R.R. & Bork, P.
Nat Genet 2000 Jun;25(2):201-4
Cloning procedures aided by homology searches of EST databases have accelerated the pace of discovery of new genes, but EST database searching remains an involved and onerous task. More than 1.6 million human EST sequences have been deposited in public databases, making it difficult to identify ESTs that represent new genes. Compounding the problems of scale are difficulties in detection associated with a high sequencing error rate and low sequence similarity between distant homologues. We have developed a new method, coupling BLAST-based searches with a domain identification protocol, that filters candidate homologues. Application of this method in a large-scale analysis of 100 signalling domain families has led to the identification of ESTs representing more than 1,000 novel human signalling genes. The 4,206 publicly available ESTs representing these genes are a valuable resource for rapid cloning of novel human signalling proteins. For example, we were able to identify ESTs of at least 106 new small GTPases, of which 6 are likely to belong to new subfamilies. In some cases, further analyses of genomic DNA led to the discovery of previously unidentified full-length protein sequences. This is exemplified by the in silico cloning (prediction of a gene product sequence using only genomic and EST sequence data) of a new type of GTPase with two catalytic domains.
Exploitation of gene context.
Huynen, M., Snel, B., Lathe, W. & Bork, P.
Curr Opin Struct Biol 2000 Jun;10(3):366-70.
Recently, a number of techniques have been proposed that use completely sequenced genomes for the function prediction of individual proteins encoded therein. They use the fusion of genes, their conserved location in operons or merely their co-occurrence in genomes to predict the existence of functional interactions between the proteins they encode. This type of information complements functional features that are predicted by classical homology-based search techniques.
EST comparison indicates 38% of human mRNAs contain possible alternative splice forms.
Brett, D., Hanke, J., Lehmann, G., Haase, S., Delbruck, S., Krueger, S., Reich, J. & Bork, P.
FEBS Lett. 2000 May 26;474(1):83-6.
Expressed sequence tag (EST) databases represent a large volume of information on expressed genes including tissue type, expression profile and exon structure. In this study we create an extensive data set of human alternative splicing. We report the analysis of 7867 non-redundant mRNAs, 3011 of which contained alternative splice forms (38% of all mRNAs analysed). From a total of 12572 ESTs 4560 different possible alternative splice forms were detected. Interestingly, 70% of the alternative splice forms correspond to exon deletion events with only 30% exonic insertions. We experimentally verified 19 different splice forms from 16 genes in a total subset of 20 studied; all of the respective genes are of medical relevance.
Homology-based method for identification of protein repeats using statistical significance estimates.
Andrade, M.A., Ponting, C.P., Gibson, T.J. & Bork, P.
J Mol Biol 2000 May 5;298(3):521-37
Short protein repeats, frequently with a length between 20 and 40 residues, represent a significant fraction of known proteins. Many repeats appear to possess high amino acid substitution rates and thus recognition of repeat homologues is highly problematic. Even if the presence of a certain repeat family is known, the exact locations and the number of repetitive units often cannot be determined using current methods. We have devised an iterative algorithm based on optimal and sub-optimal score distributions from profile analysis that estimates the significance of all repeats that are detected in a single sequence. This procedure allows the identification of homologues at alignment scores lower than the highest optimal alignment score for non-homologous sequences. The method has been used to investigate the occurrence of eleven families of repeats in Saccharomyces cerevisiae, Caenorhabditis elegans and Homo sapiens accounting for 1055, 2205 and 2320 repeats, respectively. For these examples, the method is both more sensitive and more selective than conventional homology search procedures. The method allowed the detection in the SwissProt database of more than 2000 previously unrecognised repeats belonging to the 11 families. In addition, the method was used to merge several repeat families that previously were supposed to be distinct, indicating common phylogenetic origins for these families. Copyright 2000 Academic Press.
Towards a structural basis of human non-synonymous single nucleotide polymorphisms.
Sunyaev, S., Ramensky, V. & Bork, P.
Trends Genet 2000 May;16(5):198-200. Europe PMC
REF, an evolutionary conserved family of hnRNP-like proteins, interacts with TAP/Mex67p and participates in mRNA nuclear export.
Stutz, F., Bachi, A., Doerks, T., Braun, I.C., Seraphin, B., Wilm, M., Bork, P. & Izaurralde, E.
RNA 2000 Apr;6(4):638-50.
Vertebrate TAP and its yeast ortholog Mex67p are involved in the export of messenger RNAs from the nucleus. TAP has also been implicated in the export of simian type D viral RNAs bearing the constitutive transport element (CTE). Although TAP directly interacts with CTE-bearing RNAs, the mode of interaction of TAP/Mex67p with cellular mRNAs is different from that with the CTE RNA and is likely to be mediated by protein-protein interactions. Here we show that Mex67p directly interacts with Yra1p, an essential yeast hnRNP-like protein. This interaction is evolutionarily conserved as Yra1p also interacts with TAP. Conditional expression in yeast cells implicates Yra1 p in the export of cellular mRNAs. Database searches revealed that Yra1p belongs to an evolutionarily conserved family of hnRNP-like proteins having more than one member in Mus musculus, Xenopus laevis, Caenorhabditis elegans, and Schizosaccharomyces pombe and at least one member in several species including plants. The murine members of the family directly interact with TAP. Because members of this protein family are characterized by the presence of one RNP-motif RNA-binding domain and exhibit RNA-binding activity, we called these proteins REF-bps for RNA and export factor binding proteins. Thus, Yra1p and members of the REF family of hnRNP-like proteins may facilitate the interaction of TAP/Mex67p with cellular mRNAs.
Powers and pitfalls in sequence analysis: the 70% hurdle.
Genome Res 2000 Apr;10(4):398-400. Europe PMC
The p150-Spir protein provides a link between c-Jun N-terminal kinase function and actin reorganization.
Otto, I.M., Raabe, T., Rennefahrt, U.E., Bork, P., Rapp, U.R. & Kerkhoff, E.
Curr Biol 2000 Mar 23;10(6):345-8.
The Jun N-terminal kinase (JNK) is a downstream effector of Rac and Cdc42 GTPases involved in actin reorganization [1-3]. A role of the Drosophila JNK homologue, Basket (DJNK/Bsk), in the regulation of cell shape changes and actin reorganization arises from its function in the process of dorsal closure [4-6]. One potential mechanism for induction of cytoskeletal changes by JNK is via transcriptional activation of the decapentaplegic gene (dpp, a member of the TGFbeta superfamily) . A direct link between JNK signalling and actin organization has not yet been found, however. We have identified a novel DJNK-interacting protein, p150-Spir, that belongs to the Wiscott-Aldrich syndrome protein (WASP) homology domain 2 (WH2) family of proteins involved in actin reorganization  . It is a multidomain protein with a cluster of four WH2 domains, a modified FYVE zinc-finger motif , and a DEJL motif, a docking site for JNK , at its carboxy-terminal end. In mouse fibroblasts, p150-Spir colocalized with F-actin and its overexpression induced clustering of filamentous actin around the nucleus. When coexpressed with p150-Spir in NIH 3T3 cells, JNK translocated to and colocalizes with p150-Spir at discrete spots around the nucleus. Carboxy-terminal sequences of p150-Spir were phosphorylated by JNK both in vitro and in vivo. We conclude that p150-Spir is a downstream target of JNK function and provides a direct link between JNK and actin organization.
Discovery, scoring and utilization of human single nucleotide polymorphisms: a multidisciplinary problem.
Isaksson, A., Landegren, U., Syvanen, A.C., Bork, P., Stein, C., Ortigao, F. & Brookes, A.J.
Eur J Hum Genet 2000 Feb;8(2):154-6.
There are great hopes that the most common form of human genetic variation, single nucleotide polymorphisms (SNPs), can be used to improve radically biological understanding and to advance medicine. However, considerable controversy exists over just how SNPs can be applied to gain these insights. The second international SNP meeting, held at Schloss Hohenkammer, Munich, Germany, brought together leading international scientists from academia and industry to look at these issues from a multidisciplinary perspective. Topics that were covered spanned SNP discovery, scoring technologies, population genetics, disease studies, commercial dimensions, pharmacogenomics, bioinformatics, and legal considerations. SNP discovery is picking up speed; The SNP Consortium (TSC) is set to produce 300,000 publicly available SNPs within 2 years. Improved technologies for scoring SNPs are reducing hands-on time and cost, although truly high-throughput methods are still lacking for genome-wide population-based studies. Large numbers of SNPs have already been analysed in diverse populations. The results emphasise the importance of considering population history when using SNPs to search for genetic risk factors. Opinions on the feasibility of extensive SNP-based analysis of complex disease vary. However, combining expertise from several fields will be key to achieving optimal utilization of SNPs.
SMART: a web-based tool for the study of genetically mobile domains.
Schultz, J., Copley, R.R., Doerks, T., Ponting, C.P. & Bork, P.
Nucleic Acids Res 2000 Jan 1;28(1):231-4.
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures (http://SMART.embl-heidelberg.de ). More than 400 domain families found in signalling, extra-cellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Each domain found in a non-redundant protein database as well as search parameters and taxonomic information are stored in a relational database system. User interfaces to this database allow searches for proteins containing specific combinations of domains in defined taxa.
HGBASE: a database of SNPs and other variations in and around human genes.
Brookes, A.J., Lehvaslaiho, H., Siegfried, M., Boehm, J.G., Yuan, Y.P., Sarkar, C.M., Bork, P. & Ortigao, F.
Nucleic Acids Res 2000 Jan 1;28(1):356-60
Human genome polymorphism is expected to play a key role in defining the etiologic basis of phenotypic differences between individuals in aspects such as drug responses and common disease predisposition. Relevant functional DNA changes will probably be located in or near to transcribed sequences, and include many single nucleotide polymorphisms. To aid the future analysis of such genome variation, HGBASE (Human Genic Bi-Allelic SEquences) was constructed as a means to gather human gene-linked polymorphisms from all possible public sources, and show these as a non-redundant set of records in a standardized and user-friendly database endowed with text and sequence based search facilities. After 1 year of presence on the WWW, the HGBASE project has compiled data for over 22 000 records, and this number continues to triple every 6-12 months with data harvested or submitted from all major public genome databases and published literature from the previous decade. Extensive annotation enhancement, internal consistency checking and manual review of every record is undertaken to address potential errors and deficiencies sometimes present in the original source data. The fully polished and comprehensive database is made freely available to all at http://hgbase.cgr.ki.se
Genome evolution. Gene fusion versus gene fission.
Snel, B., Bork, P. & Huynen, M.
Trends Genet 2000 Jan;16(1):9-11. Europe PMC
Individual variation in protein-coding sequences of human genome.
Sunyaev, S., Hanke, J., Brett, D., Aydin, A., Zastrow, I., Lathe, W., Bork, P. & Reich, J.
Adv Protein Chem 2000;54:409-37. Europe PMC
Evolution of domain families.
Ponting, C.P., Schultz, J., Copley, R.R., Andrade, M.A. & Bork, P.
Adv Protein Chem 2000;54:185-244. Europe PMC
Exploitation of gene context to infer evolution and to predict function.
Bork, P., Snel, B., Lehmann, G., Suyama, M., Dandekar, T., Lathe, W. & Huynen, M.
In "Comparative Genomics", Sankoff, D. & Nadeau, J.H. (eds.) Kluwer Acad. Publ. pp. 281-294
Analysis of amino acid sequence.
Bork, P. (ed.), Acad. Press
Advances in Protein Chemistry
Genome and proteome informatics
Bork, P. & Eisenberg, D.
Curr. Opin. Struct. Biol. 2000 10 341-342
Preface: Protein sequence analysis
Adv. Prot. Chem. 2000 54 xi-xv
The three-dimensional structure of the HRDC domain and implications for the Werner and Bloom syndrome proteins.
Liu, Z., Macias, M.J., Bottomley, M.J., Stier, G., Linge, J.P., Nilges, M., Bork, P. & Sattler, M.
Structure Fold Des 1999 Dec 15;7(12):1557-66
BACKGROUND: The HRDC (helicase and RNaseD C-terminal) domain is found at the C terminus of many RecQ helicases, including the human Werner and Bloom syndrome proteins. RecQ helicases have been shown to unwind DNA in an ATP-dependent manner. However, the specific functional roles of these proteins in DNA recombination and replication are not known. An HRDC domain exists in both of the human RecQ homologues that are implicated in human disease and may have an important role in their function. RESULTS: We have determined the three-dimensional structure of the HRDC domain in the Saccharomyces cerevisiae RecQ helicase Sgs1p by nuclear magnetic resonance (NMR) spectroscopy. The structure resembles auxiliary domains in bacterial DNA helicases and other proteins that interact with nucleic acids. We show that a positively charged region on the surface of the Sgs1p HRDC domain can interact with DNA. Structural similarities to bacterial DNA helicases suggest that the HRDC domain functions as an auxiliary domain in RecQ helicases. Homology models of the Werner and Bloom HRDC domains show different surface properties when compared with Sgs1p. CONCLUSIONS: The HRDC domain represents a structural scaffold that resembles auxiliary domains in proteins that are involved in nucleic acid metabolism. In Sgs1p, the HRDC domain could modulate the helicase function via auxiliary contacts to DNA. However, in the Werner and Bloom syndrome helicases the HRDC domain may have a role in their functional differences by mediating diverse molecular interactions.
Prediction of nonsynonymous single nucleotide polymorphisms in human disease-associated genes.
Sunyaev, S., Hanke, J., Aydin, A., Wirkner, U., Zastrow, I., Reich, J. & Bork, P.
J Mol Med 1999 Nov;77(11):754-60
Analysis of human genetic variation can shed light on the problem of the genetic basis of complex disorders. Nonsynonymous single nucleotide polymorphisms (SNPs), which affect the amino acid sequence of proteins, are believed to be the most frequent type of variation associated with the respective disease phenotype. Complete enumeration of nonsynonymous SNPs in the candidate genes will enable further association studies on panels of affected and unaffected individuals. Experimental detection of SNPs requires implementation of expensive technologies and is still far from being routine. Alternatively, SNPs can be identified by computational analysis of a publicly available expressed sequence tag (EST) database following experimental verification. We performed in silico analysis of amino acid variation for 471 of proteins with a documented history of experimental variation studies and with confirmed association with human diseases. This allowed us to evaluate the level of completeness of the current knowledge of nonsynonymous SNPs in well studied, medically relevant genes and to estimate the proportion of new variants which can be added with the help of computer-aided mining in EST databases. Our results suggest that approx. 50% of frequent nonsynonymous variants are already stored in public databases. Computational methods based on the scan of an EST database can add significantly to the current knowledge, but they are greatly limited by the size of EST databases and the nonuniform coverage of genes by ESTs. Nevertheless, a considerable number of new candidate nonsynonymous SNPs in genes of medical interest were found by EST screening procedure.
A lipid-binding domain in Wnt: a case of mistaken identity?
Barnes, M.R., Russell, R.B., Copley, R.R., Ponting, C.P., Bork, P., Cumberledge, S., Reichsman, F. & Moore, H.M.
Curr Biol 1999 Oct 7;9(19):R717-9 Europe PMC
Pathway alignment: application to the comparative analysis of glycolytic enzymes.
Dandekar, T., Schuster, S., Snel, B., Huynen, M. & Bork, P.
Biochem J 1999 Oct 1;343 Pt 1:115-24
Comparative analysis of metabolic pathways in different genomes yields important information on their evolution, on pharmacological targets and on biotechnological applications. In this study on glycolysis, three alternative ways of comparing biochemical pathways are combined: (1) analysis and comparison of biochemical data, (2) pathway analysis based on the concept of elementary modes, and (3) a comparative genome analysis of 17 completely sequenced genomes. The analysis reveals a surprising plasticity of the glycolytic pathway. Isoenzymes in different species are identified and compared; deviations from the textbook standard are detailed. Several potential pharmacological targets and by-passes (such as the Entner-Doudoroff pathway) to glycolysis are examined and compared in the different species. Archaean, bacterial and parasite specific adaptations are identified and described.
Solution structure of the receptor tyrosine kinase EphB2 SAM domain and identification of two distinct homotypic interaction sites.
Smalla, M., Schmieder, P., Kelly, M., Ter Laak, A., Krause, G., Ball, L., Wahl, M., Bork, P. & Oschkinat, H.
Protein Sci 1999 Oct;8(10):1954-61
The sterile alpha motif (SAM) is a protein interaction domain of around 70 amino acids present predominantly in the N- and C-termini of more than 60 diverse proteins that participate in signal transduction and transcriptional repression. SAM domains have been shown to homo- and hetero-oligomerize and to mediate specific protein-protein interactions. A highly conserved subclass of SAM domains is present at the intracellular C-terminus of more than 40 Eph receptor tyrosine kinases that are involved in the control of axonal pathfinding upon ephrin-induced oligomerization and activation in the event of cell-cell contacts. These SAM domains appear to participate in downstream signaling events via interactions with cytosolic proteins. We determined the solution structure of the EphB2 receptor SAM domain and studied its association behavior. The structure consists of five helices forming a compact structure without binding pockets or exposed conserved aromatic residues. Concentration-dependent chemical shift changes of NMR signals reveal two distinct well-separated areas on the domains' surface sensitive to the formation of homotypic oligomers in solution. These findings are supported by analytical ultracentrifugation studies. The conserved Tyr932, which was reported to be essential for the interaction with SH2 domains after phosphorylation, is buried in the hydrophobic core of the structure. The weak capability of the isolated EphB2 receptor SAM domain to form oligomers is supposed to be relevant in vivo when the driving force of ligand binding induces receptor oligomerization. A formation of SAM tetramers is thought to provide an appropriate contact area for the binding of a low-molecular-weight phosphotyrosine phosphatase and to initiate further downstream responses.
Alternative splicing of human genes: more the rule than the exception?
Hanke, J., Brett, D., Zastrow, I., Aydin, A., Delbruck, S., Lehmann, G., Luft, F., Reich, J. & Bork, P.
Trends Genet 1999 Oct;15(10):389-90 Europe PMC
Prediction of potential GPI-modification sites in proprotein sequences.
Eisenhaber, B., Bork, P. & Eisenhaber, F.
J Mol Biol 1999 Sep 24;292(3):741-58
Glycosylphosphatidylinositol (GPI) lipid anchoring is a common posttranslational modification known mainly from extracellular eukaryotic proteins. Attachment of the GPI moiety to the carboxyl terminus (omega-site) of the polypeptide follows after proteolytic cleavage of a C-terminal propeptide. For the first time, a new prediction technique locating potential GPI-modification sites in precursor sequences has been applied for large-scale protein sequence database searches. The composite prediction function (with separate parametrisation for metazoan and protozoan proteins) consists of terms evaluating both amino acid type preferences at sequence positions near a supposed omega-site as well as the concordance with general physical properties encoded in multi-residue correlation within the motif sequence. The latter terms are especially successful in rejecting non- appropriate sequences from consideration. The algorithm has been validated with a self-consistency and two jack-knife tests for the learning set of fully annotated sequences from the SWISS-PROT database as well as with a newly created database "big-Pi" (more than 300 GPI- motif mutations extracted from original literature sources). The accuracy of predicting the effect of mutations in the GPI sequence motif was above 83 %. Lists of potential precursor proteins which are non-annotated in SWISS-PROT and SPTrEMBL are presented on the WWW-page http://www.embl-heidelberg.de/beisenha/gpi/gpi_p rediction. html The algorithm has been implemented in the prototype software "big-Pi predictor" which may find application as a genome annotation and target selection tool. Copyright 1999 Academic Press.
Associative database of protein sequences.
Hanke, J., Lehmann, G., Bork, P. & Reich, J.G.
Bioinformatics 1999 Sep;15(9):741-8
MOTIVATION: We present a new concept that combines data storage and data analysis in genome research, based on an associative network memory. As an illustration, 115 000 conserved regions from over 73 000 published sequences (i.e. from the entire annotated part of the SWISSPROT sequence database) were identified and clustered by a self- organizing network. Similarity and kinship, as well as degree of distance between the conserved protein segments, are visualized as neighborhood relationship on a two-dimensional topographical map. RESULTS: Such a display overcomes the restrictions of linear list processing and allows local and global sequence relationships to be studied visually. Families are memorized as prototype vectors of conserved regions. On a massive parallel machine, clustering and updating of the database take only a few seconds; a rapid analysis of incoming data such as protein sequences or ESTs is carried out on present-day workstations. AVAILABILITY: Access to the database is available at http://www.bioinf.mdc-berlin.de/unter2.html++ + CONTACT: (hanke,lehmann,reich)@mdc-berlin.de; firstname.lastname@example.org
Domain organization of Mac-2 binding protein and its oligomerization to linear and ring-like structures.
Muller, S.A., Sasaki, T., Bork, P., Wolpensinger, B., Schulthess, T., Timpl, R., Engel, A. & Engel, J.
J Mol Biol 1999 Aug 27;291(4):801-13
The multidomain Mac-2 binding protein (M2BP) is present in serum and in the extracellular matrix in the form of linear and ring-shaped oligomers, which interact with galectin-3, fibronectin, collagens, integrins and other large glycoproteins. Domain 1 of M2BP (M2BP-1) shows homology with the cysteine-rich SRCR domain of scavanger receptor. Domains 2 and 3 are related to the dimerization domains BTB/POZ and IVR of the Drosophila kelch protein. Recombinant M2BP, its N-terminal domain M2BP-1 and a fragment consisting of putative domains 2, 3 and 4 (M2BP-2,3,4) were investigated by scanning transmission electron microscopy, transmission electron microscopy, analytical ultracentrifugation and binding assays. The ring oligomers formed by the intact protein are comprised of approximately 14 nm long segments composed of two 92 kDa M2BP monomers. Although the rings vary in size, decamers predominate. The various linear oligomers also observed are probably ring precursors, dimers predominate. M2BP-1 exhibits a native fold, does not oligomerize and is inactive in cell attachment. M2BP- 2,3,4 aggregates to heterogeneous, protein filled ring-like structures as shown by metal shadowed preparations. These aggregates retain the cell-adhesive potential indicating native folding. It is hypothesized that the rings provide an interaction pattern for multivalent interactions of M2BP with target molecules or complexes of ligands. Copyright 1999 Academic Press.
A latrophilin/CL-1-like GPS domain in polycystin-1
Ponting, C.P., Hofmann, K. & Bork, P.
Curr Biol 1999 Aug 26;9(16):R585-8 Europe PMC
Variation and evolution of the citric-acid cycle: a genomic perspective.
Huynen, M.A., Dandekar, T. & Bork, P.
This is a review article.
Trends Microbiol 1999 Jul;7(7):281-91
The presence of genes encoding enzymes involved in the citric-acid cycle has been studied in 19 completely sequenced genomes. In the majority of species, the cycle appears to be incomplete or absent. Several distinct, incomplete cycles reflect adaptations to different environments. Their distribution over the phylogenetic tree hints at precursors in the evolution of the citric-acid cycle.
Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries.
Eisenhaber, F. & Bork, P.
Bioinformatics 1999 Jul-Aug;15(7-8):528-35
MOTIVATION: Computer-based selection of entries from sequence databases with respect to a related functional description, e.g. with respect to a common cellular localization or contributing to the same phenotypic function, is a difficult task. Automatic semantic analysis of annotations is not only hampered by incomplete functional assignments. A major problem is that annotations are written in a rich, non- formalized language and are meant for reading by a human expert. This person can extract from the text considerably more information than is immediately apparent due to his extended biological background knowledge and logical reasoning. APPROACH: A technique of automated annotation evaluation based on a combination of lexical analysis and the usage of biological rule libraries has been developed. The proposed algorithm generates new functional descriptors from the annotation of a given entry using the semantic units of the annotation as prepositions for implications executed in accordance with the rule library. RESULTS: The prototype of a software system, the Meta_A(nnotator) program, is described and the results of its application to sequence attribute assignment and sequence selection problems, such as cellular localization and sequence domain annotation of SWISS-PROT entries, are presented. The current software version assigns useful subcellular localization qualifiers to approximately 88% of all SWISS-PROT entries. As shown by demonstrative examples, the combination of sequence and annotation analysis is a powerful approach for the detection of mutual annotation/sequence inconsistencies. AVAILABILITY: Results for the cellular localization assignment can be viewed at the URL http://www.bork. embl-heidelberg.de/CELL_LOC/CELL_LOC.html.
Domains in plexins: links to integrins and transcription factors.
Bork, P., Doerks, T., Springer, T.A. & Snel, B.
Trends Biochem Sci 1999 Jul;24(7):261-3 Europe PMC
Eukaryotic signalling domain homologues in archaea and bacteria. Ancient ancestry and horizontal gene transfer.
Ponting, C.P., Aravind, L., Schultz, J., Bork, P. & Koonin, E.V.
J Mol Biol 1999 Jun 18;289(4):729-45
Phyletic distributions of eukaryotic signalling domains were studied using recently developed sensitive methods for protein sequence analysis, with an emphasis on the detection and accurate enumeration of homologues in bacteria and archaea. A major difference was found between the distributions of enzyme families that are typically found in all three divisions of cellular life and non-enzymatic domain families that are usually eukaryote-specific. Previously undetected bacterial homologues were identified for# plant pathogenesis-related proteins, Pad1, von Willebrand factor type A, src homology 3 and YWTD repeat-containing domains. Comparisons of the domain distributions in eukaryotes and prokaryotes enabled distinctions to be made between the domains originating prior to the last common ancestor of all known life forms and those apparently originating as consequences of horizontal gene transfer events. A number of transfers of signalling domains from eukaryotes to bacteria were confidently identified, in contrast to only a single case of apparent transfer from eukaryotes to archaea. Copyright 1999 Academic Press.
A novel transactivation domain in parkin
Morett, E. & Bork, P.
Trends Biochem Sci 1999 Jun;24(6):229-31 Europe PMC
Protein families in multicellular organisms.
Copley, R.R., Schultz, J., Ponting, C.P. & Bork, P.
This is a review article.
Curr Opin Struct Biol 1999 Jun;9(3):408-15
The complete sequence of the nematode worm Caenorhabditis elegans contains the genetic machinery that is required to undertake the core biological processes of single cells. However, the genome also encodes proteins that are associated with multicellularity, as well as others that are lineage-specific expansions of phylogenetically widespread families and yet more that are absent in non-nematodes. Ongoing analysis is beginning to illuminate the similarities and differences among human proteins and proteins that are encoded by the genomes of the multicellular worm and the unicellular yeast, and will be essential in determining the reliability of transferring experimental data among phylogenetically distant species.
No Sec7-homology domain in guanine-nucleotide-exchange factors that act on Ras and Rho
Ponting, C.P., Bork, P., Schultz, J. & Aravind, L.
Trends Biochem Sci 1999 May;24(5):177-8 Europe PMC
SMART: identification and annotation of domains from signalling and extracellular protein sequences.
Ponting, C.P., Schultz, J., Milpetz, F. & Bork, P.
Nucleic Acids Res 1999 Jan 1;27(1):229-32
SMART is a simple modular architecture research tool and database that provides domain identification and annotation on the WWW (http://coot.embl-heidelberg.de/SMART). The tool compares query sequences with its databases of domain sequences and multiple alignments whilst concurrently identifying compositionally biased regions such as signal peptide, transmembrane and coiled coil segments. Annotated and unannotated regions of the sequence can be used as queries in searches of sequence databases. The SMART alignment collection represents more than 250 signalling and extracellular domains. Each alignment is curated to assign appropriate domain boundaries and to ensure its quality. In addition, each domain is annotated extensively with respect to cellular localisation, species distribution, functional class, tertiary structure and functionally important residues.
Genome phylogeny based on gene content.
Snel, B., Bork, P. & Huynen, M.A.
Nat Genet 1999 Jan;21(1):108-10
Species phylogenies derived from comparisons of single genes are rarely consistent with each other, due to horizontal gene transfer, unrecognized paralogy and highly variable rates of evolution. The advent of completely sequenced genomes allows the construction of a phylogeny that is less sensitive to such inconsistencies and more representative of whole-genomes than are single-gene trees. Here, we present a distance-based phylogeny constructed on the basis of gene content, rather than on sequence identity, of 13 completely sequenced genomes of unicellular species. The similarity between two species is defined as the number of genes that they have in common divided by their total number of genes. In this type of phylogenetic analysis, evolutionary distance can be interpreted in terms of evolutionary events such as the acquisition and loss of genes, whereas the underlying properties (the gene content) can be interpreted in terms of function. As such, it takes a position intermediate to phylogenies based on single genes and phylogenies based on phenotypic characteristics. Although our comprehensive genome phylogeny is independent of phylogenies based on the level of sequence identity of individual genes, it correlates with the standard reference of prokarytic phylogeny based on sequence similarity of 16s rRNA. Thus, shared gene content between genomes is quantitatively determined by phylogeny, rather than by phenotype, and horizontal gene transfer has only a limited role in determining the gene content of genomes.
Lateral Gene transfer, genome surveys and the phylogeny of prokaryotes.
Huynen, M.A., Snel, B. & Bork, P.
Science 1999 286 1443a
Sequence analysis, dotplot, aligning sequences, indels and gap-penalty
Gibson, T. & Bork, P.
In "Encyclopedia of Molecular Biology", T. Creighton (ed.), Wiley & Sons Inc. New York. 86-90; 765-766; 961-962; 1252; 2220-2324
Phospholipases A and Wnt-like proteins are unlikely to share common ancestry.
Copley, R., Ponting, C. & Bork, P.
Curr. Biol. 1999 9 R718-R719
Genome comparisons to monitor molecular evolution.
Bork, P., Dandekar, T., Snel, B. & Huynen, M.A.
In "Microbiology and Infection", Goebel, U.B., Ruf. B.R.(ed.), Einhorn-presse Verlag, Reinbek, 80-92
A point of entry into genomics.
Bork, P.& Huynen, M.A.
Nature Genet. 1999;23:273.
Biospektrum 1999 3 172
Applying logic programming ro derive novel functional information of genomes.
Bansal, A.K., & Bork, P.
In "Lecture notes in computer science 1551: Practical aspects of declarative languages", Gupta, G. (ed.), Springer Verlag, 275-289
Homology-based gene prediction using neural nets.
Cai, Y. & Bork, P.
Anal Biochem 1998 Dec 15;265(2):269-74.
We have developed and implemented a method for computational gene identification called GIN (gene identification using neural nets and homology information) that has been particularly designed to avoid false positive predictions. It thus predicts 55% of all genes tested correctly, has a specificity of 99%, but also has an overall accuracy of 92% on a benchmark set of 570 vertebrate genes constructed by Burset and Guigo. The method combines homology searches in protein and expressed sequence tag databases with several neural networks designed to recognize start codons, Poly(A) signals, stop codons, and splice sites. Predicted exons are assembled into genes using a homology-based scoring function. GIN is able to recognize multiple genes within genomic DNA as demonstrated by the identification of a globin gene (gamma-globin-1(G)) that has not been annotated as a coding region in the widely used the test set of Burset and Guigo. Furthermore, GIN identifies more than 107 other protein hits in noncoding regions and classifies them into possible pseudogenes or splice variants.
Identical variant TSG101 transcripts in soft tissue sarcomas and various non-neoplastic tissues.
Willeke, F., Ridder, R., Bork, P., Klaes, R., Mechtersheimer, G., Schwarzbach, M., Zimmer, D., Kloor, M., Lehnert, T., Herfarth, C. & von Knebel Doeberitz, M.
Mol Carcinog 1998 Dec 23(4) 195-200
Inactivation of the TSG101 gene was recently shown to induce malignant transformation of NIH/3T3 fibroblasts. Abnormal TSG101 transcription profiles were observed in various human cancers, and large intragenic deletions of the TSG101 gene were reported for a series of human breast cancer specimens, pointing to a potential tumor-suppressive activity of TSG101. However, subsequent more detailed studies on a large panel of breast carcinoma samples did not confirm the tumor-associated genomic deletions. Here we analyzed the transcription patterns of the TSG101 gene in soft-tissue sarcomas and non-neoplastic human tissues. Forty- five of 71 soft tissue sarcoma samples (63%) displayed variant transcripts; however, identical aberrant transcripts were also detected in seven of 15 non-neoplastic control tissues. Restriction fragment length polymorphism analysis of the TSG101 gene excluded major genomic rearrangements in the soft tissue sarcoma samples. Northern blot analysis revealed a very low abundance of variant transcripts as compared with the wild-type TSG101 transcript. These data point to aberrant splicing of the TSG101 mRNA in normal and transformed human mesenchymal tissues rather than tumor specific alterations of the TSG101 gene. In summary, this analyses does not support a pathogenic role for altered TSG101 expression in human soft tissue sarcomas.
Sequence properties of GPI-anchored proteins near the omega-site: constraints for the polypeptide binding site of the putative transamidase.
Eisenhaber, B., Bork, P. & Eisenhaber, F.
Protein Eng 1998 Dec 11(12) 1155-1161
Glycosylphosphatidylinositol (GPI) anchoring is a common post- translational modification of extracellular eukaryotic proteins. Attachment of the GPI moiety to the carboxyl terminus (omega-site) of the polypeptide occurs after proteolytic cleavage of a C-terminal propeptide. In this work, the sequence pattern for GPI-modification was analyzed in terms of physical amino acid properties based on a database analysis of annotated proprotein sequences. In addition to a refinement of previously described sequence signals, we report conserved sequence properties in the regions omega - 11...omega - 1 and omega + 4...omega + 5. We present statistical evidence for volume-compensating residue exchanges with respect to the positions omega - 1...omega + 2. Differences between protozoan and metazoan GPI-modification motifs consist mainly in variations of preferences to amino acid types at the positions near the omega-site and in the overall motif length. The variations of polypeptide substrates are exploited to suggest a model of the polypeptide binding site of the putative transamidase, the enzyme catalyzing the GPI-modification. The volume of the active site cleft accommodating the four residues omega - 1...omega + 2 appears to be approximately 540 A3.
Predicting function: from genes to genomes and back.
Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M. & Yuan, Y.
This is a review article.
J Mol Biol 1998 Nov 6 283(4) 707-725
Predicting function from sequence using computational tools is a highly complicated procedure that is generally done for each gene individually. This review focuses on the added value that is provided by completely sequenced genomes in function prediction. Various levels of sequence annotation and function prediction are discussed, ranging from genomic sequence to that of complex cellular processes. Protein function is currently best described in the context of molecular interactions. In the near future it will be possible to predict protein function in the context of higher order processes such as the regulation of gene expression, metabolic pathways and signalling cascades. The analysis of such higher levels of function description uses, besides the information from completely sequenced genomes, also the additional information from proteomics and expression data. The final goal will be to elucidate the mapping between genotype and phenotype. Copyright 1998 Academic Press.
Conservation of gene order: a fingerprint of proteins that physically interact.
Dandekar, T., Snel, B., Huynen, M. & Bork, P.
Trends Biochem Sci 1998 Sep;23(9):324-8.
A systematic comparison of nine bacterial and archaeal genomes reveals a low level of gene-order (and operon architecture) conservation. Nevertheless, a number of gene pairs are conserved. The proteins encoded by conserved gene pairs appear to interact physically. This observation can therefore be used to predict functions of, and interactions between, prokaryotic gene products.
Evolution of new protein function: recombinational enhancer Fis originated by horizontal gene transfer from the transcriptional regulator NtrC.
Morett, E. & Bork, P.
FEBS Lett 1998 Aug 14 433(1-2) 108-112
New protein function is thought to evolve mostly by gene duplication and divergence. Here we present phylogenetic evidence that the multifunctional protein Fis of the gamma proteobacterial species derived from the COOH-terminal domain of an ancestral alpha proteobacterial NtrC transcriptional regulatory protein. All of the known enterobacterial fis genes are preceded by an open reading frame, named yhdG, that is highly similar to nifR3, a gene that forms an operon with ntrC in several alpha proteobacterial species. Thus, we propose that yhdG and fis were acquired by a lineage ancestral to the gamma proteobacteria in a single horizontal gene transfer event, and later diverged to their present functions.
Homology-based fold predictions for Mycoplasma genitalium proteins.
Huynen, M., Doerks, T., Eisenhaber, F., Orengo, C., Sunyaev, S., Yuan, Y. & Bork, P.
J Mol Biol 1998 Jul 17;280(3):323-6.
Homology search techniques based on the iterative PSI-BLAST method in combination with various filters for low sequence complexity are applied to assign folds to all Mycoplasma genitalium proteins. The resulting procedure (implemented as a web server) is able to predict at least one domain in 37% of these proteins automatically, with an estimated accuracy higher than 98%. Taking structural features such as coiled coil or transmembrane regions aside, folds can be assigned to more than half of the globular proteins in a bacterium just by iterative sequence comparison.
Conformational stability studies of the pleckstrin DEP domain: definition of the domain boundaries.
Kharrat, A., Millevoi, S., Baraldi, E., Ponting, C.P., Bork, P. & Pastore, A.
Biochim Biophys Acta 1998 Jun 11;1385(1):157-64.
Pleckstrin is the major substrate of protein kinase C in platelets. It contains at its N- and C-termini two pleckstrin homology (PH) domains which have been proposed to mediate protein-protein and protein-lipid interactions. A new module, called DEP, has recently been identified by sequence analysis in the central region of pleckstrin. In order to study this module, several recombinant polypeptides corresponding to the DEP module and N- and C-termini extended forms have been expressed. Using circular dichroism (CD) and nuclear magnetic resonance (NMR) techniques, the domain boundaries have been determined that yield a soluble and folded pleckstrin DEP domain. This comprises 93 amino acids with an alpha/beta fold in agreement with secondary structure predictions. Stability studies indicate that the regions surrounding the DEP domain do not contribute to its stability suggesting that the phosphorylation sites at S113, T114 and S117 are in an unstructured region. Identification of the regions of pleckstrin that are folded shall facilitate determination of its structure and function.
Protein annotation: detective work for function prediction.
Doerks, T., Bairoch, A. & Bork, P.
This is a review article.
Trends Genet 1998 Jun 14(6) 248-250 Europe PMC
SMART, a simple modular architecture research tool: identification of signaling domains.
Schultz, J., Milpetz, F., Bork, P. & Ponting, C.P.
Proc Natl Acad Sci U S A 1998 May 26 95(11) 5857-5864
Accurate multiple alignments of 86 domains that occur in signaling proteins have been constructed and used to provide a Web-based tool (SMART: simple modular architecture research tool) that allows rapid identification and annotation of signaling domain sequences. The majority of signaling proteins are multidomain in character with a considerable variety of domain combinations known. Comparison with established databases showed that 25% of our domain set could not be deduced from SwissProt and 41% could not be annotated by Pfam. SMART is able to determine the modular architectures of single sequences or genomes; application to the entire yeast genome revealed that at least 6.7% of its genes contain one or more signaling domains, approximately 350 greater than previously annotated. The process of constructing SMART predicted (i) novel domain homologues in unexpected locations such as band 4.1-homologous domains in focal adhesion kinases; (ii) previously unknown domain families, including a citron-homology domain; (iii) putative functions of domain families after identification of additional family members, for example, a ubiquitin-binding role for ubiquitin-associated domains (UBA); (iv) cellular roles for proteins, such predicted DEATH domains in netrin receptors further implicating these molecules in axonal guidance; (v) signaling domains in known disease genes such as SPRY domains in both marenostrin/pyrin and Midline 1; (vi) domains in unexpected phylogenetic contexts such as diacylglycerol kinase homologues in yeast and bacteria; and (vii) likely protein misclassifications exemplified by a predicted pleckstrin homology domain in a Candida albicans protein, previously described as an integrin.
Measuring genome evolution.
Huynen, M.A. & Bork, P.
This is a review article.
Proc Natl Acad Sci U S A 1998 May 26 95(11) 5849-5856
The determination of complete genome sequences provides us with an opportunity to describe and analyze evolution at the comprehensive level of genomes. Here we compare nine genomes with respect to their protein coding genes at two levels: (i) we compare genomes as "bags of genes" and measure the fraction of orthologs shared between genomes and (ii) we quantify correlations between genes with respect to their relative positions in genomes. Distances between the genomes are related to their divergence times, measured as the number of amino acid substitutions per site in a set of 34 orthologous genes that are shared among all the genomes compared. We establish a hierarchy of rates at which genomes have changed during evolution. Protein sequence identity is the most conserved, followed by the complement of genes within the genome. Next is the degree of conservation of the order of genes, whereas gene regulation appears to evolve at the highest rate. Finally, we show that some genomes are more highly organized than others: they show a higher degree of the clustering of genes that have orthologs in other genomes.
Differential gene expression in mammary carcinoma cell lines: identification of DRIM, a new gene down-regulated in metastasis.
Schwirzke, M., Gnirke, A., Bork, P., Tarin, D. & Weidle, U.H.
Anticancer Res 1998 May-Jun;18(3A):1409-21.
Differential display technique was applied to a pair of cell lines derived from human breast carcinoma cell line MDA-MB 435 with metastatic and non-metastatic properties in the nude mouse system, with the objective to isolate genes involved in metastasis. DRIM (Down-Regulated In Metastasis) was the only gene found to be differentially expressed in this system. DRIM encodes a protein comprising 2785 amino acids with significant homology to a protein in yeast and C. elegans. The protein contains a conserved positively charged tail and several HEAT repeats, designated after four functionally characterized proteins in which the repeat was detected. Most of the hydrophobic regions of DRIM can be assigned to HEAT repeats. Expression of DRIM at the RNA level was investigated in several normal tissues and tumor cell lines.
Differential genome analysis applied to the species-specific features of Helicobacter pylori.
Huynen, M., Dandekar, T. & Bork, P.
FEBS Lett 1998 Apr 10;426(1):1-5.
We introduce a simple and rapid strategy to identify genes that are responsible for species-specific phenotypes. The genome of a species that has a specific phenotype is compared with at least one, closely related, species that lacks this phenotype. Homologous genes that are shared among the species compared are identified and discarded from the list of candidates for species-specific genes. The process is automated and rapidly yields a small subset of the genome that likely contains genes responsible for the species-specific features. Functions are assigned to the genes, and dubious annotations are filtered out. Information is extracted not only from the presence of genes, but also from their absence with respect to known phenotypes. We have applied the technique to identify a set of species-specific genes in Helicobacter pylori by comparing it with its closest relatives for which complete genome sequences are available, Haemophilus influenzae and Escherichia coli. Of the genes of this set for which functional features can be obtained, a large fraction (63%, 123 proteins) is (potentially) involved in H. pylori's interaction with its host. We hypothesize that a family of outer membrane proteins is critical for the ability of H. pylori to colonize host cells in highly acidic environments.
Wanted: subcellular localization of proteins based on sequence.
Eisenhaber, F. & Bork, P.
Trends Cell Biol 1998 Apr;8(4):169-70. Europe PMC
Predicting functions from protein sequences--where are the bottlenecks?
Bork, P. & Koonin, E.V.
This is a review article.
Nat Genet 1998 Apr 18(4) 313-318
The exponential growth of sequence data does not necessarily lead to an increase in knowledge about the functions of genes and their products. Prediction of function using comparative sequence analysis is extremely powerful but, if not performed appropriately, may also lead to the creation and propagation of assignment errors. While current homology detection methods can cope with the data flow, the identification, verification and annotation of functional features need to be drastically improved.
Merging extracellular domains: fold prediction for laminin G-like and amino-terminal thrombospondin-like modules based on homology to pentraxins
Beckmann, G., Hanke, J., Bork, P. & Reich, J.G.
J Mol Biol 1998/02/06 275(5) 725-730
Using a new method for construction and database searches of sequence consensus strings, we have identified a new superfamily of protein modules comprising laminin G, thrombospondin N and the pentraxin families. The conserved
patterns correspond mainly to hydrophobic core residues located in central beta strands of the known three-dimensional structures of two pentraxins, the human C-reactive protein and the serum amyloid P-component. Thus, we predict a similar
jellyroll fold for all members of this superfamily. In addition, the conservation of two exposed aspartate residues in the majority of superfamily members suggests hitherto unrecognised functional sites.
Characterization of targeting domains by sequence analysis: glycogen-binding domains in protein phosphatases.
Bork, P., Dandekar, T., Eisenhaber, F. & Huynen, M.
J Mol Med 1998 Feb;76(2):77-9. Europe PMC
Towards detection of orthologues in sequence databases.
Yuan, Y.P., Eulenstein, O., Vingron, M. & Bork, P.
MOTIVATION: Numerous homologous sequences from diverse species can be retrieved from databases using programs such as BLAST. However, due to multigene families, evolutionary relationship often cannot be easily determined and proper functional assignment becomes difficult. Thus, discrimination between orthologues and paralogues within BLAST output lists of homologous sequences becomes more and more important. RESULT: We therefore developed a method that attempts to construct a reconciled tree from a gene tree of selected sequences and its corresponding phylogenetic tree of the species involved (species tree). An interface on the Web is developed to enable users to analyse the BLAST result. BLAST outputs are parsed and, for the selected sequences, multiple alignments are constructed either globally or for local regions. Bootstrapped trees are returned and compared with the expected species tree. In cases of discrepancies, gene duplications are assumed and a reconciled tree is computed. The reconciled tree shows probable orthologues and paralogues as predicted.
Elementary modes analysis illustrated with human red cell metabolism.
Schuster, S., Fell, D.A., Pfeiffer, T., Dandekar, T. & Bork, P.
In "BioThermoKinectis in the Post Genomic Era," Larsson, C., Pahlman, I.-L. & Gustafsson, L. (eds.), 1998, Chalmers, Goeteborg, pp. 332-339
Sequence and structure of proteins
Eisenhaber, F. & Bork, P.
In "Recombinant protein, monoclonal antibodies and therapeutic genes," (series Biotechnology, 2nd. Edition), 1998, Rehm, H.-J. & Reed, G. (eds.), Wiley-VCH, Weinheim
Systematic genomic screening and analysis of mRNA in untranslated regions and mRNA precursors: combining experimental and computational approaches.
Dandekar, T., Beyer, K., Bork, P., Kenealy, M.R., Pantopoulos, K., Hentze, M., Sonntag-Buck, V., Flouriot, G., Gannon, F. & Schreiber, S.
MOTIVATION: The untranslated regions (UTRs) of mRNA upstream (5'UTR) and downstream (3'UTR) of the open reading frame, as well as the mRNA precursor, carry important regulatory sequences. To reveal unidentified regulatory signals, we combine information from experiments with computational approaches. Depending on available knowledge, three different strategies are employed. RESULTS: Searching with a consensus template, new RNAs with regulatory RNA elements can be identified in genomic screens. By this approach, we identify new candidate regulatory motifs resembling iron-responsive elements in the 5'UTRs of HemA, FepB and FrdB mRNA from Escherichia coli. If an RNA element is not yet defined, it may be analyzed by combining results from SELEX (selective enrichment of ligands by exponential amplification) and a search of databases from RNA or genomic sequences. A cleavage stimulating factor (CstF) binding element 3 of the polyadenylation site in the mRNA precursor serves as a test example. Alternatively, the regulatory RNA element may be found by studying different RNA foldings and their correlation with simple experimental tests. We delineate a novel instability element in the 3'UTR of the estrogen receptor mRNA in this way. AVAILABILITY: Strategy, methods and programs are available on request from T.Dandekar. CONTACT: email@example.com
Frame: detection of genomic sequencing errors.
Brown, N.P., Sander, C. & Bork, P.
Bioinformatics 1998 14(4) 367-371
MOTIVATION: The underlying error rate for genomic sequencing sometimes results in the introduction of artificial frameshifts and in-frame stop codons into putative protein encoding genes. Severe errors are then introduced into the inferred transcripts through mis-translation or premature termination. RESULTS: We describe a system for screening segments of DNA for frameshift and in-frame stop errors in coding regions. The method is based on homology matching using blastx to compare all six reading frames of the query nucleotide sequence against selected protein sequence databases. Fragments of protein matching neighbouring regions of the query DNA are united and extended laterally to define candidate open reading frames, within which, frameshifts and stops are identified. Suitable targets include prokaryotic or other intron-free genomic sequence and complementary DNAs. As an example of its use, we report here two frameshifted ORFs that deviate from the original TIGR sequence annotations for the recently released Helicobacter pylori genome. AVAILABILITY: The tool is accessible via the URL http://www.sander.ebi.ac.uk/frame/. CONTACT: firstname.lastname@example.org.
Deriving biological knowledge from genomic sequence.
Bork, P. & Eisenberg, D.
Curr Opin Struc Biol 1998 8 331-332
Automated pair-wise comparisons of microbial genomes.
Bansal, A.K., Bork, P. & Stuckey
Mathematical Modelling and Scientific Computing 1998 9 1-23
Secreted fringe-like signaling molecules may be glycosyltransferases
Yuan, Y.P., Schultz, J., Mlodzik, M. & Bork, P.
Cell 1997 Jan 10;88(1):9-11 Europe PMC
Pleckstrin's repeat performance: a novel domain in G-protein signaling?
Ponting, C.P. & Bork, P.
Trends Biochem Sci 1996 Jul;21(7):245-6 Europe PMC
Divergent evolution of a beta/alpha-barrel subclass: detection of numerous phosphate-binding sites by motif search.
Bork, P., Gellerich, J., Groth, H., Hooft, R. & Martin, F.
Protein Sci 1995 Feb;4(2):268-74.
Study of the most conserved region in many beta/alpha-barrels, the phosphate-binding site, revealed a sequence motif in a few beta/alpha-barrels with known tertiary structure, namely glycolate oxidase (GOX), cytochrome b2 (Cyb2), tryptophan synthase alpha subunit (TrpA), and the indoleglycerolphosphate synthase (TrpC). Database searches identified this motif in numerous other enzyme families: (1) IMP dehydrogenase (IMPDH) and GMP reductase (GuaC); (2) phosphoribosylformimino-5-aminoimidazol carboxamide ribotide isomerase (HisA) and the cyclase-producing D-erythro-imidazole-glycerolphosphate (HisF) of the histidine biosynthetic pathway; (3) dihydroorotate dehydrogenase (PyrD); (4) glutamate synthase (GltB); (5) ThiE and ThiG involved in the biosynthesis of thiamine as well as related proteins; (6) an uncharacterized open reading frame from Erwinia herbicola; and (7) a glycerol uptake operon antiterminator regulatory protein (GlpP). Secondary structure predictions of the different families mentioned above revealed an alternating order of beta-strands and alpha-helices in agreement with a beta/alpha-barrel-like topology. The putative phosphate-binding site is always found near the C-terminus of the enzymes, which are all at least about 200 amino acids long. This is compatible with its assumed location between strand 7 and helix 8. The identification of a significant motif in functionally diverse enzymes suggests a divergent evolution of at least a considerable fraction of beta/alpha-barrels. In addition to the known accumulation of beta/alpha-barrels in the tryptophan biosynthetic pathway, we observe clusters of these enzymes in histidine biosynthesis, purine metabolism, and apparently also in thiamine biosynthesis. The substrates are mostly heterocyclic compounds.(ABSTRACT TRUNCATED AT 250 WORDS)
- ERC Investigator Click here to learn more about the European Research Council
- Tara Oceans science Explore Tara Oceans research and inspiring marine life