The eukaryotic linear motif resource ELM: 10 years and counting.
Dinkel, H., Van Roey, K., Michael, S., Davey, N.E., Weatheritt, R.J., Born, D., Speck, T., Kruger, D., Grebnev, G., Kuban, M., Strumillo, M., Uyar, B., Budd, A., Altenberg, B., Seiler, M., Chemes, L.B., Glavina, J., Sanchez, I.E., Diella, F. & Gibson, T.J.
Nucleic Acids Res. 2014 Jan;42(Database issue):D259-66. doi: 10.1093/nar/gkt1047.Epub 2013 Nov 7.
The eukaryotic linear motif (ELM http://elm.eu.org) resource is a hub for collecting, classifying and curating information about short linear motifs (SLiMs). For >10 years, this resource has provided the scientific community with a freely accessible guide to the biology and function of linear motifs. The current version of ELM contains approximately 200 different motif classes with over 2400 experimentally validated instances manually curated from >2000 scientific publications. Furthermore, detailed information about motif-mediated interactions has been annotated and made available in standard exchange formats. Where appropriate, links are provided to resources such as switches.elm.eu.org and KEGG pathways.
Drift and conservation of differential exon usage across tissues in primate species.
Reyes, A., Anders, S., Weatheritt, R.J., Gibson, T.J., Steinmetz, L.M. & Huber, W.
Proc Natl Acad Sci U S A. 2013 Sep 17;110(38):15377-82. doi:10.1073/pnas.1307202110. Epub 2013 Sep 3.
Alternative usage of exons provides genomes with plasticity to produce different transcripts from the same gene, modulating the function, localization, and life cycle of gene products. It affects most human genes. For a limited number of cases, alternative functions and tissue-specific roles are known. However, recent high-throughput sequencing studies have suggested that much alternative isoform usage across tissues is nonconserved, raising the question of the extent of its functional importance. We address this question in a genome-wide manner by analyzing the transcriptomes of five tissues for six primate species, focusing on exons that are 1:1 orthologous in all six species. Our results support a model in which differential usage of exons has two major modes: First, most of the exons show only weak differences, which are dominated by interspecies variability and may reflect neutral drift and noisy splicing. These cases dominate the genome-wide view and explain why conservation appears to be so limited. Second, however, a sizeable minority of exons show strong differences between tissues, which are mostly conserved. We identified a core set of 3,800 exons from 1,643 genes that show conservation of strongly tissue-dependent usage patterns from human at least to macaque. This set is enriched for exons encoding protein-disordered regions and untranslated regions. Our findings support the theory that isoform regulation is an important target of evolution in primates, and our method provides a powerful tool for discovering potentially functional tissue-dependent isoforms.
The transience of transient overexpression.
Gibson, T.J., Seiler, M. & Veitia, R.A.
Nat Methods. 2013 Aug;10(8):715-21. doi: 10.1038/nmeth.2534.
Much of what is known about mammalian cell regulation has been achieved with the aid of transiently transfected cells. However, overexpression can violate balanced gene dosage, affecting protein folding, complex assembly and downstream regulation. To avoid these problems, genome engineering technologies now enable the generation of stable cell lines expressing modified proteins at (almost) native levels.
The switches.ELM resource: a compendium of conditional regulatory interaction interfaces.
Van Roey, K., Dinkel, H., Weatheritt, R.J., Gibson, T.J. & Davey, N.E.
Sci Signal. 2013 Apr 2;6(269):rs7. doi: 10.1126/scisignal.2003345.
Short linear motifs (SLiMs) are protein interaction sites that play an important role in cell regulation by controlling protein activity, localization, and local abundance. The functionality of a SLiM can be modulated in a context-dependent manner to induce a gain, loss, or exchange of binding partners, which will affect the function of the SLiM-containing protein. As such, these conditional interactions underlie molecular decision-making in cell signaling. We identified multiple types of pre- and posttranslational switch mechanisms that can regulate the function of a SLiM and thereby control its interactions. The collected examples of experimentally characterized SLiM-based switch mechanisms were curated in the freely accessible switches.ELM resource (http://switches.elm.eu.org). On the basis of these examples, we defined and integrated rules to analyze SLiMs for putative regulatory switch mechanisms. We applied these rules to known validated SLiMs, providing evidence that more than half of these are likely to be pre- or posttranslationally regulated. In addition, we showed that posttranslationally modified sites are enriched around SLiMs, which enables cooperative and integrative regulation of protein interaction interfaces. We foresee switches.ELM complementing available resources to extend our knowledge of the molecular mechanisms underlying cell signaling.
SLiMPrints: conservation-based discovery of functional motif fingerprints in intrinsically disordered protein regions.
Davey, N.E., Cowan, J.L., Shields, D.C., Gibson, T.J., Coldwell, M.J. & Edwards, R.J.
Nucleic Acids Res. 2012 Nov 1;40(21):10628-41. doi: 10.1093/nar/gks854. Epub 2012Sep 12.
Large portions of higher eukaryotic proteomes are intrinsically disordered, and abundant evidence suggests that these unstructured regions of proteins are rich in regulatory interaction interfaces. A major class of disordered interaction interfaces are the compact and degenerate modules known as short linear motifs (SLiMs). As a result of the difficulties associated with the experimental identification and validation of SLiMs, our understanding of these modules is limited, advocating the use of computational methods to focus experimental discovery. This article evaluates the use of evolutionary conservation as a discriminatory technique for motif discovery. A statistical framework is introduced to assess the significance of relatively conserved residues, quantifying the likelihood a residue will have a particular level of conservation given the conservation of the surrounding residues. The framework is expanded to assess the significance of groupings of conserved residues, a metric that forms the basis of SLiMPrints (short linear motif fingerprints), a de novo motif discovery tool. SLiMPrints identifies relatively overconstrained proximal groupings of residues within intrinsically disordered regions, indicative of putatively functional motifs. Finally, the human proteome is analysed to create a set of highly conserved putative motif instances, including a novel site on translation initiation factor eIF2A that may regulate translation through binding of eIF4E.
A Proteome-wide screen for mammalian SxIP motif-containing microtubule plus-end tracking proteins.
Jiang, K., Toedt, G., Montenegro Gouveia, S., Davey, N.E., Hua, S., van der Vaart, B., Grigoriev, I., Larsen, J., Pedersen, L.B., Bezstarosti, K., Lince-Faria, M., Demmers, J., Steinmetz, M.O., Gibson, T.J. & Akhmanova, A.
Curr Biol. 2012 Oct 9;22(19):1800-7. doi: 10.1016/j.cub.2012.07.047. Epub 2012Aug 9.
Microtubule plus-end tracking proteins (+TIPs) are structurally and functionally diverse factors that accumulate at the growing microtubule plus-ends, connect them to various cellular structures, and control microtubule dynamics [1, 2]. EB1 and its homologs are +TIPs that can autonomously recognize growing microtubule ends and recruit to them a variety of other proteins. Numerous +TIPs bind to end binding (EB) proteins through natively unstructured basic and serine-rich polypeptide regions containing a core SxIP motif (serine-any amino acid-isoleucine-proline) . The SxIP consensus sequence is short, and the surrounding sequences show high variability, raising the possibility that undiscovered SxIP containing +TIPs are encoded in mammalian genomes. Here, we performed a proteome-wide search for mammalian SxIP-containing +TIPs by combining biochemical and bioinformatics approaches. We have identified a set of previously uncharacterized EB partners that have the capacity to accumulate at the growing microtubule ends, including protein kinases, a small GTPase, centriole-, membrane-, and actin-associated proteins. We show that one of the newly identified +TIPs, CEP104, interacts with CP110 and CEP97 at the centriole and is required for ciliogenesis. Our study reveals the complexity of the mammalian +TIP interactome and provides a basis for investigating the molecular crosstalk between microtubule ends and other cellular structures.
RACK1 research - ships passing in the night?
FEBS Lett. 2012 Aug 14;586(17):2787-9. doi: 10.1016/j.febslet.2012.04.048. Epub2012 May 8.
It should not be surprising that a protein with a name like RACK1 - short for receptor for activated C kinase 1 - is found in a variety of signaling complexes. Its alternative name, the splendidly unmemorable GNB2L1 - short for guanine nucleotide-binding protein subunit beta-2-like 1 - should reinforce this link to signaling complexes. There are currently over 400 publications listed in PubMed mentioning RACK1/GNB2L1 in the abstract, so it is certainly an actively studied protein with much involvement in different aspects of cell regulation being reported. RACK1 binds to the 40S ribosomal subunit, suggesting it links cell regulation and translation. It is also a target of intracellular parasites. And yet does this protein have the profile that it should? And why are there two kinds of RACK1 researcher who do not seem to communicate well?
Linear motifs: lost in (pre)translation.
Weatheritt, R.J. & Gibson, T.J.
Trends Biochem Sci. 2012 Aug;37(8):333-41. doi: 10.1016/j.tibs.2012.05.001. Epub2012 Jun 15.
Pretranslational modification by alternative splicing, alternative promoter usage and RNA editing enables the production of multiple protein isoforms from a single gene. A large quantity of data now supports the notion that short linear motifs (SLiMs), which are protein interaction modules enriched within intrinsically disordered regions, are key for the functional diversification of these isoforms. The inclusion or removal of these SLiMs can switch the subcellular localisation of an isoform, promote cooperative associations, refine the affinity of an interaction, coordinate phase transitions within the cell, and even create isoforms of opposing function. This article discusses the novel functionality enabled by the addition or removal of SLiM-containing exons by pretranslational modifications, such as alternative splicing and alternative promoter usage, and how these alterations enable the creation and modulation of complex regulatory and signalling pathways.
Prion infected rhesus monkeys to study differential transcription of Alu DNA elements and editing of Alu transcripts in neuronal cells and blood cells.
Kiesel, P., Bodemer, W., Gibson, T.J., Zischler, H. & Kaup, F.J.
J Med Primatol. 2012 Jun;41(3):176-82. doi: 10.1111/j.1600-0684.2012.00535.x.Epub 2012 Mar 2.
Background Rhesus monkeys were used as a non-human primate model to study small non-coding RNA after infection with human sporadic and variant Creutzfeldt-Jakob prions. Methods Tissue-specific Alu DNA element transcription and editing of transcripts were assessed in neuronal - and blood cells (Buffy Coat). Results Tissue/cell-specific transcription and editing patterns were obtained. Active Alu DNA elements belonged to several Alu DNA families, they could be located on several chromosomes, and their genomic sites were identified. Deamination by adenosine deaminase acting on RNA and apolipoprotein B editing complex was found. Conclusions Different Alu transcription and editing programmes exist and may depend on the infection status.
iELM--a web server to explore short linear motif-mediated interactions.
Weatheritt, R.J., Jehl, P., Dinkel, H. & Gibson, T.J.
Nucleic Acids Res. 2012 Jul;40(Web Server issue):W364-9. doi: 10.1093/nar/gks444.Epub 2012 May 25.
The recent expansion in our knowledge of protein-protein interactions (PPIs) has allowed the annotation and prediction of hundreds of thousands of interactions. However, the function of many of these interactions remains elusive. The interactions of Eukaryotic Linear Motif (iELM) web server provides a resource for predicting the function and positional interface for a subset of interactions mediated by short linear motifs (SLiMs). The iELM prediction algorithm is based on the annotated SLiM classes from the Eukaryotic Linear Motif (ELM) resource and allows users to explore both annotated and user-generated PPI networks for SLiM-mediated interactions. By incorporating the annotated information from the ELM resource, iELM provides functional details of PPIs. This can be used in proteomic analysis, for example, to infer whether an interaction promotes complex formation or degradation. Furthermore, details of the molecular interface of the SLiM-mediated interactions are also predicted. This information is displayed in a fully searchable table, as well as graphically with the modular architecture of the participating proteins extracted from the UniProt and Phospho.ELM resources. A network figure is also presented to aid the interpretation of results. The iELM server supports single protein queries as well as large-scale proteomic submissions and is freely available at http://i.elm.eu.org.
Linear motifs confer functional diversity onto splice variants.
Weatheritt, R.J., Davey, N.E. & Gibson, T.J.
Nucleic Acids Res. 2012 Aug;40(15):7123-31. doi: 10.1093/nar/gks442. Epub 2012May 25.
The pre-translational modification of messenger ribonucleic acids (mRNAs) by alternative promoter usage and alternative splicing is an important source of pleiotropy. Despite intensive efforts, our understanding of the functional implications of this dynamically created diversity is still incomplete. Using the available knowledge of interaction modules, particularly within intrinsically disordered regions (IDRs), we analysed the occurrences of protein modules within alternative exons. We find that regions removed or included by pre-translational variation are enriched in linear motifs suggesting that the removal or inclusion of exons containing these interaction modules is an important regulatory mechanism. In particular, we observe that PDZ-, PTB-, SH2- and WW-domain binding motifs are more likely to occur within alternative exons. We also determine that regions removed or included by alternative promoter usage are enriched in IDRs suggesting that protein isoform diversity is tightly coupled to the modulation of IDRs. This study, therefore, demonstrates that short linear motifs are key components for establishing protein diversity between splice variants.
Motif switches: decision-making in cell regulation.
Van Roey, K., Gibson, T.J. & Davey, N.E.
Curr Opin Struct Biol. 2012 Jun;22(3):378-85. doi: 10.1016/j.sbi.2012.03.004.Epub 2012 Apr 3.
Tight regulation of gene products from transcription to protein degradation is required for reliable and robust control of eukaryotic cell physiology. Many of the mechanisms directing cell regulation rely on proteins detecting the state of the cell through context-dependent, tuneable interactions. These interactions underlie the ability of proteins to make decisions by combining regulatory information encoded in a protein's expression level, localisation and modification state. This raises the question, how do proteins integrate available information to correctly make decisions? Over the past decade pioneering work on the nature and function of intrinsically disordered protein regions has revealed many elegant switching mechanisms that underlie cell signalling and regulation, prompting a reevaluation of their role in cooperative decision-making.
The identification of short linear motif-mediated interfaces within the human interactome.
Weatheritt, R.J., Luck, K., Petsalaki, E., Davey, N.E. & Gibson, T.J.
Bioinformatics. 2012 Apr 1;28(7):976-82. doi: 10.1093/bioinformatics/bts072. Epub2012 Feb 10.
MOTIVATION: Eukaryotic proteins are highly modular, containing multiple interaction interfaces that mediate binding to a network of regulators and effectors. Recent advances in high-throughput proteomics have rapidly expanded the number of known protein-protein interactions (PPIs); however, the molecular basis for the majority of these interactions remains to be elucidated. There has been a growing appreciation of the importance of a subset of these PPIs, namely those mediated by short linear motifs (SLiMs), particularly the canonical and ubiquitous SH2, SH3 and PDZ domain-binding motifs. However, these motif classes represent only a small fraction of known SLiMs and outside these examples little effort has been made, either bioinformatically or experimentally, to discover the full complement of motif instances. RESULTS: In this article, interaction data are analysed to identify and characterize an important subset of PPIs, those involving SLiMs binding to globular domains. To do this, we introduce iELM, a method to identify interactions mediated by SLiMs and add molecular details of the interaction interfaces to both interacting proteins. The method identifies SLiM-mediated interfaces from PPI data by searching for known SLiM-domain pairs. This approach was applied to the human interactome to identify a set of high-confidence putative SLiM-mediated PPIs. AVAILABILITY: iELM is freely available at http://elmint.embl.de CONTACT: firstname.lastname@example.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Attributes of short linear motifs.
Davey, N.E., Van Roey, K., Weatheritt, R.J., Toedt, G., Uyar, B., Altenberg, B., Budd, A., Diella, F., Dinkel, H. & Gibson, T.J.
Mol Biosyst. 2012 Jan;8(1):268-81. doi: 10.1039/c1mb05231d. Epub 2011 Sep 12.
Traditionally, protein-protein interactions were thought to be mediated by large, structured domains. However, it has become clear that the interactome comprises a wide range of binding interfaces with varying degrees of flexibility, ranging from rigid globular domains to disordered regions that natively lack structure. Enrichment for disorder in highly connected hub proteins and its correlation with organism complexity hint at the functional importance of disordered regions. Nevertheless, they have not yet been extensively characterised. Shifting the attention from globular domains to disordered regions of the proteome might bring us closer to elucidating the dense and complex connectivity of the interactome. An important class of disordered interfaces are the compact mono-partite, short linear motifs (SLiMs, or eukaryotic linear motifs (ELMs)). They are evolutionarily plastic and interact with relatively low affinity due to the limited number of residues that make direct contact with the binding partner. These features confer to SLiMs the ability to evolve convergently and mediate transient interactions, which is imperative to network evolution and to maintain robust cell signalling, respectively. The ability to discriminate biologically relevant SLiMs by means of different attributes will improve our understanding of the complexity of the interactome and aid development of bioinformatics tools for motif discovery. In this paper, the curated instances currently available in the Eukaryotic Linear Motif (ELM) database are analysed to provide a clear overview of the defining attributes of SLiMs. These analyses suggest that functional SLiMs have higher levels of conservation than their surrounding residues, frequently evolve convergently, preferentially occur in disordered regions and often form a secondary structure when bound to their interaction partner. These results advocate searching for small groupings of residues in disordered regions with higher relative conservation and a propensity to form the secondary structure. Finally, the most interesting conclusions are examined in regard to their functional consequences.
ELM--the database of eukaryotic linear motifs.
Dinkel, H., Michael, S., Weatheritt, R.J., Davey, N.E., Van Roey, K., Altenberg, B., Toedt, G., Uyar, B., Seiler, M., Budd, A., Jodicke, L., Dammert, M.A., Schroeter, C., Hammer, M., Schmidt, T., Jehl, P., McGuigan, C., Dymecka, M., Chica, C., Luck, K., Via, A., Chatr-Aryamontri, A., Haslam, N., Grebnev, G., Edwards, R.J., Steinmetz, M.O., Meiselbach, H., Diella, F. & Gibson, T.J.
Nucleic Acids Res. 2012 Jan 1;40(D1):D242-D251. Epub 2011 Nov 21.
Linear motifs are short, evolutionarily plastic components of regulatory proteins and provide low-affinity interaction interfaces. These compact modules play central roles in mediating every aspect of the regulatory functionality of the cell. They are particularly prominent in mediating cell signaling, controlling protein turnover and directing protein localization. Given their importance, our understanding of motifs is surprisingly limited, largely as a result of the difficulty of discovery, both experimentally and computationally. The Eukaryotic Linear Motif (ELM) resource at http://elm.eu.org provides the biological community with a comprehensive database of known experimentally validated motifs, and an exploratory tool to discover putative linear motifs in user-submitted protein sequences. The current update of the ELM database comprises 1800 annotated motif instances representing 170 distinct functional classes, including approximately 500 novel instances and 24 novel classes. Several older motif class entries have been also revisited, improving annotation and adding novel instances. Furthermore, addition of full-text search capabilities, an enhanced interface and simplified batch download has improved the overall accessibility of the ELM data. The motif discovery portion of the ELM resource has added conservation, and structural attributes have been incorporated to aid users to discriminate biologically relevant motifs from stochastically occurring non-functional instances.
A comparative analysis to study editing of small noncoding BC200- and Alu transcripts in brain of prion-inoculated rhesus monkeys (M. Mulatta).
Kiesel, P., Kues, A., Kaup, F.J., Bodemer, W., Gibson, T.J. & Zischler, H.
J Toxicol Environ Health A. 2012;75(7):391-401. doi:10.1080/15287394.2012.670896.
Small retroelements (short interspersed elements, abbreviated SINEs) are abundant in vertebrate genomes. Using RNA isolated from rhesus monkey cerebellum and buffy coat, reverse-transcription polymerase chain reaction (RT PCR) was applied to clone cDNA of BC200 and Alu RNAs. Transcripts containing Alu-SINE sequences may be subjected to extensive RNA editing by ADAR (adenosine deaminases that act on RNA) deamination. Abundance of Alu transcripts was determined with real-time RT PCR and was significantly higher than BC200 (brain cytoplasmic) in cerebellum. BC200 transcripts were absent from buffy coat cells. Availability of the rhesus genome sequence allowed the BC200 transcripts to be mapped to the specific locus on chromosome 13. Both the qualitative and quantitative characteristics of BC 200 expression argue for the BC 200 transcripts being generated by RNA polymerase III. In cerebellum, Alu transcripts often possessed base exchanges (A to G) consistent with ADAR editing and, somewhat unexpectedly, C to T exchanges consistent with APOBEC (apolipoprotein B editing complex) editing. In contrast, the BC200 transcripts, which as RNA POLIII transcripts play a role in dendritic RNA translation, appeared not to be deaminated, despite the presence of editing of Alu in the same tissue. To assess whether neuronal disease might influence editing of BC200 and Alu-SINE transcripts in cerebellum, RNA was isolated from two rhesus monkeys that were inoculated with prions from human variant Creutzfeldt-Jakob disease (vCJD). Regardless of prion-induced neurodegeneration, no BC200 RNA editing was observed, while Alu RNA continued to show both ADAR and APOBEC editing. Thus, BC200 RNAs do not appear to become accessible to editing enzymes despite infected neurons being subjected to severe stress, damage, and eventually cell death.
Possible editing of Alu transcripts in blood cells of sporadic Creutzfeldt-Jakob disease (sCJD).
Kiesel, P., Gibson, T.J., Ciesielczyk, B., Bodemer, M., Kaup, F.J., Bodemer, W., Zischler, H. & Zerr, I.
J Toxicol Environ Health A. 2011 Jan;74(2-4):88-95.
Editing of RNA molecules gained major interest when coding mRNA was analyzed. A small, noncoding, Alu DNA element transcript that may act as regulatory RNA in cells was examined in this study. Alu DNA element transcription was determined in buffy coat from healthy humans and human sporadic Creutzfeldt-Jakob disease (sCJD) cases. In addition, non-sCJD controls, mostly dementia cases and Alzheimer's disease (AD) cases, were included. The Alu cDNA sequences were aligned to genomic Alu DNA elements by database search. A comparison of best aligned Alu DNA sequences with our RNA/cDNA clones revealed editing by deamination by ADAR (adenosine deaminase acting on RNA) and APOBEC (apolipoprotein B editing complex). Nucleotide exchanges like a G instead of an A or a T instead of a C in our cDNA sequences versus genomic Alu DNA pointed to recent mutations. To confirm this, our Alu cDNA sequences were aligned not only to genomic human Alu DNA but also to the respective genomic DNA of the chimpanzee and rhesus. Enhanced ADAR correlated with A-G exchanges in dementia, AD, and sCJD was noted when compared to healthy controls as well as APOBEC-related C-T exchanges. The APOBEC-related mutations were higher in healthy controls than in cases suffering from neurodegeneration, with the exception of the dementia group with the prion protein gene (PRNP) MV genotype. Hence, this study may be considered the first real-time analysis of Alu DNA element transcripts with regard to editing of the respective Alu transcripts in human blood cells.
Phospho.ELM: a database of phosphorylation sites--update 2011.
Dinkel, H., Chica, C., Via, A., Gould, C.M., Jensen, L.J., Gibson, T.J. & Diella, F.
Nucleic Acids Res. 2011 Jan;39(Database issue):D261-7. Epub 2010 Nov 9.
The Phospho.ELM resource (http://phospho.elm.eu.org) is a relational database designed to store in vivo and in vitro phosphorylation data extracted from the scientific literature and phosphoproteomic analyses. The resource has been actively developed for more than 7 years and currently comprises 42,574 serine, threonine and tyrosine non-redundant phosphorylation sites. Several new features have been implemented, such as structural disorder/order and accessibility information and a conservation score. Additionally, the conservation of the phosphosites can now be visualized directly on the multiple sequence alignment used for the score calculation. Finally, special emphasis has been put on linking to external resources such as interaction networks and other databases.
From sequence to structural analysis in protein phosphorylation motifs.
Via, A., Diella, F., Gibson, T.J. & Helmer-Citterich, M.
Front Biosci. 2011 Jan 1;16:1261-75.
Phosphorylation is the most widely studied post-translational modification occurring in cells. While mass spectrometry-based proteomics experiments are uncovering thousands of novel in vivo phosphorylation sites, the identification of kinase specificity rules still remains a relatively slow and often inefficacious task. In the last twenty years, many efforts have being devoted to the experimental and computational identification of sequence and structural motifs encoding kinase-substrate interaction key residues and the phosphorylated amino acid itself. In this review, we retrace the road to the discovery of phosphorylation sequence motifs, examine the progresses achieved in the detection of three-dimensional motifs and discuss their importance in the understanding of regulation and de-regulation of many cellular processes.
Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.
Sievers, F., Wilm, A., Dineen, D., Gibson, T.J., Karplus, K., Li, W., Lopez, R., McWilliam, H., Remmert, M., Soding, J., Thompson, J.D. & Higgins, D.G.
Mol Syst Biol. 2011 Oct 11;7:539. doi: 10.1038/msb.2011.75.
Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.
How viruses hijack cell regulation.
Davey, N.E., Trave, G. & Gibson, T.J.
Trends Biochem Sci. 2011 Mar;36(3):159-69. Epub 2010 Dec 9
Viruses, as obligate intracellular parasites, are the pathogens that have the most intimate relationship with their host, and as such, their genomes have been shaped directly by interactions with the host proteome. Every step of the viral life cycle, from entry to budding, is orchestrated through interactions with cellular proteins. Accordingly, viruses will hijack and manipulate these proteins utilising any achievable mechanism. Yet, the extensive interactions of viral proteomes has yielded a conundrum: how do viruses commandeer so many diverse pathways and processes, given the obvious spatial constraints imposed by their compact genomes? One important approach is slowly being revealed, the extensive mimicry of host protein short linear motifs (SLiMs).
Ancient protostome origin of chemosensory ionotropic glutamate receptors and the evolution of insect taste and olfaction.
Croset, V., Rytz, R., Cummins, S.F., Budd, A., Brawand, D., Kaessmann, H., Gibson, T.J. & Benton, R.
PLoS Genet. 2010 Aug 19;6(8). pii: e1001064.
Ionotropic glutamate receptors (iGluRs) are a highly conserved family of ligand-gated ion channels present in animals, plants, and bacteria, which are best characterized for their roles in synaptic communication in vertebrate nervous systems. A variant subfamily of iGluRs, the Ionotropic Receptors (IRs), was recently identified as a new class of olfactory receptors in the fruit fly, Drosophila melanogaster, hinting at a broader function of this ion channel family in detection of environmental, as well as intercellular, chemical signals. Here, we investigate the origin and evolution of IRs by comprehensive evolutionary genomics and in situ expression analysis. In marked contrast to the insect-specific Odorant Receptor family, we show that IRs are expressed in olfactory organs across Protostomia-a major branch of the animal kingdom that encompasses arthropods, nematodes, and molluscs-indicating that they represent an ancestral protostome chemosensory receptor family. Two subfamilies of IRs are distinguished: conserved "antennal IRs," which likely define the first olfactory receptor family of insects, and species-specific "divergent IRs," which are expressed in peripheral and internal gustatory neurons, implicating this family in taste and food assessment. Comparative analysis of drosophilid IRs reveals the selective forces that have shaped the repertoires in flies with distinct chemosensory preferences. Examination of IR gene structure and genomic distribution suggests both non-allelic homologous recombination and retroposition contributed to the expansion of this multigene family. Together, these findings lay a foundation for functional analysis of these receptors in both neurobiological and evolutionary studies. Furthermore, this work identifies novel targets for manipulating chemosensory-driven behaviours of agricultural pests and disease vectors.
EpiC: an open resource for exploring epitopes to aid antibody-based experiments.
Haslam, N.J. & Gibson, T.J.
J Proteome Res. 2010 Jul 2;9(7):3759-63.
Antibodies are a primary research tool for a diverse range of experiments in biology, from development to pathology. Their utility is derived from their ability to specifically identify proteins at a high level of sensitivity. This diversity of experimental requirements stretches the capabilities of these key research reagents. However, antibodies seem well placed to answer the challenges of the forthcoming proteome-scale biology. Their use in such a wide variety of experimental requirements impacts on the choice of epitope used to raise the antibody. Understanding the constraints imposed by the experimental configuration is crucial to developing well-characterized affinity reagents. Their application to a wide range of biological fields and relatively low-cost of manufacture has ensured that the demand for a resource of well-characterized antibodies will remain high and that they will be an important biological resource for the foreseeable future. This demand will only increase as the number of therapeutic targets continues to grow. Current tools to aid in the production of affinity reagents are disparate and not freely available. We present a freely available Web resource ( http://epic.embl.de ) for the proteomics community; the Epitope Choice Resource (EpiC) for the selection of epitopes and characterization of the target protein. It provides the community with a single Web-based portal for the exploration of epitopes on a target protein and connects over the Internet to a wide range of bioinformatic tools ensuring that data being presented are up to date.
Transcription of Alu DNA elements in blood cells of sporadic Creutzfeldt-Jakob disease (sCJD).
Kiesel, P., Gibson, T.J., Ciesielczyk, B., Bodemer, M., Kaup, F.J., Bodemer, W., Zischler, H. & Zerr, I.
Prion. 2010 Apr-Jun;4(2):87-93. Epub 2010 Apr 5.
Alu DNA elements were long considered to be of no biological significance and thus have been only poorly defined. However, in the past Alu DNA elements with well-defined nucleotide sequences have been suspected to contribute to disease, but the role of Alu DNA element transcripts has rarely been investigated. For the first time, we determined in a real-time approach Alu DNA element transcription in buffy coat cells isolated from the blood of humans suffering from sporadic Creutzfeldt-Jakob disease (sCJD) and other neurodegenerative disorders. The reverse transcribed Alu transcripts were amplified and their cDNA sequences were aligned to genomic regions best fitted to database genomic Alu DNA element sequences deposited in the UCSC and NCBI data bases. Our cloned Alu RNA/cDNA sequences were widely distributed in the human genome and preferably belonged to the "young" Alu Y family. We also observed that some RNA/cDNA clones could be aligned to several chromosomes because of the same degree of identity and score to resident genomic Alu DNA elements. These elements, called paralogues, have purportedly been recently generated by retrotransposition. Along with cases of sCJD we also included cases of dementia and Alzheimer disease (AD). Each group revealed a divergent pattern of transcribed Alu elements. Chromosome 2 was the most preferred site in sCJD cases, besides chromosome 17; in AD cases chromosome 11 was overrepresented whereas chromosomes 2, 3 and 17 were preferred active Alu loci in controls. Chromosomes 2, 12 and 17 gave rise to Alu transcripts in dementia cases. The detection of putative Alu paralogues widely differed depending on the disease. A detailed data search revealed that some cloned Alu transcripts originated from RNA polymerase III transcription since the genomic sites of their Alu elements were found between genes. Other Alu DNA elements could be located close to or within coding regions of genes. In general, our observations suggest that identification and genomic localization of active Alu DNA elements could be further developed as a surrogate marker for differential gene expression in disease. A sufficient number of cases are necessary for statistical significance before Alu DNA elements can be considered useful to differentiate neurodegenerative diseases from controls.
ELM: the status of the 2010 eukaryotic linear motif resource.
Gould, C.M., Diella, F., Via, A., Puntervoll, P., Gemund, C., Chabanis-Davidson, S., Michael, S., Sayadi, A., Bryne, J.C., Chica, C., Seiler, M., Davey, N.E., Haslam, N., Weatheritt, R.J., Budd, A., Hughes, T., Pas, J., Rychlewski, L., Trave, G., Aasland, R., Helmer-Citterich, M., Linding, R. & Gibson, T.J.
Nucleic Acids Res. 2010 Jan;38(Database issue):D167-80. doi: 10.1093/nar/gkp1016.Epub 2009 Nov 17.
Linear motifs are short segments of multidomain proteins that provide regulatory functions independently of protein tertiary structure. Much of intracellular signalling passes through protein modifications at linear motifs. Many thousands of linear motif instances, most notably phosphorylation sites, have now been reported. Although clearly very abundant, linear motifs are difficult to predict de novo in protein sequences due to the difficulty of obtaining robust statistical assessments. The ELM resource at http://elm.eu.org/ provides an expanding knowledge base, currently covering 146 known motifs, with annotation that includes >1300 experimentally reported instances. ELM is also an exploratory tool for suggesting new candidates of known linear motifs in proteins of interest. Information about protein domains, protein structure and native disorder, cellular and taxonomic contexts is used to reduce or deprecate false positive matches. Results are graphically displayed in a 'Bar Code' format, which also displays known instances from homologous proteins through a novel 'Instance Mapper' protocol based on PHI-BLAST. ELM server output provides links to the ELM annotation as well as to a number of remote resources. Using the links, researchers can explore the motifs, proteins, complex structures and associated literature to evaluate whether candidate motifs might be worth experimental investigation.
A community standard format for the representation of protein affinity reagents.
Gloriam, D.E., Orchard, S., Bertinetti, D., Bjorling, E., Bongcam-Rudloff, E., Borrebaeck, C.A., Bourbeillon, J., Bradbury, A.R., de Daruvar, A., Dubel, S., Frank, R., Gibson, T.J., Gold, L., Haslam, N., Herberg, F.W., Hiltke, T., Hoheisel, J.D., Kerrien, S., Koegl, M., Konthur, Z., Korn, B., Landegren, U., Montecchi-Palazzi, L., Palcy, S., Rodriguez, H., Schweinsberg, S., Sievert, V., Stoevesandt, O., Taussig, M.J., Ueffing, M., Uhlen, M., van der Maarel, S., Wingren, C., Woollard, P., Sherman, D.J. & Hermjakob, H.
Mol Cell Proteomics. 2010 Jan;9(1):1-10. doi: 10.1074/mcp.M900185-MCP200. Epub2009 Aug 11.
Protein affinity reagents (PARs), most commonly antibodies, are essential reagents for protein characterization in basic research, biotechnology, and diagnostics as well as the fastest growing class of therapeutics. Large numbers of PARs are available commercially; however, their quality is often uncertain. In addition, currently available PARs cover only a fraction of the human proteome, and their cost is prohibitive for proteome scale applications. This situation has triggered several initiatives involving large scale generation and validation of antibodies, for example the Swedish Human Protein Atlas and the German Antibody Factory. Antibodies targeting specific subproteomes are being pursued by members of Human Proteome Organisation (plasma and liver proteome projects) and the United States National Cancer Institute (cancer-associated antigens). ProteomeBinders, a European consortium, aims to set up a resource of consistently quality-controlled protein-binding reagents for the whole human proteome. An ultimate PAR database resource would allow consumers to visit one on-line warehouse and find all available affinity reagents from different providers together with documentation that facilitates easy comparison of their cost and quality. However, in contrast to, for example, nucleotide databases among which data are synchronized between the major data providers, current PAR producers, quality control centers, and commercial companies all use incompatible formats, hindering data exchange. Here we propose Proteomics Standards Initiative (PSI)-PAR as a global community standard format for the representation and exchange of protein affinity reagent data. The PSI-PAR format is maintained by the Human Proteome Organisation PSI and was developed within the context of ProteomeBinders by building on a mature proteomics standard format, PSI-molecular interaction, which is a widely accepted and established community standard for molecular interaction data. Further information and documentation are available on the PSI-PAR web site.
Cell regulation: determined to signal discrete cooperation.
Trends Biochem Sci. 2009 Oct;34(10):471-82. Epub 2009 Sep 8.
Do kinases cascade? How well is cell regulation understood? What are the best ways to model regulatory systems? Attempts to answer such questions can have bearings on the way in which research is conducted. Fortunately there are recurring themes in regulatory processes from many different cellular contexts, which might provide useful guidance. Three principles seem to be almost universal: regulatory interactions are cooperative; regulatory decisions are made by large dynamic protein complexes; and regulation is intricately networked. A fourth principle, although not universal, is remarkably common: regulatory proteins are actively placed where they are needed. Here, I argue that the true nature of cell signalling and our perceptions of it are in a state of discord. This raises the question: Are our misconceptions detrimental to progress in biomedical science?
Dimerization and protein binding specificity of the U2AF homology motif of the splicing factor Puf60.
Corsini, L., Hothorn, M., Stier, G., Rybin, V., Scheffzek, K., Gibson, T.J. & Sattler, M.
J Biol Chem. 2009 Jan 2;284(1):630-9. doi: 10.1074/jbc.M805395200. Epub 2008 Oct29.
PUF60 is an essential splicing factor functionally related and homologous to U2AF(65). Its C-terminal domain belongs to the family of U2AF (U2 auxiliary factor) homology motifs (UHM), a subgroup of RNA recognition motifs that bind to tryptophan-containing linear peptide motifs (UHM ligand motifs, ULMs) in several nuclear proteins. Here, we show that the Puf60 UHM is mainly monomeric in physiological buffer, whereas its dimerization is induced upon the addition of SDS. The crystal structure of PUF60-UHM at 2.2 angstroms resolution, NMR data, and mutational analysis reveal that the dimer interface is mediated by electrostatic interactions involving a flexible loop. Using glutathione S-transferase pulldown experiments, isothermal titration calorimetry, and NMR titrations, we find that Puf60-UHM binds to ULM sequences in the splicing factors SF1, U2AF65, and SF3b155. Compared with U2AF65-UHM, Puf60-UHM has distinct binding preferences to ULMs in the N terminus of SF3b155. Our data suggest that the functional cooperativity between U2AF65 and Puf60 may involve simultaneous interactions of the two proteins with SF3b155.
KEPE--a motif frequently superimposed on sumoylation sites in metazoan chromatin proteins and transcription factors.
Diella, F., Chabanis, S., Luck, K., Chica, C., Ramu, C., Nerlov, C. & Gibson, T.J.
Bioinformatics. 2009 Jan 1;25(1):1-5. Epub 2008 Nov 24.
MOTIVATION: We noted that the sumoylation site in C/EBP homologues is conserved beyond the canonical consensus sequence for sumoylation. Therefore, we investigated whether this pattern might define a more general protein motif. RESULTS: We undertook a survey of the human proteome using a regular expression based on the C/EBP motif. This revealed significant enrichment of the motif using different Gene Ontology terms (e.g. 'transcription') that pertain to the nucleus. When considering requirements for the motif to be functional (evolutionary conservation, structural accessibility of the motif and proper cell localization of the protein), more than 130 human proteins were retrieved from the UniProt/Swiss-Prot database. These candidates were particularly enriched in transcription factors, including FOS, JUN, Hif-1alpha, MLL2 and members of the KLF, MAF and NFATC families; chromatin modifiers like CHD-8, HDAC4 and DNA Top1; and the transcriptional regulatory kinases HIPK1 and HIPK2. The KEPEmotif appears to be restricted to the metazoan lineage and has three length variants-short, medium and long-which do not appear to interchange.
A structure filter for the Eukaryotic Linear Motif Resource.
Via, A., Gould, C.M., Gemund, C., Gibson, T.J. & Helmer-Citterich, M.
BMC Bioinformatics. 2009 Oct 24;10:351.
BACKGROUND: Many proteins are highly modular, being assembled from globular domains and segments of natively disordered polypeptides. Linear motifs, short sequence modules functioning independently of protein tertiary structure, are most abundant in natively disordered polypeptides but are also found in accessible parts of globular domains, such as exposed loops. The prediction of novel occurrences of known linear motifs attempts the difficult task of distinguishing functional matches from stochastically occurring non-functional matches. Although functionality can only be confirmed experimentally, confidence in a putative motif is increased if a motif exhibits attributes associated with functional instances such as occurrence in the correct taxonomic range, cellular compartment, conservation in homologues and accessibility to interacting partners. Several tools now use these attributes to classify putative motifs based on confidence of functionality. RESULTS: Current methods assessing motif accessibility do not consider much of the information available, either predicting accessibility from primary sequence or regarding any motif occurring in a globular region as low confidence. We present a method considering accessibility and secondary structural context derived from experimentally solved protein structures to rectify this situation. Putatively functional motif occurrences are mapped onto a representative domain, given that a high quality reference SCOP domain structure is available for the protein itself or a close relative. Candidate motifs can then be scored for solvent-accessibility and secondary structure context. The scores are calibrated on a benchmark set of experimentally verified motif instances compared with a set of random matches. A combined score yields 3-fold enrichment for functional motifs assigned to high confidence classifications and 2.5-fold enrichment for random motifs assigned to low confidence classifications. The structure filter is implemented as a pipeline with both a graphical interface via the ELM resource http://elm.eu.org/ and through a Web Service protocol. CONCLUSION: New occurrences of known linear motifs require experimental validation as the bioinformatics tools currently have limited reliability. The ELM structure filter will aid users assessing candidate motifs presenting in globular structural regions. Most importantly, it will help users to decide whether to expend their valuable time and resources on experimental testing of interesting motif candidates.
Phosphorylation of s776 and 14-3-3 binding modulate ataxin-1 interaction with splicing factors.
de Chiara, C., Menon, R.P., Strom, M., Gibson, T.J. & Pastore, A.
PLoS One. 2009 Dec 23;4(12):e8372.
Ataxin-1 (Atx1), a member of the polyglutamine (polyQ) expanded protein family, is responsible for spinocerebellar ataxia type 1. Requirements for developing the disease are polyQ expansion, nuclear localization and phosphorylation of S776. Using a combination of bioinformatics, cell and structural biology approaches, we have identified a UHM ligand motif (ULM), present in proteins associated with splicing, in the C-terminus of Atx1 and shown that Atx1 interacts with and influences the function of the splicing factor U2AF65 via this motif. ULM comprises S776 of Atx1 and overlaps with a nuclear localization signal and a 14-3-3 binding motif. We demonstrate that phosphorylation of S776 provides the molecular switch which discriminates between 14-3-3 and components of the spliceosome. We also show that an S776D Atx1 mutant previously designed to mimic phosphorylation is unsuitable for this aim because of the different chemical properties of the two groups. Our results indicate that Atx1 is part of a complex network of interactions with splicing factors and suggest that development of the pathology is the consequence of a competition of aggregation with native interactions. Studies of the interactions formed by non-expanded Atx1 thus provide valuable hints for understanding both the function of the non-pathologic protein and the causes of the disease.
Evidence for the concerted evolution between short linear protein motifs and their flanking regions.
Chica, C., Diella, F. & Gibson, T.J.
PLoS One. 2009 Jul 8;4(7):e6052.
BACKGROUND: Linear motifs are short modules of protein sequences that play a crucial role in mediating and regulating many protein-protein interactions. The function of linear motifs strongly depends on the context, e.g. functional instances mainly occur inside flexible regions that are accessible for interaction. Sometimes linear motifs appear as isolated islands of conservation in multiple sequence alignments. However, they also occur in larger blocks of sequence conservation, suggesting an active role for the neighbouring amino acids. RESULTS: The evolution of regions flanking 116 functional linear motif instances was studied. The conservation of the amino acid sequence and order/disorder tendency of those regions was related to presence/absence of the instance. For the majority of the analysed instances, the pairs of sequences conserving the linear motif were also observed to maintain a similar local structural tendency and/or to have higher local sequence conservation when compared to pairs of sequences where one is missing the linear motif. Furthermore, those instances have a higher chance to co-evolve with the neighbouring residues in comparison to the distant ones. Those findings are supported by examples where the regulation of the linear motif-mediated interaction has been shown to depend on the modifications (e.g. phosphorylation) at neighbouring positions or is thought to benefit from the binding versatility of disordered regions. CONCLUSION: The results suggest that flanking regions are relevant for linear motif-mediated interactions, both at the structural and sequence level. More interestingly, they indicate that the prediction of linear motif instances can be enriched with contextual information by performing a sequence analysis similar to the one presented here. This can facilitate the understanding of the role of these predicted instances in determining the protein function inside the broader context of the cellular network where they arise.
Malectin: a novel carbohydrate-binding protein of the endoplasmic reticulum and a candidate player in the early steps of protein N-glycosylation.
Schallus, T., Jaeckh, C., Feher, K., Palma, A.S., Liu, Y., Simpson, J.C., Mackeen, M., Stier, G., Gibson, T.J., Feizi, T., Pieler, T. & Muhle-Goll, C.
Mol Biol Cell. 2008 Aug;19(8):3404-14. Epub 2008 Jun 4.
N-Glycosylation starts in the endoplasmic reticulum (ER) where a 14-sugar glycan composed of three glucoses, nine mannoses, and two N-acetylglucosamines (Glc(3)Man(9)GlcNAc(2)) is transferred to nascent proteins. The glucoses are sequentially trimmed by ER-resident glucosidases. The Glc(3)Man(9)GlcNAc(2) moiety is the substrate for oligosaccharyltransferase; the Glc(1)Man(9)GlcNAc(2) and Man(9)GlcNAc(2) intermediates are signals for glycoprotein folding and quality control in the calnexin/calreticulin cycle. Here, we report a novel membrane-anchored ER protein that is highly conserved in animals and that recognizes the Glc(2)-N-glycan. Structure determination by nuclear magnetic resonance showed that its luminal part is a carbohydrate binding domain that recognizes glucose oligomers. Carbohydrate microarray analyses revealed a uniquely selective binding to a Glc(2)-N-glycan probe. The localization, structure, and binding specificity of this protein, which we have named malectin, open the way to studies of its role in the genesis, processing and secretion of N-glycosylated proteins.
A careful disorderliness in the proteome: sites for interaction and targets for future therapies.
Russell, R.B. & Gibson, T.J.
FEBS Lett. 2008 Apr 9;582(8):1271-5. doi: 10.1016/j.febslet.2008.02.027. Epub2008 Feb 20.
The community of scientists interested in studying intrinsically unstructured (or disordered) proteins has emerged in recent years. What began as a controversial idea has become an established phenomenon. The new, greater focus on proteins that are in some way normally unstructured promises to provide a greater understanding of protein function, particularly with respect to protein-protein interactions. These regions also offer new possibilities into how interactions can be targeted by small molecules.
Discovery of candidate KEN-box motifs using cell cycle keyword enrichment combined with native disorder prediction and motif conservation.
Michael, S., Trave, G., Ramu, C., Chica, C. & Gibson, T.J.
Bioinformatics. 2008 Feb 15;24(4):453-7. Epub 2008 Jan 9.
MOTIVATION: KEN-box-mediated target selection is one of the mechanisms used in the proteasomal destruction of mitotic cell cycle proteins via the APC/C complex. While annotating the Eukaryotic Linear Motif resource (ELM, http://elm.eu.org/), we found that KEN motifs were significantly enriched in human protein entries with cell cycle keywords in the UniProt/Swiss-Prot database-implying that KEN-boxes might be more common than reported. RESULTS: Matches to short linear motifs in protein database searches are not, per se, significant. KEN-box enrichment with cell cycle Gene Ontology terms suggests that collectively these motifs are functional but does not prove that any given instance is so. Candidates were surveyed for native disorder prediction using GlobPlot and IUPred and for motif conservation in homologues. Among >25 strong new candidates, the most notable are human HIPK2, CHFR, CDC27, Dab2, Upf2, kinesin Eg5, DNA Topoisomerase 1 and yeast Cdc5 and Swi5. A similar number of weaker candidates were present. These proteins have yet to be tested for APC/C targeted destruction, providing potential new avenues of research.
Phospho.ELM: a database of phosphorylation sites--update 2008.
Diella, F., Gould, C.M., Chica, C., Via, A. & Gibson, T.J.
Nucleic Acids Res. 2008 Jan;36(Database issue):D240-4. Epub 2007 Oct 25.
Phospho.ELM is a manually curated database of eukaryotic phosphorylation sites. The resource includes data collected from published literature as well as high-throughput data sets. The current release of Phospho.ELM (version 7.0, July 2007) contains 4078 phospho-protein sequences covering 12 025 phospho-serine, 2362 phospho-threonine and 2083 phospho-tyrosine sites. The entries provide information about the phosphorylated proteins and the exact position of known phosphorylated instances, the kinases responsible for the modification (where known) and links to bibliographic references. The database entries have hyperlinks to easily access further information from UniProt, PubMed, SMART, ELM, MSD as well as links to the protein interaction databases MINT and STRING. A new BLAST search tool, complementary to retrieval by keyword and UniProt accession number, allows users to submit a protein query (by sequence or UniProt accession) to search against the curated data set of phosphorylated peptides. Phospho.ELM is available on line at: http://phospho.elm.eu.org.
A new protein linear motif benchmark for multiple sequence alignment software.
Perrodou, E., Chica, C., Poch, O., Gibson, T.J. & Thompson, J.D.
BMC Bioinformatics. 2008 Apr 25;9:213.
BACKGROUND: Linear motifs (LMs) are abundant short regulatory sites used for modulating the functions of many eukaryotic proteins. They play important roles in post-translational modification, cell compartment targeting, docking sites for regulatory complex assembly and protein processing and cleavage. Methods for LM detection are now being developed that are strongly dependent on scores for motif conservation in homologous proteins. However, most LMs are found in natively disordered polypeptide segments that evolve rapidly, unhindered by structural constraints on the sequence. These regions of modular proteins are difficult to align using classical multiple sequence alignment programs that are specifically optimised to align the globular domains. As a consequence, poor motif alignment quality is hindering efforts to detect new LMs. RESULTS: We have developed a new benchmark, as part of the BAliBASE suite, designed to assess the ability of standard multiple alignment methods to detect and align LMs. The reference alignments are organised into different test sets representing real alignment problems and contain examples of experimentally verified functional motifs, extracted from the Eukaryotic Linear Motif (ELM) database. The benchmark has been used to evaluate and compare a number of multiple alignment programs. With distantly related proteins, the worst alignment program correctly aligns 48% of LMs compared to 73% for the best program. However, the performance of all the programs is adversely affected by the introduction of other sequences containing false positive motifs. The ranking of the alignment programs based on LM alignment quality is similar to that observed when considering full-length protein alignments, however little correlation was observed between LM and overall alignment quality for individual alignment test cases. CONCLUSION: We have shown that none of the programs currently available is capable of reliably aligning LMs in distantly related sequences and we have highlighted a number of specific problems. The results of the tests suggest possible ways to improve program accuracy for difficult, divergent sequences.
Understanding eukaryotic linear motifs and their role in cell signaling and regulation.
Diella, F., Haslam, N., Chica, C., Budd, A., Michael, S., Brown, N.P., Trave, G. & Gibson, T.J.
Front Biosci. 2008 May 1;13:6580-603.
It is now clear that a detailed picture of cell regulation requires a comprehensive understanding of the abundant short protein motifs through which signaling is channeled. The current body of knowledge has slowly accumulated through piecemeal experimental investigation of individual motifs in signaling. Computational methods contributed little to this process. A new generation of bioinformatics tools will aid the future investigation of motifs in regulatory proteins, and the disordered polypeptide regions in which they frequently reside. Allied to high throughput methods such as phosphoproteomics, signaling networks are becoming amenable to experimental deconstruction. In this review, we summarise the current state of linear motif biology, which uses low affinity interactions to create cooperative, combinatorial and highly dynamic regulatory protein complexes. The discrete deterministic properties implicit to these assemblies suggest that models for cell regulatory networks in systems biology should neither be overly dependent on stochastic nor on smooth deterministic approximations.
A tree-based conservation scoring method for short linear motifs in multiple alignments of protein sequences.
Chica, C., Labarga, A., Gould, C.M., Lopez, R. & Gibson, T.J.
BMC Bioinformatics. 2008 May 6;9:229.
BACKGROUND: The structure of many eukaryotic cell regulatory proteins is highly modular. They are assembled from globular domains, segments of natively disordered polypeptides and short linear motifs. The latter are involved in protein interactions and formation of regulatory complexes. The function of such proteins, which may be difficult to define, is the aggregate of the subfunctions of the modules. It is therefore desirable to efficiently predict linear motifs with some degree of accuracy, yet sequence database searches return results that are not significant. RESULTS: We have developed a method for scoring the conservation of linear motif instances. It requires only primary sequence-derived information (e.g. multiple alignment and sequence tree) and takes into account the degenerate nature of linear motif patterns. On our benchmarking, the method accurately scores 86% of the known positive instances, while distinguishing them from random matches in 78% of the cases. The conservation score is implemented as a real time application designed to be integrated into other tools. It is currently accessible via a Web Service or through a graphical interface. CONCLUSION: The conservation score improves the prediction of linear motifs, by discarding those matches that are unlikely to be functional because they have not been conserved during the evolution of the protein sequences. It is especially useful for instances in non-structured regions of the proteins, where a domain masking filtering strategy is not applicable.
Clustal W and Clustal X version 2.0.
Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J. & Higgins, D.G.
Bioinformatics. 2007 Nov 1;23(21):2947-8. Epub 2007 Sep 10.
SUMMARY: The Clustal W and Clustal X multiple sequence alignment programs have been completely rewritten in C++. This will facilitate the further development of the alignment algorithms in the future and has allowed proper porting of the programs to the latest versions of Linux, Macintosh and Windows operating systems. AVAILABILITY: The programs can be run on-line from the EBI web server: http://www.ebi.ac.uk/tools/clustalw2. The source code and executables for Windows, Linux and Macintosh computers are available from the EBI ftp site ftp://ftp.ebi.ac.uk/pub/software/clustalw2/
ProteomeBinders: planning a European resource of affinity reagents for analysis of the human proteome.
Taussig, M.J., Stoevesandt, O., Borrebaeck, C.A., Bradbury, A.R., Cahill, D., Cambillau, C., de Daruvar, A., Dubel, S., Eichler, J., Frank, R., Gibson, T.J., Gloriam, D., Gold, L., Herberg, F.W., Hermjakob, H., Hoheisel, J.D., Joos, T.O., Kallioniemi, O., Koegll, M., Konthur, Z., Korn, B., Kremmer, E., Krobitsch, S., Landegren, U., van der Maarel, S., McCafferty, J., Muyldermans, S., Nygren, P.A., Palcy, S., Pluckthun, A., Polic, B., Przybylski, M., Saviranta, P., Sawyer, A., Sherman, D.J., Skerra, A., Templin, M., Ueffing, M. & Uhlen, M.
Nat Methods. 2007 Jan;4(1):13-7. Europe PMC
Systematic discovery of new recognition peptides mediating protein interaction networks.
Neduva, V., Linding, R., Su-Angrand, I., Stark, A., de Masi, F., Gibson, T.J., Lewis, J.D., Serrano, L. & Russell, R.B.
PLoS Biol. 2005 Dec;3(12):e405. Epub 2005 Nov 15.
Many aspects of cell signalling, trafficking, and targeting are governed by interactions between globular protein domains and short peptide segments. These domains often bind multiple peptides that share a common sequence pattern, or "linear motif" (e.g., SH3 binding to PxxP). Many domains are known, though comparatively few linear motifs have been discovered. Their short length (three to eight residues), and the fact that they often reside in disordered regions in proteins makes them difficult to detect through sequence comparison or experiment. Nevertheless, each new motif provides critical molecular details of how interaction networks are constructed, and can explain how one protein is able to bind to very different partners. Here we show that binding motifs can be detected using data from genome-scale interaction studies, and thus avoid the normally slow discovery process. Our approach based on motif over-representation in non-homologous sequences, rediscovers known motifs and predicts dozens of others. Direct binding experiments reveal that two predicted motifs are indeed protein-binding modules: a DxxDxxxD protein phosphatase 1 binding motif with a KD of 22 microM and a VxxxRxYS motif that binds Translin with a KD of 43 microM. We estimate that there are dozens or even hundreds of linear motifs yet to be discovered that will give molecular insight into protein networks and greatly illuminate cellular processes.
Anopheles gambiae SRPN2 facilitates midgut invasion by the malaria parasite Plasmodium berghei.
Michel, K., Budd, A., Pinto, S., Gibson, T.J. & Kafatos, F.C.
EMBO Rep 2005 Sep;6(9):891-7.
We report on a phylogenetic and functional analysis of genes encoding three mosquito serpins (SRPN1, SRPN2 and SRPN3), which resemble known inhibitors of prophenoloxidase-activating enzymes in other insects. Following RNA interference induction by double-stranded RNA injection, knockdown of SRPN2 in adult Anopheles gambiae produced a notable phenotype: the appearance of melanotic pseudotumours, which increased in size and number with time, indicating spontaneous melanization and association with an observed lifespan reduction. Furthermore, knockdown of SRPN2 strongly interfered with the invasion of A. gambiae midguts by the rodent malaria parasite Plasmodium berghei. It did not affect ookinete formation, but markedly reduced oocyst numbers, by 97%, as a result of increased ookinete lysis and melanization.
Patterns and clusters within the PSM column in TiBS, 1992-2004.
McEntyre, J.R. & Gibson, T.J.
Trends Biochem Sci 2004 Dec;29(12):627-33.
Sequence similarities among proteins can infer biological function and evolutionary relationships--a powerful approach for investigating new proteins and suggesting future experiments. The availability of public sequence databases and freely distributed tools for sequence analysis has meant that researchers from all over the world can use this approach. For the past 12 years, the Protein Sequence Motif column in TiBS has provided a platform for documenting interesting discoveries from sequence analyses. As the column comes to an end, we look at the published contributions over the years and reflect on sequence analysis through the beginning of the genomic era.
The C terminus of fragile X mental retardation protein interacts with the multi-domain Ran-binding protein in the microtubule-organising centre.
Menon, R.P., Gibson, T.J. & Pastore, A.
J Mol Biol 2004 Oct 8;343(1):43-53.
Absence of the fragile X mental retardation protein (FMRP) causes fragile X syndrome, the most common form of hereditary mental retardation. FMRP is a mainly cytoplasmic protein thought to be involved in repression of translation, through a complex network of protein-protein and protein-RNA interactions. Most of the currently known protein partners of FMRP recognise the conserved N terminus of the protein. No interaction has yet been mapped to the highly charged, poorly conserved C terminus, so far thought to be involved in RNA recognition through an RGG motif. In the present study, we show that a two-hybrid bait containing residues 419-632 of human FMRP fishes out a protein that spans the sequence of the Ran-binding protein in the microtubule-organising centre (RanBPM/RanBP9). Specific interaction of RanBPM with FMRP was confirmed by in vivo and in vitro assays. In brain tissue sections, RanBPM is highly expressed in the neurons of cerebral cortex and the cerebellar purkinje cells, in a pattern similar to that described for FMRP. Sequence analysis shows that RanBPM is a multi-domain protein. The interaction with FMRP was mapped in a newly identified CRA motif present in the RanBPM C terminus. Our results suggest that the functional role of RanBPM binding is modulation of the RNA-binding properties of FMRP.
Phospho.ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins.
Diella, F., Cameron, S., Gemund, C., Linding, R., Via, A., Kuster, B., Sicheritz-Ponten, T., Blom, N. & Gibson, T.J.
BMC Bioinformatics 2004 Jun 22;5(1):79.
BACKGROUND: Post-translational phosphorylation is one of the most common protein modifications. Phosphoserine, threonine and tyrosine residues play critical roles in the regulation of many cellular processes. The fast growing number of research reports on protein phosphorylation points to a general need for an accurate database dedicated to phosphorylation to provide easily retrievable information on phosphoproteins. DESCRIPTION: Phospho.ELM http://phospho.elm.eu.org is a new resource containing experimentally verified phosphorylation sites manually curated from the literature and is developed as part of the ELM (Eukaryotic Linear Motif) resource. Phospho.ELM constitutes the largest searchable collection of phosphorylation sites available to the research community. The Phospho.ELM entries store information about substrate proteins with the exact positions of residues known to be phosphorylated by cellular kinases. Additional annotation includes literature references, subcellular compartment, tissue distribution, and information about the signaling pathways involved as well as links to the molecular interaction database MINT. Phospho.ELM version 2.0 contains 1703 phosphorylation site instances for 556 phosphorylated proteins. CONCLUSION: Phospho.ELM will be a valuable tool both for molecular biologists working on protein phosphorylation sites and for bioinformaticians developing computational predictions on the specificity of phosphorylation reactions.
Nucleosome binding by the bromodomain and PHD finger of the transcriptional cofactor p300.
Ragvin, A., Valvatne, H., Erdal, S., Arskog, V., Tufteland, K.R., Breen, K., OYan, A.M., Eberharter, A., Gibson, T.J., Becker, P.B. & Aasland, R.
J Mol Biol 2004 Apr 2;337(4):773-88.
The PHD finger and the bromodomain are small protein domains that occur in many proteins associated with phenomena related to chromatin. The bromodomain has been shown to bind acetylated lysine residues on histone tails. Lysine acetylation is one of several histone modifications that have been proposed to form the basis for a mechanism for recording epigenetically stable marks in chromatin, known as the histone code. The bromodomain is therefore thought to read a part of the histone code. Since PHD fingers often occur in proteins next to bromodomains, we have tested the hypothesis that the PHD finger can also interact with nucleosomes. Using two different in vitro assays, we found that the bromodomain/PHD finger region of the transcriptional cofactor p300 can bind to nucleosomes that have a high degree of histone acetylation. In a nucleosome retention assay, both domains were required for binding. Replacement of the p300 PHD finger with other PHD fingers resulted in loss of nucleosome binding. In an electrophoretic mobility shift assay, each domain alone showed, however, nucleosome-binding activity. The binding of the isolated PHD finger to nucleosomes was independent of the histone acetylation levels. Our data are consistent with a model where the two domains cooperate in nucleosome binding. In this model, both the bromodomain and the PHD finger contact the nucleosome while simultaneously interacting with each other.
Bacterial alpha2-macroglobulins: colonization factors acquired by horizontal gene transfer from the metazoan genome?
Budd, A., Blandin, S., Levashina, E.A. & Gibson, T.J.
Genome Biol 2004;5(6):R38. Epub 2004 May 26.
BACKGROUND: Invasive bacteria are known to have captured and adapted eukaryotic host genes. They also readily acquire colonizing genes from other bacteria by horizontal gene transfer. Closely related species such as Helicobacter pylori and Helicobacter hepaticus, which exploit different host tissues, share almost none of their colonization genes. The protease inhibitor alpha2-macroglobulin provides a major metazoan defense against invasive bacteria, trapping attacking proteases required by parasites for successful invasion. RESULTS: Database searches with metazoan alpha2-macroglobulin sequences revealed homologous sequences in bacterial proteomes. The bacterial alpha2-macroglobulin phylogenetic distribution is patchy and violates the vertical descent model. Bacterial alpha2-macroglobulin genes are found in diverse clades, including purple bacteria (proteobacteria), fusobacteria, spirochetes, bacteroidetes, deinococcids, cyanobacteria, planctomycetes and thermotogae. Most bacterial species with bacterial alpha2-macroglobulin genes exploit higher eukaryotes (multicellular plants and animals) as hosts. Both pathogenically invasive and saprophytically colonizing species possess bacterial alpha2-macroglobulins, indicating that bacterial alpha2-macroglobulin is a colonization rather than a virulence factor. CONCLUSIONS: Metazoan alpha2-macroglobulins inhibit proteases of pathogens. The bacterial homologs may function in reverse to block host antimicrobial defenses. alpha2-macroglobulin was probably acquired one or more times from metazoan hosts and has then spread widely through other colonizing bacterial species by more than 10 independent horizontal gene transfers. yfhM-like bacterial alpha2-macroglobulin genes are often found tightly linked with pbpC, encoding an atypical peptidoglycan transglycosylase, PBP1C, that does not function in vegetative peptidoglycan synthesis. We suggest that YfhM and PBP1C are coupled together as a periplasmic defense and repair system. Bacterial alpha2-macroglobulins might provide useful targets for enhancing vaccine efficacy in combating infections.
Protein disorder prediction. Implications for structural proteomics.
Linding, R., Jensen, L.J., Diella, F., Bork, P., Gibson, T.J. & Russell, R.B.
Structure (Camb) 2003 Nov;11(11):1453-9.
A great challenge in the proteomics and structural genomics era is to predict protein structure and function, including identification of those proteins that are partially or wholly unstructured. Disordered regions in proteins often contain short linear peptide motifs (e.g., SH3 ligands and targeting signals) that are important for protein function. We present here DisEMBL, a computational tool for prediction of disordered/unstructured regions within a protein sequence. As no clear definition of disorder exists, we have developed parameters based on several alternative definitions and introduced a new one based on the concept of "hot loops," i.e., coils with high temperature factors. Avoiding potentially disordered segments in protein expression constructs can increase expression, foldability, and stability of the expressed protein. DisEMBL is thus useful for target selection and the design of constructs as needed for many biochemical studies, particularly structural biology and structural genomics projects. The tool is freely available via a web interface (http://dis.embl.de) and can be downloaded for use in large-scale studies.
ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins.
Puntervoll, P., Linding, R., Gemund, C., Chabanis-Davidson, S., Mattingsdal, M., Cameron, S., Martin, D.M., Ausiello, G., Brannetti, B., Costantini, A., Ferre, F., Maselli, V., Via, A., Cesareni, G., Diella, F., Superti-Furga, G., Wyrwicz, L., Ramu, C., McGuigan, C., Gudavalli, R., Letunic, I., Bork, P., Rychlewski, L., Kuster, B., Helmer-Citterich, M., Hunter, W.N., Aasland, R. & Gibson, T.J.
Nucleic Acids Res 2003 Jul 1;31(13):3625-30.
Multidomain proteins predominate in eukaryotic proteomes. Individual functions assigned to different sequence segments combine to create a complex function for the whole protein. While on-line resources are available for revealing globular domains in sequences, there has hitherto been no comprehensive collection of small functional sites/motifs comparable to the globular domain resources, yet these are as important for the function of multidomain proteins. Short linear peptide motifs are used for cell compartment targeting, protein-protein interaction, regulation by phosphorylation, acetylation, glycosylation and a host of other post-translational modifications. ELM, the Eukaryotic Linear Motif server at http://elm.eu.org/, is a new bioinformatics resource for investigating candidate short non-globular functional motifs in eukaryotic proteins, aiming to fill the void in bioinformatics tools. Sequence comparisons with short motifs are difficult to evaluate because the usual significance assessments are inappropriate. Therefore the server is implemented with several logical filters to eliminate false positives. Current filters are for cell compartment, globular domain clash and taxonomic range. In favourable cases, the filters can reduce the number of retained matches by an order of magnitude or more.
Multiple sequence alignment with the Clustal series of programs.
Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T.J., Higgins, D.G. & Thompson, J.D.
Nucleic Acids Res 2003 Jul 1;31(13):3497-500.
The Clustal series of programs are widely used in molecular biology for the multiple alignment of both nucleic acid and protein sequences and for preparing phylogenetic trees. The popularity of the programs depends on a number of factors, including not only the accuracy of the results, but also the robustness, portability and user-friendliness of the programs. New features include NEXUS and FASTA format output, printing range numbers and faster tree calculation. Although, Clustal was originally developed to run on a local computer, numerous Web servers have been set up, notably at the EBI (European Bioinformatics Institute) (http://www.ebi.ac.uk/clustalw/).
BLAST2SRS, a web server for flexible retrieval of related protein sequences in the SWISS-PROT and SPTrEMBL databases.
Bimpikis, K., Budd, A., Linding, R. & Gibson, T.J.
Nucleic Acids Res 2003 Jul 1;31(13):3792-4.
SRS (Sequence Retrieval System) is a widely used keyword search engine for querying biological databases. BLAST2 is the most widely used tool to query databases by sequence similarity search. These tools allow users to retrieve sequences by shared keyword or by shared similarity, with many public web servers available. However, with the increasingly large datasets available it is now quite common that a user is interested in some subset of homologous sequences but has no efficient way to restrict retrieval to that set. By allowing the user to control SRS from the BLAST output, BLAST2SRS (http://blast2srs.embl.de/) aims to meet this need. This server therefore combines the two ways to search sequence databases: similarity and keyword.
The phylogenetic distribution of frataxin indicates a role in iron-sulfur cluster protein assembly.
Huynen, M.A., Snel, B., Bork, P. & Gibson, T.J.
Hum Mol Genet 2001 Oct 1;10(21):2463-8.
Much has been learned about the cellular pathology of Friedreich's ataxia, a recessive neurodegenerative disease resulting from insufficient expression of the mitochondrial protein frataxin. However, the biochemical function of frataxin has remained obscure, hampering attempts at therapeutic intervention. To predict functional interactions of frataxin with other proteins we investigated whether its gene specifically co-occurs with any other genes in sequenced genomes. In 56 available genomes we identified two genes with identical phylogenetic distributions to the frataxin/cyaY gene: hscA and hscB/JAC1. These genes have not only emerged in the same evolutionary lineage as the frataxin gene, they have also been lost at least twice with it, and they have been horizontally transferred with it in the evolution of the mitochondria. The proteins encoded by hscA and hscB, the chaperone HSP66 and the co-chaperone HSP20, have been shown to be required for the synthesis of 2Fe-2S clusters on ferredoxin in proteobacteria. JAC1, an ortholog of hscB, and SSQ1, a paralog of hscA, have been shown to be required for iron-sulfur cluster assembly in mitochondria of Saccharomyces cerevisiae. Combining data on the co-occurrence of genes in genomes with experimental and predicted cellular localization data of their proteins supports the hypothesis that frataxin is directly involved in iron-sulfur cluster protein assembly. They indicate that frataxin is specifically involved in the same sub-process as HSP20/Jac1p.
The gene mutated in ataxia-ocular apraxia 1 encodes the new HIT/Zn-finger protein aprataxin.
Moreira, M.C., Barbot, C., Tachi, N., Kozuka, N., Uchida, E., Gibson, T., Mendonca, P., Costa, M., Barros, J., Yanagisawa, T., Watanabe, M., Ikeda, Y., Aoki, M., Nagata, T., Coutinho, P., Sequeiros, J. & Koenig, M.
Nat Genet 2001 Oct;29(2):189-93.
The newly recognized ataxia-ocular apraxia 1 (AOA1; MIM 208920) is the most frequent cause of autosomal recessive ataxia in Japan and is second only to Friedreich ataxia in Portugal. It shares several neurological features with ataxia-telangiectasia, including early onset ataxia, oculomotor apraxia and cerebellar atrophy, but does not share its extraneurological features (immune deficiency, chromosomal instability and hypersensitivity to X-rays). AOA1 is also characterized by axonal motor neuropathy and the later decrease of serum albumin levels and elevation of total cholesterol. We have identified the gene causing AOA1 and the major Portuguese and Japanese mutations. This gene encodes a new, ubiquitously expressed protein that we named aprataxin. This protein is composed of three domains that share distant homology with the amino-terminal domain of polynucleotide kinase 3'- phosphatase (PNKP), with histidine-triad (HIT) proteins and with DNA-binding C2H2 zinc-finger proteins, respectively. PNKP is involved in DNA single-strand break repair (SSBR) following exposure to ionizing radiation and reactive oxygen species. Fragile-HIT proteins (FHIT) cleave diadenosine tetraphosphate, which is potentially produced during activation of the SSBR complex. The results suggest that aprataxin is a nuclear protein with a role in DNA repair reminiscent of the function of the protein defective in ataxia-telangiectasia, but that would cause a phenotype restricted to neurological signs when mutant.
The SAND domain structure defines a novel DNA-binding fold in transcriptional regulation.
Bottomley, M.J., Collard, M.W., Huggenvik, J.I., Liu, Z., Gibson, T.J. & Sattler, M.
Nat Struct Biol 2001 Jul;8(7):626-33.
The SAND domain is a conserved sequence motif found in a number of nuclear proteins, including the Sp100 family and NUDR. These are thought to play important roles in chromatin-dependent transcriptional regulation and are linked to many diseases. We have determined the three-dimensional (3D) structure of the SAND domain from Sp100b. The structure represents a novel alpha/beta fold, in which a conserved KDWK sequence motif is found within an alpha-helical, positively charged surface patch. For NUDR, the SAND domain is shown to be sufficient to mediate DNA binding. Using mutational analyses and chemical shift perturbation experiments, the DNA binding surface is mapped to the alpha-helical region encompassing the KDWK motif. The DNA binding activity of wild type and mutant proteins in vitro correlates with transcriptional regulation activity of full length NUDR in vivo. The evolutionarily conserved SAND domain defines a new DNA binding fold that is involved in chromatin-associated transcriptional regulation.
Gene2EST: a BLAST2 server for searching expressed sequence tag (EST) databases with eukaryotic gene-sized queries.
Gemund, C., Ramu, C., Altenberg-Greulich, B. & Gibson, T.J.
Nucleic Acids Res 2001 Mar 15;29(6):1272-7.
Expressed sequence tags (ESTs) are randomly sequenced cDNA clones. Currently, nearly 3 million human and 2 million mouse ESTs provide valuable resources that enable researchers to investigate the products of gene expression. The EST databases have proven to be useful tools for detecting homologous genes, for exon mapping, revealing differential splicing, etc. With the increasing availability of large amounts of poorly characterised eukaryotic (notably human) genomic sequence, ESTs have now become a vital tool for gene identification, sometimes yielding the only unambiguous evidence for the existence of a gene expression product. However, BLAST-based Web servers available to the general user have not kept pace with these developments and do not provide appropriate tools for querying EST databases with large highly spliced genes, often spanning 50 000-100 000 bases or more. Here we describe Gene2EST (http://woody.embl-heidelberg.de/gene2est/), a server that brings together a set of tools enabling efficient retrieval of ESTs matching large DNA queries and their subsequent analysis. RepeatMasker is used to mask dispersed repetitive sequences (such as Alu elements) in the query, BLAST2 for searching EST databases and Artemis for graphical display of the findings. Gene2EST combines these components into a Web resource targeted at the researcher who wishes to study one or a few genes to a high level of detail.
RuNAway Disease: A two cycle model for transmissible spongiform encephalopathies (TSEs) wherein SINE proliferation drives PrP overproduction.
Genome Biol 2001;2(7):Preprint 0006.
BACKGROUND: Despite decades of research, the agent responsible for transmitting spongiform encephalopathies (TSEs) has not been identified. The Prion hypothesis, which dominates the field, supposes that modified host PrP protein, termed PrPSc, acts as the transmissible agent. This model fits the observation that TSE diseases elicit almost no immune reaction. Prion transmission has not been verified, however, as it has not been possible to produce pure PrPSc aggregates. One long-standing objection to the Prion model is the observation that TSE disease agents show classical genetic behaviours, such as reproducible strain variation, while also responding to selection for novel traits such as adaptation to new hosts. Moreover, evidence has been steadily accumulating that infectious titre is decoupled from the quantity (or even the presence) of PrPSc deposits. Rather awkwardly for the Prion hypothesis, PrP0/0 knockout mice have been found to incubate and transmit TSE agents (despite themselves being refractory to TSE disease). HYPOTHESIS: In this article, a new scheme, RuNAway, is proposed whereby uncontrolled proliferation of a type of parasitic gene, the small dispersed repeat sequences (SINEs), in somatic cells induces overproduction of PrP with pathogenic consequences. The RuNAway scheme involves twin tandem positive feedback loops: triggering the second loop leads to the pathogenic disease. This model is consistent with the long latency period and much shorter visible disease progression typical of TSEs.
Object-oriented parsing of biological databases with python.
Ramu, C., Gemund, C. & Gibson, T.J.
Bioinformatics 2000 Jul;16(7):628-38
Motivation: While database activities in the biological area are increasing rapidly, rather little is done in the area of parsing them in a simple and object-oriented way. Results: We present here an elegant, simple yet powerful way of parsing biological flat-file databases. We have taken EMBL, SWISSPROT and GENBANK as examples. EMBL and SWISS-PROT do not differ much in the format structure. GENBANK has a very different format structure than EMBL and SWISS-PROT. Extracting the desired fields in an entry (for example a sub-sequence with an associated feature) for later analysis is a constant need in the biological sequence-analysis community: this is illustrated with tools to make new splice-site databases. The interface to the parser is abstract in the sense that the access to all the databases is independent from their different formats, since parsing instructions are hidden. Availability: The modules are available at http://shag.embl-heidelberg.de:8000/Biopy/ Contact: email@example.com Supplementary information: http://shag. embl-heidelberg.de:8000/Biopy/
An estimate of large-scale sequencing accuracy.
Hill, F., Gemund, C., Benes, V., Ansorge, W. & Gibson, T.J.
EMBO Rep. 2000 Jul;1(1):29-31.
The accuracy of large-scale DNA sequencing is difficult to estimate without redundant effort. We have found that the mobile genetic element IS10, a component of the transposon Tn10, has contaminated a significant number of clones in the public databases, as a result of the use of the transposon in bacterial cloning strain construction. These contaminations need to be annotated as such. More positively, by defining the range of sequence variation in IS10, we have been able to determine that the rate of sequencing errors is very low, most likely surpassing the stated aim of one error or less in ten thousand bases.
Homology-based method for identification of protein repeats using statistical significance estimates.
Andrade, M.A., Ponting, C.P., Gibson, T.J. & Bork, P.
J Mol Biol 2000 May 5;298(3):521-37
Short protein repeats, frequently with a length between 20 and 40 residues, represent a significant fraction of known proteins. Many repeats appear to possess high amino acid substitution rates and thus recognition of repeat homologues is highly problematic. Even if the presence of a certain repeat family is known, the exact locations and the number of repetitive units often cannot be determined using current methods. We have devised an iterative algorithm based on optimal and sub-optimal score distributions from profile analysis that estimates the significance of all repeats that are detected in a single sequence. This procedure allows the identification of homologues at alignment scores lower than the highest optimal alignment score for non-homologous sequences. The method has been used to investigate the occurrence of eleven families of repeats in Saccharomyces cerevisiae, Caenorhabditis elegans and Homo sapiens accounting for 1055, 2205 and 2320 repeats, respectively. For these examples, the method is both more sensitive and more selective than conventional homology search procedures. The method allowed the detection in the SwissProt database of more than 2000 previously unrecognised repeats belonging to the 11 families. In addition, the method was used to merge several repeat families that previously were supposed to be distinct, indicating common phylogenetic origins for these families. Copyright 2000 Academic Press.
Evidence in favour of ancient octaploidy in the vertebrate genome.
Gibson, T.J. & Spring, J.
Biochem Soc Trans 2000 Feb;28(2):259-64.
Vertebrate genomes are larger than invertebrates and show evidence of extensive gene duplication, including many collinear chromosomal segments. On the basis of this intra-genomic synteny, it has been proposed that two rounds of whole genome duplication (octaploidy) occurred early in the vertebrate lineage. Recently, this early vertebrate octaploidy has been challenged on the basis of gene trees. We report new linkage groups encompassing the matrilin (MATN), syndecan (SDC), Eyes Absent (EYA), HCK kinase and SRC kinase paralogous gene quartets. In contrast to other studies, the sequence trees are weakly supportive of ancient octaploidy. It is concluded that there is no strong evidence against the octaploidy, provided that consecutive genome duplication was rapid.
Assignment of the 1H, 15N, and 13C resonances of the C-terminal domain of frataxin, the protein responsible for Friedreich ataxia
Musco, G., de Tommasi, T., Stier, G., Kolmerer, B., Bottomley, M., Adinolfi, S., Muskett, F.W., Gibson, T.J., Frenkiel, T.A. & Pastore, A.
J Biomol NMR 1999 Sep;15(1):87-8 Europe PMC
Formin defines a large family of morphoregulatory genes and functions in establishment of the polarising region.
Zeller, R., Haramis, A.G., Zuniga, A., McGuigan, C., Dono, R., Davidson, G., Chabanis, S. & Gibson, T.
Cell Tissue Res 1999 Mar 29 296(1) 85-93
Formin was originally isolated as the gene affected by the murine limb deformity (ld) mutations, which disrupt the epithelial-mesenchymal interactions regulating patterning of the vertebrate limb autopod. More recently, a rapidly growing number of genes with similarity to formin have been isolated from many different species including fungi and plants. Genetic and biochemical analysis shows that formin family members function in cellular processes regulating either cytokinesis and/or cell polarisation. Another common feature among formin family members is their requirement in morphogenetic processes such as budding and conjugation of yeast, establishment of Drosophila oocyte polarity and vertebrate limb pattern formation. Vertebrate formins are predominantly nuclear proteins which control polarising activity in limb buds through establishment of the SHH/FGF-4 feedback loop. Formin acts in the limb bud mesenchyme to induce apical ectodermal ridge (AER) differentiation and FGF-4 expression in the posterior AER compartment. Finally, disruption of the epithelial-mesenchymal interactions controlling induction of metanephric kidneys in ld mutant embryos indicates that formin might function more generally in transduction of morphogenetic signals during embryonic pattern formation.
Multiple sequence alignment with Clustal X.
Jeanmougin, F., Thompson, J.D., Gouy, M., Higgins, D.G. & Gibson, T.J.
Trends Biochem Sci 1998 Oct 23(10) 403-405 Europe PMC
A new method for isolating tyrosine kinase substrates used to identify fish, an SH3 and PX domain-containing protein, and Src substrate.
Lock, P., Abram, C.L., Gibson, T. & Courtneidge, S.A.
EMBO J 1998 Aug 3 17(15) 4346-4357
We describe a method for identifying tyrosine kinase substrates using anti-phosphotyrosine antibodies to screen tyrosine-phosphorylated cDNA expression libraries. Several potential Src substrates were identified including Fish, which has five SH3 domains and a recently discovered phox homology (PX) domain. Fish is tyrosine-phosphorylated in Src- transformed fibroblasts (suggesting that it is a target of Src in vivo) and in normal cells following treatment with several growth factors. Treatment of cells with cytochalasin D also resulted in rapid tyrosine phosphorylation of Fish, concomitant with activation of Src. These data suggest that Fish is involved in signalling by tyrosine kinases, and imply a specialized role in the actin cytoskeleton.
The APECED polyglandular autoimmune syndrome protein, AIRE-1, contains the SAND domain and is probably a transcription factor
Gibson, T.J., Ramu, C., Gemund, C. & Aasland, R.
Trends Biochem Sci 1998 Jul 23(7) 242-244 Europe PMC
Genetic redundancy in vertebrates: polyploidy and persistence of genes encoding multidomain proteins.
Gibson, T.J. & Spring, J.
This is a review article.
Trends Genet 1998 Feb 14(2) 46-49 Europe PMC
The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools.
Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. & Higgins, D.G.
Nucleic Acids Res. 1997 Dec 15;25(24):4876-82.
CLUSTAL X is a new windows interface for the widely-used progressive multiple sequence alignment program CLUSTAL W. The new system is easy to use, providing an integrated system for performing multiple sequence and profile alignments and analysing the results. CLUSTAL X displays the sequence alignment in a window on the screen. A versatile sequence colouring scheme allows the user to highlight conserved features in the alignment. Pull-down menus provide all the options required for traditional multiple sequence and profile alignment. New features include: the ability to cut-and-paste sequences to change the order of the alignment, selection of a subset of the sequences to be realigned, and selection of a sub-range of the alignment to be realigned and inserted back into the original alignment. Alignment quality analysis can be performed and low-scoring segments or exceptional residues can be highlighted. Quality analysis and realignment of selected residue ranges provide the user with a powerful tool to improve and refine difficult alignments and to trap errors in input sequences. CLUSTAL X has been compiled on SUN Solaris, IRIX5.3 on Silicon Graphics, Digital UNIX on DECstations, Microsoft Windows (32 bit) for PCs, Linux ELF for x86 PCs, and Macintosh PowerMac.
PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames.
Birney, E., Thompson, J.D. & Gibson, T.J.
Nucleic Acids Res 1996 Jul 15;24(14):2730-9
DNA translation frames can be disrupted for several reasons, including: (i) errors in sequence determination; (ii) RNA processing, such as intron removal and guide RNA editing; (iii) less commonly, polymerase frameshifting during transcription or ribosomal frameshifting during translation. Frameshifts frequently confound computational activities involving homologous sequences, such as database searches and inferences on structure, function or phylogeny made from multiple alignments. A dynamic alignment algorithm is reported here which compares a protein profile (a residue scoring matrix for one or more aligned sequences) against the three translation frames of a DNA strand, allowing frameshifting. The algorithm has been incorporated into a new package, WiseTools, for comparison of biological sequences. A protein profile can be compared against either a DNA sequence or a protein sequence. The program PairWise may be used interactively for alignment of any two sequence inputs. SearchWise can perform combinations of searches through DNA or protein databases by a protein profile or DNA sequence. Routine application of the programs has revealed a set of database entries with frameshifts caused by errors in sequence determination.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
Thompson, J.D., Higgins, D.G. & Gibson, T.J.
Nucleic Acids Res 1994 Nov 11;22(22):4673-80
The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to down-weight near- duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.