Impact of genomic structural variation in Drosophila melanogaster based on population-scale sequencing.
Zichner, T., Garfield, D.A., Rausch, T., Stutz, A.M., Cannavo, E., Braun, M., Furlong, E.E. & Korbel, J.O.
Genome Res. 2013 Mar;23(3):568-79. doi: 10.1101/gr.142646.112. Epub 2012 Dec 6.
Genomic structural variation (SV) is a major determinant for phenotypic variation. Although it has been extensively studied in humans, the nucleotide resolution structure of SVs within the widely used model organism Drosophila remains unknown. We report a highly accurate, densely validated map of unbalanced SVs comprising 8962 deletions and 916 tandem duplications in 39 lines derived from short-read DNA sequencing in a natural population (the "Drosophila melanogaster Genetic Reference Panel," DGRP). Most SVs (>90%) were inferred at nucleotide resolution, and a large fraction was genotyped across all samples. Comprehensive analyses of SV formation mechanisms using the short-read data revealed an abundance of SVs formed by mobile element and nonhomologous end-joining-mediated rearrangements, and clustering of variants into SV hotspots. We further observed a strong depletion of SVs overlapping genes, which, along with population genetics analyses, suggests that these SVs are often deleterious. We inferred several gene fusion events also highlighting the potential role of SVs in the generation of novel protein products. Expression quantitative trait locus (eQTL) mapping revealed the functional impact of our high-resolution SV map, with quantifiable effects at >100 genic loci. Our map represents a resource for population-level studies of SVs in an important model organism.
Phenotypic impact of genomic structural variation: insights from and for human disease.
Weischenfeldt, J., Symmons, O., Spitz, F. & Korbel, J.O.
Nat Rev Genet. 2013 Feb;14(2):125-38. doi: 10.1038/nrg3373.
Genomic structural variants have long been implicated in phenotypic diversity and human disease, but dissecting the mechanisms by which they exert their functional impact has proven elusive. Recently however, developments in high-throughput DNA sequencing and chromosomal engineering technology have facilitated the analysis of structural variants in human populations and model systems in unprecedented detail. In this Review, we describe how structural variants can affect molecular and cellular processes, leading to complex organismal phenotypes, including human disease. We further present advances in delineating disease-causing elements that are affected by structural variants, and we discuss future directions for research on the functional consequences of structural variants.
Primate genome architecture influences structural variation mechanisms and functional consequences.
Gokcumen, O., Tischler, V., Tica, J., Zhu, Q., Iskow, R.C., Lee, E., Fritz, M.H., Langdon, A., Stutz, A.M., Pavlidis, P., Benes, V., Mills, R.E., Park, P.J., Lee, C. & Korbel, J.O.
Proc Natl Acad Sci U S A. 2013 Sep 6.
Although nucleotide resolution maps of genomic structural variants (SVs) have provided insights into the origin and impact of phenotypic diversity in humans, comparable maps in nonhuman primates have thus far been lacking. Using massively parallel DNA sequencing, we constructed fine-resolution genomic structural variation maps in five chimpanzees, five orang-utans, and five rhesus macaques. The SV maps, which are comprised of thousands of deletions, duplications, and mobile element insertions, revealed a high activity of retrotransposition in macaques compared with great apes. By comparison, nonallelic homologous recombination is specifically active in the great apes, which is correlated with architectural differences between the genomes of great apes and macaque. Transcriptome analyses across nonhuman primates and humans revealed effects of species-specific whole-gene duplication on gene expression. We identified 13 gene duplications coinciding with the species-specific gain of tissue-specific gene expression in keeping with a role of gene duplication in the promotion of diversification and the acquisition of unique functions. Differences in the present day activity of SV formation mechanisms that our study revealed may contribute to ongoing diversification and adaptation of great ape and Old World monkey lineages.
Identification of a Ninein (NIN) mutation in a family with spondyloepimetaphyseal dysplasia with joint laxity (leptodactylic type)-like phenotype.
Grosch, M., Gruner, B., Spranger, S., Stutz, A.M., Rausch, T., Korbel, J.O., Seelow, D., Nurnberg, P., Sticht, H., Lausch, E., Zabel, B., Winterpacht, A. & Tagariello, A.
Matrix Biol. 2013 May 9. pii: S0945-053X(13)00070-X. doi:10.1016/j.matbio.2013.05.001.
Spondyloepimetaphyseal dysplasia with joint laxity-leptodactylic type (SEMDJL2) is an autosomal dominant skeletal dysplasia which is characterized by midface hypoplasia, short stature, joint laxity with dislocations, genua valga, progressive scoliosis, and slender fingers. Recently, heterozygous missense mutations in KIF22, a gene which encodes a member of the kinesin-like protein family, have been identified in sporadic as well as familial cases of SEMDJL2. In the present study homozygosity mapping and whole-exome sequencing were combined to analyze a consanguineous family with a phenotype resembling SEMDJL2. We identified homozygous missense mutations in the two nearby genes NIN (Ninein) and POLE2 (DNA polymerase epsilon subunit B) which segregate with the disease in the family and were not present in 500 healthy control individuals and in the 1094 control individuals contained within the 1000-genomes database. We present several lines of evidence that mutant Ninein is most likely causative for the SEMDJL2-like phenotype. The centrosomal protein NIN shows a functional relationship with KIF22 and other proteins associated with chromosome congression/movement, centrosomal function, and ciliogenesis, which have been associated with skeletal dysplasias. Moreover, compound heterozygous missense mutations at more N-terminal positions of Ninein have very recently been identified in a family with microcephalic primordial dwarfism. Together with the present report this strongly supports a fundamental role of Ninein in skeletal development.
Whole-exome sequencing links caspase recruitment domain 11 (CARD11) inactivation to severe combined immunodeficiency.
Greil, J., Rausch, T., Giese, T., Bandapalli, O.R., Daniel, V., Bekeredjian-Ding, I., Stutz, A.M., Drees, C., Roth, S., Ruland, J., Korbel, J.O. & Kulozik, A.E.
J Allergy Clin Immunol. 2013 May;131(5):1376-83.e3. doi:10.1016/j.jaci.2013.02.012. Epub 2013 Apr 3.
BACKGROUND: Primary immunodeficiencies represent model diseases for the mechanistic understanding of the human innate and adaptive immune response. They are clinically highly relevant per se because in patients with severe combined immunodeficiency (SCID), infections caused by opportunistic pathogens are typically life-threatening early in life. OBJECTIVES: We aimed at defining and functionally characterizing a novel form of SCID in an infant of consanguineous parents who presented with life-threatening Pneumocystis jirovecii pneumonia using a comprehensive immunologic and whole-exome genetic diagnostic strategy. METHODS: Analysis of leukocyte subpopulations was performed by using multicolor flow cytometry and was combined with stimulation tests for T-cell function. The search for a disease-causing mutation was performed with diagnostic whole-exome sequencing and systematic variant categorization. Reconstitution assays were used for validating the loss-of-function mutation. RESULTS: The novel entity of SCID was characterized by agammaglobulinemia and profoundly deficient T-cell function despite quantitatively normal T and B lymphocytes. Genetic analysis revealed a single pathogenic homozygous nonsense mutation of the caspase recruitment domain 11 (CARD11) gene. In reconstitution assays we demonstrated that the patient-derived truncated CARD11 protein is defective in antigen receptor signaling and nuclear factor kappaB activation. CONCLUSION: We show that an inactivating CARD11 mutation links defective nuclear factor kappaB signaling to a novel cause of autosomal recessive SCID.
The Genomic and Transcriptomic Landscape of a HeLa Cell Line.
Landry, J.J., Pyl, P.T., Rausch, T., Zichner, T., Tekkedil, M.M., Stutz, A.M., Jauch, A., Aiyar, R.S., Pau, G., Delhomme, N., Gagneur, J., Korbel, J.O., Huber, W. & Steinmetz, L.M.
G3 (Bethesda). 2013 Mar 26. pii: g3.113.005777v6. doi: 10.1534/g3.113.005777.
HeLa is the most widely used model cell line for studying human cellular and molecular biology. To date, no genomic reference for this cell line has been released, and experiments have relied on the human reference genome. Effective design and interpretation of molecular genetic studies done using HeLa cells requires accurate genomic information. Here we present a detailed genomic and transcriptomic characterization of a HeLa cell line. We performed DNA and RNA sequencing of a HeLa Kyoto cell line and analyzed its mutational portfolio and gene expression profile. Segmentation of the genome according to copy number revealed a remarkably high level of aneuploidy and numerous large structural variants at unprecedented resolution. The extensive genomic rearrangements are indicative of catastrophic chromosome shattering, known as chromothripsis. Our analysis of the HeLa gene expression profile revealed that several pathways, including cell cycle and DNA repair, exhibit significantly different expression patterns from those in normal human tissues. Our results provide the first detailed account of genomic variants in the HeLa genome, yielding insight into their impact on gene expression and cellular function as well as their origins. This study underscores the importance of accounting for the strikingly aberrant characteristics of HeLa cells when designing and interpreting experiments, and has implications for the use of HeLa as a model of human biology.
Criteria for inference of chromothripsis in cancer genomes.
Korbel, J.O. & Campbell, P.J.
Cell. 2013 Mar 14;152(6):1226-36. doi: 10.1016/j.cell.2013.02.023.
Chromothripsis scars the genome when localized chromosome shattering and repair occurs in a one-off catastrophe. Outcomes of this process are detectable as massive DNA rearrangements affecting one or a few chromosomes. Although recent findings suggest a crucial role of chromothripsis in cancer development, the reproducible inference of this process remains challenging, requiring that cataclysmic one-off rearrangements be distinguished from localized lesions that occur progressively. We describe conceptual criteria for the inference of chromothripsis, based on ruling out the alternative hypothesis that stepwise rearrangements occurred. Robust means of inference may facilitate in-depth studies on the impact of, and the mechanisms underlying, chromothripsis.
Integrative genomic analyses reveal an androgen-driven somatic alteration landscape in early-onset prostate cancer.
Weischenfeldt, J., Simon, R., Feuerbach, L., Schlangen, K., Weichenhan, D., Minner, S., Wuttig, D., Warnatz, H.J., Stehr, H., Rausch, T., Jager, N., Gu, L., Bogatyrova, O., Stutz, A.M., Claus, R., Eils, J., Eils, R., Gerhauser, C., Huang, P.H., Hutter, B., Kabbe, R., Lawerenz, C., Radomski, S., Bartholomae, C.C., Falth, M., Gade, S., Schmidt, M., Amschler, N., Hass, T., Galal, R., Gjoni, J., Kuner, R., Baer, C., Masser, S., von Kalle, C., Zichner, T., Benes, V., Raeder, B., Mader, M., Amstislavskiy, V., Avci, M., Lehrach, H., Parkhomchuk, D., Sultan, M., Burkhardt, L., Graefen, M., Huland, H., Kluth, M., Krohn, A., Sirma, H., Stumm, L., Steurer, S., Grupp, K., Sultmann, H., Sauter, G., Plass, C., Brors, B., Yaspo, M.L., Korbel, J.O. & Schlomm, T.
Cancer Cell. 2013 Feb 11;23(2):159-70. doi: 10.1016/j.ccr.2013.01.002.
Early-onset prostate cancer (EO-PCA) represents the earliest clinical manifestation of prostate cancer. To compare the genomic alteration landscapes of EO-PCA with "classical" (elderly-onset) PCA, we performed deep sequencing-based genomics analyses in 11 tumors diagnosed at young age, and pursued comparative assessments with seven elderly-onset PCA genomes. Remarkable age-related differences in structural rearrangement (SR) formation became evident, suggesting distinct disease pathomechanisms. Whereas EO-PCAs harbored a prevalence of balanced SRs, with a specific abundance of androgen-regulated ETS gene fusions including TMPRSS2:ERG, elderly-onset PCAs displayed primarily non-androgen-associated SRs. Data from a validation cohort of > 10,000 patients showed age-dependent androgen receptor levels and a prevalence of SRs affecting androgen-regulated genes, further substantiating the activity of a characteristic "androgen-type" pathomechanism in EO-PCA.
Genomic deletion of MAP3K7 at 6q12-22 is associated with early PSA recurrence in prostate cancer and absence of TMPRSS2:ERG fusions.
Kluth, M., Hesse, J., Heinl, A., Krohn, A., Steurer, S., Sirma, H., Simon, R., Mayer, P.S., Schumacher, U., Grupp, K., Izbicki, J.R., Pantel, K., Dikomey, E., Korbel, J.O., Plass, C., Sauter, G., Schlomm, T. & Minner, S.
Mod Pathol. 2013 Feb 1. doi: 10.1038/modpathol.2012.236.
6q12-22 is the second most commonly deleted genomic region in prostate cancer. Mapping studies have described a minimally deleted area at 6q15, containing MAP3K7/TAK1, which was recently shown to have tumor suppressive properties. To determine prevalence and clinical significance of MAP3K7 alterations in prostate cancer, a tissue microarray containing 4699 prostate cancer samples was analyzed by fluorescence in situ hybridization. Heterozygous MAP3K7 deletions were found in 18.48% of 2289 interpretable prostate cancers. MAP3K7 deletions were significantly associated with advanced tumor stage (P<0.0001), high Gleason grade (P<0.0001), lymph node metastasis (P<0.0108) and early biochemical recurrence (P<0.0001). MAP3K7 alterations were typically limited to the loss of one allele as homozygous deletions were virtually absent and sequencing analyses revealed no evidence for MAP3K7 mutations in 15 deleted and in 14 non-deleted cancers. There was a striking inverse association of MAP3K7 deletions and TMPRSS2:ERG fusion status with 26.7% 6q deletions in 1125 ERG-negative and 11.1% 6q deletions in 1198 ERG-positive cancers (P<0.0001). However, the strong prognostic role of 6q deletions was retained in both ERG-positive and ERG-negative cancers (P<0.0001 each). In summary, our study identifies MAP3K7 deletion as a prominent feature in ERG-negative prostate cancer with strong association to tumor aggressiveness. MAP3K7 alterations are typically limited to one allele of the gene. Together with the demonstrated tumor suppressive function in cell line experiments and lacking evidence for inactivation through hypermethylation, these results indicate MAP3K7 as a gene for which haploinsufficency is substantially tumorigenic.Modern Pathology advance online publication, 1 February 2013; doi:10.1038/modpathol.2012.236.
Recurrent somatic alterations of FGFR1 and NTRK2 in pilocytic astrocytoma.
Jones, D.T., Hutter, B., Jager, N., Korshunov, A., Kool, M., Warnatz, H.J., Zichner, T., Lambert, S.R., Ryzhova, M., Quang, D.A., Fontebasso, A.M., Stutz, A.M., Hutter, S., Zuckermann, M., Sturm, D., Gronych, J., Lasitschka, B., Schmidt, S., Seker-Cin, H., Witt, H., Sultan, M., Ralser, M., Northcott, P.A., Hovestadt, V., Bender, S., Pfaff, E., Stark, S., Faury, D., Schwartzentruber, J., Majewski, J., Weber, U.D., Zapatka, M., Raeder, B., Schlesner, M., Worth, C.L., Bartholomae, C.C., von Kalle, C., Imbusch, C.D., Radomski, S., Lawerenz, C., van Sluis, P., Koster, J., Volckmann, R., Versteeg, R., Lehrach, H., Monoranu, C., Winkler, B., Unterberg, A., Herold-Mende, C., Milde, T., Kulozik, A.E., Ebinger, M., Schuhmann, M.U., Cho, Y.J., Pomeroy, S.L., von Deimling, A., Witt, O., Taylor, M.D., Wolf, S., Karajannis, M.A., Eberhart, C.G., Scheurlen, W., Hasselblatt, M., Ligon, K.L., Kieran, M.W., Korbel, J.O., Yaspo, M.L., Brors, B., Felsberg, J., Reifenberger, G., Collins, V.P., Jabado, N., Eils, R., Lichter, P. & Pfister, S.M.
Nat Genet. 2013 Aug;45(8):927-32. doi: 10.1038/ng.2682. Epub 2013 Jun 30.
Pilocytic astrocytoma, the most common childhood brain tumor, is typically associated with mitogen-activated protein kinase (MAPK) pathway alterations. Surgically inaccessible midline tumors are therapeutically challenging, showing sustained tendency for progression and often becoming a chronic disease with substantial morbidities. Here we describe whole-genome sequencing of 96 pilocytic astrocytomas, with matched RNA sequencing (n = 73), conducted by the International Cancer Genome Consortium (ICGC) PedBrain Tumor Project. We identified recurrent activating mutations in FGFR1 and PTPN11 and new NTRK2 fusion genes in non-cerebellar tumors. New BRAF-activating changes were also observed. MAPK pathway alterations affected all tumors analyzed, with no other significant mutations identified, indicating that pilocytic astrocytoma is predominantly a single-pathway disease. Notably, we identified the same FGFR1 mutations in a subset of H3F3A-mutated pediatric glioblastoma with additional alterations in the NF1 gene. Our findings thus identify new potential therapeutic targets in distinct subsets of pilocytic astrocytoma and childhood glioblastoma.
Recurrent mutation of the ID3 gene in Burkitt lymphoma identified by integrated genome, exome and transcriptome sequencing.
Richter J., Schlesner M., Hoffmann S., Kreuz M., Leich E., Burkhardt B., Rosolowski M., Ammerpohl O., Wagener R., Bernhart S.H., Lenze D., Szczepanowski M., Paulsen M., Lipinski S., Russell R.B., Adam-Klages S., Apic G., Claviez A., Hasenclever D., Hovestadt V., Hornig N., Korbel J.O., Kube D., Langenberger D., Lawerenz C., Lisfeld J., Meyer K., Picelli S., Pischimarov J., Radlwimmer B., Rausch T., Rohde M., Schilhabel M., Scholtysik R., Spang R., Trautmann H., Zenz T., Borkhardt A., Drexler H.G., Moller P., Macleod R.A., Pott C., Schreiber S., Trumper L., Loeffler M., Stadler P.F., Lichter P., Eils R., Kuppers R., Hummel M., Klapper W., Rosenstiel P., Rosenwald A., Brors B., Siebert R.
Nat Genet. 2012 Nov 11;44(12):1316-1320. doi: 10.1038/ng.2469. Epub 2012 Nov 11.
Burkitt lymphoma is a mature aggressive B-cell lymphoma derived from germinal center B cells. Its cytogenetic hallmark is the Burkitt translocation t(8;14)(q24;q32) and its variants, which juxtapose the MYC oncogene with one of the three immunoglobulin loci. Consequently, MYC is deregulated, resulting in massive perturbation of gene expression. Nevertheless, MYC deregulation alone seems not to be sufficient to drive Burkitt lymphomagenesis. By whole-genome, whole-exome and transcriptome sequencing of four prototypical Burkitt lymphomas with immunoglobulin gene (IG)-MYC translocation, we identified seven recurrently mutated genes. One of these genes, ID3, mapped to a region of focal homozygous loss in Burkitt lymphoma. In an extended cohort, 36 of 53 molecularly defined Burkitt lymphomas (68%) carried potentially damaging mutations of ID3. These were strongly enriched at somatic hypermutation motifs. Only 6 of 47 other B-cell lymphomas with the IG-MYC translocation (13%) carried ID3 mutations. These findings suggest that cooperation between ID3 inactivation and IG-MYC translocation is a hallmark of Burkitt lymphomagenesis.
An integrated map of genetic variation from 1,092 human genomes.
1000 Genomes Project Consortium
Nature. 2012 Nov 1;491(7422):56-65. doi: 10.1038/nature11632.
By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.
DELLY: structural variant discovery by integrated paired-end and split-read analysis.
Rausch, T., Zichner, T., Schlattl, A., Stutz, A.M., Benes, V. & Korbel, J.O.
Bioinformatics. 2012 Sep 15;28(18):i333-i339.
MOTIVATION: The discovery of genomic structural variants (SVs) at high sensitivity and specificity is an essential requirement for characterizing naturally occurring variation and for understanding pathological somatic rearrangements in personal genome sequencing data. Of particular interest are integrated methods that accurately identify simple and complex rearrangements in heterogeneous sequencing datasets at single-nucleotide resolution, as an optimal basis for investigating the formation mechanisms and functional consequences of SVs. RESULTS: We have developed an SV discovery method, called DELLY, that integrates short insert paired-ends, long-range mate-pairs and split-read alignments to accurately delineate genomic rearrangements at single-nucleotide resolution. DELLY is suitable for detecting copy-number variable deletion and tandem duplication events as well as balanced rearrangements such as inversions or reciprocal translocations. DELLY, thus, enables to ascertain the full spectrum of genomic rearrangements, including complex events. On simulated data, DELLY compares favorably to other SV prediction methods across a wide range of sequencing parameters. On real data, DELLY reliably uncovers SVs from the 1000 Genomes Project and cancer genomes, and validation experiments of randomly selected deletion loci show a high specificity. AVAILABILITY: DELLY is available at www.korbel.embl.de/software.html CONTACT: email@example.com.
Subgroup-specific structural variation across 1,000 medulloblastoma genomes.
Northcott, P.A., Shih, D.J., Peacock, J., Garzia, L., Morrissy, A.S., Zichner, T., Stutz, A.M., Korshunov, A., Reimand, J., Schumacher, S.E., Beroukhim, R., Ellison, D.W., Marshall, C.R., Lionel, A.C., Mack, S., Dubuc, A., Yao, Y., Ramaswamy, V., Luu, B., Rolider, A., Cavalli, F.M., Wang, X., Remke, M., Wu, X., Chiu, R.Y., Chu, A., Chuah, E., Corbett, R.D., Hoad, G.R., Jackman, S.D., Li, Y., Lo, A., Mungall, K.L., Nip, K.M., Qian, J.Q., Raymond, A.G., Thiessen, N.T., Varhol, R.J., Birol, I., Moore, R.A., Mungall, A.J., Holt, R., Kawauchi, D., Roussel, M.F., Kool, M., Jones, D.T., Witt, H., Fernandez-L, A., Kenney, A.M., Wechsler-Reya, R.J., Dirks, P., Aviv, T., Grajkowska, W.A., Perek-Polnik, M., Haberler, C.C., Delattre, O., Reynaud, S.S., Doz, F.F., Pernet-Fattet, S.S., Cho, B.K., Kim, S.K., Wang, K.C., Scheurlen, W., Eberhart, C.G., Fevre-Montange, M., Jouvet, A., Pollack, I.F., Fan, X., Muraszko, K.M., Gillespie, G.Y., Di Rocco, C., Massimi, L., Michiels, E.M., Kloosterhof, N.K., French, P.J., Kros, J.M., Olson, J.M., Ellenbogen, R.G., Zitterbart, K., Kren, L., Thompson, R.C., Cooper, M.K., Lach, B., McLendon, R.E., Bigner, D.D., Fontebasso, A., Albrecht, S., Jabado, N., Lindsey, J.C., Bailey, S., Gupta, N., Weiss, W.A., Bognar, L., Klekner, A., Van Meter, T.E., Kumabe, T., Tominaga, T., Elbabaa, S.K., Leonard, J.R., Rubin, J.B., Liau, L.M., Van Meir, E.G., Fouladi, M., Nakamura, H., Cinalli, G., Garami, M., Hauser, P., Saad, A.G., Iolascon, A., Jung, S., Carlotti, C.G., Vibhakar, R., Ra, Y.S., Robinson, S., Zollo, M., Faria, C.C., Chan, J.A., Levy, M.L., Sorensen, P.H., Meyerson, M., Pomeroy, S.L., Cho, Y.J., Bader, G.D., Tabori, U., Hawkins, C.E., Bouffet, E., Scherer, S.W., Rutka, J.T., Malkin, D., Clifford, S.C., Jones, S.J., Korbel, J.O., Pfister, S.M., Marra, M.A. & Taylor, M.D.
Nature. 2012 Aug 2;488(7409):49-56.
Medulloblastoma, the most common malignant paediatric brain tumour, is currently treated with nonspecific cytotoxic therapies including surgery, whole-brain radiation, and aggressive chemotherapy. As medulloblastoma exhibits marked intertumoural heterogeneity, with at least four distinct molecular variants, previous attempts to identify targets for therapy have been underpowered because of small samples sizes. Here we report somatic copy number aberrations (SCNAs) in 1,087 unique medulloblastomas. SCNAs are common in medulloblastoma, and are predominantly subgroup-enriched. The most common region of focal copy number gain is a tandem duplication of SNCAIP, a gene associated with Parkinson's disease, which is exquisitely restricted to Group 4alpha. Recurrent translocations of PVT1, including PVT1-MYC and PVT1-NDRG1, that arise through chromothripsis are restricted to Group 3. Numerous targetable SCNAs, including recurrent events targeting TGF-beta signalling in Group 3, and NF-kappaB signalling in Group 4, suggest future avenues for rational, targeted therapy.
Dissecting the genomic complexity underlying medulloblastoma.
Jones, D.T., Jager, N., Kool, M., Zichner, T., Hutter, B., Sultan, M., Cho, Y.J., Pugh, T.J., Hovestadt, V., Stutz, A.M., Rausch, T., Warnatz, H.J., Ryzhova, M., Bender, S., Sturm, D., Pleier, S., Cin, H., Pfaff, E., Sieber, L., Wittmann, A., Remke, M., Witt, H., Hutter, S., Tzaridis, T., Weischenfeldt, J., Raeder, B., Avci, M., Amstislavskiy, V., Zapatka, M., Weber, U.D., Wang, Q., Lasitschka, B., Bartholomae, C.C., Schmidt, M., von Kalle, C., Ast, V., Lawerenz, C., Eils, J., Kabbe, R., Benes, V., van Sluis, P., Koster, J., Volckmann, R., Shih, D., Betts, M.J., Russell, R.B., Coco, S., Tonini, G.P., Schuller, U., Hans, V., Graf, N., Kim, Y.J., Monoranu, C., Roggendorf, W., Unterberg, A., Herold-Mende, C., Milde, T., Kulozik, A.E., von Deimling, A., Witt, O., Maass, E., Rossler, J., Ebinger, M., Schuhmann, M.U., Fruhwald, M.C., Hasselblatt, M., Jabado, N., Rutkowski, S., von Bueren, A.O., Williamson, D., Clifford, S.C., McCabe, M.G., Collins, V.P., Wolf, S., Wiemann, S., Lehrach, H., Brors, B., Scheurlen, W., Felsberg, J., Reifenberger, G., Northcott, P.A., Taylor, M.D., Meyerson, M., Pomeroy, S.L., Yaspo, M.L., Korbel, J.O., Korshunov, A., Eils, R., Pfister, S.M. & Lichter, P.
Nature. 2012 Aug 2;488(7409):100-5.
Medulloblastoma is an aggressively growing tumour, arising in the cerebellum or medulla/brain stem. It is the most common malignant brain tumour in children, and shows tremendous biological and clinical heterogeneity. Despite recent treatment advances, approximately 40% of children experience tumour recurrence, and 30% will die from their disease. Those who survive often have a significantly reduced quality of life. Four tumour subgroups with distinct clinical, biological and genetic profiles are currently identified. WNT tumours, showing activated wingless pathway signalling, carry a favourable prognosis under current treatment regimens. SHH tumours show hedgehog pathway activation, and have an intermediate prognosis. Group 3 and 4 tumours are molecularly less well characterized, and also present the greatest clinical challenges. The full repertoire of genetic events driving this distinction, however, remains unclear. Here we describe an integrative deep-sequencing analysis of 125 tumour-normal pairs, conducted as part of the International Cancer Genome Consortium (ICGC) PedBrain Tumor Project. Tetraploidy was identified as a frequent early event in Group 3 and 4 tumours, and a positive correlation between patient age and mutation rate was observed. Several recurrent mutations were identified, both in known medulloblastoma-related genes (CTNNB1, PTCH1, MLL2, SMARCA4) and in genes not previously linked to this tumour (DDX3X, CTDNEP1, KDM6A, TBR1), often in subgroup-specific patterns. RNA sequencing confirmed these alterations, and revealed the expression of what are, to our knowledge, the first medulloblastoma fusion genes identified. Chromatin modifiers were frequently altered across all subgroups. These findings enhance our understanding of the genomic complexity and heterogeneity underlying medulloblastoma, and provide several potential targets for new therapeutics, especially for Group 3 and 4 patients.
A 15q24 microdeletion in transient myeloproliferative disease (TMD) and acute megakaryoblastic leukaemia (AMKL) implicates PML and SUMO3 in the leukaemogenesis of TMD/AMKL.
Haemmerling, S., Behnisch, W., Doerks, T., Korbel, J.O., Bork, P., Moog, U., Hentze, S., Grasshoff, U., Bonin, M., Riess, O., Janssen, J.W., Jauch, A., Bartram, C.R., Reinhardt, D., Koch, K.A., Bandapalli, O.R. & Kulozik, A.E.
Br J Haematol. 2012 Apr;157(2):180-7. doi: 10.1111/j.1365-2141.2012.09028.x. Epub2012 Feb 1.
Transient myeloproliferative disorder (TMD) of the newborn and acute megakaryoblastic leukaemia (AMKL) in children with Down syndrome (DS) represent paradigmatic models of leukaemogenesis. Chromosome 21 gene dosage effects and truncating mutations of the X-chromosomal transcription factor GATA1 synergize to trigger TMD and AMKL in most patients. Here, we report the occurrence of TMD, which spontaneously remitted and later progressed to AMKL in a patient without DS but with a distinct dysmorphic syndrome. Genetic analysis of the leukaemic clone revealed somatic trisomy 21 and a truncating GATA1 mutation. The analysis of the patient's normal blood cell DNA on a genomic single nucleotide polymorphism (SNP) array revealed a de novo germ line 2.58 Mb 15q24 microdeletion including 41 known genes encompassing the tumour suppressor PML. Genomic context analysis of proteins encoded by genes that are included in the microdeletion, chromosome 21-encoded proteins and GATA1 suggests that the microdeletion may trigger leukaemogenesis by disturbing the balance of a hypothetical regulatory network of normal megakaryopoiesis involving PML, SUMO3 and GATA1. The 15q24 microdeletion may thus represent the first genetic hit to initiate leukaemogenesis and implicates PML and SUMO3 as novel components of the leukaemogenic network in TMD/AMKL.
Driver mutations in histone H3.3 and chromatin remodelling genes in paediatric glioblastoma.
Schwartzentruber, J., Korshunov, A., Liu, X.Y., Jones, D.T., Pfaff, E., Jacob, K., Sturm, D., Fontebasso, A.M., Quang, D.A., Tonjes, M., Hovestadt, V., Albrecht, S., Kool, M., Nantel, A., Konermann, C., Lindroth, A., Jager, N., Rausch, T., Ryzhova, M., Korbel, J.O., Hielscher, T., Hauser, P., Garami, M., Klekner, A., Bognar, L., Ebinger, M., Schuhmann, M.U., Scheurlen, W., Pekrun, A., Fruhwald, M.C., Roggendorf, W., Kramm, C., Durken, M., Atkinson, J., Lepage, P., Montpetit, A., Zakrzewska, M., Zakrzewski, K., Liberski, P.P., Dong, Z., Siegel, P., Kulozik, A.E., Zapatka, M., Guha, A., Malkin, D., Felsberg, J., Reifenberger, G., von Deimling, A., Ichimura, K., Collins, V.P., Witt, H., Milde, T., Witt, O., Zhang, C., Castelo-Branco, P., Lichter, P., Faury, D., Tabori, U., Plass, C., Majewski, J., Pfister, S.M. & Jabado, N.
Nature. 2012 Jan 29;482(7384):226-31. doi: 10.1038/nature10833.
Glioblastoma multiforme (GBM) is a lethal brain tumour in adults and children. However, DNA copy number and gene expression signatures indicate differences between adult and paediatric cases. To explore the genetic events underlying this distinction, we sequenced the exomes of 48 paediatric GBM samples. Somatic mutations in the H3.3-ATRX-DAXX chromatin remodelling pathway were identified in 44% of tumours (21/48). Recurrent mutations in H3F3A, which encodes the replication-independent histone 3 variant H3.3, were observed in 31% of tumours, and led to amino acid substitutions at two critical positions within the histone tail (K27M, G34R/G34V) involved in key regulatory post-translational modifications. Mutations in ATRX (alpha-thalassaemia/mental retardation syndrome X-linked) and DAXX (death-domain associated protein), encoding two subunits of a chromatin remodelling complex required for H3.3 incorporation at pericentric heterochromatin and telomeres, were identified in 31% of samples overall, and in 100% of tumours harbouring a G34R or G34V H3.3 mutation. Somatic TP53 mutations were identified in 54% of all cases, and in 86% of samples with H3F3A and/or ATRX mutations. Screening of a large cohort of gliomas of various grades and histologies (n = 784) showed H3F3A mutations to be specific to GBM and highly prevalent in children and young adults. Furthermore, the presence of H3F3A/ATRX-DAXX/TP53 mutations was strongly associated with alternative lengthening of telomeres and specific gene expression profiles. This is, to our knowledge, the first report to highlight recurrent mutations in a regulatory histone in humans, and our data suggest that defects of the chromatin architecture underlie paediatric and young adult GBM pathogenesis.
Genome Sequencing of Pediatric Medulloblastoma Links Catastrophic DNA Rearrangements with TP53 Mutations.
Rausch, T., Jones, D.T., Zapatka, M., Stutz, A.M., Zichner, T., Weischenfeldt, J., Jager, N., Remke, M., Shih, D., Northcott, P.A., Pfaff, E., Tica, J., Wang, Q., Massimi, L., Witt, H., Bender, S., Pleier, S., Cin, H., Hawkins, C., Beck, C., von Deimling, A., Hans, V., Brors, B., Eils, R., Scheurlen, W., Blake, J., Benes, V., Kulozik, A.E., Witt, O., Martin, D., Zhang, C., Porat, R., Merino, D.M., Wasserman, J., Jabado, N., Fontebasso, A., Bullinger, L., Rucker, F.G., Dohner, K., Dohner, H., Koster, J., Molenaar, J.J., Versteeg, R., Kool, M., Tabori, U., Malkin, D., Korshunov, A., Taylor, M.D., Lichter, P., Pfister, S.M. & Korbel, J.O.
Cell. 2012 Jan 20;148(1-2):59-71.
Genomic rearrangements are thought to occur progressively during tumor development. Recent findings, however, suggest an alternative mechanism, involving massive chromosome rearrangements in a one-step catastrophic event termed chromothripsis. We report the whole-genome sequencing-based analysis of a Sonic-Hedgehog medulloblastoma (SHH-MB) brain tumor from a patient with a germline TP53 mutation (Li-Fraumeni syndrome), uncovering massive, complex chromosome rearrangements. Integrating TP53 status with microarray and deep sequencing-based DNA rearrangement data in additional patients reveals a striking association between TP53 mutation and chromothripsis in SHH-MBs. Analysis of additional tumor entities substantiates a link between TP53 mutation and chromothripsis, and indicates a context-specific
High-resolution genomic profiling of chronic lymphocytic leukemia reveals new recurrent genomic alterations.
Edelmann, J., Holzmann, K., Miller, F., Winkler, D., Buhler, A., Zenz, T., Bullinger, L., Kuhn, M.W., Gerhardinger, A., Bloehdorn, J., Radtke, I., Su, X., Ma, J., Pounds, S., Hallek, M., Lichter, P., Korbel, J., Busch, R., Mertens, D., Downing, J.R., Stilgenbauer, S. & Dohner, H.
Blood. 2012 Dec 6;120(24):4783-94. doi: 10.1182/blood-2012-04-423517. Epub 2012Oct 9.
To identify genomic alterations in chronic lymphocytic leukemia (CLL), we performed single-nucleotide polymorphism-array analysis using Affymetrix Version 6.0 on 353 samples from untreated patients entered in the CLL8 treatment trial. Based on paired-sample analysis (n = 144), a mean of 1.8 copy number alterations per patient were identified; approximately 60% of patients carried no copy number alterations other than those detected by fluorescence in situ hybridization analysis. Copy-neutral loss-of-heterozygosity was detected in 6% of CLL patients and was found most frequently on 13q, 17p, and 11q. Minimally deleted regions were refined on 13q14 (deleted in 61% of patients) to the DLEU1 and DLEU2 genes, on 11q22.3 (27% of patients) to ATM, on 2p16.1-2p15 (gained in 7% of patients) to a 1.9-Mb fragment containing 9 genes, and on 8q24.21 (5% of patients) to a segment 486 kb proximal to the MYC locus. 13q deletions exhibited proximal and distal breakpoint cluster regions. Among the most common novel lesions were deletions at 15q15.1 (4% of patients), with the smallest deletion (70.48 kb) found in the MGA locus. Sequence analysis of MGA in 59 samples revealed a truncating mutation in one CLL patient lacking a 15q deletion. MNT at 17p13.3, which in addition to MGA and MYC encodes for the network of MAX-interacting proteins, was also deleted recurrently.
Relating CNVs to transcriptome data at fine resolution: Assessment of the effect of variant size, type, and overlap with functional regions.
Schlattl, A., Anders, S., Waszak, S.M., Huber, W. & Korbel, J.O.
Genome Res. 2011 Dec;21(12):2004-13. Epub 2011 Aug 23.
Copy-number variants (CNVs) form an abundant class of genetic variation with a presumed widespread impact on individual traits. While recent advances, such as the population-scale sequencing of human genomes, facilitated the fine-scale mapping of CNVs, the phenotypic impact of most of these CNVs remains unclear. By relating copy-number genotypes to transcriptome sequencing data, we have evaluated the impact of CNVs, mapped at fine scale, on gene expression. Based on data from 129 individuals with ancestry from two populations, we identified CNVs associated with the expression of 110 genes, with 13% of the associations involving complex, multiallelic CNVs. Categorization of CNVs according to variant type, size, and gene overlap enabled us to examine the impact of different CNV classes on expression variation. While many small (<4 kb) CNVs were associated with expression variation, overall we observed an enrichment of large duplications and deletions, including large intergenic CNVs, relative to the entire set of expression-associated CNVs. Furthermore, the copy number of genes intersecting with CNVs typically correlated positively with the genes' expression, and also was more strongly correlated with expression than nearby single nucleotide polymorphisms, suggesting a frequent causal role of CNVs in expression quantitative trait loci (eQTLs). We also elucidated unexpected cases of negative correlations between copy number and expression by assessing the CNVs' effects on the structure and regulation of genes. Finally, we examined dosage compensation of transcript levels. Our results suggest that association studies can gain in resolution and power by including fine-scale CNV information, such as those obtained from population-scale sequencing.
Challenges in studying genomic structural variant formation mechanisms: the short-read dilemma and beyond.
Onishi-Seebacher, M. & Korbel, J.O.
Bioessays. 2011 Nov;33(11):840-50. doi: 10.1002/bies.201100075. Epub 2011Sep 30.
Next-generation sequencing (NGS) technologies have revolutionised the analysis of genomic structural variants (SVs), providing significant insights into SV de novo formation based on analyses of rearrangement breakpoint junctions. The short DNA reads generated by NGS, however, have also created novel obstacles by biasing the ascertainment of SVs, an aspect that we refer to as the 'short-read dilemma'. For example, recent studies have found that SVs are often complex, with SV formation generating large numbers of breakpoints in a single event (multi-breakpoint SVs) or structurally polymorphic loci having multiple allelic states (multi-allelic SVs). This complexity may be obscured in short reads, unless the data is analysed and interpreted within its wider genomic context. We discuss how novel approaches will help to overcome the short-read dilemma, and how integration of other sources of information, including the structure of chromatin, may help in the future to deepen the understanding of SV formation processes.
A comprehensive map of mobile element insertion polymorphisms in humans.
Stewart, C., Kural, D., Stromberg, M.P., Walker, J.A., Konkel, M.K., Stutz, A.M., Urban, A.E., Grubert, F., Lam, H.Y., Lee, W.P., Busby, M., Indap, A.R., Garrison, E., Huff, C., Xing, J., Snyder, M.P., Jorde, L.B., Batzer, M.A., Korbel, J.O., Marth, G.T.; 1000 Genomes Project.
PLoS Genet. 2011 Aug;7(8):e1002236. Epub 2011 Aug 18.
As a consequence of the accumulation of insertion events over evolutionary time, mobile elements now comprise nearly half of the human genome. The Alu, L1, and SVA mobile element families are still duplicating, generating variation between individual genomes. Mobile element insertions (MEI) have been identified as causes for genetic diseases, including hemophilia, neurofibromatosis, and various cancers. Here we present a comprehensive map of 7,380 MEI polymorphisms from the 1000 Genomes Project whole-genome sequencing data of 185 samples in three major populations detected with two detection methods. This catalog enables us to systematically study mutation rates, population segregation, genomic distribution, and functional properties of MEI polymorphisms and to compare MEI to SNP variation from the same individuals. Population allele frequencies of MEI and SNPs are described, broadly, by the same neutral ancestral processes despite vastly different mutation mechanisms and rates, except in coding regions where MEI are virtually absent, presumably due to strong negative selection. A direct comparison of MEI and SNP diversity levels suggests a differential mobile element insertion rate among populations.
Mapping copy number variation by population-scale genome sequencing.
Mills, R.E., Walter, K., Stewart, C., Handsaker, R.E., Chen, K., Alkan, C., Abyzov, A., Yoon, S.C., Ye, K., Cheetham, R.K., Chinwalla, A., Conrad, D.F., Fu, Y., Grubert, F., Hajirasouliha, I., Hormozdiari, F., Iakoucheva, L.M., Iqbal, Z., Kang, S., Kidd, J.M., Konkel, M.K., Korn, J., Khurana, E., Kural, D., Lam, H.Y., Leng, J., Li, R., Li, Y., Lin, C.Y., Luo, R., Mu, X.J., Nemesh, J., Peckham, H.E., Rausch, T., Scally, A., Shi, X., Stromberg, M.P., Stutz, A.M., Urban, A.E., Walker, J.A., Wu, J., Zhang, Y., Zhang, Z.D., Batzer, M.A., Ding, L., Marth, G.T., McVean, G., Sebat, J., Snyder, M., Wang, J., Ye, K., Eichler, E.E., Gerstein, M.B., Hurles, M.E., Lee, C., McCarroll, S.A., Korbel, J.O.; 1000 Genomes Project.
Nature. 2011 Feb 3;470(7332):59-65.
Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differe
International network of cancer genome projects.
International Cancer Genome Consortium.
Nature 2010 464(7291) 993-998 Europe PMC
A map of human genome variation from population-scale sequencing.
1000 Genomes Project Consortium.
Nature. 2010 Oct 28;467(7319):1061-73
The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.
The baker's yeast diploid genome is remarkably stable in vegetative growth and meiosis.
Nishant, K.T., Wei, W., Mancera, E., Argueso, J.L., Schlattl, A., Delhomme, N., Ma, X., Bustamante, C.D., Korbel, J.O., Gu, Z., Steinmetz, L.M. & Alani, E.
PLoS Genet. 2010 Sep 9;6(9). pii: e1001109.
Accurate estimates of mutation rates provide critical information to analyze genome evolution and organism fitness. We used whole-genome DNA sequencing, pulse-field gel electrophoresis, and comparative genome hybridization to determine mutation rates in diploid vegetative and meiotic mutation accumulation lines of Saccharomyces cerevisiae. The vegetative lines underwent only mitotic divisions while the meiotic lines underwent a meiotic cycle every approximately 20 vegetative divisions. Similar base substitution rates were estimated for both lines. Given our experimental design, these measures indicated that the meiotic mutation rate is within the range of being equal to zero to being 55-fold higher than the vegetative rate. Mutations detected in vegetative lines were all heterozygous while those in meiotic lines were homozygous. A quantitative analysis of intra-tetrad mating events in the meiotic lines showed that inter-spore mating is primarily responsible for rapidly fixing mutations to homozygosity as well as for removing mutations. We did not observe 1-2 nt insertion/deletion (in-del) mutations in any of the sequenced lines and only one structural variant in a non-telomeric location was found. However, a large number of structural variations in subtelomeric sequences were seen in both vegetative and meiotic lines that did not affect viability. Our results indicate that the diploid yeast nuclear genome is remarkably stable during the vegetative and meiotic cell cycles and support the hypothesis that peripheral regions of chromosomes are more dynamic than gene-rich central sections where structural rearrangements could be deleterious. This work also provides an improved estimate for the mutational load carried by diploid organisms.
Potential and challenges of personalized genomics and the 1000 Genomes Project
Stütz, A.M. & Korbel, J.O.
Medizinische Genetik 2010 Jun; 22(2):242-247
The ability to sequence entire individual human genomes has heralded a new era in human genetics. Such advances in sequencing technologies make it possible to address new questions such as the generation of a comprehensive map of common and rare genetic variants in humans. The 1000 Genomes Project will analyze 2500 genomes and is expected to greatly expand our knowledge about genomic variation, both on single nucleotide polymorphisms and genomic structural variants in a number of human ethnic populations. Furthermore, the possibility to use these new sequencing technologies for such large scale projects will be evaluated. Finally, new bioinformatics solutions will be developed to efficiently store and process such large volumes of data for the scientific community. This catalogue of common and rare variations will facilitate the development of better methods for phenotype-genotype associations and help uncover the molecular bases for a variety of diseases in the near future.
Variation in transcription factor binding among humans.
Kasowski, M., Grubert, F., Heffelfinger, C., Hariharan, M., Asabere, A., Waszak, S.M., Habegger, L., Rozowsky, J., Shi, M., Urban, A.E., Hong, M.Y., Karczewski, K.J., Huber, W., Weissman, S.M., Gerstein, M.B., Korbel, J.O. & Snyder, M.
Science. 2010 Apr 9;328(5975):232-5. Epub 2010 Mar 18.
Differences in gene expression may play a major role in speciation and phenotypic diversity. We examined genome-wide differences in transcription factor (TF) binding in several humans and a single chimpanzee by using chromatin immunoprecipitation followed by sequencing. The binding sites of RNA polymerase II (PolII) and a key regulator of immune responses, nuclear factor kappaB (p65), were mapped in 10 lymphoblastoid cell lines, and 25 and 7.5% of the respective binding regions were found to differ between individuals. Binding differences were frequently associated with single-nucleotide polymorphisms and genomic structural variants, and these differences were often correlated with differences in gene expression, suggesting functional consequences of binding variation. Furthermore, comparing PolII binding between humans and chimpanzee suggests extensive divergence in TF binding. Our results indicate that many differences in individuals and species occur at the level of TF binding, and they provide insight into the genetic events responsible for these differences.
Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library.
Lam, H.Y., Mu, X.J., Stutz, A.M., Tanzer, A., Cayting, P.D., Snyder, M., Kim, P.M., Korbel, J.O. & Gerstein, M.B.
Nat Biotechnol. 2010 Jan;28(1):47-55. Epub 2009 Dec 27.
Structural variants (SVs) are a major source of human genomic variation; however, characterizing them at nucleotide resolution remains challenging. Here we assemble a library of breakpoints at nucleotide resolution from collating and standardizing ~2,000 published SVs. For each breakpoint, we infer its ancestral state (through comparison to primate genomes) and its mechanism of formation (e.g., nonallelic homologous recombination, NAHR). We characterize breakpoint sequences with respect to genomic landmarks, chromosomal location, sequence motifs and physical properties, finding that the occurrence of insertions and deletions is more balanced than previously reported and that NAHR-formed breakpoints are associated with relatively rigid, stable DNA helices. Finally, we demonstrate an approach, BreakSeq, for scanning the reads from short-read sequenced genomes against our breakpoint library to accurately identify previously overlooked SVs, which we then validate by PCR. As new data become available, we expect our BreakSeq approach will become more sensitive and facilitate rapid SV genotyping of personal genomes.
Systematic inference of copy-number genotypes from personal genome sequencing data reveals extensive olfactory receptor gene content diversity.
Waszak, S.M., Hasin, Y., Zichner, T., Olender, T., Keydar, I., Khen, M., Stütz, A.M., Schlattl, A., Lancet, D. & Korbel, J.O.
PLoS Comput Biol. 2010 Nov 11;6(11):e1000988.
Copy-number variations (CNVs) are widespread in the human genome, but comprehensive assignments of integer locus copy-numbers (i.e., copy-number genotypes) that, for example, enable discrimination of homozygous from heterozygous CNVs, have remained challenging. Here we present CopySeq, a novel computational approach with an underlying statistical framework that analyzes the depth-of-coverage of high-throughput DNA sequencing reads, and can incorporate paired-end and breakpoint junction analysis based CNV-analysis approaches, to infer locus copy-number genotypes. We benchmarked CopySeq by genotyping 500 chromosome 1 CNV regions in 150 personal genomes sequenced at low-coverage. The assessed copy-number genotypes were highly concordant with our performed qPCR experiments (Pearson correlation coefficient 0.94), and with the published results of two microarray platforms (95-99% concordance). We further demonstrated the utility of CopySeq for analyzing gene regions enriched for segmental duplications by comprehensively inferring copy-number genotypes in the CNV-enriched >800 olfactory receptor (OR) human gene and pseudogene loci. CopySeq revealed that OR loci display an extensive range of locus copy-numbers across individuals, with zero to two copies in some OR loci, and two to nine copies in others. Among genetic variants affecting OR loci we identified deleterious variants including CNVs and SNPs affecting approximately 15% and approximately 20% of the human OR gene repertoire, respectively, implying that genetic variants with a possible impact on smell perception are widespread. Finally, we found that for several OR loci the reference genome appears to represent a minor-frequency variant, implying a necessary revision of the OR repertoire for future functional studies. CopySeq can ascertain genomic structural variation in specific gene families as well as at a genome-wide scale, where it may enable the quantitative evaluation of CNVs in genome-wide association studies involving high-throughput sequencing.
The genetic architecture of Down syndrome phenotypes revealed by high-resolution analysis of human segmental trisomies.
Korbel, J.O., Tirosh-Wagner, T., Urban, A.E., Chen, X.N., Kasowski, M., Dai, L., Grubert, F., Erdman, C., Gao, M.C., Lange, K., Sobel, E.M., Barlow, G.M., Aylsworth, A.S., Carpenter, N.J., Clark, R.D., Cohen, M.Y., Doran, E., Falik-Zaccai, T., Lewin, S.O., Lott, I.T., McGillivray, B.C., Moeschler, J.B., Pettenati, M.J., Pueschel, S.M., Rao, K.W., Shaffer, L.G., Shohat, M., Van Riper, A.J., Warburton, D., Weissman, S., Gerstein, M.B., Snyder, M. & Korenberg, J.R.
Proc Natl Acad Sci U S A. 2009 Jul 21;106(29):12031-6. Epub 2009 Jul 13.
Down syndrome (DS), or trisomy 21, is a common disorder associated with several complex clinical phenotypes. Although several hypotheses have been put forward, it is unclear as to whether particular gene loci on chromosome 21 (HSA21) are sufficient to cause DS and its associated features. Here we present a high-resolution genetic map of DS phenotypes based on an analysis of 30 subjects carrying rare segmental trisomies of various regions of HSA21. By using state-of-the-art genomics technologies we mapped segmental trisomies at exon-level resolution and identified discrete regions of 1.8-16.3 Mb likely to be involved in the development of 8 DS phenotypes, 4 of which are congenital malformations, including acute megakaryocytic leukemia, transient myeloproliferative disorder, Hirschsprung disease, duodenal stenosis, imperforate anus, severe mental retardation, DS-Alzheimer Disease, and DS-specific congenital heart disease (DSCHD). Our DS-phenotypic maps located DSCHD to a <2-Mb interval. Furthermore, the map enabled us to present evidence against the necessary involvement of other loci as well as specific hypotheses that have been put forward in relation to the etiology of DS-i.e., the presence of a single DS consensus region and the sufficiency of DSCR1 and DYRK1A, or APP, in causing several severe DS phenotypes. Our study demonstrates the value of combining advanced genomics with cohorts of rare patients for studying DS, a prototype for the role of copy-number variation in complex disease.
Distinct genomic aberrations associated with ERG rearranged prostate cancer.
Demichelis, F., Setlur, S.R., Beroukhim, R., Perner, S., Korbel, J.O., Lafargue, C.J., Pflueger, D., Pina, C., Hofer, M.D., Sboner, A., Svensson, M.A., Rickman, D.S., Urban, A., Snyder, M., Meyerson, M., Lee, C., Gerstein, M.B., Kuefer, R. & Rubin, M.A.
Genes Chromosomes Cancer. 2009 Apr;48(4):366-80.
Emerging molecular and clinical data suggest that ETS fusion prostate cancer represents a distinct molecular subclass, driven most commonly by a hormonally regulated promoter and characterized by an aggressive natural history. The study of the genomic landscape of prostate cancer in the light of ETS fusion events is required to understand the foundation of this molecularly and clinically distinct subtype. We performed genome-wide profiling of 49 primary prostate cancers and identified 20 recurrent chromosomal copy number aberrations, mainly occurring as genomic losses. Co-occurring events included losses at 19q13.32 and 1p22.1. We discovered three genomic events associated with ERG rearranged prostate cancer, affecting 6q, 7q, and 16q. 6q loss in nonrearranged prostate cancer is accompanied by gene expression deregulation in an independent dataset and by protein deregulation of MYO6. To analyze copy number alterations within the ETS genes, we performed a comprehensive analysis of all 27 ETS genes and of the 3 Mbp genomic area between ERG and TMPRSS2 (21q) with an unprecedented resolution (30 bp). We demonstrate that high-resolution tiling arrays can be used to pin-point breakpoints leading to fusion events. This study provides further support to define a distinct molecular subtype of prostate cancer based on the presence of ETS gene rearrangements.
Quantifying environmental adaptation of metabolic pathways in metagenomics.
Gianoulis, T.A., Raes, J., Patel, P.V., Bjornson, R., Korbel, J.O., Letunic, I., Yamada, T., Paccanaro, A., Jensen, L.J., Snyder, M., Bork, P. & Gerstein, M.B.
Proc Natl Acad Sci U S A. 2009 Feb 3;106(5):1374-9. Epub 2009 Jan 22.
Recently, approaches have been developed to sample the genetic content of heterogeneous environments (metagenomics). However, by what means these sequences link distinct environmental conditions with specific biological processes is not well understood. Thus, a major challenge is how the usage of particular pathways and subnetworks reflects the adaptation of microbial communities across environments and habitats-i.e., how network dynamics relates to environmental features. Previous research has treated environments as discrete, somewhat simplified classes (e.g., terrestrial vs. marine), and searched for obvious metabolic differences among them (i.e., treating the analysis as a typical classification problem). However, environmental differences result from combinations of many factors, which often vary only slightly. Therefore, we introduce an approach that employs correlation and regression to relate multiple, continuously varying factors defining an environment to the extent of particular microbial pathways present in a geographic site. Moreover, rather than looking only at individual correlations (one-to-one), we adapted canonical correlation analysis and related techniques to define an ensemble of weighted pathways that maximally covaries with a combination of environmental variables (many-to-many), which we term a metabolic footprint. Applied to available aquatic datasets, we identified footprints predictive of their environment that can potentially be used as biosensors. For example, we show a strong multivariate correlation between the energy-conversion strategies of a community and multiple environmental gradients (e.g., temperature). Moreover, we identified covariation in amino acid transport and cofactor synthesis, suggesting that limiting amounts of cofactor can (partially) explain increased import of amino acids in nutrient-limited conditions.
PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data.
Korbel, J.O., Abyzov, A., Mu, X.J., Carriero, N., Cayting, P., Zhang, Z., Snyder, M. & Gerstein, M.B.
Genome Biol. 2009 Feb 23;10(2):R23.
ABSTRACT: Personal-genomics endeavors, such as the "1000 Genomes Project", are generating maps of genomic structural variants (SVs) by analyzing ends of massively sequenced genome-fragments. To process these we developed Paired-End Mapper (PEMer; http://sv.gersteinlab.org/pemer). This comprises a parallelizable analysis pipeline, compatible with several next-generation sequencing platforms; simulation-based error models, yielding confidence-values for each SV; and a back-end database. The simulations demonstrated high SV-reconstruction efficiency for PEMer's coverage-adjusted multi-cutoff scoring-strategy and showed its relative insensitivity to base-calling errors.
MSB: a mean-shift-based approach for the analysis of structural variation in the genome.
Wang, L.Y., Abyzov, A., Korbel, J.O., Snyder, M. & Gerstein, M.
Genome Res. 2009 Jan;19(1):106-17. Epub 2008 Nov 26.
Genome structural variation includes segmental duplications, deletions, and other rearrangements, and array-based comparative genomic hybridization (array-CGH) is a popular technology for determining this. Drawing relevant conclusions from array-CGH requires computational methods for partitioning the chromosome into segments of elevated, reduced, or unchanged copy number. Several approaches have been described, most of which attempt to explicitly model the underlying distribution of data based on particular assumptions. Often, they optimize likelihood functions for estimating model parameters, by expectation maximization or related approaches; however, this requires good parameter initialization through prespecifying the number of segments. Moreover, convergence is difficult to achieve, since many parameters are required to characterize an experiment. To overcome these limitations, we propose a nonparametric method without a global criterion to be optimized. Our method involves mean-shift-based (MSB) procedures; it considers the observed array-CGH signal as sampling from a probability-density function, uses a kernel-based approach to estimate local gradients for this function, and iteratively follows them to determine local modes of the signal. Overall, our method achieves robust discontinuity-preserving smoothing, thus accurately segmenting chromosomes into regions of duplication and deletion. It does not require the number of segments as input, nor does its convergence depend on this. We successfully applied our method to both simulated data and array-CGH experiments on glioblastoma and adenocarcinoma. We show that it performs at least as well as, and often better than, 10 previously published algorithms. Finally, we show that our approach can be extended to segmenting the signal resulting from the depth-of-coverage of mapped reads from next-generation sequencing.
High-resolution copy-number variation map reflects human olfactory receptor diversity and evolution.
Hasin, Y., Olender, T., Khen, M., Gonzaga-Jauregui, C., Kim, P.M., Urban, A.E., Snyder, M., Gerstein, M.B., Lancet, D. & Korbel, J.O.
PLoS Genet. 2008 Nov;4(11):e1000249. Epub 2008 Nov 7.
Olfactory receptors (ORs), which are involved in odorant recognition, form the largest mammalian protein superfamily. The genomic content of OR genes is considerably reduced in humans, as reflected by the relatively small repertoire size and the high fraction ( approximately 55%) of human pseudogenes. Since several recent low-resolution surveys suggested that OR genomic loci are frequently affected by copy-number variants (CNVs), we hypothesized that CNVs may play an important role in the evolution of the human olfactory repertoire. We used high-resolution oligonucleotide tiling microarrays to detect CNVs across 851 OR gene and pseudogene loci. Examining genomic DNA from 25 individuals with ancestry from three populations, we identified 93 OR gene loci and 151 pseudogene loci affected by CNVs, generating a mosaic of OR dosages across persons. Our data suggest that approximately 50% of the CNVs involve more than one OR, with the largest CNV spanning 11 loci. In contrast to earlier reports, we observe that CNVs are more frequent among OR pseudogenes than among intact genes, presumably due to both selective constraints and CNV formation biases. Furthermore, our results show an enrichment of CNVs among ORs with a close human paralog or lacking a one-to-one ortholog in chimpanzee. Interestingly, among the latter we observed an enrichment in CNV losses over gains, a finding potentially related to the known diminution of the human OR repertoire. Quantitative PCR experiments performed for 122 sampled ORs agreed well with the microarray results and uncovered 23 additional CNVs. Importantly, these experiments allowed us to uncover nine common deletion alleles that affect 15 OR genes and five pseudogenes. Comparison to the chimpanzee reference genome revealed that all of the deletion alleles are human derived, therefore indicating a profound effect of human-specific deletions on the individual OR gene content. Furthermore, these deletion alleles may be used in future genetic association studies of olfactory inter-individual differences.
The current excitement about copy-number variation: how it relates to gene duplications and protein families.
Korbel, J.O., Kim, P.M., Chen, X., Urban, A.E., Weissman, S., Snyder, M. & Gerstein, M.B.
Curr Opin Struct Biol. 2008 May 27;.
Following recent technological advances there has been an increasing interest in genome structural variants (SVs), in particular copy-number variants (CNVs) - large-scale duplications and deletions. Although not immediately evident, CNV surveys make a conceptual connection between the fields of population genetics and protein families, in particular with regard to the stability and expandability of families. The mechanisms giving rise to CNVs can be considered as fundamental processes underlying gene duplication and loss; duplicated genes being the results of 'successful' copies, fixed and maintained in the population. Conversely, many 'unsuccessful' duplicates remain in the genome as pseudogenes. Here, we survey studies on CNVs, highlighting issues related to protein families. In particular, CNVs tend to affect specific gene functional categories, such as those associated with environmental response, and are depleted in genes related to basic cellular processes. Furthermore, CNVs occur more often at the periphery of the protein interaction network. In comparison, protein families associated with successful and unsuccessful duplicates are associated with similar functional categories but are differentially placed in the interaction network. These trends are likely reflective of CNV formation biases and natural selection, both of which differentially influence distinct protein families.
Analysis of copy number variation in the rhesus macaque genome identifies candidate loci for evolutionary and human disease studies.
Lee, A.S., Gutierrez-Arcelus, M., Perry, G.H., Vallender, E.J., Johnson, W.E., Miller, G.M., Korbel, J.O. & Lee, C.
Hum Mol Genet. 2008 Apr 15;17(8):1127-36. Epub 2008 Jan 7.
Copy number variants (CNVs) are heritable gains and losses of genomic DNA in normal individuals. While copy number variation is widely studied in humans, our knowledge of CNVs in other mammalian species is more limited. We have designed a custom array-based comparative genomic hybridization (aCGH) platform with 385 000 oligonucleotide probes based on the reference genome sequence of the rhesus macaque (Macaca mulatta), the most widely studied non-human primate in biomedical research. We used this platform to identify 123 CNVs among 10 unrelated macaque individuals, with 24% of the CNVs observed in multiple individuals. We found that segmental duplications were significantly enriched at macaque CNV loci. We also observed significant overlap between rhesus macaque and human CNVs, suggesting that certain genomic regions are prone to recurrent CNV formation and instability, even across a total of approximately 50 million years of primate evolution ( approximately 25 million years in each lineage). Furthermore, for eight of the CNVs that were observed in both humans and macaques, previous human studies have reported a relationship between copy number and gene expression or disease susceptibility. Therefore, the rhesus macaque offers an intriguing, non-human primate outbred model organism with which hypotheses concerning the specific functions of phenotypically relevant human CNVs can be tested.
Positive selection at the protein network periphery: evaluation in terms of structural constraints and cellular context.
Kim, P.M., Korbel, J.O. & Gerstein, M.B.
Proc Natl Acad Sci U S A. 2007 Dec 18;104(51):20274-9. Epub 2007 Dec 12.
Because of recent advances in genotyping and sequencing, human genetic variation and adaptive evolution in the primate lineage have become major research foci. Here, we examine the relationship between genetic signatures of adaptive evolution and network topology. We find a striking tendency of proteins that have been under positive selection (as compared with the chimpanzee) to be located at the periphery of the interaction network. Our results are based on the analysis of two types of genome evolution, both in terms of intra- and interspecies variation. First, we looked at single-nucleotide polymorphisms and their fixed variants, single-nucleotide differences in the human genome relative to the chimpanzee. Second, we examine fixed structural variants, specifically large segmental duplications and their polymorphic precursors known as copy number variants. We propose two complementary mechanisms that lead to the observed trends. First, we can rationalize them in terms of constraints imposed by protein structure: We find that positively selected sites are preferentially located on the exposed surface of proteins. Because central network proteins (hubs) are likely to have a larger fraction of their surface involved in interactions, they tend to be constrained and under negative selection. Conversely, we show that the interaction network roughly maps to cellular organization, with the periphery of the network corresponding to the cellular periphery (i.e., extracellular space or cell membrane). This suggests that the observed positive selection at the network periphery may be due to an increase of adaptive events on the cellular periphery responding to changing environments.
Paired-end mapping reveals extensive structural variation in the human genome.
Korbel, J.O., Urban, A.E., Affourtit, J.P., Godwin, B., Grubert, F., Simons, J.F., Kim, P.M., Palejev, D., Carriero, N.J., Du, L., Taillon, B.E., Chen, Z., Tanzer, A., Saunders, A.C., Chi, J., Yang, F., Carter, N.P., Hurles, M.E., Weissman, S.M., Harkins, T.T., Gerstein, M.B., Egholm, M. & Snyder, M.
Science. 2007 Oct 19;318(5849):420-6. Epub 2007 Sep 27.
Structural variation of the genome involves kilobase- to megabase-sized deletions, duplications, insertions, inversions, and complex combinations of rearrangements. We introduce high-throughput and massive paired-end mapping (PEM), a large-scale genome-sequencing method to identify structural variants (SVs) approximately 3 kilobases (kb) or larger that combines the rescue and capture of paired ends of 3-kb fragments, massive 454 sequencing, and a computational approach to map DNA reads onto a reference genome. PEM was used to map SVs in an African and in a putatively European individual and identified shared and divergent SVs relative to the reference genome. Overall, we fine-mapped more than 1000 SVs and documented that the number of SVs among humans is much larger than initially hypothesized; many of the SVs potentially affect gene function. The breakpoint junction sequences of more than 200 SVs were determined with a novel pooling strategy and computational analysis. Our analysis provided insights into the mechanisms of SV formation in humans.
Use of pathway analysis and genome context methods for functional genomics of Mycoplasma pneumoniae nucleotide metabolism.
Pachkov, M., Dandekar, T., Korbel, J., Bork, P. & Schuster, S.
Gene. 2007 Jul 15;396(2):215-25. Epub 2007 Mar 24.
Elementary modes analysis allows one to reveal whether a set of known enzymes is sufficient to sustain functionality of the cell. Moreover, it is helpful in detecting missing reactions and predicting which enzymes could fill these gaps. Here, we perform a comprehensive elementary modes analysis and a genomic context analysis of Mycoplasma pneumoniae nucleotide metabolism, and search for new enzyme activities. The purine and pyrimidine networks are reconstructed by assembling enzymes annotated in the genome or found experimentally. We show that these reaction sets are sufficient for enabling synthesis of DNA and RNA in M. pneumoniae. Special focus is on the key modes for growth. Moreover, we make an educated guess on the nutritional requirements of this micro-organism. For the case that M. pneumoniae does not require adenine as a substrate, we suggest adenylosuccinate synthetase (EC 220.127.116.11), adenylosuccinate lyase (EC 18.104.22.168) and GMP reductase (EC 22.214.171.124) to be operative. GMP reductase activity is putatively assigned to the NRDI_MYCPN gene on the basis of the genomic context analysis. For the pyrimidine network, we suggest CTP synthase (EC 126.96.36.199) to be active. Further experiments on the nutritional requirements are needed to make a decision. Pyrimidine metabolism appears to be more appropriate as a drug target than purine metabolism since it shows lower plasticity.
Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome.
Korbel, J.O., Urban, A.E., Grubert, F., Du, J., Royce, T.E., Starr, P., Zhong, G., Emanuel, B.S., Weissman, S.M., Snyder, M. & Gerstein, M.B.
Proc Natl Acad Sci U S A. 2007 Jun 12;104(24):10110-5. Epub 2007 Jun 5.
Copy-number variants (CNVs) are an abundant form of genetic variation in humans. However, approaches for determining exact CNV breakpoint sequences (physical deletion or duplication boundaries) across individuals, crucial for associating genotype to phenotype, have been lacking so far, and the vast majority of CNVs have been reported with approximate genomic coordinates only. Here, we report an approach, called BreakPtr, for fine-mapping CNVs (available from http://breakptr.gersteinlab.org). We statistically integrate both sequence characteristics and data from high-resolution comparative genome hybridization experiments in a discrete-valued, bivariate hidden Markov model. Incorporation of nucleotide-sequence information allows us to take into account the fact that recently duplicated sequences (e.g., segmental duplications) often coincide with breakpoints. In anticipation of an upcoming increase in CNV data, we developed an iterative, "active" approach to initially scoring with a preliminary model, performing targeted validations, retraining the model, and then rescoring, and a flexible parameterization system that intuitively collapses from a full model of 2,503 parameters to a core one of only 10. Using our approach, we accurately mapped >400 breakpoints on chromosome 22 and a region of chromosome 11, refining the boundaries of many previously approximately mapped CNVs. Four predicted breakpoints flanked known disease-associated deletions. We validated an additional four predicted CNV breakpoints by sequencing. Overall, our results suggest a predictive resolution of approximately 300 bp. This level of resolution enables more precise correlations between CNVs and across individuals than previously possible, allowing the study of CNV population frequencies. Further, it enabled us to demonstrate a clear Mendelian pattern of inheritance for one of the CNVs.
Structured RNAs in the ENCODE selected regions of the human genome.
Washietl, S., Pedersen, J.S., Korbel, J.O., Stocsits, C., Gruber, A.R., Hackermuller, J., Hertel, J., Lindemeyer, M., Reiche, K., Tanzer, A., Ucla, C., Wyss, C., Antonarakis, S.E., Denoeud, F., Lagarde, J., Drenkow, J., Kapranov, P., Gingeras, T.R., Guigo, R., Snyder, M., Gerstein, M.B., Reymond, A., Hofacker, I.L. & Stadler, P.F.
Genome Res. 2007 Jun;17(6):852-64.
Functional RNA structures play an important role both in the context of noncoding RNA transcripts as well as regulatory elements in mRNAs. Here we present a computational study to detect functional RNA structures within the ENCODE regions of the human genome. Since structural RNAs in general lack characteristic signals in primary sequence, comparative approaches evaluating evolutionary conservation of structures are most promising. We have used three recently introduced programs based on either phylogenetic-stochastic context-free grammar (EvoFold) or energy directed folding (RNAz and AlifoldZ), yielding several thousand candidate structures (corresponding to approximately 2.7% of the ENCODE regions). EvoFold has its highest sensitivity in highly conserved and relatively AU-rich regions, while RNAz favors slightly GC-rich regions, resulting in a relatively small overlap between methods. Comparison with the GENCODE annotation points to functional RNAs in all genomic contexts, with a slightly increased density in 3'-UTRs. While we estimate a significant false discovery rate of approximately 50%-70% many of the predictions can be further substantiated by additional criteria: 248 loci are predicted by both RNAz and EvoFold, and an additional 239 RNAz or EvoFold predictions are supported by the (more stringent) AlifoldZ algorithm. Five hundred seventy RNAz structure predictions fall into regions that show signs of selection pressure also on the sequence level (i.e., conserved elements). More than 700 predictions overlap with noncoding transcripts detected by oligonucleotide tiling arrays. One hundred seventy-five selected candidates were tested by RT-PCR in six tissues, and expression could be verified in 43 cases (24.6%).
The DART classification of unannotated transcription within the ENCODE regions: associating transcription with known and novel loci.
Rozowsky, J.S., Newburger, D., Sayward, F., Wu, J., Jordan, G., Korbel, J.O., Nagalakshmi, U., Yang, J., Zheng, D., Guigo, R., Gingeras, T.R., Weissman, S., Miller, P., Snyder, M. & Gerstein, M.B.
Genome Res. 2007 Jun;17(6):732-45.
For the approximately 1% of the human genome in the ENCODE regions, only about half of the transcriptionally active regions (TARs) identified with tiling microarrays correspond to annotated exons. Here we categorize this large amount of "unannotated transcription." We use a number of disparate features to classify the 6988 novel TARs-array expression profiles across cell lines and conditions, sequence composition, phylogenetic profiles (presence/absence of syntenic conservation across 17 species), and locations relative to genes. In the classification, we first filter out TARs with unusual sequence composition and those likely resulting from cross-hybridization. We then associate some of those remaining with proximal exons having correlated expression profiles. Finally, we cluster unclassified TARs into putative novel loci, based on similar expression and phylogenetic profiles. To encapsulate our classification, we construct a Database of Active Regions and Tools (DART.gersteinlab.org). DART has special facilities for rapidly handling and comparing many sets of TARs and their heterogeneous features, synchronizing across builds, and interfacing with other resources. Overall, we find that approximately 14% of the novel TARs can be associated with known genes, while approximately 21% can be clustered into approximately 200 novel loci. We observe that TARs associated with genes are enriched in the potential to form structural RNAs and many novel TAR clusters are associated with nearby promoters. To benchmark our classification, we design a set of experiments for testing the connectivity of novel TARs. Overall, we find that 18 of the 46 connections tested validate by RT-PCR and four of five sequenced PCR products confirm connectivity unambiguously.
What is a gene, post-ENCODE? History and updated definition.
Gerstein, M.B., Bruce, C., Rozowsky, J.S., Zheng, D., Du, J., Korbel, J.O., Emanuelsson, O., Zhang, Z.D., Weissman, S. & Snyder, M.
Genome Res. 2007 Jun;17(6):669-81.
While sequencing of the human genome surprised us with how many protein-coding genes there are, it did not fundamentally change our perspective on what a gene is. In contrast, the complex patterns of dispersed regulation and pervasive transcription uncovered by the ENCODE project, together with non-genic conservation and the abundance of noncoding RNA genes, have challenged the notion of the gene. To illustrate this, we review the evolution of operational definitions of a gene over the past century--from the abstract elements of heredity of Mendel and Morgan to the present-day ORFs enumerated in the sequence databanks. We then summarize the current ENCODE findings and provide a computational metaphor for the complexity. Finally, we propose a tentative update to the definition of a gene: A gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products. Our definition side-steps the complexities of regulation and transcription by removing the former altogether from the definition and arguing that final, functional gene products (rather than intermediate transcripts) should be used to group together entities associated with a single gene. It also manifests how integral the concept of biological function is in defining genes.
Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.
ENCODE Project Consortium.
Nature. 2007 Jun 14;14(447 (7146)) 799-816
Prediction of effective genome size in metagenomic samples.
Raes, J., Korbel, J.O., Lercher, M.J., von Mering, C. & Bork, P.
Genome Biol. 2007;8(1):R10.
We introduce a novel computational approach to predict effective genome size (EGS; a measure that includes multiple plasmid copies, inserted sequences, and associated phages and viruses) from short sequencing reads of environmental genomics (or metagenomics) projects. We observe considerable EGS differences between environments and link this with ecologic complexity as well as species composition (for instance, the presence of eukaryotes). For example, we estimate EGS in a complex, organism-dense farm soil sample at about 6.3 megabases (Mb) whereas that of the bacteria therein is only 4.7 Mb; for bacteria in a nutrient-poor, organism-sparse ocean surface water sample, EGS is as low as 1.6 Mb. The method also permits evaluation of completion status and assembly bias in single-genome sequencing projects.
Global identification and characterization of transcriptionally active regions in the rice genome.
Li, L., Wang, X., Sasidharan, R., Stolc, V., Deng, W., He, H., Korbel, J., Chen, X., Tongprasit, W., Ronald, P., Chen, R., Gerstein, M. & Wang Deng, X.
PLoS ONE. 2007 Mar 14;2(3):e294.
Genome tiling microarray studies have consistently documented rich transcriptional activity beyond the annotated genes. However, systematic characterization and transcriptional profiling of the putative novel transcripts on the genome scale are still lacking. We report here the identification of 25,352 and 27,744 transcriptionally active regions (TARs) not encoded by annotated exons in the rice (Oryza. sativa) subspecies japonica and indica, respectively. The non-exonic TARs account for approximately two thirds of the total TARs detected by tiling arrays and represent transcripts likely conserved between japonica and indica. Transcription of 21,018 (83%) japonica non-exonic TARs was verified through expression profiling in 10 tissue types using a re-array in which annotated genes and TARs were each represented by five independent probes. Subsequent analyses indicate that about 80% of the japonica TARs that were not assigned to annotated exons can be assigned to various putatively functional or structural elements of the rice genome, including splice variants, uncharacterized portions of incompletely annotated genes, antisense transcripts, duplicated gene fragments, and potential non-coding RNAs. These results provide a systematic characterization of non-exonic transcripts in rice and thus expand the current view of the complexity and dynamics of the rice transcriptome.
A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: systematically incorporating validated biological knowledge.
Du, J., Rozowsky, J.S., Korbel, J.O., Zhang, Z.D., Royce, T.E., Schultz, M.H., Snyder, M. & Gerstein, M.
Bioinformatics. 2006 Dec 15;22(24):3016-24. Epub 2006 Oct 12.
MOTIVATION: Large-scale tiling array experiments are becoming increasingly common in genomics. In particular, the ENCODE project requires the consistent segmentation of many different tiling array datasets into 'active regions' (e.g. finding transfrags from transcriptional data and putative binding sites from ChIP-chip experiments). Previously, such segmentation was done in an unsupervised fashion mainly based on characteristics of the signal distribution in the tiling array data itself. Here we propose a supervised framework for doing this. It has the advantage of explicitly incorporating validated biological knowledge into the model and allowing for formal training and testing. Methodology: In particular, we use a hidden Markov model (HMM) framework, which is capable of explicitly modeling the dependency between neighboring probes and whose extended version (the generalized HMM) also allows explicit description of state duration density. We introduce a formal definition of the tiling-array analysis problem, and explain how we can use this to describe sampling small genomic regions for experimental validation to build up a gold-standard set for training and testing. We then describe various ideal and practical sampling strategies (e.g. maximizing signal entropy within a selected region versus using gene annotation or known promoters as positives for transcription or ChIP-chip data, respectively). RESULTS: For the practical sampling and training strategies, we show how the size and noise in the validated training data affects the performance of an HMM applied to the ENCODE transcriptional and ChIP-chip experiments. In particular, we show that the HMM framework is able to efficiently process tiling array data as well as or better than previous approaches. For the idealized sampling strategies, we show how we can assess their performance in a simulation framework and how a maximum entropy approach, which samples sub-regions with very different signal intensities, gives the maximally performing gold-standard. This latter result has strong implications for the optimum way medium-scale validation experiments should be carried out to verify the results of the genome-scale tiling array experiments.
High-resolution mapping of DNA copy alterations in human chromosome 22 using high-density tiling oligonucleotide arrays.
Urban, A.E., Korbel, J.O., Selzer, R., Richmond, T., Hacker, A., Popescu, G.V., Cubells, J.F., Green, R., Emanuel, B.S., Gerstein, M.B., Weissman, S.M. & Snyder, M.
Proc Natl Acad Sci U S A. 2006 Mar 21;103(12):4534-9. Epub 2006 Mar 14.
Deletions and amplifications of the human genomic sequence (copy number polymorphisms) are the cause of numerous diseases and a potential cause of phenotypic variation in the normal population. Comparative genomic hybridization (CGH) has been developed as a useful tool for detecting alterations in DNA copy number that involve blocks of DNA several kilobases or larger in size. We have developed high-resolution CGH (HR-CGH) to detect accurately and with relatively little bias the presence and extent of chromosomal aberrations in human DNA. Maskless array synthesis was used to construct arrays containing 385,000 oligonucleotides with isothermal probes of 45-85 bp in length; arrays tiling the beta-globin locus and chromosome 22q were prepared. Arrays with a 9-bp tiling path were used to map a 622-bp heterozygous deletion in the beta-globin locus. Arrays with an 85-bp tiling path were used to analyze DNA from patients with copy number changes in the pericentromeric region of chromosome 22q. Heterozygous deletions and duplications as well as partial triploidies and partial tetraploidies of portions of chromosome 22q were mapped with high resolution (typically up to 200 bp) in each patient, and the precise breakpoints of two deletions were confirmed by DNA sequencing. Additional peaks potentially corresponding to known and novel additional CNPs were also observed. Our results demonstrate that HR-CGH allows the detection of copy number changes in the human genome at an unprecedented level of resolution.
Similar gene expression profiles do not imply similar tissue functions.
Yanai, I., Korbel, J.O., Boue, S., McWeeney, S.K., Bork, P. & Lercher, M.J.
Trends Genet. 2006 Mar;22(3):132-8. Epub 2006 Feb 9.
Although similarities in gene expression among tissues are commonly inferred to reflect functional constraints, this has never been formally tested. Furthermore, it is unclear which evolutionary processes are responsible for the observed similarities. When examining genome-wide expression data in mouse, we found that patterns of expression similarity between tissues extend to genes that are unlikely to function in the tissues. Thus, ectopic expression can seem coordinated across tissues. This indicates that knowledge of gene expression patterns per se is insufficient to infer gene function. Ectopic expression is possibly explained as expression leakage, caused by spreading of chromatin modifications or the transcription apparatus into neighboring genes.
Novel transcribed regions in the human genome.
Rozowsky, J., Wu, J., Lian, Z., Nagalakshmi, U., Korbel, J.O., Kapranov, P., Zheng, D., Dyke, S., Newburger, P., Miller, P., Gingeras, T.R., Weissman, S., Gerstein, M. & Snyder, M.
Cold Spring Harb Symp Quant Biol. 2006;71:111-6.
We have used genomic tiling arrays to identify transcribed regions throughout the human genome. Analysis of the mapping results of RNA isolated from five cell/tissue types, NB4 cells, NB4 cells treated with retinoic acid (RA), NB4 cells treated with 12-O-tetradecanoylphorbol-13 acetate (TPA), neutrophils, and placenta, throughout the ENCODE region reveals a large number of novel transcribed regions. Interestingly, neutrophils exhibit a great deal of novel expression in several intronic regions. Comparison of the hybridization results of NB4 cells treated with different stimuli relative to untreated cells reveals that many new regions are expressed upon cell differentiation. One such region is the Hox locus, which contains a large number of novel regions expressed in a number of cell types. Analysis of the trinucleotide composition of the novel transcribed regions reveals that it is similar to that of known exons. These results suggest that many of the novel transcribed regions may have a functional role.
Systematic association of genes to phenotypes by genome and literature mining.
Korbel, J.O., Doerks, T., Jensen, L.J., Perez-Iratxeta, C., Kaczanowski, S., Hooper, S.D., Andrade, M.A. & Bork, P.
PLoS Biol 2005 Apr 5;3(5):e134.
One of the major challenges of functional genomics is to unravel the connection between genotype and phenotype. So far no global analysis has attempted to explore those connections in the light of the large phenotypic variability seen in nature. Here, we use an unsupervised, systematic approach for associating genes and phenotypic characteristics that combines literature mining with comparative genome analysis. We first mine the MEDLINE literature database for terms that reflect phenotypic similarities of species. Subsequently we predict the likely genomic determinants: genes specifically present in the respective genomes. In a global analysis involving 92 prokaryotic genomes we retrieve 323 clusters containing a total of 2,700 significant gene-phenotype associations. Some clusters contain mostly known relationships, such as genes involved in motility or plant degradation, often with additional hypothetical proteins associated with those phenotypes. Other clusters comprise unexpected associations; for example, a group of terms related to food and spoilage is linked to genes predicted to be involved in bacterial food poisoning. Among the clusters, we observe an enrichment of pathogenicity-related associations, suggesting that the approach reveals many novel genes likely to play a role in infectious diseases.
Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs.
Korbel, J.O., Jensen, L.J., von Mering, C. & Bork, P.
Nat Biotechnol 2004 Jul;22(7):911-7.
Several widely used methods for predicting functional associations between proteins are based on the systematic analysis of genomic context. Efforts are ongoing to improve these methods and to search for novel aspects in genomes that could be exploited for function prediction. Here, we use gene expression data to demonstrate two functional implications of genome organization: first, chromosomal proximity indicates gene coregulation in prokaryotes independent of relative gene orientation; and second, adjacent bidirectionally transcribed genes (that is,'divergently' organized coding regions) with conserved gene orientation are strongly coregulated. We further demonstrate that such bidirectionally transcribed gene pairs are functionally associated and derive from this a novel genomic context method that reliably predicts links between >2,500 pairs of genes in approximately 100 species. Around 650 of these functional associations are supported by other genomic context methods. In most instances, one gene encodes a transcriptional regulator, and the other a nonregulatory protein. In-depth analysis in Escherichia coli shows that the vast majority of these regulators both control transcription of the divergently transcribed target gene/operon and auto-regulate their own biosynthesis. The method thus enables the prediction of target processes and regulatory features for several hundred transcriptional regulators.
Transgene methylation in mice reflects copy number but not expression level.
Pena, R.N., Webster, J., Kwan, S., Korbel, J. & Whitelaw, B.A.
Mol Biotechnol 2004 Mar;26(3):215-20.
In mammals, CpG methylation is one of the mechanisms of epigenetic control over the linear sequence of bases of deoxyribonucleic acid (DNA); about 70% of CpG dinucleotides are methylated. The actual signal that triggers DNA methylation is not known, although repetitive DNA has been shown to be an attractive template for DNA methylases. To address methylation events associated with transgenic copy number, we have analyzed transgenes that are actively transcribed in a tissue-specific manner. We have compared gross transgene methylation by restriction-enzyme digestion in expressing and nonexpressing tissues. The observed pattern suggests that the DNA methylation machinery can recognize repeated genomic sequences independently of their transcriptional activity.
The Helmholtz Network for Bioinformatics: an integrative web portal for bioinformatics resources.
Crass, T., Antes, I., Basekow, R., Bork, P., Buning, C., Christensen, M., Claussen, H., Ebeling, C., Ernst, P., Gailus-Durner, V., Glatting, K.H., Gohla, R., Gossling, F., Grote, K., Heidtke, K., Herrmann, A., O'Keeffe, S., Kiesslich, O., Kolibal, S., Korbel, J.O., Lengauer, T., Liebich, I., Van Der Linden, M., Luz, H., Meissner, K., Von Mering, C., Mevissen, H.T., Mewes, H.W., Michael, H., Mokrejs, M., Muller, T., Pospisil, H., Rarey, M., Reich, J.G., Schneider, R., Schomburg, D., Schulze-Kremer, S., Schwarzer, K., Sommer, I., Springstubbe, S., Suhai, S., Thoppae, G., Vingron, M., Warfsmann, J., Werner, T., Wetzler, D., Wingender, E. & Zimmer, R.
Bioinformatics 2004 Jan 22;20(2):268-270.
SUMMARY: The Helmholtz Network for Bioinformatics (HNB) is a joint venture of eleven German bioinformatics research groups that offers convenient access to numerous bioinformatics resources through a single web portal. The 'Guided Solution Finder' which is available through the HNB portal helps users to locate the appropriate resources to answer their queries by employing a detailed, tree-like questionnaire. Furthermore, automated complex tool cascades ('tasks'), involving resources located on different servers, have been implemented, allowing users to perform comprehensive data analyses without the requirement of further manual intervention for data transfer and re-formatting. Currently, automated cascades for the analysis of regulatory DNA segments as well as for the prediction of protein functional properties are provided. AVAILABILITY: The HNB portal is available at http://www.hnbioinfo.de
Systematic discovery of analogous enzymes in thiamin biosynthesis.
Morett, E., Korbel, J.O., Rajan, E., Saab-Rincon, G., Olvera, L., Olvera, M., Schmidt, S., Snel, B. & Bork, P.
Nat Biotechnol 2003 Jul;21(7):790-5.
In all genome-sequencing projects completed to date, a considerable number of 'gaps' have been found in the biochemical pathways of the respective species. In many instances, missing enzymes are displaced by analogs, functionally equivalent proteins that have evolved independently and lack sequence and structural similarity. Here we fill such gaps by analyzing anticorrelating occurrences of genes across species. Our approach, applied to the thiamin biosynthesis pathway comprising approximately 15 catalytic steps, predicts seven instances in which known enzymes have been displaced by analogous proteins. So far we have verified four predictions by genetic complementation, including three proteins for which there was no previous experimental evidence of a role in the thiamin biosynthesis pathway. For one hypothetical protein, biochemical characterization confirmed the predicted thiamin phosphate synthase (ThiE) activity. The results demonstrate the ability of our computational approach to predict specific functions without taking into account sequence similarity.
Compositional asymmetries and predicted origins of replication of the saccharomyces cerevisiae genome.
Korbel, J. O., Assmus, H., Kielbasa, S., & Herzel, H.
In "Bioinformatics of Genome Regulation and Structure." N. Kolchanov and R. Hofestaedt R. (Eds). Kluwer Academic Publishers
SHOT: a web server for the construction of genome phylogenies.
Korbel, J.O., Snel, B., Huynen, M.A. & Bork, P.
Trends Genet 2002 Mar;18(3):158-62.
With the increasing availability of genome sequences, new methods are being proposed that exploit information from complete genomes to classify species in a phylogeny. Here we present SHOT, a web server for the classification of genomes on the basis of shared gene content or the conservation of gene order that reflects the dominant, phylogenetic signal in these genomic properties. In general, the genome trees are consistent with classical gene-based phylogenies, although some interesting exceptions indicate massive horizontal gene transfer. SHOT is a useful tool for analysing the tree of life from a genomic point of view. It is available at http://www.Bork.EMBL-Heidelberg.de/SHOT.