Massive DNA encyclopedia scraps the junk


ENCODE researchers found that most of our DNA has a function: controlling when and where genes are turned on and off.
Click to enlarge
Download high-resolution tiff

Today, an international team of researchers reveal that much of what has been called ‘junk DNA’ in the human genome is actually a massive control panel with millions of switches regulating the activity of our genes. Without these switches, genes would not work – and mutations in these regions might lead to human disease. Discovered by hundreds of scientists working on the ENCODE Project, the new information is so comprehensive and complex that it has given rise to a new publishing model in which electronic documents and datasets are interconnected.

Just as the Human Genome Project revolutionised biomedical research, ENCODE will drive new understanding and open new avenues for biomedical science. Led by the National Genome Research Institute (NHGRI) in the US and the EMBL-European Bioinformatics Institute (EMBL-EBI) in the UK, ENCODE now presents a detailed map of genome function that identifies 4 million gene ‘switches’. This essential reference will help researchers pinpoint very specific areas of research for human disease. The findings are published in 30 connected, open-access papers appearing in three science journals: Nature, Genome Biology and Genome Research.

“Our genome is simply alive with switches: millions of places that determine whether a gene is switched on or off,” says Ewan Birney of EMBL-EBI, lead analysis coordinator for ENCODE. “The Human Genome Project showed that only 2% of the genome contains genes, the instructions to make proteins. With ENCODE, we can see that around 80% of the genome is actively doing something. We found that a much bigger part of the genome – a surprising amount, in fact – is involved in controlling when and where proteins are produced, than in simply manufacturing the building blocks.”

“ENCODE data can be used by any disease researcher, whatever pathology they may be interested in,” said Ian Dunham of EMBL-EBI, who played a key role in coordinating the analysis. “In many cases you may have a good idea of which genes are involved in your disease, but you might not know which switches are involved. Sometimes these switches are very surprising, because their location might seem more logically connected to a completely different disease. ENCODE gives us a set of very valuable leads to follow to discover key mechanisms at play in health and disease. Those can be exploited to create entirely new medicines, or to repurpose existing treatments.”

“ENCODE gives us the knowledge we need to look beyond the linear structure of the genome to how the whole network is connected,” commented Michael Snyder, professor and chair at Stanford University and a principal investigator on ENCODE. “We are beginning to understand the information generated in genome-wide association studies – not just where certain genes are located, but which sequences control them. Because of the complex, three-dimensional shape of our genome, those controls are sometimes far from the gene they regulate and looping around to make contact. Were it not for ENCODE, we might never have looked in those regions. This is a major step toward understanding the wiring diagram of a human being. ENCODE helps us look deeply into the regulatory circuit that tells us how all of the parts come together to make a complex being.”

Until recently, generating and storing large volumes of data has been a challenge in biomedical research. Now, with the falling cost and rising productivity of genome sequencing, the focus has shifted to analysis – making sense of the data produced in genome-wide association studies. ENCODE partners have been working systematically through the human genome, using the same computational and wet-lab methods and reagents in laboratories distributed throughout the world.

To give some sense of the scale of the project: ENCODE combined the efforts of 442 scientists in 32 labs in the UK, US, Spain, Singapore and Japan. They generated and analysed over 15 terabytes (15 trillion bytes) of raw data – all of which is now publicly available. The study used around 300 years’ worth of computer time studying 147 tissue types to determine what turns specific genes on and off, and how that ‘switch’ differs between cell types.

The articles published today represent hundreds of pages of research. But the digital publishing group at Nature recognises that ‘pages’ are a thing of the past. All of the published ENCODE content, in all three journals, is connected digitally through topical ‘threads’, so that readers can follow their area of interest between papers and all the way down to the original data.

“Getting the best people with the best expertise together is what this is all about,” said Ewan Birney. “ENCODE has really shown that leading life scientists are very good at collaborating closely on a large scale to produce excellent foundational resources that the whole community can use.”

“Until now, everyone’s been generating and publishing this data piecemeal and unintentionally trapping it in niche communities and static publications. How could anyone outside that community exploit that knowledge if they don’t know it’s there?” commented Roderic Guigo of the Centre de Regulació Genómica (CRG) in Barcelona, Spain. “We have now an interactive encyclopaedia that everyone can refer to, and that will make a huge difference.”

Press Contacts


Mary Todd Bergman
Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Tel: +44 (0)1223 494 665
Fax: +44 (0) 1223 492 621


Sonia Furtado Neves
Meyerhofstraße 1, 69117 Heidelberg, Germany

Tel: +49 (0) 6221 387 8263
Fax: +49 (0) 6221 387 8525


Further information:

Watch Ewan Birney of EMBL-EBI, Tim Hubbard of the Wellcome Trust Sanger Institute and Roderic Guigo of CRG talk about ENCODE – with Spanish subtitles.

ENCODE in Numbers:

Consortium members
(Authors on the main Nature paper)
Actively participating institutes 32 (in the US, UK, Spain, Japan and Singapore)
PIs (Principal Investigators) 24
  Papers published in Nature 6
   Papers published in Genome Research 18
   Papers published in Genome Biology 6
   Topic ‘threads’ stemming from main paper  13
  Special features 3 virtual machines
1 website
1 iPad App 
  Cell types studied 147
  Experiments run 1649
  Antibodies CHIp’ed 235
  Transcription factors identified 119
Sites at which a gene is touching a protein ~4 000 000
  Transcription factors identified 119
  Size of Figure 1
(all data represented at smallest resolution)
30 kilometres x 16 metres
  DNA sequence generated 5 terabases (5 x 1012 bases)
  Disk space used* 15 terabytes (15 x 1012 bytes)
  Files analysed 11 972
ENCODE Wiki**  
  Content pages 741
  Uploaded files 2955
  Edits per day 11
  Page edits since 2008 18 500
  Views per day (average over 4.5 yr) 150
  Total views 248 140

*All sequence and analysis files generated on EMBL-EBI servers.
**The ENCODE consortium could operate efficiently only by sharing information instantly and in an accessible way. The members built a Wiki to keep their work central, manageable and up to date.

Source Articles:


Paper Title

Topic Threads




An integrated encyclopedia of DNA elements in the human genome

1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13

The ENCODE Project Consortium.

Nature (6 September 2012)
DOI: 10.1038/nature11247


The accessible chromatin landscape of the human genome

1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13

Thurman, Rynes and Humbert et al.

Nature (6 September 2012)
DOI: 10.1038/nature11232


An expansive human regulatory lexicon encoded in transcription factor footprints

1, 2, 3, 7, 10, 12, 13

Neph, Vierstra, Stergachis and Reynolds et al.

Nature (6 September 2012)
DOI: 10.1038/nature11212


Architecture of the human regulatory network derived from ENCODE data

4, 8, 9, 10, 11, 12, 13

Gerstein, Kundaje, Hariharan, Landt and Yan et al.

Nature (6 September 2012)
DOI: 10.1038/nature11245


Landscape of transcription in human cells

3, 5, 6, 8, 12

Djebali and Davis et al.

Nature (6 September 2012)
DOI: 10.1038/nature11233


The long-range interaction landscape of gene promoters

4, 8, 9, 10

Sanyal and Lajoie et al.

Nature (6 September 2012)
DOI: 10.1038/nature11279

Genome Research

Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs


Tilgner et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.134445.111

Genome Research

RNA editing in the human ENCODE RNA-seq data


Park et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.134957.111

Genome Research

Discovery of hundreds of mirtrons in mouse and human small RNA data


Ladewig et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.133553.111

Genome Research

Long noncoding RNAs are rarely translated in two human cell lines

6, 11

Bánfai et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.134767.111

Genome Research

Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors

1, 2, 4

Wang et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.139105.112

Genome Research

Understanding transcriptional regulation by integrative analysis of transcription factor binding data.

4, 11

Cheng et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.136838.111

Genome Research

A highly integrated and complex PPARGC1A transcription factor binding network in HepG2 cells.


Charos et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.127761.111

Genome Research

Widespread plasticity in CTCF occupancy linked to DNA methylation. Genome Res.


Wang et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.136101.111

Genome Research

Personal and population genomics of human regulatory variation.

12, 13

Vernot et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.134890.111

Genome Research

Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome

3, 6

Howald et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.134478.111

Genome Research

Predicting cell-type–specific gene expression from regions of open chromatin.

1, 2, 4, 8

Natarajan et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.135129.111

Genome Research

Sequence and chromatin determinants of cell-type-specific transcription factor binding.

1, 11

Arvey et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.127712.111

Genome Research

Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements.


Kundaje et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.136366.111

Genome Research

Linking disease associations with regulatory information in the human genome.


Schaub et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.136127.111

Genome Research

GENCODE: The reference human genome annotation for the ENCODE project

3, 6

Harrow et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.135350.111

Genome Research

The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression.


Derrien et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.132159.111

Genome Research

Annotation of functional variation in personal genomes using RegulomeDB.


Boyle et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.137323.112

Genome Research

ChIP-seq guidelines and practices used by the ENCODE and modENCODE consortia.


Landt et al.

Genome Res. (6 September 2012)
DOI: 10.1101/gr.136184.111

Genome Biology

Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription related factors

1, 2, 4, 8, 9, 10, 11

Yip et al.

Genome Biol. (6 September 2012) DOI: 10.1186/gb-2012-13-9-r48

Genome Biology

Analysis of variation at transcription factor binding sites in Drosophila and humans

1, 12, 13

Spivakov et al.

Genome Biol. (6 September 2012)
DOI: 10.1186/gb-2012-13-9-r49

Genome Biology

Functional analysis of transcription factor binding sites in human promoters


Whitfield et al.

Genome Biol. (6 September 2012)
DOI: 10.1186/gb-2012-13-9-r50

Genome Biology

The GENCODE pseudogene resource

3, 6, 13

Pei et al.

Genome Biol. (6 September 2012)
DOI: 10.1186/gb-2012-13-9-r51

Genome Biology

Cell type-specific binding patterns reveal that TCF7L2 can be tethered to the genome by association with GATA3

1, 8

Frietze et al.

Genome Biol. (6 September 2012)
DOI: 10.1186/gb-2012-13-9-r52

Genome Biology

Modeling gene expression using chromatin features in various cellular contexts

4, 5

Dong et al.

Genome Biol. (6 September 2012)
DOI: 10.1186/gb-2012-13-9-r53

BMC Genetics Simultaneous SNP identification and assessment of allele-specific bias from ChIP-seq data


Ni et al. BMC Gen. (6 September 2012)
DOI: 10.1186/1471-2156-13-46

ENCODE explorer: topic threads




Transcription factor motifs

1, 2, 3, 11, 17, 18, 24, 25, 26, 27, 29

ENCODE discovers many new transcription factor binding site motifs and explores their properties

Chromatin patterns at transcription factor binding sites

1, 2, 3, 11, 17, 19, 25

Both chromatin accessibility and histone modifications have different patterns around different transcription factor binding sites

Characterisation of intergenic regions and gene definition

2, 3, 5, 16, 21, 28

The prevalence and analysis of ENCODE data is changing the definition and characterisation of intergenic and genic regions

RNA and chromatin modification patterns around promoters

1, 2, 4, 6, 11, 12, 17, 25, 30

Patterns of gene expression can be modelled using the histone modification and transcription factor binding at promoters

Epigenetic regulation of RNA processing

1, 5, 7, 30

ENCODE data explores properties of RNA processing and its relationship to histone modifications

Non-coding RNA characterisation

1, 2, 5, 10, 16, 21, 22, 28

Many novel and previously known non-coding RNA species are characterised in ENCODE

DNA methylation

1, 2, 3, 14

ENCODE analysis identifies dynamic DNA methylation patterns and relationships to regulatory elements

Enhancer discovery and characterisation

1, 2, 4, 5, 6, 17, 25, 29

ENCODE has developed methods to discover enhancers, and characterised them both with other datasets and with specific experiments

Three-dimensional connections across the genome

1, 2, 4, 6, 25

ENCODE mapped long-range looping interactions between functional elements and genes, placing them in a three-dimensional context to reveal their functional relationships

Characterisation of network topology

1, 2, 3, 4, 6, 13, 25

Describing the various types of regulatory “wiring” implicit in the genome

Machine learning approaches to genomics

1, 2, 4, 10, 12, 18, 25,

ENCODE has applied machine learning approaches to enable integration and exploration of large and diverse data

Impact of functional information on understanding variation

1, 2, 3, 4, 5, 15, 20, 23, 26

ENCODE provides an initial interpretation of many human variants and plausible leads for the role of many variants identified in GWAS

Impact of evolutionary selesction on functional regions

1, 2, 3, 4, 15, 26, 28

The imprint of evolutionary selection on ENCODE regulatory elements is manifested between species and within human populations


Press Coverage

Policy regarding use

Press and Picture Releases

EMBL press and picture releases including photographs, graphics, movies and videos are copyrighted by EMBL. They may be freely reprinted and distributed for non-commercial use via print, broadcast and electronic media, provided that proper attribution to authors, photographers and designers is made.