The importance
of in situ hybridisation data is reflected by the fact that high throughput
screens have been performed for the major model species and been made
accessible to the scientific community through specialized databases. I have
contributed to that by developing two species specific and a cross species
database for gene expression data.
I have
developed expression pattern databases for two model species worked on at EMBL:
Medaka is a small fresh water fish that has been model system for genetics in
Japan for around 100 years. Today it is used in developmental biology,
molecular biology and ecology. Recently its genome has been sequenced and is
available in EnsEMBL www.ensembl.org.
Platynereis
is an annelid worm, which is important because of its basal position in
evolution. It is placed close to the divergence of protostomia (drosophila, C.
elegans) and deuterostomia (vertebrates).
·
MEPD:
Medaka Expression Pattern Database
http://ani.embl.de:8080/mepd/
·
PEPD:
Platynereis Expression Pattern Database http://ani.embl.de:8080/pepd/
PEPD is not
yet published, whereas MEPD has been described in two publications (Henrich T,
2003, 2005) and developed into a central resource within the medaka scientific
community.
Many
species-specific databases containing in situ expression data have been
developed in the past years. However there is no common resource that allows
one to compare expression patterns across species boundaries. We have developed
such a resource (Haudry Y, 2007). In order to be able to compare gene
expression patterns between the species we will have to establish relationships
between: genes, developmental stages and anatomical structures.
Relationships
between genes can be established over sequence homology and public
resources (e.g. EnsEMBL) can be used for that purpose. Since June 2006 EnsEMBL
uses a new tree based algorithm to predict orthologs which is much more
reliable than the previous blast best reciprocal hit approach.
I have
developed a strategy to establish relationships between developmental stages
of different species. To avoid a combinatorial explosion, I developed an
ontology of developmental stages for all bilaterian animals. Stages for each
species can be mapped on that and links between the species can be done over
the cross species ontology. I have mapped stages of zebrafish, medaka, mouse
and drosophila so far.
The
establishment of relationships between anatomical structures will be the
major challenge of the project. Some of these relationships (common knowledge
of homologous structures) can be entered pre hand some other will be
established through analysis of gene expression comparisons.
In addition
to our cross species repository we are integrating in situ data with microarray
data and are therefore developing an in situ warehouse in the ArrayExpress
group at the EBI in Hinxton UK. This warehouse will be optimized for fast
queries and will contain only data necessary for the searches. Links from the warehouse back to the
repository will be established to access the full information on an entry.
The major
aim of this database is the establishment of a platform:
·
for
the scientific community to query cross species gene expression patterns and
microarray data
·
to
easily compile data for data analysis using bioinformatics tools
For the
annotation of gene expression patterns in Medaka I have developed a detailed
ontology representing anatomical structures and their relationships to each
other for all the 44 developmental stages. This was done in collaboration with
zebrafish researchers (zfin.org) in order to give the same names to
corresponding structures in the two fish species. The ontology consists of over
4000 terms and is accessible at obo.sourceforge.net.
It is also
used to annotate mutant phenotypes and is described in a publication describing
a fish mutant database (Henrich T, 2004).
In order to
be able to exchange and compare expression data from different species it is
important to develop a standard for in situ hybridisation data. For this reason we have
joined an initiative which is aiming for that goal.
The first
step in this process is to find an agreement of what is the minimal information
needed to fully describe and repeat any kind of in situ hybridisation experiment. The name
of this specification is: MISFISHIE: Minimum Information Specification For In
situ Hybridization
and Immunohistochemistry Experiments (http://mged.sourceforge.net/misfishie).
The
initiative consists of groups that have experience in producing or organising in
situ hybridisation
data and groups that were involved in developing a similar standard for
microarray experiments (MIAME: Minimum information about a microarray
experiment).
MISFISHIE
describes 6 categories:
·
Experimental
design
·
Biomaterial
and treatment (z. B. specimen, fixation, sections)
·
Reporters
(probe sequence or antibody gene id)
·
Staining
(protocol)
·
Imaging
(magnification, illumination, microscope)
·
Image
characterization (expression pattern annotation)
The
MISFISHIE standard has been announced in a publication (Deutsch EW et al 2006)
and a detailed description is published in Nature Biotechnology.
In
collaboration with Mirana Ramialison a former master student of mine we have
applied a hierarchical clustering algorithm to characterise 600 Medaka genes,
which were annotated in MEPD. Several synexpression groups could be identified
for example a group of genes expressed in the otic vesicles and the yolk or a
group of genes expressed in the marginal zones of the tectum and other
proliferative tissues.
The cluster
of genes expressed in the proliferative zones (ciliary marginal zone of the
retina, tectum marginal zone, fins periphery) consists of a set of 48 genes. By
systematically applying a motif discovery pipeline to the 2 kb upstream regions
of these genes, we discovered three groups of evolutionary conserved motifs
specifically over-represented in this dataset. These motifs correspond to both
novel and known transcription factor binding sites. One of them is the Sox
binding site. Interestingly, the Sox and one unknown binding motif can be found
in the upstream region of sox3 itself, in a genomic region conserved within
teleosts, as well as in the upstream region of calmodulin, where it is
conserved from fish to mammals. It has been shown that Sox3 and Calmodulin
promote proliferation (Xia Y., 2000) and that another member of the Sox family
(Sox9) directly binds to Calmodulin (Harley VR., 1996). Therefore, it appears
likely that these two proteins are co-regulated by the same trans-acting
factors in order to ensure co-expression with respect to time, space and levels.
The
functionality of the module containing the two motifs was tested by transgene
expression with GFP as reporter. The resulting transgenic animals recapitulate
the expression pattern of the endogenous calmodulin gene. The injection of the
same region lacking the tandem motif resulted in the loss of expression in the
proliferative zones. This demonstrates that this tandem motif, identified in
the bioinformatics pipeline, is necessary and sufficient to trigger in vivo
expression of calmodulin in the proliferative regions. These data support the
accuracy of our motif discovery pipeline.
Zebrafish
is the model species with the most complete description of expression patterns.
More than 5000 genes are described throughout the major developmental stages
(from gastrulation till segmentation). We have compiled an expression matrix
for the zebrafish genes described (rows) and the anatomical structures
(columns). We have run a hierarchical clustering algorithm on that matrix
(binary distance, complete linkage clustering). The result is shown below. When
going back to the individual gene entries and looking at the expression pattern
descriptions of some of the genes ending up in the same cluster, they turn out
to be very similar: The genes cypa and ctss are expressed in the yolk- syncytial layer; Nanos and h1m are expressed
in primary germ cells. This is a promising result as it shows us that a large
scale clustering algorithm is able to classify gene expression patterns into
groups with similar expression patterns. Therefore we will choose the zebrafish
clustering to be compared with the promoter clustering (see below).
To be able
to cluster expression patterns between different species we need to establish
relationships between: genes, developmental stages, and anatomical entities.
We have
compiled a list of 54 transcription factor binding sites involved in
developmental processes taken from public resources: Transfac
(www.gene-regulation.com) and Jaspar (jaspar.cgb.ki.se/). We are now adjusting
a motif search algorithm, which has been developed in the institute (Etwiller
et al, submitted) to run on a genome wide scale using distributed runs on our
linux farm. The search algorithm calculates a score for each hit, which is
based upon the conservation of the site in other vertebrate organisms, thereby
distinguishing fish (Fugu, Tetraodon, Zebrafish) and mammalian (mouse, human,
rat) conservation.