1       Current Research

1.1     Expression Pattern Databases

The importance of in situ hybridisation data is reflected by the fact that high throughput screens have been performed for the major model species and been made accessible to the scientific community through specialized databases. I have contributed to that by developing two species specific and a cross species database for gene expression data.

1.1.1    Species Specific Databases

I have developed expression pattern databases for two model species worked on at EMBL: Medaka is a small fresh water fish that has been model system for genetics in Japan for around 100 years. Today it is used in developmental biology, molecular biology and ecology. Recently its genome has been sequenced and is available in EnsEMBL www.ensembl.org.

Platynereis is an annelid worm, which is important because of its basal position in evolution. It is placed close to the divergence of protostomia (drosophila, C. elegans) and deuterostomia (vertebrates).

 

·       MEPD: Medaka Expression Pattern Database  http://ani.embl.de:8080/mepd/

·       PEPD: Platynereis Expression Pattern Database http://ani.embl.de:8080/pepd/

 

PEPD is not yet published, whereas MEPD has been described in two publications (Henrich T, 2003, 2005) and developed into a central resource within the medaka scientific community.

1.1.2    Cross Species Database

Many species-specific databases containing in situ expression data have been developed in the past years. However there is no common resource that allows one to compare expression patterns across species boundaries. We have developed such a resource (Haudry Y, 2007). In order to be able to compare gene expression patterns between the species we will have to establish relationships between: genes, developmental stages and anatomical structures.

Relationships between genes can be established over sequence homology and public resources (e.g. EnsEMBL) can be used for that purpose. Since June 2006 EnsEMBL uses a new tree based algorithm to predict orthologs which is much more reliable than the previous blast best reciprocal hit approach.

I have developed a strategy to establish relationships between developmental stages of different species. To avoid a combinatorial explosion, I developed an ontology of developmental stages for all bilaterian animals. Stages for each species can be mapped on that and links between the species can be done over the cross species ontology. I have mapped stages of zebrafish, medaka, mouse and drosophila so far.

The establishment of relationships between anatomical structures will be the major challenge of the project. Some of these relationships (common knowledge of homologous structures) can be entered pre hand some other will be established through analysis of gene expression comparisons.

In addition to our cross species repository we are integrating in situ data with microarray data and are therefore developing an in situ warehouse in the ArrayExpress group at the EBI in Hinxton UK. This warehouse will be optimized for fast queries and will contain only data necessary for the searches.  Links from the warehouse back to the repository will be established to access the full information on an entry.

The major aim of this database is the establishment of a platform:

·       for the scientific community to query cross species gene expression patterns and microarray data

·       to easily compile data for data analysis using bioinformatics tools

 

 

1.2     Ontologies

For the annotation of gene expression patterns in Medaka I have developed a detailed ontology representing anatomical structures and their relationships to each other for all the 44 developmental stages. This was done in collaboration with zebrafish researchers (zfin.org) in order to give the same names to corresponding structures in the two fish species. The ontology consists of over 4000 terms and is accessible at obo.sourceforge.net.

It is also used to annotate mutant phenotypes and is described in a publication describing a fish mutant database (Henrich T, 2004).

1.3     MISFISHIE: A Standard for In situ Expression Data

In order to be able to exchange and compare expression data from different species it is important to develop a standard for in situ hybridisation data. For this reason we have joined an initiative which is aiming for that goal.

The first step in this process is to find an agreement of what is the minimal information needed to fully describe and repeat any kind of in situ hybridisation experiment. The name of this specification is: MISFISHIE: Minimum Information Specification For In situ Hybridization and Immunohistochemistry Experiments (http://mged.sourceforge.net/misfishie).

The initiative consists of groups that have experience in producing or organising in situ hybridisation data and groups that were involved in developing a similar standard for microarray experiments (MIAME: Minimum information about a microarray experiment).

MISFISHIE describes 6 categories:

·       Experimental design

·       Biomaterial and treatment (z. B. specimen, fixation, sections)

·       Reporters (probe sequence or antibody gene id)

·       Staining (protocol)

·       Imaging (magnification, illumination, microscope)

·       Image characterization (expression pattern annotation)

The MISFISHIE standard has been announced in a publication (Deutsch EW et al 2006) and a detailed description is published in Nature Biotechnology.

1.4     Data Analysis

1.4.1    Expression Clustering in Medaka

In collaboration with Mirana Ramialison a former master student of mine we have applied a hierarchical clustering algorithm to characterise 600 Medaka genes, which were annotated in MEPD. Several synexpression groups could be identified for example a group of genes expressed in the otic vesicles and the yolk or a group of genes expressed in the marginal zones of the tectum and other proliferative tissues.

The cluster of genes expressed in the proliferative zones (ciliary marginal zone of the retina, tectum marginal zone, fins periphery) consists of a set of 48 genes. By systematically applying a motif discovery pipeline to the 2 kb upstream regions of these genes, we discovered three groups of evolutionary conserved motifs specifically over-represented in this dataset. These motifs correspond to both novel and known transcription factor binding sites. One of them is the Sox binding site. Interestingly, the Sox and one unknown binding motif can be found in the upstream region of sox3 itself, in a genomic region conserved within teleosts, as well as in the upstream region of calmodulin, where it is conserved from fish to mammals. It has been shown that Sox3 and Calmodulin promote proliferation (Xia Y., 2000) and that another member of the Sox family (Sox9) directly binds to Calmodulin (Harley VR., 1996). Therefore, it appears likely that these two proteins are co-regulated by the same trans-acting factors in order to ensure co-expression with respect to time, space and levels.

The functionality of the module containing the two motifs was tested by transgene expression with GFP as reporter. The resulting transgenic animals recapitulate the expression pattern of the endogenous calmodulin gene. The injection of the same region lacking the tandem motif resulted in the loss of expression in the proliferative zones. This demonstrates that this tandem motif, identified in the bioinformatics pipeline, is necessary and sufficient to trigger in vivo expression of calmodulin in the proliferative regions. These data support the accuracy of our motif discovery pipeline.

1.4.2    Expression Clustering in Zebrafish and Multi Species

Zebrafish is the model species with the most complete description of expression patterns. More than 5000 genes are described throughout the major developmental stages (from gastrulation till segmentation). We have compiled an expression matrix for the zebrafish genes described (rows) and the anatomical structures (columns). We have run a hierarchical clustering algorithm on that matrix (binary distance, complete linkage clustering). The result is shown below. When going back to the individual gene entries and looking at the expression pattern descriptions of some of the genes ending up in the same cluster, they turn out to be very similar: The genes cypa and ctss are expressed in the yolk- syncytial layer; Nanos and h1m are expressed in primary germ cells. This is a promising result as it shows us that a large scale clustering algorithm is able to classify gene expression patterns into groups with similar expression patterns. Therefore we will choose the zebrafish clustering to be compared with the promoter clustering (see below).

To be able to cluster expression patterns between different species we need to establish relationships between: genes, developmental stages, and anatomical entities.

1.4.3    Motiv Pipeline

We have compiled a list of 54 transcription factor binding sites involved in developmental processes taken from public resources: Transfac (www.gene-regulation.com) and Jaspar (jaspar.cgb.ki.se/). We are now adjusting a motif search algorithm, which has been developed in the institute (Etwiller et al, submitted) to run on a genome wide scale using distributed runs on our linux farm. The search algorithm calculates a score for each hit, which is based upon the conservation of the site in other vertebrate organisms, thereby distinguishing fish (Fugu, Tetraodon, Zebrafish) and mammalian (mouse, human, rat) conservation.