Over the last decade, a great number of genomes, from different organisms have been decoded. The knowledge of a genome sequence is an important step towards understanding it. However, the sequence itself provides little information about the role of genes contained within a genome. Old issues remain like: What are the functions of the different genes? In what cellular processes do they participate? How are genes regulated? In which cell types and depending on which conditions the genes become active? How various diseases or treatments influence the activity of genes? Or the reverse: How genes contribute to diseases?
Transcriptomics or global analysis of gene expression, also called genome-wide expression profiling, is a way to answer these questions. While the genome represents an inventory of every available gene in an organism, the transcriptome reflects the genes that are being actively expressed at any given time. By studying the patterns of gene expression in different experimental conditions, researchers can get an understanding of genes and pathways involved in biological processes. A gene expression level is a numerical value assessing how this gene was over-expressed (intensively active) or under-expressed (weakly active) compared with his activity in normal conditions (apart from the experiment). Transcriptomics aims at discovering genes involved in similar biological processes using expression level measures. So called "in-silico" annotations are deduced from overall gene expression measurements in particular experimental contexts.
My research aims to analyse data generated by gene expression profiling experiments in order to highlight new knowledge from these data. The study focuses on high-throughput techniques, that are able to profile the expression of thousands of genes in parallel. These include hybridization-based technologies, such as DNA microarrays, and sequencing-based approaches like SAGE (Serial Analysis of Gene Expression)  and MPSS (Massively Parallel Signature Sequencing).
The analysis process include statistical and mathematical analysis of the measured gene expression but also the use of biological knowledge. Biological knowledge refers to biological information describing known gene properties and relationships. This knowledge is widely available from public sources but it is is spread over heterogeneous sources of information such as research papers, biological ontologies or regulation networks for instance. Currently, most of the information is stored in plain text format into millions of biological research papers. However, a growing number of repositories make their data available in more structured formats, sometimes organized with ontologies.
A widely used source of annotations is the Gene Ontology (GO). GO is a controlled vocabulary developed by a consortium of scientists that is used to describe (annotate) a gene or a product of a gene in regard to its molecular functions (its activities or abilities, i.e. the catalysis of a biochemical reaction), cellular components (its localizations in the cell, i.e. in the plasma membrane) and biological processes (the processes in which it participates, i.e. the respiration). Other gene centric annotations include phenotypic annotations (the measurable characteristics of an organism controlled by genes), disease annotations (the propensity of genetic diseases associated with genes), tissue-expression patterns (the association of genes with the tissues in which they are preferentially expressed) and homology information (genes in different species that share a common ancestor). However, as no gene operates in an isolated way, it is also important to consider information about the complex molecular networks orchestrating the activity of cells.
The study includes several key points: - to improve relevance and utility of extracted knowledge by extending existing data mining techniques to adapt them to biological data. The next challenge for this purpose is to integrate biological knowledge in all phases of the data mining process to optimize existing knowledge profit. , - to integrate data coming from different sources and deal with recurrent problems in genome annotation, including inconsistent function descriptions, false (positive or negative) assignments, unsupported predictions, haphazard use of various terms (activities concerning this point are detailed here), - to represent knowledge (prior knowledge as well as generated one) along with contextual or uncertainty information and that enable some kind of reasoning (automatic inference, detection of inconsistencies).
The Thea project (Tools for High-throughput Experiment Analysis)
Thea (Tools for High-throughput Experiments Analysis) is an integrated information processing system allowing convenient handling of data. It allows to automatically annotate data issued from classification systems with selected biological information coming from a knowledge base and to either manually search and browse through these annotations or automatically generate meaningful generalizations according to statistical criteria (data mining) .
Co-expressed Gene Groups Analysis (CGGA)
CGGA (Co-expressed Gene Groups Analysis) is a novel data analysis method that automatically finds groups of genes that are functionally enriched, i.e. have the same functional annotations, and are co-expressed. CGGA automatically integrates the information of microarrays, i.e. gene expression profiles, with the functional annotations of the genes obtained by the genome-wide information sources such as Gene Ontology. By applying CGGA to wellknown microarray experiments, we have identified the principal functionally enriched and co-expressed gene groups, and we have shown that this approach enhances and accelerates the interpretation of DNA microarray experiments [4–7].
GenMiner is a smart adaptation of closed itemsets based association rules extraction to genomic data. It takes advantage of the novel Nordi discretization method and of the Close algorithm to efficiently generate minimal non-redundant association rules. Genminer facilitates the integration of numerous sources of biological information such as gene expressions and annotations, and can tacitly integrate qualitative information on biological conditions (age, sex, etc.). We validated this approach analyzing the microarray datasets used by Eisen et al. in 1998 with several sources of biological annotations: GO gene product descriptions, KEGG molecular interaction and reaction networks, PubMed/Medline literature information, Saccharomyces Genome Database phenotypic annotations and transcriptional regulators. Extracted associations revealed significant co-annotated and co-expressed gene patterns, showing important biological relationships between genes and their features. Several of these relationships are supported by recent biological literature [8–11].
1. Martinez R, Christen R, Pasquier C, Pasquier N: Exploratory Analysis of Cancer SAGE Data. In 9th european conferences on principles and practice of knowledge discovery in databases (pkdd’05), discovery challenge. Porto, Portugal; 2005.
2. Pasquier N, Pasquier C, Brisson L, Collard M: Mining Gene Expression Data using Domain Knowledge. International Journal of Software and Informatics (IJSI) 2008, 2:215–231.
3. Pasquier C, Girardot F, Jevardat de Fombelle K, Christen R: THEA: ontology-driven analysis of microarray data. Bioinformatics (Oxford, England) 2004, 20:2636–43.
4. Martinez R, Pasquier N, Collard M, Pasquier C, Lopez-Perez L: Co-expressed gene groups analysis (CGGA): An automatic tool for the interpretation of microarray experiments. Journal of Integrative Bioinformatics 2006, 3:1–12.
5. Martinez R, Pasquier N, Pasquier C, Collard M, Lopez-Perez L: Analyse des groupes de gènes co-exprimés (AGGC): un outil automatique pour l’interprétation des expériences de biopuces. In 13ème rencontres de la société francophone de classification (sfc’06). Edited by Cépaduès Editions. Metz; 2006:267–276.
6. Martinez R, Pasquier N, Pasquier C, Collard M, Lopez-Perez L: Analyse des groupes de gènes co-exprimés: un outil automatique pour l’interprétation des expériences de biopuces (version étendue). Revue des Nouvelles Technologies de l’Information (RNTI-C-2), Classification : points de vue croisés 2008, 831:263–74.
7. Martinez R, Pasquier N, Pasquier C, Lopez-Perez L: Interpreting microarray experiments via co-expressed gene groups analysis. In 9th international conference of discovery science (icds’06) lecture notes in computer science. Volume 4265. Barcelona: Springer Berlin Heidelberg; 2006:316–320.
8. Martinez R, Pasquier C, Pasquier N: GenMiner: Mining Informative Association Rules from Genomic Data. In IEEE international conference on bioinformatics and biomedicine (bibm’07). Fremont, Silicon Valley, CA: IEEE; 2007:15–22.
9. Martinez R, Pasquier N, Pasquier C: Mining Association Rule Bases from Integrated Genomic Data and Annotations. In 5th international conference on computational intelligence methods for bioinformatics and biostatistics (cibb’08). Edited by Berlin S. Vietri sul Mare, Salerno, Italy; 2008:33–43.
10. Martinez R, Pasquier N, Pasquier C: GenMiner: mining non-redundant association rules from integrated gene expression data and annotations. Bioinformatics (Oxford, England) 2008, 24:2643–4.
11. Martinez R, Pasquier N, Pasquier C: Mining Association Rule Bases from Integrated Genomic Data and Annotations (extended version). Lecture Notes in Bioinformatics 2009, 5488:78–90.