Supplementary Materials for:

THEA: ontology-driven analysis of microarray data. Bioinformatics (Oxford, England) 2004

Data sources

Data pre-processing

The CEL files were downloaded from the Genome-wide Expression Patterns of Drosophila in Response to Immune Challenge Homepage and treated using Bioconductor's affy package (Bioconductor is an open source software that can be dowloaded from, the affy package is dedicaced to the analysis of affymetrix data); the releases used were bioconductor 1.2 and affy 1.2.30 (for a review of Bioconductor's specificities, see for example Dudoit et al., 2003. The method used to calculate the expression indexes was the 'rma' method, using the default options as described by Irizarry et al., 2003 which is arguably one the most pertinent way to process affymetrix arrays measurements so far (for comparisons see Irizarry et al., 2003 and

Every possible treated vs control (i.e. infected vs non infected) ratios were computed. These results were subsequently submitted to a SAM multiclass analysis in order to select the genes showing statistically significant variation(s) of expression across experimental conditions (software downloadable from The chosen parameters ensured that less than 1% false positives were selected and lead to the selection of 1623 probe-sets of which only those showing a mean fold-change of at least 1.3 in any comparison were retained for further analysis. This further reduced the data to 1290 probe-sets, corresponding to 1277 independent gene products. For each of these selected probe-sets and each experimental conditions the mean of the logged expression values across replicates were calculated, these means where subsequently used to calculate the different infected/uninfected logged expression ratios. This dataset was then uploaded on the bioinformatics web site of CNIO (, Herrero et al., 2003), where the use of a preprocessor Herrero et al., 2003 allowed the merging of the replicates (using the median of the ratios) as well as the generation of various entry files for the different classification programs available on the same server. We chose to realize a SOTA analysis Herrero et al., 2001 using the 'correlation coefficient (linear)' metrics and '90% variability' as end training condition. The generated tree was loaded in THEA and analyzed using the program's features.

Processed Data files

RMA data

Expression levels calculated with the RMA method.

expression ratios

Infected/uninfected logged expression ratios of the replicates merged by GEPAS (using the median of the ratios).

SOTA tree

Result of the SOTA analysis (Herrero et al., 2001) using the 'correlation coefficient (linear)' metrics and '90% variability' as end training condition.


Statistics of the namings realized by THEA using the cutoffs described in the legend of the Figure 3 of the paper. (MF: Molecular Fonction, CC: Cellular Component, BF: Biological Fonction, DGA: Drosphila Gross Anatomy, DDS : Drosophila Developmental Stage).


Comparison of THEA and other related programs.