THE FIRST LAWS OF GENOMICS.
SLONIMSKI P.P.1, MOSSE M.O.1,
GOLIK P.1, HENAUT A.1, DIAZ Y1, RISLER
COMET J.P.2, AUDE J.C.2, WOZNIAK A.2, GLEMET E.2, CODANI J. J.2.
1 Centre de Genetique Moleculaire du CNRS, Avenue de la
91190 Gif-sur-Yvette, FRANCE.
2 INRIA, Batiment 8, BP 105, 78153 Le Chesnay Cedex, FRANCE.
Several complete sequences of microbial genomes have been recently established. They comprise 6087 Open Reading Frames (ORFs i.e. genes potentially coding for proteins) from the budding yeast (a eucaryote) and more than 10 000 ORF's from Mycoplasma, Haemophilus, Methanococcus, Synechocystis and Escherichia coli (eubacteria and archaebacteria). Using specially developped mathematical and computer tools we have made the pairwise alignements (by the Smith and Waterman dynamic programming) of all proteins from all genomes, estimated the sequence similarities (by calculating the Z-score of each alignement and the corresponding probability which follows the Pareto distribution) and classified homologous proteins in various clusters of paralogs (within the same genome) and orthologs (between different genomes). Contrary to the usual approach, where a single and arbitrary cut-off value is used to construct the single-link connective clusters of similar proteins, we have introduced the continuous "probability- threshold-scanning" approach. The 150 million pairwise alignements are scanned and the connective clusters of similar proteins are contructed for every probability threshold. In this manner we construct not just one set of connective clusters linked by a single similarity threshold, but a spectrum of sets by increasing step by step the similarity thershold. The analysis of this spectrum discloses a number of unexpected and singular regularities. They have a strong predictive value and can be viewed therefore as laws of genomics.
The first law concerns paralogs. To our surprise, we found that the frequency of paralogous duplications follows a very simple distribution. This distribution can be described by two strictly equivalent formulae :
(1) number of sequences Sn in clusters of n sequences : Sn = N/2(n+1)
(2) number of clusters Cn of n sequences : Cn = N/ (n * 2 (n+1))
where N is the total number of sequences (ORFs) in a given
genome and n is 2, 3, 4, 5 and 6.
Thus, 1/8 of the total number of genes in a given genome is present in clusters of two (and only two) homologous sequences, 1/16 in clusters of three (and only three) homologous sequences and so on. We have verified by a series of statistical tests (eg the Chi2 conformity test) that this law holds true for all genomes analysed, whether prokaryotic or eukaryotic and whatever the size of the genome. Interestingly, it does not apply to orthologous duplications, demonstrating that it describes a novel property of complete, individual genomes, as such. This is strengthened by the fact, that clusters of 2, 3, 4, 5 and 6 paralogs, although numerically identical in different organisms, are often not homologous inbetween different genomes. It should be stressed that the distribution of such clusters does not obey the Poisson distribution and that the N/2(n+1) law holds true whatever scoring matrix is used (eg PAM 250 or BLOSUM 62).
Before the completion of the DNA sequence of B. subtilis we have predicted the distribution of small paralogous clusters in this organism. This prediction has been verified experimentally : there are 268, 79, 31, 14 and 5 clusters containing respectively 2, 3, 4, 5 and 6 paralogs for 3986 ORF's present in the genome, in perfect agreement with our theory.
In contrast to these small clusters, the large (beginning with clusters of 7 sequences) and very large clusters of paralogous duplications behave quite differently. Some of them can be very large (more than 30 homologous sequences in a single cluster). As expected, they are more abundant in the larger genomes than in the smaller ones. These large expansions appear to be determined by the Darwinian selection of adaptive functions e.g. a eukaryote amplifies specific families of mitochondrial carriers and transcription regulators which are totaly absent in prokaryotes, while E. coli and Synechocystis amplify specific families of regulators and sensors which are practically absent from the yeast and other bacterial genomes.
The second law of genomics concerns the orthologs. By analysing the number of protein sequences and the number of protein families common to different species we have established the Global Genomic Proximity index. It demonstrates that Eucarya (i.e. Yeast) are evolutionary closer to Archae (i.e. Methanococcus) than to Eubacteria. A more unexpected result concerns the part of each genome homologous to other genomes. The universal orthologs common to all five species examined represent practically a constant fraction of every genome. In other terms, the number of orthologous protein sequences is directly proportional to the genome size due to the pari passu duplication of paralogs.