TITLE : "Construction of protein sequences families"

AUTHORS : Jean-Christophe AUDE Jean-Paul COMET


We present a method to decompose large sets of proteins into families. We provide a pyramidal clustering of proteins within each family and a hierachical clustering of families. These clustering methods are based on dissimilarity index. The Z-score [PL88] provides statistical significances of alignment scores, that can be combined with the Wiener information formula to create this index. This method has been applied to the whole yeast genome (6482 sequences).

A Z-score is simply the comparison of an actual alignment score with the scores obtained on a set of random sequences by a Monte-Carlo process. Scores are calculated using the Smith & Waterman [SW81] algorithm.

Here we determine a relationship between the calculated Z-score, the variance of its distribution and the number of randomizations in the Monte-Carlo process. We observe that pairwise alignments, performed on random sequences having the same amino acid composition as real proteins, result in Z-scores that fit the distribution of extreme values. For real sequences, however, we observe an over-representation of the highest Z-scores. Thus, if the significance of an alignment score is based on the theoretical distribution of extreme values, which is usually the case, then the significance of the highest Z-scores will be overestimated. We determine a cut-off value which splits higher and lower Z-scores. We therefore introduce a model of a mixed distribution for estimating the significance of an alignment, based on the distribution of the extreme values for the lowest Z-scores and on an empirical estimation of the distribution of the highest Z-scores. We use this model to estimate the probability associated with a Z-score. Wiener information formula gives us the dissimilarity index for this probability.

The computation of Z-scores require intensive computations. It has been achieved using LASSAP software [GC96] running on a 32 nodes IBM SP2 machine and driving a BIOCCELERATOR board.

We use a graph to represent pairwise sequences dissimilarities. The vertex set is composed of sequences and the edges of the dissimilarity index. We select edges with a value greater than a given threshold and compute connected components to build families. We use a threshold based on the cut-off value which splits higher and lower Z-scores.

To understand how these families are organized together we build a hierarchy of all components. We gather components regarding the lowest dissimilarity index between all sequences in each of them. Thus we keep an equivalence between the hierachy and connected components with a upper threshold.

Families are simply built upon the transitive link between sequences. Since this information is not informative enough, we use pyramidal clustering on each family. Pyramidal clustering is an extension of hierarchy that allows overlapping clusters, and detection of local orders. This method highlights sub-families within a family and highlights sequences which share properties among some sub-families. These informations cannot be found with unroot-trees such as used in CLUSTAL [HS88].