Nicolas PASQUIER ♦ Université Côte d'Azur

Benchmark Datasets

Sparse Datasets

These datasets have been generated using Rakesh Agrawal's data generator to generate synthetic basket market data.

  • T10I4D100: Each transaction contains up to 10 items with an average size of 4 items and the dataset contains 100000 transactions.
  • T20I6D100: Each transaction contains up to 20 items with an average size of 6 items and the dataset contains 100000 transactions.
  • T25I10D10: Each transaction contains up to 25 items with an average size of 10 items and the dataset contains 10000 transactions.
Dense Datasets

These datasets were constructed from data publicly available at the UCI Data Repository.

  • Mushroom: This dataset was used as a benchmark for the Apriori, Close and MaxMiner association rule extraction algorithms among others. It is a dense dataset containing 8416 lines (mushrooms) with 23 attributes (describing mushroom features).
  • C73D10: Dataset constructed from the PUMS Kansas 1990 census data. The 10000 data lines were not truncated and contain 73 attributes.
  • C20D10: Dataset constructed from the PUMS Kansas 1990 census data. The 10000 data lines were truncated to include only the first 20 attributes.
  • Census: Dataset constructed from the PUMS Kansas 1990 census data. The 32561 data lines were truncated to include only the first 15 attributes.
Reference

Generating a condensed representation for association rules, Nicolas Pasquier, Rafik Taouil, Yves Bastide, Gerd Stumme and Lotfi Lakhal, Journal of Intelligent Information Systems, 24:1(29-60), 2005.