Thursday, August 6, 2009

Datasets for Association Rule Mining

A normal transaction consists of a transaction-id and a list of items in every row or sentence. Sometimes, the items are represented as boolean values 0 if the item is not bought, or 1 if the item is bought. But the commonly used format for Market Basket data is that of numeric values for items without any other information:
1 3 5 9 11 20 31 45 49
3 7 11 12 15 20 43...
This format has to be converted in order to be used by ARMiner and ARtool, since those tools can only evaluate binary data. ARMiner and ARtool have a special converter for that purpose which have to be performed before analyzing the data. WEKA needs a special ASCII-Data format (*.arff) for data analysis containing information about the attributes and a boolean representation of the items. Since there is no unique format for input-data, it is impossible to evaluate the same dataset in one format with different tools. In this paper, we present a dataset generator that is able to generate datasets that are readable by ARMiner, ARtool,WEKA and other data mining tools. Additionally, the generator has the ability to produce large Market Basket datasets with timestamps to simulate transactions in both retail and e-commerce environments.

1 comment: