This directory contains methods for assessing data mining results using swap randomization as described in Aristides Gionis, Heikki Mannila, Taneli Mielikäinen, and Panayiotis Tsaparas. Assessing data mining results via swap randomization. In Mark Craven and Dimitrios Gunopulos (Eds.): The Twelfth Annual SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006). ACM, 2006. Swap randomization for transaction databases and frequent itemsets ------------------------------------------------------------------ Perl scripts swaptdb.pl and swapfreq.pl produce swap-randomized transaction databases and frequent itemsets. freqchanges.pl compares the frequencies of frequent itemsets in the original and swap-randomized transaction databases. Usage: perl swaptdb.pl tdbfile tdbprefix stepsize steps parameters tdbfile : the name of the file consisting the transaction database tdbprefix : the prefix of the filenames of the swap-randomized transaction databases produced by the program stepsize : the number of attempted swaps between the stored transaction databases steps : the number of swap-randomized transaction databases produced The format of the transaction database is the same as used in FIMI repository (http://fimi.cs.helsinki.fi), i.e., the items in the transaction databases are positive integers and each row in the file is one transaction represented as a list of items in ascending order. Usage: perl swapfreq.pl tdbfile freqprefix tmpfile minsupp stepsize steps parameters tdbfile : the name of the file consisting the transaction database freqprefix : the prefix of the filenames of the swap-randomized transaction databases produced by the program tmpfile : the name of the temporary file used to store the swap-randomized transaction databases minsupp : the minimum support threshold for frequent itemset mining stepsize : the number of attempted swaps between the stored transaction databases steps : the number of swap-randomized transaction databases produced The format of the transaction database is the same as for swaptdb.pl. The script need a program 'fim_all' for mining frequent itemsets, as given in the FIMI repository. Usage: perl freqchanges.pl tdbfile tmpfile1 tmpfile2 minsupp minsupporig \ minsuppswap stepsize parameters tdbfile : the name of the file consisting the transaction database freqprefix : the prefix of the filenames of the swap-randomized transaction databases produced by the program tmpfile1 : the name of the temporary file used to store the frequent itemsets in the swap-randomized transaction databases tmpfile2 : the name of the temporary file used to store the swap-randomized transaction databases minsupp : the minimum support threshold for items to be considered minsupporig : the minimum support threshold for frequent itemset mining in the original transaction database minsuppswap : the minimum support threshold for frequent itemset mining in the swap-randomized transaction database stepsize : the number of attempted swaps for producing the swap-randomized transaction database The format of the transaction database is the same as with swaptdb.pl and swapfreq.pl, and the some implementation of 'fim_all' e.g. from the FIMI repository is needed. The columns of the output are the following: freqorig freeqswap relerrorig relerrswap avgcorrorig avgcorrswap \ freqswap/freqorig freqorig/freqswap ratioorig ratioswap ; itemset freqorig : the support of the itemset in the original data freqswap : the support of the itemset in the swap-randomized data relerrorig : (freqorig-freqswap)/freqorig relerrswap : (freqorig-freqswap)/freqorig avgcorrorig : the average correlation between the items of the itemset in the original transaction database avgcorrswap : the average correlation between the items of the itemset in the swap-randomized transaction database swaplift : freqswap/freqorig liftswapped : freqorig/freqswap ratioorig : the lift in the original transaction database ratioswap : the lift in the swap-randomized transaction database itemset : the list of the items in the itemset in ascending order Mex file for swaping a 0-1 matrix --------------------------------- C code that compiles and runs through Matlab. File: swap.c In order to compile follow the following steps inside Matlab: 1. set the compiler option: >> mex -setup (in our setup it is option 2 for gcc) 2. Compile the file: >> mex swap.c Usage of function: Y = swap(X) is the default call X is the input binary matrix. Y is the output swapped matrix. The number of swaps attempted is the number of ones of matrix X. Y = swap(X,k) attempts to perform k swaps. [Y,t] = swap(X) records t, the actual number of swaps that took place. Matlab function to test significance of clustering -------------------------------------------------- Compute clustering error on the original data and the swapped data File: clusteringtest.m Usage: [orig,permuted,sw] = clusteringtest(Dinp,k,nosamples,L); Input Dinp: input 0-1 matrix k: number of clusters to be tested nosamples: number of samples (swapped datasets) to be drawn L: number of swaps to be attempted in order to generate each sample Output orig: error on the original data permuted: vector of errors on permuted data sw: vector of actual swaps that took place in each sample For example, generate artificial data that contain clustered structure (use files gendata.m and randgd.m provided in this directory) >> A = gendata(100,20,5,0.1); % 100 points, 20 clusters, 5 clusters, 0.1 noise and run >> [or,perm,sw] = clusteringtest(A,5,100);