-----
What:
-----

This is the readme file for the Learning Optimal Bayesian Network Structures 
Weighted Partial MaxSAT datasets.

These datasets are part of publication:

J. Berg, A. Hyttinen, M. Jarvisalo:
"Applications of MaxSAT in Data Analysis"

----------
References
----------

The datasets are the product of the following paper:

Optimal Correlation Clustering via MaxSAT. Jeremias Berg and Matti Jarvisalo. 
In Wei Ding, Takashi Washio, Hui Xiong, George Karypis, Bhavani M. Thuraisingham, Diane J. Cook and Xindong Wu, editors, 
Proceedings of the 2013 IEEE 13th International Conference on Data Mining Workshops (ICDMW 2013),
pages 750-757, IEEE Computer Society, 2013. 

-----------------------
Problem Specifics
-----------------------

These instances three different encoding of the correlation clustering problem into 
Weighted Partial MaxSAT.
For details correlation clustering or the encodings, plese see the refences. 

-----------------------
Data Sets used
-----------------------

The problems were created from a variety of datasets such as:
	Protein 1, 2, 3, and 4: 
		Similarity values between amino acid sequences of proteins 
		Obtainable from http://www.paccanarolab.org/scps.
	ORL: 
		The AT&T ORL database of images of faces.
		Obtainable from http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html
	Ionosphere: 
		The UCI ionosphere dataset, for classification of radar returns from the
		ionosphere. Obtainable from http://archive.ics.uci.edu/ml/
	Breastcancer: 
		The LIBSVM breast-cancer dataset,
		originally named ``Wisconsin Breast Cancer in UCI''.
		Obtainable from: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
	Ecoli: 	
		The UCI Ecoli dataset, containing protein localization sites.
		Obtainable from: http://archive.ics.uci.edu/ml/
	Vowel: 	
		The LIBSVM Vowel dataset, originally from UCI, with 10 features.
		Obtainable from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.


-----------------------
Computation of Similarity values
-----------------------
	Before the creation of the MaxSAT instance, a pairwise similarity measure over the points 
	in the data set had to be computed. 

	The four protein datasets already contained a similiarity measure (calculated by BLAST) 
	in the range [0,1]. 
	In order to fit the data into the clustering setting outlined in the references, we 
	linearly transformed the values by subtracting 0.5 from all values so that 
	the final values of the similarity measure all were between -0.5 and 0.5.

	From the other data sets we first calculated the euclidean 
	distance between all points and then normalized the resulting values to the range [0,1].
	Then we again linearly transformed the resulting values to the range [-0.5, 0.5].
	Finally we simulated sparsity in the data by filtering out all similarity values s
	for which |s| < 0.2. The filtering step was ommited for the protein data as it already
	was incomplete. 


-----------
FILE NAMES
-----------

Files are named following convention:

(Rounded)__<data-name>_<enc-name>_N<n>_D<d>.wcnf

where
Rounded indicates whether or not the weights in the maxsat instance are rounded to whole numbers. 
<data-name> = Name of the dataset used 
<enc-name> = Which of the three encodings was used, see the references for details. 
<n> = 	The number of points from the full dataset that was included in the creation. A value n
	indicates that the points 1 through n were taken from the dataset. 
<d> = 	The similarity values were originally normalized all in the range -0.5 through 0.5.  
	The parameter D is meant for simulating missing information from the data, a value 
	d indicates that only similarity values s for which |s| > d were considered.  

	  
-------
CONTACT
-------

In case of questions please check the original paper first, then you can contact:

Jeremias Berg
email: jeremias.berg@cs.helsinki.fi