Three Concepts: Probability | Projects

Project III: The Legend of Golden Standard

Completed projects

(none yet)

Background

When designing a learning algorithm to construct models from data sets with non-informative prior information (i.e., no preference for the structure in advance) a challenging and interesting task is to evaluate the "quality" of the learning algorithm. There are several possible ways to study the performance of the learning algorithm, and many of the schemes are based on simulating the future prediction tasks by reusing the data available. These approaches have problems of their own and they tend to be complex for cases where one is interested in probabilistic models for the joint distributions. Therefore in many cases in the literature the so-called synthetic or "Golden Standard" approach is used to evaluate the learning algorithm. In this approach one first selects a "true model" (Golden standard) and then generates data stochastically from this model. The quality of the learning algorithm is judged by its ability to reconstruct this model from the generated data. In this project we will study this evaluation approach.

The Task

1. Synthetic data generator

The first task is to write a program that generates data from a discrete Bayesian network. It should take as input a Bayesian network in a format described below produce as output a set of i.i.d. data vectors generated stochastically from this model. The amount of generated data is naturally a parameter for the program. The generated data set should be in tabular format in a text file so that it can be used by B-Course software. The usual general requirements for the software hold: it is reasonably documented and runs in our machine (CSL#2 Linux). All documentation should be presented as a WEB-page.

The network description language is based on the one used by the Hugin tool. The data generator should accept networks written in the grammar below. Since writing a proper parser may be a big effort for those who are not exprerienced parser writers, data generators that cheat by using regular expression matching or something similiar will be tolerated, as long as they are able to parse the Hugin networks exported by B-Course (B-Course refers these as "Hugin Lite files")

ť The grammar

2. Evaluation of the learning algorithm in B-Course

The second task is to study the actual process of evaluating a learning algorithm against the Golden Standard. Although writing a Bayesian network learner on your own would be a useful task, for this project you should use B-Course learning engine for the candidate algorithm to be evaluated. In this case we are particularly interested in

How well the B-Course learner can "recover" the Golden Standard network for Bayesian networks of varying complexity and for data sets of different sizes? Following the dependency modeling aspects underlying B-Course, we are here mainly interested in comparing the structural differences, not parameters.
What is a good metric in comparing the original Bayesian network and the B-Course constructed Bayesian network? Although often one is interested in the actual difference of distributions (e.g., K-L divergence), you should for this project consider graphical measures (Number of missing arcs, extra arcs, their combinations etc.). Observe that in considering the arc directions one has to be careful with the equivalence of the Bayesian networks as many of the arcs can be reversed without changing the distribution.

The test setting should be approximately as follows

Varying the Bayesian network size: 5, 15, 50 nodes
Varying the dependency structure complexity (against the maximum of n*n arcs): 0% (all independent), 10%, 40% of possible structural dependencies
Varying the average node incoming/outgoing degree (i.e., the number of arcs pointing to or from a node): 1, 3, 5
Varying the size of the generated data set: 100, 1000, 10000

One should thus test combinations of these test parameters with several different Bayesian networks. The observations and empirical results should be reported as a WEB-page with discussion on the possible reasons for the differences between the generating Bayesian network and the discovered network structure. Notice that B-Course provides you with a nice graphical representation of the constructed Bayesian network and it would be helpful to show at least some of the Golden Standard networks in graphical format also.

Along with the documentation, the sources for the generator program, and the network descriptions used must be provided.

Automated B-Course Driver

As the above setting requires you to repeat network learning multiple times, it doesn't necessarily make sense to use B-Course by hand. To ease your life we provide you a Python script which takes your sampled data (in normal B-Course format), sends it to B-Course, learns a network (with varying number of iterations) and gives you the learned network as a HUGIN file (bnetwork.net) and a nice PNG graph (bnetwork.png). Here's the driver: auto-bcourse.py .

Usage:
python auto-bcourse.py samples.txt 30

The last number makes script to wait for 30 seconds before returning the net. The longer you wait the better network you'll get, roughly.

Evaluation

The project will be evaluated both for the first part (generator design) and the analysis of the experiments. All reported (documented) bugs found in B-Course will give you extra bonus, but should be reported immediately to Teemu!

Hints

B-Course search is stochastic and thus can produce different networks for different runs. This cannot be totally eliminated but one should let B-Course search engine to search long enough!




Three Concepts: Probability	2007