582636 Probabilistic Models

582636 Probabilistic Models (4 cr) / Todennäköisyysmallit (4 op), Spring 2009

Format

There will be several smaller exercises and at the end of the course, one larger home assignment. The smaller exercises can give 5-10 points altogether, and the larger home assignment about another 10 points when the maximum number of points including the final exam is 60 (the exact numbers will be decided later). To earn the points, you need to send your answers to the exercises by email before the session where the exercises are discussed. Send your answers by email to "petri dot myllymaki at cs dot helsinki dot fi" AND to "teemu dot roos at cs dot helsinki dot fi".

Next exercise session: Thursday, Feb. 19 at 16:15-18:00 in B222.
Topic: Exercises 8-12.
Deadline for sending your contributions by email: Thu Feb. 19, 16:15.

Exercises will be discussed first in groups, and non-Finnish speakers are also welcome to attend the exercise sessions (they can form an English speaking group).

Exercises

N.B. In all the exercises, do not just give the answer, but also the derivation how you obtained it.

The apartment rents in Helsinki are modeled by using two variables: the Type (public,private,other) and the Rent (low, medium, upper_medium, high). The joint probability distribution P(Type,Rent) is given by

P(Rent,Type) public private other

low 0.17 0.01 0.02

medium 0.44 0.03 0.01

upper_medium 0.09 0.07 0.01

high 0.00 0.14 0.01
1. What is P(Type), the marginal distribution of Type?
2. What is P(Rent), the marginal distribution of Rent?
3. What is P(Rent | Type=private)?
4. What is P(Type | rent=low)?
Let us consider a coin tossing experiment as a Bernoulli model, and denote heads by "1" and tails by "0". Let parameter q denote the probability of "1" (and 1-q the probability of "0").
1. We toss the coin 10 times and observe D="1110110101".
  1. What are the maximum likelihood parameters?
  2. What is the likelihood of the data with these parameters?
  3. What is the likelihood of the data assuming that the coin is perfectly fair?
  4. What is the posterior distribution P(q | D), assuming
    1. the uniform prior?
    2. the Jeffreys prior?
  5. In the given order, for each observation X, calculate it's predictive probability given the preceding sequence, i.e., calculate P(X=1),P(X=1| D={1}),P(X=1|D={11}),P(X=0|D={111}), P(X=1|D={1110}, and so on (altogether, 10 probabilities), by using
    1. The maximum likelihood parameters.
    2. Bayesian inference (all parameters) with the uniform prior.
    Calculate also the product of these 10 predictive probabilities.
2. We toss the coin 10 times and observe D="1111111000".
  1. What are the maximum likelihood parameters?
  2. What is the likelihood of the data with these parameters?
  3. What is the likelihood of the data assuming that the coin is perfectly fair?
  4. What is the posterior distribution P(q | D), assuming
    1. the uniform prior?
    2. the Jeffreys prior?
  5. As in a., in the given order, for each observation X, calculate it's predictive probability given the preceding sequence (altogether, 10 probabilities), by using
    1. The maximum likelihood parameters.
    2. Bayesian inference (all parameters) with the uniform prior.
    Calculate also the product of these 10 predictive probabilities.
Let us consider a 6-sided dice rolling experiment as a multinomial model (i.i.d. multi-valued Bernoulli). We roll the dice 60 times, and observe data D with the following counts for the sides:

X=I: 10 times

X=II: 8 times

X=III: 14 times

X=IV: 5 times

X=V: 12 times

X=VI: 11 times
1. What are the maximum likelihood parameters, given this data?
2. What is the posterior distribution P(q₁,q₂,...,q₆| D), assuming
  1. the uniform prior (Dir(1,1,1,1,1,1))?
  2. the Jeffreys prior (Dir(0.5,0.5,0.5,0.5,0.5,0.5))?
3. What is P(X | D), the predictive distribution for the next result, given D (give all 6 probabilities), by using Bayesian inference with the uniform prior?
4. Give an example of a Dirichlet prior distribution so that P(X=I | D) = P(X=IV | D) = 0.20.
Show by using the d-separation criterion that a node in a Bayesian network is conditionally independent of all the other nodes, given its Markov blanket.
List all the marginal and conditional pairwise independencies represented by the following DAG:
Given the above Bayesian network, suppose the variables are binary and the conditional probabilities are as follows:
P(A=1)= 0.8
P(B=1)= 0.1
P(C=1 | A=0,B=0) = 0.1
P(C=1 | A=1,B=0) = 0.6
P(C=1 | A=0,B=1) = 0.7
P(C=1 | A=1,B=1) = 0.9> P(D=1 | C=1) = 0.9
P(D=1 | C=0) = 0.3
1. Construct the joint probability table of the distribution represented by this model.
2. What is max P(A,B,C,D)?
3. Compute the marginal probabilities P(A), P(B), P(C) and P(D).
4. Compute the following conditional probabilities:
  P(A=1 | D=1)
  P(B=1 | D=1)
  P(A=1 | D=0)
  P(B=1 | D=0)
  P(A=1 | C=1)
  P(B=1 | C=1)
  P(A=0 | C=1)
  P(B=0 | C=1)
  P(C=1 | D=0)
  P(C=1 | A=1)
  P(D=1 | A=1)
  P(A=1 | C=1, D=1)
  P(B=1 | C=1, D=1)
  P(A=1 | C=1, D=0)
  P(B=1 | C=1, D=0)
  P(A=1 | C=1, B=1)
  P(A=1 | C=1, B=0)
  P(A=1 | B=1)
  P(A=1 | B=0).
List all 3-node Bayesian networks, and partition them into equivalence classes. For the notation, let us call the nodes A,B and C, and let AB mean that there is a directed arc from A to B. For example, {} is an empty network with no arcs, and {AB,AC,BC} is a fully connected network with arcs A->B, A->C and B->C.
Location estimation with a HMM (this is a somewhat challenging task, but it gives more points + you have 2 WEEKS to submit your solution, i.e. the deadline is Feb 19th.).
Assume that we have 4 possible locations labeled A, B, C and D. Our observations consist of measurements of signals X and Y with possible values {low,medium,high}. We model this domain by a first-order HMM where the location is the hidden variable, and X and Y are assumed to be independent given the location. The transition probabilities between the hidden state L(t) at time t and the hidden state L(t+1) at time t+1 are as follows:
P(L(t+1)=A | L(t) = A) = 1/3
P(L(t+1)=B | L(t) = A) = 1/3
P(L(t+1)=C | L(t) = A) = 1/3

P(L(t+1)=B | L(t) = B) = 1/3
P(L(t+1)=A | L(t) = B) = 1/3
P(L(t+1)=D | L(t) = B) = 1/3

P(L(t+1)=C | L(t) = C) = 1/3
P(L(t+1)=A | L(t) = C) = 1/3
P(L(t+1)=D | L(t) = C) = 1/3

P(L(t+1)=D | L(t) = D) = 1/3
P(L(t+1)=B | L(t) = D) = 1/3
P(L(t+1)=C | L(t) = D) = 1/3

The conditional probabilities of X and Y are as follows:
P(X=low | L=A) = 0.1
P(X=medium | L=A) = 0.2
P(X=high | L=A) = 0.7
P(X=low | L=B) = 0.3
P(X=medium | L=B) = 0.4
P(X=high | L=B) = 0.3
P(X=low | L=C) = 0.3
P(X=medium | L=C) = 0.4
P(X=high | L=C) = 0.3
P(X=low | L=D) = 0.7
P(X=medium | L=D) = 0.2
P(X=high | L=D) = 0.1

P(Y=low | L=A) = 0.7
P(Y=medium | L=A) = 0.2
P(Y=high | L=A) = 0.1
P(Y=low | L=B) = 0.3
P(Y=medium | L=B) = 0.4
P(Y=high | L=B) = 0.3
P(Y=low | L=C) = 0.2
P(Y=medium | L=C) = 0.3
P(Y=high | L=C) = 0.5
P(Y=low | L=D) = 0.1
P(Y=medium | L=D) = 0.2
P(Y=high | L=D) = 0.7

We make the following observations (X,Y) during 6 time steps: O(1)=(high,low), O(2)=(medium,medium), O(3)=(low,high), O(4)=(medium,medium), O(5)=(high,medium), O(6)=(medium,low). Let us assume that a priori (before making the observations) all states are equally probable, i.e., P(L(0)=A) = P(L(0)=B) = P(L(0)=C) = P(L(0)=D) = 1/4.
1. Compute the predictive distribution over the hidden states after each observation, i.e., P(L(1) |O(1)), P(L(2) |O(1),O(2)),..., and P(L(6) |O(1),...O(6)).
2. Use the Viterbi algorithm to compute the most probable path between the hidden states L(1)->L(2)->...->L(6). What is the path and what is its probability?
Let us consider the following Naive Bayes classifier: the hidden node H has three possible values 1,2 and 3, and the four predictors A, B, C and D are binary (N.B. "C" is not the class variable here, but "H"), and the parameters are as follows:
P(H=1) = 0.7
P(H=2) = 0.2
P(H=3) = 0.1
P(A=1 | H=1) = 0.9
P(B=1 | H=1) = 0.1
P(C=1 | H=1) = 0.1
P(D=1 | H=1) = 0.6
P(A=1 | H=2) = 0.2
P(B=1 | H=2) = 0.9
P(C=1 | H=2) = 0.7
P(D=1 | H=2) = 0.3
P(A=1 | H=3) = 0.4
P(B=1 | H=3) = 0.6
P(C=1 | H=3) = 0.3
P(D=1 | H=3) = 0.95
1. Compute the classification distribution P(H |A,B,C,D)) for each of the 16 cases (A,B,C,D).
2. Compute probability P(A,B,C,D) for each of the 16 cases (by marginalizing over H).
Take the 10 most probable vectors from the list produced in part (b.) of the previous exercise, and assign them to the most probable class to form your training data set D. Now based on this data, learn a Naive Bayes classifier by
1. Using the maximum likelihood parameters
2. Using the expected parameters with the BDeu prior using equivalence sample size of 1.
Using the two Naive Bayes classifiers (ML parameters/expected BDeu parameters) learned in the previous exercise:
1. Compute the classification distribution P(H |A,B,C,D)) for each of the 16 cases (A,B,C,D).
2. Compute the 0/1 loss in the "test set" consisting of the 6 cases not found in the training set D, assuming that the correct class in each case is the one maximizing the classification probility given by the original generating NB classifier (Exercise 9.a.)?
3. Compute the logarithmic loss in the "test set" consisting of the 6 cases not in the training set D, assuming that the correct class in each case is the one maximizing the classification probility given by the original generating NB classifier (Exercise 9.a.)??
(N.B. Item a. below was slightly modified on Friday morning, Feb. 13th). Still using the same training data set D of size 10:
1. What is the marginal likelihood with the Naive Bayes structure (with BDeu priors, equivalent sample size 1)? Calculate it in 3 different ways to make sure they all produce the same result: using the "gamma-formula" directly, and using sequentially the predictive distribution with two different orderings of the data (e.g. take first any ordering and then reverse it).
2. What is the marginal likelihood with the empty graph (with BDeu priors, equivalent sample size 1)?
3. What should be the ratio of the prior probabilities of these two structures to make their posterior probabilities P(M |D) equal?
4. Can you find a Bayesian network structure producing a higher marginal likelihood than either of the two models above?

Home assignment

Background

When designing a learning algorithm to construct models from data sets with non-informative prior information (i.e., no preference for the structure in advance), a challenging and interesting task is to evaluate the "quality" of the learning algorithm. There are several possible ways to study the performance of the learning algorithm, and many of the schemes are based on simulating the future prediction tasks by reusing the data available. These approaches have problems of their own and they tend to be complex for cases where one is interested in probabilistic models for the joint distributions. Therefore in many cases in the literature the so-called synthetic or "Golden Standard" approach is used to evaluate the learning algorithm. In this approach one first selects a "true model" (Golden standard) and then generates data stochastically from this model. The quality of the learning algorithm is judged by its ability to reconstruct this model from the generated data. In this project we will study this Golden standard evaluation approach.

The Task

First you need to write a program that generates data from a discrete Bayesian network. The generated data set should be in tabular format in a text file so that it can be used by B-Course software (if you plan to use it, see below). Your task is then to study the actual process of evaluating a learning algorithm against the Golden Standard. Although writing a Bayesian network learner on your own would be a useful task, for this project you can use B-Course learning engine for the candidate algorithm to be evaluated. In this case we are particularly interested in the following questions:

How well can the B-Course learner "recover" the Golden Standard network for Bayesian networks of varying complexity and for data sets of different sizes? Following the dependency modeling aspects underlying B-Course, we are here interested in comparing the structural differences only, not parameters.
What is a good metric in comparing the original Bayesian network and the B-Course constructed Bayesian network? Although often one is interested in the actual difference of distributions (e.g., K-L divergence), you should for this project consider graphical measures (Number of missing arcs, extra arcs, their combinations etc.). Observe that in considering the arc directions one has to be careful with the equivalence of the Bayesian networks as many of the arcs can be reversed without changing the distribution. In this project, you are free to choose the metric used, as long as it is properly explained.

The test setting should be approximately as follows:

Vary the size of the data generating Bayesian network: e.g., 5, 15, 50 nodes
Vary the dependency structure complexity (against the maximum of n*n arcs): e.g., 0% (all independent), 10%, 40% of possible structural dependencies
Vary the average node incoming/outgoing degree (i.e., the number of arcs pointing to or from a node): e.g., 1, 3, 5
Vary the size of the generated data set: e.g., 100, 1000, 10000

To get maximum points, one should test combinations of these test parameters with several different Bayesian networks.

Automated B-Course Driver
As the above setting requires you to repeat network learning multiple times, it doesn't necessarily make sense to use B-Course by hand. To ease your life we provide you a Python script which takes your sampled data (in normal B-Course format), sends it to B-Course, learns a network (with varying number of iterations) and gives you the learned network as a HUGIN file (bnetwork.net) and a nice PNG graph (bnetwork.png). (HUGIN lite is a free tool for working with Bayesian networks. You may download it and use it if you find it useful, but this is not required.)

Here's the driver: auto-bcourse.py .
Usage: python auto-bcourse.py samples.txt 30

The last number makes the script to wait for 30 seconds before returning the net. The longer you wait, the better network you'll get (normally). B-Course search is stochastic and thus can produce different networks for different runs - this cannot be totally eliminated so one should let B-Course search engine to search long enough!

Deliverables

The description of your setup, the empirical results and conclusions should be reported as a PDF document with discussion on the possible reasons for the differences between the generating Bayesian network and the discovered network structure. Notice that as B-Course provides you with nice graphical pictures of the constructed Bayesian networks, it would be helpful to show at least some of the Golden Standard networks in graphical format also. There is no minimum (or maximum) page limit for your document, but naturally the more comprehensive and/or insightful it is, the more points you will earn.

In addition to the documentation of the experimental setup and the results, the source codes for the programs used must also be provided. All reported (documented) bugs found in B-Course will give you extra bonus.

Timetable

The deadline for the deliverables is Tuesaday, March 3, at 8:00am. Delivery by email to both Petri and Teemu. The deadline is strict: late submissions will not be considered, extensions will not be granted.

Solutions

Exercises 1-3.
N.B. There was an error in the lecture notes on page 32 in the formula for the posterior of the Bernoulli model, which is now corrected.
Exercises 4-7.

Petri Myllymäki

Department of Computer Science

Department information