# Probabilistic Factor Analysis Methods

Joseph H. Sakaya & Suleiman A. Khan

15 - 19 May, 2017

This tutorial presents an overview of probabilistic factor analysis I cannot conceal the fact here that in the specific application of these rules, I foresee many things happening which can cause one to be badly mistaken if he does not proceed cautiously. James Bernoulli methods as used in practice. It uses the Stan programming language for model specification and inference. We will study the Bayesian variants of four models viz. PCA, CCA, GFA and MTF in addition to topic modelling. After corroborating these models on artificial data, we will apply them on real world data sets and examine their behaviour in terms of interpretability.

## Getting Started

Stan is a Turing-complete probabilistic programming language used for performing statistical inference of Bayesian models. A minimal Stan model is defined by program blocks: data, parameters, transformed parameters and model. Of these, all except the model block are optional. Consider, for instance, inferring a Gaussian with an unknown mean and variance, defined as \begin{align} \mu &\sim \mathcal{N}(0,10)\\ x &\sim \mathcal{N}(\mu, \sigma^2) \end{align} This is easily represented in Stan as follows,

            gaussian <- "
data {
int<lower=1> n;
vector[n] x;
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
mu ~ normal(0, 10);
x ~ normal(mu, sigma);
} "

library("rstan")
n <- 1000
x <- rnorm(n, 5, 10)
data <- list(x=x, n=n)
m <- stan_model(model_code = gaussian)
samples <- sampling(m, data=data, iter=10000, chains=1)
mu <- mean(extract(samples)$mu) sigma <- mean(extract(samples)$sigma)


The structure of this snippet is simple. First, generate or load existing data; then, initialise and indicate the parameters to monitor; lastly, run the inference. Stan offers several inference alternatives including efficient sampling strategies such as Hamiltonian Monte Carlo (HMC) as well as optimisation-based inference like variational Bayes (no need to bother with the details though). Note that we have not specified any prior for sigma . Stan assumes uniform prior over such parameters by default, rendering the model non-conjugate. Stan can perform inference on non-conjugate models. This allows you to pay more attention to modelling your data faithfully rather than fretting over the ugly details of inference that such non-conjugate models otherwise entail.

In order to familiarise yourself with Stan spend some time with part two, II Stan Modeling Language, of the Stan reference. Most constructs in Stan are similar to programming languages you already know — you need to spend just enough time to be conversant in:

1. Primitive data types such as int and real.
2. Linear algebra data types such as vector and row_vector and matrix.
3. Placing upper and lower constraints on the data types.
4. Sampling statements such as y ~ normal(mu, sigma) or
y ~ gamma(alpha, beta).
5. for loops and traversal order of linear algebra data types.
6. Rewriting code in vectorized form.
7. What the various blocks mean as well as the use of optional blocks such as the transformed parameters block.
8. Marginalisation of discrete parameters. Read chapter 12. Finite Mixture from the Stan reference.

If your answer to the above is in the affirmative, feel free to handle the questions. We provide starter code only in R (although we promise help with Python as well). Feel free to ask for help anytime.

### Using R and Python in the Cubbli machines

R users can simply use the terminal or RStudio. The package rstan comes already installed. Use require(rstan) to import the library.

Python users can use pystan installed in the virtual environment:

source /cs/group/r-course/pystan/bin/activate  Alternatively, you're free to install your own versions of rstan and pystan. ### Marking scheme The exercises sessions held from May 15 to May 18 carry a total of 10 points. Each session gives you 2.5 points. ### Questions Implement the following models with Stan. Use vectorized code and the transformed parameters block where appropriate. You will need at least 10 of 15 points to complete today's exercise. 1. Beta-Bernoulli model [½ pt] \begin{align} y &\sim \text{Bern}(\theta)\\ \theta &\sim \text{Beta}(\alpha, \beta) \end{align} Starter code and solution. 2. Beta-Binomial model [½ pt] \begin{align} y &\sim \text{Bin}(N, \theta) \\ \theta &\sim \text{Beta}(\alpha, \beta) \end{align} Starter code and solution. 3. Poisson-Gamma model [½ pt] \begin{align} y &\sim \text{Poisson}(\lambda) \\ \lambda &\sim \text{Gamma}(\alpha_0, \beta_0) \\ \end{align} Starter code and solution. 4. Normal with location [½ pt] \begin{align} y &\sim \mathcal{N}(\mu, \sigma_0^2) \\ \mu &\sim \mathcal{N}(\mu_0, \sigma_0^2) \end{align} Starter code and solution. 5. Normal with location and scale [2 pts] \begin{align} y &\sim \mathcal{N}(\mu, \sigma^2) \\ \mu &\sim \mathcal{N}(\mu_0, \sigma_0^2) \\ \text{and, }\sigma^2 &\sim \text{InvGam}(\alpha_0, \beta_0) \\ \text{or, }\sigma &\sim \text{logNormal}(\mu_0, \sigma_0^2)\\ \text{or, }\sigma &\sim \text{Cauchy}(\mu_0, \sigma_0) \\ \end{align} Starter code and solution. 6. Multivariate normal with location and scale [2 pts] \begin{align} \mathbf y &\sim \mathcal{N}(\boldsymbol\mu, \sigma^2 \mathbf I) \\ \boldsymbol\mu &\sim \mathcal{N}(\boldsymbol\mu_0, \sigma_0^2 \mathbf{I}) \\ \sigma^2 &\sim \text{InvGam}(\alpha_0, \beta_0) \\ \end{align} Starter code and solution. 7. Bayesian linear regression [2 pts] \begin{align} \tau &\sim \text{Gam}(\alpha_0, \beta_0)\\ \sigma^2 &\sim \text{InvGam}(\alpha_0, \beta_0) \\ \mathbf w &\sim \mathcal{N}(0, \tau^{-1} \mathbf I)\\ y &\sim \mathcal{N}(\mathbf w^{\top} \mathbf x, \sigma^2)\\ K &= \text{number of dimensions} \end{align} Starter code and solution. 8. Bayesian linear regression with ARD prior [2 pts] \begin{align} \alpha_{i=1\ldots K} &\sim \text{Gam}(\alpha_0, \beta_0)\\ \sigma^2 &\sim \text{InvGam}(\alpha_0, \beta_0) \\ \mathbf w &\sim \mathcal{N}(0, \boldsymbol \alpha^{-1} \mathbf I)\\ y &\sim \mathcal{N}(\mathbf w^{\top} \mathbf x, \sigma^2)\\ K &= \text{number of dimensions} \end{align} Starter code and solution. 9. Bayesian logistic regression with ARD prior [2 pts] \begin{align} \alpha_{i=1\ldots K} &\sim \text{Gam}(\alpha_0, \beta_0)\\ \mathbf w &\sim \mathcal{N}(0, \boldsymbol \alpha^{-1} \mathbf I)\\ y &\sim \text{Bern}(\sigma(\mathbf w^{\top} \mathbf x))\\ K &= \text{number of dimensions} \end{align} Starter code and solution. Take a moment to appreciate the fact that you can solve in a trivial piece of code what once constituted a paper. 10. Normal mixture with isotropic covariance and marginalisation [3 pts] \begin{align} \mathbf{\pi} & \sim \operatorname{Dir}(\alpha_0) \\ \mathbf{\sigma}^2_{i=1 \dots K} & \sim \text{InvGam}(\alpha_0, \beta_0) \\ \boldsymbol{\mu}_{i=1 \dots K} & \sim \mathcal{N}(\boldsymbol{\mu}_0, \sigma_0^2 \mathbf I) \\ \mathbf{z}_{i = 1 \dots N} & \sim \operatorname{Cat}(\mathbf{\pi}) \\ \mathbf{x}_{i=1 \dots N} & \sim \mathcal{N}(\boldsymbol{\mu}_{z_i}, {\sigma^2_{z_i}} \mathbf I) \\ K &= \text{number of mixing components} \\ N &= \text{number of data points} \end{align} Starter code and solution. Modify your code to replace the isotropic covariance with diagonal covariance. ## Principal Component Analysis Matrix factorisation is the decomposition of a matrix into a product of two or more matrices, with a mind to capture the interaction between two entities in terms of lower-dimensional latent features. It involves modelling correlations in higher-dimensional data as a lower-dimensional, oriented subspace. Consider a matrix\mathbf{X} \in \Re^{n \times d}$generated by a linear transformation of a zero-mean, unit-variance Gaussian matrix$\mathbf{Z} \in \Re^{n \times d}$and some additional zero mean diagonal covariance Gaussian noise$\boldsymbol{\epsilon}$, where$k \leq d$, $$\mathbf{x}_n = \mathbf{W}\mathbf{z}_n + \boldsymbol{\epsilon}_n$$ $$\mathbf{z}_n \sim \mathcal{N}(0, \mathbf{I}_k)\qquad \boldsymbol \epsilon_n \sim \mathcal{N}(0, \mathbf{\Psi})$$ where$\mathbf{W} \in \Re^{k \times d}$constitutes the linear transformation, also known as the factor loadings or the projection matrix. Marginalising over$\mathbf{z}_n$, it is not hard to see that$\mathbf{x}_n \sim \mathcal{N}(0,\mathbf{WW}^\top + \mathbf{\Psi})$. By constraining the noise to be a diagonal covariance matrix,$\mathbf{x}_n$becomes conditionally independent for a given$\mathbf{z}_n$. This results in the$\boldsymbol \epsilon_n$capturing the independent variance unique to a axis, while$\mathbf{W}$captures the rest of the correlations presumed to be of interest. The components of the resulting latent vectors$\mathbf{z}_n$are uncorrelated. Factor analysis differs from traditional PCA in that it differentiates between the variance and the correlations. This can be rectified by replacing$\mathbf \Psi$with an isotropic covariance, in which case the model simply reduces to PCA. Thus, PCA can be easily re-interpreted as a linear-Gaussian latent variable model. However, a fully Bayesian treatment of PCA involves introducing priors over$\mathbf{W}$and$\boldsymbol{\alpha}$. Figure 1: Hinton diagrams of the matrix$W$in which each element is depicted as a square (blue for negative, red for positive) with its intensity proportional to the magnitude of that element. It is seen that the extraneous dimensions are effectively set to zero and the model accurately deduces the true number of components to be$5. ### Optional reading Bishop's paper on Bayesian PCA provides an excellent description of the model. ### Questions To claim 2.5 points today, you have to implement Bayesian PCA with Stan and test it on a gene expression measurement dataset. The generative model for Bayesian PCA with ARD prior is given as follows: \begin{align} \mathbf{x}_n &\sim \mathcal{N}(\mathbf W\mathbf z_n, \tau^{-1} \mathbf I_D)\\ \mathbf{W}_k|\alpha_k &\sim \mathcal{N}(0, \alpha_k^{-1} \mathbf I_D)\\ \alpha_k &\sim \text{Gam}(\alpha_0, \beta_0)\\ \tau &\sim \text{Gam}(\alpha_0, \beta_0)\\ \mathbf z &\sim \mathcal{N}(0, \mathbf I_K)\\ N &= \text{number of data points}\\ D &= \text{dimension of data point} \\ K &= \text{latent dimension} \\ \end{align} 1. Write the Stan snippet of the generative model. 2. Run both the NUTS sampler and VB on the toy data set and reproduce Figure 1. 3. Vectorize the code. You will need no more than 2 loops in your code — one in the transformed parameters block and the other in the model block. Figure 2: PCA latent variables identify a well-known biological process in cancer cells Consider a matrix of gene expression measurements of a set of genes(D = 1106)$over a set of drugs$(N = 78)$. Each cell in the matrix indicates if the gene (column) is up-regulated (positive) or down-regulated (negative). It is hypothesised that such data can be used to understand the mechanisms (genes) through with drugs affect the cells. The matrix is the response of a given type of cancer cell and will constitute our input matrix. Run the model on the data set. For R users, access to the data set is already provided in the starter code. Python users can access it here. To interpret the results, inspect$\boldsymbol \alpha$and choose a component$k$, explaining your choice. For the selected component$k$, 1. Collect the row indices of the 10 highest and lowest scores of the$k^{th}$component of$\mathbf Z$, and 2. The column indices of the 20 highest scores of the$k^{th}$component of$\mathbf W$. Subset the original data matrix$\mathbf X, using the row and column indices, and plot the heat map of the resulting sub-matrix similar to the one shown in Figure 2. Convince yourself that this particular strategy lends itself well to interpretation. ### Extra (optional) Figure 3: The log of the estimated ARD matrix shown as Hinton diagrams. Blue corresponds to active components while red corresponds to inactive ones. Figure 4: The true latent components used for generating the data. The first two components are shared between the two views, while the last two are specific to one view each. Figure 5: The estimated latent components of our model. Note that the components can be invariant to order The generative story of Bayesian PCA for continuous data goes as follows, \begin{align} \mathbf{W}_k &\sim \mathcal{N}(0, \tau^{-1} \mathbf I_D)\\ \mathbf z &\sim \mathcal{N}(0, \mathbf I_K)\\ \mathbf{x}_n &\sim \mathcal{N}(\mathbf W\mathbf z_n, \tau_m^{-1} \mathbf I_D)\\ \end{align} Is the claim Latent Dirichlet Allocation (LDA) is PCA for discrete data true or false? Use the generative model for LDA to justify your position either way. ## Canonical Correlation Analysis and Group Factor Analysis For two data vectors\mathbf x^1$and$\mathbf x^2, efficient inference on Bayesian CCA can be done by introducing a group-wise sparsity prior for the projection matrix, as was explained in class. Recall that the IBFA model, \begin{align} \mathbf x^{(m)} &\sim \mathcal{N}(\mathbf A^{(m)}\mathbf z + \mathbf B^{(m)}\mathbf z^{(m)}, \tau^{(m)} \mathbf I)\\ \mathbf z^{(m)} &\sim \mathcal{N}(\mathbf 0,\mathbf I)\\ \mathbf z &\sim \mathcal{N}(\mathbf 0,\mathbf I) \end{align} captures the shared correlation common to both data sets with the shared latent variable\mathbf{z}$and models the variation within a data set with the view-specific latent variable$\mathbf z^{(m)}$. Now, by introducing the group-wise sparsity on$\mathbf W$as, $$\mathbf W = \begin{bmatrix} \mathbf A^{(1)} &\mathbf B^{(1)} & \mathbf 0\\ \mathbf A^{(2)} & \mathbf 0 & \mathbf B^{(1)} \end{bmatrix}$$ $$\mathbf \Sigma = \begin{bmatrix} \mathbf (\tau^{(1)})^{-1} \mathbf I & \mathbf 0\\ \mathbf 0 & \mathbf (\tau^{(2)})^{-1} \mathbf I\\ \end{bmatrix}$$ and concatenating$\mathbf z = [\mathbf z^{(1)} ,\mathbf z^{(2)}], IBFA can alternatively be written as, \begin{align} \mathbf x &\sim \mathcal{N} (\mathbf{Wz}, \mathbf\Sigma)\\ \mathbf z &\sim \mathcal{N}(\mathbf 0,\mathbf I ) \end{align} In other words, we are now effectively analysing the feature-wise concatenation of the data sources with a single latent variable model with diagonal noise covariance, and a projection matrix with a specific structure, enforced by introducing an ARD prior for each view. Bayesian GFA can be interpreted as an extension of factor models, where the task is to model relationships between data sets (or views), describing dependencies between views rather than the independent variables. Alternatively, it can also be thought of as an extension of Bayesian CCA to more than 2 views. "The figure above illustrates group factor analysis of three data sets or views. The feature-wise concatenation of the data sets\mathbf X_m$is factorised as a product of the latent variables$\mathbf Z$and factor loadings$\mathbf W$. The factor loadings are group-wise sparse, so that each factor is active only in some subset of views (or all of them). from Bayesian Group Factor Analysis, Virtanen et al. The factors active in just one of the views model the structured noise, variation independent of all other views, whereas the rest model the dependencies. The nature of each of the factors is learnt automatically, without needing to specify the numbers of different factor types (whose number could be exponential in the number of views) beforehand." ### Optional Reading You can refer to the Bayesian CCA as well as the Bayesian GFA paper for more details. Figure 6: The ARD matrix$\alphaused in data generation is shown. Blue corresponds to active components and red to inactive ones. Figure 7: The log of the estimated ARD matrix. Inactive components are in red. ### Questions You will have to complete at the very least the Bayesian CCA model to claim 2.5 points for today's exercise session. Bayesian GFA is trivially extendable. Bayesian CCA is formally expressed as follows: \begin{align} \mathbf x_n^{(m)} &\sim \mathcal{N} ( \mathbf W^{(m)} \mathbf z_n , ( \tau^{( m )} )^{-1}\mathbf I)\\ \mathbf W_k^{(m)}| \alpha_k^{(m)} &\sim \mathcal{N} ( 0, ( \alpha_k )^{-1} \mathbf I)\\ \alpha_k^{(m)} &\sim \text{Gamma}( \alpha_0 , \beta_0 )\\ \tau^{( m )} &\sim \text{Gamma}( \alpha_0 , \beta_0 )\\ \mathbf z_n &\sim \mathcal{N} ( 0, \mathbf I_k ) \end{align} 1. Write the Stan snippet for Bayesian CCA. Start with the Stan code for Bayesian PCA (you can access the solution to Bayesian PCA). 2. Reproduce the ARD matrix in Figure 3. 3. Reproduce the latent components shown in Figure 5. 4. If we were to ignore the structure in\mathbf W$and assume a single$\tau$across all views, what would this model reduce to? The formulation of Bayesian GFA is the same as for CCA, except$m$is extended to more than two views. 1. If your Stan code for Bayesian CCA is correct, it should be easily extendable to Bayesian GFA. 2. The data generation uses the ARD matrix in Figure 6. 3. Reproduce the ARD matrix in Figure 7. Remember that the model is invariant to order, so you don't need to reproduce it exactly. ## Tensor factorization Canonical decomposition(CP) factorizes a tensor into low-dimensional latent variables in all the modesA tensor of mode 2 is a matrix. For this exercise, we use a tensor of mode 3. Figure 8: Illustration of tensor factorization. The tensor$\mathcal{Y}$can be factorized into a low-dimensional component space$\mathbf X$,$\mathbf W$and$\mathbf U$which represents relationships across the drugs, genes and cancers. as shown in Figure 8. The latent variables are also commonly referred to as factors or components. For the tensor$\mathcal{Y} \in \mathbb{R}^{N \times D \times L}$, CP identifies latent variables with$K$components in all the modes$\mathbf X \in \mathbb{R}^{N \times K}$,$\mathbf W \in \mathbb{R}^{D \times K}$and$\mathbf U \in \mathbb{R}^{L \times K}$as $$\mathcal{Y} \approx \sum_{k=1}^{K} \mathbf x_k \circ \mathbf w_k \circ \mathbf u_k$$ where$\circ$is the outer product. Since you are not expected to know about tensor factorization methods until tomorrow, you can imagine tensor factorization to be an extension of Bayesian PCA for tensors. Figure 9: Illustration of Bayesian tensor factorization with ARD. The latent components of$\mathbf X$(top),$\mathbf W$(middle) and$\mathbf U$(bottom) are shown, with the ARD switching off the extraneous components. ### Questions Given tensor$\mathcal{Y} \in \mathbb{R}^{N \times D \times L}$, the Bayesian formulation of the CP factorization for a three-mode tensor into corresponding latent variables (or components)$\mathbf X \in \mathbb{R}^{N \times K}$,$\mathbf W \in \mathbb{R}^{D \times K}$and$\mathbf U \in \mathbb{R}^{L \times K}$, with ARD priors on$\mathbf W, is given as follows: \begin{align} {y}_{n,d,l} &\sim \mathcal{N} \left( \sum_{k=1}^K \left(x_{n,k} \cdot w_{d,k} \cdot u_{l,k}\right) , \tau^{-1}\right)\\ {x}_{n,k} &\sim \mathcal{N}(0,1 )\\ {u}_{l,k} &\sim \mathcal{N}(0,1 )\\ {w}_{d,k} &\sim \mathcal{N}(0, \alpha_k^{-1} )\\ \tau &\sim \text{Gam}(\alpha_0, \beta_0)\\ \alpha_k &\sim \text{Gam}(\alpha_0, \beta_0)\\ \end{align} An ideal way to write a Stan solution for this is as follows: 1. Rewrite the matrix multiplication\mathbf Z * \mathbf W'$in the Bayesian PCA snippet as a sum of outer products over$\sum_{k=1}^{K} \mathbf z_k \circ \mathbf w_k$. 2. Then extend the outer product sum to the 3 matrices$\mathbf X$,$\mathbf W$and$\mathbf U$as$\sum_{k=1}^K (x_{n,k} \cdot w_{d,k} \cdot u_{l,k})\$
3. Run the Stan code on the toy data set and reproduce the plots in Figure 9.
4. If you are able to, try and optimize the code.

## Epilogue

We trust that the exercises have proven to be of value in furthering your understanding of probabilistic factor models. Naturally, learning a new language from scratch and implementing complicated models is challenging, so some of the exercises might have proven harder than expected — which is why we provided you with rather copious (and admittedly buggy) starter codes.

Do remember to leave feedback about the exercises on the course feedback form. If you are at all interested in probabilistic programming languages and black-box inference methods, now is a great time to work in these areas. There are several such popular languages under active development:

You have mostly been using variational Bayes(VB) for inference. In contrast to MCMC methods, VB is extremely scalable and efficient. If you want to know more about the latest trends in such inference methods, you are welcome to look up the NIPS workshop held every year on approximate inference.

That is all, folks.