Chapter 3 Multivariate normal distributions and numerical linear algebra

Multivariate normal distribution is one of the most commonly encountered distributions in statistics.

The \(d\)-dimensional multivariate normal \(\mathcal{N}(\mathbf{\mu}, \mathbf{\Sigma})\) is parametrised by the \(d\)-dimensional mean vector \(\mathbf{\mu}\) and a symmetric positive-definite \(d \times d\) covariance matrix \(\mathbf{\Sigma}\).

Some important properties of symmetric positive-definite matrices \(\mathbf{\Sigma}\):

all eigenvalues of symmetric matrices are real; and
all eigenvalues of symmetric positive-definite matrices are positive.

3.1 Cholesky decomposition

Symmetric positive definite matrix \(\mathbf{\Sigma}\) can be represented as \[ \mathbf{\Sigma} = \mathbf{L} \mathbf{L}^T, \] where \(\mathbf{L}\) is a lower-triangular matrix. \(\mathbf{L}\) is called the Cholesky decomposition of \(\Sigma\).

Note: Python NumPy and SciPy have two incompatible implementations of the Cholesky decomposition: scipy.linalg.cholesky and numpy.linalg.cholesky. NumPy version numpy.linalg.cholesky uses the definition used here while scipy.linalg.cholesky returns by default the upper-triangular matrix \(\mathbf{U} = \mathbf{L}^T\). scipy.linalg.cholesky can be made to return \(\mathbf{L}\) using scipy.linalg.cholesky(..., lower=True).

3.2 Evaluating the multivariate normal density

The log-density of the \(d\)-dimensional multivariate normal is: \[ \log p(\mathbf{x}) = -\frac{d}{2} \log(2 \pi) -\frac{1}{2} \log | \det \mathbf{\Sigma} | - \frac{1}{2} (\mathbf{x}-\mathbf{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}-\mathbf{\mu}).\]

It contains two more difficult terms: \[ \log | \det \mathbf{\Sigma} | \] and \[ - \frac{1}{2} (\mathbf{x}-\mathbf{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}-\mathbf{\mu}).\]

Both difficult terms can be evaluated efficiently and in a numerically stable manner using the Cholesky decomposition of \(\mathbf{\Sigma}\).

3.2.1 Determinant evaluation using the Cholesky decomposition

Assume \(\mathbf{\Sigma} = \mathbf{L} \mathbf{L}^T\) with \(\mathbf{L} = \begin{pmatrix} l_{11} & 0 & 0 & \dots & 0 \\ l_{21} & l_{22} & 0 & \dots & 0 \\ l_{31} & l_{32} & l_{33} & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & 0 \\ l_{d1} & l_{d2} & l_{d3} & \dots & l_{dd} \\ \end{pmatrix}.\)

Now using basic properties of the determinant and the logarithm, we obtain \[\begin{align*} \log \det \boldsymbol{\Sigma} &= \log\left( \det ( \mathbf{L} \mathbf{L}^T) \right) = \log\left( \det \mathbf{L} \det \mathbf{L}^T \right) \\ &= \log\left( (\det \mathbf{L})^2 \right) = 2 \log\left( \det \mathbf{L} \right) \\ &= 2 \log\left( \prod_{i=1}^d l_{ii} \right) = 2 \sum_{i=1}^d \log\left( l_{ii} \right). \end{align*}\]

3.2.2 Quadratic form evaluation with Cholesky

We can simplify the quadratic form as \[\begin{align*} (\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu}) &= (\mathbf{x}-\boldsymbol{\mu})^T (\mathbf{L} \mathbf{L}^T)^{-1} (\mathbf{x}-\boldsymbol{\mu}) \\ &= (\mathbf{x}-\boldsymbol{\mu})^T \mathbf{L}^{-T} \mathbf{L}^{-1} (\mathbf{x}-\boldsymbol{\mu}) \\ &= (\mathbf{L}^{-1}(\mathbf{x}-\boldsymbol{\mu}))^T \mathbf{L}^{-1}(\mathbf{x}-\boldsymbol{\mu}) \\ &= \mathbf{z}^T \mathbf{z} = \sum_{i=1}^d z_i^2, \end{align*}\] where \(\mathbf{z} = (z_1, \dots, z_d)^T = \mathbf{L}^{-1}(\mathbf{x}-\mathbf{\mu})\).

Note: You should never explicitly compute \(\mathbf{\Sigma}^{-1}\)!

To compute \(\mathbf{z} = \mathbf{L}^{-1}(\mathbf{x}-\mathbf{\mu})\), we solve a linear system of equations \[ \mathbf{L} \mathbf{z} = \mathbf{x}-\mathbf{\mu}. \] In our case \(\mathbf{L}\) is triangular, so this can be solved using

import scipy.linalg as slg
z = slg.solve_triangular(L, x-mu, lower=True)

Even in general, the best algorithm for solving \(\mathbf{\Sigma} \mathbf{x} = \mathbf{b}\) for positive-definite \(\mathbf{\Sigma}\) uses the Cholesky decomposition of \(\mathbf{\Sigma}\) internally.

Note: Even if you are given \(\mathbf{\Sigma}^{-1}\) for free, it may still be better (as fast or faster, more stable) to use the Cholesky approach.

Note: SciPy provides slightly faster routines for solving linear systems with Cholesky factors as scipy.linalg.cho_factor and scipy.linalg.cho_solve.

3.3 Simulating multivariate normal random draws

Box-Muller transform (Sec. 2.2.1) can be used to simulate multivariate normal random draws with a unit covariance \(\mathbf{I}_d\).

Theorem 3.1 Assuming \(\mathbf{x} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_d)\), we can obtain \(\mathbf{y} \sim \mathcal{N}(\mathbf{\mu}, \mathbf{\Sigma})\) using the transformation \[\mathbf{y} = \mathbf{L} \mathbf{x} + \mathbf{\mu},\] where \(\mathbf{L}\) is the Cholesky decomposition of \(\mathbf{\Sigma}\): \(\mathbf{\Sigma} = \mathbf{L} \mathbf{L}^T\).

Proof. \(\mathbf{y}\) remains multivariate normal after an affine transformation because the (unnormalised) multivariate normal density after transformation \(\mathbf{x} = \mathbf{A} \mathbf{z} + \mathbf{b}\) with invertible \(\mathbf{A}\) still has the form of a multivariate normal over \(\mathbf{z}\): \[\begin{align*} p(\mathbf{x}) &= \frac{1}{Z} \exp(- \frac{1}{2} (\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu}))\\ &= \frac{1}{Z'} \exp(- \frac{1}{2} (\mathbf{A} \mathbf{z} + \mathbf{b} -\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{A} \mathbf{z} + \mathbf{b}-\boldsymbol{\mu}))\\ &= \frac{1}{Z'} \exp(- \frac{1}{2} (\mathbf{z} - \mathbf{A}^{-1}(\boldsymbol{\mu} - \mathbf{b}))^T (\mathbf{A}^{-1} \boldsymbol{\Sigma} (\mathbf{A}^{-1})^{T})^{-1} (\mathbf{z} - \mathbf{A}^{-1}(\boldsymbol{\mu} - \mathbf{b}))). \end{align*}\]

The mean and covariance of \(\mathbf{y}\) can be evaluated as \[ \operatorname{E}[\mathbf{y}] = \mathbf{L} \operatorname{E}[\mathbf{x}] + \mathbf{\mu} = \mathbf{\mu} \] \[ \operatorname{Cov}[\mathbf{y}] = \operatorname{E}[ (\mathbf{y} - \operatorname{E}[\mathbf{y}]) (\mathbf{y} - \operatorname{E}[\mathbf{y}])^T ] = \operatorname{E}[ (\mathbf{L} \mathbf{x})(\mathbf{L} \mathbf{x})^T] = \operatorname{E}[ \mathbf{L} \mathbf{x} \mathbf{x}^T \mathbf{L}^T ] = \mathbf{L} \operatorname{E}[ \mathbf{x} \mathbf{x}^T ] \mathbf{L}^T = \mathbf{L} \mathbf{I}_d \mathbf{L}^T = \mathbf{\Sigma}. \]

3.4 Further numerical linear algebra tricks

If \(\mathbf{\Sigma}\) is non-positive because of numerical problems (has negative eigenvalues), \(\mathbf{\Sigma} + \epsilon \mathbf{I}_d\) with \(\epsilon > | \lambda_{\text{min}} |\) will fix this
Written algorithms often use matrix trace \[ \operatorname{tr}{}(\mathbf{A}) = \sum_{i=1}^d a_{ii} \] which only depends on a subset of the matrix elements.
If the argument \(\mathbf{A}\) is a complex expression, explicit evaluation is very inefficient. Consider e.g. \[ \operatorname{tr}{}(\mathbf{A}^T \mathbf{B}) = \sum_{i,j} a_{ij} b_{ij}, \] which is way faster (\(\mathcal{O}(n^2)\)) to compute without computing the matrix product (\(\mathcal{O}(n^3)\))
- Using trace(A@B) would waste the non-diagonal elements computed in the matrix product

3.5 Numerical integration

Bayesian statistics provides a theoretically well-founded framework for answering many statistical questions. Unfortunately it is often difficult to apply in practice because the answers depend on the evaluation of complicated integrals which cannot be solved analytically.

Numerical integration concerns the approximation of definite integrals through finite sums: \[ \int_V f(x) \mathrm{d}x \approx \sum_i w_i f(x_i). \] There are a number of efficient methods for this problem when the space is one-dimensional, \(V\) is bounded and \(f(x)\) is well-behaved.

The simplest methods for numerical integration are based on a uniform grid. Different ways of interpolating the function between the grid points result in different algorithms with different weights \(w_i\):

Piece-wise constant interpolation yields the so-called rectangle rule.
Piece-wise linear interpolation yields the so-called trapezoidal rule.
Piece-wise quadratic interpolation yields the so-called Simpson’s rule.

Given a fixed number of function evaluations, more accurate methods can be obtained by allowing them to be non-uniformly spaced. Gaussian quadrature methods have been developed to yield optimal accuracy for a small number of function evaluations in various settings such as targets with a known weighting function.

While these simple quadrature methods work well in 1D, they become impractical very quickly when the dimensionality of the space increases. If accurately covering the integration domain in 1D requires \(n\) points, the number of points needed to cover a \(d\)-dimensional domain with equal coverage is of the order \(n^d\), which becomes very large very quickly.

No efficient general methods for high-dimensional numerical integration are known. Monte Carlo methods are often the best choice.

3.5.1 Software for numerical integration

Numerical software packages usually include functions for computing integrals that apply a number of advanced tricks to make the process significantly more efficient.

In Python, standard 1D numerical integration is performed by scipy.integrate.quad. This function uses an adaptive method to choose the evaluation points and therefore needs access to the target function directly. The module scipy.integrate also has functions for higher dimensional integration.

Sometimes the integrand cannot be easily written as a full function but only evaluated at a fixed set of points. In Python, functions scipy.integrate.trapz and scipy.integrate.simps can compute integrals from such a fixed set of function evaluations.

Similar functions exist in other numerical software, too.

3.6 Some best practices

Never explicitly compute matrix inverses \(\mathbf{A}^{-1}\) in real code
Cholesky decomposition is your friend when dealing with covariance matrices
- Always use Cholesky instead of explicit inverse
- Always use Cholesky instead of explicit determinant
Never explicitly compute trace of a more complex expression because many computed elements of the expression will be wasted
Numerical quadrature good for low-d, check software system docs for the available implementation
High-dimensional integration is difficult, Monte Carlo may be your best choice in general

3.7 Exercises

Verify that the formulas derived for the log-determinant and the quadratic form in the normal density using Cholesky decomposition are valid.
Use numerical integration to verify that the normal pdf implemented in Sec. 2.3 is properly normalised.
Test and time the two ways of evaluating the trace of the product of two square matrices illustrated by the equation \[ \operatorname{tr}{}(\mathbf{A}^T \mathbf{B}) = \sum_{i,j} a_{ij} b_{ij}. \]