Approximating the full joint posterior distribution using a Gaussian distribution is inaccurate. An alternative is to approximate the marginal posterior distribution of some subset of the parameters, referred to as the marginal Laplace approximation.
Then, integrate out the remaining parameters using another method.
- Integrated. Using numerical integration
- Nested. Because we need to get
- Laplace approximations. The methods used to obtain parameters for the Gaussian approximation.
When is it useful?
Wherever we have a latent Gaussian. Almost always.
Following Dan Simpson, we can write a linear mixed model as
where
- Fixed effects. contain the covariates, is a vector of unknown regression coefficients with a Gaussian prior
- Random effects. is a known matrix that describes the random effects,
We can put fixed and random effects together , with a latent field (or latent Gaussian component) .
This allows us to re-write the model as a three-stage hierarchical model
where is the latent field, and are the hyperparameters. For the latent field, , we can write
and parametrise by a precision matrix
What am I approximating?
Marginalise out and then use standard inference techniques on .
The posterior can be written as
- GMRF specifies
- prior is supplied
- measurement model gives us
- normalising constant is ignored since is fixed
The other term in the denominator, , is estimated with Laplace’s method, where matches the mode and the curvature
The likelihood
Laplace approximation
Denoting , we have the following approximation around any given :
In the last line we group the quadratic and linear terms.
The constant is then be determined by using the fact that this must be a probability distribution in (i.e. use the value of the constant from the log probability function of the gaussian with the same quadratic and linear terms).
Evaluate at to maximise. Find this with Newton’s method, which amounts to iteratively solving for the maximum in the very same quadratic approximation. Iteratively solve for in
until stabilisation.
Where can sparsity help?
Looking at the precision matrix again
- Often we make a Markov assumption on , for example in ICAR model in spatial settings where we only assign covariance to nearest neighbours. This makes the precision matrix sparse.
- This is the Gaussian Markov Random Field (GMRF)
- Usually the random effect size dominates the covariates .
This means normally is sparse. The four steps that are needed in sparse implementation are:
- Adding to the diagonal.
- Multiplying a vector
- Solving linear systems (Newton step)
- Computing
More details in Dan’s blog and the README of the jax implementation.
Resources used:
- Adam Howes’ thesis chapter on (R-)INLA
- Dan Simpson’s blog
- Junpeng Lao’s attempt in pymc3
- That Stan team paper
- INLA from scratch
- jax implementation
And the relevant pymc issue.