Approximating the full joint posterior distribution using a Gaussian distribution is inaccurate. An alternative is to approximate the marginal posterior distribution of some subset of the parameters, referred to as the marginal Laplace approximation.

Then, integrate out the remaining parameters using another method.

  • Integrated. Using numerical integration
  • Nested. Because we need to get
  • Laplace approximations. The methods used to obtain parameters for the Gaussian approximation.

When is it useful?

Wherever we have a latent Gaussian. Almost always.

Following Dan Simpson, we can write a linear mixed model as

where

  • Fixed effects. contain the covariates, is a vector of unknown regression coefficients with a Gaussian prior
  • Random effects. is a known matrix that describes the random effects,

We can put fixed and random effects together , with a latent field (or latent Gaussian component) .

This allows us to re-write the model as a three-stage hierarchical model

where is the latent field, and are the hyperparameters. For the latent field, , we can write

and parametrise by a precision matrix

What am I approximating?

Marginalise out and then use standard inference techniques on .

The posterior can be written as

  • GMRF specifies
  • prior is supplied
  • measurement model gives us
  • normalising constant is ignored since is fixed
log_prior_fn(params) + log_marginal_likelihood(params, y)

The other term in the denominator, , is estimated with Laplace’s method, where matches the mode and the curvature

The likelihood

# evaluated at x = \hat{x}_0 below
log_laplace_approx = self._conditional_gaussian_approximation(
	x, y, gaussian
).logpdf(x)
log_marginal_likelihood = (    # P(y | params) =
	gaussian.logpdf(x)         # P(x | params)
	+ self.f(x, y).sum()       # * P(y | x, params)
	- log_laplace_approx       # / P(x | y, params)
)

Laplace approximation

Denoting , we have the following approximation around any given :

In the last line we group the quadratic and linear terms.

def _conditional_gaussian_approximation(
      self, x, y, unconditional_gaussian
  ) -> gmrf.Gaussian:
    """The gaussian approximation of P(x | y, params) at the given x."""
    fpp = self.fpp(x, y)
    precision = unconditional_gaussian.precision.add_diag(-fpp)
    linear_part = (
	    # self.precision @ self.mean
        unconditional_gaussian.information_vector + 
        self.fp(x, y) -
        x * fpp
    )
    # Gaussian with mean -fpp + precision
    return gmrf.Gaussian(precision, precision.solve(linear_part))

The constant is then be determined by using the fact that this must be a probability distribution in (i.e. use the value of the constant from the log probability function of the gaussian with the same quadratic and linear terms).

Evaluate at to maximise. Find this with Newton’s method, which amounts to iteratively solving for the maximum in the very same quadratic approximation. Iteratively solve for in

until stabilisation.

def newton_step(x, y, gaussian):
	return self._conditional_gaussian_approximation(x, y, gaussian).mean
 
fpi = jaxopt.FixedPointIteration(newton_step)
x = fpi.run(gaussian.mean, y, gaussian).params

Where can sparsity help?

Looking at the precision matrix again

  • Often we make a Markov assumption on , for example in ICAR model in spatial settings where we only assign covariance to nearest neighbours. This makes the precision matrix sparse.
    • This is the Gaussian Markov Random Field (GMRF)
  • Usually the random effect size dominates the covariates .

This means normally is sparse. The four steps that are needed in sparse implementation are:

  • Adding to the diagonal.
  • Multiplying a vector
  • Solving linear systems (Newton step)
  • Computing

More details in Dan’s blog and the README of the jax implementation.

Resources used:

And the relevant pymc issue.