Gaussian Processes

Dr. Youngsoo Baek

Mar 18, 2025

Recalling what went before the break

Is Linear Fit the Right Way?

There are various ways to fit a model to time-varying behavior of the curves.

A brief agenda

By the end of this lecture you should:

Understand the basic concepts of Gaussian processes (GP) and their usefulness in statistical modeling.
Compute posterior and predictive inference for a simple GP model.
Use GP as a building block in modeling time-correlated data.

Motivation

Let \(i\) index distinct eyes (\(i=1,2,\ldots,n\)) and \(t\) index within-eye data in time (\(t=1,2,\ldots,n_i\)).

Suppose we are interested in effects of time-invariant predictors (age and iop):

\[Y_{it} = \underbrace{\mathbf{x}_i\boldsymbol{\beta}}_{\text{Fixed in time}} + \underbrace{\eta_{it}}_{\text{Varying in time}} + \epsilon_{it}\]

In a previous lecture, we included only time (\(X_{it}\)) and fit a random slope regression.

\[\eta_{it} = \theta_{0i} + \underbrace{X_{it}}_{=Time}\theta_{1i}.\]

What if we want a nonlinear model for time effects?

Exploratory Analysis of Population Predictors

A review of computation / linear algebra needed to sample

How can we sample a 3D normal random vector?

p <- 3
mu <- c(3, -1, 5)
Sigma <- matrix(
  c(
    1.0, 0.6, 0.3,
    0.6, 1.0, 0.6,
    0.3, 0.6, 1.0
  ),
  nrow = 3, ncol = 3
)
z <- rnorm(p)
R <- t(chol(Sigma)) ## R %*% t(R) == Sigma
x <- R %*% z

# This is the same as
x <- rmvnorm(1, mu, Sigma)

From previous lecture: R decomposes into D (diagonal) and L (Cholesky factor of a correlation matrix).

Constructing a Gaussian Random Function in Time

What about “random functions”? Can we extend this idea to arbitrarily many time points? What does that even mean?

We start at time \(t=0\) from \(X_0 = 0\).
When time passes by amount of \(h > 0\), we want

\[X_{t+h} - X_t \sim N(0,h).\] Note: \(\mathbb{E}[X_t^2] = \mathbb{V}(X_t) = t\) (Why?)

Each increment will be independent from each other.

Now think of passing to the limit: Somehow it works (!) and we have a process defined at every \(t\).

Pointwise Observations from a Brownian Motion

Suppose we observe this process \(X_t\) at time points \((t_1,t_2,\ldots,t_N)\). Then the distribution of the following vector is a multivariate normal:

\[ \begin{bmatrix} X_1\\ \vdots \\ X_N \end{bmatrix}\sim N\left(\mathbf{0},\boldsymbol{\Sigma}\right),\; \boldsymbol{\Sigma} = \begin{bmatrix} t_1 & t_1 & \cdots & t_1 \\ t_1 & t_2 & \cdots & t_2 \\ \vdots & \vdots & \ddots & \vdots \\ t_1 & t_2 & \cdots & t_N \end{bmatrix} \]

The covariance structure comes from solving the equation:

\[ h = \mathbb{V}(X_{t+h} - X_t) = \underbrace{\mathbb{E}[X_{t+h}^2]}_{=t+h} + \underbrace{\mathbb{E}[X_t^2]}_{=t} - 2\mathbb{C}(X_{t+h},X_t). \]

Visualizing the Brownian Motion

What if I want different correlation structures?

For longitudinal data,

We are not interested in an indefinitely long time span.
Within the time window, the data can exhibit stable behavior.

Hypothesis 1 : The marginal variability of \(X_t\) should stay the same.
Hypothesis 2 : Time correlation should decay in, and only depend on, the amount of time elapsed.
Hypothesis 3 : We have expectations about the span of correlation and smoothness of the process.

These are all a priori hypotheses. The data may come from something very different!

Stationary Gaussian processes

A different process results from a stationary kernel:

\[ \mathbb{C}(X_t, X_s) = C(|t-s|). \]

We want \(C(0) = \sigma^2 > 0\) and \(C\to 0\) as \(h = |t-s|\to\infty\). Some common choices:

Exponential kernel: \(C(h) = \sigma^2\exp(-h/\rho)\)
Square exponential kernel: \(C(h) = \sigma^2\exp\{-h^2/(2\rho^2)\}\)
Matérn kernels

Visualizing the process

Interpreting Hyperparameters and Kernels

Often, we fix the overall kernel function while learning its “bandwidth” \(\rho\) due to poor identifiability from the data.

Default Prior Choices

Stan team often recommends \(\rho^{-1} \sim Gamma(5,5)\)
A priori the prior is concentrated around 1. For a prior mean, distance increase of 1 corrresponds to a multiplicative decay of correlation by \(e^{-1}\sim 37\%\).
Beware of the units! A year correlation is very different from that over a day.

Gaussian Process Covariance Functions in Stan

A list of available kernels is available here.

The squared exponential kernel (exponentiated quadratic kernel) is given by:

\[C(\mathbf{x}_i,\mathbf{x}_j) = \sigma^2 \exp\left(-\frac{|\mathbf{x}_i - \mathbf{x}_j|^2}{2l^2}\right)\]

matrix gp_exp_quad_cov(array[] real x, real sigma, real length_scale)

For us, \(\mathbf{x}\) is 1D (time)…but it does not have to be!

Prior Sampling in Stan

Once we have a GP model, sampling from the prior is equivalent to sampling from a multivariate normal.

data {
  int<lower=1> N;
  array[N] real x;
  real<lower=0> sigma;
  real<lower=0> l;
}
transformed data {
  // 0. A "small nugget" to stabilize matrix root computation
  // This is for.better numerical stability in taking large matrix roots
  real delta = 1e-9;

  // 1. Compute the squared exponential kernel matrix
  // x is the time variable
  vector[N] mu = rep_vector(0, N);
  matrix[N, N] R_C;
  matrix[N, N] C = gp_exp_quad_cov(x, sigma, l);
  for (i in 1:N) {
    C[i, i] = C[i, i] + delta;
  }

  // 2. Compute the root of C by Cholesky decomposition
  R_C = cholesky_decompose(C);
}
generated quantities {
  // 3. Sample from the prior: multivariate_normal(0, C)
  f ~ multi_normal_cholesky(mu, R_C)
}

Posterior inference

For each \(i\)-th eye: time effect vector is given an independent prior based on the GP model.

\[\boldsymbol{\eta}_i = (\eta_{i1},\ldots,\eta_{in_i})^\top \stackrel{ind}{\sim} N_{n_i}(\mathbf{0},\mathbf{C}_i)\]

Each \(\eta_{it}\) is a pointwise value of the process \(\eta_i(t) = \eta_{it}\).

\[\mathbf{C}_i = \begin{bmatrix} C(0) & C(|t_{i1}-t_{i2}|) & \cdots & C(|t_{i1} - t_{in_i}|)\\ C(|t_{i1} - t_{i2}|) & C(0) & \cdots & C(|t_{i,2} - t_{in_i}|)\\ \vdots & \vdots & \ddots & \vdots \end{bmatrix}\]

Remember: Different eyes need not have the same number of encounters, nor need they have measurements at the same time.

The full model

For \(i\) (\(i = 1,\ldots,n\)) and \(t\) (\(t = 1,\ldots,n_i\)),

\[\begin{align*} Y_{it} &= \mathbf{x}_{i}\boldsymbol{\beta} + \eta_{i}(t) + \epsilon_{it}, \quad \epsilon_{it} \stackrel{iid}{\sim} N(0,\sigma^2),\\ \boldsymbol{\eta}_i &\stackrel{ind}{\sim} N_{n_i}(\mathbf{0},\mathbf{C}_i),\\ \boldsymbol{\Omega} &\sim f(\boldsymbol{\Omega}). \end{align*}\]

\(\boldsymbol{\Omega}\) represents the population parameters: \(\boldsymbol{\beta}\), \(\sigma^2\), and any kernel parameters determining \(\mathbf{C}_i\)
For concreteness: we use a squared exponential kernel \(C\) and model \(\mathbb{C}(\eta_i(t),\eta_i(t')) = \alpha^2\exp(-|t-t'|^2/2\rho^2).\) Thus \(\boldsymbol{\Omega} = (\boldsymbol{\beta},\sigma^2,\alpha,\rho)\).

First Look at the Data

head(dataset)

  pat_id eye_id mean_deviation      time      age      iop
1      1      1          -7.69 0.0000000 51.55616 10.87303
2      1      1          -9.95 0.5753425 51.55616 10.87303
3      1      1          -9.58 1.0547945 51.55616 10.87303
4      1      1          -9.53 1.5726027 51.55616 10.87303
5      1      1          -9.18 2.0136986 51.55616 10.87303
6      1      1          -9.63 2.5671233 51.55616 10.87303

Model Specification

We want to fit a model estimating the effects of age and iop on mean deviation, adjusting for nonlinear time effects:

\[\begin{align*} Y_{it} &= \underbrace{\beta_0 + \text{age}_i\beta_1 + \text{iop}_i\beta_2}_{=\mathbf{x}_{i}\boldsymbol{\beta}} + \eta_{it} + \epsilon_{it}\\ \boldsymbol{\eta}_i &= \begin{bmatrix} \eta_{i1}\\ \vdots\\ \eta_{in_i} \end{bmatrix} \stackrel{iid}{\sim} N(0,\mathbf{C}_i)\\ \epsilon_{it} &\stackrel{iid}{\sim} N(0,\sigma^2) \end{align*}\]

Our model for \(\eta_{it}\) is nonlinear in time: time enters the covariance structure of \(\mathbf{C}_i\) producing a function \(\eta_i(t) = \eta_{it}\).

Preparing the Data

A few data processing steps are needed to handle the data structure with more ease.

\(N\) will denote the number of all observations: \(N = \sum_{i=1}^{n} n_i\).
\(n\) will denote the total number of eyes.
\(p\) will denote the number of predictors fixed across time (\(p=3\): intercept, slopes for age and iop).
A separate vector of the number of observations (\(n_i\)) for each eye will be stored as s.

Preparing the Data

library(dplyr)

# Group predictors fixed across time
fixed_df <- dataset %>% 
  group_by(eye_id) %>% 
  reframe(age = unique(age), iop = unique(iop))
Xmat <- model.matrix(~age + iop, data = fixed_df)

# Number of measurements for each eye
groupsizes <- dataset %>% 
  group_by(eye_id) %>% 
  summarise(n = n()) %>% 
  pull(n)

stan_data <- list(
  N = dim(dataset)[1],
  n = max(dataset$eye_id),
  p = dim(Xmat)[2],
  t = dataset$time,
  Y = dataset$mean_deviation,
  s = groupsizes,
  X = Xmat
)

Conditional GP Model

data {
  int<lower=0> N;       // total number of observations
  int<lower=1> n;       // number of eyes
  int<lower=1> p;       // fixed effects dimension
  vector[N] Y;          // observation
  matrix[n, p] X;        // fixed effects predictors
  array[N] real t;      // obs time points
  array[n] int s;       // sizes of within-pt obs
}
transformed data {
  real delta = 1e-9;
}
parameters {
  // Fixed effects model
  vector[p] beta;
  real<lower=0> sigma;
  // GP parameters
  vector[N] z;
  real<lower=0> alpha;
  real<lower=0> rho;
}
transformed parameters {
  vector[n] mu = X * beta;
}
model {
  beta ~ normal(0,3);
  z ~ std_normal();
  alpha ~ std_normal();
  sigma ~ std_normal();
  rho ~ inv_gamma(5,5);

  vector[N] mu_rep;
  vector[N] eta;
  
  int pos;
  pos = 1;
  // Ragged loop computing the mean for each time obs
  for (i in 1:n) {
    // GP covariance for the k-th eye
    int n_i = s[i];
    int pos_end = pos + n_i - 1;
    matrix[n_i, n_i] R_C;
    matrix[n_i, n_i] C = gp_exp_quad_cov(segment(t, pos, n_i), alpha, rho);
    for (j in 1:n_i) {
      // adding a small term to the diagonal entries
      C[j, j] = C[j, j] + delta;
    }
    R_C = cholesky_decompose(C);
    
    // Mean of data at each time
    mu_rep[pos:pos_end] = rep_vector(mu[i], n_i);
    // GP for the i-th eye
    eta[pos:pos_end] = R_C * segment(z, pos, n_i);
    pos = pos_end + 1;
  }
  // Normal observation model centered at mu_rep + eta
  Y ~ normal(mu_rep + eta, sigma);
}

Assessing Convergence

traceplot(gp_condl_fit, pars = c("beta", "sigma", "alpha", "rho"))

Assessing Convergence

bayesplot::mcmc_acf(gp_condl_fit, regex_pars = c("beta", "sigma", "alpha", "rho"))

Marginal Model Specification

The conditional model is pretty slow (can be improved) and results in poor mixing (can be run longer). Marginalizing the random effect can stabilize the computation.

In vector notation, our observation model for the \(i\)-th eye is

\[ \mathbf{Y}_i = \begin{bmatrix} Y_{i1}\\ \vdots \\ Y_{in_i} \end{bmatrix} = \mathbf{x}_i\boldsymbol{\beta} + \boldsymbol{\eta}_i + \boldsymbol{\epsilon}_i \] By marginalizing out \(\boldsymbol{\eta}_i\), we obtain

\[ \mathbf{Y}_i \sim N_{n_i}(\mathbf{x}_i\boldsymbol{\beta},\mathbf{C}_i + \sigma^2\mathbf{I}_{n_i}) \]

Fitting a Marginal Model

Only the model segment has to be changed.

model {
  beta ~ normal(0,3);
  alpha ~ std_normal();
  sigma ~ std_normal();
  rho ~ inv_gamma(5,5);

  int pos;
  pos = 1;
  // Ragged loop computing joint likelihood for each eye
  for (i in 1:n) {
    // GP covariance for the k-th eye
    int n_i = s[i];
    vector[n_i] mu_rep;
    
    matrix[n_i, n_i] R_C;
    matrix[n_i, n_i] C = gp_exp_quad_cov(segment(t, pos, n_i), alpha, rho);
    for (j in 1:n_i) {
      // Add noise variance to the diagonal entries
      C[j, j] = C[j, j] + square(sigma);
    }
    R_C = cholesky_decompose(C);
    // Marginal model for the i-th eye
    mu_rep = rep_vector(mu[i], n_i);
    target += multi_normal_cholesky_lpdf(to_vector(segment(Y, pos, n_i)) | mu_rep, R_C);
    pos = pos + n_i;
  }
}

Assessing Convergence

traceplot(gp_marginal_fit, pars = c("beta", "sigma", "alpha", "rho"))

Assessing Convergence

bayesplot::mcmc_acf(gp_marginal_fit, regex_pars = c("beta", "sigma", "alpha", "rho"))

Estimates of eye-level Predictors

Learned GP hyperparameter

The posterior mean of \(\rho\) (in year) is very large: the pattern of mean deviation change is very gradual over time.
A posteriori we believe mean deviations of an eye, 10 years apart and adjusted for age and iop, are still strongly correlated (\(e^{-5/6}\sim 43\%\)).

Random effects covariance structure for the first 5 years

Predictive inference

The model is continuous in nature and defined, in principle, for all times. This is very convenient for visualiztion, interpolation, and forecasting.

\[ \mathbf{Y}_i = \underbrace{\mathbf{f}_{i}}_{ = \mathbf{x}_{i}\boldsymbol{\beta} + \boldsymbol{\eta}_{i} } + \boldsymbol{\epsilon}_i \]

The vector \(\mathbf{f}_{i}\) may be interpreted as the denoised latent process of an eye-specific mean deviation over time: \((f_{i1},\ldots,f_{in_i})^\top\).

\[ \mathbb{E}[f_{it}] = \mathbf{x}_i\boldsymbol{\beta},\; \mathbb{C}(f_{it}, f_{it'}) = \mathbb{C}(\eta_{it},\eta_{it'}) \]

Predictive Inference using Normal Conditioning

Say \(\mathbf{f}^{pred}_i\) is the values of \(f_i\) at \(m_i\) new time points which we want to predict, based on our GP model:

\[\begin{align*} \mathbf{f}^{pred}_i &\sim N_{m_i}(\mathbf{x}_i\boldsymbol{\beta},\mathbf{C}^{pred}_i)\\ \mathbf{Y}_i = \mathbf{f}_i + \boldsymbol{\epsilon}_i &\sim N_{n_1}(\mathbf{x}_i\boldsymbol{\beta},\mathbf{C}_i + \sigma^2\mathbf{I}_{n_i}) \end{align*}\]

An analytical formula exists for the conditional distribution of \(\mathbf{f}^{pred}\) given observed \(Y_{it}\) at \(n_i\) time points.

\[ \mathbf{f}^{pred}_i | \mathbf{Y}_i \sim N\left(\mathbf{x}_i\boldsymbol{\beta} + \mathbf{k}_i(\mathbf{C}_i + \sigma^2\mathbf{I})^{-1}(\mathbf{Y}_i - \mathbf{x}_i\boldsymbol{\beta}), \mathbf{C}_i^{pred}-\mathbf{k}_i(\mathbf{C}_i + \sigma^2\mathbf{I})^{-1}\mathbf{k}_i^\top\right) \] where \(\mathbf{k}_i = \mathbb{C}(\mathbf{f}^{pred}_i, \mathbf{f}_i)\) (“cross-covariances”).

This stems from a more general fact about conditional distributions of jointly normal vectors: e.g., \(\mathbf{f}_i^{pred}\) and \(\mathbf{f}_i\) need not have the same mean.

New Data for Predictive Inference

Suppose we are interested in understanding the mean deviation trend for the first 5 years from baseline (\(t^{pred}\in [0,5]\)).

stan_data <- list(
  N = dim(dataset)[1],
  n = max(dataset$eye_id),
  p = dim(Xmat)[2],
  Y = dataset$mean_deviation,
  t = dataset$time,
  s = groupsizes,
  X = Xmat,
  # Time window for which we want predictive
  t_pred = seq(0, 5, by = 0.25),
  # Total length of this window
  N_pred = 21
)

Stan Implementation

See the Stan Help page for details.

functions {
  // Analytical formula for latent GP conditional on Gaussian observations
  vector gp_pred_rng(array[] real x_pred,
                     vector Y,
                     array[] real x,
                     real mu,
                     real alpha,
                     real rho,
                     real sigma,
                     real delta) {
    int N1 = rows(Y);
    int N2 = size(x_pred);
    vector[N2] f_pred;
    {
      matrix[N1, N1] L_Sigma;
      vector[N1] Sigma_div_y;
      matrix[N1, N2] C_x_xpred;
      matrix[N1, N2] v_pred;
      vector[N2] fpred_mu;
      matrix[N2, N2] cov_fpred;
      matrix[N2, N2] diag_delta;
      matrix[N1, N1] Sigma;
      Sigma = gp_exp_quad_cov(x, alpha, rho);
      for (n in 1:N1) {
        Sigma[n, n] = Sigma[n, n] + square(sigma);
      }
      L_Sigma = cholesky_decompose(Sigma);
      Sigma_div_y = mdivide_left_tri_low(L_Sigma, Y - mu);
      Sigma_div_y = mdivide_right_tri_low(Sigma_div_y', L_Sigma)';
      C_x_xpred = gp_exp_quad_cov(x, x_pred, alpha, rho);
      fpred_mu = (C_x_xpred' * Sigma_div_y);
      v_pred = mdivide_left_tri_low(L_Sigma, C_x_xpred);
      cov_fpred = gp_exp_quad_cov(x_pred, alpha, rho) - v_pred' * v_pred;
      diag_delta = diag_matrix(rep_vector(delta, N2));

      f_pred = multi_normal_rng(fpred_mu, cov_fpred + diag_delta);
    }
    return f_pred;
  }
}
//...
generated quantities {
  matrix[n,Np] f_pred;
  matrix[Np,Np] Cp;
  Cp = gp_exp_quad_cov(t_pred, alpha, rho);
  // Posterior predictive on fixed time grid for all eyes
  int pos;
  pos = 1;
  for (i in 1:n) {
    int n_i = s[i];
    f_pred[i,] = mu[i] + 
      gp_pred_rng(
          t_pred,
          segment(Y, pos, n_i),
          segment(t, pos, n_i),
          mu[i], 
          alpha,
          rho,
          sigma,
          delta
        )';
    pos = pos + n_i;
  }
}

Visualizing latent effects using predictive formulae

Stan speed concerns

GP is notorious for not being scalable. Tl;dr is the need for matrix root / inverse computation.
For datasets covered in this class, Stan works well. For research purposes, worth exploring other specialized toolkits.

Code MCMC yourself to make it faster (e.g., using Rcpp)
Do a bit of software research on Wikipedia
Scalable Approximations (some of them will be coming soon!)

A summary

GP is a flexible, high-dimensional model for handling correlated measurements over time.
With some basic knowledge about conditioning Gaussian random variables, we can implement posterior computation and prediction / interpolation.
Programming in Stan is straightforward but can be expensive with large number of observations.

Prepare for next class

Work on HW 04 which is due March 25
Complete reading to prepare for next Thursday’s lecture
Thursday’s lecture: Using GP to model spatial data