Scalable Gaussian Processes #1

Christine Shen

Apr 01, 2025

Review of the last lecture

TBU

Motivating dataset

Recall we worked with a dataset on women aged 15-49 sampled from the 2013-14 Democratic Republic of Congo (DRC) Demographic and Health Survey. Variables are:

  • loc_id: location id (i.e. survey cluster).

  • hemoglobin: hemoglobin level (g/dL).

  • anemia: anemia classifications.

  • age: age in years.

  • urban: urban vs. rural.

  • LATNUM: latitude.

  • LONGNUM: longitude.

Motivating dataset

  loc_id hemoglobin     anemia age urban   LATNUM  LONGNUM
1      1       12.5 not anemic  28 rural 0.220128 21.79508
2      1       12.6 not anemic  42 rural 0.220128 21.79508
3      1       13.3 not anemic  15 rural 0.220128 21.79508
4      1       12.9 not anemic  28 rural 0.220128 21.79508
5      1       10.4       mild  32 rural 0.220128 21.79508
6      1       12.2 not anemic  42 rural 0.220128 21.79508

Modeling goals:

  • Learn the associations between age and urbanality and hemoglobin, accounting for unmeasured spatial confounders.

  • Create a predicted map of hemoglobin across the spatial surface controlling for age and urbanality, with uncertainty quantification.

Map of Sud-Kivu state

Last time, we focused on one state with ~500 observations at ~30 locations.

Map of the DRC

Today we will extend the analysis to the full dataset with ~8,600 observations at ~500 locations.

Modeling

Yj(ui)=α+xj(ui)β+θ(ui)+ϵj(ui),ϵj(ui)∼iidN(0,σ2)

Data objects:

  • i∈{1,…,n} indexes unique locations.

  • j∈{1,…,ni} indexes individuals at each location.

  • Yj(ui) denotes the observation of individual j at location ui.

  • xj(ui)=(ageij/10,urbani)∈R1×p, where p=2 is the number of predictors (excluding intercept).

Modeling

Yj(ui)=α+xj(ui)β+θ(ui)+ϵj(ui),ϵj(ui)∼iidN(0,σ2)

Population parameters:

  • α∈R is the intercept.

  • β∈Rp is the regression coefficients.

  • σ2∈R+ is the overall residual error (nugget).

Location-specific parameters:

  • ui=(latitudei,longitudei)∈R2 denotes coordinates of location i.

  • θ(ui) denotes the spatial intercept at location ui.

Location-specific notation

Y(ui)=α1ni+X(ui)β+θ(ui)1ni+ϵ(ui),ϵ(ui)∼Nni(0,σ2I)

  • Y(ui)=(Y1(ui),…,Yni(ui))⊤.

  • X(ui) is an ni×p dimensional matrix with rows xj(ui).

  • ϵ(ui)=(ϵi(ui),…,ϵni(ui))⊤.

Full data notation

Y=α1N+Xβ+Zθ+ϵ,ϵ∼NN(0,σ2I)

  • Y=(Y(u1)⊤,…,Y(un)⊤)⊤∈RN, with N=∑i=1nni.

  • X∈RN×p stacks X(ui).

  • θ=(θ(u1),…,θ(un))⊤∈Rn.

  • Z is an N×n dimensional block diagonal binary matrix. Each row contains a single 1 in column i that corresponds to the location of Yj(ui). Z=[1n10…001n2…0⋮⋮⋱⋮0…01nn].

Modeling

We specify the following model: Y=α1N+Xβ+Zθ+ϵ,ϵ∼NN(0,σ2I) with priors

  • θ(u)|τ,ρ∼GP(0,C(⋅,⋅)), where C is the Matérn 3/2 covariance function with magnitude τ and length scale ρ
  • α∗∼N(0,42). This is the intercept after centering X.
  • βj|σβ∼N(0,σβ2), j∈{1,…,p}
  • σ∼Half-Normal(0,22)
  • τ∼Half-Normal(0,42)
  • ρ∼Inv-Gamma(5,5)
  • σβ∼Half-Normal(0,22)

Computational issues with GP

Effectively, the prior for θ is θ|τ,ρ∼Nn(0,C),C∈Rn×n. Matérn 3/2 is an isotropic covariance function, C(ui,uj)=C(∥ui−uj∥).

C=[C(0)C(∥u1−u2∥)⋯C(∥u1−un∥)C(∥u1−u2∥)C(0)⋯C(∥u2−un∥)⋮⋮⋱⋮C(∥u1−un∥)C(∥u2−un∥)⋯C(0)].

This is not scalable because we need to invert an n×n dense covariance matrix for each MCMC iteration, which requires O(n3) floating point operations (flops), and O(n2) memory.

Scalable GP methods overview

The computational issues motivated exploration in scalable GP methods. Existing scalable methods broadly fall under two categories.

  1. Sparsity methods
  • sparsity in C, e.g., covariance tapering (@furrer2006covariance).
  • sparsity in C−1, e.g., Vecchia approximation (@vecchia1988estimation) and nearest-neighbor GP (@datta2016hierarchical).
  1. Low-rank methods
  • approximate C on a low-dimensional subspace.
  • e.g., process convolution (@higdon2002space), inducing point method(@snelson2005sparse).

Hilbert space method for GP

  • @solin2020hilbert introduced a Hilbert space method for reduced-rank Gaussian process regression (HSGP).

  • @riutort2023practical discussed how to practically implement HSGP.

  • Tutorial codes are available in different probabilistic programming languages:

    • stan
    • NumPyro
    • pyMC

Lecture plan

Today:

  • How does HSGP work
  • Why HSGP is scalable
  • How to use HSGP for Bayesian geospatial model fitting and posterior predictive sampling

Thursday:

  • Parameter tuning for HSGP
  • How to implement HSGP in stan

HSGP approximation

Given:

  • an isotropic covariance function C which admits a power spectral density, e.g., the Matérn family, and
  • a compact domain Θ∈Rd with smooth boundaries. For our purposes, we only consider boxes, e.g., [−1,1]×[−1,1].

HSGP approximates the (i,j) element of the corresponding n×n covariance matrix C as Cij=C(∥ui−uj∥)≈∑k=1mskϕk(ui)ϕk(uj).

HSGP approximation

Cij=C(∥ui−uj∥)≈∑k=1mskϕk(ui)ϕk(uj).

  • sk∈R+ are positive scalars which depends on the covariance function C and its parameters τ and ρ.
  • ϕk:Θ→R are basis functions which only depends on Θ.
  • m is the number of basis functions. Note: even with an infinite sum (m→∞), this remains an approximation (see @solin2020hilbert).

HSGP approximation

In matrix notation,

C≈ΦSΦ⊤.

  • Φ∈Rn×m is a feature matrix. Only depends on Θ and the observed locations.
  • S∈Rm×m is diagonal. Depends on the covariance function C and parameters τ and ρ.

Φ=[ϕ1(u1)…ϕm(u1)⋮⋱⋮ϕ1(un)…ϕm(un)],S=[s1⋱sm].

Why HSGP is scalable

C≈ΦSΦ⊤.

  • Φ only depends on Θ and the observed locations, can be pre-calculated.
  • No matrix inversion.
  • Each MCMC iteration requires O(nm+m) flops, vs O(n3) for a full GP.
  • Ideally m≪n, but HSGP can be faster even for m>n.

Model reparameterization

Under HSGP, approximately θ=dΦS1/2b,b∼Nm(0,I).

Therefore we can reparameterize the model as

Y=α1N+Xβ+Zθ+ϵ≈α1N+Xβ+ZΦS1/2⏟Wb+ϵ

Note the resemblance to linear regression:

  • W∈Rn×m is a known design matrix given parameters τ and ρ.
  • b is an unknown parameter vector with prior Nm(0,I).

HSGP in stan

Similarly, we can use the reparameterized model in stan.

This is called the non-centered parameterization in stan documentation. It’s recommended for computational efficiency for hierarchical models.

transformed data {
  matrix[n,m] PHI;
  matrix[N,m] Z;
}
parameters {
  real alpha;
  real<lower=0> sigma;
  vector[p] beta;
  vector[m] b;
  vector<lower=0>[m] sqrt_S;
}
model {
  vector[n] theta = PHI * (sqrt_S .* b);
  target += normal_lupdf(y | alpha + X * beta + Z * theta, sigma);
  target += normal_lupdf(b | 0, 1);
  ...
}

Posterior predictive distribution

We want to make predictions for Y∗=(Y(un+1),…,Y(un+q))⊤, observations at q new locations. Define θ∗=(θ(un+1),…,θ(un+q))⊤, Ω=(α,β,σ,τ,ρ). Recall:

f(Y∗|Y)=∫f(Y∗,θ∗,θ,Ω|Y)dθ∗dθdΩ=∫f(Y∗|θ∗,Ω)⏟(1)f(θ∗|θ,Ω)⏟(2)f(θ,Ω|Y)⏟(3)dθ∗dθdΩ

  1. Likelihood: f(Y∗|θ∗,Ω) – remains the same as for GP

  2. Kriging: f(θ∗|θ,Ω) – we will focus on this next

  3. Posterior distribution: f(θ,Ω|Y) – we have just discussed

Kriging

Recall under the GP prior,

[θθ∗]|Ω∼Nn+q([0n0q],[CC+C+⊤C∗]),

where C is the covariance of θ, C∗ is the covariance of θ∗, and C+ is the cross covariance matrix between θ and θ∗.

Therefore by properties of multivariate normal, θ∗∣(θ,Ω)∼Nq(Eθ∗,Vθ∗),where Eθ∗=C+⊤C−1θVθ∗=C∗−C+⊤C−1C+.

Kriging under HSGP

Under HSGP, C∗≈Φ∗SΦ∗⊤, C+≈ΦSΦ∗⊤, where Φ∗∈Rq×m=[ϕ1(un+1)…ϕm(un+1)⋮⋱⋮ϕ1(un+q)…ϕm(un+q)] is the feature matrix for the new locations. Therefore approximately [θθ∗]∼Nn+q([0n0q],[ΦSΦ⊤ΦSΦ∗⊤Φ∗SΦ⊤Φ∗SΦ∗⊤]). Claim θ∗∣(θ,Ω)=(Φ∗SΦ⊤)(ΦSΦ⊤)†θ, where A† denotes a generalized inverse of matrix A such that AA†=I.

Kriging under HSGP

Claim θ∗∣(θ,Ω)=(Φ∗SΦ⊤)(ΦSΦ⊤)†θ.

Sketch proof below, see details in class:

  1. By properties of multivariate normal, θ∗∣(θ,Ω)∼Nq(Eθ∗HS,Vθ∗HS), Eθ∗HS=(Φ∗SΦ⊤)(ΦSΦ⊤)†θVθ∗HS=(Φ∗SΦ∗⊤)−(Φ∗SΦ⊤)(ΦSΦ⊤)†(ΦSΦ∗⊤).
  2. Show if Φ has full column rank, which is true under HSGP, then for any matrix A of proper dimension, (1)SΦ⊤(ΦSΦ⊤)†ΦA=A Therefore Vθ∗HS≡0.

Kriging under HSGP

Under the reparameterized model, θ=ΦS1/2b, for b∼Nm(0,I). Therefore θ∗∣(θ,Ω)=(Φ∗SΦ⊤)(ΦSΦ⊤)†θ=(Φ∗SΦ⊤)(ΦSΦ⊤)†(ΦS1/2b)=Φ∗S1/2b.(by equation (1) in the last slide)

During MCMC sampling, we can obtain posterior predictive samples for θ∗ through posterior samples of b and S. Let superscript (s) denote the sth posterior sample:

θ∗(s)=Φ∗S(s)1/2b(s).

Kriging under HSGP – alternative view

Under the reparameterized model, there is another (perhaps more intuitive) way to recognize the kriging distribution under HSGP.

We model θ=ΦS1/2b, where b is treated as the unknown parameter. Therefore for kriging: θ∗∣(θ,Ω)=Φ∗S1/2b∣(b,Ω)=Φ∗S1/2b.

HSGP kriging in stan

Kriging under HSGP can be easily implemented in stan.

transformed data {
  matrix[q,m] PHI_new;
}
parameters {
  vector[m] b;
  vector<lower=0>[m] sqrt_S;
}
generated quantities {
  vector[q] theta_new = PHI_new * (sqrt_S .* b);
}

Recap

HSGP is a low rank approximation method for GP.

Cij≈∑k=1mskϕk(ui)ϕk(uj),C≈ΦSΦ⊤,

  • for covariance function C which admits a power spectral density.
  • on a box Θ⊂Rd.
  • with m number of basis functions.

We have talked about:

  • why HSGP is scalable.
  • how to do posterior sampling and posterior predictive sampling in stan.

HSGP parameters

@solin2020hilbert showed that HSGP approximation can be made arbitrarily accurate as Θ and m increase.

But how to choose:

  1. size of the box Θ.

  2. number of basis functions m.

HSGP approximation box

Due to the design of HSGP, the approximation is less accurate near the boundaries of Θ.

  • Suppose all the coordinates are centered. Let Sl=maxi|uil|,l=1,…,d,i=1,…,(n+q) such that ΘS=∏l=1d[−Sl,Sl] is the smallest box which contains all observed and prediction locations. We should at least ensure Θ⊃ΘS.
  • We want the box to be large enough to ensure good boundary accuracy. Let cl≥1 be boundary factors, we consider Θ=∏l=1d[−Ll,Ll],Ll=clSl.

HSGP approximation box

How much the approximation accuracy deteriorates towards the boundaries depends on smoothness of the true surface.

  • E.g., the larger the length scale ρ, the smoother the surface, a smaller box can be used for the same level of boundary accuracy.

HSGP approximation box

The larger the box,

  • the more basis functions we need for the same level of overall accuracy,
  • hence higher run time.

HSGP basis functions

The total number of basis functions m=∏l=1dml, i.e., we need to decide on ml’s, the number of basis functions for each dimension.

  • The higher the m, the better the overall approximation accuracy, the higher the runtime.
  • m scales exponentially in d, hence the HSGP computation complexity O(mn+m) also scales exponentially in d. Therefore HSGP is only recommended for d≤3, at most 4.

Relationship between c and m

@riutort2023practical used simulations with a squared exponential covariance function to investigate the relationship between c and m.

Notice that c needs to be above a certain minimum value to achieve a reasonable boundary accuracy.

Relationship between c, m and ρ

To summarize:

  1. The boundary factor c needs to be above a minimum value, otherwise HSGP approximation is poor no matter how large m is.
  2. As ρ increases, the surface is less smooth,
  • c needs to increase to retain boundary accuracy.
  • m needs to increase to retain overall accuracy.
  1. As c increases, m needs to increase to retain overall accuracy.
  2. As m increases, run time increases. Hence we want to minimize m and c while maintaining certain accuracy level.

And, we do not know ρ. So how to choose c and m?

Prepare for next class

  1. Work on HW 05 which is due Apr 8
  2. Complete reading to prepare for Thursday’s lecture
  3. Thursday’s lecture:
  • Parameter tuning for HSGP
  • How to implement HSGP in stan

References

🔗 BIOSTAT 725 - Spring 2025

1 / 38
Scalable Gaussian Processes #1 Christine Shen Apr 01, 2025

  1. Slides

  2. Tools

  3. Close
  • Scalable Gaussian Processes #1
  • Review of the last lecture
  • Motivating dataset
  • Motivating dataset
  • Map of Sud-Kivu state
  • Map of the DRC
  • Modeling
  • Modeling
  • Location-specific notation
  • Full data notation
  • Modeling
  • Computational issues with GP
  • Scalable GP methods overview
  • Hilbert space method for GP
  • Lecture plan
  • HSGP approximation
  • HSGP approximation
  • HSGP approximation
  • Why HSGP is scalable
  • Model reparameterization
  • HSGP in stan
  • Posterior predictive distribution
  • Kriging
  • Kriging under HSGP
  • Kriging under HSGP
  • Kriging under HSGP
  • Kriging under HSGP – alternative view
  • HSGP kriging in stan
  • Recap
  • HSGP parameters
  • HSGP approximation box
  • HSGP approximation box
  • HSGP approximation box
  • HSGP basis functions
  • Relationship between c and m
  • Relationship between c, m and ρ
  • Prepare for next class
  • References
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help