Regularization

Prof. Sam Berchuck

Feb 13, 2025

Review of last lecture

  • On Tuesday, we learned about robust regression.

    • Heteroskedasticity

    • Heavy-tailed distributions

    • Median regression

  • These were all models for the observed data Yi.

  • Today, we will focus on prior specifications for β.

Sparsity in regression problems

  • Supervised learning can be cast as the problem of estimating a set of coefficients β={βj}j=1p that determines some functional relationship between a set of {xij}j=1p and a target variable Yi.

  • This is a central focus of statistics and machine learning.

  • Challenges arise in “large-p” problems where, in order to avoid overly complex models that predict poorly, some form of dimension reduction is needed.

  • Finding a sparse solution, where some βj are zero, is desirable.

Bayesian sparse estimation

  • From a Bayesian-learning perspective, there are two main sparse-estimation alternatives: discrete mixtures and shrinkage priors.

  • Discrete mixtures have been very popular, with the spike-and-slab prior being the gold standard.

    • Easy to force βj to exactly zero, but require discrete parameter specification.
  • Shrinkage priors force βj to zero using regularization, but struggle to get exact zeros.

    • In recent years, shrinkage priors have become dominant in Bayesian sparsity priors.

Global-local shrinkage

  • Let’s assume Y∼N(α+Xβ,σ2In).

  • Sparsity can be induced into β using a global-local prior,

βj|λj,τ∼indN(0,λj2τ2)λj∼iidf(λj).

  • τ2 is the global variance term.

  • λj is the local term.

  • The degree of sparsity depends on the choice of f(λj).

Spike-and-slab prior

  • Discrete parameter specification,

βj|λj,τ∼indN(0,λj2τ2)λj∼iidBernoulli(π).

  • λj∈{0,1}, thus this model permits exact zeros.

  • The number of zeros is dictated by π, which can either be pre-specified or given a prior.

  • Discrete parameters can not be specified in Stan!

Spike-and-slab prior

  • Spike-and-slab can be written generally as a two-component mixture of Gaussians,

βj|λj,τ,ω∼indλjN(0,τ2)+(1−λj)N(0,ω2)λj∼iidBernoulli(π).

  • ω≪τ and the indicator variable λj∈{0,1} denotes whether βj is close to zero (comes from the “spike”, λj=0) or non-zero (comes from the “slab”, λj=1).

  • Often ω=0 (the spike is a true spike).

Ridge regression

  • Ridge regression is motivated by extending linear regression to the setting where:

    • there are too many predictors (sparsity is desired) and/or,

    • X⊤X is ill-conditioned were singular or nearly singular (multicollinearity).

  • The OLS estimate becomes unstable: β^OLS=(X⊤X)−1X⊤Y.

Ridge regression

The ridge estimator minimizes the penalized sum of squares,

β^RIDGE=arg⁡minβ||Y−μ||22+λ∑j=1pβj2

  • μ=α+Xβ.

  • ||v||2=v⊤v is the L2 norm.

  • β^RIDGE=(X⊤X+λIp)−1X⊤Y

    • Adding the λ to diagonals of X⊤X stabilizes the inverse, which becomes unstable with multicollinearity.

Bayesian ridge prior

Ridge regression can be obtained using the following global-local shrinkage prior,

βj|λj,τ∼indN(0,λj2τ2)λj=1/λτ2=σ2.

  • This is equivalent to: f(βj|λ,σ)∼iidN(0,σ2λ).

  • How is this equivalent to ridge regression?

Bayesian ridge prior

  • The negative log-posterior is proportional to,

||Y−μ||222σ2+λ∑j=1pβj22σ2.

  • The posterior mean and mode are β^RIDGE.

  • Since λ is applied to the squared norm of the β, people often standardize all of the covariates to make them have a similar scale.

  • Bayesian statistics is inherently performing regularization!

Lasso regression

The least absolute shrinkage and selection operator (lasso) estimator minimizes the penalized sum of squares,

β^LASSO=arg⁡minβ||Y−μ||22+λ∑j=1p|βj|

  • λ=0 reduces to OLS etimator.

  • λ=∞ leads to β^LASSO=0.

  • Lasso is desirable because it can set some βj exactly to zero.

Bayesian lasso prior

Lasso regression can be obtained using the following global-local shrinkage prior,

βj|λj,τ∼indN(0,λj2τ2)λj2∼iidExponential(0.5).

  • This is equivalent to: f(βj|τ)∼iidLaplace(0,τ).

  • How is this equivalent to lasso regression?

Bayesian lasso prior

  • The negative log-posterior is proportional to,

||Y−μ||222σ2+∑j=1p|βj|τ.

  • Lasso is recovered by specifying: λ=1/τ.

  • The posterior mode is β^LASSO.

  • As λ increases, more coefficients are set to zero (less variables are selected), and among the non-zero coefficients, more shrinkage is employed.

Bayesian lasso does not work

  • There is a consensus that the Bayesian lasso does not work well.

  • It does not yield βj that are exactly zero and it can overly shrink non-zero βj.

  • The gold-standard sparsity-inducing prior in Bayesian statistics is the horseshoe prior.

Relevance vector machine

  • Before we get to the horseshoe, one more global-local prior, called the relevance vector machine.

  • This model can be obtained using the following prior,

βj|λj,τ∼indN(0,λj2τ2)λj2∼iidInverse-Gamma(ν2,ν2).

  • This is equivalent to: f(βj|τ)∼iidtν(0,τ).

Horseshoe prior

  • The horseshoe prior is specified as,

    βj|λj,τ∼indN(0,λj2τ2)λj∼iidC+(0,1), where C+(0,1) is a half-Cauchy distribution for the local parameter λj.

    • λj’s are the local shrinkage parameters.

    • τ is the global shrinkage parameter.

Half-Cauchy distribution

A random variable X∼C+(μ,σ) follows a half-Cauchy distribution with location μ and scale σ>0 and has the following density,

f(X|μ,σ)=2πσ11+(X−μ)2/σ2,X≥μ

  • The Half-Cauchy distribution with μ=0 is a useful prior for non-negative parameters that may be very large, as allowed by the very heavy tails of the Cauchy distribution.

Half-Cauchy distribution in Stan

In Stan, the half-Cauchy distribution can be specified by putting a constraint on the parameter definition.

parameters {
  real<lower = 0> lambda;
}
model {
  target += cauchy_lpdf(lambda | 0, 1);
}

Half-Cauchy distribution

Horseshoe prior

The horseshoe prior has two interesting features that make it particularly useful as a shrinkage prior for sparse problems.

  1. It has flat, Cauchy-like tails that allow strong signals to remain large (that is, un-shrunk) a posteriori.

  2. It has an infinitely tall spike at the origin that provides severe shrinkage for the zero elements of β.

As we will see, these are key elements that make the horseshoe an attractive choice for handling sparse vectors.

Relation to other shrinkage priors

βj|λj,τ∼N(0,λj2τ2)λj2∼f(λj)

  1. λj=1/λ, implies ridge regression.

  2. f(λj)=Exponential(0.5), implies lasso.

  3. f(λj)=Inverse-Gamma(ν2,ν2), implies relevance vector machine.

  4. f(λj)=C+(0,1), implies horseshoe.

Horsehoe density

workflow

From Carvalho 2009

Shrinkage of each prior

  • Define the posterior mean of βj as β¯j and the maximum likelihood estimator for βj as β^j.

  • The following relationship holds: β¯j=(1−κj)β^j,

κj=11+nσ−2τ2sj2λj2.

  • κj is called the shrinkage factor for βj.

  • sj2=V(xj) is the variance for each predictor.

Standardization of predictors

  • In regularization problems, predictors are standardized (to mean zero and standard deviation one).

  • This means that so that sj=1.

  • Shrinkage parameter:

κj=11+nσ−2τ2λj2.

  • κj=1, implies complete shrinkage.

  • κj=0, implies no shrinkage.

Shrinkage parameter

workflow

From Carvalho 2009

Horseshoe shrinkage parameter

  • Choosing λj∼C+(0,1) implies κj∼Beta(0.5,0.5), a density that is symmetric and unbounded at both 0 and 1.

  • This horseshoe-shaped shrinkage profile expects to see two things a priori:

    1. Strong signals (κ≈0, no shrinkage), and

    2. Zeros (κ≈1, total shrinkage).

Similarity to spike-and-slab

  • A horseshoe prior can be considered as a continuous approximation to the spike-and-slab prior.

    • The spike-and-slab places a discrete probability mass at exactly zero (the “spike”) and a separate distribution around non-zero values (the “slab”).

    • The horseshoe prior smoothly approximates this behavior with a very concentrated distribution near zero.

workflow

From Piironena and Vehtari 2017

Choosing a prior for τ

  • Carvalho et al. 2009 suggest τ∼C+(0,1).

  • Polson and Scott 2011 recommend τ|σ∼C+(0,σ2).

  • Another prior comes from a quantity called the effective number of nonzero coefficients,

    meff=∑j=1p(1−κj).

Global shrinkage parameter τ

  • The prior mean can be shown to be,

E[meff|τ,σ]=τσ−1n1+τσ−1np.

  • Setting E[meff|τ,σ]=p0 (prior guess for the number of non-zero coefficients) yields for τ,

τ0=p0p−p0σn.

Global shrinkage parameter τ

workflow

From Piironena and Vehtari 2017

Non-Gaussian observation models

  • The reference value:

τ0=p0p−p0σn.

  • This framework can be applied to non-Gaussian observation data models using plug-in estimates values for σ.

    • Gaussian approximations to the likelihood.

    • For example: For logistic regression σ=2.

Coding up the model in Stan

Horseshoe model has the following form,

βj|λj,τ∼indN(0,λj2τ2)λj∼iidC+(0,1),τ∼C+(0,τ02).

Efficient parameter transformation, βj=τλjzj,zj∼iidN(0,1).

Horseshoe in Stan

data {
  int<lower = 1> n;
  int<lower = 1> p;
  vector[n] Y;
  matrix[n, p] X;
  real<lower = 0> tau0;
}
parameters {
  real alpha;     
  real<lower = 0> sigma;
  vector[p] z;
  vector<lower = 0>[p] lambda;
  real<lower = 0> tau;
}
transformed parameters {
  vector[p] beta;
  beta = tau * lambda .* z;
}
model {
  // likelihood
  target += normal_lpdf(Y | alpha + X * beta, sigma);
  // population parameters
  target += normal_lpdf(alpha | 0, 3);
  target += normal_lpdf(sigma | 0, 3);
  // horseshoe prior
  target += std_normal_lpdf(z);
  target += cauchy_lpdf(lambda | 0, 1);
  target += cauchy_lpdf(tau | 0, tau0);
}

Prepare for next class

  • Work on HW 03, which was just assigned.

  • Complete reading to prepare for next Tuesday’s lecture

  • Tuesday’s lecture: Classification

🔗 BIOSTAT 725 - Spring 2025

1 / 35
Regularization Prof. Sam Berchuck Feb 13, 2025

  1. Slides

  2. Tools

  3. Close
  • Regularization
  • Review of last lecture
  • Sparsity in regression problems
  • Bayesian sparse estimation
  • Global-local shrinkage
  • Spike-and-slab prior
  • Spike-and-slab prior
  • Ridge regression
  • Ridge regression
  • Bayesian ridge prior
  • Bayesian ridge prior
  • Lasso regression
  • Bayesian lasso prior
  • Bayesian lasso prior
  • Bayesian lasso does not work
  • Relevance vector machine
  • Horseshoe prior
  • Half-Cauchy distribution
  • Half-Cauchy distribution in Stan
  • Half-Cauchy distribution
  • Horseshoe prior
  • Relation to other shrinkage priors
  • Horsehoe density
  • Shrinkage of each prior
  • Standardization of predictors
  • Shrinkage parameter
  • Horseshoe shrinkage parameter
  • Similarity to spike-and-slab
  • Choosing a prior for τ
  • Global shrinkage parameter τ
  • Global shrinkage parameter τ
  • Non-Gaussian observation models
  • Coding up the model in Stan
  • Horseshoe in Stan
  • Prepare for next class
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help