Feb 13, 2025
On Tuesday, we learned about robust regression.
Heteroskedasticity
Heavy-tailed distributions
Median regression
These were all models for the observed data
Today, we will focus on prior specifications for
Supervised learning can be cast as the problem of estimating a set of coefficients
This is a central focus of statistics and machine learning.
Challenges arise in “large-
Finding a sparse solution, where some
From a Bayesian-learning perspective, there are two main sparse-estimation alternatives: discrete mixtures and shrinkage priors.
Discrete mixtures have been very popular, with the spike-and-slab prior being the gold standard.
Shrinkage priors force
Let’s assume
Sparsity can be induced into
The degree of sparsity depends on the choice of
The number of zeros is dictated by
Discrete parameters can not be specified in Stan!
Often
Ridge regression is motivated by extending linear regression to the setting where:
there are too many predictors (sparsity is desired) and/or,
The OLS estimate becomes unstable:
The ridge estimator minimizes the penalized sum of squares,
Ridge regression can be obtained using the following global-local shrinkage prior,
This is equivalent to:
How is this equivalent to ridge regression?
The posterior mean and mode are
Since
Bayesian statistics is inherently performing regularization!
The least absolute shrinkage and selection operator (lasso) estimator minimizes the penalized sum of squares,
Lasso is desirable because it can set some
Lasso regression can be obtained using the following global-local shrinkage prior,
This is equivalent to:
How is this equivalent to lasso regression?
Lasso is recovered by specifying:
The posterior mode is
As
There is a consensus that the Bayesian lasso does not work well.
It does not yield
The gold-standard sparsity-inducing prior in Bayesian statistics is the horseshoe prior.
Before we get to the horseshoe, one more global-local prior, called the relevance vector machine.
This model can be obtained using the following prior,
The horseshoe prior is specified as,
A random variable
In Stan, the half-Cauchy distribution can be specified by putting a constraint on the parameter definition.

The horseshoe prior has two interesting features that make it particularly useful as a shrinkage prior for sparse problems.
It has flat, Cauchy-like tails that allow strong signals to remain large (that is, un-shrunk) a posteriori.
It has an infinitely tall spike at the origin that provides severe shrinkage for the zero elements of
As we will see, these are key elements that make the horseshoe an attractive choice for handling sparse vectors.

Define the posterior mean of
The following relationship holds:
In regularization problems, predictors are standardized (to mean zero and standard deviation one).
This means that so that
Shrinkage parameter:

Choosing
This horseshoe-shaped shrinkage profile expects to see two things a priori:
Strong signals (
Zeros (
A horseshoe prior can be considered as a continuous approximation to the spike-and-slab prior.
The spike-and-slab places a discrete probability mass at exactly zero (the “spike”) and a separate distribution around non-zero values (the “slab”).
The horseshoe prior smoothly approximates this behavior with a very concentrated distribution near zero.

Carvalho et al. 2009 suggest
Polson and Scott 2011 recommend
Another prior comes from a quantity called the effective number of nonzero coefficients,

This framework can be applied to non-Gaussian observation data models using plug-in estimates values for
Gaussian approximations to the likelihood.
For example: For logistic regression
Horseshoe model has the following form,
Efficient parameter transformation,
data {
int<lower = 1> n;
int<lower = 1> p;
vector[n] Y;
matrix[n, p] X;
real<lower = 0> tau0;
}
parameters {
real alpha;
real<lower = 0> sigma;
vector[p] z;
vector<lower = 0>[p] lambda;
real<lower = 0> tau;
}
transformed parameters {
vector[p] beta;
beta = tau * lambda .* z;
}
model {
// likelihood
target += normal_lpdf(Y | alpha + X * beta, sigma);
// population parameters
target += normal_lpdf(alpha | 0, 3);
target += normal_lpdf(sigma | 0, 3);
// horseshoe prior
target += std_normal_lpdf(z);
target += cauchy_lpdf(lambda | 0, 1);
target += cauchy_lpdf(tau | 0, tau0);
}Work on HW 03, which was just assigned.
Complete reading to prepare for next Tuesday’s lecture
Tuesday’s lecture: Classification

