Robust Regression

Prof. Sam Berchuck

Feb 11, 2025

Review of last lecture

  • On Thursday, we started to branch out from linear regression.

  • We learned about approaches for nonlinear regression.

  • Today we will address approaches for robust regression, which will generalize the assumption of homoskedasticity (and also the normality assumption).

A motivating research question

  • In today’s lecture, we will look at data on serum concentration (grams per litre) of immunoglobulin-G (IgG) in 298 children aged from 6 months to 6 years.

    • A detailed discussion of this data set may be found in Isaacs et al. (1983) and Royston and Altman (1994).
  • For an example patient, we define Yi as the serum concentration value and Xi as a child’s age, given in years.

Pulling the data

library(Brq)
data("ImmunogG")
head(ImmunogG)
  IgG Age
1 1.5 0.5
2 2.7 0.5
3 1.9 0.5
4 4.0 0.5
5 1.9 0.5
6 4.4 0.5

Visualizing IgG data

Modeling the association between age and IgG

  • Linear regression can be written as follows for i=1,…,n,

Yi=α+βXi+ϵi,ϵi∼N(0,σ2).

  • β represent the change in IgG serum concentration with a one year increase in age.

Linear regression assumptions

Yi=α+βXi+ϵi,ϵi∼N(0,σ2)=μi+ϵi.

Assumptions:

  1. Yi are independent observations (independence).

  2. Yi is linearly related to Xi (linearity).

  3. ϵi=Yi−μi is normally distributed (normality).

  4. ϵi has constant variance across Xi (homoskedasticity).

Assessing assumptions

Robust regression

  • Today we will learn about regression techniques that are robust to the assumptions of linear regression.

  • We will introduce the idea of robust regression by exploring ways to generalize the homoskedastic variance assumption in linear regression.

  • We will touch on heteroskedasticity, heavy-tailed distributions, and median regression (more generally quantile regression).

Why is robust regression not more comomon?

  • Despite their desirable properties, robust methods are not widely used. Why?

    • Historically computationally complex.

    • Not available in statistical software packages.

  • Bayesian modeling using Stan alleviates these bottlenecks!

Heteroskedasticity

  • Heteroskedasticity is the violation of the assumption of constant variance.

  • How can we handle this?

  • In OLS, there are approaches like heteroskedastic consistent errors, but this is not a generative model.

  • In the Bayesian framework, we generally like to write down generative models.

Modeling the scale with covariates

  • One option is to allow the sale to be modeled as a function of covariates.

  • It is common to model the log-transformation of the scale or variance to transform it to R,

log⁡τi=ziγ,

where zi=(zi1,…,zip) are a p-dimensional vector of covariates and γ are parameters that regress the covariates onto the log standard deviation.

  • Other options include: log⁡τi=ziγ+νi,νi∼N(0,σ2)

  • Other options include: log⁡τi=f(μi)

  • Any plausible generative model can be specified!

Modeling the scale with covariates

data {
  int<lower = 1> n;
  int<lower = 1> p;
  int<lower = 1> q;
  vector[n] Y;
  matrix[n, p] X;
  matrix[n, q] Z;
}
parameters {
  real alpha;
  vector[p] beta;
  vector[q] gamma;
}
transformed parameters {
  vector[n] tau = exp(Z * gamma);
}
model {
  target += normal_lpdf(Y | alpha + X * beta, tau);
}

Heteroskedastic variance

  • We can write the regression model using a observation specific variance, Yi=α+xiβ+ϵi,ϵi∼indN(0,τi2).

  • One way of writing the variance is: τi2=σ2λi.

    • σ2 is a global scale parameter.

    • λi is an observation specific scale parameter.

  • In the Bayesian framework, we must place a prior on λi.

A prior to induce a heavy-tail

  • A common prior for λi is as follows:

λi∼iidInverse-Gamma(ν2,ν2).

A prior to induce a heavy-tail

  • A common prior for λi is as follows:

λi∼iidInverse-Gamma(ν2,ν2).

  • Under this prior, the marginal likelihood for Yi is equivalent to a Student-t distribution,

Yi=α+xiβ+ϵi,ϵi∼iidtν(0,σ).

Understanding the equivalence

  • Heteroskedastic variances assumption is equivalent to assuming a heavy-tailed distribution.

Yi=α+xiβ+ϵi,ϵi∼iidtν(0,σ).

⟺

Yi=α+xiβ+ϵi,ϵi∼indN(0,σ2λi)λi∼iidInverse-Gamma(ν2,ν2)

  • Note that since the number of λi parameters is equal to the number of observations, this model will not have a proper posterior distribution without a proper prior distribution.

Understanding the equivalence

f(Yi)=∫0∞f(Yi,λi)dλi=∫0∞f(Yi|λi)f(λi)dλi=∫0∞N(Yi;μi,σ2λi)Inverse-Gamma(λi;ν2,ν2)dλi=tν(μi,σ).

  • The marginal likelihood can be viewed as a mixture of a Gaussian likelihood with an Inverse-Gamma scale parameter.

Understanding the equivalence

A random variable Ti∼iidtν can be written as a function of Gaussian and χ2 random variables, Ti=ZiWiν,Zi∼iidN(0,1),Wi∼iidχν2=Zi1νVi,Vi∼iidInv-χν2,Vi=Wi−1=νViZi,λi=νVi=λiZi,λi∼iidInverse-Gamma(ν2,ν2).

  • We then have: Yi=μi+σTi∼tν(μi,σ).

Student-t in Stan

data {
  int<lower = 1> n;
  int<lower = 1> p;
  vector[n] Y;
  matrix[n, p] X;
}
parameters {
  real alpha;
  vector[p] beta;
  real<lower = 0> sigma;
  real<lower = 0> nu;
}
model {
  target += student_t_lpdf(Y | nu, alpha + X * beta, sigma);
}

Student-t in Stan: mixture

data {
  int<lower = 1> n;
  int<lower = 1> p;
  vector[n] Y;
  matrix[n, p] X;
}
parameters {
  real alpha;
  vector[p] beta;
  real<lower = 0> sigma;
  vector[n] lambda;
  real<lower = 0> nu;
}
transformed parameters {
  vector[n] tau = sigma * sqrt(lambda);
}
model {
  target += normal_lpdf(Y | alpha + X * beta, tau);
  target += inv_gamma_lpdf(lambda | 0.5 * nu, 0.5 * nu);
}

Why heavy-tailed distributions?

  • Replacing the normal distribution with a distribution with heavy-tails (e.g., Student-t, Laplace) is a common approach to robust regression.

  • Robust regression refers to regression methods which are less sensitive to outliers or small sample sizes.

  • Linear regression, including Bayesian regression with normally distributed errors is sensitive to outliers, because the normal distribution has narrow tail probabilities.

  • Our heteroskedastic model that we just explored is only one example of a robust regression model.

Vizualizing heavy tail distributions

Vizualizing heavy tail distributions

Vizualizing heavy tail distributions

Another example of robust regression

  • Let’s revisit our general heteroskedastic regression, Yi=α+xiβ+ϵi,ϵi∼indN(0,σ2λi).

  • We can induce another form of robust regression using the following prior for λi, λi∼Exponential(1/2).

Another example of robust regression

  • Let’s revisit our general heteroskedastic regression, Yi=α+xiβ+ϵi,ϵi∼indN(0,σ2λi).

  • We can induce another form of robust regression using the following prior for λi, λi∼Exponential(1/2).

  • Under this prior, the induced marginal model is, Yi=α+xiβ+ϵi,ϵi∼iidLaplace(μ=0,σ).

  • This has a really nice interpretation!

Laplace distribution

Suppse a variable Yi follows a Laplace (or double exponential) distribution, then the pdf is given by,

f(Yi|μ,σ)=12σexp⁡{−|Yi−μ|σ}

  • E[Yi]=μ

  • V(Yi)=2σ2

  • Under the Laplace likelihood, estimation of μ is equivalent to estimating the population median of Yi.

Median regression using Laplace

Least absolute deviation (LAD) regression minimizes the following objective function,

α^LAD,β^LAD=arg⁡minα,β∑i=1n|Yi−μi|,μi=α+xiβ.

The Bayesian analog is the Laplace distribution,

f(Y|α,β,σ)=(12σ)nexp⁡{−∑i=1n|Yi−μi|σ}.

Median regression using Laplace

  • The Laplace distribution is analogous to least absolute deviations because the kernel of the distribution is |x−μ|, so minimizing the likelihood will also minimize the least absolute distances.

  • Laplace distribution is also known as the double-exponential distribution (symmetric exponential distributions around μ with scale σ).

  • Thus, a linear regression with Laplace errors is analogous to a median regression,

  • Why is median regression considered more robust than regression of the mean?

Laplace regression in Stan

data {
  int<lower = 1> n;
  int<lower = 1> p;
  vector[n] Y;
  matrix[n, p] X;
}
parameters {
  real alpha;
  vector[p] beta;
  real<lower = 0> sigma;
}
model {
  target += double_exponential_lpdf(Y | alpha + X * beta, sigma);
}

Laplace regression in Stan: mixture

data {
  int<lower = 1> n;
  int<lower = 1> p;
  vector[n] Y;
  matrix[n, p] X;
}
parameters {
  real alpha;
  vector[p] beta;
  real<lower = 0> sigma;
  vector[n] lambda;
}
transformed parameters {
  vector[n] tau = sigma * sqrt(lambda);
}
model {
  target += normal_lpdf(Y | alpha + X * beta, tau);
  target += exponential_lpdf(lambda | 0.5);
}

Returning to IgG

Posterior of β

Model Mean Lower Upper
Laplace 0.73 0.56 0.89
Student-t 0.69 0.55 0.82
Gaussian 0.69 0.56 0.83
Gaussian with Covariates in Variance 0.76 0.62 0.89

Model Comparison

elpd_diff elpd_loo looic
Student-t 0.00 -624.42 1248.84
Gaussian -2.01 -626.43 1252.86
Gaussian with Covariates in Variance -6.09 -630.51 1261.02
Laplace -13.10 -637.52 1275.05

Examining the Student-t model fit

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean   sd 2.5% 97.5% n_eff Rhat
alpha   3.32    0.00 0.21 2.91  3.74  2084    1
beta[1] 0.69    0.00 0.07 0.55  0.82  2122    1
sigma   1.72    0.00 0.10 1.53  1.91  2744    1
nu      7.80    0.05 2.35 4.40 13.29  2617    1

Samples were drawn using NUTS(diag_e) at Sun Feb  9 14:54:53 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Examining the Student-t model fit

Examining the Student-t model fit

Summary of robust regression

  • Robust regression techniques can be used when the assumptions of constant variance and/or normality of the residuals do not hold.

  • Heteroskedastic variance can viewed as inducing extreme value distributions.

  • Extreme value regression using Student-t and Laplace distributions are robust to outliers.

  • Laplace regression is equivalent to median regression.

Prepare for next class

  • Work on HW 02, which is due before next class.

  • Complete reading to prepare for next Thursday’s lecture

  • Thursday’s lecture: Regularization

🔗 BIOSTAT 725 - Spring 2025

1 / 40
Robust Regression Prof. Sam Berchuck Feb 11, 2025

  1. Slides

  2. Tools

  3. Close
  • Robust Regression
  • Review of last lecture
  • A motivating research question
  • Pulling the data
  • Visualizing IgG data
  • Modeling the association between age and IgG
  • Linear regression assumptions
  • Assessing assumptions
  • Robust regression
  • Why is robust regression not more comomon?
  • Heteroskedasticity
  • Modeling the scale with covariates
  • Modeling the scale with covariates
  • Heteroskedastic variance
  • A prior to induce a heavy-tail
  • A prior to induce a heavy-tail
  • Understanding the equivalence
  • Understanding the equivalence
  • Understanding the equivalence
  • Student-t in Stan
  • Student-t in Stan: mixture
  • Why heavy-tailed distributions?
  • Vizualizing heavy tail distributions
  • Vizualizing heavy tail distributions
  • Vizualizing heavy tail distributions
  • Another example of robust regression
  • Another example of robust regression
  • Laplace distribution
  • Median regression using Laplace
  • Median regression using Laplace
  • Laplace regression in Stan
  • Laplace regression in Stan: mixture
  • Returning to IgG
  • Posterior of β
  • Model Comparison
  • Examining the Student-t model fit
  • Examining the Student-t model fit
  • Examining the Student-t model fit
  • Summary of robust regression
  • Prepare for next class
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help