This article walks through a full Bayesian linear regression workflow: construct the design matrix, derive the posterior, compare MLE and MAP, and make a predictive distribution for a new input.
Setup
Task
Given training data, predict the output for a new input and quantify predictive uncertainty.
Data
Three training points with input $x \in \mathbb{R}$ and output $y \in \mathbb{R}$:
| $n$ | $x_n$ | $y_n$ |
|---|---|---|
| 1 | 0 | 1.2 |
| 2 | 1 | 2.8 |
| 3 | 2 | 4.1 |
Model
$$ y = w_0 + w_1 x + \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, \sigma^2) $$Use the feature map
$$ \psi(x) = (1, x)^\top $$so that
$$ y = w^\top \psi(x) + \varepsilon, \qquad w = (w_0, w_1)^\top \in \mathbb{R}^2 $$Notation
- $w \in \mathbb{R}^2$: weight vector
- $\Psi \in \mathbb{R}^{N \times D}$: design matrix, with $N = 3$ data points and $D = 2$ features
- $y \in \mathbb{R}^N$: observed response vector
- $S \in \mathbb{R}^{D \times D}$: prior covariance matrix
- $\sigma^2 \in \mathbb{R}$: known noise variance
- $\mu_{\text{post}}, \Sigma_{\text{post}}$: posterior mean and covariance
Fixed values
- Noise variance: $\sigma^2 = 1$
- Prior: $w \sim \mathcal{N}(0, S)$ with
Phase 0: Build the Matrices
Design matrix $\Psi$
Each row is $\psi(x_n)^\top = (1, x_n)$:
$$ \Psi = \begin{pmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{pmatrix} \in \mathbb{R}^{3 \times 2} $$Observed vector
$$ y = \begin{pmatrix} 1.2 \\ 2.8 \\ 4.1 \end{pmatrix} $$Precompute $\Psi^\top \Psi$
$$ \Psi^\top \Psi = \begin{pmatrix} 1 & 1 & 1 \\ 0 & 1 & 2 \end{pmatrix}\begin{pmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{pmatrix} = \begin{pmatrix} 3 & 3 \\ 3 & 5 \end{pmatrix} $$Precompute $\Psi^\top y$
$$ \Psi^\top y = \begin{pmatrix} 1 & 1 & 1 \\ 0 & 1 & 2 \end{pmatrix}\begin{pmatrix} 1.2 \\ 2.8 \\ 4.1 \end{pmatrix} = \begin{pmatrix} 8.1 \\ 11.0 \end{pmatrix} $$Precompute $S^{-1}$
$$ S^{-1} = (2I)^{-1} = \frac{1}{2}I = \begin{pmatrix} 0.5 & 0 \\ 0 & 0.5 \end{pmatrix} $$Phase 1: Derive the Posterior
Start from
$$ \log p(w \mid y, \Psi) = \log p(w) + \log p(y \mid w, \Psi) + \text{const} $$Prior term
$$ \log p(w) = -\frac{1}{2} w^\top S^{-1} w + \text{const} $$Likelihood term
$$ \log p(y \mid w, \Psi) = -\frac{1}{2\sigma^2}\lVert y - \Psi w \rVert^2 + \text{const} $$Plug in $\sigma^2 = 1$:
$$ \log p(w \mid y) = -\frac{1}{2} w^\top S^{-1} w - \frac{1}{2}(y - \Psi w)^\top (y - \Psi w) + \text{const} $$Expand the quadratic form:
$$ \lVert y - \Psi w \rVert^2 = y^\top y - 2 y^\top \Psi w + w^\top \Psi^\top \Psi w $$Collect the terms involving $w$:
$$ \log p(w \mid y) = -\frac{1}{2} w^\top (S^{-1} + \sigma^{-2}\Psi^\top \Psi) w + \sigma^{-2} y^\top \Psi w + \text{const} $$This is a Gaussian quadratic form, so
$$ w \mid y \sim \mathcal{N}(\mu_{\text{post}}, \Sigma_{\text{post}}) $$with
$$ \Sigma_{\text{post}} = (\sigma^{-2}\Psi^\top \Psi + S^{-1})^{-1} $$and
$$ \mu_{\text{post}} = \sigma^{-2}\Sigma_{\text{post}}\Psi^\top y $$Phase 2: Compute the Posterior Numerically
Step 1: Posterior precision matrix
$$ \Sigma_{\text{post}}^{-1} = \sigma^{-2}\Psi^\top \Psi + S^{-1} = \begin{pmatrix} 3 & 3 \\ 3 & 5 \end{pmatrix} + \begin{pmatrix} 0.5 & 0 \\ 0 & 0.5 \end{pmatrix} = \begin{pmatrix} 3.5 & 3 \\ 3 & 5.5 \end{pmatrix} $$Step 2: Invert to get $\Sigma_{\text{post}}$
For a $2 \times 2$ matrix,
$$ \begin{pmatrix} a & b \\ c & d \end{pmatrix}^{-1} = \frac{1}{ad - bc}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix} $$so
$$ \det(\Sigma_{\text{post}}^{-1}) = 3.5 \times 5.5 - 3 \times 3 = 19.25 - 9 = 10.25 $$Therefore,
$$ \Sigma_{\text{post}} = \frac{1}{10.25}\begin{pmatrix} 5.5 & -3 \\ -3 & 3.5 \end{pmatrix} \approx \begin{pmatrix} 0.537 & -0.293 \\ -0.293 & 0.341 \end{pmatrix} $$Step 3: Posterior mean
$$ \mu_{\text{post}} = \Sigma_{\text{post}}\Psi^\top y = \frac{1}{10.25}\begin{pmatrix} 5.5 & -3 \\ -3 & 3.5 \end{pmatrix}\begin{pmatrix} 8.1 \\ 11.0 \end{pmatrix} \approx \begin{pmatrix} 1.127 \\ 1.385 \end{pmatrix} $$Interpretation
The posterior mean says
- $w_0 \approx 1.13$ for the intercept
- $w_1 \approx 1.39$ for the slope
Looking back at the data, the line is clearly increasing, and the slope is around $1.4$ to $1.5$. The prior mean is $0$, so the posterior is pulled slightly toward the origin. That is the regularization effect of Bayesian inference.
Phase 3: MLE, MAP, and Ridge
MLE
Without a prior, the least-squares estimate is
$$ \hat{w}_{\text{MLE}} = (\Psi^\top \Psi)^{-1}\Psi^\top y $$and
$$ (\Psi^\top \Psi)^{-1} = \frac{1}{3 \times 5 - 9}\begin{pmatrix} 5 & -3 \\ -3 & 3 \end{pmatrix} = \frac{1}{6}\begin{pmatrix} 5 & -3 \\ -3 & 3 \end{pmatrix} = \begin{pmatrix} 0.833 & -0.5 \\ -0.5 & 0.5 \end{pmatrix} $$So
$$ \hat{w}_{\text{MLE}} = \begin{pmatrix} 0.833 & -0.5 \\ -0.5 & 0.5 \end{pmatrix}\begin{pmatrix} 8.1 \\ 11.0 \end{pmatrix} = \begin{pmatrix} 1.250 \\ 1.450 \end{pmatrix} $$MAP
Because the posterior is Gaussian, the posterior mean and posterior mode are the same. So in this example,
$$ \hat{w}_{\text{MAP}} = \mu_{\text{post}} \approx \begin{pmatrix} 1.127 \\ 1.385 \end{pmatrix} $$Ridge connection
With $\sigma^2 = 1$ and $S = 2I$, maximizing the posterior is equivalent to minimizing
$$ \lVert y - \Psi w \rVert^2 + \frac{1}{2}\lVert w \rVert^2 $$which is exactly a ridge regression objective with penalty $\lambda = 0.5$.
Comparison
| $w_0$ | $w_1$ | |
|---|---|---|
| MLE | 1.250 | 1.450 |
| MAP / Posterior mean | 1.127 | 1.385 |
| Prior mean | 0 | 0 |
The Bayesian estimate sits between the MLE and the prior mean. A stronger prior, meaning a smaller covariance $S$, would pull the estimate more aggressively toward $0$.
Phase 4: Predictive Distribution
Goal
For a new input $x_* = 3$, compute the predictive distribution of $y_*$.
Formula
$$ p(y_* \mid x_*, y, \Psi) = \mathcal{N}(y_*; \mu_*, \sigma_*^2) $$where
$$ \mu_* = \psi(x_*)^\top \mu_{\text{post}} $$and
$$ \sigma_*^2 = \psi(x_*)^\top \Sigma_{\text{post}} \psi(x_*) + \sigma^2 $$Step 1: Feature vector
$$ \psi(x_*) = \psi(3) = \begin{pmatrix} 1 \\ 3 \end{pmatrix} $$Step 2: Predictive mean
$$ \mu_* = \begin{pmatrix} 1 & 3 \end{pmatrix}\begin{pmatrix} 1.127 \\ 1.385 \end{pmatrix} \approx 5.282 $$Step 3: Predictive variance
Using the exact posterior covariance,
$$ \psi(x_*)^\top \Sigma_{\text{post}} \psi(x_*) = \begin{pmatrix} 1 & 3 \end{pmatrix}\frac{1}{10.25}\begin{pmatrix} 5.5 & -3 \\ -3 & 3.5 \end{pmatrix}\begin{pmatrix} 1 \\ 3 \end{pmatrix} \approx 1.854 $$Therefore,
$$ \sigma_*^2 \approx 1.854 + 1 = 2.854 $$So the predictive distribution is
$$ y_* \mid x_* = 3 \sim \mathcal{N}(5.282, 2.854) $$Interpretation
The predictive mean is about $5.28$. Since $\sqrt{2.854} \approx 1.689$, a rough $\pm 2$ standard deviation interval is
$$ [5.282 - 2(1.689),\ 5.282 + 2(1.689)] \approx [1.90,\ 8.66] $$This interval is fairly wide because $x_* = 3$ lies outside the observed training range, which only goes up to $x = 2$. Bayesian linear regression naturally reflects this by increasing predictive uncertainty farther away from the data.