This article walks through a full Bayesian linear regression workflow: construct the design matrix, derive the posterior, compare MLE and MAP, and make a predictive distribution for a new input.

Setup

Task

Given training data, predict the output for a new input and quantify predictive uncertainty.

Data

Three training points with input $x \in \mathbb{R}$ and output $y \in \mathbb{R}$:

$n$$x_n$$y_n$
101.2
212.8
324.1

Model

$$ y = w_0 + w_1 x + \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, \sigma^2) $$

Use the feature map

$$ \psi(x) = (1, x)^\top $$

so that

$$ y = w^\top \psi(x) + \varepsilon, \qquad w = (w_0, w_1)^\top \in \mathbb{R}^2 $$

Notation

  • $w \in \mathbb{R}^2$: weight vector
  • $\Psi \in \mathbb{R}^{N \times D}$: design matrix, with $N = 3$ data points and $D = 2$ features
  • $y \in \mathbb{R}^N$: observed response vector
  • $S \in \mathbb{R}^{D \times D}$: prior covariance matrix
  • $\sigma^2 \in \mathbb{R}$: known noise variance
  • $\mu_{\text{post}}, \Sigma_{\text{post}}$: posterior mean and covariance

Fixed values

  • Noise variance: $\sigma^2 = 1$
  • Prior: $w \sim \mathcal{N}(0, S)$ with
$$ S = 2I = \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix} $$

Phase 0: Build the Matrices

Design matrix $\Psi$

Each row is $\psi(x_n)^\top = (1, x_n)$:

$$ \Psi = \begin{pmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{pmatrix} \in \mathbb{R}^{3 \times 2} $$

Observed vector

$$ y = \begin{pmatrix} 1.2 \\ 2.8 \\ 4.1 \end{pmatrix} $$

Precompute $\Psi^\top \Psi$

$$ \Psi^\top \Psi = \begin{pmatrix} 1 & 1 & 1 \\ 0 & 1 & 2 \end{pmatrix}\begin{pmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{pmatrix} = \begin{pmatrix} 3 & 3 \\ 3 & 5 \end{pmatrix} $$

Precompute $\Psi^\top y$

$$ \Psi^\top y = \begin{pmatrix} 1 & 1 & 1 \\ 0 & 1 & 2 \end{pmatrix}\begin{pmatrix} 1.2 \\ 2.8 \\ 4.1 \end{pmatrix} = \begin{pmatrix} 8.1 \\ 11.0 \end{pmatrix} $$

Precompute $S^{-1}$

$$ S^{-1} = (2I)^{-1} = \frac{1}{2}I = \begin{pmatrix} 0.5 & 0 \\ 0 & 0.5 \end{pmatrix} $$

Phase 1: Derive the Posterior

Start from

$$ \log p(w \mid y, \Psi) = \log p(w) + \log p(y \mid w, \Psi) + \text{const} $$

Prior term

$$ \log p(w) = -\frac{1}{2} w^\top S^{-1} w + \text{const} $$

Likelihood term

$$ \log p(y \mid w, \Psi) = -\frac{1}{2\sigma^2}\lVert y - \Psi w \rVert^2 + \text{const} $$

Plug in $\sigma^2 = 1$:

$$ \log p(w \mid y) = -\frac{1}{2} w^\top S^{-1} w - \frac{1}{2}(y - \Psi w)^\top (y - \Psi w) + \text{const} $$

Expand the quadratic form:

$$ \lVert y - \Psi w \rVert^2 = y^\top y - 2 y^\top \Psi w + w^\top \Psi^\top \Psi w $$

Collect the terms involving $w$:

$$ \log p(w \mid y) = -\frac{1}{2} w^\top (S^{-1} + \sigma^{-2}\Psi^\top \Psi) w + \sigma^{-2} y^\top \Psi w + \text{const} $$

This is a Gaussian quadratic form, so

$$ w \mid y \sim \mathcal{N}(\mu_{\text{post}}, \Sigma_{\text{post}}) $$

with

$$ \Sigma_{\text{post}} = (\sigma^{-2}\Psi^\top \Psi + S^{-1})^{-1} $$

and

$$ \mu_{\text{post}} = \sigma^{-2}\Sigma_{\text{post}}\Psi^\top y $$

Phase 2: Compute the Posterior Numerically

Step 1: Posterior precision matrix

$$ \Sigma_{\text{post}}^{-1} = \sigma^{-2}\Psi^\top \Psi + S^{-1} = \begin{pmatrix} 3 & 3 \\ 3 & 5 \end{pmatrix} + \begin{pmatrix} 0.5 & 0 \\ 0 & 0.5 \end{pmatrix} = \begin{pmatrix} 3.5 & 3 \\ 3 & 5.5 \end{pmatrix} $$

Step 2: Invert to get $\Sigma_{\text{post}}$

For a $2 \times 2$ matrix,

$$ \begin{pmatrix} a & b \\ c & d \end{pmatrix}^{-1} = \frac{1}{ad - bc}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix} $$

so

$$ \det(\Sigma_{\text{post}}^{-1}) = 3.5 \times 5.5 - 3 \times 3 = 19.25 - 9 = 10.25 $$

Therefore,

$$ \Sigma_{\text{post}} = \frac{1}{10.25}\begin{pmatrix} 5.5 & -3 \\ -3 & 3.5 \end{pmatrix} \approx \begin{pmatrix} 0.537 & -0.293 \\ -0.293 & 0.341 \end{pmatrix} $$

Step 3: Posterior mean

$$ \mu_{\text{post}} = \Sigma_{\text{post}}\Psi^\top y = \frac{1}{10.25}\begin{pmatrix} 5.5 & -3 \\ -3 & 3.5 \end{pmatrix}\begin{pmatrix} 8.1 \\ 11.0 \end{pmatrix} \approx \begin{pmatrix} 1.127 \\ 1.385 \end{pmatrix} $$

Interpretation

The posterior mean says

  • $w_0 \approx 1.13$ for the intercept
  • $w_1 \approx 1.39$ for the slope

Looking back at the data, the line is clearly increasing, and the slope is around $1.4$ to $1.5$. The prior mean is $0$, so the posterior is pulled slightly toward the origin. That is the regularization effect of Bayesian inference.

Phase 3: MLE, MAP, and Ridge

MLE

Without a prior, the least-squares estimate is

$$ \hat{w}_{\text{MLE}} = (\Psi^\top \Psi)^{-1}\Psi^\top y $$

and

$$ (\Psi^\top \Psi)^{-1} = \frac{1}{3 \times 5 - 9}\begin{pmatrix} 5 & -3 \\ -3 & 3 \end{pmatrix} = \frac{1}{6}\begin{pmatrix} 5 & -3 \\ -3 & 3 \end{pmatrix} = \begin{pmatrix} 0.833 & -0.5 \\ -0.5 & 0.5 \end{pmatrix} $$

So

$$ \hat{w}_{\text{MLE}} = \begin{pmatrix} 0.833 & -0.5 \\ -0.5 & 0.5 \end{pmatrix}\begin{pmatrix} 8.1 \\ 11.0 \end{pmatrix} = \begin{pmatrix} 1.250 \\ 1.450 \end{pmatrix} $$

MAP

Because the posterior is Gaussian, the posterior mean and posterior mode are the same. So in this example,

$$ \hat{w}_{\text{MAP}} = \mu_{\text{post}} \approx \begin{pmatrix} 1.127 \\ 1.385 \end{pmatrix} $$

Ridge connection

With $\sigma^2 = 1$ and $S = 2I$, maximizing the posterior is equivalent to minimizing

$$ \lVert y - \Psi w \rVert^2 + \frac{1}{2}\lVert w \rVert^2 $$

which is exactly a ridge regression objective with penalty $\lambda = 0.5$.

Comparison

$w_0$$w_1$
MLE1.2501.450
MAP / Posterior mean1.1271.385
Prior mean00

The Bayesian estimate sits between the MLE and the prior mean. A stronger prior, meaning a smaller covariance $S$, would pull the estimate more aggressively toward $0$.

Phase 4: Predictive Distribution

Goal

For a new input $x_* = 3$, compute the predictive distribution of $y_*$.

Formula

$$ p(y_* \mid x_*, y, \Psi) = \mathcal{N}(y_*; \mu_*, \sigma_*^2) $$

where

$$ \mu_* = \psi(x_*)^\top \mu_{\text{post}} $$

and

$$ \sigma_*^2 = \psi(x_*)^\top \Sigma_{\text{post}} \psi(x_*) + \sigma^2 $$

Step 1: Feature vector

$$ \psi(x_*) = \psi(3) = \begin{pmatrix} 1 \\ 3 \end{pmatrix} $$

Step 2: Predictive mean

$$ \mu_* = \begin{pmatrix} 1 & 3 \end{pmatrix}\begin{pmatrix} 1.127 \\ 1.385 \end{pmatrix} \approx 5.282 $$

Step 3: Predictive variance

Using the exact posterior covariance,

$$ \psi(x_*)^\top \Sigma_{\text{post}} \psi(x_*) = \begin{pmatrix} 1 & 3 \end{pmatrix}\frac{1}{10.25}\begin{pmatrix} 5.5 & -3 \\ -3 & 3.5 \end{pmatrix}\begin{pmatrix} 1 \\ 3 \end{pmatrix} \approx 1.854 $$

Therefore,

$$ \sigma_*^2 \approx 1.854 + 1 = 2.854 $$

So the predictive distribution is

$$ y_* \mid x_* = 3 \sim \mathcal{N}(5.282, 2.854) $$

Interpretation

The predictive mean is about $5.28$. Since $\sqrt{2.854} \approx 1.689$, a rough $\pm 2$ standard deviation interval is

$$ [5.282 - 2(1.689),\ 5.282 + 2(1.689)] \approx [1.90,\ 8.66] $$

This interval is fairly wide because $x_* = 3$ lies outside the observed training range, which only goes up to $x = 2$. Bayesian linear regression naturally reflects this by increasing predictive uncertainty farther away from the data.