Neural Network Lifecycle: Forward Pass, Backpropagation, and Parameter Updates

This article uses a simple 2-2-1 neural network example to show one full lifecycle of parameter updates. If you want to expand any step further, a language model or a textbook companion can help unpack the details.

Setup

Task

Input: $x \in \mathbb{R}^2$
Output: $\hat{y} \in \mathbb{R}$

Structure

$$ \begin{aligned} h &= \operatorname{ReLU}(W^{(1)}x + b^{(1)}) \\ \hat{y} &= W^{(2)}h + b^{(2)} \end{aligned} $$

Parameter dimensionality

$$ \begin{aligned} W^{(1)} &\in \mathbb{R}^{2 \times 2}, \quad b^{(1)} \in \mathbb{R}^2 \\ W^{(2)} &\in \mathbb{R}^{1 \times 2}, \quad b^{(2)} \in \mathbb{R} \end{aligned} $$

Loss function

$$ L = \frac{1}{2}(\hat{y} - y)^2 $$

Phase 0: Initialization

$$ \begin{aligned} W^{(1)} &= \begin{bmatrix} 0.3 & -0.1 \\ 0.2 & 0.4 \end{bmatrix}, \quad b^{(1)} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \\ W^{(2)} &= \begin{pmatrix} 0.5 & -0.3 \end{pmatrix}, \quad b^{(2)} = 0 \end{aligned} $$

Phase 1: Forward Pass

Input:

$$ x = \begin{pmatrix} 1 \\ 2 \end{pmatrix}, \quad y = 1 $$

1. Pre-activation

$$ z^{(1)} = W^{(1)}x + b^{(1)} = \begin{pmatrix} 0.3 & -0.1 \\ 0.2 & 0.4 \end{pmatrix}\begin{pmatrix} 1 \\ 2 \end{pmatrix} = \begin{pmatrix} 0.1 \\ 1.0 \end{pmatrix} $$

2. Activation

$$ h = \operatorname{ReLU}(z^{(1)}) = \begin{pmatrix} \max(0, 0.1) \\ \max(0, 1.0) \end{pmatrix} = \begin{pmatrix} 0.1 \\ 1.0 \end{pmatrix} $$

3. Output

$$ \hat{y} = W^{(2)}h + b^{(2)} = \begin{pmatrix} 0.5 & -0.3 \end{pmatrix}\begin{pmatrix} 0.1 \\ 1.0 \end{pmatrix} = 0.05 - 0.3 = -0.25 $$

4. Loss

$$ L = \frac{1}{2}(\hat{y} - y)^2 = \frac{1}{2}(-0.25 - 1)^2 = \frac{1}{2}(-1.25)^2 = 0.78125 $$

Phase 2: Backpropagation

1. Output error signal

$$ \frac{\partial L}{\partial \hat{y}} = \hat{y} - y = -1.25 $$

2. Gradients for the output layer

$$ \begin{aligned} \frac{\partial L}{\partial W^{(2)}} &= \frac{\partial L}{\partial \hat{y}} h^\top = -1.25 \begin{pmatrix} 0.1 & 1.0 \end{pmatrix} = \begin{pmatrix} -0.125 & -1.25 \end{pmatrix} \\ \frac{\partial L}{\partial b^{(2)}} &= \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial b^{(2)}} = (\hat{y} - y)(1) = -1.25 \end{aligned} $$

3. Error signal for the hidden layer

$$ \frac{\partial L}{\partial h} = \left(W^{(2)}\right)^\top \frac{\partial L}{\partial \hat{y}} = \begin{pmatrix} 0.5 \\ -0.3 \end{pmatrix}(-1.25) = \begin{pmatrix} -0.625 \\ 0.375 \end{pmatrix} $$

4. Pass through ReLU

Since

$$ z^{(1)} = \begin{pmatrix} 0.1 \\ 1.0 \end{pmatrix} $$

both entries are positive, so

$$ \operatorname{ReLU}'(z^{(1)}) = \begin{bmatrix} 1 \\ 1 \end{bmatrix} $$

and therefore

$$ \frac{\partial L}{\partial z^{(1)}} = \frac{\partial L}{\partial h} \odot \operatorname{ReLU}'(z^{(1)}) = \begin{bmatrix} -0.625 \\ 0.375 \end{bmatrix} $$

5. Gradients for the first layer

$$ \begin{aligned} \frac{\partial L}{\partial W^{(1)}} &= \frac{\partial L}{\partial z^{(1)}} x^\top = \begin{pmatrix} -0.625 \\ 0.375 \end{pmatrix}\begin{pmatrix} 1 & 2 \end{pmatrix} = \begin{pmatrix} -0.625 & -1.25 \\ 0.375 & 0.75 \end{pmatrix} \\ \frac{\partial L}{\partial b^{(1)}} &= \frac{\partial L}{\partial z^{(1)}} = \begin{bmatrix} -0.625 \\ 0.375 \end{bmatrix} \end{aligned} $$

Phase 3: Parameter Update

Using gradient descent,

$$ \theta_{\text{new}} = \theta_{\text{old}} - \alpha \nabla_\theta L $$

Take $\alpha = 0.1$.

Updated second layer

$$ \begin{aligned} W^{(2)}_{\text{new}} &= W^{(2)} - \alpha \frac{\partial L}{\partial W^{(2)}} = \begin{pmatrix} 0.5 & -0.3 \end{pmatrix} - 0.1 \begin{pmatrix} -0.125 & -1.25 \end{pmatrix} = \begin{pmatrix} 0.5125 & -0.175 \end{pmatrix} \\ b^{(2)}_{\text{new}} &= b^{(2)} - \alpha \frac{\partial L}{\partial b^{(2)}} = 0 - 0.1(-1.25) = 0.125 \end{aligned} $$

Updated first layer

$$ \begin{aligned} W^{(1)}_{\text{new}} &= W^{(1)} - \alpha \frac{\partial L}{\partial W^{(1)}} = \begin{pmatrix} 0.3 & -0.1 \\ 0.2 & 0.4 \end{pmatrix} - 0.1 \begin{pmatrix} -0.625 & -1.25 \\ 0.375 & 0.75 \end{pmatrix} = \begin{pmatrix} 0.3625 & 0.025 \\ 0.1625 & 0.325 \end{pmatrix} \\ b^{(1)}_{\text{new}} &= b^{(1)} - \alpha \frac{\partial L}{\partial b^{(1)}} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} - 0.1 \begin{bmatrix} -0.625 \\ 0.375 \end{bmatrix} = \begin{bmatrix} 0.0625 \\ -0.0375 \end{bmatrix} \end{aligned} $$

Flowcharts

Flowchart of the neural network lifecycle

Forward and backward flow reference

Handwritten Form

Handwritten derivation of the neural network lifecycle

Setup#

Phase 0: Initialization#

Phase 1: Forward Pass#

1. Pre-activation#

2. Activation#

3. Output#

4. Loss#

Phase 2: Backpropagation#

1. Output error signal#

2. Gradients for the output layer#

3. Error signal for the hidden layer#

4. Pass through ReLU#

5. Gradients for the first layer#

Phase 3: Parameter Update#

Updated second layer#

Updated first layer#

Flowcharts#

Handwritten Form#