The Parametrization Trick

The Objective Setup

In variational inference, a core objective is maximizing the Evidence Lower Bound (ELBO). Let $x$ be observed data, $z$ be continuous latent variables, $p_\theta(x|z)$ be the generative likelihood, and $q_\phi(z|x)$ be the parameterized variational distribution. We wish to compute the gradient of the reconstruction term with respect to the variational parameters $\phi$:

$$ \nabla_\phi \mathcal{J}(\phi) = \nabla_\phi \mathbb{E}{z \sim q\phi(z|x)}[\log p_\theta(x|z)] $$

The Problem with Direct Differentiation

Directly computing this gradient is problematic because the integration measure depends on the parameters $\phi$ with respect to which we are differentiating:

$$ \nabla_\phi \mathbb{E}{z \sim q\phi(z|x)}[\log p_\theta(x|z)] = \nabla_\phi \int \log p_\theta(x|z) q_\phi(z|x) \, dz $$

By Leibniz’s rule, passing the gradient inside the integral yields:

$$ \int \log p_\theta(x|z) \nabla_\phi q_\phi(z|x) \, dz $$

Using the log-derivative trick ($\nabla_\phi q = q \nabla_\phi \log q$), we obtain the score function (REINFORCE) estimator:

$$ \mathbb{E}{z \sim q\phi(z|x)}[\log p_\theta(x|z) \nabla_\phi \log q_\phi(z|x)] $$

Why this fails in practice: While this Monte Carlo estimator is unbiased, its variance is prohibitively high. The term $\log p_\theta(x|z)$ acts as a scalar “reward” for the sample $z$. This estimator completely ignores the topological structure and differentiability of the likelihood model $p_\theta$; it blindly perturbs $\phi$, samples $z$, and scales the gradient by the resulting scalar log-likelihood.

The Reparameterization Trick

To resolve this, we assume $q_\phi(z|x)$ is a distribution from a location-scale family, specifically a diagonal multivariate Gaussian $\mathcal{N}(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))$.

We can express the random variable $z \sim q_\phi(z|x)$ via a deterministic, differentiable transformation $g_\phi$ of an auxiliary noise variable $\epsilon$:

$$ z = g_\phi(\epsilon, x) = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon $$

where $\epsilon \sim p(\epsilon) = \mathcal{N}(0, I)$ and $\odot$ denotes the Hadamard (element-wise) product.

Decoupling Stochasticity and Differentiability

By the Law of the Unconscious Statistician (LOTUS), we push the expectation forward from the $\phi$-dependent measure space to a fixed measure space. The objective becomes:

$$ \mathcal{J}(\phi) = \mathbb{E}{\epsilon \sim \mathcal{N}(0,I)}[\log p\theta(x|g_\phi(\epsilon, x))] $$

This achieves a strict architectural decoupling:

  1. The stochastic source is isolated strictly within $\epsilon \sim \mathcal{N}(0, I)$, a base distribution that is entirely independent of $\phi$.
  2. The differentiable transformation is isolated within the deterministic mapping $g_\phi(\epsilon, x)$.

Enabling Gradient Flow

Because the probability measure $p(\epsilon)$ no longer depends on $\phi$, the gradient operator can pass through the expectation without invoking the log-derivative trick:

$$ \nabla_\phi \mathcal{J}(\phi) = \nabla_\phi \mathbb{E}{\epsilon \sim p(\epsilon)}[\log p\theta(x|g_\phi(\epsilon, x))] = \mathbb{E}{\epsilon \sim p(\epsilon)}[\nabla\phi \log p_\theta(x|g_\phi(\epsilon, x))] $$

Applying the multivariate chain rule yields:

$$ \nabla_\phi \mathcal{J}(\phi) = \mathbb{E}{\epsilon \sim p(\epsilon)} \left[ \nabla_z \log p\theta(x|z) \Big|{z=g\phi(\epsilon, x)} \nabla_\phi g_\phi(\epsilon, x) \right] $$

For the Gaussian case, the gradients with respect to the specific parameters are simply:

$$ \nabla_{\mu_\phi} \mathcal{J} = \mathbb{E}{\epsilon}\left[\nabla_z \log p\theta(x|z)\right] $$

$$ \nabla_{\sigma_\phi} \mathcal{J} = \mathbb{E}{\epsilon}\left[\nabla_z \log p\theta(x|z) \odot \epsilon\right] $$

Connection to Low-Variance Monte Carlo Estimation

The new Monte Carlo estimator for the gradient is:

$$ \widehat{\nabla_\phi \mathcal{J}} = \frac{1}{L} \sum_{l=1}^L \nabla_z \log p_\theta(x|z^{(l)}) \nabla_\phi g_\phi(\epsilon^{(l)}, x), \quad \epsilon^{(l)} \sim \mathcal{N}(0,I) $$

Why the variance is radically lower: This is a pathwise derivative estimator. Instead of treating $\log p_\theta$ as a black-box scalar reward, the gradient now explicitly depends on $\nabla_z \log p_\theta(x|z)$. The analytic gradient of the likelihood surface dictates precisely how the latent variable $z$ should move to increase the objective, and $\nabla_\phi g_\phi$ dictates how the parameters $\mu_\phi, \sigma_\phi$ must change to effect that movement in $z$. Leveraging this first-order geometric information structurally reduces the variance of the gradient estimate, making stochastic gradient descent on the ELBO tractable.

2. SGD

Initial Intuition:

Explain why SGD or mini-batch gradient descent is preferred over full batch gradient descent when the loss is an average over a very large dataset. Make the explanation mathematically precise using E(w) = (1/N) Σ_n L(t_n, y(x_n, w)), and distinguish computational cost, unbiased gradient estimation, and variance. Do not discuss topics not appearing in the lecture.

Explainations

To understand why Stochastic Gradient Descent (SGD) and mini-batch gradient descent are preferred over full batch gradient descent for very large datasets, we must analyze the mathematical structure of the objective function.

Given the loss function defined as an average over the entire dataset of size $N$:

$$E(\mathbf{w}) = \frac{1}{N} \sum_{n=1}^N L(t_n, y(\mathbf{x}_n, \mathbf{w}))$$

The true (full batch) gradient with respect to the weights $\mathbf{w}$ is:

$$\nabla E(\mathbf{w}) = \frac{1}{N} \sum_{n=1}^N \nabla L(t_n, y(\mathbf{x}_n, \mathbf{w}))$$

In mini-batch gradient descent, we approximate this true gradient using a random subset (batch) $\mathcal{B}$ of size $M$ (where $1 \le M \ll N$). The mini-batch gradient is:

$$\nabla E_{\mathcal{B}}(\mathbf{w}) = \frac{1}{M} \sum_{i \in \mathcal{B}} \nabla L(t_i, y(\mathbf{x}_i, \mathbf{w}))$$

(Note: Pure SGD is the special case where $M=1$).

The preference for SGD/mini-batch over full batch is driven by three distinct mathematical and statistical properties:

1. Computational Cost

For a single weight update, full batch gradient descent requires computing the gradient $\nabla L$ for all $N$ data points. The computational complexity per update step is $\mathcal{O}(N)$. When $N$ is very large (e.g., millions), a single step toward the minimum takes an enormous amount of time.

Conversely, a mini-batch update has a computational complexity of $\mathcal{O}(M)$ per step. Because $E(\mathbf{w})$ is an average, large datasets often contain highly redundant information. Calculating the gradient over a small, representative sample of size $M$ yields a direction that is largely similar to the full batch gradient. Therefore, in the time it takes full batch to make one update, mini-batch can make $N/M$ updates, leading to vastly faster initial progress in minimizing the loss function.

2. Unbiased Gradient Estimation

Although mini-batch gradient descent only uses a fraction of the data, it provides an unbiased estimator of the true gradient.

Because the mini-batch $\mathcal{B}$ is sampled uniformly at random from the full dataset, the expected value of the gradient for a single randomly chosen example $i$ is exactly the dataset average:

$$\mathbb{E}[\nabla L_i] = \frac{1}{N} \sum_{n=1}^N \nabla L_n = \nabla E(\mathbf{w})$$

By linearity of expectation, the expected value of the mini-batch gradient over the random sampling of batches is exactly the full batch gradient:

$$\mathbb{E}{\mathcal{B}}[\nabla E{\mathcal{B}}(\mathbf{w})] = \mathbb{E}{\mathcal{B}}\left[ \frac{1}{M} \sum{i \in \mathcal{B}} \nabla L_i \right] = \frac{1}{M} \sum_{i \in \mathcal{B}} \mathbb{E}[\nabla L_i] = \nabla E(\mathbf{w})$$

This unbiasedness guarantees that, on average, the mini-batch update moves the weights in the correct steepest descent direction, ensuring convergence despite not looking at the entire dataset at once.

3. Variance

While the mini-batch gradient is unbiased, it is a random variable and therefore possesses variance. This variance dictates how “noisy” or erratic the weight updates will be.

Assuming the samples in the mini-batch are drawn independently, the variance of the mini-batch gradient is the variance of a single sample’s gradient divided by the batch size $M$:

$$\text{Var}(\nabla E_{\mathcal{B}}(\mathbf{w})) = \text{Var}\left( \frac{1}{M} \sum_{i \in \mathcal{B}} \nabla L_i \right) = \frac{1}{M} \text{Var}(\nabla L_i)$$

This mathematical property dictates the preference between pure SGD, mini- batch, and full batch:

• Pure SGD ($M=1$): Has the highest variance. The gradient estimates are noisy, causing the trajectory of $\mathbf{w}$ to oscillate wildly around the optimum. • Full Batch ($M=N$): Has zero variance (it is deterministic), offering a perfectly smooth trajectory, but at the prohibitively high computational cost discussed above. • Mini-batch ($1 < M \ll N$): Reduces the variance of pure SGD by a factor of $1/M$.

Mini-batch gradient descent is ultimately preferred because it leverages this variance property: by increasing $M$ slightly, we exponentially reduce the noise of the gradient estimate (yielding more stable convergence than pure SGD), while still remaining computationally inexpensive compared to the $\mathcal{O}(N)$ cost of full batch updates.


Multi-layer neural network

Explain why a multi-layer neural network with identity activation in every hidden layer is still only a linear model. Show explicitly how a composition of linear layers collapses into a single affine map, and explain why this destroys the universal approximation property.

Answer

To understand why a multi-layer neural network with identity activations is nothing more than a simple linear model, we have to look at the mathematics of how neural network layers interact.

In short: The composition of multiple affine (linear) transformations is simply another single affine transformation. Adding layers without adding non-linearities does not increase the mathematical capacity of the network; it only adds redundant parameters.

Here is the explicit breakdown of why this happens and why it destroys the network’s ability to learn complex patterns.


1. The Mathematical Collapse: Explicit Proof

Let’s define a standard feedforward neural network. In a typical network, the output of a specific layer $l$ is calculated as: $h_l = \sigma(W_l h_{l-1} + b_l)$

Where:

• $h_{l-1}$ is the input from the previous layer. • $W_l$ is the weight matrix. • $b_l$ is the bias vector. • $\sigma$ is the activation function.

If we use an identity activation function, then $\sigma(x) = x$. The function does nothing, and the equation simplifies to an affine transformation: $h_l = W_l h_{l-1} + b_l$

Now, let’s look at a 3-layer neural network (Input $\rightarrow$ Hidden 1 $\rightarrow$ Hidden 2 $\rightarrow$ Output) with input $x$:

Layer 1: $h_1 = W_1 x + b_1$

Layer 2: $h_2 = W_2 h_1 + b_2$

Layer 3 (Output): $y = W_3 h_2 + b_3$

To see what the network is actually doing, we substitute the equation for $h_1$ into $h_2$, and then $h_2$ into $y$:

Step 1: Substitute Layer 1 into Layer 2 $h_2 = W_2 (W_1 x + b_1) + b_2$ $h_2 = (W_2 W_1)x + (W_2 b_1 + b_2)$

Step 2: Substitute Layer 2 into Layer 3 $y = W_3 \Big[ (W_2 W_1)x + (W_2 b_1 + b_2) \Big] + b_3$ $y = (W_3 W_2 W_1)x + (W_3 W_2 b_1 + W_3 b_2 + b_3)$

The Collapse Notice that matrix multiplication is associative. $W_3 W_2 W_1$ is just a sequence of matrices multiplied together, which results in a single new matrix. Let’s call it $W_{equivalent}$. Similarly, $(W_3 W_2 b_1 + W_3 b_2 + b_3)$ is just a sequence of matrix- vector multiplications and vector additions, resulting in a single new vector. Let’s call it $b_{equivalent}$.

Therefore, the entire 3-layer network can be rewritten exactly as:

$$y = W_{equivalent}x + b_{equivalent}$$

No matter if you have 3 layers or 3,000 layers, if there are no non-linear activation functions (like ReLU, Sigmoid, or Tanh) between them, the entire network algebraically collapses into a single affine map. You have effectively built standard linear regression.


2. Why this Destroys the Universal Approximation Property

The Universal Approximation Theorem (UAT) states that a feedforward network with at least one hidden layer and a non-linear activation function can approximate any continuous function to an arbitrary degree of accuracy, provided it has enough neurons.

When you use linear (identity) activations, you lose this property entirely. Here is why:

A. Geometric Rigidity (Straight Lines Only)

An affine function ($y = Wx + b$) represents a hyperplane (a straight line in 2D, a flat plane in 3D, etc.). Because the deep linear network collapses into a single affine function, it can only ever draw flat decision boundaries or map flat relationships.

If your data forms a curve (e.g., $y = x^2$, a sine wave, or clusters of images that require complex topological boundaries), a flat plane will never be able to fit it. You cannot approximate a curve using a single straight line.

B. The XOR Problem

The simplest proof of this limitation is the classic XOR problem. If you try to separate data points representing the XOR logic gate, you cannot draw a single straight line to separate the True values from the False values. Because a linear neural network only ever acts as a single straight line, it mathematically cannot solve the XOR problem, no matter how wide or deep it is.

C. The Role of Non-Linearity

Non-linear activation functions (like ReLU) are what allow a neural network to “bend,” “fold,” and “piece together” the input space.

• A linear transformation rotates, scales, and shifts space. • A non-linear activation bends and breaks that space.

By alternating between moving the space (affine transformations) and bending the space (non-linearities), neural networks can approximate highly complex, warped, and curved real-world functions. By removing the non-linear “bending” step, the network is doomed to only ever scale, rotate, and shift inputs in a strictly flat, linear manner.


Non-Linear Activation Functions

Explain why non-linear activation functions are essential for the expressive power of neural networks. Contrast linear networks with non-linear networks, and connect this to the claim that multi-layer feed-forward networks can act as universal function approximators.

Answer

To understand why non-linear activation functions are the lifeblood of modern neural networks, we must look at the mathematics of how networks process information and the geometry of how they shape data.

Here is an explanation of why non-linearity is essential for the expressive power of neural networks, contrasting linear and non-linear networks, and connecting this to the Universal Approximation Theorem.


1. The Trap of Linear Networks: “Linear Collapse”

To understand the necessity of non-linearities, we must first look at what happens when a neural network is entirely linear.

A linear activation function simply outputs the input it receives, perhaps multiplied by a constant (e.g., $f(x) = x$). If a neural network uses only linear activation functions, the entire network—no matter how many layers or thousands of neurons it has—mathematically collapses into a single-layer linear regression model.

The Math: Imagine a two-layer neural network.

• The output of the first layer is $h = W_1x + b_1$ (where $W$ is weights, $b$ is biases, and $x$ is the input). • The output of the second layer is $y = W_2h + b_2$.

If we substitute the first layer into the second, we get: $y = W_2(W_1x + b_1) + b_2$ $y = (W_2W_1)x + (W_2b_1 + b_2)$

Because multiplying two matrices ($W_2W_1$) just yields another matrix, and adding two vectors ($W_2b_1 + b_2$) yields another vector, the equation simplifies to: $y = W_{new}x + b_{new}$

The Consequence: A million-layer linear network has the exact same expressive power as a one-layer linear network. It can only draw straight lines (or flat hyperplanes) through data. It is utterly incapable of solving non-linear problems, such as the famous XOR logic gate, or recognizing a face in an image.

2. The Power of Non-Linear Networks

When we introduce a non-linear activation function—such as ReLU (Rectified Linear Unit), Sigmoid, or Tanh—we prevent this mathematical collapse.

If we apply a non-linear function $\sigma$ to the hidden layer, the equation becomes: $y = W_2(\sigma(W_1x + b_1)) + b_2$

Because $\sigma$ is non-linear, you can no longer factor out the matrices and collapse the layers. Each new layer adds genuine depth and complexity.

The Geometric Intuition: Imagine your dataset is a crumpled piece of paper with red dots on the inside and blue dots on the outside.

• A linear network can only take a flat sheet of glass and slice it straight through the paper. It will never separate all the red dots from the blue dots. • A non-linear network can stretch, fold, warp, and twist the space the paper exists in. By passing the data through successive non-linear layers, the network effectively “uncrumples” the paper until the red and blue dots are neatly separated, at which point a final linear layer can draw a straight line between them.

This ability to warp and mold data spaces is what we mean by expressive power.

3. Connection to the Universal Approximation Theorem

The absolute necessity of non-linear activation functions is formally proven by the Universal Approximation Theorem (UAT).

The UAT states that a feed-forward neural network with just one hidden layer containing a finite number of neurons can approximate any continuous function to any desired degree of accuracy—provided it uses a non-linear activation function.

Why is non-linearity the secret ingredient for universal approximation? You can think of it through the analogy of building blocks:

  1. Building “Bumps”: If you take two non-linear functions (like two Sigmoid curves or two ReLUs) and subtract one from the other, you can create a localized “bump” or a square step.
  2. Creating Lego Bricks: Each hidden neuron in a network can be configured to represent one of these little bumps occurring at a specific point along an axis.
  3. Approximating the Curve: By changing the weights, the network can change the height and width of these bumps. If you have enough neurons, you can line up thousands of these distinct bumps side-by-side to approximate any complex curve or shape—just like how digital pixels can form a curved circle, or how Riemann sums approximate an integral in calculus.

If the activation functions were linear, adding them together would only ever yield a straight line. You cannot build a “bump” out of straight lines. Therefore, you cannot build a universal approximator.

Summary

Non-linear activation functions are essential because they prevent deep neural networks from mathematically collapsing into simple linear regressions. They provide the network with the expressive power to warp space and learn complex, real-world patterns. Ultimately, they are the mathematical tools that allow a network to construct the localized “building blocks” required by the Universal Approximation Theorem to map any input to any output.


Backpropogation

Explain why backpropagation is computationally more efficient than computing the derivative with respect to each weight independently from scratch. Focus on repeated terms in the chain rule, error signals, and reverse traversal of the computation graph. Use the logistic least-squares example or a simple two-layer network.

Answer

To understand why backpropagation is computationally vastly superior to calculating derivatives independently from scratch, we must look at how the chain rule of calculus applies to neural networks.

At its core, backpropagation is an application of dynamic programming (specifically, reverse-mode automatic differentiation). It calculates the gradient of the loss function with respect to all weights in a single backward sweep by caching and reusing intermediate results, rather than redundantly recalculating them.

To illustrate this, let’s use a simple two-layer network with a logistic sigmoid activation and a least-squares loss function.

1. The Setup: A Two-Layer Logistic Least-Squares Network

Imagine a network with an input $x$, one hidden neuron, and one output neuron.

• Hidden layer pre-activation: $z_1 = w_1 x$ • Hidden layer activation: $a_1 = \sigma(z_1)$ (where $\sigma$ is the logistic sigmoid) • Output layer pre-activation: $z_2 = w_2 a_1$ • Output layer prediction: $\hat{y} = \sigma(z_2)$ • Loss function (Least Squares): $L = \frac{1}{2}(\hat{y} - y)^2$

(Note: Biases are omitted for mathematical brevity, but the logic remains identical).

2. The Inefficient Way: Independent Computation (From Scratch)

If we were to compute the derivative of the loss $L$ with respect to each weight independently, we would trace the path from the loss backward to the specific weight, expanding the chain rule fully every time.

For $w_2$ (Output layer weight):

$$ \frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_2} \cdot \frac{\partial z_2}{\partial w_2} $$

For $w_1$ (Hidden layer weight):

$$ \frac{\partial L}{\partial w_1} = \left( \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_2} \right) \cdot \frac{\partial z_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_1} $$

The Problem: Repeated Terms Look at the terms in the parentheses for $w_1$. The expression $\left( \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_2} \right)$ was already completely calculated when we found the derivative for $w_2$.

If we compute derivatives independently, the computer “forgets” this calculation and evaluates it from scratch for $w_1$. Now imagine a modern network with 100 layers and billions of weights. If you calculate the derivative for a weight in the 1st layer, you have to multiply the gradients through layers 2 through 100. If you have 10,000 weights in the 1st layer, you will independently do that 100-layer multiplication 10, 000 times. The computational complexity explodes.

3. The Efficient Way: Backpropagation

Backpropagation solves this through reverse traversal of the computation graph and the use of error signals. Because a neural network has many inputs/weights but only one scalar output (the Loss), traversing the graph from the Loss backward allows us to compute how sensitive the loss is to every single node in the network exactly once.

The Error Signal ($\delta$)

Backprop defines an “error signal,” usually denoted as $\delta$, for each layer. $\delta$ is the derivative of the loss with respect to the pre- activation ($z$) of a specific layer.

• $\delta_2 = \frac{\partial L}{\partial z_2}$ • $\delta_1 = \frac{\partial L}{\partial z_1}$

Let’s trace the graph in reverse, exactly as backpropagation does:

Step A: Compute the Output Error Signal ($\delta_2$) Instead of calculating everything for a specific weight, backprop first asks: How does the loss change with respect to the output node $z_2$?

$$ \delta_2 = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_2} $$

For our logistic least-squares model, this evaluates to: $\delta_2 = (\hat{y} - y) \cdot \sigma'(z_2)$. We calculate this scalar value once and save it in memory.

Step B: Compute Gradients for Output Weights Now, to find the gradient for $w_2$, we just multiply the saved error signal by the local derivative:

$$ \frac{\partial L}{\partial w_2} = \delta_2 \cdot \frac{\partial z_2}{\partial w_2} = \delta_2 \cdot a_1 $$

Step C: Propagate the Error Signal Backward ($\delta_1$) We move leftwards in the graph. We need the error signal for the hidden layer ($\delta_1$). Because we saved $\delta_2$, we don’t start from the loss. We just pull $\delta_2$ backward through the weights and the hidden activation function:

$$ \delta_1 = \delta_2 \cdot \frac{\partial z_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} $$

For our model, this evaluates to: $\delta_1 = \delta_2 \cdot w_2 \cdot \sigma'(z_1)$. We calculate this value once and save it in memory.

Step D: Compute Gradients for Hidden Weights To find the gradient for $w_1$, we simply use the newly saved $\delta_1$:

$$ \frac{\partial L}{\partial w_1} = \delta_1 \cdot \frac{\partial z_1}{\partial w_1} = \delta_1 \cdot x $$

Summary of Efficiency

  1. Elimination of Redundancy: By defining the error signal $\delta$, backpropagation groups repeated chain-rule terms together. Once a $\delta$ is calculated for a layer, it is used to compute the gradients for all the weights in that layer, and it is also passed backward to compute the $\delta$ for the previous layer.
  2. Reverse Traversal Advantage: Computing forward (from weights to loss) would require one full pass per weight. Computing backward (from loss to weights) allows us to compute the gradient for all weights in exactly one backward pass.
  3. Computational Complexity: If a network has $W$ weights, the naive independent method requires roughly $O(W^2)$ operations (or worse, depending on depth). Backpropagation reduces this to $O(W)$ operations, requiring roughly the same amount of computation as a single forward pass. Without this mathematical trick, training deep neural networks would be computationally impossible.

Overfitting and L2 Regularization

Explain why highly expressive neural networks are vulnerable to overfitting. Connect universal approximation to memorizing training data, and explain how L2 regularization and early stopping help mitigate this problem. Stay within the concepts used in the quiz and lecture."

Answer

Here is an explanation of why highly expressive neural networks are vulnerable to overfitting and how we can mitigate it, based on core machine learning concepts.

The Problem: Universal Approximation and Memorization

A neural network is considered highly expressive when it has a large number of parameters (weights and biases) and non-linear activation functions. This expressiveness is governed by the Universal Approximation Theorem, which states that a neural network with enough capacity (e.g., enough hidden neurons/layers) can approximate virtually any continuous mathematical function to an arbitrary level of precision.

While this makes neural networks incredibly powerful, it is also their Achilles’ heel. Because a highly expressive network can represent any function, it has the capacity to draw a wildly complex, jagged decision boundary that passes exactly through every single point in the training data.

Instead of learning the underlying, generalizable pattern of the data, the network simply memorizes the training data, including all the random noise, fluctuations, and outliers. When this happens, the model will achieve a near- perfect score on the training set but will perform poorly on unseen test data. This phenomenon is called overfitting.

In short: Universal approximation guarantees the network can fit the training data perfectly, making it inherently prone to memorization if left unchecked.

The Solutions: Controlling Expressiveness

To prevent overfitting, we must constrain the network so it cannot use its full expressive capacity to memorize noise. Two primary ways to do this are L2 Regularization and Early Stopping.

1. L2 Regularization (Weight Decay)

L2 regularization tackles the problem by directly penalizing the complexity of the network. It does this by adding a penalty term to the loss function that is proportional to the sum of the squared weights in the network.

• How it works: Because the optimizer is trying to minimize the total loss, it is forced to keep the weights as small as possible. • Why it prevents memorization: Large weights are required for a neural network to create the sharp, jagged, highly sensitive mathematical functions needed to connect every noisy data point perfectly. By forcing the weights to stay small, L2 regularization restricts the network to drawing smoother, simpler functions. It effectively “dampens” the network’s expressiveness, forcing it to capture the broad, general trend of the data rather than the noisy specifics.

2. Early Stopping

Early stopping tackles the problem by limiting the time the network has to memorize the data.

• How it works: When a network trains, it naturally learns the simple, broad, and generalizable patterns first. As training progresses into later epochs, it begins to use its universal approximation powers to learn the highly specific noise of the training set. • Why it prevents memorization: During training, we monitor the network’s error on a separate “validation set.” Initially, both training error and validation error go down. However, the moment the network starts memorizing the training noise, its performance on the validation set will start to get worse. Early stopping simply halts the training process at the exact epoch where the validation error is at its lowest. By stopping the training early, we freeze the network’s weights after it has learned the general patterns, but before it has had the opportunity to memorize the noise.