$ cat layernorm_V_batchnorm.txt

██╗      █████╗ ██╗   ██╗███████╗██████╗ ███╗   ██╗ ██████╗ ██████╗ ███╗   ███╗
██║     ██╔══██╗╚██╗ ██╔╝██╔════╝██╔══██╗████╗  ██║██╔═══██╗██╔══██╗████╗ ████║
██║     ███████║ ╚████╔╝ █████╗  ██████╔╝██╔██╗ ██║██║   ██║██████╔╝██╔████╔██║
██║     ██╔══██║  ╚██╔╝  ██╔══╝  ██╔══██╗██║╚██╗██║██║   ██║██╔══██╗██║╚██╔╝██║
███████╗██║  ██║   ██║   ███████╗██║  ██║██║ ╚████║╚██████╔╝██║  ██║██║ ╚═╝ ██║
╚══════╝╚═╝  ╚═╝   ╚═╝   ╚══════╝╚═╝  ╚═╝╚═╝  ╚═══╝ ╚═════╝ ╚═╝  ╚═╝╚═╝     ╚═╝

██╗   ██╗███████╗
██║   ██║██╔════╝
██║   ██║███████╗
╚██╗ ██╔╝╚════██║
 ╚████╔╝ ███████║
  ╚═══╝  ╚══════╝

██████╗  █████╗ ████████╗ ██████╗██╗  ██╗███╗   ██╗ ██████╗ ██████╗ ███╗   ███╗
██╔══██╗██╔══██╗╚══██╔══╝██╔════╝██║  ██║████╗  ██║██╔═══██╗██╔══██╗████╗ ████║
██████╔╝███████║   ██║   ██║     ███████║██╔██╗ ██║██║   ██║██████╔╝██╔████╔██║
██╔══██╗██╔══██║   ██║   ██║     ██╔══██║██║╚██╗██║██║   ██║██╔══██╗██║╚██╔╝██║
██████╔╝██║  ██║   ██║   ╚██████╗██║  ██║██║ ╚████║╚██████╔╝██║  ██║██║ ╚═╝ ██║
╚═════╝ ╚═╝  ╚═╝   ╚═╝    ╚═════╝╚═╝  ╚═╝╚═╝  ╚═══╝ ╚═════╝ ╚═╝  ╚═╝╚═╝     ╚═╝

Every machine learning engineer has been there. You're three hours into debugging why your model won't converge, your loss curve looks like a seismograph during an earthquake, and you're starting to question your life choices. The culprit? Your activations are all over the place — some neurons firing like they're on espresso, others barely whispering.

This is where normalization steps in, but not as some magical cure-all. BatchNorm and LayerNorm are more like two different approaches to managing a chaotic group project. BatchNorm is the coordinator who looks at what everyone else is doing and tries to bring some consistency across the team. LayerNorm is the methodical one who focuses on getting each individual's work properly organized before moving forward.

The choice between them isn't academic — it's practical. Pick wrong, and you'll spend your weekend wondering why your transformer is acting like a broken calculator or why your CNN thinks every image is a cat. Pick right, and training becomes the smooth, predictable process it should be.

Motivation and Early Works

Batch normalization was introduced primarily to address internal covariate shift, though the full picture of why it works is more nuanced than originally thought.

Internal Covariate Shift: The Original Motivation

Internal covariate shift refers to the change in the distribution of layer inputs during training. Here's what happens:

The Problem

As neural network parameters update during training, the distribution of inputs to each layer constantly shifts. For example, if an early layer's weights change, all subsequent layers receive inputs with different statistical properties than they had in previous iterations. This creates a moving target for each layer - just as a layer starts to adapt to one input distribution, that distribution changes.

"The training is complicated by the fact that the inputs in each layer are affected by the parameters of the preceding layers - so that small changes to the network amplify as the network becomes deeper."

Why This Matters

Each layer must continuously readjust to these shifting input distributions, which can:

Slow down training significantly
Require careful initialization and lower learning rates
Make the network sensitive to parameter changes
Cause gradients to become very small or very large

The Batch Normalization Solution

By normalizing inputs to each layer to have zero mean and unit variance, batch normalization ensures that regardless of how previous layers' parameters change, the current layer always receives inputs with a consistent statistical distribution.

The normalization operation is: $$\frac{x - \mu}{\sigma}$$ where μ and σ are the batch mean and standard deviation. The method then applies learnable parameters γ (scale) and β (shift) to maintain the network's representational capacity.

All this is cool, but why / how did they think of this ?

It was observed that the neurons output could be abysmally high / low that would produce saturated result after passing through the activation function is found to be in the saturated regime. This means trouble since we will require massive gradients to rebalance them. If however we could ensure that the nonlinear output remains stable, then the optimizer will not get stuck in a saturated regime, and training would accelerate.

They make use of the fact that the network converges faster if the inputs are whitened - i.e. linearly transformed to have zero mean and unit variance, and are decorrelated.

Beyond Internal Covariate Shift: Additional Benefits

Research has revealed that batch normalization provides several other crucial advantages:

1. Smoothing the Loss Landscape

Batch normalization makes the optimization landscape much smoother. This means gradients are more predictable and stable, allowing for higher learning rates and more robust training. The loss function becomes less sensitive to parameter initialization and changes.

2. Gradient Flow Improvement

BN has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters of their initial values.

3. Regularization Effect

Batch normalization introduces noise through the batch statistics (since each batch has slightly different mean and variance). This stochastic element acts as a form of regularization, similar to dropout, helping prevent overfitting.

4. Reduced Sensitivity to Initialization

Networks with batch normalization are much less sensitive to weight initialization schemes. You can often use larger initial weights without destabilizing training.

5. Faster Convergence

The combination of smoother optimization and better gradient flow typically leads to faster convergence, often requiring fewer epochs to reach good performance.

The Nuanced Reality

Interestingly, recent research suggests that the "internal covariate shift" explanation, while intuitive, may not be the complete story. Studies have shown that batch normalization can be effective even when internal covariate shift is artificially increased. The smoothing of the loss landscape appears to be a more fundamental mechanism.

The technique works by fundamentally changing how information flows through the network during both forward and backward passes, creating more stable and predictable training dynamics. This makes it one of the most impactful innovations in deep learning, enabling the training of much deeper networks with greater ease and reliability.

3 Normalization via Mini-Batch Statistics

Since the full whitening of each layer’s inputs is costly and not everywhere differentiable, we make two necessary simplifications. The first is that instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have the mean of zero and the variance of 1. For a layer with d-dimensional input $ \mathbf{x} = (x^{(1)} \ldots x^{(d)}) $, we will normalize each dimension

\[ \hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{\text{Var}[x^{(k)}]}} \]

where the expectation and variance are computed over the training data set. As shown in LeCun et al., 1998b, such normalization speeds up convergence, even when the features are not decorrelated.

Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transform.

As the Batch Normalizing Transform, we present the BN transform in Algorithm 1. In the algorithm, $ \epsilon $ is a constant added to the mini-batch variance for numerical stability.

Input: Values of $ x $ over a mini-batch: $ \mathcal{B} = \{x_1 \ldots x_m\} $;
Parameters to be learned: $ \gamma, \beta $
Output: $ \{y_i = \text{BN}_{\gamma,\beta}(x_i)\} $

\[ \mu_\mathcal{B} \leftarrow \frac{1}{m} \sum_{i=1}^m x_i \quad \text{// mini-batch mean} \] \[ \sigma_\mathcal{B}^2 \leftarrow \frac{1}{m} \sum_{i=1}^m (x_i - \mu_\mathcal{B})^2 \quad \text{// mini-batch variance} \] \[ \hat{x}_i \leftarrow \frac{x_i - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}} \quad \text{// normalize} \] \[ y_i \leftarrow \gamma \hat{x}_i + \beta = \text{BN}_{\gamma,\beta}(x_i) \quad \text{// scale and shift} \]

Algorithm 1: Batch Normalizing Transform, applied to activation $ x $ over a mini-batch.

The BN transform can be added to a network to manipulate any activation. In the notation $ y = \text{BN}_{\gamma,\beta}(x) $, we interpret the operation as normalizing the input $ x $ over the mini-batch, followed by a learnable affine transformation with parameters $ \gamma $ and $ \beta $. These parameters allow the network to preserve or re-learn the original distribution of activations if needed, ensuring that the transformation does not limit the expressiveness of the model.

Standardize the chaos.