Normalization-free transformers are subcritical, Part 2.

Overview

In my previous blog post, I demonstrated empirically that normalization-free (DyT/Derf) transformers [1, 2] have worse gradient propagation properties than the standard pre-LN transformer – namely, they exhibit much stronger gradient amplification (approximately stretched-exponential, as opposed to the power-law growth in the pre-LN baseline). Although the theoretical analysis correctly characterized the gap between the models at a qualitative level, it did not account for the attention mechanism.

In this blog post, I modify the theoretical argument by reintroducing attention, using the theoretical framework developed in [3], which restricts the initial token configurations to permutation-invariant ones. We generalize the analysis in [3] to normalization-free transformers by replacing the LayerNorms with pointwise activation functions. We show that attention does not change the mechanism that makes gradient propagation in normalization-free transformers inferior to that in pre-LN transformers. However, we can now demonstrate not only qualitative agreement between theoretical and empirical activation norms, gradients, and Jacobians, but also perfect quantitative agreement.

How to read this post

This post contains the following sections:

1. Mean-field framework
- 1.1 Introduction
- 1.2 Setup
- 1.3 Forward signal propagation
- 1.4 Backward gradient propagation
2. LayerNorm vs. Derf/DyT
- 2.1 Theory
- 2.2 Experiments
3. References

Readers interested primarily in why normalization-free Transformers have worse gradient propagation may want to proceed to Section 2.1, which relies on results from Sections 1.3 and 1.4.

Section 1.1 overviews mean-field theory at initialization, which tracks the dynamics of a pair of activation vectors. Section 1.2 introduces the notation for the Transformer. Sections 1.3 and 1.4 extend the mean-field recursion relations of [3] to normalization-free transformers. Section 2.1 uses these relations to compare gradient propagation for LayerNorm versus $tanh$ / $erf$ -like pointwise normalizations, and Section 2.2 validates the theory against measurements in a ViT.

Mean-field framework

Introduction

For a general introduction to the theory of signal propagation and the mean-field formalism in the large-width limit at initialization, I refer the reader to my previous blog post.

[3] observed that, for permutation-equivariant transformers (i.e., with bidirectional attention and no positional encoding), the mean-field theory at initialization effectively reduces to the layer-to-layer evolution of just two degrees of freedom, provided the initial token configuration is permutation-invariant: the component variance of the activation vector at a given position, $q^{l} = \frac{1}{d} h_{a}^{l} \cdot h_{a}^{l}$ , and the covariance between components of activation vectors at different positions, $p^{l} = \frac{1}{d} h_{a}^{l} \cdot h_{b}^{l}, a \neq = b$ . Here, $h_{a}^{l}$ is a $d$ -dimensional activation vector at layer $l$ and position $a$ . Geometrically, the former is the squared norm of the activation vector at a given position, normalized by $d$ , while the latter is the normalized dot product between activation vectors at different positions; consequently, $p^{l} / q^{l}$ is the cosine similarity between activation vectors at different positions.

The two main ingredients in the signal-propagation calculation at initialization are (co)variance propagation through pointwise nonlinearities and through linear layers. For the former, let us illustrate the calculation using a nonlinearity $ϕ$ and two activation vectors at different positions, $h_{1}$ and $h_{2}$ , whose component-wise covariance is given by

Σ = (q p p q) .

That is, for any component $i$ of the activation vectors, $(h_{1}^{i}, h_{2}^{i}) \sim N (0, Σ)$ , and components with different indices are uncorrelated. We then compute the $2 \times 2$ (non-centered) covariance matrix after applying the nonlinearity as

Σ_{ab}^{ϕ} = E_{(h_{1}, h_{2}) \sim N (0, Σ)} [ϕ (h_{a}) ϕ (h_{b})], a, b \in {1, 2} .

In this expression, each $h_{a}$ is a scalar dummy variable rather than a multi-component activation vector – we have simply suppressed the component index $i$ . Even if the original activations have zero mean, passing through some nonlinearities (e.g., ReLU) can introduce a non-zero mean $E ϕ (h_{a})$ ; however, it is eliminated by the subsequent linear transformation.

A subsequent linear transformation $W$ with zero mean and variance $σ_{W}^{2} / d$ (assuming Gaussian weights for simplicity) multiplies the covariance by $σ_{W}^{2}$ . Thus, combining the nonlinearity with the linear transformation yields $([W ϕ (h_{1})]^{i}, [W ϕ (h_{2})]^{i}) \sim$ $N (0, σ_{W}^{2} Σ^{ϕ})$ .

Setup

Assume a Transformer with context size $n$ has $L$ layers, alternating (bidirectional) self-attention and a position-wise MLP with ReLU activation, with residual connections. The input to each residual branch is normalized – either with LayerNorm or with a pointwise transform such as DyT/Derf. For simplicity, we assume single-head attention – in case of multi-head attention the signal propagation equations remain exactly identical. The dynamics of activation vectors of hidden dimension $d$ are given by the following equation:

h_{a}^{l + 1} = h_{a}^{l} + ⎩ ⎨ ⎧ W_{O}^{l} W_{V}^{l} b \sum A_{ab}^{l} \tilde{h}_{b}^{l}, W_{2}^{l} ReLU (W_{1}^{l} \tilde{h}_{a}^{l}), l even (attn), l odd (MLP) .

Note that here $l = \overline{0, L - 1}$ indexes layers (attention and MLP) rather than transformer blocks. The vectors $\tilde{h}_{a}^{l}$ are normalized activation vectors:

\tilde{h}_{a}^{l} = Norm (h_{a}^{l}) .

Here $Norm$ may be $LayerNorm$ or a pointwise transform; for example, in the case of Derf with parameter $α$ , $Norm (x) = erf (αx)$ . The attention scores $A_{ab}^{l}$ between the $a$ -th query and the $b$ -th key are computed in the standard way:

A_{ab}^{l} = \frac{exp ( ( W _{Q}^{l} h ~ _{a}^{l} ) \cdot ( W _{K}^{l} h ~ _{b}^{l} ) / d )}{\sum _{c} exp ( ( W _{Q}^{l} h ~ _{a}^{l} ) \cdot ( W _{K}^{l} h ~ _{c}^{l} ) / d )} .

All weights are initialized from zero-mean Gaussian distributions, with variances that are shared across Transformer blocks: in the attention layer, $W_{Q}$ , $W_{K}$ , $W_{O}$ , $W_{V}$ have component-wise variances $σ_{Q}^{2} / d_{in}$ , $σ_{K}^{2} / d_{in}$ , $σ_{O}^{2} / d_{in}$ , and $σ_{V}^{2} / d_{in}$ , respectively; in the MLP layer, $W_{1}$ and $W_{2}$ have component-wise variances $σ_{1}^{2} / d_{in}$ and $σ_{2}^{2} / d_{in}$ . Here $d_{in}$ denotes the input dimension of the layer.

Forward signal propagation

With a number of simplifying assumptions about the statistics of attention scores (see Assumption 2 in [3]), one can solve for the dynamics of $q^{l}$ and $p^{l}$ :

q^{l + 1} = q^{l} + ⎩ ⎨ ⎧ σ_{O V}^{2} \tilde{q}^{l} \frac{1 + \frac{p ^{l}}{q ^{l}} ( n - 1 ) exp ( σ _{Q K}^{2} q ~ ^{l} ( p ~ ^{l} - q ~ ^{l} ) )}{1 + ( n - 1 ) exp ( σ _{Q K}^{2} q ~ ^{l} ( p ~ ^{l} - q ~ ^{l} ) )}, \frac{1}{2} σ_{21}^{2} \tilde{q}^{l}, l even (attn), l odd (MLP) .

p^{l + 1} = p^{l} + ⎩ ⎨ ⎧ σ_{O V}^{2} \tilde{q}^{l} \frac{1 + \frac{p ^{l}}{q ^{l}} ( n - 1 ) exp ( σ _{Q K}^{2} p ~ ^{l} ( p ~ ^{l} - q ~ ^{l} ) )}{1 + ( n - 1 ) exp ( σ _{Q K}^{2} p ~ ^{l} ( p ~ ^{l} - q ~ ^{l} ) )}, σ_{21}^{2} κ (\tilde{p}^{l} / \tilde{q}^{l}) \tilde{q}^{l}, l even (attn), l odd (MLP) .

For brevity, we define $σ_{O V}^{2} = (σ_{O} σ_{V})^{2}$ , $σ_{21}^{2} = (σ_{2} σ_{1})^{2}$ , and $σ_{Q K}^{2} = (σ_{Q} σ_{K})^{2}$ . We recall that $n$ is the context size, i.e. the number of tokens/patches. Finally, $\tilde{q}^{l}$ and $\tilde{p}^{l}$ are the covariance components after propagation through $Norm$ :

\tilde{Σ}_{ab}^{l} = E_{(h_{1}, h_{2}) \sim N (0, Σ^{l})} [Norm (h_{a}) Norm (h_{b})], a, b \in {1, 2},

where $Σ_{11}^{l} = Σ_{22}^{l} = q^{l}$ , $Σ_{12}^{l} = Σ_{21}^{l} = p^{l}$ , and $\tilde{Σ}_{11}^{l} = \tilde{Σ}_{22}^{l} = \tilde{q}^{l}$ , $\tilde{Σ}_{12}^{l} = \tilde{Σ}_{21}^{l} = \tilde{p}^{l}$ .

The coefficients of $1/2$ and $κ (\tilde{p}^{l} / \tilde{q}^{l})$ in the MLP expressions in Eqs. (6) and (7) arise from covariance propagation through the ReLU nonlinearity, where $κ$ is given by

κ (ρ) = \frac{1}{2 π} (1 - ρ^{2} + ρ (π - arccos ρ)) .

Backward gradient propagation

To characterize backward gradient propagation between layers $L$ and $l$ with $L > l$ , we use the Frobenius norm of the Jacobian $J^{L, l} = \partial h^{L} / \partial h^{l}$ , averaged over weight initializations – the APJN (averaged partial Jacobian norm) [4]:

J^{L, l} = \frac{1}{d n} E [∥ J^{L, l} ∥_{F}^{2}] .

In the large-width limit, the APJN satisfies the recursion relation [4]

J^{L, l} = χ_{J}^{l} J^{L, l + 1}, l < L .

Equivalently, $J^{l + 1, l_{0}} = χ_{J}^{l} J^{l, l_{0}}$ for $l \geq l_{0}$ . In our setup,

χ_{J}^{l} = 1 + ⎩ ⎨ ⎧ σ_{O V}^{2} \frac{q ^ ^{l}}{1 + ( n - 1 ) exp ( σ _{Q K}^{2} q ~ ^{l} ( p ~ ^{l} - q ~ ^{l} ) )}, \frac{1}{2} σ_{21}^{2} \overset{q}{^}^{l}, l even (attn), l odd (MLP) .

Here $\overset{q}{^}^{l}$ is the variance obtained by propagating $q^{l}$ through $Norm^{'}$ , where the prime denotes the derivative:

\overset{q}{^}^{l} = E_{h \sim N (0, q^{l})} Norm^{'} (h)^{2} .

This expression is somewhat vague for LayerNorm, so in that case we simply define $\overset{q}{^}^{l} = 1/ q^{l}$ to avoid confusion. The quantity $\overset{q}{^}^{l}$ arises from differentiating the normalization, both for the pointwise transform and for LayerNorm.

In fact, for LayerNorm and Derf, the quantities $\tilde{q}^{l}$ , $\tilde{p}^{l}$ , and $\overset{q}{^}^{l}$ can be computed analytically. We provide these expressions here for completeness.

LayerNorm:

\tilde{q}^{l} = 1, \tilde{p}^{l} = p^{l} / q^{l}, \overset{q}{^}^{l} = 1/ q^{l} .

Derf (with parameter $α$ ):

\tilde{q}^{l} \overset{q}{^}^{l} = \frac{2}{π} arcsin (\frac{2 α ^{2} q ^{l}}{1 + 2 α ^{2} q ^{l}}), \tilde{p}^{l} = \frac{2}{π} arcsin (\frac{2 α ^{2} p ^{l}}{1 + 2 α ^{2} q ^{l}}), = \frac{4 α ^{2}}{π} \frac{1}{1 + 4 α ^{2} q ^{l}} .

LayerNorm vs. Derf/DyT

Theory

We now have all the components (Eqs. (6), (7), and (12)) to show that, for $tanh$ / $erf$ -like normalization functions, the APJN grows approximately as a stretched-exponential, i.e. like $e^{l / λ}$ for some parameter $λ$ , whereas in the standard pre-LN setup it grows approximately as a power law. The general argument is identical to that in my previous blog post, so we omit the details here. The key idea is that in both cases $q^{l} \sim l$ for large $l$ , as follows from Eq. (6); however, in Eq. (12), for $tanh$ / $erf$ -like normalization functions, $\overset{q}{^}^{l} \sim (q^{l})^{- 1/2} \sim l^{- 1/2}$ , whereas for LayerNorm, $\overset{q}{^}^{l} = (q^{l})^{- 1} \sim l^{- 1}$ . This implies the stated behavior of the APJN, which is given by a product of the $χ_{J}^{l}$ factors.

This conclusion remains valid in the presence of attention. In the forward pass, the linear growth of $q^{l}$ persists because the attention contribution is bounded. In the backward pass, since $\tilde{p}^{l} \leq \tilde{q}^{l}$ , the denominator in Eq. (12) (attn) cannot suppress $\overset{q}{^}^{l}$ by more than a factor of $n$ . And even if it could, the MLP contribution remains the same as without attention, providing stretched-exponential growth for $tanh$ / $erf$ -like normalization functions and power-law growth for LayerNorm.

Experiments

We compute the layer-wise (co)variances $q^{l}$ and $p^{l}$ , as well as the APJN, both from the mean-field analysis and by estimating them from the ViT model, for the pre-LN baseline and for Derf with various values of the parameter $α$ .

Fig. 1 compares quantities $q^{l}$ and $p^{l}$ computed from the mean-field analysis (left) with those estimated from a ViT forward pass (right), where the input to the first Transformer block is a generated permutation-invariant token configuration. In both cases, $q^{0} = 1$ and $p^{0} = 0.4$ ; the number of layers $L$ is $128$ , and the context size $n$ is $197$ . The ViT model is initialized as in vit_large_patch16_224, with hidden dimension $d = 1024$ and component-wise weight standard deviations equal to $0.02$ . To match its behavior, in mean-field analysis we set $σ_{1} = σ_{O} = σ_{V} = σ_{Q} = σ_{K} = 0.64$ and $σ_{2} = 1.28$ .

Fig. 2 (left) compares APJN computed from mean-field analysis and from the ViT model via Hutchinson’s method [4]. Fig. 2 (right) shows gradient amplification coefficients estimated from the ViT backward pass on a batch of permutation-invariant token configurations. The observed gradient amplification is slightly larger than the APJN, likely because the gradients lie in a subspace corresponding to larger-than-average Jacobian eigenvalues.

Overall, matching the pre-LN gradient amplification behavior requires choosing a smaller $α$ in the Derf model. However, smaller values of $α$ in Derf also yield smaller updates to the residual stream. If instead we try to align pre-LN and Derf by matching the magnitude of the residual-stream update, then the gradient amplification in Derf becomes much larger. Concretely, choosing $α$ so that the $q_{l}$ curves for pre-LN and Derf are as close as possible leads to a large $α$ .

Figure 1. (a) The component-wise variance of the activation vector at a given position, $q^{l}$ , (b) the covariance between components of activation vectors at different positions, $p^{l}$ , and (c) their ratio, $p^{l} / q^{l}$ . Left: mean-field analysis; right: values estimated from a ViT forward pass on a batch of permutation-invariant token configurations with the same initial value of $p$ . The black solid line indicates the pre-LN baseline. Colored lines show Derf variants with varying values of $α$ .

Figure 2. Left: APJN. Solid lines indicate values computed from mean-field theory (MFT). Crosses indicate values obtained from a ViT via Hutchinson’s method [4]. Right: gradient amplification coefficients estimated from a ViT backward pass on a batch of permutation-invariant token configurations.

References

[1] Zhu, J., Chen, X., He, K., LeCun, Y., Liu, Z. Transformers without Normalization. 2025. https://arxiv.org/abs/2503.10622

[2] Chen, M., Lu, T., Zhu, J., Sun, M., Liu, Z. Stronger Normalization-Free Transformers. 2025. https://arxiv.org/abs/2512.10938

[3] Cowsik, A., Nebabu, T., Qi, X.-L., Ganguli, S. Geometric Dynamics of Signal Propagation Predict Trainability of Transformers. 2024. https://arxiv.org/abs/2403.02579

[4] Doshi, D., He, T., Gromov, A. Critical Initialization of Wide and Deep Neural Networks through Partial Jacobians: General Theory and Applications. 2023. https://arxiv.org/abs/2111.12143