Normalization-free transformers are subcritical
Overview
In this blog post, we connect recent work proposing LayerNorm-free Transformers [1, 2], in which (pre-)LayerNorm (LN) is replaced by pointwise activation functions such as or , with the body of work studying criticality in wide neural networks at initialization. We note that, in residual networks, replacing LN with -like pointwise functions leads to subcritical behavior, whereas LayerNorm achieves criticality, as previously shown in [3, 4]. In practical terms, residual networks with Dynamic (DyT) or Dynamic (Derf) exhibit worse gradient propagation than pre-LN residual networks: gradients grow stretched-exponentially with depth (from the last layer to the first) rather than following a power law. This can cause training instability and may require more careful hyperparameter tuning to avoid divergence.
We analyze how the initialization of the parameter , which appears in DyT/Derf, affects gradient propagation in the model proposed in [3]. For networks of finite depth, smaller values of give rise to exponentially growing gradients, whereas larger values of give rise to stretched-exponentially growing gradients. Larger values of lead to stronger gradient amplification, which may explain several empirical observations in DyT/Derf models: training is typically more stable at smaller , while overly large can lead to divergence; deeper models require smaller . Moreover, if is chosen so that the residual-update size in DyT/Derf matches that of pre-LN, DyT/Derf exhibits much stronger gradient amplification than pre-LN. We show that the qualitative gradient behavior in ViT aligns well with the theory, even though the theory does not account for attention blocks. In addition, we empirically show that DyT/Derf models initialized with larger may benefit from learning rate warmup more than pre-LN.
How to read this post
The blog post contains the following sections:
-
1. Background
-
3. Conclusion
-
4. References
Readers interested only in the empirical results on gradient propagation in ViT and training stability with DyT and LN may want to proceed directly to Sections 2.3 and 2.4. Readers who want a theoretical explanation but are already familiar with criticality and mean-field theory at initialization may want to start from Section 2.1, although Section 1.2 may be useful for the notation. Readers who want the full picture from scratch may want to start from Section 1.1.
Sections 1.1–2.1 review existing theoretical results on signal propagation in deep neural networks and provide a gentle introduction to the topic. Namely, Section 1.1 introduces the ordered and chaotic phases, as well as criticality, for signal propagation in neural networks. Section 1.2 introduces the mean-field formalism in the large-width limit, which allows one to quantify signal-propagation properties. Section 2.1 compares the signal-propagation behavior of a toy residual network with LayerNorm and with a -like nonlinearity, and shows theoretically why the former leads to better gradient propagation.
Section 2.2 extends the theoretical analysis by introducing the DyT/Derf parameter , which rescales the input to the nonlinearity. Section 2.3 studies signal propagation empirically in a Transformer model and compares DyT models to the pre-LN baseline. Section 2.4 studies the effect of warmup on training stability for DyT models and compares it to the pre-LN baseline.
1. Background
1.1 Criticality: large picture
It has been known for some time that a network’s signal-propagation properties at initialization directly affect its training stability, which in turn is strongly correlated with the final performance [3–6]. As a striking example, modern Transformers commonly use residual connections and pre-LayerNorm (rather than post-LayerNorm or no normalization) – both of which are known to improve gradient propagation and help prevent exponentially vanishing or exploding gradients [7–10].
The notion of “good” or “bad” signal propagation can be formalized by introducing the partial Jacobian between the model’s activations at layers and , with . This Jacobian determines how perturbations to activations at layer propagate to layer in the forward pass and how gradients propagate from layer back to layer in the backward pass. Its squared Frobenius norm, averaged over weight initializations – (the averaged partial Jacobian norm, APJN) – is the simplest scalar measure that relates the typical gradient magnitudes at layers and (details below) [3, 11, 12].
For a given neural network architecture, the partial Jacobian depends on hyperparameters such as the initialization variances of weights and biases, as well as less standard ones such as residual scaling and the parameter in DyT/Derf. Thus, each point in hyperparameter space may be characterized by the resulting APJN behavior.
If the APJN grows (or decays) exponentially with depth, , the model is said to be in the chaotic (or ordered) phase. The depth scale , called the correlation length, determines the effective depth over which signals propagate efficiently. Empirically, it has been shown that when the model depth substantially exceeds , training becomes unstable [3, 5, 6]. Therefore, to achieve stable training, it is desirable to make as large as possible.
Typically, as one approaches the region of hyperparameters where diverges, the asymptotic APJN scaling switches to a power law, , with critical exponent . In this case, the model is said to be at criticality.
Let us illustrate these ideas using the figures from [3]. That work studies a residual network of the form
where , , and . Here is either or simply if no normalization layer is used. We take to be the initial embedding. The components of the weight matrix are initialized from , and the components of the bias vector are initialized from . The parameter , which is typically set to in modern transformers, is the residual scaling that controls the contribution of the residual stream. When , the model reduces to a feedforward network without residual connections.
Figure 1. Empirical phase diagrams and training accuracy for a deep MLP with 50 layers and hidden dimension (with/without residual connections, LayerNorm, with different activation functions). Figures reprinted from [3] with permission from the authors. Upper panel: quantifies proximity to criticality: corresponds to criticality, to the chaotic phase, and to the ordered phase. Solid lines indicate the criticality boundary predicted by the infinite-width calculation. Lower panel: training accuracy of the deep MLP on FashionMNIST. The dashed white lines denote the (analytical) critical lines.
Fig. 1 (upper panel) shows phase diagrams in the plane for the model in Eq. (1), for multiple choices of the nonlinearity , multiple values of the residual scaling , and with and without LayerNorm. Each point in the phase diagram is characterized by the theoretically computed quantity , which is roughly equal to and will be defined more rigorously below. Criticality is achieved when , while ( ) corresponds to the ordered (chaotic) phase. Fig. 1 (lower panel) shows the training accuracy of the same models on FashionMNIST.
We would like to draw the reader’s attention to three points:
-
The setup with pre-LN and , which is dominant in real models used in practice, is everywhere-critical. No matter how one chooses the initialization variances , the backpropagated gradient norms follow a power law.
-
The setup without pre-LN, with and (i.e. Derf with fixed ), is not everywhere critical, but it is closer to criticality than setups with non--like activation functions such as ReLU or GELU. In this setup, the APJN grows stretched-exponentially, , which is faster than power-law growth but slower than exponential growth.
-
Note how well criticality at initialization correlates with final training accuracy: the more the model deviates from criticality at initialization, the less trainable it is.
1.2 Criticality: large-width limit and mean-field formalism
Signal-propagation analysis becomes much simpler in the large-width limit. In this regime, each component of the activation vector at layer , , is a sum of many (approximately) independent terms, so by the central limit theorem it can be treated as Gaussian. Moreover, components with different indices are independent because the rows of the weight matrix are independent, and they share the same (typically zero) mean and variance due to permutation symmetry across units. Thus, the distribution of the activation vector at layer is characterized by a single quantity, its variance , so that , where denotes the identity matrix. Knowledge of then allows one to compute the expectation of any function of . The value of can be computed recursively from Eq. (1) as
where is either if LayerNorm is used, or otherwise. Note that in the expectation above, is a scalar dummy variable representing a single component of the activation vector at layer . Intuitively, is the squared norm of normalized by width. In the large -limit, the fluctuations of this normalized squared norm vanish, so
The reduction of the dynamics to tracking the single quantity at each layer is often referred to as mean-field theory at initialization. Intuitively, Eq. (2) can be interpreted as a consequence of applying the Pythagorean theorem to the three vectors in Eq. (1), which become approximately orthogonal in the large-width limit.
Let us now turn to the question of gradient propagation. Given two layers and , with , the gradients of a loss function at these two layers are related through the Jacobian matrix:
where the partial Jacobian matrix between layers and is defined as
Thus, it is natural that the Frobenius norm of this Jacobian, averaged over weight initializations – the APJN, – relates gradient norms at layers and [3]:
In the large-width limit, the APJN satisfies the recursion relation [3]
Equivalently, for . The asymptotic behavior of as becomes large determines the phase of a deep network: values greater than correspond to the chaotic phase, in which gradients grow exponentially (from later layers to earlier ones); values less than correspond to the ordered phase, in which gradients decay exponentially; and values close to correspond to criticality. The quantity used in Fig. 1 is the asymptotic value of as , which is approximated by the final-layer value .
For the model in Eq. (1) with LayerNorm,
and without LayerNorm,
For derivations of Eqs. (6)–(9), we refer the reader to [3]. All of these expressions follow directly from the definition of the APJN. Note the additional factor of in Eq. (8) relative to Eq. (9), which arises from the LayerNorm Jacobian – it is crucial for achieving criticality in pre-LN residual networks.
Overall, once is obtained from the variance-propagation equation (2), Eq. (8) or Eq. (9) determines the APJN and, consequently, the layer-wise behavior of the gradient norm.
2. Analysis of criticality in layernorm-free transformers
2.1 LayerNorm vs. tanh-like nonlinearity: criticality perspective
Let us compare the behavior of for the setups with and without LayerNorm. We will stick to the standard practical choices of and .
LayerNorm:
Note that we have rewritten the expectations in Eqs. (2) and (8) in terms of a standard Gaussian, making it explicit that they do not depend on . This is consistent with the fact that LayerNorm normalizes individual activations to have unit variance, or, equivalently, rescales the activation vector to have norm . At each layer, receives the same increment and therefore grows linearly with . Consequently, for large , , where is a constant determined by Eq. (10) and the choice of nonlinearity . Thus, the APJN grows as a power law, , and so do the squared gradient norms in the backward pass. For or , .
No LayerNorm:
In this case, the expectations depend explicitly on , and the behavior of depends strongly on the choice of nonlinearity .
For -like nonlinearities , is close to on the whole real line except in the vicinity of . As grows, the probability mass near decreases and approaches . Thus, for large . A more geometric way to see this is to note that once the pre-activations in Eq. (1) are large enough, a -like nonlinearity effectively saturates and acts as a normalization, so the term has approximately constant norm.
On the other hand, is concentrated near . As grows, the Gaussian density at the origin becomes approximately , and consequently for large , where is a constant determined by Eq. (11). Thus, the APJN grows stretched-exponentially, , and so do the gradient norms in the backward pass.
We also note that reshaping Eq. (1) into a more MLP-like form, as in modern Transformers – e.g., with a ReLU-like or with a ReLU-like and a -like – does not change the results qualitatively. For example, with an intermediate ReLU nonlinearity, this preserves the form of Eq. (10) under the replacement , and the form of Eq. (11) under the replacement .
2.2 Introducing : finite-depth transition from exponential to stretched-exponential behavior
Let us now consider the Derf nonlinearity as defined in [2], i.e. . The expectations in Eq. (11) can be computed explicitly:
The depth , defined by , separates two APJN growth regimes. For , both and grow exponentially at the same rate: and . For , the growth of is stretched-exponential, as discussed above. Assuming , the transition depth can be roughly estimated from:
which yields
The value of decreases as increases: for large , the transition to stretched-exponential behavior occurs in early layers (possibly in the first layer), whereas for small the APJN grows exponentially for many layers before entering the stretched-exponential regime.
For a network of finite depth , this implies that if (small ) one observes only exponential growth; if (large ) one observes purely stretched-exponential growth; and if one observes a transition from exponential to stretched-exponential growth. Note, however, that a larger always implies stronger signal amplification from the first layer to the last, i.e. a larger for any fixed . Fig. 2 shows and computed from the recursion in Eq. (12) for Derf with multiple values of ranging from to ; it also shows and for the pre-LN setup in Eq. (10) with .
Figure 2. (a) Layer-wise component variance of the activation vector, , and (b) the APJN , which characterizes amplification of the squared gradient norm from layer to layer in the toy model (1) with Derf and pre-LN. For Derf, for each value of , the black dot marks , defined by , and indicates the transition to the stretched-exponential regime, i.e. the point at which begins to grow linearly and begins to grow stretched-exponentially (in the backward direction). The solid curve corresponds to the stretched-exponential regime (), while the dashed curve corresponds to the exponential regime (). The black curve shows the pre-LN setup with .
2.3 Empirical gradient norms in a Transformer with DyT
We empirically measure activation and gradient norms at initialization in a ViT model with layers and hidden dimension , using a random batch from CIFAR-100, comparing DyT with different values of to the pre-LN baseline. Fig. 3 (a) shows the component-wise activation variance at each layer, . Fig. 3 (b) shows the layer-wise gradient amplification coefficient . Fig. 3 (c) shows that the squared logarithm of is approximately linear in , indicating stretched-exponential growth for larger . Averages are taken over the batch dimension and patches.
Despite the presence of attention layers in ViT, which makes the analytic treatment more challenging, the qualitative gradient behavior agrees well with the simplified model studied above. In particular, matching the pre-LN gradient amplification behavior requires choosing a smaller . However, smaller values of in DyT produce smaller updates to the residual stream. Thus, comparing the models purely by gradient amplification can be misleading. A perhaps more natural comparison is to align the pre-LN setup with DyT by matching the magnitude of the residual stream update. Concretely, one chooses so that the curves for pre-LN and DyT are as close as possible, which results in a large . Under this alignment, gradient amplification in DyT is much larger.
Figure 3. Empirical measurements in a ViT model with layers and hidden dimension : (a) the layer-wise component variance of the activation vector, ; and the layer-wise gradient amplification coefficients (b) and (c) squared log of . Averages are taken over the batch dimension and patches. For Derf, we use the same conventions as in Fig. 2: the dot marks (defined by ), solid/dashed curves indicate the stretched-exponential/exponential regimes, and the black curve is the pre-LN baseline. Overall, DyT exhibits much faster gradient-norm amplification from later layers to earlier ones – especially when is chosen so that the residual-update size is comparable to pre-LN; this effect can be mitigated by choosing sufficiently small for the given model depth.
2.4 DyT benefits from learning rate warmup
[1, 2] trained their DyT/Derf ViT models with learning-rate warmup. We empirically show that, in addition to tuning initial , DyT Transformers may require careful tuning of the number of warmup steps when is large, whereas this is less important for the pre-LN variant. We train a ViT-B ( layers, hidden dimension ) on CIFAR-100, with DyT and with pre-LN, varying the number of warmup epochs, the learning rate, and the DyT parameter at initialization. Fig. 4 (a) shows that reducing the number of warmup epochs can destabilize training with DyT, while using too many warmup epochs can slow training: the model without warmup diverges, while the model with 3 warmup epochs converges slightly faster than the model with 10 warmup epochs. Fig. 4 (b) compares the DyT variant () to the pre-LN variant without warmup as the learning rate is varied. The DyT model with trains stably only at the lowest learning rate, and converges more slowly than the pre-LN baseline. As argued in the previous section, it is natural to align pre-LN with DyT based on the size of the residual stream updates; for this reason, we compare pre-LN to the DyT variant with the relatively large value here. Fig. 4 (c) shows the test accuracy after 9 epochs as a function of and the number of warmup epochs, confirming that warmup becomes more important as increases. Finally, Fig. 4 (d) shows that choosing too small can also slow training, possibly due to the smaller residual stream updates.
The models were trained with AdamW (, , weight decay ) and a batch size of 256. After the initial warmup, the learning rate was constant. We ran the experiments for only 30 epochs due to limited resources; fully training the models would require at least a few hundred epochs.
Overall, a good initialization of should avoid both extremes: values that are so small that training slows down and values that are so large that training becomes unstable. Increasing the number of warmup epochs can help stabilize training at larger . The pre-LN setup exhibits better training stability across a wider range of learning rates than DyT, provided that is chosen so as not to overly reduce the size of the residual updates relative to pre-LN.
Figure 4. Effect of learning-rate warmup, learning rate, and initialization on training stability in ViT-B (12 layers, ) on CIFAR-100 for DyT and pre-LN. (a) Warmup sweep for DyT. (b) Learning-rate sweep comparing DyT () to pre-LN without warmup. (c) Test accuracy at epoch 9 versus and warmup epochs. (d) sweep. At larger values of , DyT is more sensitive to warmup and exhibits worse training stability than pre-LN across a range of learning rates.
3. Conclusion
Normalization-free Transformers that replace LayerNorm with saturating pointwise functions (DyT/Derf) gain computational simplicity at the cost of worse signal propagation across layers. In the mean-field picture, pre-LN residual networks are effectively critical at initialization: gradient norms grow only as a power law with depth. By contrast, DyT/Derf activation functions are subcritical in residual networks: they lead to stretched-exponential gradient amplification, which can make optimization more sensitive to hyperparameters.
A key hyperparameter in DyT/Derf is the scale at initialization. Larger implies stronger overall gradient amplification from late to early layers, especially relative to the pre-LN architecture. The criticality perspective also helps rationalize the training recipes in [1, 2]: smaller improves training stability, deeper models require smaller , and learning-rate warmup becomes increasingly important at larger .
Practical takeaways.
-
Pre-LN yields critical gradient propagation “by default” in deep residual networks; DyT/Derf does not.
-
Increasing increases end-to-end gradient amplification (and instability risk).
-
Depth matters: as depth increases, matching pre-LN gradient amplification in DyT/Derf requires decreasing .
-
Warmup for DyT/Derf can be a stabilizer, similar in spirit to warmup in architectures with poor early gradient flow.