μP is an influential, theoretically grounded prescription for how to scale various neural network architectures such that the layer activations (and other quantities such as the learning rate) remain stable during training (neither shrink nor explode) with the model size (i.e. width and depth).
- Key papers
- Depth extensions
- Understanding hyperparameter transfer
- Spectral perspective
- Other optimisers
- Other architectures
- On weight decay
- Miscellaneous
- Further resources
- Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks: original paper introducing μP for SGD building on the "Tensor Programs" formalism. The main motivation was to find a parameterisation that both (i) allows for as much feature learning as possible (μP is maximal in this sense) unlike the NTK, and (ii) remains stable with respect to the model width, unlike the standard parameterisation.
- Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer: building on the previous paper, this shows that under μP, many optimal hyperparameters such as the learning rate also remain stable across models (including GPT-3) of different width, allowing for zero-shot hyperparameter transfer without tuning at large scale. It also extends μP for Adam beyond SGD.
- Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit: fully works out the μP theory for adaptive optimisers including Adam.
- Depth Dependence of μP Learning Rates in ReLU MLPs
- Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks: concurrently with the following paper, this proposed an extension of μP to model depth for ResNets (with unit block depth) by rescaling each residual block and parameter update by the square root of the depth. Experiments with fully connected ResNets on CIFAR10.
- Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit: concurrently with the previous paper, this proposed a slightly different depth-extension of μP using dynamical mean field theory (DMFT). Unlike Yang et al. (2024), they do not rescale the learning rate of Adam, but this rescaling is reintroduced in the next paper. Experiments with both CNNs and ViTs (with and without LayerNorm) on both CIFAR10 and ImageNet.
- Infinite Limits of Multi-head Transformer Dynamics: also relying on DMFT, derives width and depth limits for multi-head attention transformers, providing principled scalings for SGD and heuristic scalings for Adam.
- Don’t be lazy: CompleteP enables compute-efficient deep transformers: in contrast to previous depth extensions of μP, this proposes rescaling the residual transformer blocks by the depth (rather than its square root) based on both empirical results and a theoretical notion of non-lazy learning of all model layers. This parameterisation requires rescaling other quantities such as LayerNorm and Adam's weight decay parameter. Experiments with large-scale LLMs, also revealing new compute-optimal regimes.
- Super Consistency of Neural Network Landscapes and Learning Rate Transfer: investigates the phenomenon of learning rate transfer from an optimisation perspective, showing that under μP and its depth extension (but not the NTK), certain quantities including the largest eigenvalue of the loss Hessian (aka sharpness) remain consistent across different model scales (i.e. widths and depths). Comprehensive experiments with ResNets, ViTs and GPT-2.
- On the Provable Separation of Scales in Maximal Update Parameterization
- A Proof of Learning Rate Transfer under μP
- Understanding the Mechanisms of Fast Hyperparameter Transfer
- Optimal learning rate scaling depends on data in deep scalar linear networks
- A Spectral Condition for Feature Learning: shows an interesting equivalence between μP and a certain scaling of the spectral norm of weight matrices and their updates. This partly inspired the Muon optimiser.
- Extending μP: Spectral Conditions for Feature Learning Across Optimizers
- Towards a Principled Muon under μ𝖯: Ensuring Spectral Conditions throughout Training
- On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width: derives a feature-learning (μP-like) infinite-width limit parameterisation for second-order methods including K-FAC and Shampoo. Experiments with MLPs, CNNs, ResNets and a simplified language model.
- Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling: derives a μP extension for sharpness aware minimisation (SAM) with stable learning rate and perturbation radius across model widths. Experiments with MLPs, ResNets & ViTs.
- Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning
- Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales
- u-μP: The Unit-Scaled Maximal Update Parametrization: proposes a simple extension of μP for LLMs such that activations, weights and gradients have unit (rather than just constant) variance (u-μP) with respect to model width, showing that it helps with low-precision training.
- μnit Scaling: Simple and Scalable FP8 LLM Training: proposes a more efficient unit-scaled μP parameterisation for low-precision LLM training.
- Sparse maximal update parameterization: A holistic approach to sparse training dynamics: derives a μP extension for random unstructured static (weight) sparsity with stable feature learning across both model widths and sparsity level. Experiments with LLMs.
- Scaling Diffusion Transformers Efficiently via μP
- Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size
- On Feature Learning in Structured State Space Models
- Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation
- μPC: Scaling Predictive Coding to 100+ Layer Networks
- On the Infinite Width and Depth Limits of Predictive Coding Networks
- μ-Parametrization for Mixture of Experts
- Transfer Paramatters: Optimal per-Module Hyperparameters Across All Scaling Axes
- GQA-μP: The Maximal Parameterization Update for Grouped Query Attention and Fully Sharded Data Parallel
- μLO: Compute-Efficient Meta-Generalization of Learned Optimizers
- Arithmetic-Mean μP for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets
- Hyperparameter Transfer with Mixture-of-Experts Layers
- Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks
- μpscaling small models: Principled warm starts and hyperparameter transfer
The role of weight decay with respect to depth-transfer is discussed in the CompleteP work (Dey et al., 2025).
- How to set AdamW's weight decay as you scale model and dataset size
- Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training
- Weight Decay may matter more than muP for Learning Rate Transfer in Practice
- Robust Layerwise Scaling Rules by Proper Weight Decay Tuning
- Lecture Notes on Infinite-Width Limits of Neural Networks: these notes (see especially Section 4) provide a detailed and pedagogical derivation of (width-only) μP for MLPs.
- Feature-Learning Networks Are Consistent Across Widths At Realistic Scales: empirically shows that the behaviour of finite-width networks is remarkably consistent across model widths used in practice, validating μP. See the next paper for a similar study also considering depth.
- Function-Space Learning Rates: impressive paper developing an efficient method (requiring only a few extra backward passes) to measure the change in the network function induced by parameter updates to achieve hyperparameter transfer across width, depth and even LoRA rank for many architectures including transformers. The empirical nature of this approach has the advantage, over μP, of not needing to derive scalings on a case-by-case basis.
- The Optimization Landscape of SGD Across the Feature Learning Strength
- Over-Alignment vs Over-Fitting: The Role of Feature Learning Strength in Generalization
- Scaling Exponents Across Parameterizations and Optimizers
- Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks
- On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling
- A thorough reproduction and evaluation of μP
- The lazy (NTK) & rich (μP) regimes: A gentle tutorial
- An Empirical Study of μP Learning Rate Transfer
- μP for RL: Mitigating Feature Inconsistencies During Reinforcement Learning
- MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
- my high-level blog post on μP and its extensions
- Microsoft's post introducing μP
- Microsoft's post on the original hyperparameter transfer results
- this post by Speechmatics
- this post by Cerebras
- this conversation with Greg Yang focused on "Tensor Programs"
- this talk on the scaling exponents of different parameterisations
- the original
mupgithub repo (PyTorch) - the
nanoGPT-muprepo (PyTorch).
Contributions are welcome! To add a paper or submit a correction, please open an issue or submit a pull request.