📑 μP Papers

μP is an influential, theoretically grounded prescription for how to scale various neural network architectures such that the layer activations (and other quantities such as the learning rate) remain stable during training (neither shrink nor explode) with the model size (i.e. width and depth).

Key papers (width-only μP)

Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks: original paper introducing μP for SGD building on the "Tensor Programs" formalism. The main motivation was to find a parameterisation that both (i) allows for as much feature learning as possible (μP is maximal in this sense) unlike the NTK, and (ii) remains stable with respect to the model width, unlike the standard parameterisation.
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer: building on the previous paper, this shows that under μP, many optimal hyperparameters such as the learning rate also remain stable across models (including GPT-3) of different width, allowing for zero-shot hyperparameter transfer without tuning at large scale. It also extends μP for Adam beyond SGD.
Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit: fully works out the μP theory for adaptive optimisers including Adam.

Depth extensions

Depth Dependence of μP Learning Rates in ReLU MLPs
Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks: concurrently with the following paper, this proposed an extension of μP to model depth for ResNets (with unit block depth) by rescaling each residual block and parameter update by the square root of the depth. Experiments with fully connected ResNets on CIFAR10.
Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit: concurrently with the previous paper, this proposed a slightly different depth-extension of μP using dynamical mean field theory (DMFT). Unlike Yang et al. (2024), they do not rescale the learning rate of Adam, but this rescaling is reintroduced in the next paper. Experiments with both CNNs and ViTs (with and without LayerNorm) on both CIFAR10 and ImageNet.
Infinite Limits of Multi-head Transformer Dynamics: also relying on DMFT, derives width and depth limits for multi-head attention transformers, providing principled scalings for SGD and heuristic scalings for Adam.
Don’t be lazy: CompleteP enables compute-efficient deep transformers: in contrast to previous depth extensions of μP, this proposes rescaling the residual transformer blocks by the depth (rather than its square root) based on both empirical results and a theoretical notion of non-lazy learning of all model layers. This parameterisation requires rescaling other quantities such as LayerNorm and Adam's weight decay parameter. Experiments with large-scale LLMs, also revealing new compute-optimal regimes.

Understanding hyperparameter transfer

Super Consistency of Neural Network Landscapes and Learning Rate Transfer: investigates the phenomenon of learning rate transfer from an optimisation perspective, showing that under μP and its depth extension (but not the NTK), certain quantities including the largest eigenvalue of the loss Hessian (aka sharpness) remain consistent across different model scales (i.e. widths and depths). Comprehensive experiments with ResNets, ViTs and GPT-2.
On the Provable Separation of Scales in Maximal Update Parameterization
A Proof of Learning Rate Transfer under μP
Understanding the Mechanisms of Fast Hyperparameter Transfer
Optimal learning rate scaling depends on data in deep scalar linear networks

Spectral perspective

A Spectral Condition for Feature Learning: shows an interesting equivalence between μP and a certain scaling of the spectral norm of weight matrices and their updates. This partly inspired the Muon optimiser.
Extending μP: Spectral Conditions for Feature Learning Across Optimizers
Towards a Principled Muon under μ𝖯: Ensuring Spectral Conditions throughout Training

Other optimisers

On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width: derives a feature-learning (μP-like) infinite-width limit parameterisation for second-order methods including K-FAC and Shampoo. Experiments with MLPs, CNNs, ResNets and a simplified language model.
Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling: derives a μP extension for sharpness aware minimisation (SAM) with stable learning rate and perturbation radius across model widths. Experiments with MLPs, ResNets & ViTs.
Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning
Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales

Other architectures

u-μP: The Unit-Scaled Maximal Update Parametrization: proposes a simple extension of μP for LLMs such that activations, weights and gradients have unit (rather than just constant) variance (u-μP) with respect to model width, showing that it helps with low-precision training.
μnit Scaling: Simple and Scalable FP8 LLM Training: proposes a more efficient unit-scaled μP parameterisation for low-precision LLM training.
Sparse maximal update parameterization: A holistic approach to sparse training dynamics: derives a μP extension for random unstructured static (weight) sparsity with stable feature learning across both model widths and sparsity level. Experiments with LLMs.
Scaling Diffusion Transformers Efficiently via μP
Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size
On Feature Learning in Structured State Space Models
Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation
μPC: Scaling Predictive Coding to 100+ Layer Networks
On the Infinite Width and Depth Limits of Predictive Coding Networks
μ-Parametrization for Mixture of Experts
Transfer Paramatters: Optimal per-Module Hyperparameters Across All Scaling Axes
GQA-μP: The Maximal Parameterization Update for Grouped Query Attention and Fully Sharded Data Parallel
μLO: Compute-Efficient Meta-Generalization of Learned Optimizers
Arithmetic-Mean μP for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets
Hyperparameter Transfer with Mixture-of-Experts Layers
Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks
μpscaling small models: Principled warm starts and hyperparameter transfer

On weight decay

The role of weight decay with respect to depth-transfer is discussed in the CompleteP work (Dey et al., 2025).

Miscellaneous

Lecture Notes on Infinite-Width Limits of Neural Networks: these notes (see especially Section 4) provide a detailed and pedagogical derivation of (width-only) μP for MLPs.
Feature-Learning Networks Are Consistent Across Widths At Realistic Scales: empirically shows that the behaviour of finite-width networks is remarkably consistent across model widths used in practice, validating μP. See the next paper for a similar study also considering depth.
Function-Space Learning Rates: impressive paper developing an efficient method (requiring only a few extra backward passes) to measure the change in the network function induced by parameter updates to achieve hyperparameter transfer across width, depth and even LoRA rank for many architectures including transformers. The empirical nature of this approach has the advantage, over μP, of not needing to derive scalings on a case-by-case basis.
The Optimization Landscape of SGD Across the Feature Learning Strength
Over-Alignment vs Over-Fitting: The Role of Feature Learning Strength in Generalization
Scaling Exponents Across Parameterizations and Optimizers
Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks
On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling
A thorough reproduction and evaluation of μP
The lazy (NTK) & rich (μP) regimes: A gentle tutorial
An Empirical Study of μP Learning Rate Transfer
μP for RL: Mitigating Feature Inconsistencies During Reinforcement Learning
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Further resources

📝 Blogs

my high-level blog post on μP and its extensions
Microsoft's post introducing μP
Microsoft's post on the original hyperparameter transfer results
this post by Speechmatics
this post by Cerebras

🎙️ Talks

this conversation with Greg Yang focused on "Tensor Programs"
this talk on the scaling exponents of different parameterisations

💻 Code

the original mup github repo (PyTorch)
the nanoGPT-mup repo (PyTorch).

Contributing

Contributions are welcome! To add a paper or submit a correction, please open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📑 μP Papers

Overview

Key papers (width-only μP)

Depth extensions

Understanding hyperparameter transfer

Spectral perspective

Other optimisers

Other architectures

On weight decay

Miscellaneous

Further resources

📝 Blogs

🎙️ Talks

💻 Code

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

📑 μP Papers

Overview

Key papers (width-only μP)

Depth extensions

Understanding hyperparameter transfer

Spectral perspective

Other optimisers

Other architectures

On weight decay

Miscellaneous

Further resources

📝 Blogs

🎙️ Talks

💻 Code

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages