[Performance] Optimize AdamW GPU kernel: use device-side lr/beta_pow accessors, float64 accumulators#78830
Open
zhengshengning wants to merge 5 commits intoPaddlePaddle:developfrom
Open
[Performance] Optimize AdamW GPU kernel: use device-side lr/beta_pow accessors, float64 accumulators#78830zhengshengning wants to merge 5 commits intoPaddlePaddle:developfrom
zhengshengning wants to merge 5 commits intoPaddlePaddle:developfrom
Conversation
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
|
你的PR提交成功,感谢你对开源项目的贡献! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Category
Performance Optimization
PR Types
Performance
Description
背景与动机
当前 AdamW GPU kernel 存在以下性能和精度问题:
不必要的 host-device 同步:每次 step 都需将
beta1_pow、beta2_pow和learning_rate从 GPU 拷贝到 CPU(memory_utils::Copy+dev_ctx.Wait()),阻塞 GPU 流水线,在高频训练中带来可观的额外延迟。精度损失:
beta_pow使用 float32 存储,随着训练步数增加,精度误差会累积;学习率也存在 float32 精度截断问题,之前通过lr_ratio补偿的方式属于 workaround,逻辑复杂且易错。beta_pow 和学习率改为 float64 后,与之前 float32 存储相比,训练结果的数值精度会有所提升(更接近理论值),不会引起精度下降。已有依赖 float32 beta_pow 行为的下游测试可能需要更新数值容忍度。
本次改动
引入
AdamWLrAccessor和AdamWBiasCorrAccessor模板结构体:根据 lr / beta_pow 是否在 CPU 上,分别走 CPU 直传标量路径和 GPU 设备指针路径,彻底消除 GPU 在 CPU 上的 host copy 同步。使用
__shared__内存缓存 per-block 标量:lr、bias_correction1/2_sqrt、step_size、lr * weight_decay等只需 thread 0 读取一次后广播,减少重复计算。beta_pow 累加器改为 float64:
beta1_pow、beta2_pow统一使用 FLOAT64 存储,提升长训练精度,消除之前通过lr_ratio补偿 float32 精度损失的 workaround。学习率 dtype 改为 float64:
_create_global_learning_rate统一使用paddle.float64,配合 kernel 侧直接读取double*,端到端避免 float32 精度截断。涉及模块
paddle/phi/kernels/gpu/adamw_kernel.cu:kernel 结构重构,引入 accessor 模板,shared memory 优化,beta_pow 类型更新,kernel registration 类型约束。python/paddle/optimizer/adamw.py:beta_pow 累加器强制 FLOAT64,移除 lr_ratio 精度补偿逻辑。python/paddle/optimizer/optimizer.py:全局学习率统一改为 float64。性能
是否引起精度变化
否