_Krauss-CSDN博客

原创为什么Pre Norm的效果不如Post Norm？

这篇文章讨论了Transformer模型中Pre Norm（残差块前归一化）和Post Norm（残差块后归一化）的差异，解释为什么Pre Norm虽然训练更容易，但最终性能不如Post Norm。核心观点是Pre Norm会无意中“稀释”模型的有效深度，使多层叠加更像增加宽度而非深度，从而影响效果。

2025-10-13 09:51:30 574

空空如也

TA创建的收藏夹 TA关注的收藏夹

TA关注的人

原创 为什么Pre Norm的效果不如Post Norm？

空空如也

空空如也

原创为什么Pre Norm的效果不如Post Norm？