深度学习中的几种正则化(Regularization)方法

本文探讨了正则化在机器学习中的应用,包括L1、L2正则化、Dropout、对抗训练等策略,旨在减少过拟合,提高模型泛化能力。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

正则化是一种策略,目的是减少测试误差,大体方式是通过增加(或减少)模型所能拟合的函数的数量来增加(或减少)模型的容量。

  1. 使用参数范数惩罚参考文献:[1]、[2]、[3]
    可以参考《凸优化》第297页的 “正则化逼近”。
    通常只惩罚权重,不惩罚偏置。
    基本公式:
    J ~ ( θ ; X , y ) = J ( θ ; X , y ) + α Ω ( θ ) \widetilde{J}(\pmb{\theta};\pmb{X},y)=J(\pmb{\theta};\pmb{X},y)+\alpha\Omega(\pmb{\theta}) J (θθθ;XXX,y)=J(θθθ;XXX,y)+αΩ(θθθ)
    常用类别

    • L 2 L^2 L2正则(权重衰减、岭回归、Tikhonov正则)
      Ω ( θ ) = 1 2 ∣ ∣ w ∣ ∣ 2 2 \Omega(\pmb{\theta})=\frac{1}{2}||\pmb{w}||^2_2 Ω(θθθ)=21www22
    • L 1 L^1 L1正则
      Ω ( θ ) = ∣ ∣ w ∣ ∣ 1 = ∑ i ∣ w i ∣ \Omega(\pmb{\theta})=||\pmb{w}||_1=\sum_i|w_i| Ω(θθθ)=www1=iwi
    • 对惩罚项进行约束,比如: Ω ( θ ) < k \Omega(\pmb{\theta})<k Ω(θθθ)<k
      L ( θ , α ; X , y ) = J ( θ ; X , y ) + α ( Ω ( θ ) − k ) \mathcal{L}(\pmb{\theta},\alpha;\pmb{X},y)=J(\pmb{\theta};\pmb{X},y)+\alpha(\Omega(\pmb{\theta})-k) L(θθθ,α;XXX,y)=J(θθθ;XXX,y)+α(Ω(θθθ)k)
  2. 数据集增强

  3. 噪声注入

    • 在输入数据中注入噪声(等价于权重的范数惩罚,参考文献:[4]、[5]
    • 向隐藏单元添加噪声(如Dropout)
    • 将噪声添加到权重参考文献:[6]、[7]、[8]
    • 向输出目标添加噪声(原因是数据集的标签会存在一定比例的错误)
      • 标签平滑参考文献:[9]
        对标签不再分类0与1,而是利用 s o f t m a x softmax softmax 输出 ϵ k − 1 \frac{\epsilon}{k-1} k1ϵ 1 − ϵ 1-\epsilon 1ϵ 的数值。
  4. 多任务学习参考文献:[10]、[11]

  5. Early-Stopping参考文献:[12]、[13]
    在二次误差的简单线性模型和简单梯度下降的情况下,它相当于 L 2 L_2 L2 正则化

  6. 参数绑定参考文献:[14]
    基本思想:类似的任务,所使用的模型的权重可能是相互接近的。
    基本方式是使用参数范数惩罚:
    Ω ( w ( A ) , w ( B ) ) = ∣ ∣ w ( A ) − w ( B ) ∣ ∣ 2 2 \Omega(\pmb{w}^{(A)}, \pmb{w}^{(B)})=||\pmb{w}^{(A)}-\pmb{w}^{(B)}||^2_2 Ω(www(A),www(B))=www(A)www(B)22
    这类方法中的杰出代表:参数共享。

  7. Bagging参考文献:[15]、[16]、[17]、[18]、[19]
    分别训练几个不同的模型,通过对这些模型输出结果进行表决的方式,来决定最终的输出。

  8. Dropout参考文献:[20]、[21]、[22]、[23]、[24]、[25]、[26]、[27]、[28]、[29]、[30]、[31]、[32]、[33]
    对某个隐藏层的神经元通过乘零操作来进行随机删除,每个神经元被乘零的概率是 p p p,这个值是人工控制的超参数。
    在推断阶段,应当使用权重比例推断规则来对被使用Dropout的层进行修正:将该层的权重乘以概率值 p p p
    以 Bagging 的角度来解释 Dropout 比较好。

  9. 对抗训练参考文献:[34]、[35]、[36]


参考文献
[1] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012c). Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580. 235, 260, 264
[2] Srebro, N. and Shraibman, A. (2005). Rank, trace-norm and max-norm. In Proceedings of the 18th Annual Conference on Learning Theory, pages 545–560. Springer-Verlag. 235
[3] Tibshirani, R. J. (1995). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58, 267–288. 233

[4] Bishop, C. M. (1995a). Regularization and complexity control in feed-forward networks. In Proceedings International Conference on Artificial Neural Networks ICANN’95, volume 1, page 141–148. 238, 247
[5] Bishop, C. M. (1995b). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1), 108–116. 238

[6] Jim, K.-C., Giles, C. L., and Horne, B. G. (1996). An analysis of noise in recurrent neural networks: convergence and generalization. IEEE Transactions on Neural Networks, 7(6), 1424–1438. 238
[7] Graves, A. (2011). Practical variational inference for neural networks. In NIPS’2011 . 238
[8] Hochreiter, S. and Schmidhuber, J. (1995). Simplifying neural nets by discovering flat minima. In Advances in Neural Information Processing Systems 7 , pages 529–536. MIT Press. 239

[9] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2015). Rethinking the Inception Architecture for Computer Vision. ArXiv e-prints. 240, 318

[10] Caruana, R. (1993). Multitask connectionist learning. In Proc. 1993 Connectionist Models Summer School , pages 372–379. 241
[11] Baxter, J. (1995). Learning internal representations. In Proceedings of the 8th International Conference on Computational Learning Theory (COLT’95), pages 311–320, Santa Cruz, California. ACM Press. 241

[12] Bishop, C. M. (1995a). Regularization and complexity control in feed-forward networks. In Proceedings International Conference on Artificial Neural Networks ICANN’95 , volume 1, page 141–148. 238, 247
[13] Sjöberg, J. and Ljung, L. (1995). Overtraining, regularization and searching for a minimum, with application to neural networks. International Journal of Control, 62(6), 1391–1407. 247

[14] Lasserre, J. A., Bishop, C. M., and Minka, T. P. (2006). Principled hybrids of generative and discriminative models. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR’06), pages 87–94, Washington, DC, USA. IEEE Computer Society. 240, 250

[15] Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. 253
[16] Koren, Y. (2009). The BellKor solution to the Netflix grand prize. 255, 475
[17] Freund, Y. and Schapire, R. E. (1996a). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of Thirteenth International Conference, pages 148–156, USA. ACM. 255
[18] Freund, Y. and Schapire, R. E. (1996b). Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages 325–332. 255
[19] Schwenk, H. and Bengio, Y. (1998). Training methods for adaptive boosting of neural networks. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information Processing Systems 10 (NIPS’97), pages 647–653. MIT Press. 255

[20] Srivastava, N. (2013). Improving Neural Networks With Dropout. Master’s thesis, U. Toronto. 533
[21] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958. 255, 261, 262, 264, 669
[22] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2014a). Going deeper with convolutions. Technical report, arXiv:1409.4842. 22, 23, 197, 255, 265, 322, 341
[23] Warde-Farley, D., Goodfellow, I. J., Courville, A., and Bengio, Y. (2014). An empirical analysis of dropout in piecewise linear networks. In ICLR’2014 . 259, 263, 264
[24] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012c). Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580. 235, 260, 264
[25] Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep Boltzmann machines. In NIPS26 . NIPS Foundation. 98, 615, 668, 669, 670, 671, 672, 695
[26] Gal, Y. and Ghahramani, Z. (2015). Bayesian convolutional neural networks with Bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158 . 261
[27] Bayer, J. and Osendorfer, C. (2014). Learning stochastic recurrent networks. ArXiv e-prints. 262
[28] Pascanu, R., Gülçehre, Ç., Cho, K., and Bengio, Y. (2014a). How to construct deep recurrent neural networks. In ICLR’2014 . 18, 262, 393, 394, 406, 455
[29] Xiong, H. Y., Barash, Y., and Frey, B. J. (2011). Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context. Bioinformatics, 27(18), 2554–2562. 262
[30] Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics. Springer. 262
[31] Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26 , pages 351–359. 262
[32] Wang, S. and Manning, C. (2013). Fast dropout training. In ICML’2013 . 263
[33] Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of neural networks using dropconnect. In ICML’2013 . 263

[34] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. (2014b). Intriguing properties of neural networks. ICLR, abs/1312.6199. 265, 266, 269
[35] Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b). Explaining and harnessing adversarial examples. CoRR, abs/1412.6572. 265, 266, 269, 553, 554
[36] Miyato, T., Maeda, S., Koyama, M., Nakae, K., and Ishii, S. (2015). Distributional smoothing with virtual adversarial training. In ICLR. Preprint: arXiv:1507.00677. 266

### 深度学习中的正则化 #### 正则化的概念 正则化是一种用于防止过拟合的技术,旨在提升模型在未见数据上的泛化能力[^4]。当模型过于复杂时,可能会过度适应训练数据的噪声或细节,从而导致其无法很好地推广到新的样本。通过引入额外约束条件,正则化可以有效降低这种风险。 #### 常见的正则化方法及其作用 1. **L2 正则化 (Ridge)** L2 正则化通过对损失函数增加权重参数平方和的形式作为惩罚项,抑制了模型中较大权值的影响,从而使模型更加平滑并减少过拟合的可能性[^1]。具体来说,该方法会使得优化过程倾向于寻找较小绝对值的权重组合。 2. **L1 正则化** 不同于 L2 的方式,L1 使用的是权重绝对值之和作为附加成本。这种方法除了具备基本的防过拟合作用外,还具有稀疏性的特点——即某些不重要的特征对应的系数会被直接缩减至零[^3]。 3. **Dropout 技术** Dropout 是一种专门针对神经网络设计的独特形式的随机失活机制,在每次迭代过程中按照一定概率临时丢弃部分节点及其连接关系。这样做的好处在于强迫剩余单元共同协作完成任务而不是依赖单一路径,进而增强了整体架构稳定性以及抗干扰能力。 4. **Early Stopping 提前停止法** 这一简单却有效的手段基于监控验证集的表现决定何时终止训练流程。如果发现继续更新反而让性能下降,则立即结束当前进程以保留最佳状态下的参数配置。 5. **Data Augmentation 数据增强** 对原始输入样本施加变换操作生成更多样式的实例供算法学习利用,比如图像旋转、缩放等预处理措施都可以看作此类范畴内的实践案例之一。 6. **Batch Normalization 批量标准化** 它能够加速收敛速度的同时改善梯度传播状况,并间接起到一定的规范化效果,有助于缓解内部协变量偏移问题带来的负面影响。 7. **Weight Decay 权重衰减** Weight decay 可视为另一种表述形式下的 L2 regularization,两者本质上并无区别只是名称有所差异而已。 8. **Max Norm Constraints 最大范数限制** 将每层网络里的每一个单独过滤器的最大允许长度设定固定阈值范围内加以限定,以此达到简化结构的目的。 9. **Noise Injection 加入噪音扰动** 向激活值或者隐藏层输出注入适量人工制造出来的随机波动成分,模拟真实世界环境里不可避免存在的不确定性因素影响,促使系统学会忽略无关紧要的变化趋势而专注于核心规律提取工作上去。 #### 数学表达式与 Python 实现示例 以下是几种典型正则化方案的具体计算公式及其实现代码: - **L2 Regularization** \[ J(\theta) = Loss(y,\hat{y}) + \lambda ||\theta||_2^2 \] ```python import tensorflow as tf from tensorflow.keras import layers, models, regularizers model = models.Sequential() model.add(layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.01))) ``` - **L1 Regularization** \[ J(\theta) = Loss(y,\hat{y}) + \alpha ||\theta||_1 \] ```python model.add(layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l1(0.01))) ``` - **Dropout Implementation** ```python model.add(layers.Dropout(rate=0.5)) ``` --- ###
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值