7.1 参数范数惩罚
许多正则化方法通过对目标函数
![equation?tex=J](https://i-blog.csdnimg.cn/blog_migrate/487efbc22f7a0514b4b7934844e7bcc2.png)
![equation?tex=%5COmega+%28%5Ctheta%29](https://i-blog.csdnimg.cn/blog_migrate/23b9945808692361473bd2b517ddb10f.png)
![equation?tex=%5Ctilde%7BJ%7D%28%5Ctheta%3BX%2Cy%29%3DJ%28%5Ctheta%3BX%2Cy%29%2B%5Calpha%5COmega%28%5Ctheta%29+%5C%5C](https://i-blog.csdnimg.cn/blog_migrate/98f1997e9f58748f24b84a3c2e8d47f7.png)
其中
![equation?tex=%5Calpha%5Cin%5B0%2C%5Cinfty%29](https://i-blog.csdnimg.cn/blog_migrate/8f13a7025086a6b25d85b88778ac9f98.png)
![equation?tex=%5COmega](https://i-blog.csdnimg.cn/blog_migrate/65daebfdd0bc463a1348ea4a7e21dad0.png)
![equation?tex=J](https://i-blog.csdnimg.cn/blog_migrate/487efbc22f7a0514b4b7934844e7bcc2.png)
在神经网络中,参数包括每一层仿射变换的权重和偏置,我们通常只对权重做惩罚而不对偏置做正则惩罚。
- 精确拟合偏置所需的数据通常比拟合权重少得多
- 每个权重会指定两个变量如何相互作用。而每个偏置仅控制一个单变量。这意味着不对偏置进行正则化也不会导致太大的方差
- 正则化偏置参数可能会导致明显的欠拟合。
因此,我们使用向量
![equation?tex=w+](https://i-blog.csdnimg.cn/blog_migrate/2cf879f6a9f2186d4a0b3e2c924cb488.png)
![equation?tex=%5Ctheta](https://i-blog.csdnimg.cn/blog_migrate/679f2e1de0bb439ac572e914fed0f16f.png)
![equation?tex=w](https://i-blog.csdnimg.cn/blog_migrate/1c7a330b82678d6b033d9748f5120af9.png)
7.1.1
![equation?tex=%5Cboldsymbol%7BL%5E2%7D](https://i-blog.csdnimg.cn/blog_migrate/77dc98a903f47ec49ccd4286dc7dcd40.png)
权重衰减(weight decay):
![equation?tex=L%5E2](https://i-blog.csdnimg.cn/blog_migrate/67851a44c6f3eb6ad16f3511fd86e95e.png)
![equation?tex=%5COmega%28%5Ctheta%29%3D%5Cfrac%7B1%7D%7B2%7D%7C%7Cw%7C%7C_2%5E2](https://i-blog.csdnimg.cn/blog_migrate/65daebfdd0bc463a1348ea4a7e21dad0.png%28%5Ctheta%29%3D%5Cfrac%7B1%7D%7B2%7D%7C%7Cw%7C%7C_2%5E2)
![equation?tex=L%5E2](https://i-blog.csdnimg.cn/blog_migrate/67851a44c6f3eb6ad16f3511fd86e95e.png)
通过研究正则化后目标函数的梯度,洞察一些权重衰减的正则化表现。
![equation?tex=%5Ctilde%7BJ%7D%28w%3BX%2Cy%29%3D%5Cfrac%7B%5Calpha%7D%7B2%7Dw%5E%7B%5Ctop%7Dw%2BJ%28w%3BX%2Cy%29+%5C%5C](https://i-blog.csdnimg.cn/blog_migrate/073b07eec411b3671ed79da04df11d08.png)
对应梯度为:
![equation?tex=%5Cnabla_w%5Ctilde%7BJ%7D%28w%3BX%2Cy%29%3D%5Calpha+w%2B%5Cnabla_wJ%28w%3BX%2Cy%29+%5C%5C](https://i-blog.csdnimg.cn/blog_migrate/70a170c684bf78384a7223c653a39c9e.png)
使用单步梯度下降更新权重,即执行以下更新:
![equation?tex=w%5Cleftarrow+w-%5Cepsilon%28%5Calpha+w%2B%5Cnabla_wJ%28w%3BX%2Cy%29%29+%5C%5C](https://i-blog.csdnimg.cn/blog_migrate/1c7a330b82678d6b033d9748f5120af9.png%5Cleftarrow+w-%5Cepsilon%28%5Calpha+w%2B%5Cnabla_wJ%28w%3BX%2Cy%29%29+%5C%5C)
换种写法:
![equation?tex=w%5Cleftarrow+%281-%5Cepsilon%5Calpha%29w-%5Cepsilon%5Cnabla_wJ%28w%3BX%2Cy%29+%5C%5C](https://i-blog.csdnimg.cn/blog_migrate/1c7a330b82678d6b033d9748f5120af9.png%5Cleftarrow+%281-%5Cepsilon%5Calpha%29w-%5Cepsilon%5Cnabla_wJ%28w%3BX%2Cy%29+%5C%5C)
我们可以看到,加入权重衰减后会引起学习规则的修改,即在每步执行通常的梯度更新之前先收缩 权重向量(将权重向量乘以一个常数因子
![equation?tex=%5Cepsilon%5Calpha%3C1](https://i-blog.csdnimg.cn/blog_migrate/332aad032f6c38a1765a937db0b377d8.png)
1.
![equation?tex=w%5E%2A](https://i-blog.csdnimg.cn/blog_migrate/1c7a330b82678d6b033d9748f5120af9.png%5E%2A)
令
![equation?tex=w%5E%2A](https://i-blog.csdnimg.cn/blog_migrate/1c7a330b82678d6b033d9748f5120af9.png%5E%2A)
![equation?tex=w%5E%2A%3D%5Carg+%5Cmin_wJ%28w%29%EF%BC%9BJ%E6%9C%80%E5%B0%8F%E6%97%B6%E7%9A%84w%E5%80%BC+%5C%5C](https://i-blog.csdnimg.cn/blog_migrate/1c7a330b82678d6b033d9748f5120af9.png%5E%2A%3D%5Carg+%5Cmin_wJ%28w%29%EF%BC%9BJ%E6%9C%80%E5%B0%8F%E6%97%B6%E7%9A%84w%E5%80%BC+%5C%5C)
并在
![equation?tex=w%5E%2A](https://i-blog.csdnimg.cn/blog_migrate/1c7a330b82678d6b033d9748f5120af9.png%5E%2A)
![equation?tex=%5Chat+J%28%5Ctheta%29](https://i-blog.csdnimg.cn/blog_migrate/78458a8d78b25456091350c733f0a853.png)