【机器学习】3 正则化

最新推荐文章于 2021-11-17 09:53:43 发布

社恐患者

最新推荐文章于 2021-11-17 09:53:43 发布

阅读量211

点赞数

分类专栏：机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/qq_44714521/article/details/108306254

版权

机器学习专栏收录该内容

15 篇文章 1 订阅

订阅专栏

第3章正则化

1 Addressing Overfitting
2 Regularization
3 Regularized Linear Regession
- 3.1 Gradient Descent
- 3.2 Normal Equation
4 Regularized Logistic Regession
- 4.1 Gradient Descent
- 4.2 Advanced optimization
5 Reference

1 Addressing Overfitting

Reduce number of features	Regularization
Manually select which features to keep	Keep all the features but reduce magnitude / values of parameters $\theta_j$
Model selection algorithm（Automatically select）	Works well when we have a lot of features, each of which contributese a bit to predicting $y$
有可能会丢失信息

2 Regularization

2.1 Theory

Small values for parameters $\theta_0,\theta_1,···,\theta_n$

“Simpler” hypothesis
Less prone to overfitting

2.2 Model

$\mathop{\text{minimize}}\limits_{\theta} J(\theta)$
$J(\theta)=\frac{1}{2m}\left[\sum_{i=1}^m(h_\theta{(x^{(i)})-y^{(i)})}^2+\lambda\sum_{j=1}^n{\theta_j}^2\right]$

$\lambda$ ：Regularization Parameter
$\lambda\sum_{j=1}^n{\theta_j}^2$ ：to shrink every single parameter $\theta_j$
【注意】从 $\theta_1$ 开始（其实从 $\theta_0$ 开始影响不大）

2.3 $\lambda$ 设置值太大可能导致的问题

Algorithm works fine; setting $\lambda$ to be very large can not hurt it
Algorithm fails to eliminate overfitting
Algorithm results in underfitting.（Fails to fit even training data well）
Gradient descent will fail to converge

3 Regularized Linear Regession

3.1 Gradient Descent

repeat until convergence{
$\begin{aligned} \theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}\\ \theta_j&:=\theta_j-\alpha\left[\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})·x_j^{(i)}+\frac{\lambda}{m}\theta_j\right]\\ 变式：\theta_j&:=\theta_j(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})·x_j^{(i)} \end{aligned}$
（ $j = 1, 2, 3, \cdot \cdot \cdot, n$ ）
}

3.2 Normal Equation

$\theta={\left(X^TX+\lambda\left[ \begin{matrix} 0\\ &1\\ &&1\\ &&&\ddots\\ &&&&1 \end{matrix}\right]_{(n+1)×(n+1)}\right)} ^{-1}X^Ty$

只要 $\lambda＞0$ ，就不会non-invertibility

4 Regularized Logistic Regession

4.1 Gradient Descent

$J(\theta)=-\frac{1}{m}\sum_{i=1}^m\left[y^{(i)}\text{log}h_{\theta}(x^{(i)})+(1-y^{(i)})\text{log}(1-h_{\theta}(x^{(i)})\right]+\frac{\lambda}{2m}\sum_{j=1}^n{\theta_j}^2$
repeat until convergence{
$\begin{aligned} \theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x_0^{(i)})\\ \theta_j&:=\theta_j-\alpha\left[\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})·x_j^{(i)}-\frac{\lambda}{m}\theta_j\right] \end{aligned}$
（ $j = 1, 2, 3, \cdot \cdot \cdot, n$ ）
(simultaneously update all $\theta_j$ ）
}

import numpy as np
def costReg(theta, X, y, learningRate):
	theta = np.matrix(theta)
	X = np.matrix(X)
	y = np.matrix(y)
	first = np.multiply(-y, np.log(sigmoid(X*theta.T)))
	second = np.multiply((1-y),np.log(1-sigmoid(X*theta*T)))
	reg = (learningRate / (2 * len(X))* np.sum(np.power(theta[:,1:theta.shape[1]],2))
	return np.sum(first - second) / (len(X)) + reg

4.2 Advanced optimization

Octave代码
function [jval, gradient] = costFunction(theta)
	jVal = [...code to compute J(theta)...];
	gradient = [... code to compute derivative of J(theta)...];
end
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);

[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);