Notes on Machine Learning By Andrew Ng (4)

最新推荐文章于 2021-09-04 11:27:12 发布

我是全宇宙ENERGE的总量

最新推荐文章于 2021-09-04 11:27:12 发布

阅读量194

点赞数 1

分类专栏：机器学习笔记

本文链接：https://blog.csdn.net/weixin_43038346/article/details/97554491

版权

机器学习同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

笔记

7 篇文章 0 订阅

订阅专栏

Notes on Machine Learning By Andrew Ng (4)

Click here to check former notes.

Regularization（正则化）

The problem of overfitting

$[外链图片转存失败(img-94NBDjeH-1564231844496)(C:\Users\chenh\AppData\Local\Temp\1563888658663.png)]$

Addressiong overfitting

Options:

Reduce # of features
- Manually select which features to keep.
- Model selection algorithm (later in course) .
- abandon some useful info.
Regularization
- Keep all the features, but reduce magnitude/ values of parameters $\theta_j$ .
- Works well when we have a lot of features, each of which contributes a bit to predicting $y$ .

Cost Function

Regularization

Small values for parameters $\theta_i, 1\leq i \leq n$

“Simpler” hypothesis
Less prone to overfitting

$J(\theta) = \frac{1}{2m}[\sum_{i=1}^m{(h_\theta(x^{(i)}) - y^{(i)})}^2 + \lambda \sum _{i =1} ^ n \theta_i^2]$

Regularzation term: $\lambda \sum _{i =1} ^ n \theta_i^2$

Regularzation parameter: $\lambda$ , controls a trade off between 2 different goals.

First goal captured by the first term of the objective, (a.k.a $\sum_{i=1}^m{(h_\theta(x^{(i)}) - y^{(i)})}^2$ ), to fit the data well.
Second goal is keep the parameters small, so that avoid overfitting.

If $\lambda$ is set too large, it will be like $\theta_i \approx 0, 1\leq i \leq n$ , which will lead to underfitting.

Regularized linear regression

Gradient descent

Modify its process.

Repeat{

$\theta_0:=\theta_0- \alpha \frac{1}{m} \sum_{i = 1} ^m (h_\theta(x^{(i)}) - y^{(i)}) x_0^{(i)}$

$\theta_j:=\theta_j - \alpha [\frac{1}{m} \sum_{i = 1} ^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}+\frac{\lambda}{m}\theta_j](j =1,2\cdots, n)$

}

$\rightarrow \theta_j:=\theta_j(1- \alpha\frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i = 1} ^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$

It is a fact that $\alpha\frac{\lambda}{m} < 1$ , so $\theta$ descent a little in every progress.

Normal equation

Modify it.
$\mathbb{\theta} = (X^T X+\lambda \left[ \begin{matrix}0\\&1\\&&1\\ &&&\ddots\\ &&&&1 \end{matrix} \right])^{-1}X^Ty$

Non-invertibilty

If $\leq$ n, $m$ is the # of examples and $n$ is the # of features, then $X^T X$ is singular matrix.

But when you add the $\lambda$ part, this matrix is invertible.

Regularized logistic regression

Just like linear regression, we add the term to penalize $\theta$ s in the cost function.

$J(\theta)=-[\frac{1}{m}\sum_{i=1}^m y^{(i)}*log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))] +\frac{\lambda}{m}\sum_{j=1}^n \theta_j^2$

Annotation

It has to be clear that although logistic and linear regression share the same math equation in the graident descent progress, they are two very different algorithm. They have their own hypothesis. The only reason they look alike in the gradient descent is that the cost fuction are different, so that we can unify the equation in graident descent.

Click here to see the following note.

我是全宇宙ENERGE的总量

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Notes on Machine Learning By Andrew Ng (4)

Notes on Machine Learning By Andrew Ng (4)Click here to check former notes.Regularization（正则化）The problem of overfitting[外链图片转存失败(img-94NBDjeH-1564231844496)(C:\Users\chenh\AppData\Local\Temp\1563...
复制链接

扫一扫

专栏目录