吴恩达机器学习笔记（一）by LKP

最新推荐文章于 2024-02-29 02:44:17 发布

love_lqz

最新推荐文章于 2024-02-29 02:44:17 发布

阅读量312

点赞数

分类专栏：机器学习文章标签：机器学习

本文链接：https://blog.csdn.net/weixin_41553159/article/details/103859248

版权

机器学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

吴恩达机器学习笔记（一）

引言
单变量线性回归
多元线性回归

引言

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Supervised learning: “right answers” given

Regression: predict continuous valued output. (Housing price prediction)
Classification: Discrete valued output (0 or 1). (Breast cancer(malignant(恶性的), benign(良性的)))

Unsupervised learning: clustering(聚类算法)
应用: organize computing clusters、market segmentation、socialnetwork analysis、astronomical(天文) data analysis.

Cocktail party problem: 混合录音分离
algorithm:
[W s v]=svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');.

suggest use: Octave

单变量线性回归

Notation:

m = Number of training examples
x’s = “input” variable/features
y’s = “output” variable/“target” variable
(x,y) = one training example
(x⁽ⁱ⁾,y⁽ⁱ⁾) = i^th training example

Training Set	Size in feet² (x)	Prize($) in 1000’s (y)
m=47	2104	460
	1416	232
	1534	315
	852	178
	…	…

Hypothesis： $h_{\theta}\left( x \right) =\theta _0+\theta _1x$
Cost function： $J\left( \theta _0,\theta _1 \right) =\frac{1}{2m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2}$ Squared error cost function
Goal： $\underset{\theta _0,\theta _1}{\min}J\left( \theta _0,\theta _1 \right)$

Simplified： $\theta _0=0$

Gradient descent(梯度下降)
repeat until convergence{
$\theta _j:=\theta _j-\alpha \frac{\partial}{\partial \theta _j}J\left( \theta _0,\theta _1 \right)$ (for j=0 and j=1)
$\alpha$ is learning rate, if $\alpha$ is too samll, gradient descent can be slow.
}

Simultanrous update：
$temp\text{ 0:}=\theta _0-\alpha \frac{\partial}{\partial \theta _0}J\left( \theta _0,\theta _1 \right)$
$temp\text{ 1:}=\theta _1-\alpha \frac{\partial}{\partial \theta _1}J\left( \theta _0,\theta _1 \right)$
$\theta _0:=temp\text{ 0}$
$\theta _0:=temp\text{ 1}$

Gradient descent can converge to a local minimum(slope=0), even with the learning rate a fixed.

As we approach a local minimum, gradient descent with automatically take smaller steps(导数值慢慢变小). So, no need to decrease over time.

$\frac{\partial}{\partial \theta _j}J\left( \theta _0,\theta _1 \right) =\frac{\partial}{\partial \theta _j}\left[ \frac{1}{2m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2} \right] =\frac{\partial}{\partial \theta _j}\frac{1}{2m}\sum_{i=1}^m{\left( \theta _0+\theta _1x^{\left( i \right)}-y^{\left( i \right)} \right) ^2}$

$j=\text{0：}\frac{\partial}{\partial \theta _0}J\left( \theta _0,\theta _1 \right) =\frac{1}{m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right)}$

$j=\text{1：}\frac{\partial}{\partial \theta _1}J\left( \theta _0,\theta _1 \right) =\frac{1}{m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) \cdot x^{\left( i \right)}}$

多元线性回归

多变量线性回归（多个特征）
Notation：

n = number of features
x⁽ⁱ⁾ = input(features) of i^th training example. (列向量)_n×1
x_j⁽ⁱ⁾ = value of features j in i^th training example

$h_{\theta}\left( x \right) =\theta _0+\theta _1x_1+\theta _2x_2+\cdots +\theta _nx_n$
define x₀=1 (x₀⁽ⁱ⁾=1)
$x=\left[ \begin{array}{c} x_0\\ x_1\\ \vdots\\ x_n\\ \end{array} \right] \in \mathbb{R}^{n+1}$ $\theta =\left[ \begin{array}{c} \theta _0\\ \theta _1\\ \vdots\\ \theta _n\\ \end{array} \right] \in \mathbb{R}^{n+1}$
$h_{\theta}\left( x \right) =\theta _0+\theta _1x_1+\theta _2x_2+\cdots +\theta _nx_n=\theta ^Tx$

Coss function： $J\left( \theta _0,\theta _1,\cdots \theta _n \right)=\frac{1}{2m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2}$

Gradient descent：
Repeat{
$\theta _j:=\theta _j-\alpha \frac{\partial}{\partial \theta _j}J\left( \theta _0,\theta _1,\cdots \theta _n \right)$
} (simultaneously update for every j=0,…,n)

New algorithm (n≥1)
Repeat{
$\theta _j:=\theta _j-\alpha \frac{1}{m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) \cdot x^{\left( i \right)}_j}$
} (simultaneously update for every j=0,…,n)

Feature Scaling
Idea：Make sure features are on a similar scale.
E.g. x₁ = size (0-2000 feet²)
x₂ = number of bedrooms (1-5)

(等值线)

梯度下降过程缓慢，反复来回振荡，需要花很长时间，才能找到一条通往全局最小值的路.

利用特征缩放： $x_1=\frac{size\left( feet^2 \right)}{2000}\text{，}x_2=\frac{number\,\,of\,\,bedrooms}{5}$
$0\leqslant x_1,x_2\leqslant 1$ Get every feature into approximately a $-1\leqslant x_i\leqslant 1$ range.
$0\leqslant x_1\leqslant 3$ $\sqrt$ $-100\leqslant x_3\leqslant 100$ $\times$
$-2\leqslant x_2\leqslant 0.5$ $\sqrt$ $-0.0001\leqslant x_4\leqslant 0.0001$ $\times$

Mean normalization (均值归一化)
Replace $x_i$ with $x_i-\mu _i$ to make features have approximately zero mean. (Do not apply to $x_0=1$ )
E.g. $x_1=\frac{size-1000}{2000}\text{，}x_2=\frac{\#bedrooms-2}{5}\text{，}-0.5\leqslant x_1,x_2\leqslant 0.5$
分子：subtractor is average value of $x_1$ in training set.
分母：range max-min or standard deviation.
$x_2$ 的分母可以为4，不需要太精确.

For sufficiently small $\alpha,J(\theta)$ should decrease on every iteration. But if $\alpha$ is too small, gradient descent can be slow to converge.
在这里插入图片描述
Summary：

If $\alpha$ is too small：slow to convergence.
If $\alpha$ is too large： $J(\theta)$ may not decrease on every iteration；may not converge. (Slow converge also possible)

To choose $\alpha,$ try …, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …
在这里插入图片描述
Normal equation：Method to solve for $\theta$ analytically. (no need use feature scaling).

$\theta \in \mathbb{R}^{n+1}$ $J\left( \theta _0,\theta _1,\cdots \theta _n \right)=\frac{1}{2m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2}$
$\frac{\partial}{\partial \theta _j}J\left( \theta \right) =\cdots =0$ (for every j)
Solve for $\theta _0,\theta _1,\cdots \theta _n$

Example：m = 4

	Size (feet²)	Number of bedrooms	Number of floors	Age of home (years)	Prize ($1000)
x₀	x₁	x₂	x₃	x₄	y
1	2104	5	1	45	460
1	1416	3	2	40	232
1	1534	3	2	30	315
1	852	2	1	36	178

$X=\left[ \begin{matrix} 1& 2104& 5& 1\\ 1& 1416& 3& 2\\ 1& 1534& 3& 2\\ 1& 852& 2& 1\\ \end{matrix}\begin{array}{c} 45\\ 40\\ 30\\ 36\\ \end{array} \right]$ $y=\left[ \begin{array}{c} 460\\ 232\\ 315\\ 178\\ \end{array} \right]$

$\theta=(X^TX)^{-1}X^Ty.$ 使代价函数最小化(minimize coss function)的 $\theta$ . 证明见西瓜书.

m examples $(x^{(1)},y^{(1)}),\cdots,(x^{(m)},y^{(m)});$ n features.
$x^{\left( i \right)}=\left[ \begin{array}{c} x_{0}^{\left( i \right)}\\ x_{1}^{\left( i \right)}\\ \vdots\\ x_{n}^{\left( i \right)}\\ \end{array} \right] \in \mathbb{R}^{n+1}$ $design\,\,matrix\,\,X=\left[ \begin{array}{c} \left( x^{\left( 1 \right)} \right) ^T\\ \left( x^{\left( 2 \right)} \right) ^T\\ \vdots\\ \left( x^{\left( m \right)} \right) ^T\\ \end{array} \right]$
Octave：pinv(X'*X)*X'*y %伪逆函数

Gradient Descent	Normal Equation
(1)Need to choose $\alpha$	(1)No need to choose $\alpha$
(2)Need many iterations	(2)Don’t need to iterate
(3)Works well even when n is large	(3)Need to compute $X^TX)^{-1}$
	(4)Slow if n is very large
n=10⁶	n=100、1000
	$\gets$ n=10000

What if is non-invertible?
(1)Redundant features (linearly dependent)
E.g. x₁ = size in feet²
x₂ = size in m² x₁=(3.28)²x₂
(2)Too many features (e.g.m≤n)
Delete some features or use regularization (later).