吴恩达机器学习笔记(一)by LKP

吴恩达机器学习笔记(一)

引言

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Supervised learning: “right answers” given

  • Regression: predict continuous valued output. (Housing price prediction)
  • Classification: Discrete valued output (0 or 1). (Breast cancer(malignant(恶性的), benign(良性的)))

Unsupervised learning: clustering(聚类算法)
     应用: organize computing clusters、market segmentation、socialnetwork analysis、astronomical(天文) data analysis.

Cocktail party problem: 混合录音分离
algorithm:
[W s v]=svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');.

suggest use: Octave
 

单变量线性回归

Notation:

  • m = Number of training examples
  • x’s = “input” variable/features
  • y’s = “output” variable/“target” variable
  • (x,y) = one training example
  • (x(i),y(i)) = ith training example
Training SetSize in feet2 (x)Prize($) in 1000’s (y)
m=472104460
1416232
1534315
852178

Hypothesis: h θ ( x ) = θ 0 + θ 1 x h_{\theta}\left( x \right) =\theta _0+\theta _1x hθ(x)=θ0+θ1x
Cost function: J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J\left( \theta _0,\theta _1 \right) =\frac{1}{2m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2} J(θ0,θ1)=2m1i=1m(hθ(x(i))y(i))2 Squared error cost function
Goal: min ⁡ θ 0 , θ 1 J ( θ 0 , θ 1 ) \underset{\theta _0,\theta _1}{\min}J\left( \theta _0,\theta _1 \right) θ0,θ1minJ(θ0,θ1)

Simplified: θ 0 = 0 \theta _0=0 θ0=0

picture

Gradient descent(梯度下降)
repeat until convergence{
      θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 ) \theta _j:=\theta _j-\alpha \frac{\partial}{\partial \theta _j}J\left( \theta _0,\theta _1 \right) θj:=θjαθjJ(θ0,θ1) (for j=0 and j=1)
      α \alpha α is learning rate, if α \alpha α is too samll, gradient descent can be slow.
}

Simultanrous update:
t e m p  0: = θ 0 − α ∂ ∂ θ 0 J ( θ 0 , θ 1 ) temp\text{ 0:}=\theta _0-\alpha \frac{\partial}{\partial \theta _0}J\left( \theta _0,\theta _1 \right) temp 0:=θ0αθ0J(θ0,θ1)
t e m p  1: = θ 1 − α ∂ ∂ θ 1 J ( θ 0 , θ 1 ) temp\text{ 1:}=\theta _1-\alpha \frac{\partial}{\partial \theta _1}J\left( \theta _0,\theta _1 \right) temp 1:=θ1αθ1J(θ0,θ1)
θ 0 : = t e m p  0 \theta _0:=temp\text{ 0} θ0:=temp 0
θ 0 : = t e m p  1 \theta _0:=temp\text{ 1} θ0:=temp 1

Gradient descent can converge to a local minimum(slope=0), even with the learning rate a fixed.

As we approach a local minimum, gradient descent with automatically take smaller steps(导数值慢慢变小). So, no need to decrease over time.
 
∂ ∂ θ j J ( θ 0 , θ 1 ) = ∂ ∂ θ j [ 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 ] = ∂ ∂ θ j 1 2 m ∑ i = 1 m ( θ 0 + θ 1 x ( i ) − y ( i ) ) 2 \frac{\partial}{\partial \theta _j}J\left( \theta _0,\theta _1 \right) =\frac{\partial}{\partial \theta _j}\left[ \frac{1}{2m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2} \right] =\frac{\partial}{\partial \theta _j}\frac{1}{2m}\sum_{i=1}^m{\left( \theta _0+\theta _1x^{\left( i \right)}-y^{\left( i \right)} \right) ^2} θjJ(θ0,θ1)=θj[2m1i=1m(hθ(x(i))y(i))2]=θj2m1i=1m(θ0+θ1x(i)y(i))2

j = 0: ∂ ∂ θ 0 J ( θ 0 , θ 1 ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) j=\text{0:}\frac{\partial}{\partial \theta _0}J\left( \theta _0,\theta _1 \right) =\frac{1}{m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right)} j=0θ0J(θ0,θ1)=m1i=1m(hθ(x(i))y(i))

j = 1: ∂ ∂ θ 1 J ( θ 0 , θ 1 ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x ( i ) j=\text{1:}\frac{\partial}{\partial \theta _1}J\left( \theta _0,\theta _1 \right) =\frac{1}{m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) \cdot x^{\left( i \right)}} j=1θ1J(θ0,θ1)=m1i=1m(hθ(x(i))y(i))x(i)
 

多元线性回归

多变量线性回归(多个特征)
Notation:

  • n = number of features
  • x(i) = input(features) of ith training example. (列向量)n×1
  • xj(i) = value of features j in ith training example

h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + ⋯ + θ n x n h_{\theta}\left( x \right) =\theta _0+\theta _1x_1+\theta _2x_2+\cdots +\theta _nx_n hθ(x)=θ0+θ1x1+θ2x2++θnxn
define x0=1 (x0(i)=1)
x = [ x 0 x 1 ⋮ x n ] ∈ R n + 1 x=\left[ \begin{array}{c} x_0\\ x_1\\ \vdots\\ x_n\\ \end{array} \right] \in \mathbb{R}^{n+1} x=x0x1xnRn+1    θ = [ θ 0 θ 1 ⋮ θ n ] ∈ R n + 1 \theta =\left[ \begin{array}{c} \theta _0\\ \theta _1\\ \vdots\\ \theta _n\\ \end{array} \right] \in \mathbb{R}^{n+1} θ=θ0θ1θnRn+1
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + ⋯ + θ n x n = θ T x h_{\theta}\left( x \right) =\theta _0+\theta _1x_1+\theta _2x_2+\cdots +\theta _nx_n=\theta ^Tx hθ(x)=θ0+θ1x1+θ2x2++θnxn=θTx

Coss function: J ( θ 0 , θ 1 , ⋯ θ n ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J\left( \theta _0,\theta _1,\cdots \theta _n \right)=\frac{1}{2m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2} J(θ0,θ1,θn)=2m1i=1m(hθ(x(i))y(i))2

Gradient descent
Repeat{
      θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 , ⋯ θ n ) \theta _j:=\theta _j-\alpha \frac{\partial}{\partial \theta _j}J\left( \theta _0,\theta _1,\cdots \theta _n \right) θj:=θjαθjJ(θ0,θ1,θn)
}           (simultaneously update for every j=0,…,n)

New algorithm (n≥1)
Repeat{
      θ j : = θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x j ( i ) \theta _j:=\theta _j-\alpha \frac{1}{m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) \cdot x^{\left( i \right)}_j} θj:=θjαm1i=1m(hθ(x(i))y(i))xj(i)
}           (simultaneously update for every j=0,…,n)

Feature Scaling
Idea:Make sure features are on a similar scale.
E.g. x1 = size (0-2000 feet2)
       x2 = number of bedrooms (1-5)

(等值线)picture

梯度下降过程缓慢,反复来回振荡,需要花很长时间,才能找到一条通往全局最小值的路.

利用特征缩放: x 1 = s i z e ( f e e t 2 ) 2000 , x 2 = n u m b e r    o f    b e d r o o m s 5 x_1=\frac{size\left( feet^2 \right)}{2000}\text{,}x_2=\frac{number\,\,of\,\,bedrooms}{5} x1=2000size(feet2)x2=5numberofbedrooms
0 ⩽ x 1 , x 2 ⩽ 1 0\leqslant x_1,x_2\leqslant 1 0x1,x21Get every feature into approximately a − 1 ⩽ x i ⩽ 1 -1\leqslant x_i\leqslant 1 1xi1 range.
0 ⩽ x 1 ⩽ 3 0\leqslant x_1\leqslant 3 0x13 √ √                  − 100 ⩽ x 3 ⩽ 100 -100\leqslant x_3\leqslant 100 100x3100 × × ×
− 2 ⩽ x 2 ⩽ 0.5 -2\leqslant x_2\leqslant 0.5 2x20.5 √ √           − 0.0001 ⩽ x 4 ⩽ 0.0001 -0.0001\leqslant x_4\leqslant 0.0001 0.0001x40.0001 × × ×

Mean normalization (均值归一化)
Replace x i x_i xi with x i − μ i x_i-\mu _i xiμi to make features have approximately zero mean. (Do not apply to x 0 = 1 x_0=1 x0=1)
E.g. x 1 = s i z e − 1000 2000 , x 2 = # b e d r o o m s − 2 5 , − 0.5 ⩽ x 1 , x 2 ⩽ 0.5 x_1=\frac{size-1000}{2000}\text{,}x_2=\frac{\#bedrooms-2}{5}\text{,}-0.5\leqslant x_1,x_2\leqslant 0.5 x1=2000size1000x2=5#bedrooms20.5x1,x20.5
分子:subtractor is average value of x 1 x_1 x1 in training set.
分母:range max-min or standard deviation.
x 2 x_2 x2的分母可以为4,不需要太精确.

For sufficiently small α , J ( θ ) \alpha,J(\theta) α,J(θ) should decrease on every iteration. But if α \alpha α is too small, gradient descent can be slow to converge.
在这里插入图片描述在这里插入图片描述
Summary:

  • If α \alpha α is too small:slow to convergence.
  • If α \alpha α is too large: J ( θ ) J(\theta) J(θ) may not decrease on every iteration;may not converge. (Slow converge also possible)

To choose α , \alpha, α, try …, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …
在这里插入图片描述
Normal equation:Method to solve for θ \theta θ analytically. (no need use feature scaling).

θ ∈ R n + 1 \theta \in \mathbb{R}^{n+1} θRn+1     J ( θ 0 , θ 1 , ⋯ θ n ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J\left( \theta _0,\theta _1,\cdots \theta _n \right)=\frac{1}{2m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2} J(θ0,θ1,θn)=2m1i=1m(hθ(x(i))y(i))2
                    ∂ ∂ θ j J ( θ ) = ⋯ = 0 \frac{\partial}{\partial \theta _j}J\left( \theta \right) =\cdots =0 θjJ(θ)==0 (for every j)
Solve for θ 0 , θ 1 , ⋯ θ n \theta _0,\theta _1,\cdots \theta _n θ0,θ1,θn

Example:m = 4

Size (feet2)Number of bedroomsNumber of floorsAge of home (years)Prize ($1000)
x0x1x2x3x4y
121045145460
114163240232
115343230315
18522136178

X = [ 1 2104 5 1 1 1416 3 2 1 1534 3 2 1 852 2 1 45 40 30 36 ] X=\left[ \begin{matrix} 1& 2104& 5& 1\\ 1& 1416& 3& 2\\ 1& 1534& 3& 2\\ 1& 852& 2& 1\\ \end{matrix}\begin{array}{c} 45\\ 40\\ 30\\ 36\\ \end{array} \right] X=11112104141615348525332122145403036      y = [ 460 232 315 178 ] y=\left[ \begin{array}{c} 460\\ 232\\ 315\\ 178\\ \end{array} \right] y=460232315178

θ = ( X T X ) − 1 X T y . \theta=(X^TX)^{-1}X^Ty. θ=(XTX)1XTy. 使代价函数最小化(minimize coss function)的 θ \theta θ. 证明见西瓜书.

m examples ( x ( 1 ) , y ( 1 ) ) , ⋯   , ( x ( m ) , y ( m ) ) ; (x^{(1)},y^{(1)}),\cdots,(x^{(m)},y^{(m)}); (x(1),y(1)),,(x(m),y(m)); n features.
x ( i ) = [ x 0 ( i ) x 1 ( i ) ⋮ x n ( i ) ] ∈ R n + 1 x^{\left( i \right)}=\left[ \begin{array}{c} x_{0}^{\left( i \right)}\\ x_{1}^{\left( i \right)}\\ \vdots\\ x_{n}^{\left( i \right)}\\ \end{array} \right] \in \mathbb{R}^{n+1} x(i)=x0(i)x1(i)xn(i)Rn+1      d e s i g n    m a t r i x    X = [ ( x ( 1 ) ) T ( x ( 2 ) ) T ⋮ ( x ( m ) ) T ] design\,\,matrix\,\,X=\left[ \begin{array}{c} \left( x^{\left( 1 \right)} \right) ^T\\ \left( x^{\left( 2 \right)} \right) ^T\\ \vdots\\ \left( x^{\left( m \right)} \right) ^T\\ \end{array} \right] designmatrixX=(x(1))T(x(2))T(x(m))T
Octave:pinv(X'*X)*X'*y %伪逆函数

Gradient DescentNormal Equation
(1)Need to choose α \alpha α(1)No need to choose α \alpha α
(2)Need many iterations(2)Don’t need to iterate
(3)Works well even when n is large(3)Need to compute ( X T X ) − 1 (X^TX)^{-1} (XTX)1
(4)Slow if n is very large
n=106n=100、1000
← \gets n=10000

What if is non-invertible?
(1)Redundant features (linearly dependent)
E.g. x1 = size in feet2
        x2 = size in m2       x1=(3.28)2x2
(2)Too many features (e.g.m≤n)
Delete some features or use regularization (later).

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值