引言
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
Supervised learning: “right answers” given
- Regression: predict continuous valued output. (Housing price prediction)
- Classification: Discrete valued output (0 or 1). (Breast cancer(malignant(恶性的), benign(良性的)))
Unsupervised learning: clustering(聚类算法)
应用: organize computing clusters、market segmentation、socialnetwork analysis、astronomical(天文) data analysis.
Cocktail party problem: 混合录音分离
algorithm:
[W s v]=svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
.
suggest use: Octave
单变量线性回归
Notation:
- m = Number of training examples
- x’s = “input” variable/features
- y’s = “output” variable/“target” variable
- (x,y) = one training example
- (x(i),y(i)) = ith training example
Training Set | Size in feet2 (x) | Prize($) in 1000’s (y) |
---|---|---|
m=47 | 2104 | 460 |
1416 | 232 | |
1534 | 315 | |
852 | 178 | |
… | … |
Hypothesis:
h
θ
(
x
)
=
θ
0
+
θ
1
x
h_{\theta}\left( x \right) =\theta _0+\theta _1x
hθ(x)=θ0+θ1x
Cost function:
J
(
θ
0
,
θ
1
)
=
1
2
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
2
J\left( \theta _0,\theta _1 \right) =\frac{1}{2m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2}
J(θ0,θ1)=2m1∑i=1m(hθ(x(i))−y(i))2 Squared error cost function
Goal:
min
θ
0
,
θ
1
J
(
θ
0
,
θ
1
)
\underset{\theta _0,\theta _1}{\min}J\left( \theta _0,\theta _1 \right)
θ0,θ1minJ(θ0,θ1)
Simplified: θ 0 = 0 \theta _0=0 θ0=0
Gradient descent(梯度下降)
repeat until convergence{
θ
j
:
=
θ
j
−
α
∂
∂
θ
j
J
(
θ
0
,
θ
1
)
\theta _j:=\theta _j-\alpha \frac{\partial}{\partial \theta _j}J\left( \theta _0,\theta _1 \right)
θj:=θj−α∂θj∂J(θ0,θ1) (for j=0 and j=1)
α
\alpha
α is learning rate, if
α
\alpha
α is too samll, gradient descent can be slow.
}
Simultanrous update:
t
e
m
p
0:
=
θ
0
−
α
∂
∂
θ
0
J
(
θ
0
,
θ
1
)
temp\text{ 0:}=\theta _0-\alpha \frac{\partial}{\partial \theta _0}J\left( \theta _0,\theta _1 \right)
temp 0:=θ0−α∂θ0∂J(θ0,θ1)
t
e
m
p
1:
=
θ
1
−
α
∂
∂
θ
1
J
(
θ
0
,
θ
1
)
temp\text{ 1:}=\theta _1-\alpha \frac{\partial}{\partial \theta _1}J\left( \theta _0,\theta _1 \right)
temp 1:=θ1−α∂θ1∂J(θ0,θ1)
θ
0
:
=
t
e
m
p
0
\theta _0:=temp\text{ 0}
θ0:=temp 0
θ
0
:
=
t
e
m
p
1
\theta _0:=temp\text{ 1}
θ0:=temp 1
Gradient descent can converge to a local minimum(slope=0), even with the learning rate a fixed.
As we approach a local minimum, gradient descent with automatically take smaller steps(导数值慢慢变小). So, no need to decrease over time.
∂
∂
θ
j
J
(
θ
0
,
θ
1
)
=
∂
∂
θ
j
[
1
2
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
2
]
=
∂
∂
θ
j
1
2
m
∑
i
=
1
m
(
θ
0
+
θ
1
x
(
i
)
−
y
(
i
)
)
2
\frac{\partial}{\partial \theta _j}J\left( \theta _0,\theta _1 \right) =\frac{\partial}{\partial \theta _j}\left[ \frac{1}{2m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2} \right] =\frac{\partial}{\partial \theta _j}\frac{1}{2m}\sum_{i=1}^m{\left( \theta _0+\theta _1x^{\left( i \right)}-y^{\left( i \right)} \right) ^2}
∂θj∂J(θ0,θ1)=∂θj∂[2m1∑i=1m(hθ(x(i))−y(i))2]=∂θj∂2m1∑i=1m(θ0+θ1x(i)−y(i))2
j = 0: ∂ ∂ θ 0 J ( θ 0 , θ 1 ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) j=\text{0:}\frac{\partial}{\partial \theta _0}J\left( \theta _0,\theta _1 \right) =\frac{1}{m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right)} j=0:∂θ0∂J(θ0,θ1)=m1∑i=1m(hθ(x(i))−y(i))
j
=
1:
∂
∂
θ
1
J
(
θ
0
,
θ
1
)
=
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
⋅
x
(
i
)
j=\text{1:}\frac{\partial}{\partial \theta _1}J\left( \theta _0,\theta _1 \right) =\frac{1}{m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) \cdot x^{\left( i \right)}}
j=1:∂θ1∂J(θ0,θ1)=m1∑i=1m(hθ(x(i))−y(i))⋅x(i)
多元线性回归
多变量线性回归(多个特征)
Notation:
- n = number of features
- x(i) = input(features) of ith training example. (列向量)n×1
- xj(i) = value of features j in ith training example
h
θ
(
x
)
=
θ
0
+
θ
1
x
1
+
θ
2
x
2
+
⋯
+
θ
n
x
n
h_{\theta}\left( x \right) =\theta _0+\theta _1x_1+\theta _2x_2+\cdots +\theta _nx_n
hθ(x)=θ0+θ1x1+θ2x2+⋯+θnxn
define x0=1 (x0(i)=1)
x
=
[
x
0
x
1
⋮
x
n
]
∈
R
n
+
1
x=\left[ \begin{array}{c} x_0\\ x_1\\ \vdots\\ x_n\\ \end{array} \right] \in \mathbb{R}^{n+1}
x=⎣⎢⎢⎢⎡x0x1⋮xn⎦⎥⎥⎥⎤∈Rn+1
θ
=
[
θ
0
θ
1
⋮
θ
n
]
∈
R
n
+
1
\theta =\left[ \begin{array}{c} \theta _0\\ \theta _1\\ \vdots\\ \theta _n\\ \end{array} \right] \in \mathbb{R}^{n+1}
θ=⎣⎢⎢⎢⎡θ0θ1⋮θn⎦⎥⎥⎥⎤∈Rn+1
h
θ
(
x
)
=
θ
0
+
θ
1
x
1
+
θ
2
x
2
+
⋯
+
θ
n
x
n
=
θ
T
x
h_{\theta}\left( x \right) =\theta _0+\theta _1x_1+\theta _2x_2+\cdots +\theta _nx_n=\theta ^Tx
hθ(x)=θ0+θ1x1+θ2x2+⋯+θnxn=θTx
Coss function: J ( θ 0 , θ 1 , ⋯ θ n ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J\left( \theta _0,\theta _1,\cdots \theta _n \right)=\frac{1}{2m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2} J(θ0,θ1,⋯θn)=2m1∑i=1m(hθ(x(i))−y(i))2
Gradient descent:
Repeat{
θ
j
:
=
θ
j
−
α
∂
∂
θ
j
J
(
θ
0
,
θ
1
,
⋯
θ
n
)
\theta _j:=\theta _j-\alpha \frac{\partial}{\partial \theta _j}J\left( \theta _0,\theta _1,\cdots \theta _n \right)
θj:=θj−α∂θj∂J(θ0,θ1,⋯θn)
} (simultaneously update for every j=0,…,n)
New algorithm (n≥1)
Repeat{
θ
j
:
=
θ
j
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
⋅
x
j
(
i
)
\theta _j:=\theta _j-\alpha \frac{1}{m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) \cdot x^{\left( i \right)}_j}
θj:=θj−αm1∑i=1m(hθ(x(i))−y(i))⋅xj(i)
} (simultaneously update for every j=0,…,n)
Feature Scaling
Idea:Make sure features are on a similar scale.
E.g. x1 = size (0-2000 feet2)
x2 = number of bedrooms (1-5)
(等值线)
梯度下降过程缓慢,反复来回振荡,需要花很长时间,才能找到一条通往全局最小值的路.
利用特征缩放:
x
1
=
s
i
z
e
(
f
e
e
t
2
)
2000
,
x
2
=
n
u
m
b
e
r
o
f
b
e
d
r
o
o
m
s
5
x_1=\frac{size\left( feet^2 \right)}{2000}\text{,}x_2=\frac{number\,\,of\,\,bedrooms}{5}
x1=2000size(feet2),x2=5numberofbedrooms
0
⩽
x
1
,
x
2
⩽
1
0\leqslant x_1,x_2\leqslant 1
0⩽x1,x2⩽1Get every feature into approximately a
−
1
⩽
x
i
⩽
1
-1\leqslant x_i\leqslant 1
−1⩽xi⩽1 range.
0
⩽
x
1
⩽
3
0\leqslant x_1\leqslant 3
0⩽x1⩽3
√
√
√
−
100
⩽
x
3
⩽
100
-100\leqslant x_3\leqslant 100
−100⩽x3⩽100
×
×
×
−
2
⩽
x
2
⩽
0.5
-2\leqslant x_2\leqslant 0.5
−2⩽x2⩽0.5
√
√
√
−
0.0001
⩽
x
4
⩽
0.0001
-0.0001\leqslant x_4\leqslant 0.0001
−0.0001⩽x4⩽0.0001
×
×
×
Mean normalization (均值归一化)
Replace
x
i
x_i
xi with
x
i
−
μ
i
x_i-\mu _i
xi−μi to make features have approximately zero mean. (Do not apply to
x
0
=
1
x_0=1
x0=1)
E.g.
x
1
=
s
i
z
e
−
1000
2000
,
x
2
=
#
b
e
d
r
o
o
m
s
−
2
5
,
−
0.5
⩽
x
1
,
x
2
⩽
0.5
x_1=\frac{size-1000}{2000}\text{,}x_2=\frac{\#bedrooms-2}{5}\text{,}-0.5\leqslant x_1,x_2\leqslant 0.5
x1=2000size−1000,x2=5#bedrooms−2,−0.5⩽x1,x2⩽0.5
分子:subtractor is average value of
x
1
x_1
x1 in training set.
分母:range max-min or standard deviation.
x
2
x_2
x2的分母可以为4,不需要太精确.
For sufficiently small
α
,
J
(
θ
)
\alpha,J(\theta)
α,J(θ) should decrease on every iteration. But if
α
\alpha
α is too small, gradient descent can be slow to converge.
Summary:
- If α \alpha α is too small:slow to convergence.
- If α \alpha α is too large: J ( θ ) J(\theta) J(θ) may not decrease on every iteration;may not converge. (Slow converge also possible)
To choose
α
,
\alpha,
α, try …, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …
Normal equation:Method to solve for
θ
\theta
θ analytically. (no need use feature scaling).
θ
∈
R
n
+
1
\theta \in \mathbb{R}^{n+1}
θ∈Rn+1
J
(
θ
0
,
θ
1
,
⋯
θ
n
)
=
1
2
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
2
J\left( \theta _0,\theta _1,\cdots \theta _n \right)=\frac{1}{2m}\sum_{i=1}^m{\left( h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2}
J(θ0,θ1,⋯θn)=2m1∑i=1m(hθ(x(i))−y(i))2
∂
∂
θ
j
J
(
θ
)
=
⋯
=
0
\frac{\partial}{\partial \theta _j}J\left( \theta \right) =\cdots =0
∂θj∂J(θ)=⋯=0 (for every j)
Solve for
θ
0
,
θ
1
,
⋯
θ
n
\theta _0,\theta _1,\cdots \theta _n
θ0,θ1,⋯θn
Example:m = 4
Size (feet2) | Number of bedrooms | Number of floors | Age of home (years) | Prize ($1000) | |
---|---|---|---|---|---|
x0 | x1 | x2 | x3 | x4 | y |
1 | 2104 | 5 | 1 | 45 | 460 |
1 | 1416 | 3 | 2 | 40 | 232 |
1 | 1534 | 3 | 2 | 30 | 315 |
1 | 852 | 2 | 1 | 36 | 178 |
X = [ 1 2104 5 1 1 1416 3 2 1 1534 3 2 1 852 2 1 45 40 30 36 ] X=\left[ \begin{matrix} 1& 2104& 5& 1\\ 1& 1416& 3& 2\\ 1& 1534& 3& 2\\ 1& 852& 2& 1\\ \end{matrix}\begin{array}{c} 45\\ 40\\ 30\\ 36\\ \end{array} \right] X=⎣⎢⎢⎡11112104141615348525332122145403036⎦⎥⎥⎤ y = [ 460 232 315 178 ] y=\left[ \begin{array}{c} 460\\ 232\\ 315\\ 178\\ \end{array} \right] y=⎣⎢⎢⎡460232315178⎦⎥⎥⎤
θ = ( X T X ) − 1 X T y . \theta=(X^TX)^{-1}X^Ty. θ=(XTX)−1XTy. 使代价函数最小化(minimize coss function)的 θ \theta θ. 证明见西瓜书.
m examples
(
x
(
1
)
,
y
(
1
)
)
,
⋯
,
(
x
(
m
)
,
y
(
m
)
)
;
(x^{(1)},y^{(1)}),\cdots,(x^{(m)},y^{(m)});
(x(1),y(1)),⋯,(x(m),y(m)); n features.
x
(
i
)
=
[
x
0
(
i
)
x
1
(
i
)
⋮
x
n
(
i
)
]
∈
R
n
+
1
x^{\left( i \right)}=\left[ \begin{array}{c} x_{0}^{\left( i \right)}\\ x_{1}^{\left( i \right)}\\ \vdots\\ x_{n}^{\left( i \right)}\\ \end{array} \right] \in \mathbb{R}^{n+1}
x(i)=⎣⎢⎢⎢⎢⎡x0(i)x1(i)⋮xn(i)⎦⎥⎥⎥⎥⎤∈Rn+1
d
e
s
i
g
n
m
a
t
r
i
x
X
=
[
(
x
(
1
)
)
T
(
x
(
2
)
)
T
⋮
(
x
(
m
)
)
T
]
design\,\,matrix\,\,X=\left[ \begin{array}{c} \left( x^{\left( 1 \right)} \right) ^T\\ \left( x^{\left( 2 \right)} \right) ^T\\ \vdots\\ \left( x^{\left( m \right)} \right) ^T\\ \end{array} \right]
designmatrixX=⎣⎢⎢⎢⎢⎡(x(1))T(x(2))T⋮(x(m))T⎦⎥⎥⎥⎥⎤
Octave:pinv(X'*X)*X'*y %伪逆函数
Gradient Descent | Normal Equation |
---|---|
(1)Need to choose α \alpha α | (1)No need to choose α \alpha α |
(2)Need many iterations | (2)Don’t need to iterate |
(3)Works well even when n is large | (3)Need to compute ( X T X ) − 1 (X^TX)^{-1} (XTX)−1 |
(4)Slow if n is very large | |
n=106 | n=100、1000 |
← \gets ←n=10000 |
What if is non-invertible?
(1)Redundant features (linearly dependent)
E.g. x1 = size in feet2
x2 = size in m2 x1=(3.28)2x2
(2)Too many features (e.g.m≤n)
Delete some features or use regularization (later).