看了一下斯坦福大学公开课:机器学习教程(吴恩达教授),记录了一些笔记,写出来以便以后有用到。笔记如有误,还望告知。
本系列其它笔记:
线性回归(Linear Regression)
分类和逻辑回归(Classification and logistic regression)
广义线性模型(Generalized Linear Models)
生成学习算法(Generative Learning algorithms)
分类和逻辑回归(Classification and logistic regression)
1 逻辑回归(Logistic regression)
h θ ( x ) = g ( θ T x ) = 1 1 + e − θ T x h_{\theta}(x) = g(\theta^Tx) = \frac{1}{1+e^{-\theta^Tx}} hθ(x)=g(θTx)=1+e−θTx1, g ( z ) = 1 1 + e − z g(z) = \frac{1}{1+e^{-z}} g(z)=1+e−z1(logistic function / sigmoid function)
p ( y = 1 ∣ x ; θ ) = h θ ( x ) p(y=1|x;\theta) = h_\theta(x) p(y=1∣x;θ)=hθ(x)
p ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) p(y=0|x;\theta) = 1 - h_\theta(x) p(y=0∣x;θ)=1−hθ(x)
p
(
y
∣
x
;
θ
)
=
(
h
θ
(
x
)
)
y
(
1
−
h
θ
(
x
)
)
1
−
y
p(y|x;\theta) = (h_\theta(x))^y(1 - h_\theta(x))^{1-y}
p(y∣x;θ)=(hθ(x))y(1−hθ(x))1−y
L
(
θ
)
=
p
(
y
⃗
∣
X
;
θ
)
=
∏
i
=
1
m
p
(
y
(
i
)
∣
x
(
i
)
;
θ
)
=
∏
i
=
1
m
(
h
θ
(
x
(
i
)
)
)
y
(
i
)
(
1
−
h
θ
(
x
(
i
)
)
)
1
−
y
(
i
)
⇓
ℓ
(
θ
)
=
log
L
(
θ
)
=
log
∏
i
=
1
m
(
h
θ
(
x
(
i
)
)
)
y
(
i
)
(
1
−
h
θ
(
x
(
i
)
)
)
1
−
y
(
i
)
=
∑
i
=
1
m
log
(
(
h
θ
(
x
(
i
)
)
)
y
(
i
)
(
1
−
h
θ
(
x
(
i
)
)
)
1
−
y
(
i
)
)
=
∑
i
=
1
m
(
log
(
(
h
θ
(
x
(
i
)
)
)
y
(
i
)
+
log
(
1
−
h
θ
(
x
(
i
)
)
)
1
−
y
(
i
)
)
=
∑
i
=
1
m
(
y
(
i
)
log
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
θ
(
x
(
i
)
)
)
)
L(\theta) = p(\vec y | X;\theta) \\ = \prod_{i=1}^{m} p(y^{(i)} | x^{(i)}; \theta) \\ = \prod_{i=1}^{m}(h_\theta(x^{(i)}))^{y^{(i)}}(1 - h_\theta(x^{(i)}))^{1-y^{(i)}} \\ \Downarrow \\ \ell(\theta) = \log L(\theta) \\ = \log \prod_{i=1}^{m}(h_\theta(x^{(i)}))^{y^{(i)}}(1 - h_\theta(x^{(i)}))^{1-y^{(i)}} \\ = \sum_{i=1}^{m} \log ((h_\theta(x^{(i)}))^{y^{(i)}}(1 - h_\theta(x^{(i)}))^{1-y^{(i)}}) \\ = \sum_{i=1}^{m}(\log ((h_\theta(x^{(i)}))^{y^{(i)}} + \log (1 - h_\theta(x^{(i)}))^{1-y^{(i)}}) \\ = \sum_{i=1}^{m}(y^{(i)} \log (h_\theta(x^{(i)})) + (1-y^{(i)}) \log (1 - h_\theta(x^{(i)})))
L(θ)=p(y∣X;θ)=i=1∏mp(y(i)∣x(i);θ)=i=1∏m(hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i)⇓ℓ(θ)=logL(θ)=logi=1∏m(hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i)=i=1∑mlog((hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i))=i=1∑m(log((hθ(x(i)))y(i)+log(1−hθ(x(i)))1−y(i))=i=1∑m(y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i))))
最大化
L
(
θ
)
L(\theta)
L(θ),
θ
:
=
θ
+
α
∇
θ
ℓ
(
θ
)
(
此
处
+
,
与
前
面
学
习
梯
度
下
降
算
法
的
−
不
同
,
因
为
h
θ
(
x
)
不
同
)
\theta \ := \theta + \alpha \ \nabla_{\theta}\ell(\theta) \ (此处+,与前面学习梯度下降算法的-不同,因为h_\theta(x)不同)
θ :=θ+α ∇θℓ(θ) (此处+,与前面学习梯度下降算法的−不同,因为hθ(x)不同)
∂
∂
θ
j
ℓ
(
θ
)
=
∂
∂
θ
j
∑
i
=
1
m
(
y
(
i
)
log
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
θ
(
x
(
i
)
)
)
)
=
∑
i
=
1
m
∂
∂
θ
j
(
y
(
i
)
log
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
θ
(
x
(
i
)
)
)
)
=
∑
i
=
1
m
(
y
(
i
)
h
θ
(
x
(
i
)
)
∂
∂
θ
j
h
θ
(
x
(
i
)
)
+
1
−
y
(
i
)
1
−
h
θ
(
x
(
i
)
)
∂
∂
θ
j
(
1
−
h
θ
(
x
(
i
)
)
)
)
=
∑
i
=
1
m
(
y
(
i
)
h
θ
(
x
(
i
)
)
∂
∂
θ
j
h
θ
(
x
(
i
)
)
−
1
−
y
(
i
)
1
−
h
θ
(
x
(
i
)
)
∂
∂
θ
j
(
h
θ
(
x
(
i
)
)
)
)
=
∑
i
=
1
m
y
(
i
)
−
h
θ
(
x
(
i
)
)
h
θ
(
x
(
i
)
)
(
1
−
h
θ
(
x
(
i
)
)
)
∂
∂
θ
j
h
θ
(
x
(
i
)
)
{
n
o
t
e
1
:
∂
∂
θ
j
h
θ
(
x
(
i
)
)
=
h
θ
(
x
(
i
)
)
(
1
−
h
θ
(
x
(
i
)
)
)
∂
∂
θ
j
θ
T
x
(
i
)
=
h
θ
(
x
(
i
)
)
(
1
−
h
θ
(
x
(
i
)
)
x
j
(
i
)
}
=
∑
i
=
1
m
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
x
j
(
i
)
\left.\frac{\partial}{\partial\theta_j}\right.\ell(\theta) = \left.\frac{\partial}{\partial\theta_j}\right.\sum_{i=1}^{m}(y^{(i)} \log (h_\theta(x^{(i)})) + (1-y^{(i)}) \log (1 - h_\theta(x^{(i)}))) \\ = \sum_{i=1}^{m} \left.\frac{\partial}{\partial\theta_j}\right.(y^{(i)} \log (h_\theta(x^{(i)})) + (1-y^{(i)}) \log (1 - h_\theta(x^{(i)}))) \\ = \sum_{i=1}^{m} (\frac{y^{(i)}}{h_\theta(x^{(i)})} \left.\frac{\partial}{\partial\theta_j}\right.h_\theta(x^{(i)}) + \frac{1 - y^{(i)}}{1 - h_\theta(x^{(i)})} \left.\frac{\partial}{\partial\theta_j}\right.(1 - h_\theta(x^{(i)}))) \\ = \sum_{i=1}^{m} (\frac{y^{(i)}}{h_\theta(x^{(i)})} \left.\frac{\partial}{\partial\theta_j}\right.h_\theta(x^{(i)}) - \frac{1 - y^{(i)}}{1 - h_\theta(x^{(i)})} \left.\frac{\partial}{\partial\theta_j}\right.(h_\theta(x^{(i)}))) \\ = \sum_{i=1}^{m} \frac{y^{(i)} - h_\theta(x^{(i)})}{h_\theta(x^{(i)})(1 - h_\theta(x^{(i)}))} \left.\frac{\partial}{\partial\theta_j}\right.h_\theta(x^{(i)}) \\ \lbrace note1:\left.\frac{\partial}{\partial\theta_j}\right.h_\theta(x^{(i)}) = h_\theta(x^{(i)})(1 - h_\theta(x^{(i)})) \left.\frac{\partial}{\partial\theta_j}\right. \theta^{T}x^{(i)} = h_\theta(x^{(i)})(1 - h_\theta(x^{(i)}) x_{j}^{(i)} \rbrace \\ = \sum_{i=1}^{m} (y^{(i)} - h_\theta(x^{(i)}))x_{j}^{(i)}
∂θj∂ℓ(θ)=∂θj∂i=1∑m(y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i))))=i=1∑m∂θj∂(y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i))))=i=1∑m(hθ(x(i))y(i)∂θj∂hθ(x(i))+1−hθ(x(i))1−y(i)∂θj∂(1−hθ(x(i))))=i=1∑m(hθ(x(i))y(i)∂θj∂hθ(x(i))−1−hθ(x(i))1−y(i)∂θj∂(hθ(x(i))))=i=1∑mhθ(x(i))(1−hθ(x(i)))y(i)−hθ(x(i))∂θj∂hθ(x(i)){note1:∂θj∂hθ(x(i))=hθ(x(i))(1−hθ(x(i)))∂θj∂θTx(i)=hθ(x(i))(1−hθ(x(i))xj(i)}=i=1∑m(y(i)−hθ(x(i)))xj(i)
θ
j
:
=
θ
j
+
α
∑
i
=
1
m
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
x
j
(
i
)
\theta_{j} \ := \theta_{j} + \alpha \ \sum_{i=1}^{m} (y^{(i)} - h_\theta(x^{(i)}))x_{j}^{(i)}
θj :=θj+α ∑i=1m(y(i)−hθ(x(i)))xj(i)
2 感知器学习算法(Digression: The perceptron learning algorithm)
定义g(z)函数:
g
(
z
)
=
{
1
i
f
z
≥
0
0
i
f
z
≤
0
g(z) = \left\{\begin{array}{cc} 1 \quad if \ z\geq 0 \\ 0 \quad if \ z\leq 0 \end{array}\right.
g(z)={1if z≥00if z≤0
如果我们让
h
θ
x
=
g
(
θ
T
x
)
h_{\theta}{x} = g({\theta^{T}x)}
hθx=g(θTx),那么可得到
θ
j
:
=
θ
j
+
α
(
y
(
i
)
−
h
θ
(
x
(
i
)
)
)
x
j
(
i
)
\theta_{j} \ := \theta_{j} + \alpha(y^{(i)} - h_\theta(x^{(i)}))x_{j}^{(i)}
θj :=θj+α(y(i)−hθ(x(i)))xj(i)(感知器学习算法)。
3 牛顿法最大化 ℓ ( θ ) \ell(\theta) ℓ(θ)(Another algorithm for maximizing ℓ ( θ ) \ell(\theta) ℓ(θ))
函数
f
(
θ
)
f(\theta)
f(θ)找一个
θ
\theta
θ使得
f
(
θ
)
=
0
f(\theta) = 0
f(θ)=0,牛顿法执行以下操作:
θ
:
=
θ
−
f
(
θ
)
f
′
(
θ
)
.
\theta := \theta - \frac{f(\theta)}{f'(\theta)}.
θ:=θ−f′(θ)f(θ).
那么我们如何找打一个
θ
\theta
θ使得函数
ℓ
(
θ
)
\ell(\theta)
ℓ(θ)值最大?我们需要是
ℓ
′
(
θ
)
=
0
\ell'(\theta) = 0
ℓ′(θ)=0(不论
ℓ
(
θ
)
\ell(\theta)
ℓ(θ)最大值或者最小值,其
ℓ
′
(
θ
)
=
0
\ell'(\theta)=0
ℓ′(θ)=0,极值在导函数拐点处),使用牛顿法可得出以下结论:
θ
:
=
θ
−
ℓ
′
(
θ
)
ℓ
′
′
(
θ
)
.
\theta := \theta - \frac{\ell'(\theta)}{\ell''(\theta)}.
θ:=θ−ℓ′′(θ)ℓ′(θ).
在逻辑回归设置中,
θ
\theta
θ是一个向量。因此牛顿法中也需满足此条件。
θ
:
=
θ
−
H
−
1
∇
θ
ℓ
(
θ
)
.
H
i
j
=
∂
2
ℓ
(
θ
)
∂
θ
i
∂
θ
j
.
\theta := \theta - H^{-1}\nabla_{\theta}\ell(\theta). \\ H_{ij} = \frac{\partial^{2}\ell(\theta)}{\partial\theta_{i}\partial\theta_{j}}.
θ:=θ−H−1∇θℓ(θ).Hij=∂θi∂θj∂2ℓ(θ).