根据《统计学习方法》第6章中6.1节介绍,下面对损失函数以及参数
w
w
w的梯度下降公式的推导:
S
i
g
m
o
i
d
Sigmoid
Sigmoid函数为:
g
(
z
)
=
1
1
+
e
−
z
g(z)=\frac{1}{1+e^{-z}}
g(z)=1+e−z1 给定一个样本
x
x
x,可以使用一个线性函数对自变量进行线性组合
z
=
w
0
+
w
1
x
1
+
w
2
x
2
+
⋯
+
w
n
x
n
=
∑
i
=
0
n
w
i
x
i
=
w
T
X
z=w_0+w_1x_1+w_2x_2+\dots+w_nx_n=\sum_{i=0}^{n}w_ix_i=w^TX
z=w0+w1x1+w2x2+⋯+wnxn=i=0∑nwixi=wTX 根据
s
i
g
m
o
i
d
sigmoid
sigmoid函数,预测函数表达式为:
h
w
(
x
)
=
g
=
w
(
T
X
)
=
1
1
+
e
−
w
T
X
h_w(x)=g=w(^TX)=\frac{1}{1+e^{-w^TX}}
hw(x)=g=w(TX)=1+e−wTX1
P
(
Y
=
1
∣
X
)
=
h
w
(
x
)
P(Y=1|X)=h_w(x)
P(Y=1∣X)=hw(x)
P
(
Y
=
0
∣
X
)
=
1
−
h
w
(
x
)
P(Y=0|X)=1-h_w(x)
P(Y=0∣X)=1−hw(x)
P
(
Y
∣
X
)
=
h
w
(
x
)
y
(
1
−
h
w
(
x
)
)
1
−
y
P(Y|X)=h_w(x)^y(1-h_w(x))^{1-y}
P(Y∣X)=hw(x)y(1−hw(x))1−y
极大似然函数:
L
(
w
)
=
∏
i
=
1
m
h
w
(
x
i
)
i
y
(
1
−
h
w
(
x
i
)
)
1
−
y
i
L(w)=\prod_{i=1}^mh_w(x_i)^y_i(1-h_w(x_i))^{1-y_i}
L(w)=i=1∏mhw(xi)iy(1−hw(xi))1−yi
l
o
g
L
(
w
)
=
∑
i
=
1
m
l
o
g
[
h
w
(
x
i
)
y
i
(
1
−
h
w
(
x
i
)
)
1
−
y
i
]
=
∑
i
=
1
m
[
y
i
l
o
g
h
w
(
x
i
)
+
(
1
−
y
i
)
l
o
g
(
1
−
h
w
(
x
i
)
)
]
logL(w)=\sum_{i=1}^mlog[h_w(x_i)^yi(1-h_w(x_i))^{1-y_i}]= \sum_{i=1}^m[y_ilogh_w(x_i)+(1-y_i)log(1-h_w(x_i))]
logL(w)=i=1∑mlog[hw(xi)yi(1−hw(xi))1−yi]=i=1∑m[yiloghw(xi)+(1−yi)log(1−hw(xi))]
损失函数:
J
(
w
)
=
−
1
m
∑
i
=
1
m
[
y
i
⋅
l
o
g
h
w
(
x
)
+
(
1
−
y
i
)
l
o
g
(
1
−
h
w
(
x
i
)
)
]
=
−
1
m
s
u
m
i
=
1
m
[
y
i
⋅
l
n
1
1
+
e
w
x
i
+
(
1
−
y
i
)
⋅
l
n
e
−
w
x
i
1
+
e
−
w
x
i
]
=
−
1
m
s
u
m
i
=
1
m
[
l
n
1
1
+
e
w
x
i
+
y
i
⋅
l
n
1
e
−
w
x
i
]
=
1
m
∑
i
=
1
m
[
−
w
x
i
y
i
+
l
n
(
1
+
e
w
x
i
)
]
J(w)=-\frac{1}{m}\sum_{i=1}^m[y_i \cdot logh_w(x)+(1-y_i)log(1-h_w(x_i))] =-\frac{1}{m}sum_{i=1}^m[y_i \cdot ln \frac{1}{1+e^{wx_i}}+(1-y_i) \cdot ln \frac{e^{-wx_i}}{1+e^{-wx_i}}] =-\frac{1}{m}sum_{i=1}^m[ln \frac{1}{1+e^{wx_i}}+y_i \cdot ln \frac{1}{e^{-wx_i}}] =\frac{1}{m}\sum_{i=1}{m}[-wx_iy_i+ln(1+e^{wx_i})]
J(w)=−m1i=1∑m[yi⋅loghw(x)+(1−yi)log(1−hw(xi))]=−m1sumi=1m[yi⋅ln1+ewxi1+(1−yi)⋅ln1+e−wxie−wxi]=−m1sumi=1m[ln1+ewxi1+yi⋅lne−wxi1]=m1i=1∑m[−wxiyi+ln(1+ewxi)]
梯度下降
w
w
w参数的梯度为:
∂
J
(
w
)
∂
w
i
=
1
m
∑
i
m
[
−
x
i
,
j
y
i
+
x
i
,
j
⋅
e
w
x
i
1
+
e
w
x
i
]
=
1
m
∑
i
m
x
i
,
j
(
1
1
+
e
−
w
x
i
−
y
i
)
=
1
m
∑
i
m
[
h
w
(
x
i
)
−
y
i
]
x
i
,
j
\frac{\partial J(w)}{\partial w_i}=\frac{1}{m}\sum_i^m[-x_{i,j}y_i+\frac{x_{i,j}\cdot e^{wx_i}}{1+e^{wx_i}}] =\frac{1}{m}\sum_i^mx_{i,j}(\frac{1}{1+e^{-wx_i}}-y_i) =\frac{1}{m}\sum_i^m[h_w(x_i)-y_i]x_{i,j}
∂wi∂J(w)=m1i∑m[−xi,jyi+1+ewxixi,j⋅ewxi]=m1i∑mxi,j(1+e−wxi1−yi)=m1i∑m[hw(xi)−yi]xi,j
所以最后的
w
w
w参数公式为:
w
j
+
1
=
w
j
−
α
∑
i
=
1
m
[
h
w
(
x
i
)
−
y
i
]
x
i
,
j
w_{j+1}=w_j-\alpha\sum_{i=1}^m[h_w(x_i)-y_i]x_{i,j}
wj+1=wj−αi=1∑m[hw(xi)−yi]xi,j 对于随机梯度下降的
w
w
w参数公式为:
w
j
+
1
=
w
j
−
α
[
h
w
(
x
)
−
y
]
x
j
w_{j+1}=w_j-\alpha[h_w(x)-y]x_j
wj+1=wj−α[hw(x)−y]xj