文章目录
数据集格式
在机器学习多分类问题里数据集格式一般如下:
第
i
i
i个样本特征和标签写作:
x
i
=
(
x
1
i
,
x
2
i
,
x
3
i
,
.
.
.
,
x
d
i
)
T
∈
R
d
y
i
=
(
y
1
i
,
y
2
i
,
y
3
i
,
.
.
.
,
y
m
i
)
T
∈
R
m
x^i=(x_1^i,x_2^i,x_3^i,...,x_d^i)^T \in R^d \\ y^i =(y_1^i,y_2^i,y_3^i,...,y_m^i)^T \in R^m
xi=(x1i,x2i,x3i,...,xdi)T∈Rdyi=(y1i,y2i,y3i,...,ymi)T∈Rm
其中
d
d
d代表输入的特征的维数,
m
m
m代表输出类别的个数并对标签进行one-hot编码,
则完整的数据集可以写作:
X
=
[
x
1
,
x
2
,
…
,
x
n
]
=
[
x
1
1
x
1
2
…
x
1
n
x
2
1
x
2
2
…
x
2
n
⋮
⋮
⋱
⋮
x
d
1
x
d
2
…
x
d
n
]
∈
R
d
∗
n
Y
=
[
y
1
,
y
2
,
…
,
y
n
]
∈
R
m
∗
n
X=[ x^1,x^2,\ldots,x^n] \\ = \begin{bmatrix} x^1_1& x^2_1 &\ldots &x^n_1 \\ x^1_2& x^2_2 &\ldots &x^n_2 \\ \vdots& \vdots & \ddots & \vdots\\ x^1_d& x^2_d &\ldots &x^n_d \\ \end{bmatrix} \in R^{d*n} \\ Y=[ y^1,y^2,\ldots,y^n] \in R^{m*n}
X=[x1,x2,…,xn]=⎣
⎡x11x21⋮xd1x12x22⋮xd2……⋱…x1nx2n⋮xdn⎦
⎤∈Rd∗nY=[y1,y2,…,yn]∈Rm∗n
基于线性回归+sigmoid实现二分类的表达式
对于单个样本
o
=
W
x
+
b
=
[
w
11
w
12
…
w
1
d
w
21
w
22
…
w
2
d
⋮
⋮
⋱
⋮
w
m
1
w
m
2
…
w
m
d
]
[
x
1
x
2
⋮
x
d
]
+
[
b
1
b
2
⋮
b
m
]
=
[
w
11
∗
x
1
+
w
12
∗
x
2
+
…
+
w
1
d
∗
x
d
+
b
1
w
21
∗
x
1
+
w
22
∗
x
2
+
…
+
w
2
d
∗
x
d
+
b
2
⋮
w
m
1
∗
x
1
+
w
m
2
∗
x
2
+
…
+
w
m
d
∗
x
d
+
b
m
]
o = Wx+b \\ =\begin{bmatrix} w_{11} & w_{12} & \ldots & w_{1d} \\ w_{21} & w_{22} & \ldots & w_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ w_{m1} & w_{m2} & \ldots & w_{md} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix} \\ = \begin{bmatrix} w_{11}*x_1+ w_{12} *x_2 + \ldots + w_{1d} *x_d+ b_1 \\ w_{21}*x_1 + w_{22} *x_2+ \ldots + w_{2d} *x_d+ b_2 \\ \vdots \\ w_{m1}*x_1 +w_{m2} *x_2 + \ldots + w_{md}*x_d+ b_m \end{bmatrix}
o=Wx+b=⎣
⎡w11w21⋮wm1w12w22⋮wm2……⋱…w1dw2d⋮wmd⎦
⎤⎣
⎡x1x2⋮xd⎦
⎤+⎣
⎡b1b2⋮bm⎦
⎤=⎣
⎡w11∗x1+w12∗x2+…+w1d∗xd+b1w21∗x1+w22∗x2+…+w2d∗xd+b2⋮wm1∗x1+wm2∗x2+…+wmd∗xd+bm⎦
⎤
使用
s
o
f
t
m
a
x
softmax
softmax函数实现输出为实现多分类,
s
o
f
t
m
a
x
softmax
softmax函数表达式如下
给定
o
=
[
o
1
o
2
⋮
o
m
]
o=\begin{bmatrix} o_1 \\ o_2 \\ \vdots \\o_m\end{bmatrix}
o=⎣
⎡o1o2⋮om⎦
⎤
y
^
=
[
y
1
^
y
2
^
⋮
y
m
^
]
=
s
o
f
t
m
a
x
(
o
)
=
[
e
o
1
∑
i
=
1
m
e
o
i
e
o
2
∑
i
=
1
m
e
o
i
⋮
e
o
m
∑
i
=
1
m
e
o
i
]
\hat{y}=\begin{bmatrix} \hat{y_1} \\ \hat{y_2} \\ \vdots \\ \hat{y_m} \end{bmatrix} = softmax(o)=\begin{bmatrix} \frac{e^{o_1}}{\sum_{i=1}^{m}e^{o_i} }\\ \frac{e^{o_2}}{\sum_{i=1}^{m}e^{o_i}} \\ \vdots \\ \frac{e^{o_m}}{\sum_{i=1}^{m}e^{o_i}} \end{bmatrix}
y^=⎣
⎡y1^y2^⋮ym^⎦
⎤=softmax(o)=⎣
⎡∑i=1meoieo1∑i=1meoieo2⋮∑i=1meoieom⎦
⎤
对于多分类问题,使用最小化负对数似然作为损失函数,其表达式为
L
=
−
log
P
(
Y
∣
X
)
=
∑
i
=
1
n
−
log
P
(
y
i
∣
x
i
)
=
∑
i
=
1
n
l
(
y
i
,
y
i
^
)
L=-\log{P(Y|X)}= \sum_{i=1}^{n}-\log{P(y^i|x^i)}=\sum_{i=1}^{n}l(y_i,\hat{y_i})
L=−logP(Y∣X)=i=1∑n−logP(yi∣xi)=i=1∑nl(yi,yi^)
l
(
y
,
y
^
)
l(y,\hat{y})
l(y,y^)是针对于单个样本而定义的,具体写作
l
(
y
,
y
^
)
=
−
∑
j
=
1
m
y
j
log
y
j
^
l(y,\hat{y})=-\sum_{j=1}^{m}y_j\log{\hat{y_j}}
l(y,y^)=−j=1∑myjlogyj^
其中
y
j
y_j
yj为样本标签值,
y
j
^
\hat{y_j}
yj^为样本预测值,
m
m
m为one-hot向量长度,代表分类种类数。
链式法则求导
链式表达式
求解
w
j
k
w_{jk}
wjk和
b
j
b_j
bj的导数需要使用链式求导法则
求导公式如下:
∂
l
∂
w
j
k
=
∂
l
∂
y
^
∂
y
^
∂
o
j
∂
o
j
∂
w
j
k
∂
l
∂
b
j
=
∂
l
∂
y
^
∂
y
^
∂
o
j
∂
o
j
∂
b
j
\frac{\partial l}{\partial w_{jk}} = \frac{\partial l}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial o_j } \frac{\partial o_j}{\partial w_{jk}} \\ \frac{\partial l}{\partial b_j} = \frac{\partial l}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial o_j } \frac{\partial o_j}{\partial b_j}
∂wjk∂l=∂y^∂l∂oj∂y^∂wjk∂oj∂bj∂l=∂y^∂l∂oj∂y^∂bj∂oj
求解 ∂ l ∂ o j ^ \frac{\partial l}{\partial \hat{o_j}} ∂oj^∂l
因为
l
(
y
,
y
^
)
=
−
∑
j
=
1
m
y
j
log
y
j
^
y
j
^
=
e
o
j
∑
i
=
1
m
e
o
i
l(y,\hat{y})=-\sum_{j=1}^{m}y_j\log{\hat{y_j}} \\ \hat{y_j}=\frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}}
l(y,y^)=−j=1∑myjlogyj^yj^=∑i=1meoieoj
则可得
l
l
l关于
o
j
o_j
oj的表达式
l
(
y
,
y
^
)
=
−
∑
j
=
1
m
y
j
log
e
o
j
∑
i
=
1
m
e
o
i
=
∑
j
=
1
m
y
j
log
∑
i
=
1
m
e
o
i
−
∑
j
=
1
m
y
j
log
e
o
j
=
log
∑
i
=
1
m
e
o
i
−
∑
j
=
1
m
y
j
o
j
l(y,\hat{y})=-\sum_{j=1}^{m}y_j\log{\frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}} } \\ = \sum_{j=1}^{m} y_j\log{ \sum_{i=1}^{m}e^{o_i}}-\sum_{j=1}^{m}y_j\log{e^{o_j}} \\ = \log{ \sum_{i=1}^{m}e^{o_i}}-\sum_{j=1}^{m}y_jo_j
l(y,y^)=−j=1∑myjlog∑i=1meoieoj=j=1∑myjlogi=1∑meoi−j=1∑myjlogeoj=logi=1∑meoi−j=1∑myjoj
因此
∂
l
∂
o
j
=
e
o
j
∑
i
=
1
m
e
o
i
−
y
j
\frac{\partial{l}}{\partial{o_j}}= \frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}}-y_j
∂oj∂l=∑i=1meoieoj−yj
求解 ∂ o j ∂ w i j \frac{\partial o_j}{\partial w_{ij}} ∂wij∂oj
o
j
o_j
oj关于
w
j
k
w_{jk}
wjk的表达式可写作
o
j
=
w
j
1
∗
x
1
+
w
j
2
∗
x
2
+
…
+
w
j
d
∗
x
d
+
b
j
o_j=w_{j1}*x_1+ w_{j2} *x_2 + \ldots + w_{jd} *x_d+ b_j
oj=wj1∗x1+wj2∗x2+…+wjd∗xd+bj
则
∂
o
j
∂
w
j
k
=
x
k
\frac{\partial{o_j}}{\partial{w_{jk}}}=x_k
∂wjk∂oj=xk
求解 ∂ o j ∂ b j \frac{\partial o_j}{\partial b_j} ∂bj∂oj
o
j
o_j
oj关于
b
j
b_j
bj的表达式为
o
j
=
w
j
1
∗
x
1
+
w
j
2
∗
x
2
+
…
+
w
j
d
∗
x
d
+
b
j
o_j=w_{j1}*x_1+ w_{j2} *x_2 + \ldots + w_{jd} *x_d+ b_j
oj=wj1∗x1+wj2∗x2+…+wjd∗xd+bj
则可得
∂
o
j
∂
b
j
=
1
\frac{\partial o_j}{\partial b_j}=1
∂bj∂oj=1
最终表达式
梯度表达式
∂
l
∂
w
j
k
=
∂
l
∂
y
^
∂
y
^
∂
o
j
∂
o
j
∂
w
j
k
=
(
e
o
j
∑
i
=
1
m
e
o
i
−
y
j
)
x
k
\frac{\partial l}{\partial w_{jk}} = \frac{\partial l}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial o_j } \frac{\partial o_j}{\partial w_{jk}} = (\frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}}-y_j)x_k
∂wjk∂l=∂y^∂l∂oj∂y^∂wjk∂oj=(∑i=1meoieoj−yj)xk
∂
l
∂
b
j
=
∂
l
∂
y
^
∂
y
^
∂
o
j
∂
o
j
∂
b
j
=
e
o
j
∑
i
=
1
m
e
o
i
−
y
j
\frac{\partial l}{\partial b_j} = \frac{\partial l}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial o_j } \frac{\partial o_j}{\partial b_j} = \frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}}-y_j
∂bj∂l=∂y^∂l∂oj∂y^∂bj∂oj=∑i=1meoieoj−yj
梯度更新表达式
w w w更新表达式
因为
∂
l
∂
w
j
k
=
(
e
o
j
∑
i
=
1
m
e
o
i
−
y
j
)
x
k
\frac{\partial l}{\partial w_{jk}} = (\frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}}-y_j)x_k
∂wjk∂l=(∑i=1meoieoj−yj)xk
则梯度更新表达式为
w
j
k
=
w
j
k
−
η
∂
l
∂
w
i
=
w
j
k
−
η
(
e
o
j
∑
i
=
1
m
e
o
i
−
y
j
)
x
k
w_{jk}=w_{jk}-\eta\frac{\partial l}{\partial w_i} \\ = w_{jk}-\eta (\frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}}-y_j)x_k
wjk=wjk−η∂wi∂l=wjk−η(∑i=1meoieoj−yj)xk
b b b更新表达式
因为
∂
l
∂
b
j
=
e
o
j
∑
i
=
1
m
e
o
i
−
y
j
\frac{\partial l}{\partial b_j} = \frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}}-y_j
∂bj∂l=∑i=1meoieoj−yj
则梯度更新表达式为
b
j
=
b
j
−
η
∂
l
∂
b
j
=
b
j
−
η
e
o
j
∑
i
=
1
m
e
o
i
−
y
j
b_j=b_j-\eta\frac{\partial l}{\partial b_j} \\ = b_j-\eta\frac{e^{o_j}}{\sum_{i=1}^{m}e^{o_i}}-y_j
bj=bj−η∂bj∂l=bj−η∑i=1meoieoj−yj
参考资料
Zhang, Aston and Lipton, Zachary C. and Li, Mu and Smola, Alexander J , DiveintoDeepLearning