神经网络学习(三)
损失函数介绍
前面我们讨论的实例和公式推导都是基于平方损失函数,但是上述案例中进行的是一个分类问题,平方损失函数一般不适用与分类问题,因此我们需要使用交叉熵损失函数(Cross-Entropy Loss Function)
平方损失函数:
E
(
i
)
=
1
2
∑
k
=
1
n
L
(
y
k
(
i
)
−
o
k
(
i
)
)
2
\begin{aligned} E_{(i)} = \frac{1}{2} \sum\limits_{k=1}^{n_L} \left(y_k^{(i)} - o_k^{(i)}\right)^2 \end{aligned}
E(i)=21k=1∑nL(yk(i)−ok(i))2
交叉熵损失函数: 一般用于分类问题。
假设样本的标签
y
∈
{
1
,
…
,
C
}
y \in \{ 1,\ldots ,C\}
y∈{1,…,C}为离散的类别,模型
f
(
x
,
θ
)
∈
[
0
,
1
]
C
f(x,\theta) \in [0,1]^C
f(x,θ)∈[0,1]C的输出为类别标签的条件概率分布,即
p
(
y
=
c
∣
x
,
θ
)
=
f
c
(
x
,
θ
)
\begin{aligned} p(y=c|x,\theta)=f_c(x,\theta) \end{aligned}
p(y=c∣x,θ)=fc(x,θ)
并且要满足
f
c
(
x
,
θ
)
∈
[
0
,
1
]
,
∑
c
=
1
C
f
c
(
x
,
θ
)
=
1
\begin{aligned} f_c(x,\theta) \in [0,1],~~~~\sum\limits_{c=1}^{C} f_c(x,\theta)=1 \end{aligned}
fc(x,θ)∈[0,1], c=1∑Cfc(x,θ)=1
因此就要引入softmax函数 作为输出层函数,满足所有概率的和为1。
softmax函数:
p
(
y
=
c
∣
x
,
θ
)
=
f
c
(
x
,
θ
)
=
e
y
i
∑
c
=
1
C
e
y
\begin{aligned} p(y=c|x,\theta)=f_c(x,\theta)=\frac{e^{y_i}}{\sum\limits_{c=1}^{C} e^y} \end{aligned}
p(y=c∣x,θ)=fc(x,θ)=c=1∑Ceyeyi
交叉熵函数:
E
(
i
)
=
−
∑
c
=
1
C
(
y
c
log
f
c
(
x
,
θ
)
)
\begin{aligned} E_{(i)} = -\sum\limits_{c=1}^{C} \left(y_c \text{log}f_c(x,\theta)\right) \end{aligned}
E(i)=−c=1∑C(yclogfc(x,θ))
由于只有一个样本取1,因此上式可以写为:
E
(
i
)
=
−
log
f
c
(
x
,
θ
)
\begin{aligned} E_{(i)} = -\text{log}f_c(x,\theta) \end{aligned}
E(i)=−logfc(x,θ)
根据第一篇笔记的[神经网络学习(一)的例子,可以写出代价函数表达式
E
=
−
∑
c
=
1
C
(
y
c
log
f
c
(
x
,
θ
)
)
=
−
y
1
log
e
z
1
3
e
z
1
3
+
z
2
3
−
y
2
log
e
z
2
3
e
z
1
3
+
z
2
3
\begin{aligned} E &=-\sum\limits_{c=1}^{C} \left(y_c \text{log}f_c(x,\theta)\right) \\ &=-y_1\text{log}\frac{e^{z_1^{3} }}{e^{z_1^{3} + z_2^{3}}}-y_2\text{log}\frac{e^{z_2^{3} }}{e^{z_1^{3} + z_2^{3}}} \\ \end{aligned}
E=−c=1∑C(yclogfc(x,θ))=−y1logez13+z23ez13−y2logez13+z23ez23
反向传播推导
根据第一篇笔记的神经网络学习(一).可以知道,我们首先要推导出输出层的
δ
i
(
L
)
=
∂
E
∂
z
i
(
L
)
\begin{aligned} \delta_i^{(L)}=\frac{ \partial E}{\partial z_i^{(L)}} \end{aligned}
δi(L)=∂zi(L)∂E
根据复合函数求导法则:
δ
i
(
L
)
=
∂
E
∂
z
i
(
L
)
=
∑
j
=
1
n
L
(
∂
E
j
∂
f
j
(
L
)
∂
f
j
(
L
)
∂
z
i
(
L
)
)
\begin{aligned} \delta_i^{(L)}&=\frac{ \partial E}{\partial z_i^{(L)}} &=\sum\limits_{j=1}^{n_L} \left(\frac{ \partial E_j}{\partial f_j^{(L)}}\frac{\partial f_j^{(L)}}{\partial z_i^{(L)}} \right) \end{aligned}
δi(L)=∂zi(L)∂E=j=1∑nL(∂fj(L)∂Ej∂zi(L)∂fj(L))
其中:
∂
f
j
(
L
)
∂
z
i
(
L
)
=
∂
(
−
y
j
l
o
g
f
j
)
∂
f
j
=
−
y
i
1
f
i
\begin{aligned} \frac{\partial f_j^{(L)}}{\partial z_i^{(L)}} &= \frac{\partial(-y_jlogf_j)}{\partial f_j} \\ &= -y_i \frac{1}{f_i} \end{aligned}
∂zi(L)∂fj(L)=∂fj∂(−yjlogfj)=−yifi1
商的求导法则:
(
u
v
)
′
=
u
′
v
−
u
v
′
v
2
\begin{aligned} \left(\frac{u}{v}\right)'= \frac{u'v-uv'}{v^2} \end{aligned}
(vu)′=v2u′v−uv′
下面就要需要分成两种情况:
- 当
i
=
j
i=j
i=j:
∂ f i ∂ z i = ∂ ( e z i ∑ c = 1 C e ( z c ) ) ∂ z i = ∑ c = 1 C e ( z c ) e z i − ( e z i ) 2 ( ∑ c = 1 C e ( z c ) ) 2 = ( e z i ∑ c = 1 C e ( z c ) ) ( 1 − e z i ∑ c = 1 C e ( z c ) ) = f i ( 1 − f i ) \begin{aligned} \frac{\partial f_i}{\partial z_i} &= \frac{\partial \left( \frac{e^{z_i}}{\sum\limits_{c=1}^{C} e^{(z_c)}}\right) }{\partial z_i} \\ &= \frac{\sum\limits_{c=1}^{C} e^{(z_c)}~e^{z_i}-(e^{z_i})^2}{(\sum\limits_{c=1}^{C} e^{(z_c)})^2} \\ &=(\frac{e^{z_i}}{\sum\limits_{c=1}^{C} e^{(z_c)}})(1-\frac{e^{z_i}}{\sum\limits_{c=1}^{C} e^{(z_c)}}) \\ &=f_i(1-f_i) \end{aligned} ∂zi∂fi=∂zi∂⎝⎛c=1∑Ce(zc)ezi⎠⎞=(c=1∑Ce(zc))2c=1∑Ce(zc) ezi−(ezi)2=(c=1∑Ce(zc)ezi)(1−c=1∑Ce(zc)ezi)=fi(1−fi) - 当
i
≠
j
i \neq j
i=j:
∂ f j ∂ z i = ∂ ( e z j ∑ c = 1 C e ( z c ) ) ∂ z i = − e z j ( 1 ∑ c = 1 C e ( z c ) ) 2 e z i = − ( e z i ∑ c = 1 C e ( z c ) ) ( e z j ∑ c = 1 C e ( z c ) ) = − f i f j \begin{aligned} \frac{\partial f_j}{\partial z_i} &= \frac{\partial \left( \frac{e^{z_j}}{\sum\limits_{c=1}^{C} e^{(z_c)}}\right) }{\partial z_i} \\ &= -e^{z_j}\left(\frac{1}{\sum\limits_{c=1}^{C} e^{(z_c)}}\right)^2e^{z_i}\\ &= -(\frac{e^{z_i}}{\sum\limits_{c=1}^{C} e^{(z_c)}})(\frac{e^{z_j}}{\sum\limits_{c=1}^{C} e^{(z_c)}}) \\ &= -f_i f_j \end{aligned} ∂zi∂fj=∂zi∂⎝⎛c=1∑Ce(zc)ezj⎠⎞=−ezj⎝⎜⎜⎛c=1∑Ce(zc)1⎠⎟⎟⎞2ezi=−(c=1∑Ce(zc)ezi)(c=1∑Ce(zc)ezj)=−fifj
故:
δ i ( L ) = ∂ E ∂ z i ( L ) = ∑ c ≠ i C − y c 1 f c ( − f i f c ) + ( − y i 1 f i ) ( f i ( 1 − f i ) ) = ∑ c ≠ i C y c f i + y i f i − y i = f i ∑ c = 1 C y c − y i \begin{aligned} \delta_i^{(L)}&=\frac{ \partial E}{\partial z_i^{(L)}} \\ &=\sum\limits_{c\neq i}^{C} -y_c \frac{1}{f_c}(-f_i f_c)+(-y_i \frac{1}{f_i})(f_i(1-f_i)) \\ &=\sum\limits_{c\neq i}^{C} y_c f_i +y_i f_i -y_i \\ &=f_i \sum\limits_{c=1}^{C} y_c -y_i \end{aligned} δi(L)=∂zi(L)∂E=c=i∑C−ycfc1(−fifc)+(−yifi1)(fi(1−fi))=c=i∑Cycfi+yifi−yi=fic=1∑Cyc−yi
由于 y c y_c yc只有一个结果是1,因此结果可以写成:
δ i ( L ) = ∂ E ∂ z i ( L ) = f i − y i (1) \begin{aligned} \delta_i^{(L)}&=\frac{ \partial E}{\partial z_i^{(L)}} = f_i-y_i \end{aligned} \tag 1 δi(L)=∂zi(L)∂E=fi−yi(1)
此时按照推导出的递推公式进行传递更新就好
δ i ( l ) = ( ( ω l + 1 ) T δ l + 1 ) ⊙ f ′ ( z ( l ) ) ∇ ω l E = δ ( l ) ( a ( l − 1 ) ) T \begin{aligned} \bm{\delta}_i^{(l)} &= \left( (\bm{\omega}^{l+1})^{\text{T}}\bm{\delta}^{l+1} \right)\odot f'(\bm{z}^{(l)}) \\ \nabla_{\omega^{l}}E &= \bm{\delta}^{(l)}(\bm{a}^{(l-1)})^\text{T} \end{aligned} δi(l)∇ωlE=((ωl+1)Tδl+1)⊙f′(z(l))=δ(l)(a(l−1))T
程序实例
import numpy as np
from sklearn.datasets import load_digits #导入手写数字数据集
from sklearn.preprocessing import LabelBinarizer # 标签二值化
from sklearn.model_selection import train_test_split # 切割数据,交叉验证法
import matplotlib.pyplot as plt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def dsigmoid(x):
return x * (1 - x)
def softmax(x):
sum = 0
temp = np.zeros(len(x[0]))
for i in range(len(x[0])):
sum += np.exp(x[0][i])
for i in range(len(x[0])):
temp[i] = np.exp(x[0][i]) / sum
temp = np.atleast_2d(temp)
return temp
class NeuralNetwork:
def __init__(self, layers): # (64,100,10)
# 权重的初始化,范围-1到1:+1的一列是偏置值
self.V = np.random.random((layers[0] + 1, layers[1] + 1)) * 2 - 1
self.W = np.random.random((layers[1] + 1, layers[2])) * 2 - 1
def train(self, X, y, lr=0.11, epochs=10000):
# 添加偏置值:最后一列全是1
temp = np.ones([X.shape[0], X.shape[1] + 1])
temp[:, 0:-1] = X
X = temp
for n in range(epochs + 1):
# 在训练集中随机选取一行(一个数据):randint()在范围内随机生成一个int类型
i = np.random.randint(X.shape[0])
x = [X[i]]
# 转为二维数据:由一维一行转为二维一行
x = np.atleast_2d(x)
# L1:输入层传递给隐藏层的值;输入层64个节点,隐藏层100个节点
# L2:隐藏层传递到输出层的值;输出层10个节点
L1 = sigmoid(np.dot(x, self.V))
L2 = softmax(np.dot(L1, self.W))
# L2_delta:输出层对隐藏层的误差改变量
# L1_delta:隐藏层对输入层的误差改变量
L2_delta = y[i] - L2
L1_delta = L2_delta.dot(self.W.T) * dsigmoid(L1)
#print(L2)
# 计算改变后的新权重
self.W += lr * L1.T.dot(L2_delta)
self.V += lr * x.T.dot(L1_delta)
if n > 40000:
lr = lr * 0.99
# 每训练1000次输出一次准确率
if n % 1000 == 0:
predictions = []
for j in range(X_test.shape[0]):
# 获取预测结果:返回与十个标签值逼近的距离,数值最大的选为本次的预测值
o = self.predict(X_test[j])
# 将最大的数值所对应的标签返回
predictions.append(np.argmax(o))
# np.equal():相同返回true,不同返回false
accuracy = np.mean(np.equal(predictions, y_test))
print('迭代次数:', n, '准确率:', accuracy)
def predict(self, x):
# 添加偏置值:最后一列全是1
temp = np.ones([x.shape[0] + 1])
temp[0:-1] = x
x = temp
# 转为二维数据:由一维一行转为二维一行
x = np.atleast_2d(x)
# L1:输入层传递给隐藏层的值;输入层64个节点,隐藏层100个节点
# L2:隐藏层传递到输出层的值;输出层10个节点
L1 = sigmoid(np.dot(x, self.V))
L2 = softmax(np.dot(L1, self.W))
return L2
# 载入数据:8*8的数据集
digits = load_digits()
X = digits.data
Y = digits.target
# 输入数据归一化:当数据集数值过大,乘以较小的权重后还是很大的数,代入sigmoid激活函数就趋近于1,不利于学习
X -= X.min()
X /= X.max()
NN = NeuralNetwork([64, 80, 10])
# sklearn切分数据
X_train, X_test, y_train, y_test = train_test_split(X, Y)
# 标签二值化:将原始标签(十进制)转为新标签(二进制)
labels_train = LabelBinarizer().fit_transform(y_train)
labels_test = LabelBinarizer().fit_transform(y_test)
print('开始训练')
NN.train(X_train, labels_train, epochs=40000)
print('训练结束')
结论
通过上面推导后,对训练方式进行更改,精度可以很容易就达到98%,并且收敛速度加快,说明刚刚的推导是有效的。
但是有一个问题,为什么换了这个代价函数,整个程序就会训练的效果变好了呢?