交叉熵 Cross Entropy
可以衡量两个分布之间的距离,记
x
i
x_i
xi为第i个样本,p(x)预测分布,q(x)真实分布。交叉熵损失为
H
(
p
,
q
)
=
−
∑
i
=
1
n
q
(
x
i
)
l
o
g
(
p
(
x
i
)
)
H(p,q)=-\sum_{i=1}^{n}q\left(x_i\right)log\left(p\left(x_i\right)\right)
H(p,q)=−i=1∑nq(xi)log(p(xi))
总的来说分布相差越大,损失越大。
二分类
sigmoid+二分类交叉熵损失
记x为特征,y可以取两个值,0或1
可以使用sigmoid函数
σ
(
x
)
=
1
1
+
e
−
x
\sigma(x)=\frac{1}{1+e^{-x}}
σ(x)=1+e−x1计算概率
p
(
y
=
1
∣
x
)
=
p
=
1
1
+
e
−
x
p(y=1|x)=p=\frac{1}{1+e^{-x}}
p(y=1∣x)=p=1+e−x1
p
(
y
=
0
∣
x
)
=
1
−
p
=
e
−
x
1
+
e
−
x
p(y=0|x)=1-p = \frac{e^{-x}}{1+e^{-x}}
p(y=0∣x)=1−p=1+e−xe−x
H
(
p
,
q
)
=
−
l
o
g
1
1
+
e
−
x
H(p,q)=-log\frac{1}{1+e^{-x}}
H(p,q)=−log1+e−x1
多分类
softmax+多分类交叉熵损失
softmax + cross entropy loss
softmax是将分数归一化,归一化后的数值可以视为概率
前向传播公式:
x是样本
y
i
y_i
yi是真实标签
j
j
j是所有的类别
s
s
s是x经过w得到的分数
这里为了与代码保持一致记作
s
=
X
W
s=XW
s=XW,其实通常公式是
s
=
W
T
x
s=W^Tx
s=WTx
总损失:
L
=
−
1
N
log
∑
(
e
s
y
i
∑
j
e
s
j
)
+
0.5
×
λ
∥
W
∥
,其中
s
=
X
W
L =-\frac{1}{N} \log \sum{\left( \frac { e ^ { s_ { y _ { i } } } } { \sum _ { j } e ^ { s _ { j } } } \right)}+0.5\times\lambda\|W\| ,其中s=XW
L=−N1log∑(∑jesjesyi)+0.5×λ∥W∥,其中s=XW总损失可以分为经验风险和正则化项两部分
- 经验风险
− 1 N log ∑ ( e s y i ∑ j e s j ) -\frac{1}{N}\log\sum{\left( \frac{e^{s_{y_i}}}{\sum_{j}e^{s_j}} \right)} −N1log∑(∑jesjesyi) - 正则化项
0.5 × λ ∥ W ∥ 0.5\times\lambda\|W\| 0.5×λ∥W∥其中0.5是为了计算梯度方便乘的
为了求梯度,公式分解,记
L
i
L _ { i }
Li:
L
i
=
−
l
o
g
e
s
y
i
∑
j
e
s
j
L _ { i } = - log \frac{ e ^ { s_{y_i} }}{\sum_j{e^{s_j}}}
Li=−log∑jesjesyi (其中i是样本数量,j是类别),
L
i
L _ { i }
Li 是第i个样本的损失
将
L
i
L _ { i }
Li带入损失函数中,最后:
L
=
1
N
∑
L
i
+
0.5
×
λ
∣
∣
W
∣
∣
,其中
L
i
=
−
l
o
g
e
s
y
i
∑
j
e
s
j
,
s
=
X
W
L=\frac{1}{N}\sum{L_i}+0.5\times\lambda||W||,其中L _ { i } = - log \frac{ e ^ { s_{y_i} }}{\sum_j{e^{s_j}}} ,s=XW
L=N1∑Li+0.5×λ∣∣W∣∣,其中Li=−log∑jesjesyi,s=XW
tips:
- 如果s是很小的负数, e s e^s es会接近0,分母接近于0意味着最后的结果是未定义的。(无穷小/无穷小结果是什么来着)
- 当s是非常大的正数时, e s e^s es上溢,再次导致整个表达式未定义。
解决办法:减去 m a x j f s j max_{j}f_{s_j} maxjfsj, e s j − m a x j f s j e^{s_j-max_{j}f_{s_j}} esj−maxjfsj。
- exp的指数最大为0,这排除了上溢的可能性。
- 同样地,分母中至少有一个值为1的项,这就排除了因分母下溢而导致被零除的可能性。(摘自花书)
e s y i ∑ j e s j = C e s y i C ∑ j e s j = e s y i + log C ∑ j e f s j + log C ( log C = − max j f s j ) \frac { e ^ { s _ { y _ { i } } } } { \sum _ { j } e ^ { s _ { j } } } = \frac { C e ^ { s _ { y _ { i } } } } { C \sum _ { j } e ^ { s _ { j } } } = \frac { e ^ { s _ { y _ { i } } + \log C } } { \sum _ { j } e ^ { fs_ { j } + \log C } }(\log C = - \max _ { j } fs_ { j }) ∑jesjesyi=C∑jesjCesyi=∑jefsj+logCesyi+logC(logC=−jmaxfsj)
反向传播,计算梯度:
d L d s j = d L d L i × d L i d s j × d s j x = ? \frac{dL}{ds_j}=\frac{dL}{dL_i}\times\frac{dL_i}{ds_j}\times\frac{ds_j}{x}=? dsjdL=dLidL×dsjdLi×xdsj=? d L d L i = 1 N \frac{dL}{dL_i}=\frac{1}{N} dLidL=N1 L i = − l o g e s y i ∑ e s j = − s y i + l o g ( ∑ e s j ) L _ { i } = - log \frac{ e ^ { s_{y_i} }}{\sum{e^{s_j}}}=-s_{y_i}+log(\sum e^{s_j}) Li=−log∑esjesyi=−syi+log(∑esj)
注意这里的
s
y
i
s_{y_i}
syi是只关于正确类别的值,所以这里要分两种情况
(这一块写的有问题,周末要改一下)
使用循环计算梯度
#encoding:utf-8
from builtins import range
import numpy as np
from random import shuffle
from past.builtins import xrange
def softmax_loss_naive(W, X, y, reg):
"""
Softmax loss function, naive implementation (with loops)
Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
#############################################################################
# TODO: Compute the softmax loss and its gradient using explicit loops. #
# Store the loss in loss and the gradient in dW. If you are not careful #
# here, it is easy to run into numeric instability. Don't forget the #
# regularization! #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
num_classes = W.shape[1]
num_train = X.shape[0]
#W(D, C) X(N, D)
for i in xrange(num_train):
scores = X[i].dot(W) #(1,C)
shift_scores = scores - max(scores)
loss +=( - shift_scores[y[i]] + np.log(sum(np.exp(shift_scores))) )
for j in xrange(num_classes):
softmax_output = np.exp(shift_scores[j])/sum(np.exp(shift_scores)) #是一个矩阵[C,1]
if j == y[i]:
dW[:,j] += (-1 + softmax_output) *X[i]
else:
dW[:,j] += softmax_output *X[i]
loss /= num_train
loss += 0.5* reg * np.sum(W * W)
dW = dW/num_train + reg* W
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return loss, dW
向量化(不使用循环)
def softmax_loss_vectorized(W, X, y, reg):
"""
Softmax loss function, vectorized version.
Inputs and outputs are the same as softmax_loss_naive.
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
#############################################################################
# TODO: Compute the softmax loss and its gradient using no explicit loops. #
# Store the loss in loss and the gradient in dW. If you are not careful #
# here, it is easy to run into numeric instability. Don't forget the #
# regularization! #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
num_classes = W.shape[1]
num_train = X.shape[0]
scores = X.dot(W)
shift_scores = scores - np.max(scores, axis = 1, keepdims = True)
probabilities = np.exp(shift_scores)/np.sum(np.exp(shift_scores), axis = 1, keepdims = True) #[n,c]
loss = -np.sum(np.log(probabilities[range(num_train), list(y)]))
loss = loss / num_train + 0.5* reg * np.sum(W * W)
dscores = softmax_output.copy()
dscores[range(num_train), list(y)] += -1
dW = (X.T).dot(dscores)
dW = dW/num_train + reg* W
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return loss, dW