今天讲解Tensorflow内置4中交叉熵损失函数:
- tf.nn.sigmoid_cross_entropy_with_logits
- tf.nn.softmax_cross_entropy_with_logits_v2
- tf.nn.sparse_softmax_cross_entropy_with_logits
- tf.nn.weighted_cross_entropy_with_logits
1. tf.nn.sigmoid_cross_entropy_with_logits
写在前面:这个损失函数要求 logits/labels 类型为 float32或float64,因为在使用时不要将 labels 定义成了 int 型!
tf.nn.sigmoid_cross_entropy_with_logits(
_sentinel=None,
labels=None,
logits=None,
name=None
)
看看这个损失函数应用于什么场景?
Measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive. For instance, one could perform multilabel classification where a picture can contain both an elephant and a dog at the same time.
这个损失函数计算的是 概率误差,各个类别相互独立但不必相互排斥。(注:logits 表示未归一化处理的概率, 即网络输出层的输出结果,因为损失函数自己会先用Sigmoid/Softmax进行归一化,对此请参见这篇博客)
For brevity, let x = logits, z = labels. The logistic loss is
z
∗
−
l
o
g
(
s
i
g
m
o
i
d
(
x
)
)
+
(
1
−
z
)
∗
−
l
o
g
(
1
−
s
i
g
m
o
i
d
(
x
)
)
=
z
∗
−
l
o
g
(
1
/
(
1
+
e
x
p
(
−
x
)
)
)
+
(
1
−
z
)
∗
−
l
o
g
(
e
x
p
(
−
x
)
/
(
1
+
e
x
p
(
−
x
)
)
)
=
z
∗
l
o
g
(
1
+
e
x
p
(
−
x
)
)
+
(
1
−
z
)
∗
(
−
l
o
g
(
e
x
p
(
−
x
)
)
+
l
o
g
(
1
+
e
x
p
(
−
x
)
)
)
=
z
∗
l
o
g
(
1
+
e
x
p
(
−
x
)
)
+
(
1
−
z
)
∗
(
x
+
l
o
g
(
1
+
e
x
p
(
−
x
)
)
=
(
1
−
z
)
∗
x
+
l
o
g
(
1
+
e
x
p
(
−
x
)
)
=
x
−
x
∗
z
+
l
o
g
(
1
+
e
x
p
(
−
x
)
)
\begin{aligned} &z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x)) \\ = &z * -log(1 / (1 + exp(-x))) + (1 - z) * -log(exp(-x) / (1 + exp(-x))) \\ = &z * log(1 + exp(-x)) + (1 - z) * (-log(exp(-x)) + log(1 + exp(-x))) \\ = &z * log(1 + exp(-x)) + (1 - z) * (x + log(1 + exp(-x)) \\ = &(1 - z) * x + log(1 + exp(-x)) \\ = &x - x * z + log(1 + exp(-x)) \\ \end{aligned}
=====z∗−log(sigmoid(x))+(1−z)∗−log(1−sigmoid(x))z∗−log(1/(1+exp(−x)))+(1−z)∗−log(exp(−x)/(1+exp(−x)))z∗log(1+exp(−x))+(1−z)∗(−log(exp(−x))+log(1+exp(−x)))z∗log(1+exp(−x))+(1−z)∗(x+log(1+exp(−x))(1−z)∗x+log(1+exp(−x))x−x∗z+log(1+exp(−x)) For x < 0, to avoid overflow in exp(-x), we reformulate the above
x
−
x
∗
z
+
l
o
g
(
1
+
e
x
p
(
−
x
)
)
=
l
o
g
(
e
x
p
(
x
)
)
−
x
∗
z
+
l
o
g
(
1
+
e
x
p
(
−
x
)
)
=
−
x
∗
z
+
l
o
g
(
1
+
e
x
p
(
x
)
)
\begin{aligned} &x - x * z + log(1 + exp(-x)) \\ = &log(exp(x)) - x * z + log(1 + exp(-x)) \\ = &- x * z + log(1 + exp(x)) \end{aligned}
==x−x∗z+log(1+exp(−x))log(exp(x))−x∗z+log(1+exp(−x))−x∗z+log(1+exp(x))
PS:那么什么是溢出呢?
定义:当变量的数据类型所提供的位数无法适应某个值时,就会发生溢出(上溢)或下溢。
不妨来看一个例子,假设在一个使用了 2 个字节内存的 short int 类型变量中存储了以下值:
这是 32 767 的二进制表示,也是能存储在该数据类型中的最大值。这里先不讲负数如何存储的细节,只要知道 short int 数据类型既可以存储正数也可以存储负数就可以了。高阶位(即最左侧位)是 0 的数字被解释为正数,高阶位为 1 的数字则被解释为负数。
如果上面示例中存储的数字加 1,则该变量将变成以下位模式:
但这不是 32 768。相反,它被解释为负数,所以这不是预期的结果。二进制 1 已经“流入”到高阶位的位置,这就是所谓的溢出(上溢)。
同样地,当一个整数变量保存的数值在其数据类型负值范围的最远端(即最小负值),那么当它被减去 1 时,其高位中的 1 将变为 0,结果数将被解释为正数。这是溢出的另一个例子。
除了溢出以外,浮点值还会遇到下溢的情况。当一个值太接近于零时,就可能会发生这种问题,过小的数字需要更多数位的精度来表示它,因而无法存储在保存它的变量中。
简而言之,溢出就是变量数据类型的位数存储不了给定数据!
当 x < 0 x < 0 x<0 且非常小时,对于 e − x e^{-x} e−x 值可能会非常大,造成溢出!
Hence, to ensure stability and avoid overflow, the implementation uses this equivalent formulation.
m
a
x
(
x
,
0
)
−
x
∗
z
+
l
o
g
(
1
+
e
x
p
(
−
a
b
s
(
x
)
)
)
max(x, 0) - x * z + log(1 + exp(-abs(x)))
max(x,0)−x∗z+log(1+exp(−abs(x)))
再来看一个运用这个损失函数的具体例子:
import numpy as np
import tensorflow as tf
def sigmoid(x):
return 1.0/(1+np.exp(-x))
labels = np.array([[1.,0.,0.],[0.,1.,0.],[0.,0.,1.]])
logits = np.array([[11.,8.,7.],[10.,14.,3.],[1.,2.,4.]])
# 根据API内部源码计算Loss
# 单目标
y_pred = sigmoid(logits)
prob_error1 = -labels * np.log(y_pred) - (1 - labels) * np.log(1 - y_pred)
print(".................................................................")
print("----------单目标loss: \n", prob_error1)
# 多目标:张图片可以有多个类别标签
labels1 = np.array([[0.,1.,0.],[1.,1.,0.],[0.,0.,1.]])
logits1 = np.array([[1.,8.,7.],[10.,14.,3.],[1.,2.,4.]])
y_pred1 = sigmoid(logits1)
prob_error2 = -labels1 * np.log(y_pred1) - (1 - labels1) * np.log(1-y_pred1)
print(".................................................................")
print("----------多目标loss: \n", prob_error2)
with tf.Session() as sess:
# 直接调用API, logits 表示未归一化处理的概率
print("***********************************************************************")
print("----------单目标loss: \n", sess.run(tf.nn.sigmoid_cross_entropy_with_logits(labels=labels,logits=logits)))
print("***********************************************************************")
print("----------多目标loss: \n", sess.run(tf.nn.sigmoid_cross_entropy_with_logits(labels=labels1,logits=logits1)))
观察结果你发现了什么?并尝试一下当 x < 0 x < 0 x<0 且非常小时造成溢出是什么样的?最后用上面的优化公式解决溢出问题。
2. tf.nn.softmax_cross_entropy_with_logits_v2
tf.nn.softmax_cross_entropy_with_logits_v2(
labels,
logits,
axis=None,
name=None,
dim=None
)
看看这个损失函数应用于什么场景?
Measures the probability error in discrete classification tasks in which the classes are mutually exclusive (each entry is in exactly one class). For example, each CIFAR-10 image is labeled with one and only one label: an image can be a dog or a truck, but not both.
NOTE: While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. If they are not, the computation of the gradient will be incorrect.
这个损失函数只适用于单目标的二分类或多分类问题,即一张图片只能有一个类别标签,而tf.nn.sigmoid_cross_entropy_with_logits一张图片可以有多个类别标签。另外,有效概率分布是指所有的类别是互斥的,但它们对应的概率不须如此。
注意:
- tf.nn.sparse_softmax_cross_entropy_with_logits要求概率有且只有一个类别。
- 该 op 内部对 logits 有 softmax 处理,效率更高,因此其输入需要未归一化的 logits。 即不需使用 softmax 的输出, 否则结果会不正确。
- tf.nn.softmax_cross_entropy_with_logits 反向传播只会发生在 logits中;tf.nn.softmax_cross_entropy_with_logits_v2 反向传播将发生在 logits 和 labels 中。 如果要禁止反向传播到 labels 中,请先将 labels 张量传递一个tf.stop_gradient参数,然后再将其传递给此函数。
3. tf.nn.sparse_softmax_cross_entropy_with_logits
tf.nn.sparse_softmax_cross_entropy_with_logits(
_sentinel=None,
labels=None,
logits=None,
name=None
)
看看这个损失函数应用于什么场景?
Measures the probability error in discrete classification tasks in which the classes are mutually exclusive (each entry is in exactly one class). For example, each CIFAR-10 image is labeled with one and only one label: an image can be a dog or a truck, but not both.
NOTE: For this operation, the probability of a given label is considered exclusive. That is, soft classes are not allowed, and the labels vector must provide a single specific index for the true class for each row of logits (each minibatch entry). For soft softmax classification with a probability distribution for each entry, see softmax_cross_entropy_with_logits_v2.
这个损失函数与 tf.nn.softmax_cross_entropy_with_logits_v2 基本一致,不同之处在于:tf.nn.sparse_softmax_cross_entropy_with_logits 给定 label 对应的概率也必须是互斥的,即 labels向量 只能在一个特定位置表示真实类别。
4. tf.nn.weighted_cross_entropy_with_logits
tf.nn.softmax_cross_entropy_with_logits_v2(
labels,
logits,
axis=None,
name=None,
dim=None
)
看看这个损失函数应用于什么场景?
This is like sigmoid_cross_entropy_with_logits() except that pos_weight, allows one to trade off recall and precision by up- or down-weighting the cost of a positive error relative to a negative error. The usual cross-entropy cost is defined as: labels * -log(sigmoid(logits)) + (1 - labels) * -log(1 - sigmoid(logits)) .
A value pos_weights > 1 decreases the false negative count, hence increasing the recall. Conversely setting pos_weights < 1 decreases the false positive count and increases the precision.
通常的交叉熵成本定义为:
t
a
r
g
e
t
s
∗
−
l
o
g
(
s
i
g
m
o
i
d
(
l
o
g
i
t
s
)
)
+
(
1
−
t
a
r
g
e
t
s
)
∗
−
l
o
g
(
1
−
s
i
g
m
o
i
d
(
l
o
g
i
t
s
)
)
targets * -log(sigmoid(logits)) + (1 - targets) * -log(1 - sigmoid(logits))
targets∗−log(sigmoid(logits))+(1−targets)∗−log(1−sigmoid(logits)) pos_weight是作为损失表达式中的正目标项的乘法系数引入的:
t
a
r
g
e
t
s
∗
−
l
o
g
(
s
i
g
m
o
i
d
(
l
o
g
i
t
s
)
)
∗
p
o
s
_
w
e
i
g
h
t
+
(
1
−
t
a
r
g
e
t
s
)
∗
−
l
o
g
(
1
−
s
i
g
m
o
i
d
(
l
o
g
i
t
s
)
)
targets * -log(sigmoid(logits)) * pos\_weight + (1 - targets) * -log(1 - sigmoid(logits))
targets∗−log(sigmoid(logits))∗pos_weight+(1−targets)∗−log(1−sigmoid(logits))
其实这个损失函数类似 sigmoid_cross_entropy_with_logits(),因此也是用于解决二分类问题的。与 sigmoid_cross_entropy_with_logits() 的区别就在于这个损失函数添加了一个权重参数,用于调节正样本损失的比例,显示这是针对正负样本不均衡时提出的方法。
For brevity, let x = logits, z = labels, q = pos_weight. The loss is:
q
z
∗
−
l
o
g
(
s
i
g
m
o
i
d
(
x
)
)
+
(
1
−
z
)
∗
−
l
o
g
(
1
−
s
i
g
m
o
i
d
(
x
)
)
=
q
z
∗
−
l
o
g
(
1
/
(
1
+
e
x
p
(
−
x
)
)
)
+
(
1
−
z
)
∗
−
l
o
g
(
e
x
p
(
−
x
)
/
(
1
+
e
x
p
(
−
x
)
)
)
=
q
z
∗
l
o
g
(
1
+
e
x
p
(
−
x
)
)
+
(
1
−
z
)
∗
(
−
l
o
g
(
e
x
p
(
−
x
)
)
+
l
o
g
(
1
+
e
x
p
(
−
x
)
)
)
=
q
z
∗
l
o
g
(
1
+
e
x
p
(
−
x
)
)
+
(
1
−
z
)
∗
(
x
+
l
o
g
(
1
+
e
x
p
(
−
x
)
)
=
(
1
−
z
)
∗
x
+
(
q
z
+
1
−
z
)
∗
l
o
g
(
1
+
e
x
p
(
−
x
)
)
=
(
1
−
z
)
∗
x
+
(
1
+
(
q
−
1
)
∗
z
)
∗
l
o
g
(
1
+
e
x
p
(
−
x
)
)
\begin{aligned} &qz * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x)) \\ = &qz * -log(1 / (1 + exp(-x))) + (1 - z) * -log(exp(-x) / (1 + exp(-x))) \\ = &qz * log(1 + exp(-x)) + (1 - z) * (-log(exp(-x)) + log(1 + exp(-x))) \\ = &qz * log(1 + exp(-x)) + (1 - z) * (x + log(1 + exp(-x)) \\ = &(1 - z) * x + (qz + 1 - z) * log(1 + exp(-x)) \\ = &(1 - z) * x + (1 + (q - 1) * z) * log(1 + exp(-x)) \\ \end{aligned}
=====qz∗−log(sigmoid(x))+(1−z)∗−log(1−sigmoid(x))qz∗−log(1/(1+exp(−x)))+(1−z)∗−log(exp(−x)/(1+exp(−x)))qz∗log(1+exp(−x))+(1−z)∗(−log(exp(−x))+log(1+exp(−x)))qz∗log(1+exp(−x))+(1−z)∗(x+log(1+exp(−x))(1−z)∗x+(qz+1−z)∗log(1+exp(−x))(1−z)∗x+(1+(q−1)∗z)∗log(1+exp(−x)) Setting l = (1 + (q - 1) * z), to ensure stability and avoid overflow, the implementation uses:
(
1
−
z
)
∗
x
+
l
∗
(
l
o
g
(
1
+
e
x
p
(
−
a
b
s
(
x
)
)
)
+
m
a
x
(
−
x
,
0
)
)
(1 - z) * x + l * (log(1 + exp(-abs(x))) + max(-x, 0))
(1−z)∗x+l∗(log(1+exp(−abs(x)))+max(−x,0))
参考:TF官网API介绍