Focal Loss for Dense Object Detection论文和代码理解

最新推荐文章于 2022-12-12 21:51:47 发布

wanghua609

最新推荐文章于 2022-12-12 21:51:47 发布

阅读量9.2k

点赞数 21

本文链接：https://blog.csdn.net/weixin_38145317/article/details/98621095

版权

论文地址:https://arxiv.org/pdf/1708.02002.pdf

参考https://www.aiuai.cn/aifarm636.html

专业术语:

hard examples:难区分样本

easy examples:易区分样本

前言：

目标识别有两大经典结构，第一类是以faster rcnn为代表的两级识别方法，这种结构的第一级专注于proposal的提取，第二级则对提取出的proposal进行分类和精确坐标回归，两级结构准确度较高，但因为第二级需要单独对每个proposal进行分类/回归，速度就打了折扣，目标是别的第二类结构是以yolo和ssd为代表的单级结构，它们摒弃了提取proposal的过程，只用一级就完成了识别/回归，虽然速度较快但准确率远远比不上两级结构，那有没有办法在单级结构中也能实现较高的准确度呢？focal loss就是要解决这个问题。

为什么单级结构的识别准确度低？

作者认为单级结构准确度低是由类别失衡（foreground-background class imbalance,个人认为这里就是指的正负样本不均衡问题)引起的，

１、negative example过多造成它的loss太大，以至于把positive的loss都淹没掉了，不利于目标的收敛；

２、大多negative example不在前景和背景的过渡区域上，分类很明确(这种易分类的negative称为easy negative)，训练时对应的背景类score会很大，换个角度看就是单个example的loss很小，反向计算时梯度小。梯度小造成easy negative example对参数的收敛作用很有限，我们更需要loss大的对参数收敛影响也更大的example，即hard positive/negative example。
这里要注意的是前一点我们说了negative的loss很大，是因为negative的绝对数量多，所以总loss大；后一点说easy negative的loss小，是针对单个example而言。

OHEM是近年兴起的另一种筛选example的方法，它通过对loss排序，选出loss最大的example来进行训练，这样就能保证训练的区域都是hard example。这个方法有个缺陷，它把所有的easy example都去除掉了，造成easy positive example无法进一步提升训练的精度。

下图是hard positvie、hard negative、easy positive、easy negative四种example的示意图，可以直观的感受到easy negativa占了大多数。

focal loss旨在解决one-stage目标检测器在训练过程出现的极端前景背景类不均衡的问题（如，前景：背景=1：1000）

我们首先考虑对于二分类问题常用的交叉熵Cross Entropy损失函数(CE)

$CE(p,y)=-[ylog(p)+(1-y)log(1-p)]=\left\{\begin{matrix} -log(p)&if~ y=1 \\ -log(1-p))& otherwise \end{matrix}\right.$ （1）

此处的y代表训练样本的真实标签值,取值为0或1(比如网络任务为二分类,判断照片是不是人,1代表是人,0代表不是人),p代表网络对这个训练样本的预测值,为一个概率值,取值为[0,1]之间的小数.

样本不平衡问题

1。难易样本不平衡（simple hard example imbalance)

可以看出,大量easy sample(就是网络对其预测值p>0.5),意味着网络对这个样本具有良好的预测能力,若这样的预测值>0.5的样本数量非常多时,比如有9000个预测值=0.75的样本,若遇到了100个不容易分类的(就是预测值在.4-0.5之间的),这100个难样本,假设测值为.45那么我们来看下这9100个样本的损失值:

$-9000*log(0.75)-100*log(0.45)=9000*0.124+100*0.346=1116+34.6$

可以看到easy sample 与hard sample对损失函数的贡献比为1116:34.6,就是难样本会被大量的易分类样本所淹没.

2.正负样本不平衡问题（postive negative example imbalance)

从网络预测的结果来看，一般都有几万个box,代表了几万个example,而一张图片上的positive example比较少，大部分都是背景（或称为负样本），负样本同样会有损失的，假如有9000个负样本=0.25的样本,有100个正样本,假设测值为.75那么我们来看下这9100个样本的损失值:

$-9000*log(1-0.25)-100*log(0.75) \\ =9000*0.124+100*0.124\\ =1116+12.4$

可以看到easy sample 与hard sample对损失函数的贡献比为1116:12.4,就是positvie样本会被大量的negative样本所淹没.

备注:关于交叉熵损失函数,其他很多文献上这样写的

$CE(p,y))=-[ylogp+(1-y)log(1-p)]$ (1--1)

考虑到二分类问题中,真实标签取值只能是0和1,所以写法(1)和(1--1)是一样的.

如果我们令：

$p_{t}=\begin{cases} p & \text{ if } y=1 \\ 1-p & \ otherwise \end{cases}$ （2）

那么（1）式和(1--1)式可以写成：

$CE(p,y)=CE(p_{t})=-log(p_{t})$ （3）

上述就是标准的交叉熵损失函数.

贡献1:平衡交叉熵(Balanced Cross Entropy),解决样本分类不均衡问题

为解决正负样本不平衡的问题,一个常见的方法就是引入一个权重系数 $\alpha \in[0,1]$ ,以减少负样本的权重,(1--1)可以被改写为

$CE(p,y))=-[\alpha ylogp+(1-y)(1-\alpha )log(1-p)]$ (4)

那么,添加了权重系数的(4)如何能起到减少负样本对loss的贡献呢?

例1: 假设真实标注y_true和预测值y_pred分别为

$y_{true}=\begin{bmatrix} 0.3 &0 \\ 0.45&-1 \\ 0.7& 1\\ 1&1 \end{bmatrix},~~~ y_{pred}=\begin{bmatrix} 0.4 \\ 0.6 \\ 0.8\\ 0.9 \end{bmatrix}$

其中y_true是在原来人工标注的图片的基础上,生成的anchor, 第一列表示这个anchor与真实box之间的iou, 比如若这个iou<0.4,我们认为这个anchor是负样本,标记为0,

若0.4<iou<0.5,这个anchor是难样本,标记为-1,计算损失时,会忽略掉这个样本

若iou>0.5,这个anchor是正样本,标记为1.

step 1: 舍弃掉难样本及其对应预测值,于是真实标注y_true和预测值y_pred变为

$y_{true}=\begin{bmatrix} 0.3 &0 \\ 0.7& 1\\ 1&1 \end{bmatrix},~~~ y_{pred}=\begin{bmatrix} 0.4 \\ 0.8\\ 0.9 \end{bmatrix}$

step2: 构造权重系数alpha,对于iou=1的y_true,进行特别对待,加重其对损失函数的贡献

$\alpha_{factor} =\begin{bmatrix} \alpha \\ \alpha\\ 1-\alpha \end{bmatrix}$

假如真实标注y_true和预测值y_pred按照普通的交叉熵计算出来的损失值loss为

$loss=\frac{loss_1+loss_2+loss_3}{positve_{num}}$

其中

$\begin{bmatrix} loss_1 \\ loss_2\\ loss_3 \end{bmatrix}=\begin{bmatrix} -0.3log0.4-(1-0.3)log(1-0.4)\\-0.7log0.8-(1-0.7)log(1-0.8)\\ -1log0.9-(1-1)log(1-0.9)\end{bmatrix}$

positve_num为这批训练数据中正样本的数量=2(可以查下y_ture中第二列=1的anchor的数量得到),而且这里的分母是除以正样本的数量,可以有效避免因为大量的负样本造成的对损失函数的主导.这一点可以参考我的另一片博客.

https://mp.csdn.net/postedit/100536017

step 3: 可以看出,三个样本(负样本,正样本,特别正样本)对损失函数的贡献比为 loss1:loss2:loss3

step 4: 再来看下增加权重系数alpha后的损失函数

$loss_\alpha=\alpha_{factor}*loss \\=\frac{\sum \begin{bmatrix} \alpha \\ \alpha\\ 1-\alpha \end{bmatrix} \begin{bmatrix}loss_1 \\ loss_2\\ loss_3\end{bmatrix}}{positive_{num}}\\=\frac{\alpha loss_1+\alpha loss_2+(1-\alpha)loss_3}{positve_{num}}$ (5)

step 4: 假如平衡系数alpha后,,三个样本(负样本,正样本,特别正样本)对损失函数的贡献比为

$loss1:loss2:\frac{1-\alpha }{\alpha}loss3$

显然,若按照文章的取法alpha=0.25时,有

$\frac{1-\alpha }{\alpha}>1$

相对于原来的1:1:1,改进后的损失函数中,增大了特别正样本对损失函数的贡献.

备注:若某个anchor是属于背景,即标签=0,这个anchor还参与对损失函数的计算吗?

答:这样的anchor是负样本,参与对损失函数的计算.

import numpy as np
import  tensorflow as tf
import keras
import keras.backend as k
alpha=0.25
y_true=np.array([[[0.3,0],[0.45,-1],[0.7,1],[1,1]]])
print(y_true,y_true.shape)
labels         = y_true[:, :, :-1 ]  # 把原始的标注数据拿过来,但相对原始来讲,这里去掉了最后一列,最后一列是什么呢?最后一列代表了这个anchor所对应的状态-1,0,1
anchor_state   = y_true[:, :, -1]  # 这里把最后的一列状态值给单度取出来,这个就只含状态值,不含标注的box信息了-1 for ignore, 0 for background, 1 for object
print('labels',labels)
print('anchor_state',anchor_state)
# # 若anchor中的目标与图片中的某个目标iou>0.5,就把这个anchor状态标记=1,若0.4<iou<0.5,则=-1,iou<0.4, state=0
classification = np.array([[[0.4,0],[0.6,-1],[0.8,1],[0.99,1]]]) # 预测值,[batch_size, 很多行,2],具体有多少行,一般跟这个样本有多少个anchor相等,2列中第一列是预测的置信度,第二列是标记状态
#
indices        = tf.where(keras.backend.not_equal(anchor_state, -1))  # 把那些iou>0.5和iou<0.5的所有anchor找出来,即认为是前景和背景的那些

# # indices是所有状态值不为-1的那些索引值,
labels         = tf.gather_nd(labels, indices  )  # 更新labels,舍弃掉那些背景anchor
# [[0.3]
 # [0.7]
 # [1. ]]
print('labels',k.eval(labels))
classification = tf.gather_nd(classification, indices)  # 更新预测值classification ,舍弃掉那些原来是背景的anchor所对应的预测
# [[0.4  0.  ]
#  [0.8  1.  ]
#  [0.99 1.  ]]
print('classification',k.eval(classification))
#
# # compute the focal loss
alpha_factor = keras.backend.ones_like(labels) * alpha  # 权重系数alpha,shape等于labels,所有元素值=alpha=0.25

# # 对于iou==1的poistive anchor,值*0.25, 其他的*0.75
alpha_factor = tf.where(keras.backend.equal(labels, 1), alpha_factor, 1 - alpha_factor  )  # 更新alpha_factor,
print('alpha_factor ',k.eval(alpha_factor ))
# [[0.75]
#  [0.75]
#  [0.25]]

贡献1:引入调节系数r,解决easy examples和难以区分的hard examples问题

那么如何通过损失函数解决易于区分的easy examples和难以区分的hard examples呢?

方法是采用针对不同的样本采用不同的权值

如果我们能够设计一种loss函数,使得当网络遇到难以区分的hard examples时,loss很大;当网络遇到易于区分的easy examples时,loss很小,那么就可以使得反向传播时,神经网络能够集中精力,针对这些hard examples进行优化,下面就是retinanet作者提出的focal loss表达式

$FL(p_t)=-(1-p_t)^rlog(p_t)$

怎么来看这个损失函数呢?假设样本的真实标签y=1,r=2,下面讨论三种情况

(1)网络预测值y_pre=0.9,显然对于网络来讲,这个样本是易于区分的,因此对于这个样本来讲,其对损失函数的贡献值=

$FL(p_t)=-(1-p_t)^rlog(p_t)=-(1-0.9)^2log(0.9)=0.01*0.045=0.0045$

也就是说,相对于没有权重系数( $(1-p_t)^r$ 的情况,该样本对于loss的贡献被削弱了

(2)网络预测值y_pre=0.51,刚超过.5,显然这个样本是勉强分类正确的样本,很容易收到一些噪声干扰导致分类错误,其对损失函数的贡献值=

$FL(p_t)=-(1-p_t)^rlog(p_t)=-(1-0.51)^2log(0.51)=0.2401*0.29=0.07$

显然0.2401比之前的0.01就要大很多,那么随着模型的训练,梯度的更新会收到这些样本的影响更大,会使得该样本的打分向1这个方向靠拢.

(3) 样本的预测值为0.1,显然这是预测错了,因为原始标签=1,现在网络认为其是1的概率=0.1,是0的概率=0.9,该样本对于模型来说显然是hard examples了,模型在这样的样本上很容易误判,此时分配的权重(1-0.1)^2=0.81, 该值比上述两个都要高,也就是说,模型在梯度更新的过程中,应该着重考虑该样本.

tensorflow版程序实现

def compute_focal_loss(logits,labels,alpha=tf.constant([[0.5],[0.5]]),class_num=2,gamma=2):
    '''
    :param logits:
    :param labels:
    :return:
    '''
    labels = tf.reshape(labels, [-1])
    labels = tf.cast(labels,tf.int32)
    labels = tf.one_hot(labels, class_num, on_value=1.0, off_value=0.0)
    pred = tf.nn.softmax(logits)
 
    temp_loss = -1*tf.pow(1 - pred, gamma) * tf.log(pred)
 
    focal_loss =  tf.reduce_mean(tf.matmul(temp_loss * labels,alpha))
 
    return focal_loss

keras版实现

    def _focal(y_true, y_pred):
        """ Compute the focal loss given the target tensor and the predicted tensor.

        As defined in https://arxiv.org/abs/1708.02002

        Args
            y_true: Tensor of target data from the generator with shape (B, N, num_classes).
            y_pred: Tensor of predicted data from the network with shape (B, N, num_classes).

        Returns
            The focal loss of y_pred w.r.t. y_true.
        """
        labels         = y_true[:, :, :-1]#把原始的标注数据拿过来,但相对原始来讲,这里去掉了最后一列,最后一列是什么呢?最后一列代表了这个anchor所对应的状态-1,0,1
        anchor_state   = y_true[:, :, -1]  # 这里把最后的一列状态值给单度取出来,这个就只含状态值,不含标注的box信息了-1 for ignore, 0 for background, 1 for object
        #若anchor中的目标与图片中的某个目标iou>0.5,就把这个anchor状态标记=1,若0.4<iou<0.5,则=-1,iou<0.4, state=0
        classification = y_pred#预测值,[batch_size, 很多行,2],具体有多少行,一般跟这个样本有多少个anchor相等,2列中第一列是预测的置信度,第二列是标记状态

        indices        = backend.where(keras.backend.not_equal(anchor_state, -1))#把那些iou>0.5和iou<0.5的所有anchor找出来,即认为是前景和背景的那些
        #indices是所有状态值不为-1的那些索引值,
        labels         = backend.gather_nd(labels, indices)#更新labels,舍弃掉那些背景anchor
        classification = backend.gather_nd(classification, indices)#更新预测值classification ,舍弃掉那些原来是背景的anchor所对应的预测

        # compute the focal loss
        alpha_factor = keras.backend.ones_like(labels) * alpha#权重系数alpha,shape等于labels,所有元素值=alpha=0.25
        #对于所有的(iou>0.5)的poistive anchor,值*0.25, 其他的*0.75
        alpha_factor = backend.where(keras.backend.equal(labels, 1), alpha_factor, 1 - alpha_factor)#更新alpha_factor,
        # 把positive anchor位置元素的值是0.25,其它是0.75,
        #这里是不是有些问题,这个时候labels里面的元素值代表每个anchor与真实标注之间的iou,应该大部分都是<1的,按这里岂不是大部分都是1 - alpha
        focal_weight = backend.where(keras.backend.equal(labels, 1), 1 - classification, classification)
        #focal_weight里面,positive anchor位置元素的值是1-y_pred,其它是y_pred
        focal_weight = alpha_factor * focal_weight ** gamma

        cls_loss = focal_weight * keras.backend.binary_crossentropy(target=labels, output=classification)

        # compute the normalizer: the number of positive anchors
        normalizer = backend.where(keras.backend.equal(anchor_state, 1))#只计算正样本的数量,忽略负样本的数量
        normalizer = keras.backend.cast(keras.backend.shape(normalizer)[0], keras.backend.floatx())
        normalizer = keras.backend.maximum(1.0, normalizer)#尽管anchor中负样本特别多,这里算平均值时只除以正样本的数量,可有效避免负样本主导损失函数造成loss很小,
        #导致的训练失败

        return keras.backend.sum(cls_loss) / normalizer

这个keras版实现的细节为:

step1 和step2同例1

step 3: 构造focal_weight, 在特别正样本处,调节系数赋值= $(1-p)^\gamma$ ,其他地方系数为 $p^\gamma$ ,即

$focal_{weight} =\begin{bmatrix} y_{pred}^\gamma \\ y_{pred}^\gamma \\ (1- y_{pred} )^\gamma \end{bmatrix}=\begin{bmatrix} 0.4^\gamma \\0.8^\gamma \\ (1- 0.9 )^\gamma \end{bmatrix}$

step4 构造具有复合调节系数的损失函数

$composed_{loss}=\alpha_{factor}*focal_{weight}*loss \\=\begin{bmatrix} \alpha \\ \alpha\\ 1-\alpha \end{bmatrix} *\begin{bmatrix} y1_{pred}^\gamma \\ y2_{pred}^\gamma \\ (1- y3_{pred} )^\gamma \end{bmatrix} \frac{\sum \begin{bmatrix}loss_1 \\ loss_2\\ loss_3\end{bmatrix}}{positive_{num}}\\=\frac{\alpha *y1_{pred}^\gamma *loss_1+\alpha *y2_{pred}^\gamma *loss_2+(1-\alpha)*(1-y2_{pred} )^\gamma loss_3}{positve_{num}}$

代码解析(注意很多是点乘)

import numpy as np
import  tensorflow as tf
import keras
import keras.backend as k
alpha=0.25
gamma=2.
y_true=np.array([[[0.3,0],[0.45,-1],[0.7,1],[1,1]]])
print(y_true,y_true.shape)
labels         = y_true[:, :, :-1 ]  # 把原始的标注数据拿过来,但相对原始来讲,这里去掉了最后一列,最后一列是什么呢?最后一列代表了这个anchor所对应的状态-1,0,1
anchor_state   = y_true[:, :, -1]  # 这里把最后的一列状态值给单度取出来,这个就只含状态值,不含标注的box信息了-1 for ignore, 0 for background, 1 for object
print('labels',labels)
print('anchor_state',anchor_state)
# # 若anchor中的目标与图片中的某个目标iou>0.5,就把这个anchor状态标记=1,若0.4<iou<0.5,则=-1,iou<0.4, state=0
# classification = np.array([[[0.4,0],[0.6,-1],[0.8,1],[0.99,1]]]) # 预测值,[batch_size, 很多行,2],具体有多少行,一般跟这个样本有多少个anchor相等,2列中第一列是预测的置信度,第二列是标记状态
classification = np.array([[[0.4],[0.6],[0.8],[0.99]]])
indices        = tf.where(keras.backend.not_equal(anchor_state, -1))  # 把那些iou>0.5和iou<0.5的所有anchor找出来,即认为是前景和背景的那些

# # indices是所有状态值不为-1的那些索引值,
labels         = tf.gather_nd(labels, indices  )  # 更新labels,舍弃掉那些背景anchor
# [[0.3]
 # [0.7]
 # [1. ]]
print('labels',k.eval(labels))
classification = tf.gather_nd(classification, indices)  # 更新预测值classification ,舍弃掉那些原来是背景的anchor所对应的预测
# [[0.4  ]
#  [0.8    ]
#  [0.99   ]]
print('classification',k.eval(classification))
#
# # compute the focal loss
alpha_factor = keras.backend.ones_like(labels) * alpha  # 权重系数alpha,shape等于labels,所有元素值=alpha=0.25

# # 对于iou==1的poistive anchor,值*0.25, 其他的*0.75
alpha_factor = tf.where(keras.backend.equal(labels, 1), alpha_factor, 1 - alpha_factor  )  # 更新alpha_factor,
print('alpha_factor ',k.eval(alpha_factor ))
# [[0.75]
#  [0.75]
#  [0.25]]
focal_weight = tf.where(keras.backend.equal(labels, 1), 1 - classification, classification)
print('focal_weight ',k.eval(focal_weight ))
#[[0.4 ]
 # [0.8 ]
 # [0.01]]
# print('focal_weight square ',k.eval(focal_weight ** gamma ))
# focal_weight里面,positive anchor位置元素的值是1-y_pred,其它是y_pred

focal_weight = alpha_factor * focal_weight ** gamma
print('focal_weight2 ',focal_weight ,k.eval(focal_weight ))
cross_loss=keras.backend.binary_crossentropy(target=labels, output=classification)#这个是怎么算出来的呢
print('cross_loss ',k.eval(cross_loss))
#[[0.63246516]
 # [0.63903186]
 # [0.01005034]]
#
cls_loss = focal_weight * cross_loss
print('cls_loss  ',k.eval(cls_loss ))
#[[7.58958187e-02]
 # [3.06735293e-01]
 # [2.51258396e-07]]

# compute the normalizer: the number of positive anchors
normalizer = tf.where(keras.backend.equal(anchor_state, 1))#找出所有的正样本
print('normalizer  ',k.eval(normalizer ))#[[0 2]
                                         # [0 3]]
normalizer = keras.backend.cast(keras.backend.shape(normalizer)[0], keras.backend.floatx())#找出正样本的个数,=2
print('normalizer2  ',k.eval(normalizer ))
normalizer = keras.backend.maximum(1.0, normalizer)#2, 异常处理,若所有的样本中,没有一个是正样本,最小算作1
print('normalizer3  ',k.eval(normalizer ))
loss=keras.backend.sum(cls_loss) / normalizer #求平均数

细节解析,就是keras的binary_cross与tf的一些细节不同处

from tensorflow.python.ops import math_ops
from tensorflow.python.ops import clip_ops
from tensorflow.python.framework import ops
import tensorflow as tf
import math
import keras.backend as K
from math import e
y_target = K.constant(value=[0.3])#实际值[1.]
output = K.constant(value=[0.4])#网络预测值[1.]
epsilon_ = ops.convert_to_tensor(K.epsilon(), output.dtype.base_dtype)#1e-07
output = clip_ops.clip_by_value(output, epsilon_, 1 - epsilon_)#[0.9999999],若y_output<epsilon_, output=epsilon_, 否则output=1-epsilon_,之所以这样修正,是避免一些极限点出现
output = math_ops.log(output / (1 - output))#15.942385,会把预测值包装两次,
print('output ',K.eval(output ))
loss_tf=tf.nn.sigmoid_cross_entropy_with_logits(labels=y_target, logits=output)
print("the loss is: ", K.eval(loss_tf))#1.1920933e-07
a=1+1/tf.exp(output)
loss2=output-output*y_target+math_ops.log(1+1/tf.exp(output))
print('loss2', K.eval(loss2))

wanghua609

关注

21
点赞
踩
87

收藏

觉得还不错? 一键收藏
3
评论
Focal Loss for Dense Object Detection论文和代码理解

论文地址:https://arxiv.org/pdf/1708.02002.pdf参考https://www.aiuai.cn/aifarm636.html专业术语:hard examples:难区分样本easy examples:易区分样本前言：目标识别有两大经典结构，第一类是以faster rcnn为代表的两级识别方法，这种结构的第一级专注于proposal...
复制链接

扫一扫