YoloV1

ReLuJie

已于 2023-07-17 17:22:24 修改

阅读量888

点赞数

分类专栏： # 目标检测文章标签：目标检测论文阅读 Yolo

于 2019-07-16 21:51:13 首次发布

本文链接：https://blog.csdn.net/On_theway10/article/details/96190963

版权

目标检测专栏收录该内容

17 篇文章 0 订阅

订阅专栏

时间线

Motivation

Yolo-v1发布之前，detection的SOTA是由Fast RCNN[当时Faster RCNN刚刚出来]创造的，尽管Fast RCNN的inference速度(0.45fps)已经比之前的RCNN (0.02fps)提升了不少，但是距离实时检测(>=25fps)还有很大的差距，因此Yolo-v1的主要聚焦于提升检测速度。尽管其检测效果比Fast RCNN差，但是它的检测速度(>=45fps)却比前者高不少！

Idea

与Fast RCNN采用selective search方法产生proposal不同的是：yolo-v1采用了通过在feature_map的每个Grid cell内产生确定数目的(k=2) default bbox的策略来产生proposal。这样大大简化了proposal的生成环节，从而极大地提升了检测速度！

具体而言，对于它将input_img均匀划分为SxS(7x7)的网格，每个网格内产生B(B=2)个bbox，但是其假定每个网格仅仅预测一个object[前提是这个object的中心点位于该bbox内].例如，下图中黄色的网格负责检测'person'，因为蓝色的bbox的几何中心点位于黄色的网格内！

对于每一个网格，它预测B个bboxes以及该网格属于C个类的条件概率，这里每个bbox包含5个参数 x, y, w, h, conf_score：

x, y : 表示bbox的中心点相对于网格的边框的坐标，因此取值去量纲化后介于[0,1]；

w, h : 表示bbox的宽和高[相对于image的size]，因此取值去量纲化后介于[0,1]；

conf_score : Pr(Objectness) x IOU(pred, truth) train-moule | Pr(Class_i) x IOU(pred, truth) test-module

以Pascal-voc数据集为例，如果(S, B) = (7, 2)，则图片的每个网格输出(4+1)x2 + 20 = 30的vector，总输出为(7,7,30)的tensor。整体输入输出映射关系如下图所示：

该策略的一个缺点是：它显式地限定了input_img中的能够被检测到的不同object之间的间距，几何中心位于同一个网格中的不同object只取confidence最高的那个，其余的被忽略。如上图左侧图片中，左下角有9个圣诞老人，但是仅仅检测到了5个！

Loss

Yolo的loss function由classification-loss、confidence-loss、localization-loss三部分构成：

备注：

1. x, y 是bounding box中心点相对于所在grid cell左上角格点的坐标；

2. w, h 是bounding box相对于整幅图像的宽和高；

3. bbox的位置信息均被归一化；

4. $C_{i}$ 表示预测框的位置置信度，Pr(Object) x IOU(pred, truth)， $\hat{C}_{i}$ 表示IOU(pred, gt)；

5. $p_{i}$ 表示模型对该cell对K个类的分类概率, $\hat{p}_{i}$ 表示；

Remark : 考虑到bbox的scale有大有小，相同的offset对于large-bbox和small-bbox的影响显然不一样，因此在localization-loss中为了体现出这种差异，采用了[(sqrt(w)-sqrt(w'))^2 + (sqrt(h)-sqrt(h'))^2]的方式，当offset一样时，它对small-bbox的惩罚更大！

数据处理

# label generate
def generate_yolov1_target(self, labels, bboxes):
    """
    make location np.ndarray from bboxes of an image    
    Input
    ----------
    labels : list
        [0, 1, 4, 2, ...]
        labels of each bboxes
    bboxes : list, normalized by image shape
        [[x_center, y_center, width, height], ...]
        
    Returns
    -------
    np.ndarray
        [self.S, self.S, self.B*5+self.C]
    """
    num_elements = self.B * 5 + self.C
    num_bboxes = len(bboxes)
    if num_bboxes == 0:
        return np.zeros((self.S, self.S, num_elements))
    labels = np.array(labels, dtype=np.int)
    bboxes = np.array(bboxes, dtype=np.float)
    np_target = np.zeros((self.S, self.S, num_elements))
    np_class = np.zeros((num_bboxes, self.C))
    # one-hot
    for i in range(num_bboxes):
        np_class[i, labels[i]] = 1

    x_center = bboxes[:, 0].reshape(-1, 1)
    y_center = bboxes[:, 1].reshape(-1, 1)
    box_w = bboxes[:, 2].reshape(-1, 1)
    box_h = bboxes[:, 3].reshape(-1, 1)
    
    # differ with paper's top-left point
    x_grid_idx = np.ceil(x_center * self.S) - 1 
    y_grid_idx = np.ceil(y_center * self.S) - 1
    
    x_offset = x_center - (x_grid_idx + 0.5) / self.S
    y_offset = y_center - (y_grid_idx + 0.5) / self.S

    box_conf = np.ones_like(x_center)
    temp = np.concatenate([x_offset, y_offset, box_w, box_h, box_conf], axis=1) # N x 5
    temp = np.repeat(temp, self.B, axis=0).reshape(num_bboxes, -1) # N x (5 * self.B)
    temp = np.concatenate([temp, np_class], axis=1) # N x (5 * self.B + self.C)
    for i in range(num_bboxes):
        np_target[int(y_grid_idx[i]), int(x_grid_idx[i])] = temp[i]
    return np_target

Architecture | pipeline

Architecture

infer

Yolov1卷积网络后接入两个FC层，第一个FC层用于维度调整，第二个用于映射回7 x 7 x 30的网格输出；

Advantage

Fast !!! [Can be used for real-time detection task].
Single-stage & End to end.
predict bbox with larger view field [Get fewer false positives in background areas].
More generalized [Can be generalized from natural images to other domains like artwork].