YOLOv5 提升小目标识别能力解决VisDrone2019数据集

Terrence Shen

已于 2023-09-16 18:28:15 修改

阅读量6.2k

点赞数 24

分类专栏： YOLOv5大改造super！文章标签： YOLO

于 2023-09-16 03:04:17 首次发布

本文链接：https://blog.csdn.net/weixin_42001184/article/details/132913976

版权

YOLOv5大改造super！专栏收录该内容

2 篇文章

订阅专栏

这篇文章介绍了VisDrone2019数据集，一个用于无人机图像分析的大型标注基准，包含高分辨率图像和密集的对象实例。研究者展示了如何使用YOLOv5和EfficientDet架构，特别是BiFPN结构，以及增加小目标检测层来提升在小目标检测上的性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

VisDrone2019是一个无人机航拍数据集，图像分辨率高达2000x1500。训练集包含6,471张图像，共有343,205个标签，每张图像平均包含53个实例对象，这表明对象密度很高，而且大多数对象都非常小（<32像素），如图5所示。验证集和测试集分别包含548张和1,610张图像。该数据集包括10个类别：行人、人群、自行车、汽车、货车、卡车、三轮车、遮阳三轮车、公共汽车和摩托车。本文中的所有模型都是在训练集上进行训练，然后在验证集上进行验证，最后在测试集上进行评估的。
The VisDrone Dataset is a large-scale benchmark created by the AISKYEYE team at the Lab of Machine Learning and Data Mining, Tianjin University, China. It contains carefully annotated ground truth data for various computer vision tasks related to drone-based image and video analysis.

VisDrone is composed of 288 video clips with 261,908 frames and 10,209 static images, captured by various drone-mounted cameras. The dataset covers a wide range of aspects, including location (14 different cities across China), environment (urban and rural), objects (pedestrians, vehicles, bicycles, etc.), and density (sparse and crowded scenes). The dataset was collected using various drone platforms under different scenarios and weather and lighting conditions. These frames are manually annotated with over 2.6 million bounding boxes of targets such as pedestrians, cars, bicycles, and tricycles. Attributes like scene visibility, object class, and occlusion are also provided for better data utilization.

# Ultralytics YOLO 🚀, AGPL-3.0 license
# VisDrone2019-DET dataset https://github.com/VisDrone/VisDrone-Dataset by Tianjin University
# Example usage: yolo train data=VisDrone.yaml
# parent
# ├── ultralytics
# └── datasets
#     └── VisDrone  ← downloads here (2.3 GB)


# Train/val/test sets as 1) dir: path/to/imgs, 2) file: path/to/imgs.txt, or 3) list: [path/to/imgs1, path/to/imgs2, ..]
path: ../datasets/VisDrone  # dataset root dir
train: VisDrone2019-DET-train/images  # train images (relative to 'path')  6471 images
val: VisDrone2019-DET-val/images  # val images (relative to 'path')  548 images
test: VisDrone2019-DET-test-dev/images  # test images (optional)  1610 images

# Classes
names:
  0: pedestrian
  1: people
  2: bicycle
  3: car
  4: van
  5: truck
  6: tricycle
  7: awning-tricycle
  8: bus
  9: motor


# Download script/URL (optional) ---------------------------------------------------------------------------------------
download: |
  import os
  from pathlib import Path

  from ultralytics.utils.downloads import download

  def visdrone2yolo(dir):
      from PIL import Image
      from tqdm import tqdm

      def convert_box(size, box):
          # Convert VisDrone box to YOLO xywh box
          dw = 1. / size[0]
          dh = 1. / size[1]
          return (box[0] + box[2] / 2) * dw, (box[1] + box[3] / 2) * dh, box[2] * dw, box[3] * dh

      (dir / 'labels').mkdir(parents=True, exist_ok=True)  # make labels directory
      pbar = tqdm((dir / 'annotations').glob('*.txt'), desc=f'Converting {dir}')
      for f in pbar:
          img_size = Image.open((dir / 'images' / f.name).with_suffix('.jpg')).size
          lines = []
          with open(f, 'r') as file:  # read annotation.txt
              for row in [x.split(',') for x in file.read().strip().splitlines()]:
                  if row[4] == '0':  # VisDrone 'ignored regions' class 0
                      continue
                  cls = int(row[5]) - 1
                  box = convert_box(img_size, tuple(map(int, row[:4])))
                  lines.append(f"{cls} {' '.join(f'{x:.6f}' for x in box)}\n")
                  with open(str(f).replace(f'{os.sep}annotations{os.sep}', f'{os.sep}labels{os.sep}'), 'w') as fl:
                      fl.writelines(lines)  # write label.txt


  # Download
  dir = Path(yaml['path'])  # dataset root dir
  urls = ['https://github.com/ultralytics/yolov5/releases/download/v1.0/VisDrone2019-DET-train.zip',
          'https://github.com/ultralytics/yolov5/releases/download/v1.0/VisDrone2019-DET-val.zip',
          'https://github.com/ultralytics/yolov5/releases/download/v1.0/VisDrone2019-DET-test-dev.zip',
          'https://github.com/ultralytics/yolov5/releases/download/v1.0/VisDrone2019-DET-test-challenge.zip']
  download(urls, dir=dir, curl=True, threads=4)

  # Convert
  for d in 'VisDrone2019-DET-train', 'VisDrone2019-DET-val', 'VisDrone2019-DET-test-dev':
      visdrone2yolo(dir / d)  # convert VisDrone annotations to YOLO labels

YOLOv5

在这里插入图片描述

方法一： BiFPN结构替代FPN+PAN的Neck结构

论文题目：《EfficientDet: Scalable and Efficient Object Detection》

EfficientDet是谷歌大脑 Mingxing Tan、Ruoming Pang 和 Quoc V. Le 提出新架构 EfficientDet，结合 EfficientNet（同样来自该团队）和新提出的 BiFPN，实现新的 SOTA 结果。

贡献：
提出一种全新的特征融合方法：重复加权双向特征金字塔网络 BiFPN ；
提出一种复合的缩放方法(EfficientNet方法)：统一缩放分辨率、深度、宽度、特征融合网络、box/class网络。
BiFPN 全称 Bidirectional Feature Pyramid Network 加权双向（自顶向下 + 自低向上）特征金字塔网络。
本文将YOLOv5中的PANet层修改为EfficientDet-BiFPN，实现自上而下与自下而上的深浅层特征双向融合，增强不同网络层之间特征信息的传递，明显提升YOLOv5算法检测精度，并且具有更加不错的检测性能。
在这里插入图片描述
原文中是有五层，我们YOLOv5s里原版只有三层的检测头。所以大概结构是这样的：

那么，YOLOv5结合BiFPN需要修改以下几个地方：

1.修改common.py

class Concat_BiFPN(nn.Module):
    def __init__(self, c1):
        super(Concat, self).__init__()
        # self.relu = nn.ReLU()
        self.w = nn.Parameter(torch.ones(2, dtype=torch.float32), requires_grad=True)
        self.epsilon = 0.0001
        self.swish = Swish()
        
    def forward(self, x):
        weight = self.w / (torch.sum(self.w, dim=0) + self.epsilon)
        # Connections for P6_0 and P7_0 to P6_1 respectively
        x = self.swish(weight[0] * x[0] + weight[1] * x[1])
        return x

2.修改yolo.py

channel现在不是直接concat了，而是进行pairwise add操作，所以out-channel不再是sum。

3.修改配置文件

根据自身结构在yolov5代码的yolov5s.yaml 中加入Concat_BiFPN.

结果：
在这里插入图片描述

mAP50 是 0.348， mAP50-95是 0.192。

方法二、小目标检测层P2

采用增加小目标检测层的方式来使YOLOv5能够检测小目标，只需要修改models下的yaml文件中的内容即可。
原本yaml文件：

# parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
 
# anchors
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32
 
# YOLOv5 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Focus, [64, 3]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 9, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 1, SPP, [1024, [5, 9, 13]]],
   [-1, 3, C3, [1024, False]],  # 9
  ]
 
# YOLOv5 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13
 
   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)
 
   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 20 (P4/16-medium)
 
   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 23 (P5/32-large)
 
   [[17, 20, 23], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

改变后的yaml文件：

# parameters
nc: 10  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
 
# anchors
anchors:
  - [5,6, 8,14, 15,11]
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32
 
# YOLOv5 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Focus, [64, 3]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 9, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 1, SPP, [1024, [5, 9, 13]]],
   [-1, 3, C3, [1024, False]],  # 9
  ]
 
# YOLOv5 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13
 
   [-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [512, False]],  # 17 (P3/8-small)
 
   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 2], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)
 
   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 18], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [256, False]],  # 20 (P4/16-medium)
 
   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]],
   [-1, 3, C3, [512, False]],
 
   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 23 (P5/32-large)
 
   [[21, 24, 27, 30], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

三、 BiFPN结构和B2小目标检测层一起使用

大概画了一个示意图

yaml文件：

# YOLOv5 🚀 by Ultralytics, GPL-3.0 license
 
# Parameters  (P2, P3, P4, P5)都输出，宽深与large版本相同，相当于比large版本能检测更小物体
nc: 1  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
anchors: 4  # AutoAnchor evolves 3 anchors per P output layer
 
# YOLOv5 v6.0 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 6, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 3, C3, [1024]],
   [-1, 1, SPPF, [1024, 5]],  # 9
  ]
 
# YOLOv5 v6.0 head with (P2, P3, P4, P5) outputs
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, BiFPN_Add2, [256,256]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13
 
   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, BiFPN_Add2, [128,128]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)
 
   [-1, 1, Conv, [128, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 2], 1, BiFPN_Add2, [64,64]],  # cat backbone P2
   [-1, 1, C3, [128, False]],  # 21 (P2/4-xsmall)
 
   [-1, 1, Conv, [128, 3, 2]],
   [-1, 1, Conv, [256, 1, 1]],
   [[-1, 17, 4], 1, BiFPN_Add3, [128,128]],  # cat head P3
   [-1, 3, C3, [256, False]],  # 25 (P3/8-small)
 
   [-1, 1, Conv, [256, 3, 2]],
   [-1, 1, Conv, [512, 1, 1]],
   [[-1, 13, 6], 1, BiFPN_Add3, [256,256]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 29 (P4/16-medium)
 
   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, BiFPN_Add2, [256,256]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 32 (P5/32-large)
 
   [[21, 25, 29, 32], 1, Detect, [nc, anchors]],  # Detect(P2, P3, P4, P5)
  ]