Paddle官方有一个口罩检测的示例,但是其是两阶段的,首先要检测出人脸,再将裁剪下来的人脸进行二分类.
facemask曾基于YOLO V2实现过单阶段口罩检测,但是速度在移动端不是很快,仅能用于服务端部署,且由于样本缺乏,实测精度也不是很高.
FaceMaskDetection是由AIZOO开源的小网络口罩检测模型,输入260x260大小,主干网络只有8层,有五个定位和分类层,一共只有28个卷积层。而每个卷积层的通道数,是32、64、128这三种,所有这个模型总的参数量只有101.5万个参数。实测速度非常快,在CPU上就能实时,而且开源了网络结构,还能进一步压缩耗时,其提供了caffe、pytorch、tensorflow、onnx和mxnet等近乎全平台的支持,美中不足的是唯独缺了paddle, 好在经过不懈努力终于补上了这个遗憾,代码见FaceMaskDetection
本文选择了最简单的caffe转paddle方案,转换非常简单,pip install x2paddle后直接运行下面的命令即可
x2paddle --framework=caffe --prototxt=models/face_mask_detection.prototxt --weight=models/face_mask_detection.caffemodel --save_dir=./ --params_merge
mv inference_model models/paddle
接下来就是改造预测部分,paddle有zero-copy和复制数据两种方式,zero-copy把数据指针直接传给网络输入,省去了一次拷贝过程,因此速度更快些.别的需要注意的就是预处理部分了, 数据要RGB格式,转换到0-1的浮点数,resize到260x260大小,具体见paddle_infer.py
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
height, width, _ = img.shape
image_resized = cv2.resize(img, target_shape)
image_np = image_resized / 255.0
image_np = image_np.transpose(2,0,1)
img = np.expand_dims(image_np,axis=0).copy()
img = img.astype("float32")
input_names = predictor.get_input_names()
input_tensor = predictor.get_input_tensor(input_names[0])
input_tensor.copy_from_cpu(img)
predictor.zero_copy_run()
output_names = predictor.get_output_names()
y_bboxes_output = predictor.get_output_tensor(output_names[0])
y_cls_output = predictor.get_output_tensor(output_names[1])
y_bboxes_output = y_bboxes_output.copy_to_cpu()
y_cls_output = y_cls_output.copy_to_cpu()
使用深度学习 计算模型中每层参数的个数和FLOPs分析可知模型参数量为1M,计算量为759M,还有有点偏大,应重点优化conv2_1和conv2_2,计算热点如下所示:
face_mask_detection.prototxt
layer name Filter Shape Output Size Params Flops Ratio
conv2d_0 (32, 3, 3, 3) (1, 32, 260, 260) 864 58406400 7.693%
conv2d_1 (64, 32, 3, 3) (1, 64, 130, 130) 18432 311500800 41.029%
conv2d_2 (64, 64, 3, 3) (1, 64, 65, 65) 36864 155750400 20.514%
conv2d_3 (64, 64, 3, 3) (1, 64, 33, 33) 36864 40144896 5.288%
conv2d_4 (128, 64, 3, 3) (1, 128, 17, 17) 73728 21307392 2.806%
conv2d_5 (128, 128, 3, 3) (1, 128, 9, 9) 147456 11943936 1.573%
conv2d_6 (64, 128, 3, 3) (1, 64, 5, 5) 73728 1843200 0.243%
conv2d_7 (64, 64, 3, 3) (1, 64, 3, 3) 36864 331776 0.044%
cls_0_insert_conv2d (64, 64, 3, 3) (1, 64, 33, 33) 36864 40144896 5.288%
cls_1_insert_conv2d (64, 128, 3, 3) (1, 64, 17, 17) 73728 21307392 2.806%
cls_2_insert_conv2d (64, 128, 3, 3) (1, 64, 9, 9) 73728 5971968 0.787%
cls_3_insert_conv2d (64, 64, 3, 3) (1, 64, 5, 5) 36864 921600 0.121%
cls_4_insert_conv2d (64, 64, 3, 3) (1, 64, 3, 3) 36864 331776 0.044%
loc_0_insert_conv2d (64, 64, 3, 3) (1, 64, 33, 33) 36864 40144896 5.288%
loc_1_insert_conv2d (64, 128, 3, 3) (1, 64, 17, 17) 73728 21307392 2.806%
loc_2_insert_conv2d (64, 128, 3, 3) (1, 64, 9, 9) 73728 5971968 0.787%
loc_3_insert_conv2d (64, 64, 3, 3) (1, 64, 5, 5) 36864 921600 0.121%
loc_4_insert_conv2d (64, 64, 3, 3) (1, 64, 3, 3) 36864 331776 0.044%
cls_0_conv (8, 64, 3, 3) (1, 8, 33, 33) 4608 5018112 0.661%
cls_1_conv (8, 64, 3, 3) (1, 8, 17, 17) 4608 1331712 0.175%
cls_2_conv (8, 64, 3, 3) (1, 8, 9, 9) 4608 373248 0.049%
cls_3_conv (8, 64, 3, 3) (1, 8, 5, 5) 4608 115200 0.015%
cls_4_conv (8, 64, 3, 3) (1, 8, 3, 3) 4608 41472 0.005%
loc_0_conv (16, 64, 3, 3) (1, 16, 33, 33) 9216 10036224 1.322%
loc_1_conv (16, 64, 3, 3) (1, 16, 17, 17) 9216 2663424 0.351%
loc_2_conv (16, 64, 3, 3) (1, 16, 9, 9) 9216 746496 0.098%
loc_3_conv (16, 64, 3, 3) (1, 16, 5, 5) 9216 230400 0.03%
loc_4_conv (16, 64, 3, 3) (1, 16, 3, 3) 9216 82944 0.011%
Layers num: 28
Total number of parameters: 1010016
Total number of FLOPs: 759223296
作者还提供了其自己整理的数据,百度网盘(提取码: eyfz),分布如表所示
来自 WIDER Face | 来自 MAFA | 共计 | |
训练集/张 | 3114 | 3006 | 6120 |
验证集/张 | 780 | 1059 | 1839 |
写了个脚本可视化标注
import cv2
import os
import xml.etree.ElementTree as ET
from tqdm import tqdm
def show(split="train",togt=True):
files = os.listdir(split)
files = [f for f in files if f.endswith("jpg")]
for file in tqdm(files):
img = cv2.imread(split+"/"+file)
xml_path = split+"/"+file.replace("jpg","xml")
tree = ET.parse(xml_path)
root = tree.getroot()
for obj in root.findall('object'):
name = obj.find('name').text
bbox = obj.find('bndbox')
xmin = int(bbox.find('xmin').text)
ymin = int(bbox.find('ymin').text)
xmax = int(bbox.find('xmax').text)
ymax = int(bbox.find('ymax').text)
cv2.rectangle(img,(xmin,ymin),(xmax,ymax),(255,0,0))
cv2.putText(img,name,(xmin,ymin),1,1,(0,0,255))
#cv2.imshow("img",img)
#cv2.waitKey()
cv2.imwrite("gt/"+file,img)
show()
可以看到覆盖了很多角度,即使手遮挡也不会认为戴口罩.
训练的话原作者并没有提供,可以在ssd-models找到.