人脸检测MTCNN简介-CSDN博客

本文介绍了一种基于多任务级联卷积神经网络的人脸检测与对齐方法，该方法通过级联分类器实现快速检测，利用多任务训练提升准确性，同时实现了人脸关键点定位。网络结构包含P-Net、R-Net和O-Net三个阶段，分别负责初步检测、细化检测框和精确对齐。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

总述

PAPER:
Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks

总体结构：
在这里插入图片描述

他的网络优势在于：
1.使用级联的分类器，单分类器都相对简单，因此速度较快。
2.使用多任务训练，查找LANDMARK和人脸分类的内在联系，提升了准确性。
3.提供了LANDMARK~做人脸识别的时候直接可以做FACE Alignment，不需要再加一层网络。

网络及损失函数

整体网络结构：
在这里插入图片描述

再来看一下各级网络的LOSS。
因为是多任务训练，所以三级网络都兼顾了分类，检测框和LANDMARK。

Face classification: The learning objective is formulated as a two-class classification problem. For each sample , we use the cross-entropy loss:

where is the probability produced by the network that indi-cates a sample being a face. The notation denotes the ground-truth label.
Bounding box regression: For each candidate window, we predict the offset between it and the nearest ground truth (i.e., the bounding boxes’ left top, height, and width). The learning objective is formulated as a regression problem, and we employ the Euclidean loss for each sample :

where ̂ regression target obtained from the network and is the ground-truth coordinate. There are four coordinates, including left top, height and width, and thus .
3) Facial landmark localization: Similar to the bounding box regression task, facial landmark detection is formulated as aregression problem and we minimize the Euclidean loss:
在这里插入图片描述

where ̂ is the facial landmark’s coordinate obtained from the network and is the ground-truth coordinate. There are five facial landmarks, including left eye, right eye, nose, left mouth corner, and right mouth corner, and thus .

整体LOSS：
在这里插入图片描述

训练

总述

训练数据库

论文中作者主要使用了Wider_face 和CelebA数据库，其中Wider_face主要用于检测任务的训练，CelebA主要用于关键点的训练．训练集分为四种:负样本，正样本，部分样本，关键点样本. 三个样本的比例为3: 1: 1: 2

we use four different kinds of data annotation in our training process: (i) Negatives: Regions that the Intersec-tion-over-Union (IoU) ratio less than 0.3 to any ground-truth faces; (ii) Positives: IoU above 0.65 to a ground truth face; (iii) Part faces: IoU between 0.4 and 0.65 to a ground truth face; and (iv) Landmark faces: faces labeled 5 landmarks’ positions.

训练主要包括三个任务

人脸分类任务：利用正样本和负样本进行训练
人脸边框回归任务：利用正样本和部分样本进行训练
关键点检测任务：利用关键点样本进行训练

训练步骤细节可以参考：
https://github.com/AITTSMD/MTCNN-Tensorflow

1.Download Wider Face Training part only from Official Website , unzip to replace WIDER_train and put it into prepare_data folder.
2.Download landmark training data from here,unzip and put them into prepare_data folder.
3.Run prepare_data/gen_12net_data.py to generate training data(Face Detection Part) for PNet.
4.Run gen_landmark_aug_12.py to generate training data(Face Landmark Detection Part) for PNet.
5.Run gen_imglist_pnet.py to merge two parts of training data.
6.Run gen_PNet_tfrecords.py to generate tfrecord for PNet.
7.After training PNet, run gen_hard_example to generate training data(Face Detection Part) for RNet.
8.Run gen_landmark_aug_24.py to generate training data(Face Landmark Detection Part) for RNet.
9.Run gen_imglist_rnet.py to merge two parts of training data.
10.Run gen_RNet_tfrecords.py to generate tfrecords for RNet.(you should run this script four times to generate tfrecords of neg,pos,part and landmark respectively)
11.After training RNet, run gen_hard_example to generate training data(Face Detection Part) for ONet.
12.Run gen_landmark_aug_48.py to generate training data(Face Landmark Detection Part) for ONet.
13.Run gen_imglist_onet.py to merge two parts of training data.
14.Run gen_ONet_tfrecords.py to generate tfrecords for ONet.(you should run this script four times to generate tfrecords of neg,pos,part and landmark respectively)

P-NET

PNET的训练集采集包括两步：
1.随机裁剪正负例样本，并根据约定的IOU正负例范围进行约束。注意一下，可以根据实际项目的需求对裁剪尺寸进行约束，默认是从12 * 12开始裁剪。
2.根据LANDMARK的数据集，裁剪出人像，缩放至12 * 12，并修改对应的LANDMARK位置。
需要稍微注意一下，MTCNN的激活函数是：
prelu.

备注：
在https://github.com/AITTSMD/MTCNN-Tensorflow 中作者提到 Celeba有标注错误，使用了 http://mmlab.ie.cuhk.edu.hk/archive/CNN_FacePoint.htm , 我个人理解这个数据集样本是不是太少了？数据平衡的问题如何解决？我个人倾向于还是使用Celeba的数据。
作者针对样本不足在样本处理中增加了一些数据增强的方法，这个值得借鉴。

R-NET

R-Net: We use first stage of our framework to detect faces from WIDER FACE [24] to collect positives, negatives and part face while landmark faces are detected from CelebA [23].
即使用P-NET的网络拿到的检测框，做一次正负例，复杂样本的分类。接着塞给R-NET训练。
对于LANDMARK的采样方式和P-NET相似，只是缩放大小变为24 * 24.

O-NET

O-NET的训练方式和R-NET相似，只是他的正负例是通过两层网络训练得出的。

online hard sample mining

In particular, in each mini-batch, we sort the loss computed in the forward propagation phase from all samples and select the top 70% of them as hard samples. Then we only compute the gradient from the hard samples in the backward propagation phase. That means we ignore the easy samples that are less helpful to strengthen the detector while training.
其实也就是在LOSS计算的时候只记录排名前70%的数据。（PS:我个人觉得这个有点玄乎，当然玄乎的东西多了。。。）

先记录这么多，后面真正训练跑的时候再看看有没有其他的坑。