人脸识别系列三 | MTCNN算法详解下篇

最新推荐文章于 2024-07-06 12:36:18 发布

just_sort

最新推荐文章于 2024-07-06 12:36:18 发布

阅读量1k

点赞数 1

分类专栏：人脸识别

本文链接：https://blog.csdn.net/just_sort/article/details/103162909

版权

人脸识别专栏收录该内容

8 篇文章 7 订阅

订阅专栏

前言

上篇讲解了MTCNN算法的算法原理以及训练细节，这篇文章主要从源码实现的角度来解析一下MTCNN算法。我要解析的代码来自github的https://github.com/ElegantGod/ncnn中的mtcnn.cpp。

网络结构

再贴一下MTCNN的网络结构，方便注释代码的时候可以随时查看。
在这里插入图片描述

MTCNN代码运行流程

在这里插入图片描述

代码中的关键参数

nms_threshold: 三次非极大值抑制筛选人脸框的IOU阈值，三个网络可以分别设置，值设置的过小，nms合并的太少，会产生较多的冗余计算。
threshold：人脸框得分阈值，三个网络可单独设定阈值，值设置的太小，会有很多框通过，也就增加了计算量，还有可能导致最后不是人脸的框错认为人脸。
mean_vals：三个网络输入图片的均值，需要单独设置。
norm_vals：三个网络输入图片的缩放系数，需要单独设置。
min_size: 最小可检测图像，该值大小，可控制图像金字塔的阶层数的参数之一，越小，阶层越多，计算越多。本代码取了40。
factor：生成图像金字塔时候的缩放系数, 范围(0,1)，可控制图像金字塔的阶层数的参数之一，越大，阶层越多，计算越多。本文取了0.709。
MIN_DET_SIZE：代表PNet的输入图像长宽，都为12。

代码执行流程

生成图像金字塔

关键参数minsize和factor共同决定了图像金字塔的层数，也就是生成的图片数量。

这部分的代码如下：

    // 缩放到12为止
	int MIN_DET_SIZE = 12;
	// 可以检测的最小人脸
	int minsize = 40;
	float m = (float)MIN_DET_SIZE / minsize;
	minl *= m;
	float factor = 0.709;
	int factor_count = 0;
	vector<float> scales_;
	while (minl>MIN_DET_SIZE) {
		if (factor_count>0)m = m*factor;
		scales_.push_back(m);
		minl *= factor;
		factor_count++;
	}

这部分代码中的MIN_DET_SIZE代表缩放的最小尺寸不可以小于12，也就是从原图缩放到12为止。scales这个vector保存的是每次缩放的系数，它的尺寸代表了可以缩放出的图片的数量。其中minsize代表可以检测到的最小人脸大小，这里设置为40。缩放后的图片尺寸可以用以下公式计算：
$minL=orgL*(12/minsize)*factor^n$ ，其中n就是scales的长度，即特征金字塔层数。

PNet

Pnet只做检测和回归任务。在上篇文章中我们知道PNet是要求12*12的输入的，实际上再训练的时候是这样做的。但是测试的时候并不需要把金字塔的每张图像resize到12乘以12喂给PNet，因为它是全卷积网络，以直接将resize后的图像喂给网络进行Forward。这个时候得到的结果就不是 $1 * 1 * 2$ 和 $1 * 1 * 4$ ，而是 $m * m * 2$ 和 $m * m * 4$ 。这样就不用先从resize的图上截取各种 $12 * 12 * 3$ 的图再送入网络了，而是一次性送入，再根据结果回推每个结果对应的 $12 * 12$ 的图在输入图片的什么位置。
然后对于金字塔的每张图，网络forward后都会得到属于人脸的概率以及人脸框回归的结果。每张图片会得到 $m * m * 2$ 个分类得分和 $m * m * 4$ 个人回归坐标，然后结合scales可以将每个滑窗映射回原图，得到真实坐标。

接下来，先根据上面的threshold参数将得分低的区域排除掉，然后执行一遍NMS去除一部分冗余的重叠框，最后，PNet就得到了一堆人脸框，当然结果还不精细，需要继续往下走。Pnet的代码为：

for (size_t i = 0; i < scales_.size(); i++) {
		int hs = (int)ceil(img_h*scales_[i]);
		int ws = (int)ceil(img_w*scales_[i]);
		//ncnn::Mat in = ncnn::Mat::from_pixels_resize(image_data, ncnn::Mat::PIXEL_RGB2BGR, img_w, img_h, ws, hs);
		ncnn::Mat in;
		resize_bilinear(img_, in, ws, hs);
		//in.substract_mean_normalize(mean_vals, norm_vals);
		ncnn::Extractor ex = Pnet.create_extractor();
		ex.set_light_mode(true);
		ex.input("data", in);
		ncnn::Mat score_, location_;
		ex.extract("prob1", score_);
		ex.extract("conv4-2", location_);
		std::vector<Bbox> boundingBox_;
		std::vector<orderScore> bboxScore_;
		generateBbox(score_, location_, boundingBox_, bboxScore_, scales_[i]);
		nms(boundingBox_, bboxScore_, nms_threshold[0]);

		for (vector<Bbox>::iterator it = boundingBox_.begin(); it != boundingBox_.end(); it++) {
			if ((*it).exist) {
				firstBbox_.push_back(*it);
				order.score = (*it).score;
				order.oriOrder = count;
				firstOrderScore_.push_back(order);
				count++;
			}
		}
		bboxScore_.clear();
		boundingBox_.clear();
	}

其中有2个关键的函数，分别是generateBox和nms，我们分别来解析一下，首先看generateBox:

// 根据Pnet的输出结果，由滑框的得分，筛选可能是人脸的滑框，并记录该框的位置、人脸坐标信息、得分以及编号
void mtcnn::generateBbox(ncnn::Mat score, ncnn::Mat location, std::vector<Bbox>& boundingBox_, std::vector<orderScore>& bboxScore_, float scale) {
	int stride = 2;
	int cellsize = 12;
	int count = 0;
	//score p 判定为人脸的概率
	float *p = score.channel(1);
	// 人脸框回归偏移量
	float *plocal = location.channel(0);
	Bbox bbox;
	orderScore order;
	for (int row = 0; row<score.h; row++) {
		for (int col = 0; col<score.w; col++) {
			if (*p>threshold[0]) {
				bbox.score = *p;
				order.score = *p;
				order.oriOrder = count;
				// 对应原图中的坐标
				bbox.x1 = round((stride*col + 1) / scale);
				bbox.y1 = round((stride*row + 1) / scale);
				bbox.x2 = round((stride*col + 1 + cellsize) / scale);
				bbox.y2 = round((stride*row + 1 + cellsize) / scale);
				bbox.exist = true;
				// 在原图中的大小
				bbox.area = (bbox.x2 - bbox.x1)*(bbox.y2 - bbox.y1);
				// 当前人脸框的回归坐标
				for (int channel = 0; channel<4; channel++)
					bbox.regreCoord[channel] = location.channel(channel)[0];
				boundingBox_.push_back(bbox);
				bboxScore_.push_back(order);
				count++;
			}
			p++;
			plocal++;
		}
	}
}

对于非极大值抑制(NMS)，应该先了解一下它的原理。简单解释一下就是说：当两个box空间位置非常接近，就以score更高的那个作为基准，看IOU即重合度如何，如果与其重合度超过阈值，就抑制score更小的box，因为没有必要输出两个接近的box，只保留score大的就可以了。之后我也会盘点各种NMS算法，讲讲他们的原理，已经在目标检测学习总结路线中规划上了，请打开公众号的深度学习栏中的目标检测路线推文查看我的讲解思维导图。代码如下，这段代码以打擂台的生活场景进行注释，比较好理解：

void mtcnn::nms(std::vector<Bbox> &boundingBox_, std::vector<orderScore> &bboxScore_, const float overlap_threshold, string modelname) {
	if (boundingBox_.empty()) {
		return;
	}
	std::vector<int> heros;
	//sort the score
	sort(bboxScore_.begin(), bboxScore_.end(), cmpScore);

	int order = 0;
	float IOU = 0;
	float maxX = 0;
	float maxY = 0;
	float minX = 0;
	float minY = 0;
	// 规则，站上擂台的擂台主，永远都是胜利者
	while (bboxScore_.size()>0) {
		order = bboxScore_.back().oriOrder; //取得分最高勇士的编号ID
		bboxScore_.pop_back(); // 勇士出列
		if (order<0)continue; //死的？下一个！（order在(*it).oriOrder = -1;改变）
		if (boundingBox_.at(order).exist == false) continue; //记录擂台主ID
		heros.push_back(order);
		boundingBox_.at(order).exist = false;//当前这个Bbox为擂台主，签订生死簿

		for (int num = 0; num<boundingBox_.size(); num++) {
			if (boundingBox_.at(num).exist) {// 活着的勇士
				//the iou
				maxX = (boundingBox_.at(num).x1>boundingBox_.at(order).x1) ? boundingBox_.at(num).x1 : boundingBox_.at(order).x1;
				maxY = (boundingBox_.at(num).y1>boundingBox_.at(order).y1) ? boundingBox_.at(num).y1 : boundingBox_.at(order).y1;
				minX = (boundingBox_.at(num).x2<boundingBox_.at(order).x2) ? boundingBox_.at(num).x2 : boundingBox_.at(order).x2;
				minY = (boundingBox_.at(num).y2<boundingBox_.at(order).y2) ? boundingBox_.at(num).y2 : boundingBox_.at(order).y2;
				//maxX1 and maxY1 reuse 
				maxX = ((minX - maxX + 1)>0) ? (minX - maxX + 1) : 0;
				maxY = ((minY - maxY + 1)>0) ? (minY - maxY + 1) : 0;
				//IOU reuse for the area of two bbox
				IOU = maxX * maxY;
				if (!modelname.compare("Union"))
					IOU = IOU / (boundingBox_.at(num).area + boundingBox_.at(order).area - IOU);
				else if (!modelname.compare("Min")) {
					IOU = IOU / ((boundingBox_.at(num).area<boundingBox_.at(order).area) ? boundingBox_.at(num).area : boundingBox_.at(order).area);
				}
				if (IOU>overlap_threshold) {
					boundingBox_.at(num).exist = false; //如果该对比框与擂台主的IOU够大，挑战者勇士战死
					for (vector<orderScore>::iterator it = bboxScore_.begin(); it != bboxScore_.end(); it++) {
						if ((*it).oriOrder == num) {
							(*it).oriOrder = -1;//勇士战死标志
							break;
						}
					}
				}
				//那些距离擂台主比较远迎战者幸免于难，将有机会作为擂台主出现
			}
		}
	}
	//从生死簿上剔除，擂台主活下来了
	for (int i = 0; i<heros.size(); i++)
		boundingBox_.at(heros.at(i)).exist = true;
}

RNet

这以阶段就和PNet相比，就需要将图像resize到(24,24)了。然后剩下的过程也和PNet一样，做nms。最后还多了一个refineAndSquareBox的后处理过程，这个函数是把所有留下的框变成正方形并且将这些框的边界限定在原图长宽范围内。注意一下，这个阶段refineAndSquareBox是在nms之后做的。

//second stage
	count = 0;
	for (vector<Bbox>::iterator it = firstBbox_.begin(); it != firstBbox_.end(); it++) {
		if ((*it).exist) {
			ncnn::Mat tempIm;
			copy_cut_border(img, tempIm, (*it).y1, img_h - (*it).y2, (*it).x1, img_w - (*it).x2);
			ncnn::Mat in;
			resize_bilinear(tempIm, in, 24, 24);
			ncnn::Extractor ex = Rnet.create_extractor();
			ex.set_light_mode(true);
			ex.input("data", in);
			ncnn::Mat score, bbox;
			ex.extract("prob1", score);
			ex.extract("conv5-2", bbox);
			if ((score[1])>threshold[1]) {
				for (int channel = 0; channel<4; channel++)
					it->regreCoord[channel] = bbox[channel];
				it->area = (it->x2 - it->x1)*(it->y2 - it->y1);
				it->score = score[1];
				secondBbox_.push_back(*it);
				order.score = it->score;
				order.oriOrder = count++;
				secondBboxScore_.push_back(order);
			}
			else {
				(*it).exist = false;
			}
		}
	}
	printf("secondBbox_.size()=%d\n", secondBbox_.size());
	if (count<1)return;
	nms(secondBbox_, secondBboxScore_, nms_threshold[1]);
	refineAndSquareBbox(secondBbox_, img_h, img_w);

ONet

ONet相比于前面2个阶段，多了一个关键点回归的过程。同时需要注意的是这个阶段refineAndSquareBox是在nms之前做的。经过这个阶段，出来的框就是我们苦苦追寻的人脸框啦，完结。

count = 0;
	for (vector<Bbox>::iterator it = secondBbox_.begin(); it != secondBbox_.end(); it++) {
		if ((*it).exist) {
			ncnn::Mat tempIm;
			copy_cut_border(img, tempIm, (*it).y1, img_h - (*it).y2, (*it).x1, img_w - (*it).x2);
			ncnn::Mat in;
			resize_bilinear(tempIm, in, 48, 48);
			ncnn::Extractor ex = Onet.create_extractor();
			ex.set_light_mode(true);
			ex.input("data", in);
			ncnn::Mat score, bbox, keyPoint;
			ex.extract("prob1", score);
			ex.extract("conv6-2", bbox);
			ex.extract("conv6-3", keyPoint);
			if (score[1]>threshold[2]) {
				for (int channel = 0; channel<4; channel++)
					it->regreCoord[channel] = bbox[channel];
				it->area = (it->x2 - it->x1)*(it->y2 - it->y1);
				it->score = score[1];
				for (int num = 0; num<5; num++) {
					(it->ppoint)[num] = it->x1 + (it->x2 - it->x1)*keyPoint[num];
					(it->ppoint)[num + 5] = it->y1 + (it->y2 - it->y1)*keyPoint[num + 5];
				}

				thirdBbox_.push_back(*it);
				order.score = it->score;
				order.oriOrder = count++;
				thirdBboxScore_.push_back(order);
			}
			else
				(*it).exist = false;
		}
	}

	printf("thirdBbox_.size()=%d\n", thirdBbox_.size());