基于OpenCV和YOLOv3深度学习的目标检测

最新推荐文章于 2024-10-10 00:03:54 发布

峻峰飞阳

最新推荐文章于 2024-10-10 00:03:54 发布

阅读量2.5k

点赞数 2

分类专栏：人工智能机器学习

人工智能同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

机器学习

11 篇文章 0 订阅

订阅专栏

本文翻译自Deep Learning based Object Detection using YOLOv3 with OpenCV ( Python / C++ )

基于OpenCV和YOLOv3深度学习的目标检测

本文，我们学习如何在OpenCV上使用目前较为先进的目标检测技术YOLOv3。

YOLOv3是当前流行的目标检测算法YOLO(You Only Look Once)的最新变种算法。所发行的模型能识别图片和视频中的80种物体，而且更重要的是它实时性强，而且准确度接近Single Shot MultiBox（SSD）。

从OpenCV 3.4.2开始，我们可以很容易的在OpenCV应用中使用YOLOv3模型（即OpemCV-3.4.2开始支持YOLOv3这网络框架）。

YOLO是什么原理？

我们可以把目标检测看成是目标定位和目标识别的结合。

在传统的计算机视觉方法中，采用滑动窗口查找不同区域和大小的目标。因为这是消耗量较大的算法，通常假定目标的纵横比是固定的。

早期的基于深度学习的目标检测算法，如R-CNN和快速R-CNN，采用选择型搜索（Selective Search）来缩小必须测试的边界框的数量（本文的边界框指的是，在预测到疑似所识别到的目标后，在图片上把对象框出的一个矩形）。

另外一种称为Overfeat的方法，通过卷积地计算滑动窗口，以多个尺度扫描了图像。

然后有人提出了快速R-CNN算法，使用Region Proposal Network(RPN)区别将要测试的边界框。通过巧妙的设计，用于目标识别的特征点，也被RPN用于提出潜在的边界框，因此节省了大量的计算。

然而，YOLO使用了完全不同的方法解决目标检测问题。它将图像进行神经网络的一次性正向处理。SSD是另外一种将图像进行神经网络一次性正向处理的方法，但是YOLOv3比SSD实现了更高的精度，同时又较快的运算速度。YOLOv3在M40，TitanX和1080Ti这类GPU上实时效果更好。

让我们看看YOLO如何在一张图片中检测目标。

首先，它把原图按比例平均分解成一张有13x13网格的图片。这169个单元会根据原图的大小而改变。对于一张416x416像素的图片，每个图片单元的大小是32x32像素。处理图片时，会以图片单元为单位，预测单位中的多个边界框。

对于每个边界框，这个网络会计算所包含物体的边界框的置信度，同时计算所包含的目标是属于一个特定类别的可能性大小。

非最大抑制（non-maximum suppression）可以消除低置信度的边界框，以及把同时包围着单个物体的多个高置信度的边界框消除到只剩下一个。

YOLOv3的作者，Joseph Redmon和Ali Farhadi，让YOLOv3比前一代YOLOv2更加精确和快速。YOLOv3在处理多个不同尺寸图片的场合中得到了优化。他们还通过加大了网络，并添加快捷链接将其引入剩余网络来改进网络。

为什么选择OpenCV的YOLO

这里有三个理由。

容易整合到现有的OpenCV程序中：如果应用程序已经使用了OpenCV，并想简单地使用YOLOv3，完全不需要担心Darknet源代码的编译和建立。
OpenCV的CPU版本的运算速度比Darknet+OpenMP快9倍：OpenCV的DNN模块，其CPU运行是十分快的。举个例子，当用了OpenMP的Darknet在CPU上处理一张图片消耗2秒，OpenCV的实现只需要0.22秒。具体请看下面的表格。
支持Python。Darknet是用C语言写的，因此并不支持Python。相反，OpenCV是支持Python的。会有支持Darknet的编程接口。

在Darknet和OpenCV上跑YOLOv3的速度测试

下面的表格展示了在Darknet和OpenCV上YOLOv3的性能差距，输入图片的尺寸是416x416。不出所料，GPU版本的Darknet在性能上比其他方式优越。同时，理所当然的Darknet配合OpenMP会好于没有OpenMP的Darknet，因为OpenMP支持多核的CPU。

意外的是，CPU版本的OpenCV在执行DNN的运算速度，是9倍的快过Darknet和OpenML。

表1. 分别在Darknet和OpenCV上跑YOLOv3的速度对比
OS	Framework	CPU/GPU	Time(ms)/Frame
Linux 16.04	Darknet	12x Intel Core i7-6850K CPU @ 3.60GHz	9370
Linux 16.04	Darknet + OpenMP	12x Intel Core i7-6850K CPU @ 3.60GHz	1942
Linux 16.04	OpenCV [CPU]	12x Intel Core i7-6850K CPU @ 3.60GHz	220
Linux 16.04	Darknet	NVIDIA GeForce 1080 Ti GPU	23
macOS	DarkNet	2.5 GHz Intel Core i7 CPU	7260
macOS	OpenCV [CPU]	2.5 GHz Intel Core i7 CPU	400

注意：我们在GPU版本的OpenCV上跑DNN时候遇到了困难。本文工作只是测试了Intel的GPU，因此如果没有Intel的GPU，程序会自动切换到CPU上跑相关算法。

采用YOLOv3的目标检测，C++/Python两种语言

让我们看看，如何在YOLOv3在OpenCV运行目标检测。

第1步：下载模型。

我们先从命令行中执行脚本getModels.sh开始。

sudo chmod a+x getModels.sh
./getModels.sh

//译者添加：
Windows下替代方案：
1、http://gnuwin32.sourceforge.net/packages/wget.htm 安装wget
cd 到wget安装目录，执行
wget https://pjreddie.com/media/files/yolov3.weights
wget https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg?raw=true -O ./yolov3.cfg
wget https://github.com/pjreddie/darknet/blob/master/data/coco.names?raw=true -O ./coco.names

执行命令后开始下载yolov3.weights文件（包括了提前训练好的网络的权值），和yolov3.cfg文件（包含了网络的配置方式）和coco.names（包括了COCO数据库中使用的80种不同的目标种类名字）。

第2步：初始化参数

YOLOv3算法的预测结果就是边界框。每一个边界框都旁随着一个置信值。第一阶段中，全部低于置信度阀值的都会排除掉。

对剩余的边界框执行非最大抑制算法，以去除重叠的边界框。非最大抑制由一个参数nmsThrehold控制。读者可以尝试改变这个数值，观察输出的边界框的改变。

接下来，设置输入图片的宽度（inpWidth）和高度（inpHeight）。我们设置他们为416，以便对比YOLOv3作者提供的Darknets的C代码。如果想要更快的速度，读者可以把宽度和高度设置为320。如果想要更准确的结果，改变他们到608。

Python代码：

# Initialize the parameters
confThreshold = 0.5 #Confidence threshold
nmsThreshold = 0.4 #Non-maximum suppression threshold
inpWidth = 416 #Width of network's input image
inpHeight = 416 #Height of network's input image

C++代码：

// Initialize the parameters
float confThreshold = 0.5; // Confidence threshold
float nmsThreshold = 0.4; // Non-maximum suppression threshold
int inpWidth = 416; // Width of network's input image
int inpHeight = 416; // Height of network's input image

第3步：读取模型和类别

文件coco.names包含了训练好的模型能识别的所有目标名字。我们读出各个类别的名字。

接着，我们读取了网络，其包含两个部分：

yolov3.weights: 预训练得到的权重。
yolov3.cfg：配置文件

我们把DNN的后端设置为OpenCV，目标设置为CPU。可以通过使cv.dnn.DNN_TARGET_OPENCL置为GPU，尝试设定偏好的运行目标为GPU。但是要记住当前的OpenCV版本只在Intel的GPU上测试，如果没有Intel的GPU则程序会自动设置为CPU。

Python:

# Load names of classes
classesFile = "coco.names";
classes = None
with open(classesFile, 'rt') as f:
classes = f.read().rstrip('\n').split('\n')
# Give the configuration and weight files for the model and load the network using them.
modelConfiguration = "yolov3.cfg";
modelWeights = "yolov3.weights";
net = cv.dnn.readNetFromDarknet(modelConfiguration, modelWeights)
net.setPreferableBackend(cv.dnn.DNN_BACKEND_OPENCV)
net.setPreferableTarget(cv.dnn.DNN_TARGET_CPU)

C++

// Load names of classes
string classesFile = "coco.names";
ifstream ifs(classesFile.c_str());
string line;
while (getline(ifs, line)) classes.push_back(line);
// Give the configuration and weight files for the model
String modelConfiguration = "yolov3.cfg";
String modelWeights = "yolov3.weights";
// Load the network
Net net = readNetFromDarknet(modelConfiguration, modelWeights);
net.setPreferableBackend(DNN_BACKEND_OPENCV);
net.setPreferableTarget(DNN_TARGET_CPU);

第4步：读取输入

这一步我们读取图像，视频流或者网络摄像头。另外，我们也使用Videowriter（OpenCV里的一个类）以视频方式保存带有输出边界框的每一帧图片。

Python

outputFile = "yolo_out_py.avi"
if (args.image):
# Open the image file
if not os.path.isfile(args.image):
print("Input image file ", args.image, " doesn't exist")
sys.exit(1)
cap = cv.VideoCapture(args.image)
outputFile = args.image[:-4]+'_yolo_out_py.jpg'
elif (args.video):
# Open the video file
if not os.path.isfile(args.video):
print("Input video file ", args.video, " doesn't exist")
sys.exit(1)
cap = cv.VideoCapture(args.video)
outputFile = args.video[:-4]+'_yolo_out_py.avi'
else:
# Webcam input
cap = cv.VideoCapture(0)
# Get the video writer initialized to save the output video
if (not args.image):
vid_writer = cv.VideoWriter(outputFile, cv.VideoWriter_fourcc('M','J','P','G'), 30, (round(cap.get(cv.CAP_PROP_FRAME_WIDTH)),round(cap.get(cv.CAP_PROP_FRAME_HEIGHT))))

C++

outputFile = "yolo_out_cpp.avi";
if (parser.has("image"))
{
// Open the image file
str = parser.get<String>("image");
ifstream ifile(str);
if (!ifile) throw("error");
cap.open(str);
str.replace(str.end()-4, str.end(), "_yolo_out.jpg");
outputFile = str;
}
else if (parser.has("video"))
{
// Open the video file
str = parser.get<String>("video");
ifstream ifile(str);
if (!ifile) throw("error");
cap.open(str);
str.replace(str.end()-4, str.end(), "_yolo_out.avi");
outputFile = str;
}
// Open the webcaom
else cap.open(parser.get<int>("device"));
// Get the video writer initialized to save the output video
if (!parser.has("image")) {
video.open(outputFile, VideoWriter::fourcc('M','J','P','G'), 28, Size(cap.get(CAP_PROP_FRAME_WIDTH), cap.get(CAP_PROP_FRAME_HEIGHT)));
}

第5步：处理每一帧

输入到神经网络的图像需要以一种叫bolb的格式保存。

读取了输入图片或者视频流的一帧图像后，这帧图像需要经过bolbFromImage()函数处理为神经网络的输入类型bolb。在这个过程中，图像像素以一个1/255的比例因子，被缩放到0到1之间。同时，图像在不裁剪的情况下，大小调整到416x416。注意我们没有降低图像平均值，因此传递[0,0,0]到函数的平均值输入，保持swapRB参数到默认值1。

输出的bolb传递到网络，经过网络正向处理，网络输出了所预测到的一个边界框清单。这些边界框通过后处理，滤除了低置信值的。我们随后再详细的说明后处理的步骤。我们在每一帧的左上方打印出了推断时间。伴随着最后的边界框的完成，图像保存到硬盘中，之后可以作为图像输入或者通过Videowriter作为视频流输入。

Python：

while cv.waitKey(1) < 0:
# get frame from the video
hasFrame, frame = cap.read()
# Stop the program if reached end of video
if not hasFrame:
print("Done processing !!!")
print("Output file is stored as ", outputFile)
cv.waitKey(3000)
break
# Create a 4D blob from a frame.
blob = cv.dnn.blobFromImage(frame, 1/255, (inpWidth, inpHeight), [0,0,0], 1, crop=False)
# Sets the input to the network
net.setInput(blob)
# Runs the forward pass to get output of the output layers
outs = net.forward(getOutputsNames(net))
# Remove the bounding boxes with low confidence
postprocess(frame, outs)
# Put efficiency information. The function getPerfProfile returns the
# overall time for inference(t) and the timings for each of the layers(in layersTimes)
t, _ = net.getPerfProfile()
label = 'Inference time: %.2f ms' % (t * 1000.0 / cv.getTickFrequency())
cv.putText(frame, label, (0, 15), cv.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255))
# Write the frame with the detection boxes
if (args.image):
cv.imwrite(outputFile, frame.astype(np.uint8));
else:
vid_writer.write(frame.astype(np.uint8))

c++

// Process frames.
while (waitKey(1) < 0)
{
// get frame from the video
cap >> frame;
// Stop the program if reached end of video
if (frame.empty()) {
cout << "Done processing !!!" << endl;
cout << "Output file is stored as " << outputFile << endl;
waitKey(3000);
break;
}
// Create a 4D blob from a frame.
blobFromImage(frame, blob, 1/255.0, cvSize(inpWidth, inpHeight), Scalar(0,0,0), true, false);
//Sets the input to the network
net.setInput(blob);
// Runs the forward pass to get output of the output layers
vector<Mat> outs;
net.forward(outs, getOutputsNames(net));
// Remove the bounding boxes with low confidence
postprocess(frame, outs);
// Put efficiency information. The function getPerfProfile returns the
// overall time for inference(t) and the timings for each of the layers(in layersTimes)
vector<double> layersTimes;
double freq = getTickFrequency() / 1000;
double t = net.getPerfProfile(layersTimes) / freq;
string label = format("Inference time for a frame : %.2f ms", t);
putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 0, 255));
// Write the frame with the detection boxes
Mat detectedFrame;
frame.convertTo(detectedFrame, CV_8U);
if (parser.has("image")) imwrite(outputFile, detectedFrame);
else video.write(detectedFrame);
}

现在，让我们详细分析一下上面调用的函数。

第5a步：得到输出层的名字

OpenCV的网络类中的前向功能需要结束层，直到它在网络中运行。因为我们需要运行整个网络，所以我们需要识别网络中的最后一层。我们通过使用getUnconnectedOutLayers()获得未连接的输出层的名字，该层基本就是网络的最后层。然后我们运行前向网络，得到输出，如前面的代码片段（net.forward(getOutputsNames(net))）。
python:

# Get the names of the output layers
def getOutputsNames(net):
# Get the names of all the layers in the network
layersNames = net.getLayerNames()
# Get the names of the output layers, i.e. the layers with unconnected outputs
return [layersNames[i[0] - 1] for i in net.getUnconnectedOutLayers()]

c++

// Get the names of the output layers
vector<String> getOutputsNames(const Net& net)
{
static vector<String> names;
if (names.empty())
{
//Get the indices of the output layers, i.e. the layers with unconnected outputs
vector<int> outLayers = net.getUnconnectedOutLayers();
//get the names of all the layers in the network
vector<String> layersNames = net.getLayerNames();
// Get the names of the output layers in names
names.resize(outLayers.size());
for (size_t i = 0; i < outLayers.size(); ++i)
names[i] = layersNames[outLayers[i] - 1];
}
return names;
}

第5b步：后处理网络输出

网络输出的每个边界框都分别由一个包含着类别名字和5个元素的向量表示。

头四个元素代表center_x, center_y, width和height。第五个元素表示包含着目标的边界框的置信度。

其余的元素是和每个类别（如目标种类）有关的置信度。边界框分配给最高分数对应的那一种类。

一个边界框的最高分数也叫做它的置信度（confidence）。如果边界框的置信度低于规定的阀值，算法上不再处理这个边界框。

置信度大于或等于置信度阀值的边界框，将进行非最大抑制。这会减少重叠的边界框数目。
Python

# Remove the bounding boxes with low confidence using non-maxima suppression
def postprocess(frame, outs):
frameHeight = frame.shape[0]
frameWidth = frame.shape[1]
classIds = []
confidences = []
boxes = []
# Scan through all the bounding boxes output from the network and keep only the
# ones with high confidence scores. Assign the box's class label as the class with the highest score.
classIds = []
confidences = []
boxes = []
for out in outs:
for detection in out:
scores = detection[5:]
classId = np.argmax(scores)
confidence = scores[classId]
if confidence > confThreshold:
center_x = int(detection[0] * frameWidth)
center_y = int(detection[1] * frameHeight)
width = int(detection[2] * frameWidth)
height = int(detection[3] * frameHeight)
left = int(center_x - width / 2)
top = int(center_y - height / 2)
classIds.append(classId)
confidences.append(float(confidence))
boxes.append([left, top, width, height])
# Perform non maximum suppression to eliminate redundant overlapping boxes with
# lower confidences.
indices = cv.dnn.NMSBoxes(boxes, confidences, confThreshold, nmsThreshold)
for i in indices:
i = i[0]
box = boxes[i]
left = box[0]
top = box[1]
width = box[2]
height = box[3]
drawPred(classIds[i], confidences[i], left, top, left + width, top + height)

c++

// Remove the bounding boxes with low confidence using non-maxima suppression
void postprocess(Mat& frame, const vector<Mat>& outs)
{
vector<int> classIds;
vector<float> confidences;
vector<Rect> boxes;
for (size_t i = 0; i < outs.size(); ++i)
{
// Scan through all the bounding boxes output from the network and keep only the
// ones with high confidence scores. Assign the box's class label as the class
// with the highest score for the box.
float* data = (float*)outs[i].data;
for (int j = 0; j < outs[i].rows; ++j, data += outs[i].cols)
{
Mat scores = outs[i].row(j).colRange(5, outs[i].cols);
Point classIdPoint;
double confidence;
// Get the value and location of the maximum score
minMaxLoc(scores, 0, &confidence, 0, &classIdPoint);
if (confidence > confThreshold)
{
int centerX = (int)(data[0] * frame.cols);
int centerY = (int)(data[1] * frame.rows);
int width = (int)(data[2] * frame.cols);
int height = (int)(data[3] * frame.rows);
int left = centerX - width / 2;
int top = centerY - height / 2;
classIds.push_back(classIdPoint.x);
confidences.push_back((float)confidence);
boxes.push_back(Rect(left, top, width, height));
}
}
}
// Perform non maximum suppression to eliminate redundant overlapping boxes with
// lower confidences
vector<int> indices;
NMSBoxes(boxes, confidences, confThreshold, nmsThreshold, indices);
for (size_t i = 0; i < indices.size(); ++i)
{
int idx = indices[i];
Rect box = boxes[idx];
drawPred(classIds[idx], confidences[idx], box.x, box.y,
box.x + box.width, box.y + box.height, frame);
}
}

非最大抑制由参数nmsThreshold控制。如果nmsThreshold设置太少，比如0.1，我们可能检测不到相同或不同种类的重叠目标。如果设置得太高，比如1，可能出现一个目标有多个边界框包围。所以我们在上面的代码使用了0.4这个中间的值。下面的gif展示了NMS阀值改变时候的效果。

第5c步：画出计算得到的边界框

最后，经过非最大抑制后，得到了边界框。我们把边界框在输入帧上画出，并标出种类名和置信值。

Python

# Draw the predicted bounding box
def drawPred(classId, conf, left, top, right, bottom):
# Draw a bounding box.
cv.rectangle(frame, (left, top), (right, bottom), (0, 0, 255))
label = '%.2f' % conf
# Get the label for the class name and its confidence
if classes:
assert(classId < len(classes))
label = '%s:%s' % (classes[classId], label)
#Display the label at the top of the bounding box
labelSize, baseLine = cv.getTextSize(label, cv.FONT_HERSHEY_SIMPLEX, 0.5, 1)
top = max(top, labelSize[1])
cv.putText(frame, label, (left, top), cv.FONT_HERSHEY_SIMPLEX, 0.5, (255,255,255))

c++

// Draw the predicted bounding box
void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame)
{
//Draw a rectangle displaying the bounding box
rectangle(frame, Point(left, top), Point(right, bottom), Scalar(0, 0, 255));
//Get the label for the class name and its confidence
string label = format("%.2f", conf);
if (!classes.empty())
{
CV_Assert(classId < (int)classes.size());
label = classes[classId] + ":" + label;
}
//Display the label at the top of the bounding box
int baseLine;
Size labelSize = getTextSize(label, FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);
top = max(top, labelSize.height);
putText(frame, label, Point(left, top), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(255,255,255));