Convolutional Neural Network - Object Detection

Overview

Challenges in object detection:

  • deformation
  • occlusion
  • dense vs sparse
  • hard negative mining: Imbalance between positive and negative examples
  • overlapping boxes

 

 

Dateset

The PASCAL Visual Object Classification (PASCAL VOC) dataset is a well-known dataset for object detection, classification, segmentation of objects and so on. Till 2012, there are around 10 000 images for training and validation containing bounding boxes with objects. Although, the PASCAL VOC dataset contains only 20 categories, it is still considered as a reference dataset in the object detection problem.

ImageNet has released an object detection dataset since 2013 with bounding boxes. The training dataset is composed of around 500 000 images only for training and 200 categories. It is rarely used because the size of the dataset requires an important computational power for training. Also, the high number of classes complicates the object recognition task. A comparison between the 2014 ImageNet dataset and the 2012 PASCAL VOC dataset is available here.

On the other hand, the Common Objects in COntext (COCO) dataset is developed by Microsoft and detailed by T.-Y.Lin and al. (2015). This dataset is used for multiple challenges: caption generation, object detection, key point detection and object segmentation.The detection challenge consists localizing the objects in an image with bounding boxes and categorizing each one of them between 80 categories. 

Apart from the general object detection benchmarks, vehicle and pedestrain detection  is an important application in self-driving study. The KITTI and CityScapes dataset provide street view images for such special tasks. Note that these datasets also contains stereo images from which we can learn the depth information.

Metrics

In order to assess the spatial precision, we need to remove the boxes with low confidence (usually, the model outputs many more boxes than actual objects). The Intersection over Union (IoU) area is a common metric reflects the overlap between the predicted box and the ground-truth box. The higher the IoU, the better the predicted location of the box for a given object. 

In binary classification, the Average Precision (AP) metric is a summary of the precision-recall curve, details are provided here. The commonly used metric used for object detection challenges is called the mean Average Precision (mAP). It is simply the mean of the Average Precisions computed over all the classes of the challenge. 

The mAP score is usually computed for a fixed IoU but a high number of bounding boxes can increase the number of candidate boxes. The COCO challenge has developed an official metric to avoid an over generation of boxes. It computes a mean of the mAP scores for variable IoU values in order to penalize high number of bounding boxes with wrong classifications.

  

R-CNN, SPP-Net, Fast R-CNN and Faster R-CNN

As we know, image classification is probably the most common use for deep neural network in computer vision. Most of the CNN networks are designed to extract features from images which can be later used to recognize it. This lead to an intuitive strategy for object detection: the region based classification, which means region proposals or region of interests (ROI) are propsed first, and then they are feeded into classifiers to decide whether they are an object and what are they.

The R-CNN network (Regions with CNN features) is proposed by Ross Girshick et. al. in 2014. For each image to be detected, about 2,000 region of interest is extracted, each RoI is warped to the same size and shape before computing the CNN features. Then the feature is feed into classifier and regressor to determine the class of the object (could be background) and the $4\times 1$ offset $(dx, dy,dw, dh)$ for the bounding box.

For training, the CNN used for image classification is usually pre-trained, and later fine-tuned for object detection. Then we can train a classifier (e.g. linear SVM) for object detection.

 

R-CNN: Process and Network Struecture

The best R-CNNs models have achieved a 62.4% mAP score over the PASCAL VOC 2012 test dataset and a 31.4% mAP score over the 2013 ImageNet dataset.

In 2015, two optimization based on R-CNN are proposed. The first method is called "Fast R-CNN", proposed by Girshick, R. The purpose of the Fast Region-based Convolutional Network (Fast R-CNN) is to reduce the time consumption related to the high number of models necessary to analyse all region proposals. 

Unlike R-CNN, in Fast R-CNN,  RoIs are detected on the produced feature maps using selective search method (see this for algorithm detail). Then, these RoI patches goes into the RoI pooling layes. The size of the input patch may differ, but the output always has fixed heigh and width as hyperparameters. This can be achieved by a spatial transformer using bilinear interpolation, and the core idea is to build a correspondence between the RoI feature patch and the original image coordinates. 

The regularized RoI is then feed into the fully connected layers, creating a features vector. The vector is used to predict the observed object with a softmax classifier and to adapt bounding box localizations with a linear regressor.

   

 

Architecture for Fast R-CNN

The best Fast R-CNNs have reached mAp scores of 70.0% for the 2007 PASCAL VOC test dataset, 68.8% for the 2010 PASCAL VOC test dataset and 68.4% for the 2012 PASCAL VOC test dataset.

Later in the same year, another method called "Faster R-CNN" is published. The major difference Fast R-CNN and this method is that it trains a region proposal network (RPN) to provide proposals other than using other region proposal techniques. The RPN and classifing network share a base CNN structure. While training, Faster R-CNN use an alternate scheme, 

 

The best Faster R-CNNs have obtained mAP scores of 78.8% over the 2007 PASCAL VOC test dataset and 75.9% over the 2012 PASCAL VOC test dataset. They have been trained with PASCAL VOC and COCO datasets. One of these models² is 34 times faster than the Fast R-CNN using the selective search method. 

SSD

 

 

 

R-FCN

 

Summary

 

 

R-CNN

Proposals of RoI are applied to input images

Image patches are regularized and then feed to the whole network

Fast R-CNN

Proposals of RoI are applied on feature maps (after conv. feature extraction layers), then feature patches are feed to the fc. layers

Faster R-CNN

Propose RoI based on feature maps (after conv.), then feed the proposals and the feature map to the fc. Layers;

Shared conv. network; Four loss functions; 

RoI proposals: "Anchor" of different scale and aspect ratio

SSD

Add conv. feature layers to the base conv. network that decrease in size progressively – allow multi-scale prediction

Filter of default boxes: k boxes of different aspect ratio, each product a c+4 (# of classes + bbox offset) output, requires total k(c+4) filters

R-FCN

Like Faster R-CNN, the method also has a region proposal and a region classification sub-network; In the classification stage, a conv. layer is added to generate position-sensitive score maps, for each of the c+1 category, there are k2 spatial grid, which means k2(c+1) channel output; followed a position-sensitive RoI pooling layer

 

Reference

Girshick, Ross, et al. "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation." IEEE Conference on Computer Vision and Pattern Recognition IEEE Computer Society, 2014:580-587.

Girshick, R. (2015). Fast R-CNN. IEEE International Conference on Computer Vision (pp.1440-1448). IEEE Computer Society.

Ren S, He K, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]. International Conference on Neural Information Processing Systems. MIT Press, 2015:91-99.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., & Fu, C. Y., et al. (2015). SSD: single shot multibox detector. 21-37.

Dai J, Li Y, He K, et al. R-FCN: Object Detection via Region-based Fully Convolutional Networks[J]. 2016.

https://blog.csdn.net/windtalkersm/article/details/79704777

https://medium.com/comet-app/review-of-deep-learning-algorithms-for-object-detection-c1f3d437b852

https://blog.csdn.net/guoyunfei20/article/details/78723646

转载于:https://www.cnblogs.com/everythingbagel/p/9176434.html

A ResNet-based Convolutional Decoder-Encoder is a type of neural network architecture that combines the principles of Residual Networks (ResNets) and Decoder-Encoder networks. ResNets are deep neural networks that use skip connections to avoid the vanishing gradient problem and allow for the training of very deep networks. Decoder-Encoder networks, on the other hand, are used for tasks such as image segmentation, object detection, and image generation. The ResNet-based Convolutional Decoder-Encoder architecture consists of a series of encoder layers that downsample the input image and a series of decoder layers that upsample the encoded features to generate the output image. The encoder layers typically consist of Convolutional Layers followed by Batch Normalization and ReLU activation. The decoder layers consist of transposed convolutional layers, also known as deconvolutional layers, followed by Batch Normalization and ReLU activation. The skip connections in the ResNet-based Convolutional Decoder-Encoder architecture allow for the direct transfer of information from the encoder to the decoder layers, which helps to preserve important features and reduce the effects of information loss during the downsampling process. The resulting network can be trained end-to-end using backpropagation to minimize a loss function that measures the difference between the predicted and ground truth images. ResNet-based Convolutional Decoder-Encoder networks have been used successfully for a variety of image reconstruction and generation tasks, including image denoising, super-resolution, and inpainting.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值