R-FCN论文解读

相比较之前的faster-rcnn而言,rfcn比之运行速度更快,效率更高,在准确率上也没有下降。那么rfcn网络到底是什么样的网络呢?其比faster-rcnn的区别在哪里呢?

一、rfcn的基本结构:

1.一个基本的conv网络:ResNet101--------------->提取基本的特征

2.一个RPN网络(Faster Rcnn来的)

3.一个position sensitive的prediction层

4.ROIpooling+投票的决策层


---RFCN的idea出发点:

分类需要特征具有平移不变性,检测则要求对目标的平移作出准确的响应。Faster-rcnn类的方法在ROIpooling前都是卷积的,是具备平移不变性的,但一旦插入ROIpooling之后,后面的网络结构就不再具备平移不变性了。因此,我们需要position sensitive core map来把目标的位置信息融入到ROIpooling中

---希望:

将耗时的卷积层都尽量移到前面共享的subnetwork上,也就是产生共享的feature map的网络

---具体的改进:

Faster-rcnn中用的Resnet(前91层共享,插入ROIpooling,后面10层不共享)。

RFCN把所有的101层都改在了前面共享的subnetwork上,最后的prediction层卷积只用了一层,大大减少了计算量。

---实现:


Backbone architecture:    ResNet101--------去掉原始的ResNet101的最后一层全连接层,保留前100层,再接一个1*1*1024的全卷积层(100层的输出是2048,为了降维,再引入一个1*1的卷积层)

K^2(C+1)的conv:     ResNet的输出是W*H*1024,用K^2(C+1)个1024*1*1的卷积核去卷积即可得到K^2(C+1)个大小为W*H的position sensitive 的score map。这步的卷积就是在做预测。K=3,表示把一个ROI划分成3*3,对应的9个位置分别为:上左(左上角),上中,上右,中左,中中,中右,下左,下中,下右(右下角)

K^2(C+1)个feature map的物理意义: 共有k*k=9个颜色,每个颜色的立体块(W*H*(C+1))表示的是不同位置存在目标的概率值(第一块黄色表示的是左上角位置,最后一块淡蓝色表示的是右下角位置)。共有k^2*(C+1)个feature map。每个feature map,z(i,j,c)是第i+k(j-1)个立体块上的第c个map(1<= i,j <=3)。(i,j)决定了9种位置的某一种位置,假设为左上角位置(i=j=1),c决定了哪一类,假设为person类。在z(i,j,c)这个feature map上的某一个像素的位置是(x,y),像素值是value,则value表示的是原图对应的(x,y)这个位置上可能是人(c=‘person’)且是人的左上部位(i=j=1)的概率值。

ROIpooling: 就是faster -rcnn中的池化,也就是一层的SPP结构,主要用来将不同大小的ROI对应的feature map映射成同样维度的特征。即不论对多大的ROI,规定在上面画一个n*n个bin网络,每个网格内所有的像素值做一个pooling(平均),这样不论图像多大,池化后ROI特征维度变成n*n。注意一点ROI pooling是每个feature map单独做的,不是通过多个channel一起的。

ROIpooling的输入和输出每个颜色的立体块(C+1)都只抠出对应位置的一个bin,把这k*k个bin组成新的立体块,大小为(C+1)*W'*H'。例如,下图中的第一块黄色只取左上角的bin,最后一块淡蓝色只取右下角的bin。所有的bin重新组合后就变成了类似右图的那个薄的立体块(图中的这个是池化后的输出,即每个面上的每个bin上已经是一个像素。池化前这个bin对应的是一个区域,是多个像素)。ROI pooling的输出为一个(C+1)*k*k的立体块,如下图



vote投票:k*k个bin直接进行求和(每个类单独做),得到每一个类的分数,并进行softmax得到每类的最终得分,并用于计算损失。


损失函数和faster RCNN类似,由分类loss和回归loss组成,分类用交叉熵损失(log loss),回归用L1-smooth损失。


总结:

R-FCN是在Faster R-CNN的框架上进行改造,第一,把base的VGG16换车了ResNet,第二,把Fast R-CNN换成了先用卷积做prediction,再进行ROI pooling。由于ROI pooling会丢失位置信息,故在pooling前加入位置信息,即指定不同score map是负责检测目标的不同位置。pooling后把不同位置得到的score map进行组合就能复现原来的位置信息。

参考:https://blog.csdn.net/baidu_32173921/article/details/71741970



























  • 2
    点赞
  • 0
    评论
  • 2
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

打赏
文章很值,打赏犒劳作者一下
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 1 The ability to learn and generalize from a few examples is a hallmark of human intelligence (1). CAPTCHAs, images used by websites to block automated interactions, are examples of problems that are easy for humans but difficult for comput-ers. CAPTCHAs are hard for algorithms because they add clutter and crowd letters together to create a chicken-and-egg problem for character classifiers — the classifiers work well for characters that have been segmented out, but segmenting the individual characters requires an understanding of the characters, each of which might be rendered in a combinato-rial number of ways (2–5). A recent deep-learning approach for parsing one specific CAPTCHA style required millions of labeled examples from it (6), and earlier approaches mostly relied on hand-crafted style-specific heuristics to segment out the character (3, 7); whereas humans can solve new styles without explicit training (Fig. 1A). The wide variety of ways in which letterforms could be rendered and still be under-stood by people is illustrated in Fig. 1. Building models that generalize well beyond their train-ing distribution is an important step toward the flexibility Douglas Hofstadter envisioned when he said that “for any program to handle letterforms with the flexibility that human beings do, it would have to possess full-scale artificial intelli-gence” (8). Many researchers have conjectured that this could be achieved by incorporating the inductive biases of the vis-ual cortex (9–12), utilizing the wealth of data generated by neuroscience and cognitive science research. In the mamma-lian brain, feedback connections in the visual cortex play roles in figure-ground-segmentation, and in object-based top-down attention that isolates the contours of an object even when partially transparent objects occupy the same spatial locations (13–16). Lateral connections in the visual cortex are implicated in enforcing contour continuity (17, 18). Contours and surfaces are represented using separate mechanisms that interact (19–21), enabling the recognition and imagination of objects with unusual appearance – for example a chair made of ice. The timing and topography of cortical activations give clues about contour-surface representations and inference al-gorithms (22, 23). These insights based on cortical function are yet to be incorporated into leading machine learning models. We introduce a hierarchical model called the Recursive Cortical Network (RCN) that incorporates these neuroscience insights in a structured probabilistic generative model frame-work (5, 24–27). In addition to developing RCN and its learning and infer-ence algorithms, we applied the model to a variety of visual cognition tasks that required generalizing from one or a few training examples: parsing of CAPTCHAs, one-shot and few-shot recognition and generation of handwritten digits, occlu-sion reasoning, and scene text recognition. We then com-pared its performance to state of the art models. Recursive cortical network RCN builds on existing compositional models (24, 28–32) in important ways [section 6 of (33)]. Although grammar based models (24) have the advantage of being based on well-known ideas from linguistics, they either limit interpreta-tions to single trees or are computationally infeasible when using attributed relations (32). The seminal work on AND-OR A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs。该论文是2017年在science发布的
相关推荐
©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页

打赏

shu_qdHao

你的鼓励将是我创作的最大动力

¥2 ¥4 ¥6 ¥10 ¥20
输入1-500的整数
余额支付 (余额:-- )
扫码支付
扫码支付:¥2
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者