READING NOTE: Learning Spatial Regularization with Image-level Supervisions for Multi-label ...

TITLE: Learning Spatial Regularization with Image-level Supervisions for Multi-label Image Classification

AUTHOR: Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, Xiaogang Wang

ASSOCIATION: University of Science and Technology of China, University of Sydney, The Chinese University of Hong Kong

FROM: arXiv:1702.05891

CONTRIBUTIONS

  1. An end-to-end deep neural network for multi-label image classification is proposed, which exploits both semantic and spatial relations of labels by training learnable convolutions on the attention maps of labels. Such relations are learned with only image-level supervisions. Investigation and visualization of learned models demonstrate that our model can effectively capture semantic and spatial relations of labels.
  2. The proposed algorithm has great generalization capability and works well on data with different types of labels.

METHOD

The proposed Spatial Regularization Net (SRN) takes visual features from the main net as inputs and learns to regularize spatial relations between labels. Such relations are exploited based on the learned attention maps for the multiple labels. Label confidences from both main net and SRN are aggregated to generate final classification confidences. The whole network is a unified framework and is trained in an end-to-end manner.

The scheme of SRN is illustrated in the following figure.

Overall Framework of SRN

To train the network,

  1. Finetune only the main net on the target dataset. Both fcnn and fcls are learned with cross-entropy loss for classification.
  2. Fix fcnn and fcls . Train fatt and conv1 with cross-entropy loss for classification.
  3. Train fsr with cross-entropy loss for classification by fixing all other sub-networks.
  4. The whole network is jointly finetuned with joint loss.

The main network follows the structure of ResNet-101. And it is finetuned on the target dataset. The output of Attention Map and Confidence Map has C channels which is same with the number of categories. Their outputs are merged by element-wise multiplication and average-pooled to a feature vector in step 2. In step 3, instead of an average-pooling, fsr follows. fsr is implemented as three convolution layers with ReLU nonlinearity followed by one fully-connected layer as shown in the following figure.

Structure of fsr

conv4 is composed of single-channel filters. In Caffe, it can be implemnted using “group”. Such design is because one label may only semantically relate to a small number of other labels, and measuring spatial relations with those unrelated attention maps is unnecessary.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值