Orderless Recurrent Models for Multi-label Classification (CVPR2020)

Orderless Recurrent Models for Multi-label Classification Paper PDF

Introduction

Multi-label classification is the task of assigning a wide range of visual concepts (labels) to images. The large variety of concepts and the uncertain relations among them make this a very challenging task and, to successfully address it. RNNs have demonstrated good performance in many tasks that require processing variable length sequential data including multi-label classification. This modal takes account the relation-patterns among labels into training process naturally. But since RNNs produce sequential outputs, labels need to be ordered for the multi-label classification task.

Several recent works have tried to address this issue by imposing an arbitrary, but consistent, ordering to the ground truth label sequences. Despite alleviating the problem, these approaches are short of solving it, and many of the original issues are still present. For example, in an image that features a clearly visible and prominent dog, the LSTM may chose to predict that label first, as the evidence for it is very large. However, if dog is not the label that happened to be first in the chosen ordering, the network will be penalized for that output, and then penalized again for not predicting dog in the “correct” step according to the ground truth sequence. In this way,the training process could become very slow.

In this paper, we propose ways to dynamically order the ground truth labels with the predicted label sequence. There are the two ways of doing that: predicted label alignment (PLA) and minimal loss alignment (MLA). We empirically show that these approaches lead to faster training and also eliminate other nuisances like repeated labels in the predicted sequence.

Innovation

  1. Orderless recurrent models with the minimal loss alignment (MLA) and predicted label alignment (PLA)

Method

Image-to-sequence model

在这里插入图片描述
This type of model consists of a CNN (encoder) part that extracts a compact visual representation from the image, and of an RNN (decoder) part that uses the encoding to generate a sequence of labels, modeling the label dependencies.

Linearized activations from the fourth convolutional layer are used as input for the attention module, along with the hidden state of the LSTM at each time step, thus the attention module focuses on different parts of the image every time. These attention weighted features are then concatenated with the word embedding of the class predicted in the previous time step, and given to the LSTM as input for the current time step.

The predictions for the current time step t t t are computed in the following way:
x t = E ⋅ l ^ t − 1 h t = L S T M ( x t , h t − 1 , c t − 1 ) p t = W ⋅ h t + b \begin{aligned} x_{t} &= E \cdot \hat{l}_{t-1} \\ h_{t} &= LSTM(x_{t}, h_{t-1}, c_{t-1}) \\ p_{t} &= W \cdot h_{t} + b \end{aligned} xthtpt=El^t1=LSTM(xt,ht1,ct1)=Wht+b

where E E E is a word embedding matrix and l ^ t − 1 \hat{l}_{t-1} l^t1 is the predicted label index in the previous time step. c t c_t ct and h t h_t ht are the model cell and hidden states in the previous LSTM unit. The prediction vector is denoted by p t p_t pt, and W W W and b b b are the weights and the bias of the fully connected layer.

Training recurrent models

To train the model a dataset with pairs of images and sets of labels is used. Let ( I , L ) (I, L) (I,L) be one of the pairs containing an image I I I and its n labels L = l 1 , l 2 , . . . , l n , l i ∈ L L = {l_1, l_2, ..., l_n}, l_i ∈ L L=l1,l2,...,ln,liL, with L L L the set of all labels with cardinality m = ∣ L ∣ m = |L| m=L, including the start and end tokens.

The predictions pt of the LSTM are collected in the matrix P = [ p 1 p 2 . . . p n ] , P ∈ R m × n P = [p_1 p_2 ... p_n], P ∈ R_{m×n} P=[p1p2...pn],PRm×n. When the number of predicted labels k k k is larger than n n n, we only select the first n n n prediction vectors. In case k k k is smaller than n we pad the matrix with empty vectors to obtain the desired dimensions.

We can now define the standard cross-entropy loss for recurrent models as:
L = t r ( T l o g ( P ) ) w i t h { T t j = 1 ,  if  l t = j T t j = 0 ,  otherwise.  (3 ) \mathfrak{L} = tr(Tlog(P)) \tag{3 } \\ with \left\{\begin{matrix} T_{tj}=1 , & \text{ if } l_{t} = j\\ T_{tj}=0 , & \text{ otherwise. } \end{matrix}\right. L=tr(Tlog(P))with{Ttj=1,Ttj=0, if lt=j otherwise. (3 )

where T ∈ R n × m T ∈ R_{n×m} TRn×m contains the ground truth label for each time step l l l. The loss is computed by comparing the prediction of the model at step t t t with the corresponding label at the same step of the ground truth sequence.
For inherently orderless tasks like multi-label classification, where labels often come in random order, it becomes essential to minimize unnecessary penalization, and several approaches have been proposed in the literature. The most popular solution to improve the alignment between ground truth and predicted labels consists on defining an arbitrary criteria by which the labels will be sorted, such frequent-first, rare-first and dictionary-order. However those methods will delay convergence, as the network will have to learn the arbitrary ordering in addition to predicting the correct labels given the image. Further- more, any misalignment between the predictions and the labels will still result in higher loss and misleading updates to the network.
在这里插入图片描述

Orderless recurrent models

To alleviate the problems caused by imposing a fixed order to the labels, we propose to align them to the predictions of the network before computing the loss. We consider two different strategies to achieve this:
The first strategy, called minimal loss alignment (MLA) is computed with:
L = m i n T t r ( T l o g ( P ) ) s . t . { T t j ∈ { 1 , 0 } , ∑ j T t j = 1 ∑ j T t j = 1 , ∀ j ∈ L ∑ j T t j = 0 , ∀ j ∉ L \mathfrak{L} = min_{T} \quad tr(Tlog(P)) \\ s.t. \left\{\begin{matrix} T_{tj} \in \{1, 0\}, & \sum_{j}T_{tj} = 1 \\ \sum_{j}T_{tj}=1 , & \forall j \in L\\ \sum_{j}T_{tj}=0 , & \forall j \notin L \end{matrix}\right. L=minTtr(Tlog(P))s.t.Ttj{1,0},jTtj=1,jTtj=0,jTtj=1jLj/L
where T ∈ R n × m T ∈ R_{n×m} TRn×m is a permutation matrix, which is constrained to have a ground truth label for each time step: ∑ j T t j = 1 \sum_{j}T_{tj} = 1 jTtj=1, and that each label in the ground truth L L L
should be assigned to a time step. The matrix T T T is chosen in such a way as to minimize the summed cross entropy loss. This minimization problem is an assignment problem and can be solved with the Hungarian algorithm.

We also consider the predicted label alignment (PLA) solution. If we predict a label which is in the set of ground truth labels for the image, then we do not wish to change it. That leads to the following optimization problem:

L = m i n T t r ( T l o g ( P ) ) s . t . { T t j ∈ { 1 , 0 } , ∑ j T t j = 1 T t j = 1 ,  if  l ^ t ∈ L  and  j = l ^ t ∑ j T t j = 1 , ∀ l j ∈ L ∑ j T t j = 0 , ∀ j ∉ L \mathfrak{L} = min_{T} \quad tr(Tlog(P)) \\ s.t. \left\{\begin{matrix} T_{tj} \in \{1, 0\}, & \sum_{j}T_{tj} = 1 \\ T_{tj}=1 , & \text{ if } \hat{l}_{t} \in L \text{ and } j = \hat{l}_{t} \\ \sum_{j}T_{tj}=1 , & \forall l j \in L \\ \sum_{j}T_{tj}=0 , & \forall j \notin L \end{matrix}\right. L=minTtr(Tlog(P))s.t.Ttj{1,0},Ttj=1,jTtj=1,jTtj=0,jTtj=1 if l^tL and j=l^tljLj/L

where l ^ t \hat{l}_{t} l^t is the label predicted by the model at step t t t. Here we first fix those elements in the matrix T T T for which we know that the prediction is in the ground truth set L, and apply the Hungarian algorithm to assign the remaining labels. This second approach results in higher losses than the first one since there are more restrictions on matrix T T T. Nevertheless, this method is more consistent with the labels which were actually predicted by the LSTM.
在这里插入图片描述

To further illustrate our proposed approach to train order-less recurrent models we consider an example image and its cost matrix (see Figure 4). The cost matrix shows the cost of assigning each label to the different time steps. The cost is computed as the negative logarithm of the probability at the corresponding time step. Although the MLA approach achieves the order that yields the lowest loss, in some cases this can cause misguided gradients as it does in the example in the figure. The MLA approach puts the label chair in the time step t 3 t_3 t3, although the network already predicts it in the time step t 4 t_4 t4. Therefore, the gradients force the network to output chair instead of sports ball although sports ball is also one of the labels.

Experiments

Convergence rate

在这里插入图片描述

Co-occurrence in the predictions

在这里插入图片描述

Comparation of different ordering methods

在这里插入图片描述

Comparation of state-of-the-art

在这里插入图片描述

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值