Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation

最新推荐文章于 2022-11-23 15:28:03 发布

bxg1065283526

最新推荐文章于 2022-11-23 15:28:03 发布

阅读量3k

点赞数 3

分类专栏： CVPR2019 文章标签：图像标注

本文链接：https://blog.csdn.net/bxg1065283526/article/details/100182322

版权

CVPR2019 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Abstract

Image annotation aims to annotate a given image with a variable number of class labels corresponding to diverse visual concepts. In this paper, we address two main issues in large-scale image annotation: 1) how to learn a rich feature representation suitable for predicting a diverse set of visual concepts ranging from object, scene to abstract concept; 2) how to annotate an image with the optimal number of class labels. To address the first issue, we propose a novel multi-scale deep model for extracting rich and discriminative features capable of representing a wide range of visual concepts. Specifically, a novel two-branch deep neural network architecture is proposed which comprises a very deep main network branch and a companion feature fusion network branch designed for fusing the multi- scale features computed from the main branch. The deep model is also made multi-modal by taking noisy user-provided tags as model input to complement the image input. For tackling the second issue, we introduce a label quantity prediction auxiliary task to the main label prediction task to explicitly estimate the optimal label number for a given image. Extensive experiments are carried out on two large-scale image annotation benchmark datasets and the results show that our method significantly outperforms the state-of-the-art.

图像注释的目的是用对应于不同视觉概念的可变数量的类标签对给定图像进行注释。本文讨论了大规模图像标注中的两个主要问题：1）如何学习一种丰富的特征表示法，适用于预测从对象、场景到抽象概念的各种视觉概念；2）如何用最佳类标签数对图像进行标注。为了解决第一个问题，我们提出了一种新的多尺度深度模型，用于提取能够表示各种视觉概念的丰富而有区别的特征。具体地说，提出了一种新的双分支深度神经网络结构，它包括一个非常深的主网络分支和一个伴随特征融合网络分支，该分支用于融合从主分支计算出的多尺度特征。深度模型还以用户提供的噪声标签作为模型输入，以补充图像输入，实现了多模式化。为了解决第二个问题，我们在主标签预测任务中引入一个标签数量预测辅助任务，以明确估计给定图像的最佳标签数。对两个大型图像标注基准数据集进行了大量的实验，结果表明，该方法明显优于现有技术。

I. INTRODUCTION

As a closely related task, image annotation [11]–[15] aims to describe (rather than merely recognise) an image by an- notating all visual concepts that appear in the image. This brings about a number of differences and new challenges compared to the image recognition problem. First, image annotation is a multi-label multi-class classification problem [16], [17], instead of a single-label multi-class classification problem as in image recognition. The model thus needs to predict multiple labels for a given images with richer contents than those used in image recognition. The second difference, which is more subtle yet more significant, is on the label types: labels in image annotation can refer to a much wider and more diverse range of visual concepts including scene properties, objects, attributes, actions, aesthetics etc. This is because image annotation benchmark datasets have become larger and larger [18]–[21] by crawling images and associated noisy user-provided tags from social media websites such as Flickr. These images are uploaded by a large variety of users capturing an extremely diverse set of visual scenes containing a wide range of visual concepts. Finally, in image annotation, the task is to predict not only more than one label, but also a variable number of labels – an image with simple content may require only one or two labels to describe, whilst more complicated content necessitates more labels. In summary, these three differences make image annotation a much harder problem far from being solved.

作为一项密切相关的任务，图像注释[11]–[15]旨在通过记录图像中出现的所有视觉概念来描述（而不仅仅是识别）图像。与图像识别问题相比，这带来了许多差异和新的挑战。首先，图像标注是一个多标签多类分类问题[16]，[17]，而不是像图像识别那样的单标签多类分类问题。因此，该模型需要预测特定图像的多个标签，其内容要比图像识别中使用的内容丰富。第二个区别是标签类型：图像注释中的标签可以引用更广泛、更多样化的视觉概念，包括场景属性、对象、属性、动作、美学等。这是因为图像注释基准K数据集已经变得越来越大[18]–[21]通过爬行图像和相关的噪声用户提供的标签从社会媒体网站，如Flickr。这些图像是由各种各样的用户上传的，这些用户捕获了一组极其多样的包含各种视觉概念的视觉场景。最后，在图像注释中，任务不仅要预测一个以上的标签，还要预测数量可变的标签——具有简单内容的图像可能只需要一个或两个标签来描述，而更复杂的内容则需要更多的标签。总之，这三个差异使得图像注释成为一个更难解决的问题。

Existing methods on image annotation focus on solving the multi-label classification problem. Two lines of research exist. In the first line, the focus is on modeling the correlation of different labels for a given image (e.g. cow typically co-exists with grass) [22] to make the multi-label prediction more robust and accurate. In the second line, side information such as the noisy user-provided tags are used as additional model input from a different modality (i.e., text) [23]. These models ignore the variable label number problem and simply predict the top- k most probable labels per image [11]–[15]. This is clearly sub-optimal as illustrated in Fig. 1.

现有的图像标注方法主要是解决多标签分类问题。有两条研究路线。在第一种路线中，重点是为给定图像（例如，奶牛通常与草共存）建模不同标签的相关性[22]，以使多标签预测更加稳健和准确。在第二种中，诸如用户提供的嘈杂标签之类的侧边信息用作来自不同模态（即文本）的附加模型输入[23]。这些模型忽略了可变标签数问题，并简单地预测每张图像中最可能出现的前k个标签[11]–[15]。这显然是次优的，如图1所示。

More recently the variable label number problem has been identified and a number of solutions have been proposed [24]–[26]. These solutions treat the image annotation problem as an image to text translation problem and solve it using an encoder-decoder model. Specifically, the image content is encoded using a convolutional neural network (CNN) and then fed into a recurrent neural network (RNN) [27]–[29] which outputs a label sequence. The RNN model is able to automatically determine the length of the label sequence thus producing variable numbers of labels for different images. However, one fundamental limitation of these existing approaches is that the original training labels are orderless whilst the RNN model requires a certain output label order for training. Existing methods have to introduce artificial orders such as rarest or more frequent label first. This indirect approach thus leads to sub-optimal estimation of the label quantity.

最近发现了可变标签号问题，并提出了许多解决方案[24]–[26]。这些解决方案将图像注释问题视为图像到文本的转换问题，并使用编码器-解码器模型解决。具体来说，图像内容使用卷积神经网络（CNN）进行编码，然后输入到循环神经网络（RNN）[27]–[29]中，该网络输出标签序列。RNN模型能够自动确定标签序列的长度，从而为不同的图像生成不同数量的标签。然而，这些现有方法的一个基本限制是，原始的训练标签是无次序的，而RNN模型要求训练有一定的输出标签顺序。现有的方法必须首先引入人工订单，例如最稀有或更频繁的标签。因此，这种间接方法导致标签数量的次优估计。

Nevertheless the most important difference and challenge, i.e. the rich and diverse label space problem, has never been tackled explicitly. In particular, recent image annotation models use deep learning in various ways: they either directly apply a CNN pretrained on the ImageNet image recognition task to extract features followed by a non-deep multi-label classification model [23]; or fine-tune a pretrained CNN on the image annotation benchmark datasets and obtain both the feature representation and classifier jointly [22], [30]. However, all the CNN models used were designed originally for the image recognition task. Specifically, only the final layer feature output is used as input to the classifier. It has been demonstrated [31]–[34] that features computed by a deep CNN correspond to visual concepts of higher levels of abstraction when progressing from bottom to top layer feature maps, e.g., filters learned in bottom layers could represent colour and texture whilst those in top layers object parts. This means that only the most abstract features were used for the classifier in the existing models. However, as mentioned earlier, for image annotation, the visual concepts to be annotated/labeled have drastically different levels of abstraction; for instance, ‘grass’ and ‘sand’ can be sufficiently described by colour and texture oriented features computed at the bottom layers of a CNN, whist ‘person’, ‘flower’ and ‘reflection’ are more abstract thus requiring features learned from top layers. These image recognition-oriented CNNs, with or without fine-tuning, are thus unsuitable for the image annotation task because they fail to provide rich representations at different abstraction scales.

然而，最重要的区别和挑战，即丰富多样的标签空间问题，从未得到明确解决。特别是，最近的图像注释模型以各种方式使用深度学习：它们要么直接在ImageNet图像识别任务上应用CNN预培训，以提取特征，然后再使用非深度多标签分类模型[23]；要么在图像注释上微调预培训CNN对数据集进行基准测试，并共同获得特征表示和分类器[22]，[30]。然而，所有使用的CNN模型最初都是为图像识别任务而设计的。具体地说，只有最终的层特征输出被用作分类器的输入。已经证明[31]–[34]由深度cnn计算的特征与从底层到顶层的特征图（例如，在底层学习的过滤器可以表示颜色和纹理，而在顶层的过滤器则可以表示对象）的更高层次抽象的视觉概念相对应。部分。这意味着现有模型中的分类器只使用最抽象的特性。然而，正如前面提到的，对于图像注释，要注释/标记的视觉概念具有显著不同的抽象层次；例如，“草”和“沙”可以通过在CNN底层计算的面向颜色和纹理的特征来充分描述，这些特征包括历史“人”、“花”和“反射”更抽象，因此需要从顶层学习特性。因此，这些面向图像识别的CNN，无论有没有微调，都不适合图像注释任务，因为它们无法在不同的抽象尺度上提供丰富的表示。

In this paper, we propose a novel multi-modal multi-scale deep neural network (see Fig. 2) to address both the diverse label space problem and variable label number problem. First, to extract and fuse features learned at different scales, a novel deep CNN architecture is proposed. It comprises two network branches: a main CNN branch which can be any contemporary CNN such as GoogleNet [8] or ResNet [5]; and a companion multi-scale feature fusion branch which fuses features extracted at different layers of the main branch whilst maintaining the feature dimension as well as performing automated feature selection. Second, to estimate the optimal number of labels that is needed to annotate a given image, we explicitly formulate an optimum label quantity estimation task and optimize it jointly with the label prediction main task. Finally, to further improve the label prediction accuracy, the proposed model is made multi-modal by taking both the image and noisy user-provided tags as input.

在本文中，我们提出了一种新的多模多尺度深度神经网络（见图2），以解决多样标签空间问题和可变标签数问题。首先，为了提取和融合不同尺度的特征，提出了一种新颖的深层CNN体系结构。它包括两个网络分支：一个主CNN分支，可以是任何当代CNN，如Googlenet[8]或Resnet[5]；以及一个伴生的多尺度特征融合分支，在保持特征尺寸的同时，融合在主分支不同层提取的特征，以及形成自动特征选择。第二，为了估计对给定图像进行注释所需的标签的最佳数量，我们显式地制定了一个最佳标签数量估计任务，并与标签预测主任务一起对其进行优化。最后，为了进一步提高标签预测的准确性，将图像和噪声用户提供的标签作为输入，建立了多模式的标签预测模型。

Our main contributions are as follows: (1) A novel multi-scale deep CNN architecture is proposed which is capable of effectively extracting and fusing features at different scales corresponding to visual concepts of different levels of abstraction. (2) The variable label number problem is solved by explicitly estimating the optimum label quantity in a given image. (3) We also formulate a multi-modal extension of the model to utilize the noisy user-provided tags. Extensive experiments are carried out on two large-scale benchmark datasets: NUS-WIDE [35] and MSCOCO [36]. The experimental results demonstrate that our model outperforms the state-of-the-art alternatives, often by a significant margin.

我们的主要贡献如下：（1）提出了一种新的多尺度深度CNN体系结构，该体系结构能够有效地提取和融合不同尺度的特征，与不同层次的抽象视觉概念相对应。（2）通过明确估计给定图像中的最佳标签数量，解决了可变标签数量问题。（3）我们还制定了模型的多模式扩展，以利用用户提供的噪声标签。对两个大型基准数据集进行了广泛的实验：nus-wide[35]和mscoco[36]。实验结果表明，我们的模型优于最先进的替代方案，通常具有显著的优势。

2.RELATED WORK

A. Multi-Scale CNN Models

Although multi-scale representation learning has never been attempted for image annotations, there are existing efforts on designing CNN architectures that enable multi-scale feature fusion. Gong et al. [31] noticed that the robustness of global features was limited due to the lack of geometric invariance, and thus proposed a multi-scale orderless pooling (MOP-CNN) scheme, which concentrates orderless Vectors of Locally Aggregated Descriptors (VLAD) pooling of CNN activations at multiple levels. Yang and Ramanan [32] argued that different scales of features can be used in different image classification tasks through multi-task learning, and developed directed acyclic graph structured CNNs (DAG-CNNs) to extract multi-scale features for both coarse and fine-grained classification tasks. Cai et al. [33] presented a multi-scale CNN (MS-CNN) model for fast object detection, which performs object detection using both lower and higher output layers. Liu et al. [34] proposed a multi-scale triplet CNN model for person re-identification. The results reported in [31]–[34] have shown that multi-scale features are indeed effective for image content analysis. In this paper, a completely new network architecture is designed by introducing the companion feature fusion branch. One of the key advantage of our architecture is that the feature fusion takes place in each layer without increasing the feature dimension, thus maintaining the same final feature dimension as the main branch. This means that our architecture can adopt arbitrarily deep network in the branch without worrying about the explosion of the fused feature dimension.

尽管从未尝试过对图像注释进行多尺度表示学习，但在设计能够实现多尺度特征融合的CNN体系结构方面，已有一些工作正在进行。Gong等人[31]注意到全局特征的鲁棒性由于缺乏几何不变性而受到限制，因此提出了一种多尺度无序池（MOP-CNN）方案，该方案将CNN激活的局部聚集描述符（VLAD）池的无序向量集中在多个层次。Yang和Ramanan[32]认为，通过多任务学习，不同尺度的特征可以用于不同的图像分类任务，并开发了有向无环图结构CNN（DAG CNN），以提取粗粒度和细粒度分类任务的多尺度特征。Cai等人[33]提出了一个用于快速目标检测的多尺度CNN（MS-CNN）模型，该模型使用较低和较高的输出层执行目标检测。Liu等人[34]提出了一个多尺度三重CNN模型用于人的再识别。[31]–[34]中报告的结果表明，多尺度特征对图像内容分析确实有效。本文通过引入伴随特征融合分支，设计了一种全新的网络体系结构。我们的体系结构的一个关键优势是，在不增加特征维数的情况下，在每一层进行特征融合，从而保持与主分支相同的最终特征维数。这意味着我们的体系结构可以在分支中采用任意深度的网络，而不必担心融合特征维度的爆炸。

B. Label Quantity Prediction for Image Annotation

Early works [22], [23], [30] on image annotation assign top- k predicted class labels to each image. However, the quantities of class labels in different images vary significantly, and the top-k annotations degrade the performance of image annotation. To overcome this issue, recent works start to predict variable numbers of class labels for different images. Most of them adopt a CNN-RNN encoder-decoder architecture, where the CNN subnetwork encodes the input pixels of images into visual feature, and the RNN subnetwork decodes the visual feature into a label prediction path [24]–[26]. Specifically, RNN can not only perform classification, but also control the number of output class labels – the model stop emitting labels once an end-of-label-sequence token is predicted. Since RNN requires an ordered sequential list as input, the unordered label set is expected to be transformed in advance based on heuristic rules such as the rare-first rule [26] or the frequent-first rule [25]. These artificially imposed rules can lead to suboptimal estimation of the label quantity. In contrast, in this work our model directly estimate the label quantity without making any assumption on the label order.

图像注释的早期工作[22]、[23]、[30]将Top-K预测类标签分配给每个图像。但是，不同图像中类标签的数量差异很大，Top-K注释会降低图像注释的性能。为了克服这个问题，最近的研究开始预测不同图像的类标签的可变数目。其中大部分采用CNN-RNN编码器解码器架构，CNN子网将图像的输入像素编码为视觉特征，RNN子网将视觉特征解码为标签预测路径[24]–[26]。具体来说，RNN不仅可以执行分类，还可以控制输出类标签的数量——一旦预测到标签序列结束标记，模型就停止发出标签。由于RNN需要一个有序的序列列表作为输入，因此需要根据启发式规则（如稀有第一规则[26]或频繁第一规则[25]）提前转换无序标签集。这些人为施加的规则可能导致标签数量的次优估计。相反，在这项工作中，我们的模型直接估计标签数量，而不需要对标签订单做任何假设。

C. Image Annotation Using Side Information

A number of recent works attempted to improve the label prediction accuracy by exploiting side information collected from social media websites where the benchmark images were collected. These user-provided side information can be extracted from the noisy tags [22]–[24] and group labels [22], [37], [38]. Johnson et al. [23] found neighborhoods of images nonparametrically according to the image metadata, and combined the visual features of each image and its neighborhoods. Liu et al. [24] filtered noisy tags to obtain true tags, which serve as the supervision to train the CNN model and also set the initial state of RNN. Hu et al. [22] utilized both tags and group labels to form a multi-layer group-concept-tag graph, which can encode the diverse label relations. Different from [22] that adopted a deep model, the group labels were used to learn the context information for image annotation in [37], [38], but without considering deep models. In this paper, we introduce an additional network branch which takes noisy tags as input and the extracted textual features are fused with the visual features extracted by CNN models to boost the annotation accuracy. Moreover, since user tagged labels may be incorrect or incomplete, some recent works have focused on training with noisy labels. Liu et al. [39] use importance reweighting to modify any loss function for classification with noisy labels, and extend the random classification noise to a bounded case [40]. Moreover, Yu et al. [41] prove that biased complementary labels can enhance multi-class classification. These methods have been successfully applied in binary classification and multi-class classification, which may lead to improvements when extended to our large-scale image annotation where each image is annotated with one or more tags and labels (but out of the scope of this paper).

最近的一些研究试图通过利用从收集基准图像的社交媒体网站收集的侧边信息来提高标签预测的准确性。这些用户提供的侧边信息可以从噪声标签[22]–[24]和组标签[22]、[37]、[38]中提取。Johnson等人[23]根据图像元数据非参数地找到图像的邻域，并结合每个图像及其邻域的视觉特征。Liu等人[24]过滤噪声标签，获得真正的标签，作为培训CNN模型的监督，并设置RNN的初始状态。Hu等人[22]利用标签和分组标签形成一个多层的分组概念标签图，可以对不同的标签关系进行编码。与采用深度模型的[22]不同，组标签用于学习[37]、[38]中图像注释的上下文信息，但不考虑深度模型。本文介绍了一个以噪声标签为输入的附加网络分支，将提取的文本特征与CNN模型提取的视觉特征融合，提高了注释的准确性。此外，由于用户标记的标签可能不正确或不完整，最近的一些工作集中在噪音标签的培训上。Liu等人[39]使用重要性重新加权来修改有噪声标签分类的任何损失函数，并将随机分类噪声扩展到有界情况[40]。此外，Yu等人[41]证明有偏互补标签可以增强多类分类。这些方法已经成功地应用于二值分类和多类分类，当扩展到我们的大规模图像注释中时，每个图像都用一个或多个标记和标签进行注释（但超出了本文的范围），这可能会导致改进。

THE PROPOSED MODEL

As shown in Fig. 2, Our multi-modal multi-scale deep model for large-scale image annotation consists of four components: visual feature learning, textual feature learning, multi-class classification, and label quantity prediction. Specifically, a multi-scale CNN subnetwork is proposed to extract visual feature from raw image pixels, and a multi-layer perception subnetwork is applied to extract textual features from noisy user-provided tags. The joint visual and textual features are connected to a simple fully connected layer for multi-class classification. To determine the optimum number of labels required to annotate a given image, we utilize another multi- layer perception subnetwork to predict the quantity of class labels. The results of multi-class classification and label quantity prediction are finally merged for image annotation. In the following, we will first give the details of the four components of our model, and then provide the algorithms for model training and test process.

如图2所示，我们的大型图像标注多模式多尺度深度模型由四个部分组成：视觉特征学习、文本特征学习、多类分类和标签数量预测。具体地说，提出了一种从原始图像像素中提取视觉特征的多尺度CNN子网，并应用多层感知子网从噪声用户提供的标签中提取文本特征。视觉和文本的联合特征被连接到一个简单的完全连接的层，用于多类分类。为了确定注释给定图像所需的标签的最佳数量，我们使用另一个多层感知子网络来预测类标签的数量。最后将多类分类和标签数量预测的结果合并起来进行图像标注。在下面，我们将首先给出模型的四个组成部分的详细信息，然后提供模型训练和测试过程的算法。

A. Multi-Scale CNN for Visual Feature Learning

Our multi-scale CNN subnetwork consists of two branch (see Fig. 2): the main CNN branch which encodes an image to both low-level and high-level features, and the feature fusion branch which fuses multi-scale features extracted from the main branch.

我们的多尺度CNN子网包括两个分支（见图2）：一个是CNN主分支，它将图像编码为低层和高层特征，另一个是融合从主分支提取的多尺度特征的特征融合分支。

Main CNN Branch. Given the raw pixels of an image I, the basic CNN branch encodes them into K levels of feature maps $M_1, M_2, \cdots , M_K$ through a series of scales $S1 (\cdot ), S2 (\cdot ), \cdots , SK (\cdot )$ . Each scale can be a composite function of operations, such as convolution (Conv), pooling (Pooling), batch normalization (BN), and activation function. The encoding process can be formulated as:

CNN主要分支。根据图像i的原始像素，CNN的基本分支通过一系列比例尺 $S1 (\cdot ), S2 (\cdot ), \cdots , SK (\cdot )$ 将其编码为特征图 $M_1, M_2, \cdots , M_K$ 的k级。每个尺度都可以是操作的复合函数，例如卷积（conv）、池（pooling）、批处理规范化（bn）和激活函数。编码过程可表述为：

M_0 = 1

$M_l=S_l(M_{l-1}),l = 1,2,\cdots ,K$

For the main CNN branch, the last feature map M_k is often used to produce the final feature vector, e.g., the conv5 3 layer of the ResNet-101 model [5].

对于CNN的主要分支，最后一个特征图 M_k 通常用于生成最终特征向量，例如resnet-101模型的conv5 3层[5]。

Feature Fusion Branch. When multi-scale feature maps $M_1, M_2, \cdots , M_K$ are obtained by the main branch, the feature fusion model is proposed to combine these original feature maps iteratively via a set of fusion functions $\{F_l(\cdot )\}$ , as shown in Fig. 3. The fused feature map M_l is formulated as follows:

当主分支获得多尺度特征图 $M_1, M_2, \cdots , M_K$ 时，提出特征融合模型，通过一组融合函数 $\{F_l(\cdot )\}$ 迭代地将这些原始特征图结合起来，如图3所示。融合特征图 M_l 的公式如下：

$\widetilde{M_1}= M_1$

$\widetilde{M_l}= M_l + F_l(\widetilde{M}_{l-1}),l = 2,\cdots ,K$

In this paper, we define the fusion function $\{F_l(\cdot )\}$ as:

本文将融合函数 $\{F_l(\cdot )\}$ 定义为:

$F_l(\cdot )=\phi_{l2}(\phi_{l1}(\cdot))$

where $l_1(\cdot )$ and $l_1(\cdot )$ are two composite functions consisting of three consecutive operations: a convolution (Conv), followed by a batch normalization (BN) and a rectified linear unit (ReLU). The difference between $l_1(\cdot )$ and $l_2(\cdot )$ lies in the convolution layer. The 3*3 Conv in $\phi_{l1}$ is used to guarantee that M_l and $f_l(M_{l-1})$ have the same height and weight, while the 1*1 Conv in $\phi_{l2}(\cdot )$ can not only increase dimensions and interact information between different channels, but also reduce the number of parameters and improve computational efficiency [5], [8]. At the end of the fused feature map $\widetilde{M}_K$ , an average pooling layer is used to extract the final visual feature vector f_v for image annotation.

其中， $l_1(\cdot )$ 和 $l_1(\cdot )$ 是由三个连续操作组成的两个复合函数：卷积（conv），然后是批标准化（bn）和校正线性单元（relu）。 $l_1(\cdot )$ 与 $l_2(\cdot )$ 的区别在于卷积层。3*3 conv 在 $\phi_{l1}$ 用于保证 M_l 和 $f_l(M_{l-1})$ 具有相同的高度和重量，而1*1 conv 在 $\phi_{l2}(\cdot )$ 不仅可以增加尺寸和不同通道之间的信息交互，还可以减少参数数量，提高计算效率[5]，[8]。在融合特征图 $\widetilde{M}_K$ 的末尾，使用平均池层提取图像注释的最终视觉特征向量 f_v 。

In this paper, we take ResNet-101 [5] as the main CNN branch, and select the final layers of five scales (i.e., conv1, conv2_3, conv3_4, conv4 23 and conv5 3, totally 5 final convolutional layers) for multi-scale fusion. In particular, the feature maps at the five scales have the size of 112 *112, 56*56, 28*28, 14*14 and 7*7, along with 64, 256, 512, 1024 and 2048 channels, respectively. The architecture of our multi-scale CNN subnetwork is shown in Fig. 4.

本文以resnet-101[5]为CNN的主要分支，选取五个尺度（conv1、conv2_3、conv3_4、conv4_23和conv5_3，共5个最终卷积层）的最终层进行多尺度融合。特别是，五个比例的特征图的大小分别为112*112、56*56、28*28、14*14和7*7，以及64、256、512、1024和2048个通道。我们的多尺度CNN子网的架构如图4所示。

B. Multi-Layer Perception Model for Textual Feature Learning

We further investigate how to learn textual features from noisy tags provided by users for social media websites. The noisy tags of image are represented as a binary vector $t = (t_1,t_2, \cdots ,t_T )$ , where T is the volume of tags, and t_j =1 if image is annotated with tag . Since the binary vector is sparse and noisy, we encode the raw vector into a dense textual feature vector f_t using a multi-layer perception model, which consists of two hidden layers (each with 2,048 units), as shown in Fig. 2. Note that only a simple neural network model is used for textual feature learning. Our consideration is that the noisy tags have a high-level semantic information and a complicated model would degrade the performance of textual feature learning. This observation has also been reported in [42]–[44] in the filed of natural language processing.

我们进一步研究如何从用户为社交媒体网站提供的嘈杂标签中学习文本特征。图像的噪声标签表示为二进制矢量 $t = (t_1,t_2, \cdots ,t_T )$ ，其中t是标签的体积，如果图像用标签注释，则。由于二进制矢量稀疏且噪声大，我们使用多层感知模型将原始矢量编码为密集的文本特征矢量，该模型由两个隐藏层（每个层具有2048个单位）组成，如图2所示。注意，只有一个简单的神经网络模型用于文本特征学习。我们考虑的是噪声标签具有较高的语义信息，复杂的模型会降低文本特征学习的性能。这一观察结果也在自然语言处理领域的[42]–[44]中报告。

In this paper, the visual feature vector f_v and the textual feature vector f_t are forced to have the same dimension, which enables them to play the same important role in feature learning. By taking a multi-modal feature learning strategy, we concatenate the visual and textual feature vectors f_v and f_t into a joint feature vector f = [f_v,f_t] for the subsequent multi-class classification and label quantity prediction.

本文将视觉特征向量 f_v 和文本特征向量 f_t 强制具有相同的维数，使它们在特征学习中发挥同样的重要作用。采用多模态特征学习策略，将视觉和文本特征向量 f_v 和 f_t 连接成一个联合特征向量 f = [f_v,f_t] ，用于后续的多类分类和标签数量预测。

C. Multi-Class Classification and Label Quantity Prediction

Multi-Class Classification. Since each image can be an- notated with one or more classes, we define a multi-class classifier for image annotation. Specifically, the joint visual and textual feature is connected to a fully connected layer for logit calculation, followed by a sigmoid function for probability calculation, as shown in Fig. 2.

由于每个图像都可以用一个或多个类表示，因此我们为图像注释定义了一个多类分类器。具体地说，连接视觉和文本的特征连接到一个完全连接的层上进行逻辑计算，然后是一个用于概率计算的sigmoid函数，如图2所示。

Label Quantity Prediction. We formulate the label quantity prediction problem as a regression problem, and adopt a multi-layer perception model as the regressor. As shown in Fig. 2, the regressor consists of two hidden layers with 512 and 256 units, respectively. In this paper, to avoid the overfitting of the regressor, the dropout layers are applied in all hidden layers with the dropout rate 0.5.

将标签数量预测问题表述为回归问题，采用多层感知模型作为回归量。如图2所示，回归器由两个隐藏层组成，分别为512和256个单位。为了避免回归器的过度拟合，在所有隐藏层中都应用了衰减层，衰减率为0.5。

D. Model Training

During the process of model training, we apply a multi-stage strategy and divide the architecture into several branches: visual features learning, textual features learning, multi-class classification, and label quantity prediction. Specifically, the original ResNet model is fine-tuned with a small learning rate. When the fine-tuning is finished, we will fix the parameters of ResNet and train the multi-scale blocks. In this paper, the training of the textual model is separated from the visual model, and thus the two models can be trained synchronously. After visual and textual feature learning, we fix the parameters of the visual and textual models, and train the multi-class classification and label quantity prediction models separately.

在模型训练过程中，我们采用了一种多阶段的策略，将该体系结构分为几个分支：视觉特征学习、文本特征学习、多类分类和标签数量预测。具体地说，原始的resnet模型经过微调，学习率很小。当微调完成后，我们将固定resnet的参数并训练多尺度块。本文将文本模型的训练与视觉模型分离开来，实现了文本模型和视觉模型的同步训练。在视觉和文本特征学习之后，我们确定了视觉和文本模型的参数，分别训练了多类分类和标签数量预测模型。

To provide further insights on model training, we define the loss functions for training the four branches as follows.

为了进一步了解训练培训，我们定义了训练四个分支的损失函数，如下所示。

Sigmoid Cross Entropy Loss. For training the first three branches, the features are first connected to a fully connected layer to compute the logits $\{Z_{ij}\}$ for image I_i , and the sigmoid cross entropy loss is then applied for classification module training:

对于前三个分支的训练，首先将特征连接到完全连接的层以计算图像 I_i 的逻辑 $\{Z_{ij}\}$ ，然后将sigmoid交叉熵损失应用于分类模块训练：

$L_{cls} = \sum_i\sum_j y_{ij}\times (-log(p_{ij}))+(1-y_{ij})\times (-log(1-p_{ij}))$

where $y_{ij} = 1$ if image I_i is annotated with class , otherwise $y_{ij} = 0$ ; and $p_{ij} = \sigma(z_{ij} )$ with $\sigma(\cdot )$ as the sigmoid function.

Mean Squared Error Loss. For training the label quantity prediction model, the features are also connected to a fully connected layer with one output unit to compute the predicted label quantity $\widehat{m}_i$ for image I_i . We then apply the following mean squared error loss for regression model training:

为了训练标签数量预测模型，这些特征还通过一个输出单元连接到一个完全连接的层，以计算图像 I_i 的预测标签数量 $\widehat{m}_i$ 。然后，我们将以下均方误差损失应用于回归模型训练：

$L_{reg} = \sum_i (\widehat{m}_i-m_i)^2$

where m_i is the ground-truth label quantity of image I_i .

其中 m_i 是图像 I_i 的真值标签数量。
In this paper, the four branches of our model are not trained in a single process. Our main consideration is that our model has a over-complicated architecture and thus the overfitting issue still exists even if a large dataset is provided for model training. In the future work, we will make efforts to develop a robust algorithm for train our model end-to-end.

在本文中，我们的模型的四个分支不是在一个过程中训练的。我们主要考虑的是我们的模型有一个过于复杂的体系结构，因此即使为模型培训提供了一个大的数据集，过拟合问题仍然存在。在未来的工作中，我们将努力开发一个健壮的算法来训练我们的模型端到端。

E. Test Process

At the test time, we first extract the joint visual and textual feature vector from each test image, and then execute multi-class classification and label quantity prediction synchronously. When the predicted probabilities of labels and the predicted label quantity $\widehat{m}$ have been obtained for the test image, we select the top $\widehat{m}$ candidates as our final annotations.

在测试时，我们首先从每个测试图像中提取视觉和文本的联合特征向量，然后同步执行多类分类和标签数量预测。当测试图像得到标签的预测概率和预测标签数量 $\widehat{m}$ 时，我们选择最上面 $\widehat{m}$ 的候选作为最终注释。

EXPERIMENTS

A. Datasets and Settings

1) Datasets: The two most widely used benchmark datasets for large-scale image annotation are selected for performance evaluation. The first dataset is NUS-WIDE [35], which con- sists of 269,648 images and 81 class labels from Flickr image metadata. The number of Flickr images varies in different studies, since some Flickr links are going invalid. For fair comparison, we also remove images without any social tag. As a result, we obtain 94,570 training images and 55,335 test images. Some examples of this dataset are shown in Fig. 5. Over 20% images from both training and test data have more than 3 labels, while more than 35% images have only one single label. The maximum number of label quantity in the training and test data is 14 and 12, separately. The second dataset is MSCOCO [36], which consists of 87,188 images and 80 class labels from the 2015 MSCOCO image captioning challenge. The training/test split of the MSCOCO dataset is 56,414/30,774. More than 25% images from both training and test data have over 3 labels, while over 20% images have only one single label. The maximum number of label quantity in the training and test data is 16 and 15, separately. The visualized distribution of label quantity on the NUS-WIDE and MSCOCO datasets is shown in Fig. 6. In summary, there exists a distinct variety of label quantities in the two datasets, and thus it is necessary to predict label quantity for accurate and complete annotation. In addition, there are originally 5,018 and 170,339 noisy user tags released with NUS-WIDE and MSCOCO, respectively. We only keep 1,000 most frequent tags for each dataset as side information.

选择两个最广泛使用的大型图像标注基准数据集进行性能评价。第一个数据集是nus-wide[35]，它包含269648个图像和81个Flickr图像元数据类标签。Flickr图像的数量在不同的研究中有所不同，因为某些Flickr链接会失效。为了公平比较，我们还删除了没有任何社会标签的图像。因此，我们获得了94570张训练图像和55335张测试图像。这个数据集的一些例子如图5所示。培训和测试数据中超过20%的图像有3个以上的标签，而超过35%的图像只有一个标签。培训和测试数据中的最大标签数量分别为14和12。第二个数据集是mscoco[36]，它由87188张图片和来自2015年mscoco图片标题挑战的80个分类标签组成。MSCOCO数据集的训练/测试分割为56414/30774。培训和测试数据中超过25%的图像有3个以上的标签，而超过20%的图像只有一个标签。培训和测试数据中的最大标签数量分别为16和15。在nus-wide和mscoco数据集中，标签数量的可视化分布如图6所示。综上所述，两个数据集中的标签数量存在着明显的差异，因此有必要对标签数量进行预测，以实现准确完整的注释。此外，原来有5018和170339噪声用户标签释放与nus-wide和mscoco，分别。我们只为每个数据集保留1000个最频繁的标签作为侧边信息。

2) Evaluation Metrics: The per-class and per-image metrics including precision and recall have been widely used in related works [22]–[26], [30]. In this paper, the per-class precision (C- P) and per-class recall (C-R) are obtained by computing the mean precision and recall over all the classes, while the overall per-image precision (I-P) and overall per-image recall (I-R) are computed by averaging over all the test images. Moreover, the per-class F1-score (C-F1) and overall per-image F1-score (I- F1) are used for comprehensive performance evaluation by combining precision and recall with the harmonic mean. The six metrics are defined as follows:

包括精度和召回率在内的每类和每图像度量在相关工作中得到了广泛的应用[22]–[26]，[30]。本文通过计算所有类的平均精度和召回率得到每类精度（c-p）和每类召回率（c-r），而通过对所有测试图像的平均计算得到每幅图像的总精度（i-p）和每幅图像的总召回率（i-r）。此外，每类f1分数（c-f1）和每幅图像的总体f1分数（i-f1）被用于综合性能评估，其方法是将精确性和召回率与调和平均数相结合。六个指标定义如下：

$C-P = \frac{1}{C}\sum_{j=1}^C\frac{NI_{j}^{c}}{NI_{j}^{p}}$ $I-P = \frac{\sum_{i=1}^{N}NL^c_i}{\sum_{i=1}^{N}NL^p_i}$

$C-R = \frac{1}{C}\sum_{j=1}^C\frac{NI_{j}^{c}}{NI_{j}^{g}}$ $I-P = \frac{\sum_{i=1}^{N}NL^c_i}{\sum_{i=1}^{N}NL^g_i}$

$C-F1 = \frac{2\cdot C-P\cdot C-R}{C-P+C-R}$ $I-F1 = \frac{2\cdot I-P\cdot I-R}{I-P+I-R}$

where C is the number of classes, N is the number of test images, $NI_{j}^{c}$ is the number of images correctly labelled as class j, $NI_{j}^{g}$ is the number of images that have a ground-truth label of class j, $NI_{j}^{p}$ is the number of images predicted as class j, $NL_{i}^{c}$ is the number of correctly annotated labels for image i, $NL_{i}^{g}$ is the number of ground-truth labels for image i, and $NL_{i}^{p}$ is the number of predicted labels for image i.

其中c是类的数目，n是测试图像的数目， $NI_{j}^{c}$ 是正确标记为类j的图像的数目， $NI_{j}^{g}$ 是具有类j的基本真值标签的图像的数目， $NI_{j}^{p}$ 是预测为类j的图像的数目， $NL_{i}^{c}$ 是正确注释为图像i的标签的数目， $NL_{i}^{g}$ 是图像i的地面真值标签数， $NL_{i}^{p}$ 而是图像i的预测标签数。

According to [30], the above per-class metrics are biased to- ward the infrequent classes, while the above per-image metrics are biased toward the frequent classes. Similar observations have also been presented in [45]. As a remedy, following the idea of [45], we define a new metric called H-F1 with the harmonic mean of C-F1 and I-F1:

根据[30]，上面的每类度量偏向于不频繁的类，而上面的每图像度量偏向于频繁的类。在[45]中也提出了类似的意见。作为补救措施，根据[45]的思想，我们定义了一个称为h-f1的新度量，其调和平均值为c-f1和i-f1：

$H-F1 = \frac{2\cdot C-F1\cdot C-F1}{C-F1+I-F1}$

Since H-F1 takes both per-class and per-image F1-scores into account, it enables us to make easy comparison between different methods for large-scale image annotation.

由于h-f1同时考虑了每类和每幅图像的f1分数，因此它使我们能够方便地比较用于大规模图像注释的不同方法。

3) Settings: In this paper, the basic CNN module makes use of ResNet-101 [5] which is pretrained on the ImageNet 2012 classification challenge dataset [10]. Our experiments are all conducted on TensorFlow. The input images are resized to 224⇥224 pixels. For training the basic CNN model, the multi- scale CNN model, and the multi-class classifier, the learning rate is set to 0.001 for NUS-WIDE and 0.002 for MSCOCO, respectively. The learning rate is multiplied by 0.1 after every 80,000 iterations, up to 160,000 iterations with early stopping. For training the textual feature learning model and the label quantity prediction model, the learning rate is set to 0.1 for NUS-WIDE and 0.0001 for MSCOCO, respectively. These models are trained using a Momentum with the momentum rate of 0.9. The decay rate of batch normalization and weight decay is set to 0.9997.

在本文中，基本cnn模块使用resnet-101[5]，它是在imagenet 2012分类挑战数据集[10]上预先训练的。我们的实验都是在tensorflow上进行的。输入图像的大小调整为224 224像素。为了训练基本cnn模型、多尺度cnn模型和多类分类器，nus-wide和mscoco的学习率分别设置为0.001和0.002。每80000次迭代后，学习率乘以0.1，最多160000次迭代，并提前停止。为了训练文本特征学习模型和标签数量预测模型，nus-wide和mscoco的学习率分别设置为0.1和0.0001。这些模型使用动量率为0.9的动量进行训练。批量标准化和重量衰减的衰减率设置为0.9997。

4) Compared Methods: We conduct two groups of experiments for evaluation and choose competitors to compare accordingly: (1) We first make comparison among several variants of our complete model shown in Fig. 2 by removing one or more components from our model, so that the effectiveness of each component can be evaluated properly. (2) To compare with a wider range of image annotation methods, we also compare with the published results on the two benchmark datasets. These include the state-of-the-art deep learning methods for large-scale image annotation.

B. Effectiveness Evaluation for Model Components

We conduct the first group of experiments to evaluate the effectiveness of the main components of our model for large-scale image annotation. Five closely related models are included in component effectiveness evaluation: (1) CNN de- notes the original ResNet-101 model; (2) MS-CNN denotes the multi-scale ResNet model shown in Fig. 3; (3) MS-CNN+Tags denotes the multi-modal multi-scale ResNet model that learns both visual and textual features for image annotation; (4) MS-CNN+LQP denotes the multi-scale ResNet model that performs both multi-class classification and label quantity pre- diction (LQP) for image annotation; (5) MS-CNN+Tags+LQP denotes our complete model shown in Fig. 2. Note that the five models can be recognized by whether they are multi-scale,multi-modal, or making LQP. This enables us to evaluate the effectiveness of each component of our model.

我们进行了第一组实验，以评估我们的模型的主要组件对于大规模图像注释的有效性。组件效能评估包括五个密切相关的模型：（1）cnn 表示原始resnet-101模型；（2）ms-cnn表示图3所示的多尺度resnet模型；（3）ms-cnn+tags表示学习视觉和文本特征的多模态多尺度resnet模型。用于图像注释的URE；（4）MS-CNN+LQP表示同时执行图像注释的多类分类和标签数量预测（LQP）的多尺度Resnet模型；（5）MS-CNN+Tags+LQP表示我们的完整模型，如图2所示。请注意，这五个模型可以通过它们是多尺度的、多模态的还是进行lqp来识别。这使我们能够评估模型中每个组件的有效性。

The results on the two benchmark datasets are shown in Tables I and II, respectively. Here, we also show the upper bounds of our complete model (i.e. MS-CNN+Tags+LQP) obtained by directly using the ground-truth label quantities to refine the predicted annotations (without LQP). We can make the following observations: (1) Since label quantity prediction is explicitly addressed in our model (unlike RNN), it indeed leads to significant improvements according to the H-F1 score (10.42 percent for NUS-WIDE, and 6.18 percent for MSCOCO), when MS-CNN+Tags is used for feature learning. The improvements achieved by label quantity prediction become smaller when only MS-CNN is used for feature learning. This is because that the quality of label quantity prediction would degrade when the social tags are not used as textual features. (2) The social tags are crucial not only for label quantity prediction but also for the final image annotation. Specifically, the textual features extracted from social tags yield significant gains according to the H- F1 score (8.74 percent for NUS-WIDE, and 4.86 percent for MSCOCO), when both MS-CNN and LQP are adopted for image annotation. This is also supported by the gains achieved by MS-CNN+Tags over MS-CNN. (3) The effectiveness of MS-CNN is verified by the comparison MS-CNN vs. CNN. Admittedly only small gains (> 1% in terms of H-F1) have been obtained by MS-CNN. However, this is still impressive since ResNet-101 is a very powerful CNN model. In summary, we have evaluated the effectiveness of all the components of our complete model (shown in Fig. 2).

两个基准数据集的结果分别显示在表一和表二中。在这里，我们还展示了我们的完整模型（即ms-cnn+tags+lqp）的上界，该模型是通过直接使用地面真值标签量来优化预测注释（没有lqp）而获得的。我们可以得出以下结论：（1）由于标签数量预测在我们的模型中得到了明确的说明（与RNN不同），当MS-CNN+标签用于特征学习时，根据H-F1评分（NUS-Wide为10.42%，MSCOCO为6.18%），它确实带来了显著的改进。宁。当仅使用MS-CNN进行特征学习时，通过标签数量预测所获得的改进变得更小。这是因为当社会标签不作为文本特征使用时，标签数量预测的质量会降低。（2）社会标签不仅是标签数量预测的关键，也是最终图像标注的关键。具体地说，当使用MS-CNN和LQP进行图像标注时，根据H-F1分数（8.74%为NUS范围，4.86%为MSCOCO），从社交标签中提取的文本特征会产生显著的增益。这也得到了MS-CNN+标签比MS-CNN所取得的成果的支持。（3）通过MS-CNN与CNN的比较，验证了MS-CNN的有效性。不可否认，MS-CNN只获得了很小的收益（H-F1的收益大于1%）。然而，这仍然令人印象深刻，因为resnet-101是一个非常强大的cnn模型。总之，我们已经评估了我们完整模型的所有组件的有效性（如图2所示）。

C. Comparison to the State-of-the-Art Methods

In this group of experiments, we compare our method with the state-of-the-art deep learning methods for large-scale image annotation. The following competitors are included: (1) CNN+Logistic [22]: This model fits a logistic regression classifier for each class label. (2) CNN+Softmax [30]: It is a CNN model that uses softmax as classifier and the cross entropy as the loss function. (3) CNN+WARP [30]: It uses the same CNN model as above, but a weighted approximate ranking loss function is adopted for training to promote the prec@K metric. (4) CNN-RNN [26]: It is a CNN-RNN model that uses output fusion to merge CNN output features and RNN outputs. (5) RIA [25]: In this CNN-RNN model, the CNN output features are used to set the hidden state of Long short-term memory (LSTM) [46]. (6) TagNeighbour [23]: It uses a non-parametric approach to find image neighbours according to metadata, and then aggregates image features for classification. (7) SINN [22]: It uses different concept layers of tags, groups, and labels to model the semantic correlation between concepts of different abstraction levels, and a bidirectional RNN-like algorithm is developed to inte- grate information for prediction. (8) SR-CNN-RNN [24]: This CNN-RNN model uses a semantically regularised embedding layer as the interface between the CNN and RNN.

在这组实验中，我们将我们的方法与用于大规模图像标注的最先进的深度学习方法进行了比较。包括以下竞争对手：（1）cnn+logistic[22]：该模型适用于每个类别标签的logistic回归分类器。（2）cnn+softmax[30]：它是一个以softmax为分类器，交叉熵为损失函数的cnn模型。（3）cnn+warp[30]：采用与上述相同的cnn模型，但采用加权近似排序损失函数进行训练，以提升prec@k度量。（4）CNN-RNN[26]：它是一个CNN-RNN模型，使用输出融合来合并CNN输出特征和RNN输出。（5）RIA[25]：在这个CNN-RNN模型中，CNN输出特性用于设置长短期记忆（LSTM）的隐藏状态[46]。（6）tagneighbor[23]：它采用非参数方法，根据元数据查找图像邻域，然后聚集图像特征进行分类。（7）sinn[22]：它使用标签、组和标签的不同概念层来建模不同抽象层的概念之间的语义关联，并开发了一种双向rnn类算法来集成信息进行预测。（8）SR-CNN-RNN[24]：该CNN-RNN模型使用语义正则的嵌入层作为CNN和RNN之间的接口。

The results on the two benchmark datasets are shown in Tables III and IV, respectively. Here, we also show the upper bounds of our model obtained by directly using the ground- truth label quantities to refine the predicted annotations (with- out LQP). It can be clearly seen that our model outperforms the state-of-the-art deep learning methods according to the H- F1 score. This provides further evidence that our multi-modal multi-scale CNN model along with label quantity prediction is indeed effective in large-scale image annotation. Moreover, it is also noted that our model with explicit label quantity prediction yields better results than the CNN-RNN models with implicit label quantity prediction (i.e. SR-CNN-RNN, RIA, and SINN). This shows that RNN is not the unique model suitable for label quantity prediction. In particular, when it is done explicitly like our model, we can pay more attention to the CNN model itself for deep feature learning. Considering that RNN needs prior knowledge on class labels, our model is expected to have a wider use in real-word applications. In addition, the annotation methods that adopt multi-modal feature learning or label quantity prediction are generally shown to outperform the methods without considering any of the two components for image annotation.

两个基准数据集的结果分别见表三和表四。在这里，我们还展示了我们的模型的上界，该上界是通过直接使用基本真值标签量来优化预测注释（不带lqp）得到的。从H-F1分数可以清楚地看出，我们的模型优于最先进的深度学习方法。这进一步证明了我们的多模态多尺度cnn模型和标记量预测在大规模图像标注中确实是有效的。此外，我们还注意到，我们的带有显式标记量预测的模型比带有隐式标记量预测的cnn-rnn模型（即sr-cnn-rnn、ria和sinn）得到更好的结果。这说明rnn并不是唯一适用于标签数量预测的模型。特别是，当它像我们的模型那样明确地完成时，我们可以更加关注cnn模型本身以进行深度特征学习。考虑到rnn需要类标签的先验知识，我们的模型有望在实际的word应用中得到更广泛的应用。此外，采用多模态特征学习或标号数量预测的标注方法，在不考虑图像标注的两个分量的情况下，通常表现出优于这些方法的效果。

D. Further Evaluation

1) Alternative Multi-Scale CNN Models: We also make comparison to two variant versions of our multi-scale CNN model (denoted as MS-CNN): 1) MS-CNN-AvgPool that directly combines the final layers of five scales (i.e. conv1, conv2 3, conv3 4, conv4 23 and conv5 3) using average pooling, resulting in a 3,904-dimensional feature vector; 2) MS-CNN-MaxPool that only replaces the 3 ⇥ 3 conv in Fig. 3 by max pooling, without changing any of the other parts of our MS-CNN. The results on the two benchmark datasets are shown in Table V. It is clearly shown that all the multi- scale CNN models outperforms the baseline CNN model (i.e. ResNet-101), due to multi-scale feature learning. Moreover, the multi-scale fusion method used in our MS-CNN is shown to yield about %1 gains over those used in MS-CNN-AvgPool and MS-CNN-MaxPool. In particular, as compared to MS- CNN-AvgPool, our MS-CNN does not increase the feature dimension and thus is more practicable.

我们还对我们的多尺度cnn模型（表示为ms-cnn）的两个不同版本进行了比较：1）ms-cnn avgpool，它使用平均池直接组合五个尺度（即conv1、conv2 3、conv3 4、conv4 23和conv5 3）的最后一层，生成3904维特征向量；2）cnn-maxpool女士，仅用max pooling替换图3中的3 3 conv，而不改变ms-cnn的任何其他部分。两个基准数据集的结果如表五所示。由于多尺度特征学习，所有多尺度cnn模型都优于基线cnn模型（resnet-101）。此外，我们的MS-CNN中使用的多尺度融合方法显示，与MS CNN AVGPool和MS CNN MaxPool中使用的方法相比，获得约%1的增益。特别是，与MS-CNN AVGPool相比，我们的MS-CNN不增加特征维数，因而更具实用性。

In addition, we provide the ablative results using different fusion methods in Table VI. Here, “sum” represents element- wise feature map sum, and “concat” means that we directly concatenate different levels of features after down sampling. With lower-level features fused, the performance (H-F1) of the “sum” fusion increases linearly, while the performance of the “concat” fusion accelerates insignificantly. This shows that the “sum” fusion benefits more from low-level features while maintaining the same feature dimension. The reason is that simply down sampling and concatenating low-level feature maps just throw out many pixels and make little use of low- level information such as texture and shape.

此外，我们在表六中提供了使用不同融合方法的烧蚀结果。这里，“sum”表示元素级特征映射和，“concat”表示我们在下采样后直接连接不同级别的特征。在低层特征融合的情况下，“和”融合的性能（h-f1）呈线性增长，而“concat”融合的性能增长不显著。这表明“和”融合在保持相同特征维数的同时，更能从低级特征中获益。原因是，简单的下采样和拼接低层特征地图只会抛出许多像素，很少使用低层信息，如纹理和形状。

2) Alternative Textual Features: We explore different types of sub-networks for textual features extraction from noisy tags. Word2vec [47] is one of the most popular methods for textual feature extraction. Since the 1,000 most frequent tags are out of order and cannot form meaningful sentences, it is hard to directly generate word2vec features from these tags. Instead, we compare MLP textual features with GloVe [48] word vectors, which are pre-trained on the Wikipedia 2014 + Gigaword 5 dataset and then fine-tuned on the two datasets. Specifically, we encode each tag into an embedding vector, and accumulate those vectors for each image into the final word2vec features. As shown in Table VII, our MLP features outperform the word2vec features by about 2 percent on the two datasets. The possible reason is that the word2vec features are more sensitive to the correctness of tag words. The attention mechanism can help to extract more stable word2vec features, which is not the focus of this paper.

我们探索不同类型的子网络，以从噪音标签中提取文本特征。word2vec[47]是最流行的文本特征提取方法之一。因为1000个最常见的标签是无序的，不能形成有意义的句子，所以很难直接从这些标签中生成word2vec特征。相反，我们将mlp文本特征与glove[48]词向量进行比较，glove[48]词向量是在wikipedia 2014+gigaword 5数据集上预先训练的，然后在两个数据集上进行微调。具体来说，我们将每个标签编码成一个嵌入向量，并将每个图像的向量累加到最终的word2vec特征中。如表7所示，我们的mlp特性在两个数据集上的性能比word2vec特性强约2%。可能的原因是word2vec特性对标记词的正确性更敏感。注意机制有助于提取更稳定的word2vec特征，这不是本文的重点。

3) Results of Label Quantity Prediction: We have evaluated the effectiveness of LQP in the above experiments, but have not shown the quality of LQP itself. In this paper, to measure the quality of LQP, two metrics are computed as follows: 1) Accuracy: the predicted label quantities are first quantized to their nearest integers, and then compared to the ground- truth label quantities to obtain the accuracy; 2) Mean Squared Error (MSE): the mean squared error is computed by directly comparing the predicted label quantities to the ground-truth ones. The results of LQP on the two benchmark datasets are shown in Tables VIII and IX, respectively. It can be seen that: (1) More than 45% label quantities are correctly predicted in all cases. (2) The textual features extracted from social tags yield significant gains when MSE is used as the measure. In addition, our LQP method is also shown to be able to handle the extreme case when the quantity number is large or small. For images which have only one single label, the predicted label quantity is 1.20±0.52 on NUS-WIDE and 1.16±0.47 on MSCOCO, which means that our LQP method can handle the case when the label quantity reaches the minimum number. For the few images which have more than 8 labels, the predicted label quantity on NUS-WIDE and MSCOCO is 6.04 ± 1.34 and 6.32 ± 1.12, respectively. Note that these images with large quantity number are rare in both training and test sets (< 0.5% for NUS-WIDE and < 1.5% for MSCOCO), and thus it is difficult for any model to precisely estimate the label quantity of outliers. Although the predicted number is only a conservative estimation of ground-truth label quantity, our LQP metod can still capture the important information that there exist multiple objects in these images.

在上述实验中，我们对lqp的有效性进行了评估，但没有显示lqp本身的质量。在本文中，为了测量lqp的质量，计算了两个指标：1）准确度：首先将预测的标号量量化为最接近的整数，然后与地面真值标号量进行比较以获得准确度；2）均方误差（mse）：均方误差误差是通过直接比较预测的标签数量和地面真实数量来计算的。两个基准数据集上的LQP结果分别显示在表VIII和IX中。可以看出：（1）在所有情况下，正确预测了45%以上的标签数量。（2）当使用最小均方误差（MSE）作为测度时，从社会标签中提取的文本特征具有显著的增益。此外，我们的lqp方法也被证明能够处理数量大或小的极端情况。对于只有一个标签的图像，在nus-wide上预测的标签量为1.20±0.52，在mscoco上预测的标签量为1.16±0.47，这意味着我们的lqp方法可以在标签量达到最小值时处理这种情况。对于少数有8个以上标记的图像，nus-wide和mscoco上的预测标记量分别为6.04±1.34和6.32±1.12。注意，这些大量的图像在训练集和测试集中都很少见（nus-wide的<0.5%和mscoco的<1.5%），因此任何模型都很难精确估计异常值的标签数量。虽然预测值只是地面真值标号量的保守估计，但我们的lqp方法仍然能够捕捉到这些图像中存在多个目标的重要信息。

We also explore other label quantity determination methods. As shown in Table X, taking a proper threshold (e.g. p = 0.3) of classification probability for determining the optimal label number of each image can help to yield better annotations, as compared to the conventional top-k label selection method (i.e. “MS-CNN”). However, it is hard to select the optimal threshold for label quantity determination. Therefore, our LQP method is clearly shown to outperform the threshold method.

我们还探索了其他标签数量的测定方法。如表x所示，与传统的to p-k标签选择方法（即“ms-cnn”）相比，采用适当的分类概率阈值（如p=0.3）来确定每幅图像的最佳标签数目有助于产生更好的注释。然而，对于标签数量的确定，很难选择最佳阈值。因此，我们的lqp方法明显优于阈值方法。

Another candidate method for label quantity prediction is “Classification”, which regards each number of label quantity as one category and applies softmax cross entropy for training. Note that our LQP method is different from this “Classification” method in that label quantity prediction is formulated as a regression problem using mean square error loss (thus denoted as “Regression”). The experimental results in Table XI demonstrate that label quantity prediction consistently leads to performance improvements no matter which LQP method is adopted. Moreover, the “Regression” method outperforms “Classification” by over 2%, which indicates that the “Classification” method is not a sound choice for label quantity prediction. The explanation is that the huge gap between the ground-truth quantity and predicted number can not be properly penalized by the classification loss. In contrast, the mean square error used in our LQP model can ensure that the prediction is not far away from the ground-truth quantity.

标签数量预测的另一种候选方法是“分类法”，它将每个标签数量作为一个类别，并应用softmax交叉熵进行训练。注意，我们的lqp方法不同于这种“分类”方法，因为标签数量预测是用均方误差损失（因此称为“回归”）表示为回归问题的。表XI中的实验结果表明，无论采用哪种LQP方法，标签数量预测都一致地导致性能改进。此外，“回归”方法比“分类”方法的效果好2%以上，说明“分类”方法对标签数量预测不是一个好的选择。其原因是地面真值与预测值之间的巨大差距不能通过分类损失得到适当的惩罚。相比之下，在我们的lqp模型中使用的均方误差可以确保预测不远离地面真值。

4) End-to-End Training: We can also train our whole model in an end-to-end manner. Specifically, the visual and textual representation learning module (MS-CNN+Tags) is pre-trained as initialization, and then we fine-tune all parameters except vanilla ResNet. The objective function is the sum of multi- label classification loss and label quantity regression loss. The comparative results in Table XII show the effectiveness of end-to-end training, which means that both classification and label quantity prediction can be promoted from each other, and a better multi-modal feature representation is achieved by interaction between visual and textual branches.

我们也可以端到端地训练我们的整个模型。具体来说，视觉和文本表示学习模块（ms-cnn+tags）在初始化时进行了预训练，然后对除vanilla resnet之外的所有参数进行了微调。目标函数是多标签分类损失和标签数量回归损失之和。表xii中的比较结果显示了端到端训练的有效性，这意味着分类和标签数量预测可以相互促进，并且通过视觉和文本分支之间的交互实现了更好的多模态特征表示。

5) Qualitative Results of Image Annotation:Fig. 7 shows several annotation examples on NUS-WIDE when our model is employed. Here, annotations with green font are included in ground-truth class labels, while annotations with blue font are correctly tagged but not included in ground-truth class labels (i.e. the ground-truth labels may be incomplete). It is observed that the images in the first two rows are all annotated exactly correctly, while this is not true for the images in the last two rows. However, by checking the extra annotations (with blue font) generated for the images in the last two rows, we find that they are all consistent with the content of the images.

图7显示了使用我们的模型时在nus-wide上的几个注释示例。在这里，绿色字体的注释包含在基本真实类标签中，而蓝色字体的注释正确标记，但不包含在基本真实类标签中（即基本真实标签可能不完整）。观察到前两行中的图像都被正确地注释了，而后两行中的图像则不是这样。但是，通过检查最后两行中为图像生成的额外注释（蓝色字体），我们发现它们都与图像的内容一致。

CONCLUSION

We have proposed a novel multi-modal multi-scale deep learning model for large-scale image annotation. Compared to existing models, the main difference is the multi-scale feature learning architecture designed to extract and fuse features at different layers suitable for representing visual concepts of different levels of abstraction. Furthermore, different from the RNN-based models with implicit label quantity prediction, a regressor is directly added to our model for explicit label quantity prediction. Extensive experiments are carried out to demonstrate that the proposed model outperforms the state-of- the-art methods. Moreover, each component of our model has also been shown to be effective in large-scale image annotation. A number of directions are worth further investigation. Firstly, the current multi-scale feature fusion architecture is based on the ResNet structure; it is desirable to investigate how other types of architecture can in integrated into our framework for multi-scale feature fusion. Secondly, other types of side information (e.g. group labels) can be fused in our model. Finally, extracting and fusing multi-scale deep features are also important for other visual recognitions tasks which require multi-scale or spatial layout sensitive representation. Part of the ongoing work is thus to investigate how the proposed model can be applied to those tasks.

提出了一种用于大规模图像标注的多模态多尺度深度学习模型。与现有的模型相比，主要的区别在于设计了多尺度特征学习体系结构来提取和融合不同层次的特征，适合于表示不同抽象层次的视觉概念。此外，与基于rnn的隐式标记量预测模型不同，我们的模型直接加入了一个回归器，用于显式标记量预测。通过大量的实验证明，该模型优于现有的方法。此外，我们的模型的各个组成部分也被证明在大规模的图像识别中是有效的。一些方向值得进一步调查。首先，目前的多尺度特征融合体系结构是基于resnet结构的，研究其他类型的结构如何集成到我们的多尺度特征融合框架中是很有必要的。其次，我们的模型可以融合其他类型的边信息（如组标签）。最后，多尺度深度特征的提取和融合对于其他需要多尺度或空间布局敏感表示的视觉识别任务也很重要。因此，正在进行的工作的一部分是研究如何将所提出的模型应用于这些任务。