论文翻译——通过预测图像旋转进行自监督学习(英汉对照)

艾醒(AiXing-w)

已于 2023-04-03 16:24:12 修改

阅读量1.3k

点赞数 1

分类专栏：读论文文章标签：学习深度学习计算机视觉

于 2023-04-03 16:14:03 首次发布

本文链接：https://blog.csdn.net/DuLNode/article/details/129878549

版权

读论文专栏收录该内容

1 篇文章 0 订阅

订阅专栏

UNSUPERVISED REPRESENTATION LEARNING BY PREDICTING IMAGE ROTATIONS[通过预测图像旋转进行自监督学习]

前言
ABSTRACT （摘要）
1 INTRODUCTION （简介）
2 METHODOLOGY (方法实现)
3 EXPERIMENTAL RESULTS(实验结果)
结尾

前言

论文链接：https://arxiv.org/abs/1803.07728
Spyros Gidaris, Praveer Singh, Nikos Komodakis
University Paris-Est, LIGM
Ecole des Ponts ParisTech
{spyros.gidaris,praveer.singh,nikos.komodakis}@enpc.fr

ABSTRACT （摘要）

Over the last years, deep convolutional neural networks (ConvNets) have transformed the field of computer vision thanks to their unparalleled capacity to learn high level semantic image features. However, in order to successfully learn those features, they usually require massive amounts of manually labeled data, which is both expensive and impractical to scale.

在过去的几年里，深度卷积神经网络（ConvNets）已经改变了计算机视觉领域，由于其学习无与伦比的高级语义图像特征的能力。然而，为了成功地学习这些特性，它们通常需要大量的手动标记数据，这既昂贵又不现实。

Therefore, unsupervised semantic feature learning, i.e., learning without requiring manual annotation effort, is of crucial importance in order to successfully harvest the vast amount of visual data that are available today. In our work we propose to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input.

因此，无监督语义特征学习，即不需要人工注释的学习，对于成功地获取当今可用的大量视觉数据是至关重要的。在我们的工作中，我们建议通过训练convnet来识别图像特征识别作为输入的图像的二维旋转。

We demonstrate both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. We exhaustively evaluate our method in various unsupervised feature learning benchmarks and we exhibit in all of them state-of-the-art performance.

我们定性和定量地证明了这个明显简单的任务实际上为语义特征学习提供了一个非常强大的监督信号。我们在各种无监督特征学习基准中详尽地评估了我们的方法，并在所有这些基准中展示了最先进的性能。

Specifically, our results on those benchmarks demonstrate dramatic improvements w.r.t. prior state-of-the-art approaches in unsupervised representation learning and thus significantly close the gap with supervised feature learning. For instance, in PASCAL VOC 2007 detection task our unsupervised pre-trained AlexNet model achieves the state-of-the-art (among unsupervised methods) mAP of 54.4% that is only 2.4 points lower from the supervised case.

具体来说，我们在这些基准测试上的结果显示了w.r.t.的显著改进先前的最先进的方法在无监督表示学习，从而显著缩小与监督特征学习的差距。例如，在PASCAL VOC 2007检测任务中，我们的无监督预训练的AlexNet模型实现了最先进的（在无监督方法中）的54.4%的mAP，仅比有监督的情况低2.4分。

We get similarly striking results when we transfer our unsupervised learned features on various other tasks, such as ImageNet classifification, PASCAL classifification, PASCAL segmentation, and CIFAR-10 classifification. The code and models of our paper will be published on: https://github.com/gidariss/FeatureLearningRotNet.

当我们将无监督学习特征转移到各种其他任务上时，我们得到了类似的惊人结果，如图像分类、图像分类、CIFAR-10分类，图像分类。我们的论文的代码和模型将在： https://github.com/gidariss/FeatureLearningRotNet.上发表

1 INTRODUCTION （简介）

In recent years, the widespread adoption of deep convolutional neural networks (LeCun et al., 1998) (ConvNets) in computer vision, has lead to a tremendous progress in the field. Specifically, by training ConvNets on the object recognition (Russakovsky et al., 2015) or the scene classification (Zhou et al., 2014) tasks with a massive amount of manually labeled data, they manage to learn powerful visual representations suitable for image understanding tasks. For instance, the image features learned by ConvNets in this supervised manner have achieved excellent results when they are transferred to other vision tasks, such as object detection (Girshick, 2015), semantic segmentation (Long et al., 2015), or image captioning (Karpathy & Fei-Fei, 2015). However, supervised feature learning has the main limitation of requiring intensive manual labeling effort, which is both expensive and infeasible to scale on the vast amount of visual data that are available today.

近年来，深度卷积神经网络（LeCun et al.，1998）（ConvNets）在计算机视觉中的广泛应用，在该领域取得了巨大的进展。具体来说，通过训练ConvNets的物体识别（俄罗斯萨科夫斯基等人，2015）或场景分类（Zhou等人，2014）任务，他们能够学习到适合图像理解任务的强大视觉表示。例如，ConvNets以这种监督方式学习到的图像特征在转移到其他视觉任务时，取得了良好的效果，如目标检测（吉希克，2015）、语义分割（Long et al.，2015）或图像字幕（卡帕西和费菲，2015）。然而，监督特征学习的主要局限性是需要密集的手工标记工作，这是昂贵的和不可用的大量视觉数据

Due to that, there is lately an increased interest to learn high level ConvNet based representations in an unsupervised manner that avoids manual annotation of visual data. Among them, a prominent paradigm is the so-called self-supervised learning that defines an annotation free pretext task, using only the visual information present on the images or videos, in order to provide a surrogate supervision signal for feature learning.

因此，最近人们对以一种无监督的方式学习基于高级ConvNet的表示越来越感兴趣，从而避免了对视觉数据的手动注释。其中，一个突出的范例是所谓的自监督学习，它定义了一个无注释的借口任务（pretext task），只使用图像或视频上的视觉信息，以便为特征学习提供一个替代监督信号。

For example, in order to learn features, Zhang et al. (2016a) and Larsson et al. (2016) train ConvNets to colorize gray scale images, Doersch et al. (2015) and Noroozi & Favaro (2016) predict the relative position of image patches, and Agrawal et al. (2015) predict the egomotion (i.e., self-motion) of a moving vehicle between two consecutive frames.

例如，为了学习特征，张等人（2016a）和拉尔森等人（2016）列车连接彩色灰度图像，多尔施等人（2015）和诺鲁齐和法瓦罗（2016）预测图像补丁的相对位置，阿格拉瓦尔等人（2015）预测连续两帧之间移动车辆的自我运动（即自运动）。

Therationale behind such self-supervised tasks is that solving them will force the ConvNet to learn semantic image features that can be useful for other vision tasks. In fact, image representations learned with the above self-supervised tasks, although they have not managed to match the performance of supervised-learned representations, they have proved to be good alternatives for transferring on other vision tasks, such as object recognition, object detection, and semantic segmentation (Zhang et al., 2016a; Larsson et al., 2016; Zhang et al., 2016b; Larsson et al., 2017; Doersch et al., 2015; Noroozi & Favaro, 2016; Noroozi et al., 2017; Pathak et al., 2016a; Doersch & Zisserman, 2017).

这种自我监督任务背后的使用原理是，解决这些问题将的方式是使得卷积神经网络更加注重也能够对其他视觉任务有用的语义图像特征。事实上，通过上述自我监督任务学习到的图像表征，虽然未能匹配监督学习表征的性能，但已被证明是转移其他视觉任务的良好选择，如目标识别、目标检测和语义分割（张等，2016a；拉尔森等，2016；张等，2016b；拉尔森等人，2017；多尔施等人，2015；诺鲁齐和法瓦罗，2016；诺鲁齐等人，2017；帕塔克等人，2016a；多尔施和齐瑟曼，2017)。

Other successful cases of unsupervised feature learning are clustering based methods (Dosovitskiy et al., 2014; Liao et al., 2016; Yang et al., 2016), reconstruction based methods (Bengio et al., 2007; Huang et al., 2007; Masci et al., 2011), and methods that involve learning generative probabilistic models Goodfellow et al. (2014); Donahue et al. (2016); Radford et al. (2015).

其他成功的无监督特征学习的案例有基于聚类的方法（多索维茨基等，2014；廖等，2016；杨等，2016)，基于重建的方法（Bengio等，2007；黄等人，2007；马西等人，2011），以及学习生成概率模型古德费尔等人（2014）；多纳休等人（2016年）；雷德福等人（2015年）。

Our work follows the self-supervised paradigm and proposes to learn image representations by training ConvNets to recognize the geometric transformation that is applied to the image that it gets as input. More specifically, we first define a small set of discrete geometric transformations, then each of those geometric transformations are applied to each image on the dataset and the produced transformed images are fed to the ConvNet model that is trained to recognize the transformation of each image.
我们的工作遵循自监督范式，并提出通过训练网络识别作为输入的几何变换来学习应用于被作为输入的图像上的图像表示。
(针对这一句我个人的理解是训练了一个模型，这个模型是用来识别几何变换的，这个几何变换可以作用在图像数据的输入上)
In this formulation, it is the set of geometric transformations that actually defines the classification pretext task that the ConvNet model has to learn. Therefore, in order to achieve unsupervised semantic feature learning, it is of crucial importance to properly choose those geometric transformations (we further discuss this aspect of our methodology in section 2.2). What we propose is to define the geometric transformations as the image rotations by 0, 90, 180, and 270 degrees. Thus, the ConvNet model is trained on the 4-way image classification task of recognizing one of the four image rotations (see Figure 2).

在这个公式中，它是一组几何变换，实际上定义了卷积神经模型必须学习的分类借口任务（pretext task）。因此，为了实现无监督的语义特征学习，正确地选择这些几何变换是至关重要的（我们将在第2.2节中进一步讨论我们的方法的这方面）。我们提出的建议是将几何变换定义为图像旋转0、90、180和270度。因此，ConvNet模型采用四路图像分类任务，识别四种图像旋转中的一种（见图2）。

We argue that in order a ConvNet model to be able recognize the rotation transformation that was applied to an image it will require to understand the concept of the objects depicted in the image (see Figure 1), such as their location in the image, their type, and their pose. Throughout the paper we support that argument both qualitatively and quantitatively.Furthermore we demonstrate on the experimental section of the paper that despite the simplicity of our self-supervised approach, the task of predicting rotation transformations provides a powerful surrogate supervision signal for feature learning and leads to dramatic improvements on the relevant benchmarks.

我们认为为了卷积神经模型能够识别旋转转换应用于图像需要理解图像中描述的对象的概念（见图1），如他们在图像的位置，他们的类型，和他们的姿势。在整个论文中，我们在定性和定量上都支持这一论点。此外，我们在论文的实验部分证明，尽管我们的自监督方法很简单，但预测旋转转换的任务为特征学习提供了一个强大的替代监督信号，并导致了相关基准的显著改进。

Note that our self-supervised task is different from the work of Dosovitskiy et al. (2014) and Agrawal et al. (2015) that also involves geometric transformations. Dosovitskiy et al. (2014) train a ConvNet model to yield representations that are discriminative between images and at the same time invariant on geometric and chromatic transformations. In contrast, we train a ConvNet model to recognize the geometric transformation applied to an image. It is also fundamentally different from the egomotion method of Agrawal et al. (2015), which employs a ConvNet model with siamese like architecture that takes as input two consecutive video frames and is trained to predict (through regression) their camera transformation. Instead, in our approach, the ConvNet takes as input a single image to which we have applied a random geometric transformation (i.e., rotation) and is trained to recognize (through classification) this geometric transformation without having access to the initial image.

需要注意的是，我们的自我监督任务不同于多索维茨基等人（2014）和阿格拉瓦尔等人（2015）的工作，这些工作也涉及到几何变换。Dosovitskiy等人（2014）训练了一个结合网络模型，使该模型产生了能够区分图像的表示，同时在几何和彩色变换上保持不变。相比之下，我们训练一个卷积神经模型来识别应用于图像的几何变换。它也与Agrawal等人（2015）的自我运动方法有根本的不同，该方法采用了具有类似于连体的网络结构，以两个连续的视频帧作为输入，并被训练来预测（通过回归）它们的摄像机转换。相反，在我们的方法中，ConvNet将单个图像作为输入，我们对其应用了随机几何变换（即旋转），并训练其识别（通过分类）这种几何变换，而无需访问初始图像。

Our contributions are:

• We propose a new self-supervised task that is very simple and at the same time, as we
demonstrate throughout the paper, offers a powerful supervisory signal for semantic feature learning.
• We exhaustively evaluate our self-supervised method under various settings (e.g. semi-supervised or transfer learning settings) and in various vision tasks (i.e., CIFAR-10, ImageNet, Places, and PASCAL classification, detection, or segmentation tasks).
• In all of them, our novel self-supervised formulation demonstrates state-of-the-art results with dramatic improvements w.r.t. prior unsupervised approaches.
• As a consequence we show that for several important vision tasks, our self-supervised
learning approach significantly narrows the gap between unsupervised and supervised feature learning.

In the following sections, we describe our self-supervised methodology in §2, we provide experimental results in §3, and finally we conclude in §4.

我们的贡献是：

• 我们提出了一个新的自我监督的任务，它非常简单，同时，正如我们本文为语义特征学习提供了一个强大的监督信号。
• 我们详尽地评估了我们的自我监督方法在各种设置下(例如半监督或迁移学习设置)和各种视觉任务（如CIFAR-10、ImageNet、PASCAL数据集分类、目标检测或语义分割任务）。
• 在所有这些研究中，与先前的无监督的方法w.r.t.相比，我们的新的自监督学习公式表现了最先进的结果与巨大的提升
• 因此，我们证明了，对于几个重要的视觉任务，我们的自监督学习方法显著缩小了无监督和有监督特征学习之间的差距。

在下面的章节中，我们将在2中描述我们的自我监督方法，我们在3中提供实验结果，最后我们在4中得出结论。
在这里插入图片描述
Figure 1: Images rotated by random multiples of 90 degrees (e.g., 0, 90, 180, or 270 degrees). The core intuition of our self-supervised feature learning approach is that if someone is not aware of the concepts of the objects depicted in the images, he cannot recognize the rotation that was applied to them.

图1：旋转90度或90度的随机倍数（例如0、90、180或270度）的图像。我们的自监督特征学习方法的核心思路是，如果没有意识到图像中描述的物体是什么，那么就识别不出来他进行了旋转。

2 METHODOLOGY (方法实现)

2.1 OVERVIEW （总览）

The goal of our work is to learn ConvNet based semantic features in an unsupervised manner. To achieve that goal we propose to train a ConvNet model F(.) to estimate the geometric transformation applied to an image that is given to it as input. Specifically, we define a set of K discrete geometric transformations $G={g(.|y)}^K_{y=1}$ ， where g(.|y)is the operator that applies to image X the geometric transformation with label y that yields the transformed image $X^y = g(X|y)$ .The ConvNet model F(.) gets as input an image $X^{y^*}$ (where the label $y^∗$ is unknown to model F(.)) and yields as output a probability distribution over all possible geometric transformations:

我们工作的目标是一种基于卷积神经网络的语义特征的无监督的学习方式。为了实现这一目标，我们训练了一个ConvNet模型F（.）。来评估应用于作为输入的图像的几何变换。具体地说，我们定义了一组K个离散的几何变换 $G={g(.|y)}^K_{y=1}$ 其中g（.|y）是应用于图像X的算子，用标签y的几何变换产生变换后的图像 $X^y = g（X|y）$ 。ConvNet模型F（.）获取图像 $X^{y^∗}$ 作为输入（其中模型F（.)并不知道标签 $y^*$ )并输出所有可能的几何变换的概率分布.
$F(X^{y^*}| \theta)=\{F^y(X^{y^*}| \theta) \}^K_{y=1}，(1)$

where $F^y (X^{y^∗} |θ)$ is the predicted probability for the geometric transformation with label y and θ are the learnable parameters of model F(.). Therefore, given a set of N training images $D=\{X_i\}^N_{i=0}$ , the self-supervised training objective that the ConvNet model must learn to solve is:

其中， $F^y（X^{y^∗}|θ）$ 为带有标签y的几何变换的预测概率，θ为模型F（.）的可学习参数。因此，给定一组N个训练图像 $D=\{X_i\}^N_{i=0}$ ，卷积神经网络模型必须学习解决的自监督训练目标是：
$\min _\theta \frac{1}{N} \sum ^{N} _{i=1}loss(x_i, \theta)，(2)$

where the loss function loss(.) is defined as:

其中损失函数定义为：
$loss(X_i, \theta)=-\frac{1}{K}\sum^K_{y=1}log(F^y(g(X_i|y)|\theta)), (3)$

In the following subsection we describe the type of geometric transformations that we propose in our work.
在下面的小节中，我们将描述我们在工作中提出的几何变换的类型
（注，前面的三个过程，按照个人的理解是先对图像进行一个几何变换，然后传到模型当中，损失值是模型再这种情况下的预测值取-log并求和）

2.2 CHOOSING GEOMETRIC TRANSFORMATIONS: IMAGE ROTATIONS（2.2选择几何变换：图像旋转）

In the above formulation, the geometric transformations G must define a classification task that should force the ConvNet model to learn semantic features useful for visual perception tasks (e.g., object detection or image classification). In our work we propose to define the set of geometric transformations G as all the image rotations by multiples of 90 degrees, 2d image rotations by 0, 90, 180, and 270 degrees (see Figure 2). More formally, if Rot(X, φ) is an operator that rotates image X by φ degrees, then our set of geometric transformations consists of the K = 4 image rotations ${g(X|y)}^4_{y=1}$ , where g(X|y) = Rot(X,(y − 1)90).

在上述公式中，几何变换G必须定义一个分类任务，这样可以使ConvNet模型学习到对于视觉感知任务有用的语义特征（如目标检测或图像分类）。在我们的工作中，我们建议将几何变换集G定义为所有图像旋转90度的倍数，二维图像旋转0、90、180和270度（见图2）。更正式地说，如果Rot（X，φ）是一个将图像X旋转φ度的算子，那么我们的几何变换集由K = 4图像旋转 ${g(X|y)}^4_{y=1}$ 组成，其中g（X|y）=Rot（X，（y−1）90）。
在这里插入图片描述
Figure 2: Illustration of the self-supervised task that we propose for semantic feature learning. Given four possible geometric transformations, the 0, 90, 180, and 270 degrees rotations, we train a ConvNet model F(.) to recognize the rotation that is applied to the image that it gets as input. $F^y (X^{y ^∗})$ is the probability of rotation transformation y predicted by model F(.) when it gets as input an image that has been transformed by the rotation transformation $y^*$ .

图2：我们提出的语义特征学习的自我监督任务的说明。给定四种可能的几何变换，0、90、180和270度旋转，我们训练一个ConvNet模型F（.）识别应用于作为输入的图像的旋转。 $F^y（X^{y^*}）$ 是模型F（.）预测的旋转变换y的概率。当它得到输入一个图像已经被旋转转换 $y^∗$ 。

（通过上面的图像可以很明显的看到，他的操作是先对图像进行旋转，然后传入卷积神经网络训练并预测图像旋转的角度）

Forcing the learning of semantic features(注重学习语义特征):
The core intuition behind using these image rotations as the set of geometric transformations relates to the simple fact that it is essentially impossible for a ConvNet model to effectively perform the above rotation recognition task unless it has first learn to recognize and detect classes of objects as well as their semantic parts in images.

背后的核心直觉使用这些图像旋转的几何变换与简单的事实本质上是除非它首先学会识别和检测类的对象以及图像的语义部分否则ConvNet模型不可能有效地实现旋转识别任务。

More specifically,to successfully predict the rotation of an image the ConvNet model must necessarily learn to localize salient objects in the image, recognize their orientation and object type, and then relate the object orientation with the dominant orientation that each type of object tends to be depicted within the available images.

更具体地说，如果模型能够成功预测图像的旋转，这个模型就必须学习到图像中对象的显著位置，识别他们的方向和对象类型，然后将各个对象主导类型的主导方向联系起来就可以得到一个倾向于当前图像的描述。（我个人理解是这里作者做了一个假设，他认为，如果模型能够识别出图像是否经过旋转，那么这个模型就一定能够学到关于要识别的对象的分类特征以及位置特征）

In Figure 3b we visualize some attention maps generated by a model trained on the rotation recognition task. These attention maps are computed based on the magnitude of activations at each spatial cell of a convolutional layer and essentially reflect where the network puts most of its focus in order to classify an input image. We observe, indeed, that in order for the model to accomplish the rotation prediction task it learns to focus on high level object parts in the image, such as eyes, nose, tails, and heads.

在图3b中，我们可视化了一些由经过旋转识别任务训练的模型生成的注意力地图。这些注意力图是根据卷积层的每个空间单元的激活大小计算出来的，并从本质上反映了网络将其大部分焦点放在哪里，以便对输入图像进行分类。事实上，我们观察到，为了让模型完成旋转预测任务，它学会了专注于图像中的高级物体部分，如眼睛、鼻子、尾巴和头部。

By comparing them with the attention maps generated by a model trained on the object recognition task in a supervised way (see Figure 3a) we observe that both models seem to focus on roughly the same image regions.

通过将它们与以监督的方式训练物体识别任务的模型生成的注意力地图进行比较（见图3a），我们观察到两种模型似乎都关注大致相同的图像区域。

Furthermore, in Figure 4 we visualize the first layer filters that were learn by an AlexNet model trained on the proposed rotation recognition task. As can be seen, they appear to have a big variety of edge filters on multiple orientations and multiple frequencies. Remarkably, these filters seem to have a greater amount of variety even than the filters learn by the supervised object recognition task.

此外，在图4中，我们可视化了由所提出的AlexNet模型训练的旋转识别任务所学习的第一层过滤器。可以看出，它们似乎在多个方向和多个频率上有大量不同的边缘滤波器。值得注意的是，这些过滤器似乎比通过监督对象识别任务学习的过滤器有更多的多样性。
在这里插入图片描述
（注：这个图中左侧是监督学习的效果，可以看到他的注意力机制与右侧使用这篇论文的方法得到的比较结果，很明显这篇文章中使用的方式能够有更好的效果，尤其是再卷积核较小的时候）

Figure 3: Attention maps generated by an AlexNet model trained (a) to recognize objects (supervised), and (b) to recognize image rotations (self-supervised). In order to generate the attention map of a conv. layer we first compute the feature maps of this layer, then we raise each feature activation on the power p, and finally we sum the activations at each location of the feature map. For the conv. layers 1, 2, and 3 we used the powers p = 1, p = 2, and p = 4 respectively. For visualization of our self-supervised model’s attention maps for all the rotated versions of the images see Figure 6 in appendix A.

图3：由AlexNet模型生成的注意力地图，训练(a)识别物体（有监督），训练(b)识别图像旋转（自我监督）。为了生成一个曲线图的注意力图。层首先计算这一层的特征图，然后在幂p上提高每个特征激活，最后对特征图每个位置的激活求和。为公寓。第1、2和3层，我们分别使用功率p = 1、p = 2和p = 4。为了可视化我们的自监督模型的所有旋转版本的注意力地图，见附录A中的图6。

Absence of low-level visual artifacts（没有低级的视觉特征）:

An additional important advantage of using image rotations by multiples of 90 degrees over other geometric transformations, is that they can be implemented by flip and transpose operations (as we will see below) that do not leave any easily detectable low-level visual artifacts that will lead the ConvNet to learn trivial features with no practical value for the vision perception tasks.

通过图像的旋转和其他几何变换的另一个优势是，它们可以通过翻转和转置操作来实现（我们将在下面看到），但是这样不会留下任何容易检测到的低级视觉特征，这将导致ConvNet只能学到对视觉感知任务没有实际价值的琐碎特征。

In contrast, had we decided to use as geometric transformations, e.g., scale and aspect ratio image transformations, in order to implement them we would need to use image resizing routines that leave easily detectable image artifacts.

相比之下，如果我们决定使用几何变换，例如，比例和高宽比图像转换，为了实现它们，我们将需要调整图像大小，留下容易检测的图像特征。

（这一块作者是在说明这他做的工作与其他方法相比可以保存低级视觉特征）

Well-posedness（位置友好）:

Furthermore, human captured images tend to depict objects in an “up-standing” position, thus making the rotation recognition task well defined, i.e., given an image rotated by 0, 90, 180, or 270 degrees, there is usually no ambiguity of what is the rotation transformation (with the exception of images that only depict round objects).

此外，人类捕获的图像倾向于描绘对象在一个“直立”的位置，从而使旋转识别任务明确，即，给定一个图像旋转0、90、180，或270度，通常没有模糊的旋转转换（除了图像只描绘圆形对象）。
（注：因为这篇文章中只限定了四个旋度去旋转，所以不存在模糊的旋转，即一定能够确定是经过了哪个旋转）

In contrast, that is not the case for the object scale that varies significantly on human captured images.

相比之下，对于在人类捕获的图像上显着变化的对象比例，情况并非如此。

Implementing image rotations（实现图像旋转）:

In order to implement the image rotations by 90, 180, and 270
degrees (the 0 degrees case is the image itself), we use flip and transpose operations. Specifically,for 90 degrees rotation we first transpose the image and then flip it vertically (upside-down flip), for 180 degrees rotation we flip the image first vertically and then horizontally (left-right flip), and finally for 270 degrees rotation we first flip vertically the image and then we transpose it.

为了实现90、180和270的图像旋转度（0度的情况是图像本身），我们使用翻转和转置操作。具体来说，对于90度旋转，我们首先转置图像，然后垂直翻转它（倒置翻转），对于180度旋转，我们首先垂直翻转图像，然后水平翻转（左右翻转），最后对于270度旋转，我们首先垂直翻转图像，然后我们转置它。

2.3 DISCUSSION

The simple formulation of our self-supervised task has several advantages. It has the same computational cost as supervised learning, similar training convergence speed (that is significantly faster than image reconstruction based approaches; our AlexNet model trains in around 2 days using a single Titan X GPU), and can trivially adopt the efficient parallelization schemes devised for supervised learning (Goyal et al., 2017), making it an ideal candidate for unsupervised learning on internetscale data (i.e., billions of images).

我们的自我监督任务的简单表述有几个优点。它具有与监督学习相同的计算成本，相似的训练收敛速度（比基于图像重建的方法快得多；我们的AlexNet模型使用一个泰坦XGPU训练大约2天），并且可以简单地采用为监督学习设计的高效并行化方案（Goyal et al.，2017），使其成为互联网规模数据（即数十亿图像）无监督学习的理想候选。

Furthermore, our approach does not require any special image pre-processing routine in order to avoid learning trivial features, as many other unsupervised or self-supervised approaches do.

此外，我们的方法不需要任何特殊的图像预处理程序，以避免学习琐碎的特征，就像许多其他无监督或自监督的方法所做的那样。

3 EXPERIMENTAL RESULTS(实验结果)

In this section we conduct an extensive evaluation of our approach on the most commonly used image datasets, such as CIFAR-10 (Krizhevsky & Hinton, 2009), ImageNet (Russakovsky et al., 2015),PASCAL (Everingham et al., 2010), and Places205 (Zhou et al., 2014), as well as on various vision tasks, such as object detection, object segmentation, and image classification. We also consider several learning scenarios, including transfer learning and semi-supervised learning. In all cases, we compare our approach with corresponding state-of-the-art methods.

在本节中我们进行广泛的评估我们的方法最常用的图像数据集，如CIFAR-10（克里日夫斯基&辛顿，2009），ImageNet（俄罗斯萨科夫斯基等，2015），PASCAL（艾弗汉姆等，2010），和Places205 （周等，2014），以及各种视觉任务，如目标检测、目标分割和图像分类。我们还考虑了一些学习场景，包括迁移学习和半监督学习。在所有情况下，我们都将我们的方法与相应的最先进的方法进行比较。

在这里插入图片描述
Figure 4: First layer filters learned by a AlexNet model trained on (a) the supervised object recognition task and (b) the self-supervised task of recognizing rotated images. We observe that the filters learned by the self-supervised task are mostly oriented edge filters on various frequencies and, remarkably, they seem to have more variety than those learned on the supervised task.

图4：AlexNet模型通过(a)训练的第一层滤波器监督学习的识别任务，(b)学习识别旋转图像的自监督学习任务。我们观察到，通过自监督任务学习到的滤波器大多是面向不同频率的边缘滤波器，值得注意的是，它们似乎比在监督任务中学习到的滤波器有更多的多样性。

Table 1: Evaluation of the unsupervised learned features by measuring the classification accuracy that they achieve when we train a non-linear object classifier on top of them. The reported results are from CIFAR-10. The size of the ConvB1 feature maps is 96 × 16 × 16 and the size of the rest feature maps is 192 × 8 × 8.

表1：通过测量我们在非监督学习特征上训练一个非线性对象分类器时所达到的分类精度来评估无监督学习特征。报告的结果来自CIFAR-10。ConvB1特征图的大小为96×16×16，其余特征图的大小为192×8×8。

在这里插入图片描述