框架学习方法_对比自我监督学习和设计新方法的框架

最新推荐文章于 2024-07-03 14:40:36 发布

weixin_26630173

最新推荐文章于 2024-07-03 14:40:36 发布

阅读量689

点赞数 1

文章标签： python java 人工智能编程语言深度学习

原文链接：https://towardsdatascience.com/a-framework-for-contrastive-self-supervised-learning-and-designing-a-new-approach-3caab5d29619

版权

框架学习方法

This is the partner blog matching our new paper: A Framework For Contrastive Self-Supervised Learning And Designing A New Approach (by William Falcon and Kyunghyun Cho).

这是与我们的新文章相匹配的合作伙伴博客： “自我对比学习和设计新方法的框架” (威廉·法尔Kong和赵庆gh)。

In the last year, a stream of “novel” self-supervised learning algorithms have set new state-of-the-art results in AI research: AMDIM, CPC, SimCLR, BYOL, Swav, etc…

去年，一系列“ 新颖 ”的自我监督学习算法在AI研究中设定了新的最新成果：AMDIM，CPC，SimCLR，BYOL，Swav等。

In our recent paper, we formulate a conceptual framework for characterizing contrastive self-supervised learning approaches. We used our framework to analyze three examples of these leading approaches, SimCLR, CPC, AMDIM, and show that although these approaches seem different on the surface, they are all in fact slight tweaks of one another.

在我们最近的论文中，我们制定了一个概念框架，用于描述对比自我监督学习方法。我们使用我们的框架分析了这些领先方法的三个示例，SimCLR，CPC，AMDIM，并表明尽管这些方法在表面上看起来有所不同，但实际上它们彼此之间都是细微的调整。

In this blog we will:

在此博客中，我们将：

Review self-supervised learning.
复习自我监督的学习。
Review contrastive learning.
回顾对比学习。
Propose a framework for comparing recent approaches.
提出一个比较最新方法的框架。
Compare CPC, AMDIM, MOCO, SimCLR, and BYOL using our framework.
使用我们的框架比较CPC，AMDIM，MOCO，SimCLR和BYOL。
Formulate a new approach — YADIM — , using our framework.
使用我们的框架制定一种新的方法-YADIM。
Describe some of our results.
描述我们的一些结果。
Describe the computational requirements to achieve these results.
描述实现这些结果的计算要求。

The majority of this work was conducted while at Facebook AI Research.

大部分工作是在Facebook AI Research进行的。

实作 (Implementations)

You can find all the augmentations and approaches we described in this article implemented in PyTorch Lightning which will allow you to train on arbitrary hardware and makes the side-by-side comparison of each approach much easier.

您可以找到我们在本文中描述的所有增强和方法，这些方法和方法在PyTorch Lightning中实现，可让您在任意硬件上进行训练，并使每种方法的并排比较更加容易。

AMDIM

BYOL

CPC V2 (only verified implementation outside of DeepMind to our knowledge).

CPC V2 (据我们所知仅在DeepMind之外经过验证的实现)。

Moco V2

SimCLR

自主学习 (Self-Supervised Learning)

Recall that in supervised learning, a system is given input (x) and a label (y),

回想一下在监督学习中 ，系统被赋予输入(x)和标签(y)，

Image for post — Supervised learning: Input on the left, label on the right.

In self-supervised learning, the system is only given (x). Instead of a (y), the system “learns to predict part of its input from other parts of its input” [reference].

在自我 监督学习中，仅给定系统(x)。系统“而不是(y)，而是”学会从输入的其他部分来预测其输入的一部分” [ 参考 ]。

In fact, this formulation is so generic that you can get creative about ways of “splitting” up the input. These strategies are called pretext tasks and researchers have tried all sorts of approaches. Here are three examples: (1) predicting relative locations of two patches, (2) solving a jigsaw puzzle, (3) colorizing an image.

实际上，这种表达方式是如此通用，以至于您可以创造性地“分解”输入内容。这些策略称为借口任务 ，研究人员尝试了各种方法。这是三个示例：(1) 预测两个斑块的相对位置，(2) 解决拼图游戏，(3) 给图像着色。

Although the approaches above are full of creativity, they don’t actually work well in practice. However, a more recent stream of approaches that use contrastive learning has actually started to dramatically close the gap between supervised learning on ImageNet.

尽管上述方法充满创意，但实际上在实践中效果并不理想。但是，使用对比学习的最新方法实际上已经开始显着缩小ImageNet上监督学习之间的差距。

对比学习 (Contrastive Learning)

A fundamental idea behind most machine learning algorithms is that similar examples should be grouped together and far from other clusters of related examples.

大多数机器学习算法背后的基本思想是，相似的示例应该组合在一起，并且与相关示例的其他群集相距甚远。

This idea is what’s behind one of the earliest works on contrastive learning, Learning a Similarity Metric Discriminatively, with Application to Face Verification By Chopra et al in 2004.

这个想法是最早进行对比学习的著作之一，即“区别学习相似性度量，并应用于人脸验证 ”(Chopra等人于2004年提出)。

The animation below illustrates this main idea:

下面的动画说明了这个主要思想：

Contrastive learning achieves this by using three key ingredients, a positive, anchor, and negative(s) representation. To create a positive pair, we need two examples that are similar, and for a negative pair, we use a third example that is not similar.

对比学习通过使用三个关键要素来实现这一目标，即正面，反面和负面表示。要创建一个正对，我们需要两个相似的例子，而对于负向，我们要使用一个不相似的第三个例子。

But in self-supervised learning, we don’t know the labels of the examples. So, there’s no way to know whether two images are similar or not.

但是在自我监督学习中， 我们不知道示例的标签 。因此，无法知道两个图像是否相似。

However, if we assume that each image is its own class, then we can come up with all sorts of ways of forming these triplets (the positive and negative pair). This means that in a dataset of size N, we now have N labels!

但是，如果我们假设每个图像都是其自己的类 ，那么我们可以想出形成这些三胞胎的各种方式(正负两对)。这意味着在大小为N的数据集中，我们现在有N个标签！

Now that we know the labels (kind of) for each image, we can use data augmentations to generate these triplets.

既然我们知道了每个图像的标签(种类)，就可以使用数据增强来生成这些三元组。

特征1：数据增强管道 (Characteristic 1: Data Augmentation pipeline)

The first way we can characterize a contrastive self-supervised learning approach is by defining a data augmentation pipeline.

表征对比自我监督学习方法的第一种方法是定义数据增强管道。

A data augmentation pipeline A(x) applies a sequence of stochastic transformations to the same input.

数据增强流水线 A(x)将随机变换序列应用于同一输入。

In deep learning, a data augmentation aims to build representations that are invariant to noise in the raw input. For example, the network should recognize the above pig as a pig even if it’s rotated, or if the colors are gone or even if the pixels are “jittered” around.

在深度学习中，数据增强旨在建立不影响原始输入中的噪声的表示形式。例如，即使旋转，或者颜色消失或像素“抖动”，网络也应将上述猪识别为猪。

In contrastive learning, the data augmentation pipeline has a secondary goal which is to generate the anchor, positive and negative examples that will be fed to the encoder and will be used for extracting representations.

在对比学习中，数据增强管道的次要目标是生成锚点，正例和负例，这些正例将被馈送到编码器并将用于提取表示。

CPC pipeline

CPC管道

CPC introduced a pipeline that applies transforms like color jitter, random greyscale, random flip, etc… but it also introduced a special transform that splits an image into overlaying sub patches.

CPC引入了应用诸如颜色抖动，随机灰度，随机翻转等变换的流水线，但它还引入了一种特殊的变换，该变换将图像拆分为重叠的子色块。

Using this pipeline, CPC can generate many sets of positive and negative samples. In practice, this process is applied to a batch of examples where we can use the rest of the examples in the batch as the negative samples.

使用此管道，CPC可以生成许多正样本和负样本。实际上，此过程适用于一批示例，在这里我们可以将该批中的其余示例用作阴性样本。

AMDIM pipeline

AMDIM管道

AMDIM takes a slightly different approach. After it performs the standard transforms (jitter, flip, etc…), it generates two versions of an image by applying the data augmentation pipeline twice to the same image.

AMDIM采用略有不同的方法。在执行标准转换(抖动，翻转等)之后，它通过将两次数据增强流水线两次应用于同一幅图像来生成图像的两个版本。

This idea was actually proposed in 2014 via this paper by Dosovitski et al. The idea is to use a “seed” image to generate many versions of the same image.

这个想法实际上是在2014年由Dosovitski等人通过论文提出的。想法是使用“种子”图像生成同一图像的许多版本。

SimCLR, Moco, Swav, BYOL pipelines

SimCLR，Moco，Swav，BYOL管道

The pipeline in AMDIM worked so well that every approach that has followed uses the same pipeline but makes slight tweaks to the transforms that happen beforehand (some add jitter, some add gaussian blur, etc…). However, most of these transforms are inconsequential compared with the main idea introduced in AMDIM.

AMDIM中的流水线工作得很好，以至于所采用的每种方法都使用相同的流水线，但是对事先发生的变换进行了一些细微的调整(有些增加了抖动，有些增加了高斯模糊，等等……)。但是，与AMDIM中引入的主要思想相比，大多数这些变换都是无关紧要的。

In our paper, we ran ablations on the impact of these transforms and found that the choice of transforms is critical to the performance of the approach. In fact, we believe that the success of these approaches is mostly driven by the particular choice of transforms.

在我们的论文中，我们对这些转换的影响进行了总结，发现转换的选择对于该方法的性能至关重要 。实际上，我们认为，这些方法的成功很大程度上取决于转换的特定选择。

These findings are in line with similar results posted by SimCLR and BYOL.

这些发现与SimCLR和BYOL发布的类似结果一致。

The video below illustrates the SimCLR pipeline in more detail.

下面的视频更详细地说明了SimCLR管道。

特征2：编码器 (Characteristic 2: Encoder)

The second way we characterize these methods is by the choice of encoder. Most of the approaches above use ResNets of various widths and depths.

我们表征这些方法的第二种方法是选择编码器。上面的大多数方法都使用各种宽度和深度的ResNet。

When these methods began to come out, CPC and AMDIM actually designed custom encoders. Our ablations found that AMDIM did not generalize well while CPC suffered less from a change in the encoder.

当这些方法问世时，CPC和AMDIM实际上设计了自定义编码器。我们的消融发现，AMDIM的推广效果不佳，而CPC受到编码器更改的影响较小。

Every approach since CPC has settled on a ResNet-50. And while there may be more optimal architectures that we’ve yet to invent, standardizing on the ResNet-50 means we can focus on improving the other characteristics to drive improvements as a result of better training methods and not better architectures.

自CPC以来，每种方法都采用ResNet-50。尽管可能还有更多的最佳架构尚待发明，但ResNet-50的标准化意味着我们可以专注于改进其他特性，以通过更好的培训方法而非更好的架构来推动改进。

One finding did hold true for every ablation, wider encoders perform much better in contrastive learning.

一个发现确实适用于每种消融， 更宽的编码器在对比学习中的表现要好得多。

特征3：表示提取 (Characteristic 3: Representation extraction)

The third way to characterize these methods is by the strategy they employ to extract representations. This is arguably where the “magic” happens in all of these methods and where they differ the most.

表征这些方法的第三种方式是通过它们用于提取表示形式的策略。可以说，这是所有这些方法中发生“魔术”的地方，并且它们之间的差异最大。

To understand why this is important, let’s first define what we mean by representations. A representation is the set of unique characteristics that allow a system (and humans) to understand what makes that object, that object, and not a different one.

要理解为什么这很重要，让我们首先定义表示的含义。表示是一组独特的特征 ，使系统(和人类)能够理解是什么使那个对象，那个对象而不是另一个对象。

This Quora post uses an example of trying to classify a shape. To successfully classify the shapes a good representation might be the number of corners detected in this shape.

这个Quora帖子使用了一个尝试对形状进行分类的示例。为了成功地对形状进行分类，良好的表示方法可能是在此形状中检测到的角数。

In this collection of methods for contrastive learning, these representations are extracted in various ways.

在这套用于对比学习的方法中，以各种方式提取了这些表示。

CPC

每次点击费用

CPC introduces the idea of learning representations by predicting the “future” in latent space. In practice this means two things:

CPC通过预测潜在空间中的“未来”来引入学习表示的想法。实际上，这意味着两件事：

1) Treat an image as a timeline with the past at the top left and the future at the bottom right.

1)将图像视为时间轴，将过去放在左上角，将未来放在右下角。

2) The predictions don’t happen at the pixel level, but instead, they use the outputs of the encoder (ie: the latent space)

2)预测不是在像素级别进行的，而是使用编码器的输出(即：潜在空间)

Finally, the representation extraction happens by formulating a prediction task using the output of the encoder (H) as targets to the context vectors generated by a projection head (which the authors call a context encoder).

最后，通过使用编码器(H)的输出作为由投影头(作者称为上下文编码器)生成的上下文向量的目标来制定预测任务，来进行表示提取。

In our paper, we find that this prediction task is unnecessary as long as the data augmentation pipeline is strong enough. And while there are a lot of hypotheses about what makes a good pipeline, we suggest that a strong pipeline creates positive pairs that share a similar global structure but have a different local structure.

在我们的论文中，我们发现只要数据增强管道足够强大，就无需执行此预测任务。尽管关于如何建立良好管道的假设有很多，但我们建议建立一个强大的管道可以创建具有相似全局结构但具有不同局部结构的正对。

AMDIM

AMDIM, on the other hand, uses the idea of comparing representations across views from feature maps extracted from intermediate layers of a convolutional neural network (CNN). Let’s unpack this into two parts, 1) multiple views of an image, 2) intermediate layers of a CNN.

另一方面，AMDIM使用从卷积神经网络(CNN)的中间层提取的特征图中跨视图比较表示的想法。让我们将其分解为两个部分，1)图像的多个视图，2)CNN的中间层。

1) Recall that the data augmentation pipeline of AMDIM generates two versions of the same image.

1)回想一下，AMDIM的数据增强管道会生成同一图像的两个版本。

2) Each version is passed into the same encoder to extract feature maps for each image. AMDIM does not discard the intermediate feature maps generated by the encoder but instead uses them to make comparisons across spatial scales. Recall that as an input makes its way through the layers of a CNN, the receptive fields encode information for different scales of an input.

2)每个版本都传递到同一编码器中，以提取每个图像的特征图。 AMDIM不会丢弃编码器生成的中间特征图，而是使用它们在空间尺度上进行比较。回想一下，当输入通过CNN的各个层时，接收场会为不同比例的输入编码信息。

AMDIM leverages these ideas by making the comparisons across the intermediate outputs of a CNN. The following animation illustrates how these comparisons are made across the three feature maps generated by the encoder.

AMDIM通过对CNN的中间输出进行比较来利用这些想法。以下动画说明了如何在编码器生成的三个特征图之间进行这些比较。

The rest of these methods make slight tweaks to the idea proposed by AMDIM.

这些方法的其余部分略微修改了AMDIM提出的想法。

SimCLR

SimCLR uses the same idea as AMDIM but makes 2 tweaks.

SimCLR使用与AMDIM相同的想法，但进行了2次调整。

A) Use only the last feature map

A)仅使用最后的特征图

B) Run that feature map through a projection head and compare both vectors (similar to the CPC context projection).

B)通过投影头运行该特征图，并比较两个向量(类似于CPC上下文投影)。

Moco

莫科

As we mentioned earlier, contrastive learning needs negative samples to work. Normally this is done by comparing an image in a batch against the other images in a batch.

正如我们之前提到的，对比学习需要否定样本才能起作用。通常，这是通过将一批图像与一批其他图像进行比较来完成的。

Moco does the same thing as AMDIM (with the last feature map only) but keeps a history of all the batches it has seen and increases the number of negative samples. The effect is that the number of negative samples used to provide a contrastive signal increases beyond a single batch size.

Moco与AMDIM做相同的事情(仅具有最后一个特征图)，但保留了已看到的所有批次的历史记录，并增加了阴性样品的数量。效果是，用于提供对比信号的阴性样本数量增加到超过单个批次大小。

BYOL

Using the same main ideas as AMDIM (but with the last feature map only), but with two changes.

使用与AMDIM相同的主要思想(但仅具有最后一个特征图)，但有两个更改。

BYOL uses two encoders instead of one. The second encoder is actually an exact copy of the first encoder but instead of updating the weights in every pass, it updates them on a rolling average.
BYOL使用两个编码器而不是一个。第二个编码器实际上是第一个编码器的精确副本，但是它不是在每次通过时都更新权重，而是在滚动平均值上更新它们。
BYOL does not use negative samples. But instead relies on the rolling weight updates as a way to give a contrastive signal to the training. However, a recent ablation discovered that this may not be necessary and that in fact adding batch-normalization is what keeps ensures the system does not generate trivial solutions.
BYOL不使用阴性样品。但取而代之的是依靠滚动权重更新，以为训练提供对比信号。但是，最近的消融发现这可能不是必需的，并且实际上添加批次标准化可以确保系统不会产生琐碎的解决方案。

Swav

斯瓦夫

Frames their representation extraction task as one of “online clustering” where they enforce “consistency between codes from different augmentations of the same image.” [reference]. So, it’s the same approach as AMDIM (using only the last feature map), but instead of comparing the vectors directly against each other, they compute the similarity against a set of K precomputed codes.

将他们的表示提取任务构架为“在线聚类”之一，其中他们强制执行“来自同一图像的不同增强的代码之间的一致性”。 [ 参考 ]。因此，这与AMDIM相同(仅使用最后一个特征图)，但是与其直接比较向量，还不如对一组K个预先计算的代码计算相似度。

In practice, this means that Swav generates K clusters and for each encoded vector it compares against those clusters to learn new representations. This work can be viewed as mixing the ideas of AMDIM and Noise as Targets.

实际上，这意味着Swav会生成K个聚类，并且针对每个编码矢量将其与这些聚类进行比较以学习新的表示形式。可以将这项工作视为AMDIM和“以噪声为目标”的思想的融合。

Characteristic 3, takeaways

特点3，外卖

The representation extraction strategies is where these approaches all differ. However, the changes are very subtle and without rigorous ablations, it’s hard to tell what actually drives results or not.

表示提取策略是这些方法都不同的地方。但是，这些更改非常微妙，并且没有严格的烧蚀，很难说出是什么真正在推动结果。

From our experiments, we found that the CPC and AMDIM strategies have a negligible effect on the results but instead add complexity. The primary driver that makes these approaches work is the data augmentation pipeline.

从我们的实验中，我们发现CPC和AMDIM策略对结果的影响可以忽略不计，但会增加复杂性。使这些方法起作用的主要驱动力是数据增强管道。

特征4：相似度 (Characteristic 4: Similarity measure)

The fourth characteristic we can use to compare these approaches is on the similarity measure that they use. All of the approaches above use a dot product or cosine similarity. Although our paper does not list these ablations, our experiments show that the choice of similarity is largely inconsequential.

我们可以用来比较这些方法的第四个特征是它们使用的相似性度量。以上所有方法均使用点积或余弦相似度。尽管我们的论文没有列出这些消融，但我们的实验表明，相似性的选择在很大程度上是无关紧要的。

特征5：损失函数 (Characteristic 5: Loss function)

The fifth characteristic we use to compare these approaches is by the choice of the loss function. All of these approaches (except BYOL) have converged on using an NCE loss. The NCE loss has two parts, a numerator, and denominator. The numerator encourages similar vectors close together and the denominator pushes all other vectors far apart.

我们用来比较这些方法的第五个特征是损失函数的选择。所有这些方法(BYOL除外)都已融合使用NCE损失。 NCE损失分为两个部分：分子和分母。分子鼓励相似的向量靠在一起，而分母将所有其他向量推开。

Without the denominator, the loss can trivially become a constant, and thus the representations learned will not be useful.

没有分母，损失会微不足道地变成一个常数，因此学到的表示将无用。

BYOL however, drops the need for the denominator and instead relies on the weighted updates to the second encoder to provide the contrastive signal. However, as mentioned earlier, recent ablations show that this in fact may not actually be the driver of the contrastive signal.

然而，BYOL放弃了对分母的需求，而是依靠对第二编码器的加权更新来提供对比信号。但是，如前所述，最近的消融表明，实际上这实际上并不是驱动对比信号的原因。

In this video, I give a full explanation on the NCE loss using SimCLR as an example.

在此视频中，我以SimCLR为例，详细介绍了NCE丢失。

另一个DIM(YADIM) (Yet Another DIM (YADIM))

We wanted to show the usefulness of our framework by generating a new approach to self-supervised learning without pretext motivations or involved representation extraction strategies. We call this new approach Yet Another DIM (YADIM).

我们希望通过生成一种无需借口动机或不涉及表征提取策略的自我监督学习新方法来展示我们框架的有用性。我们称这种新方法为“另一个DIM”(YADIM)。

YADIM can be characterized as follows:

YADIM的特征如下：

Characteristic 1: Data augmentation pipeline

特征1：数据扩充管道

For YADIM we merge the pipelines of CPC and AMDIM.

对于YADIM，我们合并CPC和AMDIM的管道。

Characteristic 2: Encoder

特征2：编码器

We use the encoder from AMDIM, although any encoder such as a ResNet-50 can also work

我们使用AMDIM的编码器，尽管任何编码器(例如ResNet-50)也可以使用

Characteristic 3: Representation extraction

特征3：表示提取

The YADIM strategy is simple. Encode the multiple versions of an image and use the last feature map to make a comparison. There is no projection head or other complicated comparison strategy

YADIM策略很简单。对图像的多个版本进行编码，并使用最后的特征图进行比较。没有投影头或其他复杂的比较策略

Characteristic 4: Similarity metric

特征4：相似度指标

We stick to dot product for YADIM

我们为YADIM坚持点产品

Characteristic 5: Loss function

特征5：损失函数

We also use the NCE loss.

我们也使用NCE损失。

YADIM结果 (YADIM Results)

Even though our only meaningful choice was to merge the pipelines of AMDIM and CPC YADIM still manages to do really well compared with other approaches.

尽管我们唯一有意义的选择是合并AMDIM和CPC CPC的管道，但与其他方法相比，YADIM仍然做得很好。

Unlike all the related approaches, we generate the results above by actually implementing each approach ourselves. In fact, our implementation of CPC V2 is, to our knowledge, the first public implementation outside of DeepMind.

与所有相关方法不同，我们通过实际实施每种方法来产生以上结果。实际上，据我们所知，我们对CPC V2的实施是DeepMind之外的第一个公开实施。

More importantly, we use PyTorch Lightning to standardize all implementations so we can objectively distill the main drivers of the above results.

更重要的是，我们使用PyTorch Lightning来标准化所有实现，因此我们可以客观地提取上述结果的主要驱动因素。

Computational efficiency

计算效率

The methods above are trained using huge amounts of computing resources. The prohibitive costs mean that we did not conduct a rigorous hyperparameter search but simply used the hyperparameters from STL-10 to train on ImageNet.

使用大量的计算资源来训练以上方法。令人望而却步的费用意味着我们没有进行严格的超参数搜索，而只是使用STL-10中的超参数在ImageNet上进行训练。

Using PyTorch Lightning to efficiently distribute the computations we were able to get an epoch through ImageNet down to about 3 minutes per epoch using 16-bit precision.

使用PyTorch Lightning有效地分配计算量，我们能够通过ImageNet使用16位精度将每个时期缩短到大约3分钟。

These are the compute resources we used for each approach

这些是我们用于每种方法的计算资源

重要要点 (Key takeaways)

We introduced a conceptual framework to compare and more easily design contrastive learning approaches.
我们引入了一个概念框架来比较和更轻松地设计对比学习方法。
AMDIM, CPC, SimCLR, Moco, BYOL, and Swav differ from each other in subtle ways. The main differences are found in how they extract representations.
AMDIM，CPC，SimCLR，Moco，BYOL和Swav在微妙的方式上彼此不同。主要区别在于它们如何提取表示形式。
AMDIM and CPC introduced the main key ideas used by other approaches. SimCLR, Moco, BYOL, and Swav can be viewed as variants of AMDIM.
AMDIM和CPC介绍了其他方法使用的主要关键思想。 SimCLR，Moco，BYOL和Swav可以视为AMDIM的变体。
The choice of the encoder does not matter as long as it is wide.
编码器的选择并不重要，只要它很宽即可。
The representation extraction strategy does not matter as long as the data augmentation pipeline generates good positive and negative inputs.
表示提取策略并不重要，只要数据增强管道生成良好的正输入和负输入即可。
Using our framework we can formulate new CSL approaches. We designed YADIM (Yet Another DIM), as an example that performs on par with competing approaches.
使用我们的框架，我们可以制定新的CSL方法。我们设计了YADIM(又一个DIM)，作为与竞争方法相当的示例。
The cost of training these approaches means that only a handful of research groups in the world can continue to make progress. Although, our release of all these algorithms in a standardized way at least alleviates the issue of implementing these algorithms and verifying those implementations.
培训这些方法的成本意味着世界上只有少数研究小组可以继续取得进展。虽然，我们以标准化方式发布所有这些算法至少可以减轻实现这些算法并验证这些实现的问题。
Since most of the results are driven by wider networks and specific data augmentation pipelines, we suspect the current line of research may have limited room to improve.
由于大多数结果是由更广泛的网络和特定的数据增强管道推动的，因此我们怀疑当前的研究领域可能仍有有限的改进空间。

致谢 (Acknowledgments)

As noted in our paper, I’d like to thank some of the authors of CPC, AMDIM, and BYOL for helpful discussions.

如本文所述，我要感谢CPC，AMDIM和BYOL的一些作者的有益讨论。

Most of this work was conducted while at Facebook AI Research. The ablations and long training times would not have been possible without the FAIR compute resources.

大部分工作是在Facebook AI Research进行的。没有FAIR计算资源，将无法实现消融和长时间的培训。

I’d also like to thank colleagues at FAIR and NYU CILVR for helpful discussions, Stephen Roller, Margaret Li, Cinjon Resnick, Ethan Perez, Shubho Sengupta and Soumith Chintala.

我还要感谢FAIR和NYU CILVR的同事们的有益讨论，包括Stephen Roller，Margaret Li，Cinjon Resnick，Ethan Perez，Shubho Sengupta和Soumith Chintala。

PyTorch闪电 (PyTorch Lightning)

In addition, this happens to have been one of the main reasons for creating PyTorch Lightning, rapid iteration of ideas using massive computing resources without getting caught up in all the engineering details required to train models at this scale.

此外，这恰好是创建PyTorch Lightning的主要原因之一，它使用大量的计算资源来快速进行思想迭代，而又不会陷入培训如此规模的模型所需的所有工程细节中。

Finally, I’d like to thank my advisors Kyunghyun Cho and Yann LeCun for patience while working on this research while building PyTorch Lightning in parallel.

最后，我要感谢我的顾问Kyunghyun Cho和Yann LeCun在并行构建PyTorch Lightning的同时进行这项研究时所表现出的耐心。