大规模机器学习_大规模机器学习的权衡

最新推荐文章于 2022-08-11 20:13:11 发布

weixin_26750481

最新推荐文章于 2022-08-11 20:13:11 发布

阅读量513

点赞数

文章标签：机器学习 python 人工智能

原文链接：https://medium.com/criteo-labs/the-trade-offs-of-large-scale-machine-learning-71ad0cf7469f

版权

本文探讨了大规模机器学习的挑战与解决方案，重点关注在处理海量数据时如何平衡计算资源、效率与精度。

摘要由CSDN通过智能技术生成

大规模机器学习

What defines large-scale machine learning? This seemingly innocent question is often answered with petabytes of data and hundreds of GPUs. It turns out that large-scale machine learning does not have much to do with all of that. In 2013, Léon Bottou gave a class on the topic at Institut Poincaré. The class is still as relevant today as it was then. This post is a short summary of it.

什么定义了大规模机器学习？这个看似无害的问题通常通过PB级数据和数百个GPU来回答。事实证明，大规模机器学习与所有这些都没有多大关系。 2013年，莱昂·波托(LéonBottou)在庞加莱学院(PostcaréInstitutPoincaré)举办了有关该主题的课程。今天的课程仍然和那时一样重要。这篇文章是它的简短摘要。

机器学习的基本假设 (The fundamental hypothesis of machine learning)

Most of the recent progress in machine learning progress has been driven by the paradigm of learning by which we train a model ƒ from existing data. We estimate ƒ using a training set and measure the final performance using the test set. The validation set is used for determining the parameters of the model.

机器学习进展的最新进展大部分是由学习范式驱动的，我们通过该范式从现有数据中训练模型ƒ。我们使用训练集来估计ƒ，并使用测试集来衡量最终表现。验证集用于确定模型的参数。

In practice, we proceed by taking two shortcuts.

在实践中，我们采取两个捷径。

Approximation error: because we cannot search through the infinite set of all possible functions F* in the universe, we work within a subspace of function F.
近似误差：因为我们无法搜索宇宙中所有可能函数F *的无限集合，所以我们在函数F的子空间中工作。
Estimation error: because the true data distribution is unknown, we do not minimize the risk, but the empirical risk computed from the available data.
估计误差：因为真实的数据分布是未知的，所以我们没有使风险最小化，而是根据可用数据计算出的经验风险。

In mathematical terms:

用数学术语来说：

where f’ is the best function given the dataset, R estimates the loss at f, and R* is the minimum statistical risk (true risk).

其中f'是给定数据集的最佳函数，R估计f处的损失，R *是最小统计风险(真实风险)。

This approximation/estimation tradeoff is well-captured by the following diagram. Given a finite amount of data, we can trade approximation for estimation. As the model complexity grows, the approximation error decreases, but the estimation error increases (at a constant amount of data). The question, therefore, becomes: how complex of a model can you afford with your data?

下图很好地捕获了这种近似/估计折衷。在给定有限数据量的情况下，我们可以权衡近似值进行估计。随着模型复杂度的增加，近似误差减小，但估计误差增大(在恒定数据量下)。因此，问题就变成了：您可以为数据提供多复杂的模型？

In the real world, we take a third shortcut:

在现实世界中，我们采取了第三条捷径：

Optimization error: finding the exact minimum of the empirical risk is often costly. Since we are already minimizing a surrogate function instead of the ideal function itself, why should we care about finding its perfect minimum? We, therefore, accept to find the minimum within a certain error ρ, such that:
优化错误：找到经验风险的确切最小值通常是昂贵的。由于我们已经将替代函数而不是理想函数本身最小化了，为什么我们要关心找到它的理想最小值呢？因此，我们接受在某个误差ρ内找到最小值，使得：

The final error is therefore composed of three components: the approximation error, the estimation error, and the optimization error. The problem becomes one of finding the optimal function space F, number of examples n and optimization error ρ subject to budget constraints, either in the number of examples n or computing time T. Léon Bottou and Olivier Bousquet develop an in-depth study of this tradeoff in The Tradeoffs of Large Scale Learning.

因此，最终误差由三个分量组成：近似误差，估计误差和优化误差。这个问题成为寻找最优函数空间F，示例数n和受预算约束的优化误差ρ之一(无论是示例数n还是计算时间T)。LéonBottou和Olivier Bousquet对此进行了深入研究大规模学习的权衡。

The fundamental difference between small-scale learning and large-scale learning lies in the budget constraint. Small-scale learning is constrained by the number of examples, while large-scale learning is constrained by computing time.

小规模学习和大规模学习之间的根本区别在于预算约束。小型学习受到示例数的限制，而大规模学习受到计算时间的限制。

This seemingly simple definition of large-scale machine learning is quite general and powerful. While the term large-scale often triggers references to petabytes of data and thousands of GPUs, practitioners often realize that these aspects are irrelevant to the underlying constraint (computing time).

大规模机器学习的这种看似简单的定义是相当通用且强大的。虽然“大规模”一词经常触发对数PB数据和数千个GPU的引用，但从业人员常常意识到这些方面与基本约束(计算时间)无关。

With this definition in mind, you could be working on a truly gigantic dataset such as the entire Google StreetView database and have access to a supercomputer allowing to iterate extremely fast on the full dataset, you would still not be doing large-scale machine learning.

牢记这一定义，您可以在一个真正巨大的数据集上工作，例如整个Google StreetView数据库，并且可以访问超级计算机，从而可以在整个数据集上进行极快的迭代，而您仍然不会进行大规模的机器学习。

时间的约束 (The constraint of time)

Being constrained by time, large-scale learning induces more complex tradeoffs than small-scale learning. We need to make an optimal choice of F, n and ρ within a given time budget. Because time is the bottleneck, we can only run a limited number of experiments per day. Therefore, these choices are often made concurrently. If we choose to decrease the optimization error ρ, a constant time budget forces us to reduce either the complexity of the model or the number of examples, which in turn has adverse effects on the estimation and approximation errors.

受时间限制，大规模学习比小规模学习引起了更复杂的权衡。我们需要在给定的时间预算内对F，n和ρ进行最佳选择。由于时间是瓶颈，因此我们每天只能进行有限数量的实验。因此，这些选择通常是同时进行的。如果我们选择降低优化误差ρ，则恒定的时间预算将迫使我们降低模型的复杂性或减少示例数量，进而对估计和近似误差产生不利影响。

In practice, we often proceed by sampling all possible configurations and end up with a graph like the one below. The optimal configuration depends on the computing time budget (i.e. different time budgets yield different optimal configurations).

实际上，我们经常对所有可能的配置进行采样，最后得到如下图所示的图形。最佳配置取决于计算时间预算(即，不同的时间预算会产生不同的最佳配置)。

专注于数据和任务 (Focusing on the data and the task)

Another striking difference between small-scale and large-scale machine learning is the focus of the effort. With small-scale machine learning, a lot of the focus is on the model and the algorithms. With large-scale machine learning, the focus shifts towards the data and the task. The time spent on the task and the data is significant and often much larger than anticipated.

小型机器学习与大型机器学习之间的另一个显着差异是努力的重点。在小型机器学习中，很多注意力都集中在模型和算法上。通过大规模的机器学习，重点转移到了数据和任务上。在任务和数据上花费的时间很长，通常比预期要大得多。

Why?

为什么？

Experiments cost more at scale (in hardware and engineering time). Therefore, the cost of working with bad data or on the wrong task is higher. In this context, it makes sense to spend extended periods of time just discussing the task or doing data cleanup. This is not such a bad thing actually. For some reason, I always feel some comfort seeing engineers and researchers discuss the task at length. Something deep inside my engineering self makes me think that these hours of discussion might save us a lot more time down the road. The sarcastic software engineering saying “weeks of coding can save hours of planning” can be translated in the machine learning world into: “weeks of training can save hours of task definition”.
在规模上(硬件和工程时间)，实验成本更高。因此，处理错误数据或执行错误任务的成本更高。在这种情况下，有意义的是花更长的时间讨论任务或进行数据清理。实际上，这并不是一件坏事。出于某种原因，我总是很高兴看到工程师和研究人员详细讨论该任务。我的工程学内心深处的某些东西使我认为，这些小时的讨论可以为我们节省很多时间。讽刺的软件工程说“数周的编码可以节省数小时的计划”可以在机器学习领域中转换为：“数周的培训可以节省数小时的任务定义”。
Large-scale systems tend to be more dynamic and to interact with the real-world. This, in turn, induces more opportunity for data quality issues as well as questions around what the model is trying to achieve exactly (e.g. should we take into account the causality effects?)
大型系统趋向于动态化并与现实世界互动。反过来，这又为数据质量问题以及有关模型试图实现的问题提出了更多机会(例如，我们是否应考虑因果关系？)
Large datasets allow for more features and more complex models. More features mean more time spent on data quality. More complex models, on the other hand, almost always translate into initially disappointing results immediately followed by a questioning of the task the model is trying to solve (instead of an even more complex model).
大型数据集可提供更多功能和更复杂的模型。更多功能意味着更多的时间花费在数据质量上。另一方面，更复杂的模型几乎总是转化为最初令人失望的结果，紧接着是对模型要解决的任务的质疑(而不是更复杂的模型)。

Focusing on the data requires to think about which kind of data is most valuable to add. Let’s assume for instance that we are working on a multi-class classification model. Adding more data will probably make the model more accurate. However, accuracy improvements are subject to diminishing returns. On the other hand, breadth improvements are not: adding examples of new classes that were never seen before could improve the model significantly.

关注数据需要考虑添加哪种类型的数据最有价值。例如，假设我们正在开发一个多类分类模型。添加更多数据可能会使模型更准确。但是，准确性的提高会导致收益递减。另一方面，广度改进不是：添加以前从未见过的新类的示例可以显着改善模型。

It is therefore best to focus on queries near the boundary of the known area (a technique referred to as active learning).

因此，最好专注于已知区域边界附近的查询(一种称为主动学习的技术)。

大规模的工程学习系统 (Engineering learning systems, at scale)

The typical approach to solving a complex problem in large-scale machine learning is to subdivide it into smaller subproblems and solving each of them separately. The training strategy can be either (1) training of each module independently (2) sequential training (use input of module n and train module n+1 with it), or (3) global training. Global training is harder but often better. Training neural networks for self-driving cars provides a rich example of global training at scale. Global training comes with a number of challenges, however, such as some modules training faster than the others, data imbalance, and modules overtaking on the learning capacity of the whole network.

解决大规模机器学习中的复杂问题的典型方法是将其细分为较小的子问题，然后分别解决每个子问题。训练策略可以是(1)独立地训练每个模块(2)顺序训练(使用模块n的输入并与其一起训练模块n + 1)，或(3)全局训练。全球培训比较困难，但往往更好。自动驾驶汽车的训练神经网络提供了大规模全球训练的丰富示例。全局培训面临许多挑战，例如，某些模块的培训速度比其他模块要快，数据不平衡，并且模块在整个网络的学习能力上超负荷。

深度学习和转移学习 (Deep learning and transfer learning)

One of the great discoveries of deep learning is how well pre-trained networks work for a task they have not been trained for. In computer vision, for instance, surprisingly good performance can be obtained using the last layers of convnets trained on ImageNet. Generic unsupervised subtasks seem to work well.

深度学习的重大发现之一是，预训练的网络如何很好地完成尚未训练的任务。例如，在计算机视觉中，使用在ImageNet上训练的卷积网络的最后一层可以获得令人惊讶的良好性能。通用无监督子任务似乎运行良好。

Another formulation of this is known as transfer learning: in the vicinity of an interesting task (with expensive labels), there are often less interesting tasks (with cheap labels) that can be put to good use.

这种方法的另一种形式被称为转移学习：在一个有趣的任务(带有昂贵标签)附近，经常会有一些有趣的任务(带有廉价标签)可以很好地利用。

A typical example is one of the labeling of faces on a database of pictures. While the interesting task might be expensive to label (face->name), another task might be much easier to label: are two image faces of the same person? A labeled dataset can simply be constructed by observing that two faces in the same image are likely to be different persons while faces in successive frames are likely to be the same person.

一个典型的例子是在图片数据库上对人脸进行标签之一。虽然有趣的任务的标签可能很昂贵(face-> name)，但另一项任务的标签可能要容易得多：同一个人的两个图像面Kong吗？可以通过观察同一图像中的两个面Kong很可能是不同的人，而连续帧中的面Kong很可能是同一个人来构造标记的数据集。

Solving a more complex task and transferring features often allows us to leverage more data of a different nature.

解决更复杂的任务并转移功能通常可以使我们利用更多不同性质的数据。

结论 (Conclusion)

Large-scale machine learning has little to do with massive hardware and petabytes of data, even though these appear naturally in the process. At scale, time becomes the bottleneck and induces complex and concurrent trade-offs on the function space, the size of the dataset, and the optimization error. The focus expands from the models to the data and the task. New engineering challenges arise around distributed systems. In short, things get a lot more fun.

大规模机器学习与大量硬件和PB数据无关，即使这些自然出现在过程中也是如此。在规模上，时间成为瓶颈，并在功能空间，数据集的大小和优化误差上引起复杂的并发权衡。重点从模型扩展到数据和任务。围绕分布式系统的新工程挑战。简而言之，事情变得更加有趣。

The original class by Leon Bottou contains a lot more material. Check it out!

Leon Bottou的原始班级包含更多材料。 看看这个！

Large-scale machine learning Revisited, by Leon Bottou, Big Data: theoretical and practical challenges Workshop, May 2013, Institut Henri Poincaré

重访大型机器学习 ，作者：Leon Bottou，大数据：理论和实践挑战研讨会，2013年5月，HenriPoincaré研究所

Thanks to Flavian Vasile and Sergey Ivanov for reading drafts of this article.

感谢 Flavian Vasile和Sergey Ivanov阅读本文的草稿。

Like what you are reading? Check out our latest publications?

喜欢您正在阅读的内容吗？ 查看我们的最新出版物？

Want to join the crowd? Check out our current openings:

想加入人群吗？ 查看我们当前的职位空缺：