分布式机器学习综述 A Survey on Distributed Machine Learning

资源存储库

于 2024-01-11 16:08:09 发布

阅读量1.6k

点赞数 15

文章标签：笔记

本文链接：https://blog.csdn.net/wq6qeg88/article/details/135532284

版权

A Survey on Distributed Machine Learning

KLOPPENBURG，荷兰代尔夫特理工大学
TIM VERBELEN, imec - Ghent University, Belgium
TIM VERBELEN， imec - 比利时根特大学
JAN S. RELLERMEYER, Delft University of Technology, Netherlands
JAN S. RELLERMEYER，荷兰代尔夫特理工大学

ACM Comput. Surv., Vol. 53, No. 2, Article 30, Publication date: March 2020.
ACM 计算。Surv.，第 53 卷，第 2 期，第 30 条，出版日期：2020 年 3 月。
DOI: https://doi.org/10.1145/3377454

The demand for artificial intelligence has grown significantly over the past decade, and this growth has been fueled by advances in machine learning techniques and the ability to leverage hardware acceleration. However, to increase the quality of predictions and render machine learning solutions feasible for more complex applications, a substantial amount of training data is required. Although small machine learning models can be trained with modest amounts of data, the input for training larger models such as neural networks grows exponentially with the number of parameters. Since the demand for processing training data has outpaced the increase in computation power of computing machinery, there is a need for distributing the machine learning workload across multiple machines, and turning the centralized into a distributed system. These distributed systems present new challenges: first and foremost, the efficient parallelization of the training process and the creation of a coherent model. This article provides an extensive overview of the current state-of-the-art in the field by outlining the challenges and opportunities of distributed machine learning over conventional (centralized) machine learning, discussing the techniques used for distributed machine learning, and providing an overview of the systems that are available.
在过去十年中，对人工智能的需求显着增长，而这种增长是由机器学习技术的进步和利用硬件加速的能力推动的。然而，为了提高预测质量并使机器学习解决方案适用于更复杂的应用程序，需要大量的训练数据。尽管可以使用少量数据训练小型机器学习模型，但训练大型模型（如神经网络）的输入会随着参数数量的增加而呈指数级增长。由于对处理训练数据的需求已经超过了计算机计算能力的增长，因此需要将机器学习工作负载分布在多台机器上，并将集中式系统转变为分布式系统。这些分布式系统带来了新的挑战：首先，训练过程的高效并行化和连贯模型的创建。本文通过概述分布式机器学习相对于传统（集中式）机器学习的挑战和机遇，讨论用于分布式机器学习的技术，并概述可用的系统，对该领域当前的最新技术进行了广泛的概述。

CCS Concepts: • General and reference → Surveys and overviews; • Computing methodologies → Machine learning; • Computer systems organization → Distributed architectures;
CCS概念： • 一般和参考→调查和概述;• 机器学习→计算方法;• 计算机系统组织→分布式架构;

Additional Key Words and Phrases: Distributed machine learning, distributed systems
其他关键词和短语： Distributed machine learning， distributed systems

ACM Reference format: ACM 参考格式：
Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S. Rellermeyer. 2020. A Survey on Distributed Machine Learning. ACM Comput. Surv. 53, 2, Article 30 (March 2020), 33 pages. https://doi.org/10.1145/3377454
Joost Verbraeken、Matthijs Wolting、Jonathan Katzy、Jeroen Kloppenburg、Tim Verbelen 和 Jan S. Rellermeyer。2020. 分布式机器学习综述.ACM 计算。生存。53、2，第 30 条（2020 年 3 月），33 页。https://doi.org/10.1145/3377454

1 INTRODUCTION 1 引言

The rapid development of new technologies in recent years has led to an unprecedented growth of data collection. Machine Learning (ML) algorithms are increasingly being used to analyze datasets and build decision-making systems for which an algorithmic solution is not feasible due to the complexity of the problem. Examples include controlling self-driving cars [23], recognizing speech [8], or predicting consumer behavior [82].
近年来，新技术的快速发展导致了数据收集的空前增长。机器学习（ML）算法越来越多地用于分析数据集和构建决策系统，由于问题的复杂性，算法解决方案不可行。例如控制自动驾驶汽车[23]、识别语音[8]或预测消费者行为[82]。

In some cases, the long runtime of training the models steers solution designers towards using distributed systems for an increase of parallelization and total amount of I/O bandwidth, as the training data required for sophisticated applications can easily be in the order of terabytes [29]. In other cases, a centralized solution is not even an option when data are inherently distributed or too big to store on single machines. Examples include transaction processing in larger enterprises on data that are stored in different locations [19] or astronomical data that are too large to move and centralize [124].
在某些情况下，训练模型的长时间会引导解决方案设计人员使用分布式系统来增加并行化和 I/O 带宽总量，因为复杂应用程序所需的训练数据很容易达到 TB 级 [ 29]。在其他情况下，当数据本质上是分布式的或太大而无法存储在单台机器上时，集中式解决方案甚至不是一种选择。例如，大型企业对存储在不同位置的数据进行事务处理[19]，或者对太大而无法移动和集中的天文数据进行交易处理[124]。

To make these types of datasets accessible as training data for machine learning problems, algorithms have to be chosen and implemented that enable parallel computation, data distribution, and resilience to failures. A rich and diverse ecosystem of research has been conducted in this field, which we categorize and discuss in this article. In contrast to prior surveys on distributed machine learning [119, 123] or related fields [87, 121, 122, 143, 152, 170], we apply a wholistic view to the problem and discuss the practical aspects of state-of-the-art machine learning from a distributed systems angle.
为了使这些类型的数据集可用作机器学习问题的训练数据，必须选择和实现能够实现并行计算、数据分发和故障恢复能力的算法。该领域已经开展了丰富多样的研究生态系统，我们在本文中对其进行分类和讨论。与之前对分布式机器学习[119,123]或相关领域[87,121,122,143,152,170]的调查相比，我们对这个问题进行了全面的研究，并从分布式系统的角度讨论了最先进的机器学习的实际方面。

Section 2 provides an in-depth discussion of the system challenges of machine learning and how ideas from High Performance Computing (HPC) have been adopted for acceleration and increased scalability. Section 3 describes a reference architecture for distributed machine learning covering the entire stack from algorithms to the network communication patterns that can be employed to exchange state between individual nodes. Section 4 presents the ecosystem of the most widely used systems and libraries as well as their underlying designs. Finally, Section 5 discusses the main challenges of distributed machine learning.
第 2 部分深入讨论了机器学习的系统挑战，以及如何采用高性能计算（HPC）的思想来加速和提高可扩展性。第 3 节描述了分布式机器学习的参考架构，涵盖了从算法到网络通信模式的整个堆栈，可用于在各个节点之间交换状态。第 4 节介绍了使用最广泛的系统和库的生态系统及其底层设计。最后，第 5 节讨论了分布式机器学习的主要挑战。

2 MACHINE LEARNING—A HIGH-PERFORMANCE COMPUTING CHALLENGE?
2 机器学习 — 高性能计算挑战？
Recent years have seen a proliferation of machine learning technology in increasingly complex applications. While various competing approaches and algorithms have emerged, the data representations used are strikingly similar in structure. The majority of computation in machine learning workloads amounts to basic transformations on vectors, matrices, or tensors—well-known problems from linear algebra. The need to optimize such operations has been a highly active area of research in the high-performance computing community for decades. As a result, some techniques and libraries from the HPC community (e.g., BLAS [89] or MPI [62]) have been successfully adopted and integrated into systems by the machine learning community. At the same time, the HPC community has identified machine learning to be an emerging high-value workload and has started to apply HPC methodology to them. Coates et al. [38] were able to train a 1B parameter network on their Commodity Off-The-Shelf High Performance Computing (COTS HPC) system in just three days. You et al. [165] optimized the training of a neural network on Intel’s Knights Landing, a chip designed for HPC applications. Kurth et al. [84] demonstrated how deep learning problems like extracting weather patterns can be optimized and scaled efficiently on large parallel HPC systems. Yan et al. [162] have addressed the challenge of scheduling deep neural network applications on cloud computing infrastructure by modeling the workload demand with techniques like lightweight profiling, which are borrowed from HPC. Li et al. [91] investigated the resilience characteristics of deep neural networks with regard to hardware errors when running on accelerators, which are frequently deployed in major HPC systems.
近年来，机器学习技术在日益复杂的应用中激增。虽然已经出现了各种相互竞争的方法和算法，但所使用的数据表示在结构上惊人地相似。机器学习工作负载中的大多数计算相当于对向量、矩阵或张量（线性代数中众所周知的问题）进行基本转换。几十年来，优化此类操作的需求一直是高性能计算社区中一个非常活跃的研究领域。因此，来自HPC社区的一些技术和库（例如，BLAS [ 89]或MPI [ 62]）已被机器学习社区成功采用并集成到系统中。与此同时，HPC 社区已将机器学习确定为一种新兴的高价值工作负载，并开始将 HPC 方法应用于它们。Coates等[ 38]能够在短短三天内在他们的商品现成高性能计算（COTS HPC）系统上训练一个1B参数网络。You等人[165]优化了英特尔Knights Landing（一种专为HPC应用设计的芯片）上的神经网络训练。Kurth等[84]展示了如何在大型并行HPC系统上有效地优化和扩展深度学习问题，例如提取天气模式。Yan等[ 162]通过使用轻量级分析等技术对工作负载需求进行建模，解决了在云计算基础设施上调度深度神经网络应用程序的挑战，这些技术是从HPC借来的。Li等[91]研究了深度神经网络在加速器上运行时对硬件错误的弹性特性，加速器经常部署在主要的HPC系统中。

Like for other large-scale computational challenges, there are two fundamentally different and complementary ways of accelerating workloads: adding more resources to a single machine (vertical scaling or scaling up) and adding more nodes to the system (horizontal scaling or scaling out).
与其他大规模计算挑战一样，有两种根本不同且互补的方法来加速工作负载：向单台计算机添加更多资源（垂直扩展或纵向扩展）和向系统添加更多节点（水平扩展或横向扩展）。

2.1 Scaling Up 2.1 扩大规模

Among the scale-up solutions, adding programmable GPUs is the most common method and various systematic efforts have shown the benefits of doing so [18, 78, 125]. GPUs feature a high number of hardware threads. For example, the Nvidia Titan V and Nvidia Tesla V100 have a total of 5,120 cores, which makes them approximately 47
×
faster for deep learning than a regular server CPU (namely an Intel Xeon E5-2690v4) [107]. Originally the applications of GPUs for machine learning were limited because GPUs used a pure SIMD (Single Instruction, Multiple Data) [51] model that did not allow the cores to execute a different branch of the code; all threads had to perform the exact same program. Over the years GPUs have shifted to more flexible architectures where the overhead of branch divergence is reduced, but diverging branches is still inefficient [66]. The proliferation of GPGPUs (General-Purpose GPUs, i.e., GPUs that can execute arbitrary code) has lead the vendors to design custom products that can be added to conventional machines as accelerators and no longer fulfill any role in the graphics subsystem of the machine. For example, the Nvidia Tesla GPU series is meant for highly parallel computing and designed for deployment in supercomputers and clusters. When a sufficient degree of parallelism is offered by the workload, GPUs can significantly accelerate machine learning algorithms. For example, Meuth [100] reported a speed-up up to 200
×
over conventional CPUs for an image recognition algorithm using a Pretrained Multilayer Perceptron (MLP).
在纵向扩展解决方案中，添加可编程GPU是最常见的方法，各种系统工作已经显示出这样做的好处[18,78,125]。GPU 具有大量硬件线程。例如，Nvidia Titan V 和 Nvidia Tesla V100 共有 5,120 个内核，这使得它们的深度学习
×
速度比常规服务器 CPU（即 Intel Xeon E5-2690v4）快约 47 [ 107]。最初，GPU 在机器学习中的应用受到限制，因为 GPU 使用纯 SIMD（单指令，多数据）[51] 模型，不允许内核执行代码的不同分支;所有线程都必须执行完全相同的程序。多年来，GPU已经转向更灵活的架构，其中分支发散的开销减少了，但发散分支的效率仍然很低[66]。GPGPU（通用 GPU，即可以执行任意代码的 GPU）的激增导致供应商设计定制产品，这些产品可以作为加速器添加到传统机器中，不再在机器的图形子系统中扮演任何角色。例如，Nvidia Tesla GPU 系列旨在实现高度并行计算，专为部署在超级计算机和集群中而设计。当工作负载提供足够程度的并行性时，GPU 可以显著加速机器学习算法。例如，Meuth [ 100] 报告说，使用预训练多层感知器（MLP）的图像识别算法
×
比传统 CPU 的速度提高了 200 倍。

An alternative to generic GPUs for acceleration is the use of Application Specific Integrated Circuits (ASICs), which implement specialized functions through a highly optimized design. In recent times, the demand for such chips has risen significantly [99]. When applied to, e.g., Bitcoin mining, ASICs have a significant competitive advantage over GPUs and CPUs due to their high performance and power efficiency [144]. Since matrix multiplications play a prominent role in many machine learning algorithms, these workloads are highly amenable to acceleration through ASICs. Google applied this concept in their Tensor Processing Unit (TPU) [128], which, as the name suggests, is an ASIC that specializes in calculations on tensors (
n
-dimensional arrays), and is designed to accelerate their Tensorflow [1, 2] framework, a popular building block for machine learning models. The most important component of the TPU is its Matrix Multiply unit based on a systolic array. TPUs use a MIMD (Multiple Instructions, Multiple Data) [51] architecture that, unlike GPUs, allows them to execute diverging branches efficiently. TPUs are attached to the server system through the PCI Express bus. This provides them with a direct connection with the CPU, which allows for a high aggregated bandwidth of 63 GB/s (PCI-e5x16). Multiple TPUs can be used in a data center, and the individual units can collaborate in a distributed setting. The benefit of the TPU over regular CPU/GPU setups is not only its increased processing power but also its power efficiency, which is important in large-scale applications due to the cost of energy and the limited availability in large-scale data centers. When running benchmarks, Jouppi et al. [80] found that the performance per watt of a TPU can approach 200
×
that of a traditional system. Further benchmarking by Sato et al. [128] indicated that the total processing power of a TPU or GPU can be up to 70
×
higher than a CPU for a typical neural network, with performance improvements varying from 3.5
×
–71
×
, depending on the task at hand.
通用 GPU 加速的替代方案是使用专用集成电路（ASIC），它通过高度优化的设计实现专用功能。近年来，对此类芯片的需求显着上升[99]。当应用于比特币挖矿时，ASIC由于其高性能和高能效而比GPU和CPU具有显着的竞争优势[144]。由于矩阵乘法在许多机器学习算法中发挥着重要作用，因此这些工作负载非常适合通过 ASIC 进行加速。谷歌在他们的张量处理单元（TPU）[128]中应用了这个概念，顾名思义，这是一个专门计算张量（
n
维数组）的ASIC，旨在加速他们的Tensorflow [ 1， 2]框架，这是机器学习模型的流行构建块。TPU 最重要的组件是基于脉动阵列的矩阵乘法单元。TPU 使用 MIMD（多指令，多数据）[51]架构，与 GPU 不同，它允许它们有效地执行不同的分支。TPU 通过 PCI Express 总线连接到服务器系统。这为他们提供了与 CPU 的直接连接，从而实现了 63 GB/s （PCI-e5x16）的高聚合带宽。数据中心可以使用多个 TPU，各个单元可以在分布式环境中进行协作。与常规 CPU/GPU 设置相比，TPU 的优势不仅在于其更高的处理能力，还在于其电源效率，由于能源成本和大型数据中心的可用性有限，这在大规模应用中非常重要。在运行基准测试时，Jouppi 等人。 [ 80] 发现 TPU 的每瓦性能可以接近传统系统的 200
×
。Sato等人[128]的进一步基准测试表明，对于典型的神经网络，TPU或GPU的总处理能力可以比
×
CPU高70，性能提升从3.5
×
–71不等
×
，具体取决于手头的任务。

Chen et al. [32] developed DianNao, a hardware accelerator for large-scale neural networks with a small area footprint. Their design introduces a Neuro-Functional Unit (NFU) in a pipeline that multiplies all inputs, adds the results, and, in a staggered manner after all additions have been performed, optionally applies an activation function like a sigmoid function. The experimental evaluation using the different layers of several large neural network structures [48, 70, 90, 132, 133] shows a performance speedup of three orders of magnitude and an energy reduction of more than 20
×
compared to using a general-purpose 128-bit 2 GHz SIMD CPU.
Chen等[32]开发了DianNao，这是一种用于大规模神经网络的硬件加速器，具有小面积占用空间。他们的设计在管道中引入了一个神经功能单元（NFU），该单元将所有输入相乘，添加结果，并在执行所有添加后以交错的方式选择性地应用激活函数，如 sigmoid 函数。使用几种大型神经网络结构的不同层 [48， 70， 90， 132， 133] 的实验评估表明，
×
与使用通用 128 位 2 GHz SIMD CPU 相比，性能提升了三个数量级，能耗降低了 20 以上。

Hinton et al. [70] address the challenge that accessing the weights of neurons from DRAM is a costly operation and can dominate the energy profile of processing. Leveraging a deep compression technique, they are able to put the weights into SRAM and accelerate the resulting sparse matrix-vector multiplications through efficient weight sharing. The result is a 2.9
×
higher throughput and a 19
×
improved energy efficiency compared to DianNao.
Hinton等[70]提出了一个挑战，即从DRAM获取神经元的权重是一项昂贵的操作，并且可能主导处理的能量分布。利用深度压缩技术，他们能够将权重放入SRAM中，并通过有效的权重共享来加速由此产生的稀疏矩阵-向量乘法。结果是，与 DianNao 相比，吞吐量提高了 2.9 倍，能源效率
×

×
提高了 19 倍。

Even general-purpose CPUs have increased the availability and width of vector instructions in recent product generations to accelerate the processing of computationally intensive problems like machine learning algorithms. These instructions are vector instructions, part of the AVX-512 family [126] with enhanced word-variable precision and support for single precision floating-point operations. In addition to the mainstream players, there are more specialized designs available such as the Epiphany [110]. This special-purpose CPU is designed with a MIMD architecture that uses an array of processors, each of which accessing the same memory, to speed up execution of floating-point operations. This is faster than giving every processor its own memory, because communicating between processors is expensive. The newest chip of the major manufacturer Adapteva is the Epiphany V, which contains 1,024 cores on a single chip [109]. Although Adapteva has not published power consumption specifications of the Epiphany V yet, it has released numbers suggesting a power usage of only 2 Watts [4].
在最近几代产品中，即使是通用 CPU 也增加了矢量指令的可用性和宽度，以加速机器学习算法等计算密集型问题的处理。这些指令是矢量指令，是AVX-512系列[126]的一部分，具有增强的字变量精度，并支持单精度浮点运算。除了主流播放器之外，还有更专业的设计，例如主显节[110]。这种专用 CPU 采用 MIMD 架构设计，该架构使用一组处理器，每个处理器访问相同的内存，以加快浮点运算的执行速度。这比为每个处理器提供自己的内存要快，因为处理器之间的通信成本很高。主要制造商 Adapteva 的最新芯片是 Epiphany V，它在单个芯片上包含 1,024 个内核 [ 109]。尽管 Adapteva 尚未公布 Epiphany V 的功耗规格，但它发布的数字表明功耗仅为 2 瓦 [ 4]。

2.2 Scaling Out 2.2 横向扩展
While there are many different strategies to increase the processing power of a single machine for large-scale machine learning, there are reasons to prefer a scale-out design or combine the two approaches, as often seen in HPC. The first reason is the generally lower equipment cost, both in terms of initial investment and maintenance. The second reason is the resilience against failures because, when a single processor fails within an HPC application, the system can still continue operating by initiating a partial recovery (e.g., based on communication-driven checkpointing [46] or partial re-computation [168]). The third reason is the increase in aggregate I/O bandwidth compared to a single machine [49]. Training ML models is a highly data-intensive task, and the ingestion of data can become a serious performance bottleneck [67]. Since every node has a dedicated I/O subsystem, scaling out is an effective technique for reducing the impact of I/O on the workload performance by effectively parallelizing the reads and writes over multiple machines. A major challenge of scaling out is that not all ML algorithms lend themselves to a distributed computing model, which can thus only be used for algorithms that can achieve a high degree of parallelism.
虽然有许多不同的策略可以提高单台机器的处理能力以进行大规模机器学习，但有理由更喜欢横向扩展设计或将这两种方法结合起来，这在 HPC 中很常见。第一个原因是设备成本普遍较低，无论是在初始投资还是维护方面。第二个原因是针对故障的弹性，因为当HPC应用程序中的单个处理器发生故障时，系统仍可以通过启动部分恢复（例如，基于通信驱动的检查点[46]或部分重新计算[168]）来继续运行。第三个原因是与单台机器相比，聚合I/O带宽增加[ 49]。训练ML模型是一项高度数据密集型的任务，数据摄取可能成为严重的性能瓶颈[67]。由于每个节点都有一个专用的 I/O 子系统，因此横向扩展是一种有效的技术，通过有效地并行化多台计算机上的读取和写入来减少 I/O 对工作负载性能的影响。横向扩展的一个主要挑战是，并非所有 ML 算法都适合分布式计算模型，因此只能用于能够实现高度并行性的算法。

2.3 Discussion 2.3 讨论
The lines between traditional supercomputers, grids, and the cloud are increasingly getting blurred when it comes to the best execution environment for demanding workloads like machine learning. For instance, GPUs and accelerators are now more common in major cloud datacenters [134, 135]. As a result, parallelization of the machine learning workload has become paramount to achieving acceptable performance at large scale. When transitioning from a centralized solution to a distributed system, however, the typical challenges of distributed computing in the form of performance, scalability, failure resilience, or security apply [40]. The following section presents a systematic discussion of the different aspects of distributed machine learning and develops a reference architecture by which all existing systems can be categorized.
传统超级计算机、网格和云之间的界限越来越模糊，当涉及到机器学习等要求苛刻的工作负载的最佳执行环境时。例如，GPU和加速器现在在主要的云数据中心中更为常见[134,135]。因此，机器学习工作负载的并行化对于大规模实现可接受的性能至关重要。然而，当从集中式解决方案过渡到分布式系统时，分布式计算在性能、可扩展性、故障恢复能力或安全性方面的典型挑战适用[ 40]。以下部分系统地讨论了分布式机器学习的不同方面，并开发了一个参考架构，通过该架构可以对所有现有系统进行分类。

3 A REFERENCE ARCHITECTURE FOR DISTRIBUTED MACHINE LEARNING 3 分布式机器学习的参考架构

Designing a generic system that enables an efficient distribution of regular machine learning is challenging, since every algorithm has a distinct communication pattern [78, 105, 127, 145, 149, 151]. Despite various different concepts and implementations for distributed machine learning, we have identified a common architectural framework that covers the entire design space. Every section discusses a particular area where designers of machine learning solutions need to make a decision.
设计一个能够有效分发常规机器学习的通用系统是具有挑战性的，因为每个算法都有独特的通信模式[78,105,127,145,149,151]。尽管分布式机器学习有各种不同的概念和实现，但我们已经确定了一个涵盖整个设计空间的通用架构框架。每个部分都讨论了机器学习解决方案设计人员需要做出决定的特定领域。

In general, the problem of machine learning can be separated into the training and the prediction phase (Figure 1).
一般来说，机器学习的问题可以分为训练阶段和预测阶段（图1）。

Fig. 1.
Fig. 1. General overview of machine learning. During the training phase an ML model is optimized using training data and by tuning hyper parameters. Then the trained model is deployed to provide predictions for new data fed into the system.
图 1.机器学习的一般概述。在训练阶段，使用训练数据和调整超参数来优化 ML 模型。然后部署经过训练的模型，为输入系统的新数据提供预测。
The Training phase involves training a machine learning model by feeding it a large body of training data and updating it using an ML algorithm. An overview of applicable and commonly used algorithms is given in Section 3.1. Aside from choosing a suitable algorithm for a given problem, we also need to find an optimal set of hyperparameters for the chosen algorithm, which is described in Section 3.2. The final outcome of the training phase is a Trained Model, which can then be deployed. The Prediction phase is used for deploying the trained model in practice. The trained model accepts new data as input and produces a prediction as output. While the training phase of the model is typically computationally intensive and requires the availability of large datasets, the inference can be performed with less computing power.
训练阶段涉及通过向机器学习模型提供大量训练数据并使用 ML 算法对其进行更新来训练机器学习模型。第 3.1 节概述了适用和常用的算法。除了为给定问题选择合适的算法外，我们还需要为所选算法找到一组最优的超参数，如第 3.2 节所述。训练阶段的最终结果是经过训练的模型，然后可以部署该模型。预测阶段用于在实践中部署经过训练的模型。经过训练的模型接受新数据作为输入，并生成预测作为输出。虽然模型的训练阶段通常是计算密集型的，并且需要大型数据集的可用性，但推理可以在较低的计算能力下执行。

The training phase and prediction phase are not mutually exclusive. Incremental learning combines the training phase and inference phase and continuously trains the model by using new data from the prediction phase.
训练阶段和预测阶段并不相互排斥。增量学习结合了训练阶段和推理阶段，并使用预测阶段的新数据持续训练模型。

When it comes to distribution, there are two fundamentally different ways of partitioning the problem across all machines: parallelizing the data or the model [119] (Figure 2). These two methods can also be applied simultaneously [161].
在分发方面，有两种根本不同的方法可以在所有机器上对问题进行分区：并行化数据或模型 [ 119] （图 2）。这两种方法也可以同时应用[161]。

Fig. 2.
Fig. 2. Parallelism in distributed machine learning. Data parallelism trains multiple instances of the same model on different subsets of the training dataset, while model parallelism distributes parallel paths of a single model to multiple nodes.
图 2.分布式机器学习中的并行性。数据并行性在训练数据集的不同子集上训练同一模型的多个实例，而模型并行性将单个模型的并行路径分发到多个节点。
In the Data-Parallel approach, the data are partitioned as many times as there are worker nodes in the system and all worker nodes subsequently apply the same algorithm to different datasets. The same model is available to all worker nodes (either through centralization or through replication) so a single coherent output emerges naturally. The technique can be used with every ML algorithm with an independent and identical distribution (i.i.d.) assumption over the data samples (i.e., most ML algorithms [161]). In the Model-Parallel approach, exact copies of the entire datasets are processed by the worker nodes that operate on different parts of the model. The model is therefore the aggregate of all model parts. The model-parallel approach cannot automatically be applied to every machine learning algorithm, because the model parameters generally cannot be split up. One option is to train different instances of the same or similar model, and aggregate the outputs of all trained models using methodologies like ensembling (Section 3.3).
在数据并行方法中，数据被分区的次数与系统中的工作节点一样多，所有工作节点随后将相同的算法应用于不同的数据集。所有工作节点都可以使用相同的模型（通过集中化或复制），因此自然会出现单个连贯的输出。该技术可用于每个ML算法，在数据样本上具有独立和相同的分布（i.i.d.）假设（即大多数ML算法[161]）。在模型并行方法中，整个数据集的精确副本由在模型不同部分上运行的工作节点处理。因此，模型是所有模型部件的聚合。模型并行方法不能自动应用于每个机器学习算法，因为模型参数通常不能拆分。一种选择是训练相同或相似模型的不同实例，并使用 ensembling 等方法聚合所有训练模型的输出（第 3.3 节）。

The final architectural decision is the topology of the distributed machine learning system. The different nodes that form the distributed system need to be connected through a specific architectural pattern to fulfill a common task. However, the choice of pattern has implications on the role that a node can play, the degree of communication between nodes, and the failure resilience of the whole deployment. A discussion of commonly used topologies is presented in Section 3.4.
最终的体系结构决策是分布式机器学习系统的拓扑结构。构成分布式系统的不同节点需要通过特定的架构模式进行连接，以完成共同的任务。但是，模式的选择会影响节点可以扮演的角色、节点之间的通信程度以及整个部署的故障恢复能力。第 3.4 节中讨论了常用拓扑。

In practice, the three layers of architecture (machine learning, parallelism, topology) are not independent. The combining factor is their impact on the amount of communication required to train the model, which is discussed in Section 3.5.
在实践中，架构的三层（机器学习、并行性、拓扑）并不是独立的。综合因素是它们对训练模型所需的通信量的影响，这将在第 3.5 节中讨论。

3.1 Machine Learning Algorithms
3.1 机器学习算法
ML algorithms learn to make decisions or predictions based on data. We categorize current ML algorithms based on the following three characteristics:
ML 算法学习根据数据做出决策或预测。我们根据以下三个特征对当前的 ML 算法进行分类：

Feedback—the type of feedback that is given to the algorithm while learning.
反馈 - 在学习时提供给算法的反馈类型。
Purpose—the desired end result of the algorithm.
目的 - 算法的预期最终结果。
Method—the nature of model evolution that occurs when given feedback.
方法 - 给定反馈时发生的模型演化的性质。
3.1.1 Feedback. To train an algorithm, it requires feedback so it can gradually improve the quality of the model. There are several different types of feedback [164]:
3.1.1 反馈。要训练算法，它需要反馈，以便逐步提高模型的质量。有几种不同类型的反馈[ 164]：

Supervised learning uses training data that consist of input objects (usually vectors) and the corresponding desired output values. Supervised learning algorithms attempt to find a function that maps the input data to the desired output. Then, this function can be applied to new input data to predict the output. One of the goals is to minimize both the bias and variance error of the predicted results. The bias error is caused by simplifying assumptions made by the learning algorithm to facilitate learning the target function. However, methods with high bias have lower predictive performance on problems that do not fully satisfy the assumptions. For example, a linear model will not be able to give accurate predictions if the underlying data have a non-linear behavior. The variance captures how much the results of the ML algorithm change for a different training set. A high variance means that the algorithm is modeling the specifics of the training data without finding the underlying (hidden) mapping between the inputs and the outputs. Unfortunately, eliminating both the bias and the variance is typically impossible, a phenomenon known as the bias-variance trade-off [54]. The more complex the model, the more training data are required to train the algorithm to gain an accurate prediction from the model. For example, when the dimensionality of the data is high, the output may depend on a convoluted combination of input factors, which requires a high number of data samples to detect the relations between these dimensions.
监督学习使用由输入对象（通常是向量）和相应的所需输出值组成的训练数据。监督学习算法试图找到一个将输入数据映射到所需输出的函数。然后，可以将此函数应用于新的输入数据以预测输出。目标之一是最小化预测结果的偏差和方差误差。偏差误差是由简化学习算法为便于学习目标函数而做出的假设引起的。然而，具有高偏差的方法在不完全满足假设的问题上具有较低的预测性能。例如，如果基础数据具有非线性行为，则线性模型将无法给出准确的预测。方差捕获不同训练集的 ML 算法结果变化的程度。高方差意味着算法正在对训练数据的细节进行建模，而没有找到输入和输出之间的底层（隐藏）映射。不幸的是，消除偏差和方差通常是不可能的，这种现象被称为偏差-方差权衡[54]。模型越复杂，训练算法以从模型中获得准确预测所需的训练数据就越多。例如，当数据的维数较高时，输出可能依赖于输入因子的复杂组合，这需要大量的数据样本来检测这些维度之间的关系。
Unsupervised learning uses training data that consist of input objects (usually vectors) without output values. Unsupervised learning algorithms aim at finding a function that describes the structure of the data and group the unsorted input data. Because the input data are unlabeled, they lack a clear output accuracy metric. The most common use case of unsupervised learning is to cluster data together based on similarities and hidden patterns. Unsupervised learning is also used for problems like dimensionality reduction where the key features of data are extracted. In this case, the feedback is generated using a similarity metric.
无监督学习使用的训练数据由没有输出值的输入对象（通常是向量）组成。无监督学习算法旨在找到一个描述数据结构的函数，并对未排序的输入数据进行分组。由于输入数据未标记，因此它们缺乏明确的输出准确性指标。无监督学习最常见的用例是根据相似性和隐藏模式将数据聚类在一起。无监督学习也用于降维等问题，其中提取数据的关键特征。在这种情况下，使用相似性指标生成反馈。
Semi-supervised learning uses a (generally small) amount of labeled data, supplemented by a comparatively large amount of unlabeled data. Clustering can be used to extrapolate known labels onto unlabeled data points. This is done under the assumption that similar data points share the same label.
半监督学习使用（通常很少）标记的数据，并辅以相对大量的未标记数据。聚类分析可用于将已知标签外推到未标记的数据点上。这是在相似数据点共享相同标签的假设下完成的。
Reinforcement learning is used to train an agent that has to take actions in an environment based on its observations. Feedback relies on a reward or cost function that evaluates the states of the system. The biggest challenge here is the credit assignment problem, or how to determine which actions actually lead to higher reward in the long run. Bagnell and Ng [13] showed that a local reward system is beneficial for the scalability of the learning problem, since global schemes require samples that scale roughly linearly with the number of participating nodes.
强化学习用于训练必须根据其观察结果在环境中采取行动的智能体。反馈依赖于评估系统状态的奖励或成本函数。这里最大的挑战是信用分配问题，或者如何确定从长远来看哪些行为实际上会带来更高的回报。Bagnell 和 Ng [ 13] 表明，局部奖励系统有利于学习问题的可扩展性，因为全局方案需要与参与节点数量大致线性扩展的样本。
3.1.2 Purpose. ML algorithms can be used for a wide variety of purposes, such as classifying an image or predicting the probability of an event. They are often used for the following tasks [85]:
3.1.2 目的。ML 算法可用于多种用途，例如对图像进行分类或预测事件的概率。它们通常用于以下任务[85]：

Anomaly detection is used to identify data samples that differ significantly from the majority of the data. These anomalies, which are also called outliers, are used in a wide range of applications including video surveillance, fraud detection in credit card transactions, or health monitoring with on-body sensors.
异常检测用于识别与大多数数据有显著差异的数据样本。这些异常也称为异常值，用于广泛的应用，包括视频监控、信用卡交易中的欺诈检测或使用身体传感器进行健康监测。
Classification is the problem of categorizing unknown data points into categories seen during training. This is an inherently supervised process; the unsupervised equivalent of classification is clustering.
分类是将未知数据点分类为训练期间看到的类别的问题。这是一个固有的监督过程;分类的无监督等价物是聚类。
Clustering groups data points that are similar according to a given metric. Small datasets can be clustered by manually labeling every instance, but for larger datasets that might be infeasible, which justifies the need for automatic labeling the instances (namely, clustering).
聚类分析根据给定指标对相似的数据点进行分组。可以通过手动标记每个实例来对小型数据集进行聚类，但对于较大的数据集，这可能不可行，因此需要自动标记实例（即聚类）。
Dimensionality reduction is the problem of reducing the number of variables in the input data. This can either be achieved by selecting only relevant variables (feature selection), or by creating new variables that represent multiple others (feature extraction).
降维是减少输入数据中变量数量的问题。这可以通过仅选择相关变量（特征选择）或创建表示多个其他变量的新变量（特征提取）来实现。
Representation learning attempts to find proper representations of input data for, e.g., feature detection, classification, clustering, encoding, or matrix factorization. This often also implies a dimensionality reduction.
表示学习试图找到输入数据的正确表示，例如特征检测、分类、聚类、编码或矩阵分解。这通常也意味着降维。
Regression is the problem of estimating how a so-called dependent variable changes in value when other variables change with a certain amount.
回归是估计当其他变量以一定量变化时，所谓的因变量如何变化的问题。
3.1.3 Method. Every effective ML algorithm needs a method that forces the algorithm to improve itself based on new input data so it can improve its accuracy. We identify five different groups of ML methods that distinguish themselves through the way the algorithm learns:
3.1.3 方法。每个有效的 ML 算法都需要一种方法来迫使算法根据新的输入数据进行自我改进，从而提高其准确性。我们确定了五组不同的 ML 方法，它们通过算法的学习方式来区分自己：

Evolutionary Algorithms (EAs) [57] (and specifically Genetic algorithms) learn iteratively based on evolution. The model that actually solves the problem is represented by a set of properties, called its genotype. The performance of the model is measured using a score, calculated using a fitness function. After calculating the fitness score of all generated models, the next iteration creates new genotypes based on mutation and crossover of models that produce more accurate estimates. Genetic algorithms can be used to create other algorithms, such as neural networks, belief networks, decision trees, and rule sets.
进化算法（EA）[57]（特别是遗传算法）基于进化进行迭代学习。实际解决问题的模型由一组属性表示，称为其基因型。模型的性能是使用分数来衡量的，分数是使用适应度函数计算的。在计算所有生成模型的适应度分数后，下一次迭代会根据模型的突变和交叉创建新的基因型，从而产生更准确的估计值。遗传算法可用于创建其他算法，例如神经网络、信念网络、决策树和规则集。
Stochastic Gradient Descent (SGD)–based algorithms minimize a loss function defined on the outputs of the model by adapting the model’s parameters in the direction of the negative gradient (the multi-variable derivative of a function). The gradient descent is called stochastic, as the gradient is calculated from a randomly sampled subset of the training data. The loss function is typically a proxy for the actual error to be minimized; for example, the mean squared error between the model outputs and desired outputs in the case of a regression problem, or the negative log likelihood of the ground truth class according to the model in the case of classification. The typical training procedure then becomes:
基于随机梯度下降（SGD）的算法通过将模型的参数调整为负梯度（函数的多变量导数）方向，将模型输出上定义的损失函数最小化。梯度下降称为随机，因为梯度是根据随机抽样的训练数据子集计算得出的。损失函数通常是要最小化实际误差的代理;例如，在回归问题的情况下，模型输出与期望输出之间的均方误差，或者在分类的情况下，根据模型的真值类的负对数似然。典型的训练程序变为：
(1) Present a batch of randomly sampled training data.
（1）呈现一批随机抽样的训练数据。
(2) Calculate the loss function of the model output and the desired output.
（2）计算模型输出和期望输出的损失函数。
(3) Calculate the gradient with respect to the model parameters.
（3）计算相对于模型参数的梯度。
(4) Adjust the model parameters in the direction of the negative gradient, multiplied by a chosen learning rate.
（4）在负梯度方向上调整模型参数，乘以选定的学习率。
(5) Repeat （5）重复
SGD is the most commonly used training method for a variety of ML models.
SGD 是各种 ML 模型最常用的训练方法。
Support Vector Machines (SVMs) map data points to high-dimensional vectors for classification and clustering purposes. For data points in a p-dimensional space, a (p-1)-dimensional hyperplane can be used as a classifier. A reasonable choice would be the hyperplane that properly separates the data points in two groups based on their labels by the largest possible margin. Sometimes special transformation equations (called kernels) are used to transform all data points to a different representation, in which it is easier to find such a hyperplane.
支持向量机（SVM）将数据点映射到高维向量，以便进行分类和聚类。对于 p 维空间中的数据点，可以使用（p-1）维超平面作为分类器。一个合理的选择是超平面，它根据标签将数据点正确地分成两组，并尽可能大。有时使用特殊的变换方程（称为核）将所有数据点转换为不同的表示形式，这样更容易找到这样的超平面。
Perceptrons [104] are binary classifiers that label input vectors as “active” or “inactive.” A perceptron assigns a weight to all inputs and then sums over the products of these weights and their input. The outcome of this is compared to a threshold to determine the label. Perceptron-based algorithms commonly use the entire batch of training data in their attempt to find a solution that is optimal for the whole set. They are binary, and therefore primarily used for binary classification.
感知器[104]是将输入向量标记为“活动”或“非活动”的二元分类器。感知器为所有输入分配权重，然后对这些权重及其输入的乘积求和。将此结果与阈值进行比较以确定标签。基于感知器的算法通常使用整批训练数据来尝试找到对整个集合最优的解决方案。它们是二进制的，因此主要用于二元分类。
Artificial Neural Networks (ANNs) are perceptron-based systems that consist of multiple layers: an input layer, one or more hidden layers, and an output layer. Each layer consists of nodes connected to the previous and next layers through edges with associated weights (usually called synapses). Unlike regular perceptrons, these nodes usually apply an activation function on the output to introduce non-linearities.
人工神经网络（ANN）是基于感知器的系统，由多层组成：输入层、一个或多个隐藏层和输出层。每一层都由节点组成，这些节点通过具有相关权重的边缘（通常称为突触）连接到上一层和下一层。与常规感知器不同，这些节点通常在输出上应用激活函数以引入非线性。
The model is defined by the state of the entire network and can be changed by altering (1) the weights of the synapses, (2) the layout of the network, or (3) the activation function of nodes.
该模型由整个网络的状态定义，可以通过改变（1）突触的权重、（2）网络的布局或（3）节点的激活函数来改变。
Because neural networks require a large number of nodes, the understandability of a neural network’s thought process is lower compared to, e.g., decision trees.
由于神经网络需要大量节点，因此与决策树等相比，神经网络思维过程的可理解性较低。
Neural networks are extensively studied because of their ability to analyze enormous sets of data. They can be categorized into several subgroups based on network layout:
神经网络因其分析大量数据集的能力而被广泛研究。根据网络布局，它们可以分为几个子组：
Deep Neural Networks (DNNs), are artificial neural networks that have many hidden layers. This allows the neural network to learn hierarchical feature abstractions of the data, with increasing abstraction the deeper you go in the network.
深度神经网络（DNN）是具有许多隐藏层的人工神经网络。这允许神经网络学习数据的分层特征抽象，随着网络的深入，抽象程度越高。
Convolutional Neural Networks (CNNs/ConvNets) are deep, feed-forward neural networks that use convolution layers with nodes connected to only a few nodes in the previous layer. These values are then pooled using pooling layers. It can be seen as a way of recognizing abstract features in the data. The convolution makes the network consider only local data. This makes the represented algorithms spatially invariant, which is why they are sometimes called Space Invariant Artificial Neural Networks (SIANN). Chaining multiple of these convolution and pooling layers together can make the network capable of recognizing complicated constructs in big datasets. Examples of this are cats in images or the contextual meaning of a sentence in a paragraph.
卷积神经网络（CNN/ConvNets）是深度前馈神经网络，它使用卷积层，节点仅连接到前一层中的几个节点。然后，使用池化层对这些值进行池化。它可以被看作是识别数据中抽象特征的一种方式。卷积使网络只考虑本地数据。这使得所表示的算法在空间上是不变的，这就是为什么它们有时被称为空间不变人工神经网络（SIANN）的原因。将多个卷积层和池化层链接在一起可以使网络能够识别大数据集中的复杂结构。这方面的例子是图像中的猫或段落中句子的上下文含义。
Recurrent Neural Networks (RNNs) keep track of a temporal state in addition to weights, which means that previous inputs of the network influence its current decisions. Recurrent synapses give the network a memory. This can help with discovering temporal patterns in data. Blocks of nodes in recurrent networks operate as cells with distinct memories and can store information for an arbitrarily long timespan.
除了权重之外，循环神经网络（RNN）还跟踪时间状态，这意味着网络的先前输入会影响其当前决策。循环突触为网络提供记忆。这有助于发现数据中的时间模式。循环网络中的节点块作为具有不同记忆的单元运行，并且可以在任意长的时间跨度内存储信息。
Hopfield Networks are a type of non-reflexive, symmetric recurrent neural network that have an energy related to every state of the network as a whole. They are guaranteed to converge on a local minimum after some number of network updates.
霍普菲尔德网络是一种非自反、对称的递归神经网络，其能量与整个网络的每个状态有关。在一定数量的网络更新后，它们保证收敛到本地最小值。
Self-Organizing Maps (SOMs)/Self-Organizing Feature Maps (SOFMs) are neural networks that learn through unsupervised competitive learning, in which nodes compete for access to specific inputs. This causes the nodes to become highly specialized, which reduces redundancy. The iterations effectively move the map closer to the training data, which is the reason for its name. Some subtypes include the Time Adaptive Self-Organizing Map (TASOM, automatically adjust the learning rate and neighborhood size of each neuron independently), Binary Tree TASOM (BTASOM, tree of TASOM networks), and Growing Self-Organizing map (GSOM, identify a suitable map size in the SOM by starting with a minimal set of nodes and growing the map by heuristically adding new nodes at the periphery).
自组织图（SOM）/自组织特征图（SOFM）是通过无监督竞争学习进行学习的神经网络，其中节点竞争访问特定输入。这会导致节点变得高度专业化，从而减少冗余。迭代有效地使地图更接近训练数据，这就是其名称的原因。一些子类型包括时间自适应自组织映射（TASOM，独立自动调整每个神经元的学习率和邻域大小）、二叉树 TASOM（BTASOM，TASOM 网络树）和增长自组织映射（GSOM，通过从一组最小的节点开始，并通过启发式地在外围添加新节点来扩展映射来识别 SOM 中合适的映射大小）。
Stochastic Neural Networks make use of stochastic transfer functions or stochastic weights, which allows them to escape the local minima that impede the convergence to a global minimum of normal neural networks. An example is a Boltzmann machine where each neuron output is represented as a binary value and the likelihood of the neuron firing depends on the network of other neurons.
随机神经网络利用随机传递函数或随机权重，这使它们能够逃避阻碍收敛到正常神经网络全局最小值的局部最小值。一个例子是玻尔兹曼机，其中每个神经元输出都表示为二进制值，神经元放电的可能性取决于其他神经元的网络。
Auto-encoders are a type of neural network that are trained specifically to encode and decode data. Since auto-encoders are trained to perform decoding separately from encoding, the encoded version of the data is a form of dimensionality reduction of the data.
自动编码器是一种专门用于编码和解码数据的神经网络。由于自动编码器被训练为与编码分开执行解码，因此数据的编码版本是数据降维的一种形式。
Generative Adversarial Networks (GAN) are generative models that are trained using a minimax game between a generator and discriminator network [58]. The goal is to train a neural network to generate data from a training set distribution. To achieve this, a discriminator neural network is trained at the same time to learn to discriminate between real dataset samples and generated samples by the generator. The discriminator is trained to minimize the classification errors, whereas the generator is trained to maximize the classification errors, in effect generating data that are indistinguishable from the real data.
生成对抗网络（GAN）是使用生成器和判别器网络之间的最小最大博弈进行训练的生成模型[58]。目标是训练神经网络以从训练集分布中生成数据。为了实现这一点，同时训练判别器神经网络来学习区分真实数据集样本和生成器生成的样本。判别器被训练为最小化分类错误，而生成器被训练为最大化分类错误，实际上生成的数据与真实数据无法区分。
Rule-Based Machine Learning (RBML) Algorithms [156] use a set of rules that each represent a small part of the problem. These rules usually express a condition, as well as a value for when that condition is met. Because of the clear if-then relation, rules lend themselves to simple interpretation compared to more abstract types of ML algorithms, such as neural networks.
基于规则的机器学习（RBML）算法[156]使用一组规则，每个规则代表问题的一小部分。这些规则通常表示一个条件，以及满足该条件时的值。由于具有明确的 if-then 关系，与更抽象的 ML 算法类型（如神经网络）相比，规则更适合于简单的解释。
Association Rule Learning is a rule-based machine learning method that focuses on finding relations between different variables in datasets. Example relatedness metrics are Support (how often variables appear together), Confidence (how often a causal rule is true), and Collective Strength (inverse likelihood of the current data distribution if a given rule does not exist).
关联规则学习是一种基于规则的机器学习方法，专注于查找数据集中不同变量之间的关系。示例相关性指标包括支持（变量一起出现的频率）、置信度（因果规则为真的频率）和集体强度（如果给定规则不存在，则当前数据分布的逆可能性）。
Decision Trees, sometimes called “CART” trees (after Classification And Regression Trees), use rule-based machine learning to create a set of rules and decision branches. Traversing the tree involves applying the rules at each step until a leaf of the tree is reached. This leaf represents the decision or classification for that input.
决策树，有时称为“CART”树（在分类和回归树之后），使用基于规则的机器学习来创建一组规则和决策分支。遍历树涉及在每一步应用规则，直到到达树的一片叶子。此叶表示该输入的决策或分类。
Topic Models ™ [21] are statistical models for finding and mapping semantic structures in large and unstructured collections of data, most often applied on text data.
主题模型（TM）[21]是用于在大型和非结构化数据集合中查找和映射语义结构的统计模型，最常应用于文本数据。
Latent Dirichlet Allocation [22] constructs a mapping between documents and a probabilistic set of topics using the assumption that documents have few different topics and that those topics use few different words. It is used to learn what unstructured documents are about based on a few keywords.
潜在狄利克雷分配 [ 22] 使用文档几乎没有不同的主题并且这些主题使用很少不同的单词的假设，构建了文档和一组概率主题之间的映射。它用于根据几个关键字了解非结构化文档的内容。
Latent Semantic Analysis (LSA)/Latent Semantic Indexing (LSI) creates a big matrix of documents and topics in an attempt to classify documents or to find relations between topics. LSA/LSI assumes a Gaussian distribution for topics and documents. LSA/LSI does not have a way of dealing with words that have multiple meanings.
潜在语义分析（LSA）/潜在语义索引（LSI）创建文档和主题的大型矩阵，以尝试对文档进行分类或查找主题之间的关系。LSA/LSI 假定主题和文档为高斯分布。LSA/LSI 无法处理具有多种含义的单词。
Naive Bayes Classifiers are relatively simple probabilistic classifiers that assume different features to be independent. They can be trained quickly using supervised learning but are less accurate than more complicated approaches.
朴素贝叶斯分类器是相对简单的概率分类器，它假定不同的特征是独立的。它们可以使用监督学习进行快速训练，但不如更复杂的方法准确。
Probabilistic Latent Semantic Analysis (PLSA)/Probabilistic Latent Semantic Indexing (PLSI) is the same as LSA/LSI, except that PLSA/PLSI assumes a Poisson distribution for topics and documents instead of the Gaussian distribution that is assumed by LSA/LSI. The reason is that a Poisson distribution appears to model the real world better [72]. Some subtypes include Multinomial Asymmetric Hierarchical Analysis (MASHA), Hierarchical Probabilistic Latent Semantic Analysis (HPLSA), and Latent Dirichlet Allocation (LDA).
概率潜在语义分析（PLSA）/概率潜在语义索引（PLSI）与LSA/LSI相同，不同之处在于PLSA/PLSI假定主题和文档的泊松分布，而不是LSA/LSI假定的高斯分布。原因是泊松分布似乎能更好地模拟现实世界[72]。一些子类型包括多项式非对称层次分析（MASHA）、分层概率潜在语义分析（HPLSA）和潜在狄利克雷分配（LDA）。
Matrix Factorization algorithms can be applied for identifying latent factors or find missing values in matrix-structured data. For example, many recommender systems are based on matrix factorization of the User-Item Rating Matrix to find new items users might be interested in, given their rating on other items [83]. Similarly factorizing a Drug compound-Target Protein Matrix is used for new drug discovery [63]. As this problem scales with
O
(
F
3
)
with
F
the dimensionality of the features, recent research focuses on scaling these methods to larger feature dimensions [142].
矩阵分解算法可用于识别潜在因素或查找矩阵结构数据中的缺失值。例如，许多推荐系统基于用户-项目评分矩阵的矩阵分解，以查找用户可能感兴趣的新项目，给定他们对其他项目的评分[ 83]。类似地，分解药物化合物-靶蛋白基质也用于新药发现[ 63]。由于这个问题随着
O
(
F
3
)

F
特征的维数而扩展，最近的研究集中在将这些方法扩展到更大的特征维度[142]。
3.2 Hyperparameter Optimization
3.2 超参数优化
The performances of many of the algorithms presented in the previous sections are largely impacted by the choice of a multitude of algorithm hyperparameters. For example, in stochastic gradient descent, one has to choose the batch size, the learning rate, the initialization of the model, and so on. Often, the optimal values of these hyperparameters are different for each problem domain, ML model, and dataset.
前面几节中介绍的许多算法的性能很大程度上受到大量算法超参数选择的影响。例如，在随机梯度下降中，必须选择批量大小、学习率、模型的初始化等。通常，对于每个问题域、ML 模型和数据集，这些超参数的最佳值是不同的。

There are several algorithms that can be used to automatically optimize the parameters of the machine learning algorithms and that can be re-used across different ML algorithm families.
有几种算法可用于自动优化机器学习算法的参数，并且可以在不同的 ML 算法系列中重复使用。

These include: 这些包括：

First-order algorithms that use at least one first-derivative of the function that maps the parameter value to the accuracy of the ML algorithm using that parameter. Examples are stochastic gradient descent (SGD) [24], stochastic dual coordinate ascent [136], or conjugate gradient methods [42, 69].
使用函数的至少一个一阶导数的一阶算法，该函数将参数值映射到使用该参数的 ML 算法的准确性。例如随机梯度下降（SGD）[24]、随机双坐标上升[136]或共轭梯度方法[42,69]。
Second-order techniques that use any second-derivative of the function that maps the parameter value to the accuracy of the ML algorithm using that parameter. Examples are Newton’s method [120] (which requires computing the Hessian matrix, and is therefore generally infeasible), Quasi-Newton methods [28] (which approximate Newton’s method by updating the Hessian by analyzing successive gradient vectors instead of recomputing the Hessian in every iteration), or L-BFGS [95].
使用函数的任何二阶导数的二阶技术，该函数将参数值映射到使用该参数的 ML 算法的准确性。例如牛顿方法[120]（需要计算Hessian矩阵，因此通常不可行），准牛顿方法[28]（通过分析连续梯度向量来更新Hessian，而不是在每次迭代中重新计算Hessian，从而近似牛顿方法）或L-BFGS [95]。
Coordinate descent [158] (also called coordinate-wise minimization), which minimizes at each iteration a single variable while keeping all other variables at their value of the current iteration.
坐标下降 [ 158] （也称为坐标最小化），它在每次迭代中最小化单个变量，同时将所有其他变量保持在当前迭代的值。
The Markov-Chain Monte-Carlo [26], which works by successively guessing new parameters randomly drawn from a normal multivariate solution centered on the old parameters and using these new parameters with a chance dependent on the likelihood of the old and the new parameters.
马尔科夫链蒙特卡洛[ 26]，其工作原理是连续猜测从以旧参数为中心的正态多元解中随机抽取的新参数，并使用这些新参数，其几率取决于旧参数和新参数的似然性。
A naive but often-used strategy is grid search, which exhaustively runs to a grid of potential values of each hyperparameter [88].
一种朴素但常用的策略是网格搜索，它详尽地运行到每个超参数的潜在值的网格[88]。
Random search uses randomly chosen trials for sampling hyperparameter values, which often yields better results in terms of efficiency compared to grid search, finding better parameter values for the same compute budget [17].
随机搜索使用随机选择的试验对超参数值进行采样，与网格搜索相比，在效率方面通常会产生更好的结果，在相同的计算预算下找到更好的参数值[ 17]。
Bayesian hyperparameter optimization techniques use the Bayesian framework to iteratively sample hyperparameter values [146]. These model each trial as a sample from a Gaussian process (GP) and use the GP to choose the most informative samples in the next trial.
贝叶斯超参数优化技术使用贝叶斯框架对超参数值进行迭代采样[146]。这些方法将每个试验建模为高斯过程（GP）的样本，并使用 GP 在下一次试验中选择信息量最大的样本。
3.3 Combining Multiple Algorithms: Ensemble Methods
3.3 组合多种算法：集成方法
For some applications, a single model is not accurate enough to solve the problem. To alleviate this issue, multiple models can be combined in so-called Ensemble Learning. For example, when machine learning algorithms are performed on inherently distributed data sources and centralization is thus not an option, the setup requires training to happen in two separate stages: first in the local sites where the data are stored, and second in the global site that aggregates over the individual results of the first stage [77]. This aggregation can be achieved by applying ensemble methods in the global site.
对于某些应用程序，单个模型不够精确，无法解决问题。为了缓解这个问题，可以将多个模型组合在所谓的集成学习中。例如，当机器学习算法在固有的分布式数据源上执行时，因此无法进行集中化，因此设置需要在两个独立的阶段进行训练：第一个在存储数据的本地站点中，第二个在聚合第一阶段的单个结果的全局站点中[77]。这种聚合可以通过在全局站点中应用集成方法来实现。

Various different ways exist to perform ensembling, such as [50]:
存在各种不同的方法来执行集成，例如 [ 50]：

Bagging is the process of building multiple classifiers and combining them into one.
装袋是构建多个分类器并将它们组合为一个的过程。
Boosting is the process of training new models with the data that are misclassified by the previous models.
提升是使用先前模型错误分类的数据训练新模型的过程。
Bucketing is the process of training many different models and eventually selecting the one that has the best performance.
分桶是训练许多不同的模型并最终选择具有最佳性能的模型的过程。
Random Forests [25] use multiple decision trees and averaging the prediction made by the individual trees to increase the overall accuracy. Different trees are given the same “voting power.”
随机森林[ 25]使用多个决策树，并对单个树所做的预测进行平均，以提高整体准确性。不同的树被赋予相同的“投票权”。
Stacking is when multiple classifiers are trained on the dataset, and one new classifier uses the output of the other classifiers as input in an attempt to reduce the variance.
堆叠是指在数据集上训练多个分类器，并且一个新分类器使用其他分类器的输出作为输入，以尝试减少方差。
Learning Classifier Systems (LCSs) is a modular system of learning approaches. An LCS iterates over data points from the dataset, completing the entire learning process in each iteration. The main idea is that an LCS has a limited number of rules. A genetic algorithm forces suboptimal rules out of the rule set. There are many different attributes that can drastically change the performance of an LCS depending on the dataset, including the Michigan-style vs. Pittsburgh-style architecture [113], supervised vs. reinforcement learning [81], incremental vs. batch learning [37], online vs. offline training, strength-based vs. accuracy-based [157], and complete mapping vs. best mapping.
学习分类器系统（LCS）是一个模块化的学习方法系统。LCS 遍历数据集中的数据点，在每次迭代中完成整个学习过程。主要思想是 LCS 的规则数量有限。遗传算法将次优规则强制从规则集之外。根据数据集的不同，有许多不同的属性可以极大地改变LCS的性能，包括密歇根式与匹兹堡式的架构[113]、监督学习与强化学习[81]、增量学习与批量学习[37]、在线与离线训练、基于强度与基于准确性的训练[157]，以及完整映射与最佳映射。
3.4 Topologies 3.4 拓扑
Another consideration for the design of a distributed machine learning deployment is the structure in which the computers within the cluster are organized. A deciding factor for the topology is the degree of distribution that the system is designed to implement. Figure 3 shows four possible topologies, in accordance with the general taxonomy of distributed communication networks by Baran [15].
设计分布式机器学习部署的另一个考虑因素是组织群集中计算机的结构。拓扑结构的决定性因素是系统设计要实现的分布程度。图3显示了四种可能的拓扑结构，根据Baran[15]对分布式通信网络的一般分类。

Fig. 3.
Fig. 3. Distributed machine learning topologies based on the degree of distribution.
图 3.基于分布程度的分布式机器学习拓扑。
Centralized systems (Figure 3(a)) employ a strictly hierarchical approach to aggregation, which happens in a single central location. Decentralized systems allow for intermediate aggregation, either with a replicated model that is consistently updated when the aggregate is broadcast to all nodes such as in tree topologies (Figure 3(b)) or with a partitioned model that is sharded over multiple parameter servers (Figure 3©). Fully distributed systems (Figure 3(d)) consist of a network of independent nodes that ensemble the solution together and where no specific roles are assigned to certain nodes.
集中式系统（图 3（a））采用严格的分层聚合方法，该方法发生在单个中心位置。分散式系统允许中间聚合，要么使用复制模型，当聚合广播到所有节点时持续更新，例如在树形拓扑中（图 3（b）），要么使用在多个参数服务器上分片的分区模型（图 3（c））。完全分布式系统（图 3（d））由独立节点网络组成，这些节点将解决方案集成在一起，并且没有为某些节点分配特定角色。

There are several distinct topologies that have become popular choices for distributed machine learning clusters:
有几种不同的拓扑已成为分布式机器学习群集的热门选择：

Trees. Tree-like topologies have the advantage that they are easy to scale and manage, as each node only has to communicate with its parent and child nodes. For example, in the AllReduce [5] paradigm, nodes in a tree accumulate their local gradients with those from their children and pass this sum to their parent node to calculate a global gradient.
树。树状拓扑的优点是易于扩展和管理，因为每个节点只需与其父节点和子节点进行通信。例如，在AllReduce[5]范式中，树中的节点将其局部梯度与子节点的梯度累加，并将此总和传递给其父节点以计算全局梯度。
Rings. In situations where the communication system does not provide efficient support for broadcast or where communication overhead needs to be kept to a minimum, ring topologies for AllReduce patterns simplify the structure by only requiring neighbor nodes to synchronize through messages. This is, e.g., commonly used between multiple GPUs on the same machine [76].
环。在通信系统无法为广播提供有效支持或需要将通信开销保持在最低限度的情况下，AllReduce 模式的环形拓扑通过仅要求相邻节点通过消息进行同步来简化结构。例如，这通常用于同一台机器上的多个GPU之间[ 76]。
Parameter Server. The Parameter Server paradigm (PS) [155] uses a decentralized set of workers with a centralized set of masters that maintain the shared state. All model parameters are stored in a shard on each parameter server, from which all clients read and write as a key-value store. An advantage is that all model parameters (within a shard) are in a global shared memory, which makes it easy to inspect the model. A disadvantage of the topology is that the parameter servers can form a bottleneck, because they are handling all communication. To partially alleviate this issue, the techniques for bridging computation and communication mentioned in Section 3.5.2 are used.
参数服务器。参数服务器范式（PS）[ 155]使用一组分散的工作线程和一组集中的主节点来维护共享状态。所有模型参数都存储在每个参数服务器上的分片中，所有客户端都从该分片中读取和写入作为键值存储。优点是所有模型参数（在分片内）都位于全局共享内存中，这使得检查模型变得容易。拓扑的一个缺点是参数服务器可能会形成瓶颈，因为它们正在处理所有通信。为了部分缓解这个问题，使用了第 3.5.2 节中提到的桥接计算和通信技术。
Peer-to-Peer. In contrast to centralized state, in the fully distributed model, every node has its own copy of the parameters and the workers communicate directly with each other. This has the advantage of typically higher scalability than a centralized model and the elimination of single points of failure in the system [52]. An example implementation of this model is a peer-to-peer network, in which nodes broadcast updates to all other nodes to form a data-parallel processing framework. Since full broadcast is typically prohibitive due to the volume of communication, Sufficient Factor Broadcasting (SFB) [94] has been proposed to reduce the communication overhead. The parameter matrix in SFB is decomposed into so-called sufficient factors, i.e., two vectors that are sufficient to reconstruct the update matrix. SFB only broadcasts these sufficient factors and lets the workers reconstruct the updates. Other models limit the degree of communication to less-frequent synchronization points while allowing the individual models to temporarily diverge. Gossip Learning [138] is built around the idea that models are mobile and perform independent random walks through the peer-to-peer network. Since this forms a data- and model-parallel processing framework, the models evolve differently and need to be combined through ensembling. In Gossip Learning, this happens continuously on the nodes by combining the current model with a limited cache of previous visitors.
点对点。与集中式状态相比，在完全分布式模型中，每个节点都有自己的参数副本，并且工作线程直接相互通信。其优点是通常比集中式模型具有更高的可扩展性，并且消除了系统中的单点故障[52]。该模型的一个示例实现是点对点网络，其中节点将更新广播到所有其他节点以形成数据并行处理框架。由于通信量过大，全广播通常令人望而却步，因此提出了充足因子广播（SFB）[94]来减少通信开销。SFB中的参数矩阵被分解为所谓的充分因子，即两个足以重构更新矩阵的向量。SFB 仅广播这些充分的因子，并让 worker 重建更新。其他模型将通信程度限制在频率较低的同步点，同时允许各个模型暂时发散。Gossip Learning [ 138] 是围绕着模型是移动的，并在点对点网络中执行独立的随机游走的思想而构建的。由于这形成了数据和模型并行处理框架，因此模型的演进方式不同，需要通过集成进行组合。在Gossip Learning中，通过将当前模型与有限的先前访问者缓存相结合，这在节点上连续发生。
3.5 Communication 3.5 沟通
As previously discussed, the need for more sophisticated machine learning-based setups quickly outgrows the capabilities of a single machine. There are several ways to partition the data and/or the program and to distribute these evenly across all machines. The choice of distribution, however, has direct implications on the amount of communication required to train the model.
如前所述，对更复杂的基于机器学习的设置的需求很快就会超出单台机器的能力。有几种方法可以对数据和/或程序进行分区，并将它们均匀地分布在所有计算机上。但是，分布的选择对训练模型所需的通信量有直接影响。

3.5.1 Computation Time vs. Communication vs. Accuracy. When Distributed Machine Learning is used, one aims for the best accuracy at the lowest computation and communication cost. However, for complex ML problems, the accuracy usually increases with processing more training data, and sometimes by increasing the ML model size, hence increasing the computation cost. Parallelizing the learning can reduce computation time, as long as the communication costs are not becoming dominant. This can become a problem if the model being trained is not sufficiently large in comparison to the data. If the data are already distributed (e.g., cloud-native data), then there is no alternative to either moving the data or the computation.
3.5.1 计算时间与通信与精度。当使用分布式机器学习时，人们的目标是以最低的计算和通信成本获得最佳的准确性。然而，对于复杂的 ML 问题，准确性通常会随着处理更多的训练数据而提高，有时还会增加 ML 模型的大小，从而增加计算成本。并行化学习可以减少计算时间，只要通信成本不会成为主导。如果与数据相比，正在训练的模型不够大，这可能会成为一个问题。如果数据已经分发（例如，云原生数据），那么除了移动数据或计算之外别无选择。

Splitting up the dataset across different machines and training a separate model on a separate part of the dataset avoids communication, but this reduces the accuracy of the individual models trained on each machine. By ensembling all these models, the overall accuracy can be improved, However, the computation time is typically not much lower, since the individual models still have to take the same number of model update steps to converge.
将数据集拆分到不同的机器上，并在数据集的单独部分训练单独的模型可以避免通信，但这会降低在每台机器上训练的单个模型的准确性。通过整合所有这些模型，可以提高整体精度，但是，计算时间通常不会低很多，因为单个模型仍然需要采取相同数量的模型更新步骤才能收敛。

By already synchronizing the different models during training (e.g., by combining the calculated gradients on all machines in case of gradient descent), the computation time can be reduced by converging faster to a local optimum. This, however, leads to an increase of communication cost as the model size increases.
通过在训练期间同步不同的模型（例如，在梯度下降的情况下，通过组合所有机器上计算出的梯度），可以通过更快地收敛到局部最优来减少计算时间。然而，随着模型大小的增加，这会导致通信成本增加。

Therefore, practical deployments require seeking the amount of communication needed to achieve the desired accuracy within an acceptable computation time.
因此，实际部署需要寻求在可接受的计算时间内实现所需精度所需的通信量。

3.5.2 Bridging Computation and Communication. To schedule and balance the workload, there are three concerns that have to be taken into account [161]:
3.5.2 桥接计算和通信。为了安排和平衡工作量，必须考虑三个问题[161]：

Identifying which tasks can be executed in parallel.
确定哪些任务可以并行执行。
Deciding the task execution order.
确定任务执行顺序。
Ensuring a balanced load distribution across the available machines.
确保在可用机器之间平衡负载分布。
After deciding on these three issues, the information between nodes should be communicated as efficiently as possible. There are several techniques that enable the interleaving of parallel computation and inter-worker communication. These techniques trade off fast/correct model convergence (at the top of the list found below) with faster/fresher updates (at the bottom of the list found below).
在决定了这三个问题之后，节点之间的信息应该尽可能有效地进行通信。有几种技术可以实现并行计算和工作线程间通信的交错。这些技术在快速/正确的模型收敛（位于下面列表的顶部）与更快/更新（在下面列表的底部）之间进行权衡。

Bulk Synchronous Parallel (BSP) is the simplest model in which programs ensure consistency by synchronizing between each computation and communication phase [161]. An example of a program following the BSP bridging model is MapReduce.
批量同步并行（BSP）是最简单的模型，其中程序通过在每个计算和通信阶段之间同步来确保一致性[161]。遵循 BSP 桥接模型的程序的一个示例是 MapReduce。
An advantage is that serializable BSP ML programs are guaranteed to output a correct solution. A disadvantage is that finished workers must wait at every synchronization barrier until all other workers are finished, which results in overhead in the event of some workers progressing slower than others [34].
一个优点是可序列化的BSP ML程序可以保证输出正确的解决方案。一个缺点是，完成的工人必须在每个同步障碍处等待，直到所有其他工人都完成，这会导致某些工人的进展比其他工人慢[34]。
Stale Synchronous Parallel (SSP) relaxes the synchronization overhead by allowing the faster workers to move ahead for a certain number of iterations. If this number is exceeded, then all workers are paused. Workers operate on cached versions of the data and only commit changes at the end of a task cycle, which can cause other workers to operate on stale data. The main advantage of SSP is that it still enjoys strong model convergence guarantees. A disadvantage, however, is that when the staleness becomes too high (e.g., when a significant number of machines slows down), the convergence rates quickly deteriorate. The algorithm can be compared to Conits [166], used in distributed systems, because it specifies the data on which the workers are working and consistency is to be measured.
过时的同步并行（SSP）通过允许速度更快的工作线程继续进行一定次数的迭代来降低同步开销。如果超过此数字，则所有工作人员都将暂停。工作线程对数据的缓存版本进行操作，并且仅在任务周期结束时提交更改，这可能会导致其他工作线程对过时的数据进行操作。SSP的主要优点是它仍然享有强大的模型收敛保证。然而，缺点是，当过时时间变得太高时（例如，当大量机器减速时），收敛率会迅速恶化。该算法可以与分布式系统中使用的Conits [166]进行比较，因为它指定了工人正在处理的数据，并且要测量一致性。
Approximate Synchronous Parallel (ASP) limits how inaccurate a parameter can be. This contrasts with SSP, which limits how stale a parameter can be. An advantage is that, whenever an aggregated update is insignificant, the server can delay synchronization indefinitely. A disadvantage is that it can be hard to choose the parameter that defines which updates are significant and which are not [73].
近似同步并行（ASP）限制参数的不准确程度。这与 SSP 形成鲜明对比，SSP 限制了参数的陈旧程度。优点是，每当聚合更新无关紧要时，服务器可以无限期地延迟同步。缺点是很难选择定义哪些更新重要且哪些不重要的参数 [ 73]。
Barrierless Asynchronous Parallel [65]/Total Asynchronous Parallel [73] (BAP/TAP) lets worker machines communicate in parallel without waiting for each other. The advantage is that it usually obtains the highest possible speedup. A disadvantage is that the model can converge slowly or even develop incorrectly, because, unlike BSP and SSP, the error grows with the delay [65].
无障碍异步并行 [ 65]/全异步并行 [ 73] （BAP/TAP）允许工作机器并行通信，而无需相互等待。优点是它通常获得尽可能高的加速。缺点是模型收敛缓慢，甚至发展不正确，因为与BSP和SSP不同，误差会随着延迟而增加[65]。
3.5.3 Communication Strategies. Communication is an important contributor to defining the performance and scalability of distributed processing [27]. Several communication management strategies [161] are used to spread and reduce the amount of data exchanged between machines:
3.5.3 沟通策略。通信是定义分布式处理的性能和可扩展性的重要贡献者[27]。几种通信管理策略[161]用于传播和减少机器之间交换的数据量：

To prevent bursts of communication over the network (e.g., after a mapper is finished), continuous communication is used, such as in the state-of-the-art implementation Bösen [155].
为了防止网络上的通信突发（例如，在映射器完成后），使用连续通信，例如在最先进的实现Bösen中[155]。
Neural networks are composed out of layers, the training of which (using the back-propagation gradient descent algorithm) is highly sequential. Because the top layers of neural networks contain the most parameters while accounting for only a small part of the total computation, Wait-free Backpropagation (WFBP) [171] was proposed. WFBP exploits the neural network structure by sending out the parameter updates of the top layers while still computing the updates for the lower layers, hence hiding most of the communication latency.
神经网络由层组成，其训练（使用反向传播梯度下降算法）是高度顺序的。由于神经网络的顶层包含最多的参数，而只占总计算量的一小部分，因此提出了无等待反向传播（WFBP）[171]。WFBP 通过发送顶层的参数更新来利用神经网络结构，同时仍然计算下层的更新，从而隐藏了大部分通信延迟。
Because WFBP does not reduce the communication overhead, hybrid communication (HybComm) [171] was proposed. Effectively, it combines Parameter Servers (PS) [155] with Sufficient Factor Broadcasting (SFB) [159], choosing the best communication method depending on the sparsity of the parameter tensor. See below for more information about PS (under Centralized Storage) and SFB (under Decentralized Storage).
由于WFBP不能减少通信开销，因此提出了混合通信（HybComm）[171]。实际上，它将参数服务器（PS） [ 155] 与足够因子广播（SFB） [ 159] 相结合，根据参数张量的稀疏性选择最佳通信方法。有关 PS（在“集中式存储”下）和 SFB（在“分散式存储”下）的更多信息，请参见下文。
3.6 Discussion 3.6 讨论
While machine learning and artificial intelligence is a discipline with a long history in computer science, recent advancements in technology have caused certain areas like neural networks to experience unprecedented popularity and impact on novel applications. As with many emerging topics, functionality has been the primary concern, and the non-functional aspects have only played a secondary role in the discussion of the technology. As a result, the community has only a preliminary understanding of how distributed ML algorithms and systems behave as a workload and which classes of problems have a higher affinity to a certain methodology when considering performance or efficiency.
虽然机器学习和人工智能是计算机科学中具有悠久历史的学科，但最近的技术进步使神经网络等某些领域经历了前所未有的普及和对新应用的影响。与许多新兴主题一样，功能一直是主要关注点，而非功能方面在技术讨论中仅起次要作用。因此，社区对分布式 ML 算法和系统作为工作负载的行为方式以及在考虑性能或效率时哪些类别的问题与特定方法具有更高的亲和力只有初步的了解。

However, as with similar topics like big data analytics, systems aspects are increasingly becoming more important as the technology matures and consumers become more mindful about resource consumption and return of investment. This has caused ML algorithms and systems to be increasingly more co-designed, i.e., adapting algorithms to make better use of systems resources and designing novel systems that support certain classes of algorithms better. We expect this trend to continue and accelerate, eventually leading to a new wave of distributed machine learning systems that are more autonomous in their ability to optimize computation and distribution for given hardware resources. This would significantly lower the burden of adopting distributed machine learning in the same way that popular libraries have democratized machine learning in general by raising the level of abstraction from numerical computing to a simple and approachable templated programming style, or similar to the way that paradigms like MapReduce [44] have made processing of large datasets accessible.
然而，与大数据分析等类似主题一样，随着技术的成熟和消费者对资源消耗和投资回报的日益关注，系统方面变得越来越重要。这导致机器学习算法和系统越来越多地协同设计，即调整算法以更好地利用系统资源，并设计出更好地支持某些算法类别的新系统。我们预计这一趋势将持续并加速，最终导致新一波分布式机器学习系统，这些系统在优化给定硬件资源的计算和分发方面更加自主。这将大大减轻采用分布式机器学习的负担，就像流行的图书馆通过将抽象级别从数值计算提高到简单易行的模板化编程风格一样，或者类似于MapReduce等范式[44]使大型数据集的处理变得容易获得的方式。

4 THE DISTRIBUTED MACHINE LEARNING ECOSYSTEM
4 分布式机器学习生态系统
The problem of processing a large volume of data on a cluster of machines is not restricted to machine learning but has been studied for a long time in distributed systems and database research. As a result, some practical implementations use general-purpose distributed platforms as the foundation for distributed machine learning. Popular frameworks like Apache Spark [168, 169] have seized the opportunity of machine learning being an emerging workload and now provide optimized libraries (e.g., MLlib [98]). On the other end of the spectrum, purpose-built machine learning libraries that were originally designed to run on a single machine have started to receive support for execution in a distributed setting. For instance, the popular library Keras [35] received backends to run atop Google’s Tensorflow [1] and Microsoft’s CNTK [129]. Nvidia extended their machine learning stack with their Collective Communications Library (NCCL) [106], which was originally designed to support multiple GPUs on the same machine, but version 2 introduced the ability to run on multiple nodes [76]. The center this ecosystem (Figure 4) is inhabited by systems natively build for distributed machine learning and designed around a specific algorithmic and operational model, e.g., Distributed Ensemble Learning, Parallel Synchronous Stochastic Gradient Descent (SGD), or Parameter Servers. While the majority of these systems are intended to set up and operated by the user and on-premise, there is an increasingly large diversity of machine learning services offered through a cloud delivery model, many centered around established distributed machine learning systems enhanced by a surrounding platform that makes the technology more consumable for data scientists and decision makers.
在机器集群上处理大量数据的问题并不局限于机器学习，而是在分布式系统和数据库研究中研究了很长时间。因此，一些实际的实现使用通用分布式平台作为分布式机器学习的基础。像Apache Spark[168,169]这样的流行框架已经抓住了机器学习成为新兴工作负载的机会，现在提供了优化的库（例如MLlib [98]）。另一方面，最初设计用于在单台机器上运行的专用机器学习库已开始获得在分布式环境中执行的支持。例如，流行的库Keras[35]接收后端运行在Google的Tensorflow [ 1]和Microsoft的CNTK [ 129]之上。Nvidia 通过其集体通信库（NCCL） [ 106] 扩展了他们的机器学习堆栈，该库最初设计为在同一台机器上支持多个 GPU，但版本 2 引入了在多个节点上运行的能力 [ 76]。这个生态系统的中心（图 4）是为分布式机器学习而原生构建的系统，这些系统围绕特定的算法和操作模型进行设计，例如分布式集成学习、并行同步随机梯度下降（SGD）或参数服务器。虽然这些系统中的大多数旨在由用户和本地设置和操作，但通过云交付模型提供的机器学习服务越来越多样化，其中许多以已建立的分布式机器学习系统为中心，这些系统由周围平台增强，使数据科学家和决策者更容易使用该技术。

Fig. 4.
Fig. 4. Distributed machine learning ecosystem. Both general-purpose distributed frameworks and single-machine ML systems and libraries are converging towards distributed machine learning. Cloud emerges as a new delivery model for ML.
图 4.分布式机器学习生态系统。通用分布式框架和单机机器学习系统和库都在向分布式机器学习靠拢。云成为一种新的 ML 交付模式。
4.1 General Purpose Distributed Computing Frameworks
4.1 通用分布式计算框架
Distributed systems for processing massive amounts of data largely rely on utilizing a number of commodity servers, each of them with a relatively small storage capacity and computing power, rather than one expensive large server. This strategy has proven more affordable compared to using more expensive specialized hardware, as long as sufficient fault tolerance is built into the software, a concept that Google has pioneered [16] and that has increasingly found traction in the industry. Furthermore, the scale-out model offers a higher aggregate I/O bandwidth compared to using a smaller number of more powerful machines, since every node comes with its own I/O subsystem. This can be highly beneficial in data-intensive applications where data ingestion is a significant part of the workload [116].
用于处理大量数据的分布式系统在很大程度上依赖于利用许多商用服务器，每个服务器都具有相对较小的存储容量和计算能力，而不是一台昂贵的大型服务器。事实证明，与使用更昂贵的专用硬件相比，这种策略更实惠，只要在软件中内置足够的容错能力，这是谷歌率先提出的一个概念[16]，并且越来越受到业界的关注。此外，与使用数量较少、功能更强大的计算机相比，横向扩展模型提供了更高的聚合 I/O 带宽，因为每个节点都有自己的 I/O 子系统。这在数据密集型应用中非常有益，因为数据摄取是工作负载的重要组成部分[116]。

4.1.1 Storage. The storage layer of existing frameworks is commonly based on the Google File System (GFS) [55] or comparable implementations. GFS is owned by and used within Google to handle all big data storage needs in the company. GFS splits up the data that are uploaded to the cluster into chunks, which are then distributed over the chunk servers. The chunks are replicated (the degree of replication is configurable and the default is three-way [55]) to protect the data from becoming unavailable in the event of machine failures. The data on the chunk servers can then be accessed by a user through contacting the master, which serves as a name node and provides the locations for every chunk of a file. The GFS architecture was adopted by an open-source framework called Hadoop [103], which was initially developed by Yahoo! and is now open source and maintained at the Apache Foundation. Its storage layer, named Hadoop File System or HDFS [141], started off as essentially a copy of the GFS design with only minor differences in nomenclature.
4.1.1 存储。现有框架的存储层通常基于Google File System（GFS）[55]或类似的实现。GFS 归 Google 所有，并在 Google 内部使用，用于处理公司的所有大数据存储需求。GFS 将上传到集群的数据拆分为多个块，然后分布在多个块服务器上。块被复制（复制程度是可配置的，默认为三向 [ 55]），以保护数据在机器故障时不可用。然后，用户可以通过联系主节点来访问块服务器上的数据，主节点充当名称节点，并为文件的每个块提供位置。GFS架构被一个名为Hadoop的开源框架所采用[103]，该框架最初由Yahoo！开发，现在是开源的，并由Apache基金会维护。它的存储层被命名为Hadoop文件系统（HDFS）[141]，最初基本上是GFS设计的副本，在命名法上只有细微的差异。

4.1.2 Compute. While the storage architecture has essentially converged to a block-based model, there exist many competing frameworks for scheduling and distributing tasks to compute resources with different features and trade-offs.
4.1.2 计算。虽然存储架构基本上已经收敛到基于块的模型，但存在许多相互竞争的框架，用于调度和分配任务，以计算具有不同功能和权衡的资源。

MapReduce is a framework (and underlying architecture) for processing data that was developed by Google [44] to process data in a distributed setting. The architecture consists of multiple phases and borrows concepts from functional programming. First, all data are split into tuples (called key-value pairs) during the map phase. This is comparable to a mapping of a second-order function to a set in functional programming. The map phase can be executed fully parallel, since there are no data dependencies between mapping a function to two different values in the set. Then, during the shuffle phase, these tuples are exchanged between nodes and passed on. This is strictly necessary, since aggregation generally has data dependencies and it has to be ensured that all tuples belonging to the same key are processed by the same node for correctness. In the subsequent reduce phase, the aggregation is performed on the tuples to generate a single output value per key. This is similar to a fold operation in functional programming, which rolls up a collection using a second-order function that produces a single result value. Fold, however, cannot be parallelized, since every fold step depends on the previous step. Shuffling the data and reducing by key is the enabler of parallelism in the reduce phase.
MapReduce是一个用于处理数据的框架（和底层架构），由Google[44]开发，用于在分布式环境中处理数据。该架构由多个阶段组成，并借鉴了函数式编程的概念。首先，在映射阶段，所有数据都会被拆分为元组（称为键值对）。这类似于函数式编程中二阶函数到集合的映射。映射阶段可以完全并行执行，因为将函数映射到集合中的两个不同值之间没有数据依赖关系。然后，在洗牌阶段，这些元组在节点之间交换并传递。这是绝对必要的，因为聚合通常具有数据依赖关系，并且必须确保属于同一键的所有元组都由同一节点处理以确保正确性。在随后的 reduce 阶段，对元组执行聚合，以为每个键生成单个输出值。这类似于函数式编程中的折叠操作，它使用生成单个结果值的二阶函数汇总集合。但是，折叠不能并行化，因为每个折叠步骤都依赖于前一步。在reduce阶段，对数据进行洗牌和按键减少是并行性的推动因素。

The main benefit of this framework is that the data can be distributed across a large number of machines while tasks of the same phase have no data dependencies and can therefore be executed entirely in parallel. Those same machines can be nodes in a GFS (or similar) storage cluster, so instead of moving data to the program, the program can be moved to the data for an increase of data locality and better performance. The program is usually several orders of magnitude smaller to transfer over the wire, and is therefore much more efficient to pass around. Furthermore, in compliance with the idea of scale-out, MapReduce implements fault-tolerance in software by monitoring the health of the worker nodes through heartbeat messages and rescheduling tasks that failed to healthy nodes. Typically, the granularity of a task equals the size of a single block in the input dataset so a node failure should only affect a fraction of the overall application and the system is able to recover gracefully. Chu et al. [36] have mapped several ML algorithms to the MapReduce framework to exploit parallelism for multicore machines.
该框架的主要优点是数据可以分布在大量机器上，而同一阶段的任务没有数据依赖性，因此可以完全并行执行。这些机器可以是 GFS（或类似）存储集群中的节点，因此，可以将程序移动到数据中，而不是将数据移动到程序中，以增加数据局部性和更好的性能。通过电线传输的程序通常要小几个数量级，因此传递效率要高得多。此外，MapReduce遵循横向扩展的思想，通过心跳消息监控工作节点的健康状况，并将失败的任务重新调度到健康节点，从而在软件中实现容错。通常，任务的粒度等于输入数据集中单个块的大小，因此节点故障应该只影响整个应用程序的一小部分，并且系统能够正常恢复。Chu等[ 36]将几种ML算法映射到MapReduce框架中，以利用多核机器的并行性。

The MapReduce architecture is similar to the Bulk-Synchronous Processing (BSP) paradigm, which preceded it. However, there are some subtle differences. For instance, the MapReduce framework does not allow communication between worker nodes in the map phase. Instead, it only allows cross-communication during the shuffle phase, in between the map and reduce phases [115], for a reduction of synchronization barriers and an increase in parallelism. Goodrich et al. [59] have shown that all BSP programs can be converted into MapReduce programs. Pace [115], in turn, proposed that all MapReduce applications should be modeled as BSP tasks to combine the benefits of theoretical correctness of the BSP paradigm with the efficient execution of MapReduce.
MapReduce架构类似于之前的Bulk-Synchronous Processing（BSP）范式。但是，存在一些细微的差异。例如，MapReduce框架不允许在map阶段的工作节点之间进行通信。相反，它只允许在随机阶段，在映射和减少阶段之间进行交叉通信[115]，以减少同步障碍并增加并行性。Goodrich等[59]已经证明，所有的BSP程序都可以转换为MapReduce程序。反过来，Pace[115]提出，所有MapReduce应用程序都应该被建模为BSP任务，以将BSP范式的理论正确性与MapReduce的高效执行结合起来。

MapReduce as a framework is proprietary to Google. The architecture behind it, however, has been recreated in the aforementioned open source Hadoop framework. It leverages HDFS where MapReduce uses GFS, but is similar in its overall architecture. Advanced variants have deliberated themselves from the strict tree topology of MapReduce data flows towards more flexible structures such as Forests (Dryad [75]) or generic Directed Acyclic Graphs (DAGs).
MapReduce作为一个框架是谷歌专有的。然而，它背后的架构已经在前面提到的开源Hadoop框架中重新创建。它利用HDFS，MapReduce使用GFS，但其整体架构相似。高级变体已经从MapReduce数据流的严格树形拓扑结构转向更灵活的结构，如森林（Dryad [75]）或通用有向无环图（DAG）。

Apache Spark. MapReduce and Hadoop heavily rely on the distributed file system in every phase of the execution. Even intermediate results are stored on the storage layer, which can be a liability for iterative workloads that need to access the same data repeatedly. Transformations in linear algebra, as they occur in many ML algorithms, are typically highly iterative in nature. Furthermore, the paradigm of map and reduce operations is not ideal to support the data flow of iterative tasks, since it essentially restricts it to a tree-structure [86]. Apache Spark has been developed in response to this challenge. It is capable of executing a directed acyclic graph of transformations (like mappings) and actions (like reductions) fully in memory [137]. Because of its structure, Spark can be significantly faster than MapReduce for more complex workloads. When, for example, two consecutive map phases are needed, two MapReduce tasks would need to be executed, both of which would need to write all (intermediate) data to disk. Spark, however, can keep all the data in memory, which saves expensive reads from the disk.
Apache Spark。MapReduce和Hadoop在执行的每个阶段都严重依赖分布式文件系统。即使是中间结果也存储在存储层上，这对于需要重复访问相同数据的迭代工作负载来说可能是一种负担。线性代数中的变换，正如它们在许多 ML 算法中发生的那样，通常本质上是高度迭代的。此外，map和reduce操作的范式对于支持迭代任务的数据流并不理想，因为它本质上将其限制在树状结构中[86]。Apache Spark 就是为了应对这一挑战而开发的。它能够完全在内存中执行转换（如映射）和动作（如约简）的有向无环图[137]。由于其结构，Spark在处理更复杂的工作负载时可能比MapReduce快得多。例如，当需要两个连续的映射阶段时，需要执行两个MapReduce任务，这两个任务都需要将所有（中间）数据写入磁盘。但是，Spark 可以将所有数据保存在内存中，从而节省了昂贵的磁盘读取费用。

The data structure that Spark was originally designed around is called a Resilient Distributed Dataset (RDD). Such datasets are read-only, and new instances can only be created from data stored on the disk or by transforming existing RDDs [167]. The Resilient part comes into play when the data are lost: Each RDD is given a lineage graph that shows what transformations have been executed on it. This lineage graph ensures that, if some data are lost, Spark can trace the path the RDD has followed from the lineage graph and recalculate any lost data. It is important that the lineage graph does not contain cycles (i.e., is a Directed Acylic Graph). Otherwise, Spark would run into infinite loops and be unable to recreate the RDD. In practice, the need for re-computation as a result of data loss due to node failure can lead to ripple effects [167]. Spark allows for checkpointing of data to prevent extensive re-computation. Checkpoints have to be explicitly requested and essentially materialize the intermediate state while truncating the RDD lineage graph. Systems like TR-Spark [163] have automated the generation of checkpoints to make Spark able to run on transient resources where interruption of the execution has to be considered the norm.
Spark 最初设计的数据结构称为弹性分布式数据集（RDD）。这样的数据集是只读的，新实例只能从存储在磁盘上的数据或通过转换现有的RDD来创建[ 167]。当数据丢失时，弹性部分开始发挥作用：每个 RDD 都会被赋予一个世系图，显示对其执行了哪些转换。此沿袭图可确保，如果某些数据丢失，Spark 可以从沿袭图中跟踪 RDD 所遵循的路径，并重新计算任何丢失的数据。重要的是，世系图不包含循环（即，是有向酰基图）。否则，Spark 将遇到无限循环，无法重新创建 RDD。在实践中，由于节点故障导致数据丢失，需要重新计算会导致连锁反应[167]。Spark 允许对数据进行检查点，以防止大量重新计算。必须显式请求检查点，并在截断 RDD 世系图时实质上实现中间状态。像TR-Spark[163]这样的系统已经自动生成了检查点，使Spark能够在瞬态资源上运行，其中必须将执行中断视为常态。

Apache Spark also includes MLlib, a scalable machine learning library that implements many ML algorithms for classification, regression, decision trees, clustering, and topic modeling. It also provides several utilities for building ML workflows, implementing often-used feature transformations, hyperparameter tuning, and so on. As MLlib uses Spark’s APIs, it immediately benefits from the scale-out and failure resilience features of Spark. MLLib relies on the Scala linear algebra package Breeze [64], which in turn utilizes netlib-java [98] for optimization, a bridge for libraries such as BLAS [20] and LAPACK [9], which are widely used in high-performance computing.
Apache Spark 还包括 MLlib，这是一个可扩展的机器学习库，可实现许多用于分类、回归、决策树、聚类和主题建模的 ML 算法。它还提供了多个实用程序，用于构建 ML 工作流、实现常用的特征转换、超参数优化等。由于 MLlib 使用 Spark 的 API，因此可以立即受益于 Spark 的横向扩展和故障恢复功能。MLLib 依赖于 Scala 线性代数包 Breeze [ 64]，而 Breeze 又利用 netlib-java [ 98] 进行优化，这是 BLAS [ 20] 和 LAPACK [ 9] 等库的桥梁，这些库广泛用于高性能计算。

4.2 Natively Distributed Machine Learning Systems
4.2 原生分布式机器学习系统
As a result of the rising popularity of machine learning in many applications, several domain-specific frameworks have been developed around specific distribution models. In this section, the characteristics of the most popular implementations are summarized.
由于机器学习在许多应用中的日益普及，围绕特定的分布模型开发了几个特定于领域的框架。在本节中，总结了最流行的实现的特征。

4.2.1 Distributed Ensemble Learning. Many generic frameworks and ML libraries have limited support for distributed training, even though they are fast and effective on a single machine. One way to achieve distribution with these frameworks is through training separate models for subsets of the available data. At prediction time, the outputs of those instances can then be combined through standard ensemble model aggregation [111].
4.2.1 分布式集成学习。许多通用框架和 ML 库对分布式训练的支持有限，尽管它们在单台机器上快速有效。使用这些框架实现分发的一种方法是为可用数据的子集训练单独的模型。在预测时，这些实例的输出可以通过标准集成模型聚合进行组合[111]。

Models that follow this strategy are not dependent on any specific library. They can be orchestrated using existing distribution frameworks (such as MapReduce [44]). The training process involves training individual models on independent machines in parallel. Neither orchestration nor communication are necessary once training has started. Training on
m
machines with
m
subsets of the data results in
m
different models. Each of these can use separate parameters or even algorithms. At prediction time, all trained models can then be run on new data, after which the output of each one is aggregated. This can once again be distributed if needed.
遵循此策略的模型不依赖于任何特定的库。它们可以使用现有的分发框架（如MapReduce[44]）进行编排。训练过程涉及在独立机器上并行训练单个模型。一旦训练开始，编排和通信都不是必需的。在具有
m
数据子集的机器上
m
进行训练会产生
m
不同的模型。其中每个都可以使用单独的参数甚至算法。在预测时，所有经过训练的模型都可以在新数据上运行，然后聚合每个模型的输出。如果需要，可以再次分发。

One large drawback is that this method is dependent on proper subdivision of the training data. If large biases are present in the training sets of some of the models, then those instances could cause biased output of the ensemble. If the data are divided manually, then it is paramount to ensure independence and identical distribution of the data (i.i.d.). If, however, the dataset is inherently distributed, then this is not straightforward to achieve.
一个很大的缺点是，这种方法依赖于训练数据的适当细分。如果某些模型的训练集中存在较大的偏差，则这些实例可能会导致集成输出的偏差。如果数据是手动分割的，那么确保数据的独立性和相同分布（即）至关重要。但是，如果数据集本质上是分布式的，那么这并不容易实现。

There is a large number of existing frameworks available for this method, as any machine learning framework can be used. Some popular implementations use Tensorflow [1], MXNet [33], and PyTorch [117].
此方法有大量现有框架可用，因为可以使用任何机器学习框架。一些流行的实现使用 Tensorflow [ 1]、MXNet [ 33] 和 PyTorch [ 117]。

4.2.2 Parallel Synchronous Stochastic Gradient Descent. Synchronized parallelism is often the most straightforward to program and reason about. Existing distribution libraries (such as Message Passing Interface (MPI) [62]) can typically be reused for this purpose. Most approaches rely on the AllReduce operation [5] where the compute nodes are arranged in a tree-like topology. Initially, each node calculates a local gradient value, accumulates these with the values received from its children and sends these up to its parent (reduce phase). Eventually, the root node obtains the global sum and broadcasts this back down to the leaf nodes (broadcast phase). Then each node updates its local model with regard to the received global gradient.
4.2.2 平行同步随机梯度下降。同步并行性通常是最直接的编程和推理方式。现有的分发库（如消息传递接口（MPI）[62]）通常可以重用。大多数方法依赖于AllReduce操作[5]，其中计算节点以树状拓扑排列。最初，每个节点计算一个局部梯度值，将这些值与从其子节点接收的值累加，然后将这些值发送到其父节点（还原阶段）。最终，根节点获取全局总和，并将其广播回叶节点（广播阶段）。然后，每个节点根据接收到的全局梯度更新其局部模型。

Baidu AllReduce uses common high performance computing technology (mainly MPI and its AllReduce operation) to iteratively train SGD models on separate mini-batches of the training data [56]. AllReduce is used to apply each of the workers’ gradients onto the last common model state after each operation and then propagate the result of that operation back to each worker. This is an inherently synchronous process, blocking on the result of each worker’s training iteration before continuing to the next.
百度AllReduce使用常见的高性能计算技术（主要是MPI及其AllReduce操作）在单独的小批量训练数据上迭代训练SGD模型[56]。AllReduce用于在每次操作后将每个工作线程的梯度应用于最后一个通用模型状态，然后将该操作的结果传播回每个工作程序。这是一个本质上是同步的过程，在继续下一个之前，会阻塞每个工人的训练迭代的结果。

Baidu includes a further optimization from Patarasuk and Yuan [118] in this process, called a Ring AllReduce, to reduce the required amount of communication. By structuring the cluster of machines as a ring (with each node having only two neighbors) and cascading the reduction operation, it is possible to utilize all bandwidth optimally. The bottleneck, then, is the highest latency between neighboring nodes.
百度在这个过程中包括了Patarasuk和Yuan[118]的进一步优化，称为Ring AllReduce，以减少所需的通信量。通过将机器集群构建为一个环（每个节点只有两个邻居）并级联缩减操作，可以最佳地利用所有带宽。因此，瓶颈是相邻节点之间的最高延迟。

Baidu claims linear speedup when applying this technique to train deep learning networks. However, it has only been demonstrated on relatively small clusters (five nodes each, though each node has multiple GPUs that communicate with each other through the same system). The approach lacks fault tolerance by default, as no node in the ring can be missed. This could be counteracted using redundancy (at cost of efficiency). If this is not done, however, then the scalability of the method is bounded by the probability of all nodes being available. This probability can be low when using large numbers of commodity machines and networking, which is needed to facilitate big data. Baidu’s system has been integrated into Tensorflow as an alternative to the built-in Parameter Server–based approach (described below).
百度声称，在应用这种技术来训练深度学习网络时，线性加速。但是，它仅在相对较小的集群上进行了演示（每个集群有五个节点，尽管每个节点都有多个 GPU 通过同一系统相互通信）。默认情况下，该方法缺乏容错能力，因为环中的任何节点都不会丢失。这可以通过冗余来抵消（以效率为代价）。但是，如果不这样做，则该方法的可伸缩性将受到所有节点可用的概率的限制。当使用大量商用机器和网络时，这种概率可能很低，这是促进大数据所必需的。百度的系统已集成到 Tensorflow 中，作为基于 Parameter Server 的内置方法（如下所述）的替代方案。

Horovod [131] takes a very similar approach to that of Baidu: It adds a layer of AllReduce-based MPI training to Tensorflow. One difference is that Horovod uses the NVIDIA Collective Communications Library (NCCL) for increased efficiency when training on (Nvidia) GPUs. This also enables use of multiple GPUs on a single node. Data-parallelizing an existing Tensorflow model is relatively simple, since only a few lines of code need to be added, wrapping the default Tensorflow training routine in a distributed AllReduce operation. When benchmarked on Inception v4 [148] and ResNet-101 [68] using 128 GPUs, the average GPU utilization is about 88%, compared to about 50% in Tensorflow’s Parameter Server approach. However, Horovod lacks fault tolerance (just like in Baidu’s approach) and therefore suffers from the same scalability issues [53].
Horovod [ 131] 采用了与百度非常相似的方法：它为 Tensorflow 添加了一层基于 AllReduce 的 MPI 训练。一个区别是，Horovod 使用 NVIDIA 集体通信库（NCCL）在（Nvidia） GPU 上训练时提高效率。这也允许在单个节点上使用多个 GPU。对现有 Tensorflow 模型进行数据并行化相对简单，因为只需添加几行代码，即可将默认的 Tensorflow 训练例程包装在分布式 AllReduce 操作中。当在 Inception v4 [ 148] 和 ResNet-101 [ 68] 上使用 128 个 GPU 进行基准测试时，平均 GPU 利用率约为 88%，而 Tensorflow 的 Parameter Server 方法约为 50%。然而，Horovod缺乏容错能力（就像百度的方法一样），因此也存在同样的可扩展性问题[53]。

Caffe2 (primarily maintained by Facebook) distributes ML through, once again, AllReduce algorithms. It does this by using NCCL between GPUs on a single host and custom code between hosts based on Facebook’s Gloo [47] library to abstract away different interconnects. Facebook uses Ring AllReduce (which offers better bandwidth and parallelism guarantees) but also recursive halving and doubling (a divide-and-conquer approach that offers better latency guarantees). According to their paper, this improves performance in latency-limited situations, such as for small buffer sizes and large server counts. He et al. [68] managed to train ResNet-50 in the span of one hour [61] using this approach, achieving linear scaling with the number of GPUs. They achieved 90% efficiency, measured up to 352 GPUs. However, once again, no fault-tolerance is present.
Caffe2（主要由 Facebook 维护）再次通过 AllReduce 算法分发 ML。它通过在单个主机上的 GPU 之间使用 NCCL 和在基于 Facebook 的 Gloo [ 47] 库的主机之间使用自定义代码来抽象出不同的互连来做到这一点。Facebook 使用 Ring AllReduce（提供更好的带宽和并行度保证），但也使用递归减半和加倍（一种分而治之的方法，提供更好的延迟保证）。根据他们的论文，这提高了延迟受限情况下的性能，例如小缓冲区大小和大量服务器。He et al. [ 68] 使用这种方法在一小时内训练了 ResNet-50 [ 61]，实现了随 GPU 数量的线性扩展。它们实现了 90% 的效率，测量了多达 352 个 GPU。但是，再一次，不存在容错。

CNTK or The Microsoft Cognitive Toolkit offers multiple modes of data-parallel distribution. Many of them use the Ring AllReduce tactic as previously described, making the same trade-off of linear scalability over fault-tolerance. The library offers two innovations:
CNTK 或 Microsoft 认知工具包提供多种数据并行分发模式。如前所述，他们中的许多人都使用 Ring AllReduce 策略，在线性可伸缩性与容错性之间做出同样的权衡。该库提供了两项创新：

1-bit stochastic gradient descent (Seide et al. [130]) is an implementation of SGD that quantizes training gradients to a single bit per value. This reduces the number of bits that need to be communicated when doing distributed training by a large constant factor.
1-bit 随机梯度下降（Seide et al. [ 130]）是 SGD 的一种实现，它将训练梯度量化为每个值的 1 位。这样可以减少在进行分布式训练时需要通信的位数，并减少较大的常数因子。
Block-momentum SGD (Chen and Huo [31]) divides the training set into m blocks and n splits. Each of the n machines trains a split on each block. Then the gradients calculated for all splits within a block are averaged to obtain the weights for the block. Finally, the block updates are merged into the global model while applying block-level momentum and learning rate.
块动量SGD（Chen和Huo [ 31]）将训练集分为m个块和n个分割。n 台机器中的每一台都在每个块上训练一个分裂。然后，对块内所有分割计算的梯度进行平均，以获得块的权重。最后，将区块更新合并到全局模型中，同时应用区块级动量和学习率。
When benchmarked on a Microsoft speech LSTM, average speedups of 85%+ are achieved for small numbers of GPUs (up to 16), but scalability drops significantly (below 70%) when scaling past that. However, the direct comparison of this number to the other synchronous frameworks’ results is questionable, as the dependency structure of an LSTM is significantly different than that of an ordinary DNN due to the introduction of temporal state [139].
当以 Microsoft 语音 LSTM 为基准时，少量 GPU（最多 16 个）的平均加速可实现 85%+，但当扩展到超过该值时，可扩展性会显着下降（低于 70%）。然而，将这个数字与其他同步框架的结果直接进行比较是值得怀疑的，因为LSTM的依赖结构与普通DNN的依赖结构明显不同，因为引入了时间状态[139]。

4.2.3 Parallel Asynchronous Stochastic Gradient Descent and Parameter Servers. Asynchronous approaches tend be more complex to implement, and it can be more difficult to trace and debug runtime behavior. However, asynchronism alleviates many problems that occur in clusters with high failure rates or inconsistent performance due to the lack of frequent synchronization barriers.
4.2.3 并行异步随机梯度下降和参数服务器。异步方法的实现往往更复杂，并且跟踪和调试运行时行为可能更困难。但是，异步性可以缓解在故障率高或性能不一致的集群中出现的许多问题，这些问题由于缺乏频繁的同步障碍而出现。

DistBelief [43] is one of the early practical implementations of large-scale distributed ML, and it was developed by Google. They encountered the limitations of GPU training and built DistBelief to counteract them. DistBelief supports data- and model-parallel training on tens of thousands of CPU cores (though GPU support was later introduced as well [2]). They reported a speedup of more than 12
×
when using 81 machines training a huge model with 1.7B parameters.
DistBelief [ 43] 是大规模分布式 ML 的早期实际实现之一，它是由 Google 开发的。他们遇到了 GPU 训练的局限性，并构建了 DistBelief 来抵消这些局限性。DistBelief 支持在数以万计的 CPU 内核上进行数据和模型并行训练（尽管后来也引入了 GPU 支持 [ 2]）。他们报告说
×
，当使用 81 台机器训练一个具有 1.7B 参数的大型模型时，速度提高了 12 以上。

To achieve efficient model-parallelism, DistBelief exploits the structure of neural networks and defines a model as a computation graph where each node implements an operation transforming inputs to outputs. Every machine executes the training of a part of the computation graph’s nodes, which can span subsets of multiple layers of the neural network. Communication is only required at those points where a node’s output is used as the input of a node trained by another machine. Partitioning the model across a cluster is transparent and requires no structural modifications. However, the efficiency of a given partitioning is greatly affected by the architecture of the model and requires careful design. For example, locally connected models lend themselves better for model-parallelism because of limited cross-partition communication. In contrast, fully connected models have more substantial cross-partition dependencies and are therefore harder to efficiently distribute through DistBelief.
为了实现高效的模型并行性，DistBelief 利用神经网络的结构，将模型定义为计算图，其中每个节点实现将输入转换为输出的操作。每台机器都执行计算图节点的一部分的训练，这些节点可以跨越神经网络多层的子集。只有在节点的输出被用作由另一台机器训练的节点的输入的那些点上才需要通信。跨集群对模型进行分区是透明的，不需要进行结构修改。但是，给定分区的效率受模型体系结构的极大影响，需要仔细设计。例如，由于跨分区通信有限，本地连接的模型更适合模型并行性。相比之下，全连接模型具有更实质性的跨分区依赖关系，因此更难通过 DistBelief 进行有效分发。

To further parallelize model training, data parallelism is applied on top of the model parallelism. A centralized sharded Parameter Server is used to allow each of a set of model replicas (which may be model-parallel internally) to share parameters. DistBelief supports two different methods of data parallelism, both of which are resilient to processing speed variance between model replicas as well as replica failure:
为了进一步并行化模型训练，在模型并行性之上应用了数据并行性。集中式分片参数服务器用于允许一组模型副本（内部可能是模型并行的）中的每一个共享参数。DistBelief 支持两种不同的数据并行方法，这两种方法都可以灵活应对模型副本之间的处理速度差异以及副本故障：

Downpour Stochastic Gradient Descent is an asynchronous alternative to the inherently sequential SGD. Each replica of the model fetches the latest model parameters from the Parameter Server every
n
f
e
t
c
h
steps, updates these parameters in accordance with the model, and pushes the tracked parameter gradients to the Parameter Server every
n
p
u
s
h
steps. The parameters
n
f
e
t
c
h
and
n
p
u
s
h
can be increased to achieve lower communication overhead. Fetching and pushing can happen as a background process, allowing training to continue.
倾盆大雨随机梯度下降是固有序列 SGD 的异步替代方案。模型的每个副本每一步从参数服务器获取最新的模型参数，根据模型更新这些参数，并在每
n
f
e
t
c
h

n
p
u
s
h
一步将跟踪的参数梯度推送到参数服务器。可以增加参数
n
f
e
t
c
h
，
n
p
u
s
h
以实现更低的通信开销。提取和推送可以作为后台进程进行，从而允许训练继续进行。
Downpour SGD is more resilient to machine failures than SGD, as it allows the training to continue even if some model replicas are off-line. However, the optimization process itself becomes less predictable due to parameters that are out of sync. The authors found relaxing consistency requirements to be remarkably effective, but offer no theoretical support for this. Tactics that contribute to robustness are the application of adaptive learning rates through AdaGrad [45] and warm starting the model through training a single model replica for a while before scaling up to the full number of machines. The authors make note of the absence of stability issues after applying these.
与 SGD 相比，倾盆大雨 SGD 对机器故障的适应能力更强，因为它允许即使某些模型副本处于离线状态，训练也能继续进行。但是，由于参数不同步，优化过程本身的可预测性降低。作者发现放宽一致性要求非常有效，但没有提供任何理论支持。有助于鲁棒性的策略是通过AdaGrad应用自适应学习率[45]，以及在扩展到全部机器之前，通过训练单个模型副本一段时间来热启动模型。作者指出，在应用这些方法后，没有稳定性问题。
Distributed L-BGFS makes use of an external coordinator process that divides training work between model replicas, as well as some operations on the parameters between the parameter server shards. Training happens through L-BGFS, as is clear from the name.
分布式 L-BGFS 利用外部协调器流程，在模型副本之间划分训练工作，以及在参数服务器分片之间对参数进行一些操作。从名称中可以清楚地看出，训练是通过 L-BGFS 进行的。
Each of the shards of the Parameter Server hold a fraction of the parameter space of a model. The model replicas pull the parameters from all shards and each parallelized part of the model only retrieves those parameters that it needs.
参数服务器的每个分片都包含模型参数空间的一小部分。模型副本从所有分片中提取参数，模型的每个并行化部分仅检索所需的参数。

Performance improvements are high, but the methodology is very expensive in terms of computational complexity. While the best speedup (downpour SGD with AdaGrad) achieved an 80% decrease in training time on ImageNet; this was achieved by using more than 500 machines and more than 1K CPU cores. It has to be noted that DistBelief did not support distributed GPU training at the time of Dean et al. [43], which could reduce the required resources significantly and is used in fact by almost all other implementations mentioned in this section.
性能改进很高，但就计算复杂性而言，该方法非常昂贵。虽然最好的加速（使用 AdaGrad 的倾盆大雨 SGD）在 ImageNet 上的训练时间减少了 80%;这是通过使用 500 多台机器和超过 1K 的 CPU 内核实现的。需要注意的是，在Dean等人[43]发表时，DistBelief并不支持分布式GPU训练，这可能会大大减少所需的资源，实际上被本节中提到的几乎所有其他实现所使用。

DIANNE (DIstributed Artificial Neural NEtworks) [39] is a Java-based distributed deep learning framework using the Torch native backend for executing the necessary computations. It uses a modular OSGi-based distribution framework [154] that allows to execute different components of the deep learning system on different nodes of the infrastructure. Each basic building block of a neural network can be deployed on a specific node, hence enabling model-parallelism. DIANNE also provides basic learner, evaluator, and parameter server components that can be scaled and provide a downpour SGD implementation similar to DistBelief.
DIANNE （DIstributed Artificial Neural NEtworks） [ 39] 是一个基于 Java 的分布式深度学习框架，使用 Torch 原生后端来执行必要的计算。它使用基于OSGi的模块化分发框架[154]，允许在基础设施的不同节点上执行深度学习系统的不同组件。神经网络的每个基本构建块都可以部署在特定节点上，从而实现模型并行性。DIANNE 还提供基本的学习器、评估器和参数服务器组件，这些组件可以扩展，并提供类似于 DistBelief 的倾盆大雨 SGD 实现。

Tensorflow [1, 2] is the evolution of DistBelief, developed to replace DistBelief within Google. It borrows the concepts of a computation graph and parameter server from it. It also applies subsequent optimizations to the parameter server model, such as optimizations for training convolutional neural networks [34] and innovations regarding consistency models and fault tolerance [92, 93]. Unlike DistBelief, TensorFlow was made available as open source software.
Tensorflow [ 1， 2] 是 DistBelief 的演变，旨在取代 Google 内部的 DistBelief。它借用了计算图和参数服务器的概念。它还将后续优化应用于参数服务器模型，例如训练卷积神经网络的优化[34]以及一致性模型和容错方面的创新[92,93]。与 DistBelief 不同，TensorFlow 是作为开源软件提供的。

TensorFlow represents both model algorithms and state as a dataflow graph, of which the execution can be distributed. This facilitates different parallelization schemes that can take, e.g., state locality into account. The level of abstraction of the dataflow graph is mathematical operations on tensors (i.e.,
n
-dimensional matrices). This in contrast to DistBelief, which abstracts at the level of individual layers. Consequently, defining a new type of neural network layer in Tensorflow requires no custom code—it can be represented as a subgraph of a larger model, composed of fundamental math operations. A Tensorflow model is first defined as a symbolic dataflow graph. Once this graph has been constructed, it is optimized and then executed on the available hardware. This execution model allows Tensorflow to tailor its operations towards the types of devices available to it. When working with, e.g., GPUs or TPUs (Tensor Processing Units [80]), Tensorflow can take into account the asynchronicity and intolerance or sensitivity to branching that is inherent to these devices, without requiring any changes to the model itself.
TensorFlow 将模型算法和状态表示为数据流图，可以分发其执行。这促进了不同的并行化方案，这些方案可以考虑例如，状态位置。数据流图的抽象级别是对张量（即
n
维矩阵）的数学运算。这与 DistBelief 形成鲜明对比，后者在各个层的级别进行抽象。因此，在 Tensorflow 中定义一种新型的神经网络层不需要自定义代码，它可以表示为由基本数学运算组成的更大模型的子图。Tensorflow 模型首先定义为符号数据流图。一旦构建了这个图，它就会被优化，然后在可用的硬件上执行。此执行模型允许 Tensorflow 根据其可用的设备类型定制其操作。当使用GPU或TPU（张量处理单元[80]）时，Tensorflow可以考虑这些设备固有的异步性和不容忍性或对分支的敏感性，而无需对模型本身进行任何更改。

Shi and Chu [138] show Tensorflow achieving about 50% efficiency on four-node, InfiniBand-connected cluster training of ResNet-50 [68] and about 75% efficiency on GoogleNet [147], showing that the communication overhead plays an important role and also depends on architecture of the neural network to optimize.
Shi 和 Chu [ 138] 表明 Tensorflow 在 ResNet-50 [ 68] 的四节点 InfiniBand 连接集群训练中实现了约 50% 的效率，在 GoogleNet [ 147] 上实现了约 75% 的效率，这表明通信开销起着重要作用，并且还依赖于神经网络的架构进行优化。

MXNet [33] uses a strategy very similar to that of Tensorflow: Models are represented as dataflow graphs, which are executed on hardware that is abstracted away and coordinated by using a parameter server. However, MXNet also supports the imperative definition of dataflow graphs as operations on n-dimensional arrays, which simplifies the implementation of certain kinds of networks.
MXNet [ 33] 使用与 Tensorflow 非常相似的策略：模型表示为数据流图，这些图在硬件上执行，这些硬件通过使用参数服务器进行抽象和协调。但是，MXNet 还支持将数据流图的命令式定义定义为对 n 维数组的操作，这简化了某些类型网络的实现。

MXNet’s Parameter Server, KVStore, is implemented on top of a traditional key-value store. The KVStore supports pushing key-value pairs from a device to the store, as well as pulling the current value of a key from the store. There is support for user-defined update logic that is executed when a new value is pushed. The KVStore can also enforce different consistency models (currently limited to sequential and eventually consistent execution). It is a two-tier system: Updates by multiple threads and GPUs are merged on the local machine before they are pushed to the full cluster. The KVStore abstraction theoretically enables the implementation of (stale-)synchronicity, although only an asynchronous implementation is present at the time of writing.
MXNet 的参数服务器 KVStore 是在传统键值存储之上实现的。KVStore 支持将键值对从设备推送到存储区，以及从存储区中提取密钥的当前值。支持在推送新值时执行的用户定义的更新逻辑。KVStore 还可以强制执行不同的一致性模型（目前仅限于顺序和最终一致性执行）。它是一个双层系统：多个线程和 GPU 的更新在推送到整个集群之前会合并到本地计算机上。KVStore 抽象理论上支持（过时）同步性的实现，尽管在撰写本文时只存在异步实现。

On a small cluster of 10 machines equipped with a GPU, MXNet achieves almost linear speedup compared to a single machine when training GoogleNet [147] with more than 10 passes over the data [33].
在配备 GPU 的 10 台机器的小型集群上，MXNet 在训练 GoogleNet [ 147] 时实现了几乎线性的加速，并对数据进行了 10 次以上的传递 [ 33]。

DMTK or the Distributed Machine Learning Toolkit [102] from Microsoft includes a Parameter Server called Multiverso. This can be used together with CNTK to enable Asynchronous SGD instead of the default Allreduce-based distribution in CNTK.
Microsoft的DMTK或分布式机器学习工具包[102]包括一个名为Multiverso的参数服务器。这可以与CNTK一起使用，以启用异步SGD，而不是CNTK中默认的基于Allreduce的分发。

4.2.4 Parallel Stale-synchronous Stochastic Gradient Descent.
4.2.4 平行-陈旧-同步随机梯度下降。

Petuum [160] aims to provide a generic platform for any type of machine learning (as long as it is iteratively convergent) on big data and big models (hundreds of billions of parameters). It supports data- and model-parallelism. The Petuum approach exploits ML’s error tolerance, dynamic structural dependencies, and non-uniform convergence to achieve good scalability on large datasets and models. This is in contrast to, for example, Spark, which focuses on fault tolerance and recovery. The platform uses stale synchronicity to exploit inherent tolerance of machine learning against errors, since a minor amount of staleness will only have minor effects on convergence. Dynamic scheduling policies are employed to exploit dynamic structural dependencies, which helps minimize parallelization error and synchronization cost. Finally, unconverged parameter prioritization takes advantage of non-uniform convergence by reducing computational cost on parameters that are already near optimal.
Petuum [ 160] 旨在为大数据和大模型（数千亿个参数）上的任何类型的机器学习（只要它是迭代收敛的）提供一个通用平台。它支持数据和模型并行。Petuum 方法利用 ML 的容错性、动态结构依赖性和非均匀收敛性，在大型数据集和模型上实现良好的可扩展性。例如，这与专注于容错和恢复的 Spark 形成鲜明对比。该平台使用过时的同步性来利用机器学习对错误的固有容忍度，因为少量的过时只会对收敛产生很小的影响。采用动态调度策略来利用动态结构依赖关系，有助于最大限度地减少并行化错误和同步成本。最后，非收敛参数优先级通过降低已经接近最优的参数的计算成本来利用非均匀收敛。

Petuum uses the Parameter Server paradigm to keep track of the parameters of the model being trained. The Parameter Server is also responsible for maintaining the staleness guarantees. In addition, it exposes a scheduler that lets the model developer control the ordering of parallelized model updates.
Petuum 使用参数服务器范例来跟踪正在训练的模型的参数。参数服务器还负责维护过期保证。此外，它还公开了一个调度程序，使模型开发人员能够控制并行化模型更新的顺序。

When developing a model using Petuum, developers have to implement a method named push, which is responsible for each of the parallelized model training operations. Its implementation should pull the model state from the parameter server, run a training iteration, and push a gradient to the parameter server. Petuum by default manages the scheduling aspect and the parameter merging logic automatically, so data-parallel models do not require any additional operations. However, if model-parallelism is desired, the schedule method (which tells each of the parallel workers what parameters they need to train) and the pull method (which defines the aggregation logic for each of the generated parameter gradients) need to be implemented as well.
使用 Petuum 开发模型时，开发人员必须实现一个名为 push 的方法，该方法负责每个并行化的模型训练操作。它的实现应从参数服务器拉取模型状态，运行训练迭代，并将梯度推送到参数服务器。默认情况下，Petuum 会自动管理调度方面和参数合并逻辑，因此数据并行模型不需要任何额外的操作。但是，如果需要模型并行性，则还需要实现 schedule 方法（告诉每个并行工作线程他们需要训练哪些参数）和 pull 方法（定义每个生成的参数梯度的聚合逻辑）。

Petuum provides an abstraction layer that also allows it to run on systems using YARN (the Hadoop job scheduler) and HDFS (the Hadoop file system), which simplifies compatibility with pre-existing clusters.
Petuum 提供了一个抽象层，它还允许它在使用 YARN（Hadoop 作业调度器）和 HDFS（Hadoop 文件系统）的系统上运行，从而简化了与预先存在的集群的兼容性。

4.2.5 Parallel Hybrid-synchronous SGD. Both synchronous and asynchronous approaches have some significant drawbacks, as is explored by Chen et al. [30]. A few frameworks attempt to find a middle ground instead that combines some of the best properties of each model of parallelism and diminishes some of the drawbacks.
4.2.5 并行混合同步 SGD。同步和异步方法都有一些明显的缺点，Chen等人[30]对此进行了探讨。一些框架试图找到一个中间立场，将每个并行模型的一些最佳属性结合起来，并减少一些缺点。

MXNet-MPI [96] takes an approach to distributed ML (using a modified version of MXNet as a proof of concept) that combines some of the best aspects of both asynchronous (Parameter Server) and synchronous (MPI) implementations. The idea here is to use the same architecture as described in the MXNet section. Instead of having single workers communicate with the parameter server, however, those workers are clustered together into groups that internally apply synchronous SGD over MPI with AllReduce. This has the benefit of easy linear scalability of the synchronous MPI approach and fault tolerance of the asynchronous Parameter Server approach.
MXNet-MPI [ 96] 采用了一种分布式 ML 方法（使用 MXNet 的修改版本作为概念验证），该方法结合了异步（Parameter Server）和同步（MPI）实现的一些最佳方面。这里的想法是使用与 MXNet 部分中描述的相同的体系结构。但是，这些工作线程不是让单个工作线程与参数服务器通信，而是将这些工作线程聚集在一起，这些组在内部应用同步 SGD over MPI 和 AllReduce。这样做的好处是同步 MPI 方法易于线性扩展，而异步 Parameter Server 方法具有容错能力。

4.3 Machine Learning in the Cloud
4.3 云中的机器学习
Several cloud operators have added machine learning as a service to their cloud offerings. Most providers offer multiple options of executing machine learning tasks in their clouds, ranging from IaaS-level services (VM instances with pre-packaged ML software) to SaaS-level solutions (Machine Learning as a Service). Much of the technology offered are standard distributed machine learning systems and libraries. Among other things, Google’s Cloud Machine Learning Engine offers support for TensorFlow and even provides TPU instances [60]. Microsoft Azure Machine Learning allows model deployment through Azure Kubernetes, through a batch service, or by using CNTK VMs [101]. As a competitor to Google’s TPUs, Azure supports accelerating ML applications through FPGAs [114]. Amazon AWS has introduced SageMaker, a hosted service for building and training machine learning models in the cloud. The service includes support for TensorFlow, MXNet, and Spark [7]. IBM has bundled their cloud machine learning offerings under the Watson brand [74]. Services include Jupyter notebooks, Tensorflow, and Keras. The cloud-based delivery model is becoming more important, as it reduces the burden of entry into designing smart applications that facilitate machine learning techniques. However, the cloud is not only a consumer of distributed machine learning technology but is also fueling the development of new systems and approaches back to the ecosystem to handle the large scale of the deployments.
一些云运营商已将机器学习即服务添加到其云产品中。大多数提供商都提供了在其云中执行机器学习任务的多种选项，从 IaaS 级服务（具有预打包 ML 软件的 VM 实例）到 SaaS 级解决方案（机器学习即服务）。提供的大部分技术都是标准的分布式机器学习系统和库。除此之外，谷歌的云机器学习引擎提供了对TensorFlow的支持，甚至提供了TPU实例[60]。Microsoft Azure 机器学习允许通过 Azure Kubernetes、批处理服务或使用 CNTK VM [ 101] 进行模型部署。作为谷歌TPU的竞争对手，Azure支持通过FPGA加速ML应用程序[ 114]。Amazon AWS 推出了 SageMaker，这是一项用于在云中构建和训练机器学习模型的托管服务。该服务包括对 TensorFlow、MXNet 和 Spark [ 7] 的支持。IBM已将其云机器学习产品捆绑在Watson品牌下[74]。服务包括 Jupyter 笔记本、Tensorflow 和 Keras。基于云的交付模型正变得越来越重要，因为它减轻了设计促进机器学习技术的智能应用程序的负担。然而，云不仅是分布式机器学习技术的消费者，而且还推动了新系统和方法的开发，以处理大规模的部署。

5 CONCLUSIONS AND CURRENT CHALLENGES
第5章结论和当前挑战
Distributed Machine Learning is a thriving ecosystem with a variety of solutions that differ in architecture, algorithms, performance, and efficiency. Some fundamental challenges had to be overcome to make distributed machine learning viable in the first place, such as finding mechanisms to efficiently parallelize the processing of data while combining the outcome into a single coherent model. Now that there are industry-grade systems available and in view of the ever-growing appetite for tackling more complex problems with machine learning, distributed machine learning is increasingly becoming the norm and single-machine solutions the exception, similar to how data processing in general had developed in the past decade. There are, however, still many open challenges that are crucial to the long-term success of distributed machine learning.
分布式机器学习是一个蓬勃发展的生态系统，拥有各种在架构、算法、性能和效率方面各不相同的解决方案。首先，必须克服一些基本挑战才能使分布式机器学习变得可行，例如找到有效并行处理数据的机制，同时将结果组合到一个连贯的模型中。现在有工业级系统可用，并且鉴于对解决机器学习更复杂问题的需求不断增长，分布式机器学习正日益成为常态，单机解决方案成为例外，类似于过去十年中数据处理的发展方式。然而，仍然存在许多开放的挑战，这些挑战对于分布式机器学习的长期成功至关重要。

5.1 Performance 5.1 性能
A trade-off that is seen frequently is the reduction of wall-clock time at the expense of total aggregate processing time (i.e., decreased efficiency) by adding additional resources. When compute resources are affordable enough, many real-world use cases of machine learning benefit most from being trained rapidly. The fact that this often implies a large increase in total compute resources and the associated energy consumption is not considered important as long as a model saves more money than it costs to train. A good example of this is found in Dean et al. [43], where wall-clock time speedup factors are achieved by increasing the number of machines quadratically or worse. It still delivered Google competitive advantage for years. Distributed use of GPUs, as in Tensorflow, has better properties, but often still exhibits efficiency below 75%. These performance concerns are much less severe in the context of synchronous SGD-based frameworks, which often do achieve linear speedups in benchmarks. However, most of these benchmarks test at most a few hundred machines, whereas the scale at which, e.g., DistBelief, is demonstrated, can be two orders of magnitude larger. The research community could clearly benefit from more independent studies that report on the performance and scalability of these systems for larger and more realistic applications, and that could provide valuable insights to guide research into workload optimization and system architecture.
经常出现的一种权衡是，通过增加额外的资源，以牺牲总聚合处理时间（即效率降低）为代价来减少挂钟时间。当计算资源足够实惠时，机器学习的许多实际用例都会从快速训练中受益最多。这通常意味着总计算资源和相关能耗的大幅增加，这一事实并不重要，只要模型节省的资金多于训练成本。Dean等人[43]就是一个很好的例子，其中挂钟时间加速因子是通过二次或更糟地增加机器数量来实现的。多年来，它仍然为谷歌提供了竞争优势。分布式使用 GPU（如 Tensorflow）具有更好的属性，但通常仍表现出低于 75% 的效率。在基于 SGD 的同步框架的上下文中，这些性能问题要小得多，这些框架通常在基准测试中实现了线性加速。然而，这些基准测试中的大多数最多测试了几百台机器，而演示的规模（例如DistBelief）可以大两个数量级。研究界显然可以从更多独立研究中受益，这些研究报告了这些系统在更大、更现实应用中的性能和可扩展性，并且可以为指导工作负载优化和系统架构的研究提供有价值的见解。

5.2 Fault Tolerance 5.2 容错
Synchronous AllReduce-based approaches seem to scale significantly better than the parameter server approach (up to a certain cluster size), but suffer from a lack of fault-tolerance: Failure of a single machine blocks the entire training process. At smaller scales, this might still be a manageable problem. However, past a certain number of nodes, the probability of any node being unavailable becomes high enough to result in near-continuous stalling. Common implementations of these HPC-inspired patterns, such as MPI and NCCL, lack fault-tolerance completely. Although there are efforts to counteract some of this, production-ready solutions are lacking. Some of the described implementations allow for checkpointing to counteract this, but significant effort is necessary to enable true fault-tolerance, as is described in Amatya et al. [6]. It is also possible to reduce the probability of failure for each individual node, but this requires very specific hardware that is expensive and not generally available in commodity scale-out data centers or in the cloud. Asynchronous implementations do not suffer from this problem as much. They are designed to explicitly tolerate straggling [41] (slow-running) and failing nodes, with only minimal impact on training performance. The question for ML operators, then, is whether they prefer performance or fault tolerance, and whether they are constrained by either one. Hybrid approaches even offer a way to customize these characteristics, although they are not frequently found in use yet. It would be interesting to see whether an even better approach exists, or whether there is an efficient way to implement fault-tolerant AllReduce.
基于同步AllReduce的方法似乎比参数服务器方法（达到一定的集群大小）具有更好的扩展性，但缺乏容错能力：一台机器的故障会阻塞整个训练过程。在较小的规模下，这可能仍然是一个可控的问题。但是，超过一定数量的节点后，任何节点不可用的概率都会变得足够高，从而导致近乎连续的停滞。这些受 HPC 启发的模式（如 MPI 和 NCCL）的常见实现完全缺乏容错能力。尽管有一些努力来抵消其中的一些问题，但缺乏生产就绪的解决方案。一些描述的实现允许检查点来抵消这种情况，但要实现真正的容错，需要付出巨大的努力，正如Amatya等人[6]所描述的那样。还可以降低每个节点的故障概率，但这需要非常具体的硬件，这些硬件价格昂贵，而且在商用横向扩展数据中心或云中通常不可用。异步实现不会受到这个问题的影响。它们被设计为明确容忍杂乱 [ 41]（运行缓慢）和故障节点，对训练性能的影响最小。因此，ML操作员的问题是，他们更喜欢性能还是容错，以及他们是否受到其中任何一个的约束。混合方法甚至提供了一种自定义这些特征的方法，尽管它们尚未经常使用。看看是否存在更好的方法，或者是否存在实现容错AllReduce的有效方法，这将是一件有趣的事情。

5.3 Privacy 5.3 隐私
There are scenarios in which it is beneficial or even mandatory to isolate different subsets of the training data from each other [79]. The furthest extent of this is when a model needs to be trained on datasets that each live on different machines or clusters and may under no circumstance be co-located or even moved. Peer-to-peer topologies like Gossip Learning [112] fully embrace this principle.
在某些情况下，将训练数据的不同子集彼此隔离是有益的，甚至是强制性的[79]。最远的程度是，当模型需要在数据集上进行训练时，每个数据集都位于不同的机器或集群上，并且在任何情况下都不能位于同一位置，甚至不能移动。像Gossip Learning[112]这样的点对点拓扑结构完全接受了这一原则。

Another approach to training models in a privacy-sensitive context is the use of a distributed ensemble model. This allows perfect separation of the training data subsets, with the drawback that a method needs to be found that properly balances each trained model’s output for an unbiased result.
在隐私敏感上下文中训练模型的另一种方法是使用分布式集成模型。这允许完美分离训练数据子集，但缺点是需要找到一种方法来正确平衡每个训练模型的输出以获得无偏的结果。

Parameter server–based systems can be useful in the context of privacy, as the training of a model can be separated from the training result. Abadi et al. [3] discuss several algorithms that are able to train models efficiently while maintaining differential privacy. These parameter server–based systems assume that no sensitive properties of the underlying data leak into the model itself, which turns out to be difficult in practice. Recently, Bagdasaryan et al. [12] showed that it is possible for attackers to implement a back door into the joint model.
基于参数服务器的系统在隐私方面非常有用，因为模型的训练可以与训练结果分开。Abadi等[3]讨论了几种算法，这些算法能够在保持差分隐私的同时有效地训练模型。这些基于参数服务器的系统假设基础数据的敏感属性不会泄漏到模型本身中，这在实践中被证明是困难的。最近，Bagdasaryan等[12]表明，攻击者有可能在联合模型中实现后门。

Federated learning systems can be deployed where multiple parties jointly learn an accurate deep neural network while keeping the data itself local and confidential. Privacy of the respective data was believed to be preserved by applying differential privacy, as shown by Shokri and Shmatikov [140] and McMahan et al. [97]. However, Hitaj et al. [71] devised an attack based on GANs, showing that record-level differential privacy is generally ineffective in federated learning systems.
可以部署联邦学习系统，让多方共同学习准确的深度神经网络，同时保持数据本身的本地性和机密性。正如Shokri和Shmatikov[140]以及McMahan等人[97]所表明的那样，通过应用差分隐私来保护各自数据的隐私。然而，Hitaj等[71]设计了一种基于GAN的攻击，表明记录级差分隐私在联邦学习系统中通常是无效的。

Additionally, it is possible to introduce statistical noise into each subset of the training data with the intention of rendering its sensitive characteristics unidentifiable to other parties. Balcan et al. [14] touch on this subject but make it clear that the resulting privacy in this scenario is dependent on the amount of statistical queries required to learn the dataset. This puts an upper bound on usefulness of the model itself.
此外，还可以在训练数据的每个子集中引入统计噪声，以使其敏感特征无法被其他方识别。Balcan等[ 14]谈到了这个问题，但明确指出，在这种情况下，由此产生的隐私取决于学习数据集所需的统计查询量。这为模型本身的有用性设定了上限。

For a more in-depth discussion on privacy in distributed deep learning, we refer to Vepakomma et al. [153]. In conclusion, while theoretical results exist, current frameworks do not offer much support for even basic forms of privacy. It could be interesting to investigate fundamental approaches to facilitate distributed privacy, which could then be integrated into the currently popular frameworks.
关于分布式深度学习中隐私的更深入讨论，我们参考了Vepakomma等人[153]。总之，虽然理论上存在结果，但目前的框架甚至没有为基本的隐私形式提供太多支持。研究促进分布式隐私的基本方法可能会很有趣，然后可以将其集成到当前流行的框架中。

5.4 Portability 5.4 可移植性
With the proliferation of machine learning, in particular deep learning, a myriad of different libraries and frameworks for creating and training neural networks is established. However, once trained, one is often stuck to the framework at hand to deploy the model in production, as they all use a custom format to store the results. For example, Tensorflow [2] uses a SavedModel directory, which includes a protocol buffer defining the whole computation graph. Caffe [78] also uses a binary protocol buffer for storing saved models, but with a custom schema. Theano [18] uses pickle to serialize models represented by Python objects, and PyTorch [117] has a built-in save method that serializes to a custom ASCII or binary format.
随着机器学习，特别是深度学习的普及，建立了无数不同的库和框架来创建和训练神经网络。然而，一旦经过训练，人们通常会坚持使用手头的框架来在生产中部署模型，因为它们都使用自定义格式来存储结果。例如，Tensorflow [ 2] 使用一个 SavedModel 目录，该目录包含一个定义整个计算图的协议缓冲区。Caffe [ 78] 也使用二进制协议缓冲区来存储保存的模型，但使用自定义模式。Theano [ 18] 使用 pickle 来序列化 Python 对象表示的模型，PyTorch [ 117] 有一个内置的 save 方法，可以序列化为自定义的 ASCII 或二进制格式。

Portability also becomes increasingly important with respect to the hardware platform on which one wants to deploy. Although the x86_64 and ARM processor architectures are mainstream to execute applications in the server and mobile devices market, respectively, we witness a shift towards using GPU hardware for efficiently executing neural network models [108]. As machine learning models become more widespread, we also see more development of custom ASICs such as TPUs [128] in Google Cloud or dedicated neural network hardware in the latest iPhone [11]. This diversification makes it more difficult to make sure that your trained model can run on any of these hardware platforms.
对于想要部署的硬件平台而言，可移植性也变得越来越重要。尽管x86_64和ARM处理器架构分别是服务器和移动设备市场中执行应用程序的主流，但我们目睹了使用GPU硬件高效执行神经网络模型的转变[108]。随着机器学习模型的普及，我们也看到了更多定制ASIC的发展，如谷歌云中的TPU[128]或最新iPhone中的专用神经网络硬件[11]。这种多样化使得确保经过训练的模型可以在这些硬件平台上运行变得更加困难。

A first step towards portability is the rise of a couple of framework-independent specifications to define machine learning models and computation graphs. The Open Neural Network Exchange (ONNX) format defines a protocol buffer schema that defines an extensible computation graph model as well as definitions for standard operators and data types. Currently, ONNX is supported out of the box by frameworks such as Caffe, PyTorch, CNTK, and MXNet, and converters exist, e.g., for TensorFlow. Similar efforts for a common model format specification are driven by Apple with their Core ML format [10] and the Khronos Group with the Neural Network Exchange Format [150].
迈向可移植性的第一步是兴起几个独立于框架的规范来定义机器学习模型和计算图。开放式神经网络交换（ONNX）格式定义了一个协议缓冲区架构，该架构定义了可扩展的计算图模型以及标准运算符和数据类型的定义。目前，Caffe、PyTorch、CNTK 和 MXNet 等框架支持 ONNX 开箱即用，并且存在转换器，例如用于 TensorFlow 的转换器。苹果公司（Apple）的Core ML格式[10]和Khronos Group的神经网络交换格式（Neural Network Exchange Format）[150]推动了通用模型格式规范的类似努力。

REFERENCES 引用
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https://www.tensorflow.org/.
马丁·阿巴迪、阿希什·阿加瓦尔、保罗·巴勒姆、尤金·布雷夫多、陈志峰、克雷格·西特罗、格雷格·科拉多、安迪·戴维斯、杰弗里·迪恩、马修·德文、桑杰·盖马瓦特、伊恩·古德费罗、安德鲁·哈普、杰弗里·欧文、迈克尔·伊萨德、贾扬青、拉法尔·约瑟夫维奇、卢卡斯·凯泽、曼朱纳特·库德鲁尔、乔什·莱文伯格、蒲公英·马内、拉贾特·蒙加、雪莉·摩尔、德里克·默里、克里斯·奥拉、迈克·舒斯特、乔纳森·施伦斯、伯努瓦·施泰纳、伊利亚·苏茨凯弗、库纳尔·塔尔瓦尔、保罗·塔克、文森特·范霍克、维杰·瓦苏德万、费尔南达·维加斯、奥里奥尔·维尼亚尔斯、皮特·沃登、马丁·瓦滕贝格、马丁·威克、袁宇和郑晓强。2015. TensorFlow：异构系统上的大规模机器学习。取自 https://www.tensorflow.org/。
Navigate to
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 265–283.
马丁·阿巴迪、保罗·巴勒姆、陈建民、陈志峰、安迪·戴维斯、杰弗里·迪恩、马修·德文、桑杰·盖马瓦特、杰弗里·欧文、迈克尔·伊萨德、曼朱纳特·库德鲁尔、乔什·莱文伯格、拉贾特·蒙加、雪莉·摩尔、德里克·默里、伯努瓦·施泰纳、保罗·塔克、维杰·瓦苏德万、皮特·沃登、马丁·威克、袁宇和郑晓强。2016. TensorFlow：用于大规模机器学习的系统。在第 12 届 USENIX 操作系统设计和实现研讨会（OSDI’16）的论文集中。265–283.
Navigate to
Martin Abadi, Andy Chu, Ian Goodfellow, Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 23rd ACM Conference on Computer and Communications Security (ACM CCS’16). 308–318. Retrieved from https://arxiv.org/abs/1607.00133.
马丁·阿巴迪、安迪·朱、伊恩·古德费罗、布伦丹·麦克马汉、伊利亚·米罗诺夫、库纳尔·塔尔瓦尔和张莉。2016. 具有差分隐私的深度学习.在第 23 届 ACM 计算机和通信安全会议（ACM CCS’16）的论文集中。308–318.取自 https://arxiv.org/abs/1607.00133。
Navigate to
Adapteva, Inc. 2017. E64G401 Epiphany 64-core Microprocessor Datasheet. Retrieved from http://www.adapteva.com/docs/e64g401_datasheet.pdf.
Adapteva， Inc. 2017 年。E64G401 Epiphany 64 核微处理器数据表。取自 http://www.adapteva.com/docs/e64g401_datasheet.pdf。
Navigate to
Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. 2014. A reliable effective terascale linear learning system. J. Mach. Learn. Res. 15, 1 (2014), 1111–1133.
阿列克·阿加瓦尔、奥利维尔·沙佩尔、米罗斯拉夫·杜迪克和约翰·兰福德。2014. 一种可靠有效的万亿级线性学习系统.J. Mach. 学习。第 15， 1 号决议（2014）， 1111–1133.
Navigate to
Vinay Amatya, Abhinav Vishnu, Charles Siegel, and Jeff Daily. 2017. What does fault tolerant deep learning need from MPI? CoRR abs/1709.03316 (2017). arxiv:1709.03316 http://arxiv.org/abs/1709.03316.
Vinay Amatya、Abhinav Vishnu、Charles Siegel 和 Jeff Daily。2017. 容错深度学习需要 MPI 提供什么？CoRR abs/1709.03316 （2017）。arxiv：1709.03316 http://arxiv.org/abs/1709.03316。
Navigate to
Amazon Web Services. 2018. Amazon SageMaker. Retrieved from https://aws.amazon.com/sagemaker/developer-resources/.
亚马逊网络服务。2018. 亚马逊 SageMaker。取自 https://aws.amazon.com/sagemaker/developer-resources/。
Navigate to
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen, Jesse Engel, Weiwei Fang, Linxi Fan, Christopher Fougner, Liang Gao, Caixia Gong, Awni Hannun, Tony Han, Lappi Johannes, Bing Jiang, Cai Ju, Billy Jun, Patrick LeGresley, Libby Lin, Junjie Liu, Yang Liu, Weigao Li, Xiangang Li, Dongpeng Ma, Sharan Narang, Andrew Ng, Sherjil Ozair, Yiping Peng, Ryan Prenger, Sheng Qian, Zongfeng Quan, Jonathan Raiman, Vinay Rao, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Kavya Srinet, Anuroop Sriram, Haiyuan Tang, Liliang Tang, Chong Wang, Jidong Wang, Kaifu Wang, Yi Wang, Zhijian Wang, Zhiqian Wang, Shuang Wu, Likai Wei, Bo Xiao, Wen Xie, Yan Xie, Dani Yogatama, Bin Yuan, Jun Zhan, and Zhenyao Zhu. 2016. Deep speech 2 : End-to-end speech recognition in English and Mandarin. In Proceedings of the 33rd International Conference on Machine Learning, Maria Florina Balcan and Kilian Q. Weinberger (Eds.), Vol. 48. PMLR, New York, New York, 173–182. Retrieved from http://proceedings.mlr.press/v48/amodei16.html.
达里奥·阿莫迪、桑达拉姆·阿南塔纳拉亚南、里希塔·阿努拜、白景亮、埃里克·巴滕伯格、卡尔·凯斯、贾里德·卡斯珀、布莱恩·卡坦扎罗、程强、陈国良、陈杰、陈敬东、陈志杰、迈克·切尔扎诺夫斯基、亚当·科茨、格雷格·迪亚莫斯、丁珂、杜念东、埃里希·埃尔森、杰西·恩格尔、方薇薇、范林希、克里斯托弗·福格纳、高亮、龚彩霞、奥尼·汉农、韩托尼、拉皮·约翰内斯、江冰，蔡鞠，钧比利， Patrick LeGresley，林淳，刘俊杰，刘洋，李伟高，李翔刚，马东鹏， Sharan Narang，吴安德， Sherjil Ozair，彭一平， Ryan Prenger，钱胜，全宗风， Jonathan Raiman， Vinay Rao， Sanjeev Satheesh， David Seetapun， Shubho Sengupta， Kavya Srinet， Anuroop Sriram， Haiyuan Tang， Liliang Tang， Chong Wang， Jidong Wang，王开复、王毅、王志坚、王志谦、吴爽、魏立凯、肖波、谢温、谢彦、丹妮瑜伽、袁斌、詹军、朱振耀。2016. 深度语音 2：英语和普通话的端到端语音识别.在第 33 届机器学习国际会议论文集，Maria Florina Balcan 和 Kilian Q. Weinberger（编辑），第 48 卷。PMLR，纽约，纽约，173-182。取自 http://proceedings.mlr.press/v48/amodei16.html。
Navigate to
Edward Anderson, Zhaojun Bai, Jack Dongarra, Anne Greenbaum, Alan McKenney, Jeremy Du Croz, Sven Hammarling, James Demmel, C. Bischof, and Danny Sorensen. 1990. LAPACK: A portable linear algebra library for high-performance computers. In Proceedings of the ACM/IEEE Conference on Supercomputing. IEEE Computer Society Press, 2–11.
爱德华·安德森、白兆军、杰克·唐加拉、安妮·格林鲍姆、艾伦·麦肯尼、杰里米·杜克罗兹、斯文·哈马林、詹姆斯·德梅尔、C·比绍夫和丹尼·索伦森。1990. LAPACK：用于高性能计算机的便携式线性代数库。在ACM/IEEE超级计算会议论文集。IEEE计算机学会出版社，2-11。
Navigate to
Apple. 2017. Core ML Model Format Specification. Retrieved from https://apple.github.io/coremltools/coremlspecification/.
苹果。2017. 核心 ML 模型格式规范。取自 https://apple.github.io/coremltools/coremlspecification/。
Navigate to
Apple. 2018. A12 Bionic. Retrieved from https://www.apple.com/iphone-xs/a12-bionic/.
苹果。2018. A12 仿生。取自 https://www.apple.com/iphone-xs/a12-bionic/。
Navigate to
Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov. 2018. How to backdoor federated learning. arXiv preprint arXiv:1807.00459 (2018).
尤金·巴格达萨良、安德烈亚斯·维特、华一清、黛博拉·埃斯特林和维塔利·什马蒂科夫。2018. 如何后门联邦学习.arXiv 预印本 arXiv：1807.00459 （2018）。
Navigate to
Drew Bagnell and Andrew Y. Ng. 2006. On local rewards and scaling distributed reinforcement learning. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 91–98.
Drew Bagnell 和 Andrew Y. Ng. 2006 年。关于局部奖励和扩展分布式强化学习。在神经信息处理系统进展国际会议论文集。91–98.
Navigate to
Maria-Florina Balcan, Avrim Blum, Shai Fine, and Yishay Mansour. 2012. Distributed learning, communication complexity and privacy. In Proceedings of the Conference on Learning Theory. 26–1.
玛丽亚-弗洛里娜·巴尔坎、阿夫里姆·布鲁姆、Shai Fine 和 Yishay Mansour。2012. 分布式学习、通信复杂性和隐私性.在学习理论会议论文集。26–1.
Navigate to
Paul Baran. 1962. On Distributed Communication Networks. Rand Corporation.
保罗·巴兰。1962. 关于分布式通信网络.兰德公司。
Navigate to
Luiz André Barroso, Jeffrey Dean, and Urs Hölzle. 2003. Web search for a planet: The Google cluster architecture. IEEE Micro 2 (2003), 22–28.
路易斯·安德烈·巴罗佐、杰弗里·迪恩和乌尔斯·霍尔兹勒。2003. 行星的网络搜索：谷歌集群架构。IEEE Micro 2 （2003）， 22–28.
Navigate to
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 ( Feb. 2012), 281–305.
詹姆斯·伯格斯特拉（James Bergstra）和约书亚·本吉奥（Yoshua Bengio）。2012. 超参数优化的随机搜索.J. Mach. 学习。第 13 号决议（2012 年 2 月），第 281–305 页。
Navigate to
James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. 2010. Theano: A CPU and GPU math compiler in Python. In Proceedings of the 9th Python in Science Conference, Vol. 1.
詹姆斯·伯格斯特拉、奥利维尔·布劳勒、弗雷德里克·巴斯蒂安、帕斯卡尔·兰布林、拉兹万·帕斯卡努、纪尧姆·德贾丁斯、约瑟夫·图里安、大卫·沃德-法利和约书亚·本吉奥。2010. Theano：Python 中的 CPU 和 GPU 数学编译器。在第九届 Python in Science 会议论文集，第 1 卷。
Navigate to
Philip A. Bernstein and Eric Newcomer. 2009. Principles of Transaction Processing. Morgan Kaufmann.
菲利普·伯恩斯坦（Philip A. Bernstein）和埃里克·纽科默（Eric Newcomer）。2009. 交易处理原理.摩根·考夫曼（Morgan Kaufmann）。
Navigate to
L. Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R. Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 2 (2002), 135–151.
L. Susan Blackford、Antoine Petitet、Roldan Pozo、Karin Remington、R. Clint Whaley、James Demmel、Jack Dongarra、Iain Duff、Sven Hammarling、Greg Henry 等人，2002 年。一组更新的基本线性代数子程序（BLAS）。ACM Trans. Math. Softw.28, 2 (2002), 135–151.
Navigate to
David M. Blei. 2012. Probabilistic topic models. Commun. ACM 55, 4 (2012), 77–84.
大卫·布莱（David M.Blei）。2012. 概率主题模型.通讯。ACM 55， 4 （2012）， 77–84.
Navigate to
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, Jan. (2003), 993–1022.
David M. Blei、Andrew Y. Ng 和 Michael I. Jordan。2003. 潜在狄利克雷分配.J. Mach. 学习。第3号决议，1月（2003年），第993–1022页。
Navigate to
Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. 2016. End to end learning for self-driving cars. CoRR abs/1604.07316 (2016). arxiv:1604.07316 http://arxiv.org/abs/1604.07316.
Mariusz Bojarski、Davide Del Testa、Daniel Dworakowski、Bernhard Firner、Beat Flepp、Prasoon Goyal、Lawrence D. Jackel、Mathew Monfort、Urs Muller、Jiakai Zhang、Xin Zhang、Jake Zhao 和 Karol Zieba。2016. 自动驾驶汽车的端到端学习.CoRR abs/1604.07316 （2016）。arxiv：1604.07316 http://arxiv.org/abs/1604.07316。
Navigate to
Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of the COMPSTAT’2010. Springer, 177–186.
莱昂·博图。2010. 具有随机梯度下降的大规模机器学习.在 COMPSTAT’2010 的会议记录中。斯普林格，177-186。
Navigate to
Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 ( 1 Oct. 2001), 5–32.
里奥·布雷曼。2001. 随机森林.马赫。学习。45， 1 （ 2001年10月1日）， 5–32.
Navigate to
Stephen Brooks. 1998. Markov chain Monte Carlo method and its application. J. Roy. Statist. Soc.: Series D (the Statist.) 47, 1 (1998), 69–100.
斯蒂芬·布鲁克斯。1998. 马尔可夫链蒙特卡罗方法及其应用.J.罗伊。中央集权。Soc.： Series D （the Statist.） 47， 1 （1998）， 69–100.
Navigate to
Rajkumar Buyya et al. 1999. High Performance Cluster Computing: Architectures and Systems. Prentice Hall, Upper SaddleRiver, NJ, 999.
Rajkumar Buyya 等人，1999 年。高性能集群计算：架构和系统。Prentice Hall，新泽西州上马鞍河，999。
Navigate to
Richard H. Byrd, Samantha L. Hansen, Jorge Nocedal, and Yoram Singer. 2016. A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26, 2 (2016), 1008–1031.
理查德·伯德、萨曼莎·汉森、豪尔赫·诺塞达尔和约拉姆·辛格。2016. 一种用于大规模优化的随机拟牛顿方法.暹罗 J. Optim。26, 2 (2016), 1008–1031.
Navigate to
K. Canini, T. Chandra, E. Ie, J. McFadden, K. Goldman, M. Gunter, J. Harmsen, K. LeFevre, D. Lepikhin, T. L. Llinares, et al. 2012. Sibyl: A system for large scale supervised machine learning. Tech. Talk 1 (2012), 113.
K. Canini、T. Chandra、E. Ie、J. McFadden、K. Goldman、M. Gunter、J. Harmsen、K. LeFevre、D. Lepikhin、TL Llinares 等人，2012 年。Sibyl：用于大规模监督机器学习的系统。技术讲座 1 （2012）， 113.
Navigate to
Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Józefowicz. 2016. Revisiting distributed synchronous SGD. CoRR abs/1604.00981 (2016). arxiv:1604.00981 http://arxiv.org/abs/1604.00981.
陈建民、拉贾特·蒙加、萨米·本吉奥和拉法尔·约泽福维奇。2016. 重新审视分布式同步 SGD. CoRR abs/1604.00981 （2016）。arxiv：1604.00981 http://arxiv.org/abs/1604.00981。
Navigate to
Kai Chen and Qiang Huo. 2016. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 5880–5884.
陈凯和霍强。2016. 通过增量块训练对深度学习机器进行可扩展训练，并具有块内并行优化和块模型更新过滤.在IEEE声学、语音和信号处理国际会议（ICASSP’16）的论文集中。IEEE，5880–5884。
Navigate to
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGPLAN Not. 49, 4 (2014), 269–284.
陈天石，杜子东，孙宁辉，王佳，吴成勇，陈云姬，和 Olivier Temam.2014. Diannao：用于无处不在的机器学习的小尺寸高通量加速器。ACM SIGPLAN 第 49， 4 （2014） 269–284 期。
Navigate to
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs/1512.01274 (2015). arxiv:1512.01274 http://arxiv.org/abs/1512.01274.
陈天琪、李沐、李玉田、林敏、王乃燕、王敏杰、肖天军、徐冰、张驰远和张铮。2015. MXNet：用于异构分布式系统的灵活高效的机器学习库。CoRR abs/1512.01274 （2015）。arxiv：1512.01274 http://arxiv.org/abs/1512.01274。
Navigate to
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, 571–582. Retrieved from https://www.usenix.org/conference/osdi14/technical-sessions/presentation/chilimbi.
Trishul Chilimbi、Yutaka Suzue、Johnson Apacible 和 Karthik Kalyanaraman。2014. 亚当计划：构建高效且可扩展的深度学习训练系统。在第 11 届 USENIX 操作系统设计和实现研讨会（OSDI’14）的论文集中。USENIX协会，571-582。取自 https://www.usenix.org/conference/osdi14/technical-sessions/presentation/chilimbi。
Navigate to
François Chollet et al. 2015. Keras. Retrieved from https://keras.io/.
François Chollet 等人，2015 年。克拉斯。取自 https://keras.io/。
Navigate to
Cheng-Tao Chu, Sang K. Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Kunle Olukotun, and Andrew Y. Ng. 2007. Map-reduce for machine learning on multicore. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 281–288.
Cheng-Tao Chu， Sang K. Kim， Yi-An Lin， YuanYu， Gary Bradski， Kunle Olukotun， and Andrew Y. Ng. 2007.Map-reduce用于多核机器学习。在神经信息处理系统进展国际会议论文集。281–288.
Navigate to
Scott H. Clearwater, Tze-Pin Cheng, Haym Hirsh, and Bruce G. Buchanan. 1989. Incremental batch learning. In Proceedings of the 6th International Workshop on Machine Learning. Elsevier, 366–370.
Scott H. Clearwater、Tze-Pin Cheng、Haym Hirsh 和 Bruce G. Buchanan。1989. 增量批量学习.第六届机器学习国际研讨会论文集。爱思唯尔，366-370。
Navigate to
Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. 2013. Deep learning with COTS HPC systems. In Proceedings of the International Conference on Machine Learning. 1337–1345.
亚当·科茨、布罗迪·胡瓦尔、王涛、大卫·吴、布莱恩·卡坦扎罗和吴安德鲁。2013. 使用 COTS HPC 系统进行深度学习。在机器学习国际会议的论文集中。1337–1345.
Navigate to
Elias De Coninck, Steven Bohez, Sam Leroux, Tim Verbelen, Bert Vankeirsbilck, Pieter Simoens, and Bart Dhoedt. 2018. DIANNE: A modular framework for designing, training and deploying deep neural networks on heterogeneous distributed infrastructure. J. Syst. Softw. 141 (2018), 52–65.
埃利亚斯·德·科宁克、史蒂文·博赫兹、山姆·勒鲁、蒂姆·韦尔贝伦、伯特·范凯尔斯比尔克、彼得·西蒙斯和巴特·多特。2018. DIANNE：用于在异构分布式基础设施上设计、训练和部署深度神经网络的模块化框架。J. Syst. 软件。141 (2018), 52–65.
Navigate to
George F. Coulouris, Jean Dollimore, and Tim Kindberg. 2005. Distributed Systems: Concepts and Design. Pearson Education.
George F. Coulouris、Jean Dollimore 和 Tim Kindberg。2005. 分布式系统：概念与设计.培生教育。
Navigate to
Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R. Ganger, Phillip B. Gibbons, et al. 2014. Exploiting bounded staleness to speed Up big data analytics. In Proceedings of the USENIX Annual Technical Conference. 37–48.
Henggang Cui， James Cipar， Qirong Ho， Jin Kyu Kim， Seunghak Lee， Abhimanu Kumar， Jinliang Wei， Wei Dai， Gregory R. Ganger， Phillip B. Gibbons， et al. 2014.利用有限过期来加速大数据分析。在USENIX年度技术会议的论文集。37–48.
Navigate to
Yu-Hong Dai and Yaxiang Yuan. 1999. A nonlinear conjugate gradient method with a strong global convergence property. SIAM J. Optim. 10, 1 (1999), 177–182.
戴宇红和袁亚翔。1999. 一种具有较强全局收敛特性的非线性共轭梯度方法.暹罗 J. Optim。10, 1 (1999), 177–182.
Navigate to
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Vol. 1 (NIPS’12). Curran Associates Inc., 1223–1231. Retrieved from http://dl.acm.org/citation.cfm?id=2999134.2999271.
Jeffrey Dean， Greg S. Corrado， Rajat Monga， Kai Chen， Matthieu Devin， Quoc V. Le， Mark Z. 毛， Marc’Aurelio Ranzato， Andrew Senior， Paul Tucker， Ke Yang， and Andrew Y. Ng. 2012.大规模分布式深度网络。在第25届神经信息处理系统国际会议论文集，第1卷（NIPS’12）。Curran Associates Inc.，1223-1231 年。取自 http://dl.acm.org/citation.cfm?id=2999134.2999271。
Navigate to
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Operating Systems Design & Implementation, Vol. 6 (OSDI’04). USENIX Association, 10–10.
杰弗里·迪恩（Jeffrey Dean）和桑杰·盖马瓦特（Sanjay Ghemawat）。2004. MapReduce：简化大型集群上的数据处理。第六届操作系统设计与实现会议论文集，第 6 卷（OSDI’04）。USENIX协会，10-10。
Navigate to
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12 ( July 2011), 2121–2159. Retrieved from http://dl.acm.org/citation.cfm?id=1953048.2021068.
约翰·杜奇、埃拉德·哈赞和约拉姆·辛格。2011. 用于在线学习和随机优化的自适应次梯度方法.J. Mach. 学习。第 12 号决议（2011 年 7 月），第 2121–2159 页。取自 http://dl.acm.org/citation.cfm?id=1953048.2021068。
Navigate to
Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34, 3 (2002), 375–408.
Elmootazbellah Nabil Elnozahy、Lorenzo Alvisi、Yi-Min Wang 和 David B. Johnson。2002. 消息传递系统中回滚恢复协议的调查.ACM 计算。生存。34, 3 (2002), 375–408.
Navigate to
Facebook. 2017. Gloo. Retrieved from https://github.com/facebookincubator/gloo.
脸书。2017. 闷闷不乐。取自 https://github.com/facebookincubator/gloo。
Navigate to
Clément Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eugenio Culurciello, and Yann LeCun. 2011. Neuflow: A runtime reconfigurable dataflow processor for vision. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’11). IEEE, 109–116.
克莱门特·法拉贝特、贝林·马蒂尼、伯努瓦·科尔达、波琳娜·阿克塞尔罗德、欧金尼奥·库鲁西洛和扬·勒昆。2011. Neuflow：用于视觉的运行时可重构数据流处理器。在IEEE计算机学会计算机视觉和模式识别研讨会会议（CVPRW’11）的论文集中。IEEE，109-116。
Navigate to
Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In ACM SIGPLAN Not., Vol. 47. ACM, 37–48.
迈克尔·费德曼、阿尔穆塔兹·阿迪莱、奥努尔·科伯伯、斯塔夫罗斯·沃洛斯、穆罕默德·阿里萨法伊、乔尔杰·杰夫吉奇、坎苏·凯纳克、阿德里安·丹尼尔·波佩斯库、阿纳斯塔西娅·艾拉马基和巴巴克·法尔萨菲。2012. 清除云层：对现代硬件上新兴横向扩展工作负载的研究。在 ACM SIGPLAN Not. 中，第 47 卷。ACM，37-48。
Navigate to
Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. Do we need hundreds of classifiers to solve real world classification problems. J. Mach. Learn. Res 15, 1 (2014), 3133–3181.
曼努埃尔·费尔南德斯-德尔加多、伊娃·塞尔纳达斯、塞内·巴罗和迪纳尼·阿莫林。2014. 我们是否需要数百个分类器来解决现实世界的分类问题？J. Mach. 学习。第 15 卷第 1 卷（2014 年），第 3133–3181 页。
Navigate to
Michael J. Flynn. 1972. Some computer organizations and their effectiveness. IEEE Trans. Comput. 100, 9 (1972), 948–960.
迈克尔·弗林（Michael J.Flynn）。1972. 一些计算机组织及其有效性。IEEE Trans. Comput.100, 9 (1972), 948–960.
Navigate to
Ian Foster and Adriana Iamnitchi. 2003. On death, taxes, and the convergence of peer-to-peer and grid computing. In Proceedings of the International Workshop on Peer-to-Peer Systems. Springer, 118–128.
伊恩·福斯特（Ian Foster）和阿德里安娜·亚姆尼奇（Adriana Iamnitchi）。2003. 关于死亡、税收以及点对点和网格计算的融合。在点对点系统国际研讨会的会议记录中。斯普林格，118-128。
Navigate to
Ermias Gebremeskel. 2018. Analysis and comparison of distributed training techniques for deep neural networks in a dynamic environment. (2018).
Ermias Gebremeskel.2018. 动态环境下深度神经网络分布式训练技术的分析与比较.(2018).
Navigate to
Stuart Geman, Elie Bienenstock, and René Doursat. 1992. Neural networks and the bias/variance dilemma. Neural Comput. 4, 1 (1992), 1–58.
Stuart Geman、Elie Bienenstock 和 René Doursat。1992. 神经网络和偏差/方差困境.神经计算。4, 1 (1992), 1–58.
Navigate to
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the ACM Symposium on Operating Systems Principles.
Sanjay Ghemawat、Howard Gobioff 和 Shun-Tak Leung。2003. Google 文件系统。在 ACM 操作系统原理研讨会的论文集中。
Navigate to
Andrew Gibiansky. 2017. Bringing HPC Techniques to Deep Learning. Retrieved from http://research.baidu.com/bringing-hpc-techniques-deep-learning/.
安德鲁·吉比安斯基。2017. 将 HPC 技术引入深度学习。取自 http://research.baidu.com/bringing-hpc-techniques-deep-learning/。
Navigate to
Yue-Jiao Gong, Wei-Neng Chen, Zhi-Hui Zhan, Jun Zhang, Yun Li, Qingfu Zhang, and Jing-Jing Li. 2015. Distributed evolutionary algorithms and their models: A survey of the state-of-the-art. Appl. Soft Comput. 34 (2015), 286–300. DOI: https://doi.org/10.1016/j.asoc.2015.04.061
龚岳娇，陈伟能，詹志辉，张军，李云，张庆福，李菁. 2015.分布式进化算法及其模型：最新技术综述。应用软计算。34 (2015), 286–300.DOI： https://doi.org/10.1016/j.asoc.2015.04.061
Navigate to
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the International Conference on Advances in Neural Information Processing Systems, Vol. 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2672–2680.
伊恩·古德费罗、让·普吉特-阿巴迪、迈赫迪·米尔扎、徐冰、大卫·沃德-法利、谢尔吉尔·奥扎尔、亚伦·库尔维尔和约书亚·本吉奥。2014. 生成对抗网络.在神经信息处理系统进展国际会议论文集，第 27 卷中，Z. Ghahramani、M. Welling、C. Cortes、ND Lawrence 和 K. Q. Weinberger（编辑）。Curran Associates， Inc.，2672-2680。
Navigate to
Michael T. Goodrich, Nodari Sitchinava, and Qin Zhang. 2011. Sorting, searching, and simulation in the MapReduce framework. In Proceedings of the International Symposium on Algorithms and Computation. 374–383.
Michael T. Goodrich、Nodari Sitchinava 和 Qin Zhang。2011. MapReduce框架中的排序、搜索和模拟.在算法与计算国际研讨会论文集。374–383.
Navigate to
Google. 2017. Google Cloud TPU. Retrieved from https://cloud.google.com/tpu.
谷歌。2017. 谷歌云 TPU。取自 https://cloud.google.com/tpu。
Navigate to
Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 hour. CoRR abs/1706.02677 (2017). arxiv:1706.02677 http://arxiv.org/abs/1706.02677.
Priya Goyal、Piotr Dollár、Ross B. Girshick、Pieter Noordhuis、Lukasz Wesolowski、Aapo Kyrola、Andrew Tulloch、Yangqing Jia 和 Kaim He。2017. 准确的大批量小批量 SGD：在 1 小时内训练 ImageNet。CoRR abs/1706.02677 （2017）。arxiv：1706.02677 http://arxiv.org/abs/1706.02677。
Navigate to
William D. Gropp, William Gropp, Ewing Lusk, and Anthony Skjellum. 1999. Using MPI: Portable Parallel Programming with the Message-passing Interface. Vol. 1. The MIT Press.
威廉·格罗普（William D. Gropp），威廉·格罗普（William Gropp），尤因·卢斯克（Ewing Lusk）和安东尼·斯凯勒姆（Anthony Skjellum）。1999. 使用 MPI：使用消息传递接口进行可移植并行编程。第 1 卷。麻省理工学院出版社。
Navigate to
Mehmet Gönen. 2012. Predicting drug-target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics 28, 18 (2012), 2304–2310.
穆罕默德·戈宁（Mehmet Gönen）。2012. 使用贝叶斯矩阵分解预测化学和基因组内核的药物-靶标相互作用。生物信息学 28， 18 （2012）， 2304–2310。
Navigate to
D. Hall, D. Ramage, et al. 2009. Breeze: Numerical Processing Library for Scala. Retrieved from https://github.com/scalanlp/breeze.
D. Hall、D. Ramage 等人，2009 年。Breeze：Scala 的数值处理库。取自 https://github.com/scalanlp/breeze。
Navigate to
Minyang Han and Khuzaima Daudjee. 2015. Giraph unchained: Barrierless asynchronous parallel execution in pregel-like graph processing systems. Proc. VLDB Endow. 8, 9 ( May 2015), 950–961. DOI: https://doi.org/10.14778/2777598.2777604
Minyang Han 和 Khuzaima Daudjee。2015. Giraph unchained：类预凝胶图形处理系统中的无屏障异步并行执行。Proc. VLDB 捐赠。8， 9 （ 2015年5月）， 950–961.DOI： https://doi.org/10.14778/2777598.2777604
Navigate to
Tianyi David Han and Tarek S. Abdelrahman. 2011. Reducing branch divergence in GPU programs. In Proceedings of the 4th Workshop on General Purpose Processing on Graphics Processing Units. 3 (Mar. 2011) 1–8.
Tianyi、David Han 和 Tarek S. Abdelrahman。2011. 减少 GPU 程序中的分支分歧。在第四届图形处理单元通用处理研讨会的论文集中。3 （2011年3月） 1–8.
Navigate to
Elmar Haußmann. 2018. Accelerating I/O Bound Deep Learning on Shared Storage. Retrieved from https://blog.riseml.com/accelerating-io-bound-deep-learning-e0e3f095fd0.
埃尔玛·豪斯曼（Elmar Haußmann）。2018. 在共享存储上加速 I/O 绑定深度学习。取自 https://blog.riseml.com/accelerating-io-bound-deep-learning-e0e3f095fd0。
Navigate to
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). arxiv:1512.03385 http://arxiv.org/abs/1512.03385.
何开明，张翔宇，任绍庆，孙健. 2015.用于图像识别的深度残差学习。CoRR abs/1512.03385 （2015）。arxiv：1512.03385 http://arxiv.org/abs/1512.03385。
Navigate to
Magnus Rudolph Hestenes, Eduard Stiefel, and others. 1952. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards 49, 6 (1952), 409–436.
马格努斯·鲁道夫·赫斯滕斯、爱德华·斯蒂费尔等。1952. 求解线性系统的共轭梯度方法.国家标准局研究 49， 6 （1952）， 409–436.
Navigate to
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012).
Geoffrey E. Hinton、Nitish Srivastava、Alex Krizhevsky、Ilya Sutskever 和 Ruslan R. Salakhutdinov。2012. 通过防止特征检测器的共适应来改进神经网络.arXiv 预印本 arXiv：1207.0580 （2012）。
Navigate to
Briland Hitaj, Giuseppe Ateniese, and Fernando Perez-Cruz. 2017. Deep models under the GAN: Information leakage from collaborative deep learning. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. ACM, 603–618.
Briland Hitaj、Giuseppe Atenese 和 Fernando Perez-Cruz。2017. GAN下的深度模型：协作深度学习中的信息泄露.在ACM SIGSAC计算机和通信安全会议的论文集中。ACM，603-618。
Navigate to
Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc., 289–296.
托马斯·霍夫曼。1999. 概率潜在语义分析.在第 15 届人工智能不确定性会议论文集。摩根·考夫曼出版公司，289-296。
Navigate to
Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, and Onur Mutlu. 2017. Gaia: Geo-distributed machine learning approaching LAN speeds. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI’17). USENIX Association, 629–647. Retrieved from https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/hsieh.
Kevin Hsieh、Aaron Harlap、Nandita Vijaykumar、Dimitris Konomis、Gregory R. Ganger、Phillip B. Gibbons 和 Onur Mutlu。2017. Gaia：地理分布式机器学习接近局域网速度。在第 14 届 USENIX 网络系统设计与实现研讨会（NSDI’17）的论文集中。USENIX协会，629-647。取自 https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/hsieh。
Navigate to
IBM Cloud. 2018. IBM Watson Machine Learning. Retrieved from https://www.ibm.com/cloud/machine-learning.
IBM 云。2018. IBM Watson 机器学习。取自 https://www.ibm.com/cloud/machine-learning。
Navigate to
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review, Vol. 41. ACM, 59–72.
迈克尔·伊萨德、米哈伊·布迪乌、袁宇、安德鲁·比雷尔和丹尼斯·费特利。2007. Dryad：来自顺序构建块的分布式数据并行程序。在 ACM SIGOPS 操作系统评论中，第 41 卷。ACM，59-72。
Navigate to
Sylvain Jeaugey. 2017. NCCL 2.0. Retrieved from http://on-demand.gputechconf.com/gtc/2017/presentation/s7155-jeaugey-nccl.pdf.
西尔万·乔吉（Sylvain Jeaugey）。2017. NCCL 2.0.取自 http://on-demand.gputechconf.com/gtc/2017/presentation/s7155-jeaugey-nccl.pdf。
Navigate to
Genlin Ji and Xiaohan Ling. 2007. Ensemble learning based distributed clustering. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 312–321.
姬玄林和凌晓寒. 2007.基于集成学习的分布式聚类。在亚太知识发现和数据挖掘会议论文集。斯普林格，312-321。
Navigate to
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 675–678.
贾扬青、埃文·谢尔哈默、杰夫·多纳休、谢尔盖·卡拉耶夫、乔纳森·朗、罗斯·吉尔希克、塞尔吉奥·瓜达拉马和特雷弗·达雷尔。2014. Caffe：用于快速特征嵌入的卷积架构。在第 22 届 ACM 多媒体国际会议论文集中。ACM，675-678。
Navigate to
Michael I. Jordan and Tom M. Mitchell. 2015. Machine learning: Trends, perspectives, and prospects. Science 349, 6245 (2015), 255–260.
迈克尔·乔丹（Michael I. Jordan）和汤姆·米切尔（Tom M. Mitchell）。2015. 机器学习：趋势、观点和前景。科学 349， 6245 （2015）， 255–260.
Navigate to
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ACM/IEEE 44th International Symposium on Computer Architecture (ISCA’17). IEEE, 1–12.
Norman P. Jouppi、Cliff Young、Nishant Patil、David Patterson、Gaurav Agrawal、Raminder Bajwa、Sarah Bates、Suresh Bhatia、Nan Boden、Al Borchers 等人，2017 年。张量处理单元的数据中心内性能分析。ACM/IEEE 第 44 届计算机体系结构国际研讨会（ISCA’17）论文集。IEEE，1-12。
Navigate to
Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. 1996. Reinforcement learning: A survey. J. Artific. Intell. Res. 4 (1996), 237–285.
莱斯利·帕克·凯尔布林、迈克尔·利特曼和安德鲁·摩尔。1996. 强化学习：一项调查。J.艺术。智力。第4号决议（1996年），第237-285页。
Navigate to
Amir E. Khandani, Adlar J. Kim, and Andrew W. Lo. 2010. Consumer credit-risk models via machine-learning algorithms. J. Bank. Fin. 34, 11 (2010), 2767–2787.
Amir E. Khandani、Adlar J. Kim 和 Andrew W. Lo。2010. 基于机器学习算法的消费者信用风险模型.J.银行。Fin. 34， 11 （2010）， 2767–2787.
Navigate to
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 ( Aug. 2009), 30–37.
耶胡达·科伦、罗伯特·贝尔和克里斯·沃林斯基。2009. 推荐系统的矩阵分解技术.计算机 42， 8 （ Aug. 2009）， 30–37.
Navigate to
Thorsten Kurth, Jian Zhang, Nadathur Satish, Evan Racah, Ioannis Mitliagkas, Md Mostofa Ali Patwary, Tareq Malas, Narayanan Sundaram, Wahid Bhimji, Mikhail Smorkalov, et al. 2017. Deep learning at 15pf: Supervised and semi-supervised classification for scientific data. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 7.
Thorsten Kurth、Jian Zhang、Nadathur Satish、Evan Racah、Ioannis Mitliagkas、Md Mostofa Ali Patwary、Tareq Malas、Narayanan Sundaram、Wahid Bhimji、Mikhail Smorkalov 等人，2017 年。15pf 的深度学习：科学数据的监督和半监督分类。在高性能计算、网络、存储和分析国际会议论文集。ACM， 7.
Navigate to
Donghwoon Kwon, Hyunjoo Kim, Jinoh Kim, Sang C. Suh, Ikkyun Kim, and Kuinam J. Kim. 2017. A survey of deep learning-based network anomaly detection. Cluster Comput. ( 27 Sep. 2017). DOI: https://doi.org/10.1007/s10586-017-1117-8
Donghuon Kwon、Hyunjoo Kim、Jinoh Kim、Sang C. Suh、Ikkyun Kim 和 Kuinam J. Kim。2017. 基于深度学习的网络异常检测综述.集群计算。（ 2017 年 9 月 27 日）。DOI： https://doi.org/10.1007/s10586-017-1117-8
Navigate to
Ralf Lammel. 2008. Google’s MapReduce programming model—Revisited. Sci. Comput. Prog. 70, 1 (2008), 1.
拉尔夫·拉梅尔。2008. Google 的 MapReduce 编程模型 - 重新审视。科学计算。Prog. 70， 1 （2008）， 1.
Navigate to
Sara Landset, Taghi M. Khoshgoftaar, Aaron N. Richter, and Tawfiq Hasanin. 2015. A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J. Big Data 2, 1 (2015), 24.
Sara Landset、Taghi M. Khoshgoftaar、Aaron N. Richter 和 Tawfiq Hasanin。2015. Hadoop生态系统中用于大数据机器学习的开源工具调查。J. 大数据 2， 1 （2015）， 24.
Navigate to
Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Y. Bengio. 2007. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the International Conference on Machine Learning, Vol. 227, 473–480. DOI: https://doi.org/10.1145/1273496.1273556
雨果·拉罗切尔、杜米特鲁·埃尔汉、亚伦·库尔维尔、詹姆斯·伯格斯特拉和 Y. Bengio。2007. 关于具有许多变异因素的问题的深层架构的实证评估.在《机器学习国际会议论文集》中，第 227 卷，第 473–480 页。DOI： https://doi.org/10.1145/1273496.1273556
Navigate to
Chuck L. Lawson, Richard J. Hanson, David R. Kincaid, and Fred T. Krogh. 1979. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Softw. 5, 3 (1979), 308–323.
Chuck L. Lawson、Richard J. Hanson、David R. Kincaid 和 Fred T. Krogh。1979. Fortran 使用的基本线性代数子程序.ACM Trans. Math. Softw.5, 3 (1979), 308–323.
Navigate to
Quoc V. Le. 2013. Building high-level features using large scale unsupervised learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). IEEE, 8595–8598.
Quoc 诉 Le.2013. 使用大规模无监督学习构建高级特征.IEEE 声学、语音和信号处理国际会议（ICASSP’13）论文集。IEEE，8595–8598。
Navigate to
Guanpeng Li, Siva Kumar Sastry Hari, Michael Sullivan, Timothy Tsai, Karthik Pattabiraman, Joel Emer, and Stephen W. Keckler. 2017. Understanding error propagation in deep learning neural network (DNN) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 8.
Guanpeng Li、Siva Kumar Sastry Hari、Michael Sullivan、Timothy Tsai、Karthik Pattabiraman、Joel Emer 和 Stephen W. Keckler。2017. 了解深度学习神经网络（DNN）加速器和应用中的误差传播。在高性能计算、网络、存储和分析国际会议论文集。ACM，8。
Navigate to
Mu Li, David G. Andersen, Alexander Smola, and Kai Yu. 2014. Communication efficient distributed machine learning with the parameter server. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14), Vol. 1. The MIT Press, 19–27.
Mu Li、David G. Andersen、Alexander Smola 和 Kai Yu。2014. 与参数服务器进行高效通信的分布式机器学习.第27届神经信息处理系统国际会议（NIPS’14）论文集，第1卷。麻省理工学院出版社，19-27。
Navigate to
Mu Li, David G. Anderson, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 583–598.
Mu Li、David G. Anderson、Jun Woo Park、Alexander J. Smola、Amr Ahmed、Vanja Josifovski、James Long、Eugene J. Shekita 和 Bor-Yiing Su。2014. 使用参数服务器扩展分布式机器学习。在USENIX操作系统设计和实现研讨会（OSDI’14）的会议记录中。583–598.
Navigate to
Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, David G. Andersen, and Alexander Smola. 2013. Parameter server for distributed machine learning. In Big Learning NIPS Workshop, Vol. 6. 2 pages.
李沐、李周、杨子超、李亚伦、夏飞、大卫·G·安德森和亚历山大·斯莫拉。2013. 分布式机器学习的参数服务器.在大型学习 NIPS 研讨会中，第 6 卷。2 页。
Navigate to
Dong C. Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Math. Prog. 45, 1–3 (1989), 503–528.
Dong C. Liu 和 Jorge Nocedal。1989. 关于有限内存BFGS方法的大规模优化.数学程序 45， 1–3 （1989）， 503–528.
Navigate to
A. R. Mamidala, G. Kollias, C. Ward, and F. Artico. 2018. MXNET-MPI: Embedding MPI parallelism in parameter server task model for scaling deep learning. ArXiv e-prints ( Jan. 2018). arxiv:cs.DC/1801.03855.
AR Mamidala、G. Kollias、C. Ward 和 F. Artico。2018. MXNET-MPI：在参数服务器任务模型中嵌入 MPI 并行性以扩展深度学习。ArXiv电子打印（2018年1月）。arxiv：cs.DC/1801.03855 号文件。
Navigate to
H. Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas. 2016. Federated learning of deep networks using model averaging. (2016).
H. Brendan McMahan、Eider Moore、Daniel Ramage 和 Blaise Agüera y Arcas。2016. 使用模型平均的深度网络联邦学习.(2016).
Navigate to
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, D. B. Tsai, Manish Amde, Sean Owen, et al. 2016. MLlib: Machine learning in Apache Spark. J. Mach. Learn. Res. 17, 1 (2016), 1235–1241.
Xangrui Meng， Joseph Bradley， Burak Yavuz， Evan Sparks， Shivaram Venkataraman， Davies Liu， Jeremy Freeman， D. B. Tsai， Manish Amde， Sean Owen， et al. 2016.MLlib：Apache Spark 中的机器学习。J. Mach. 学习。第 17， 1 号决议（2016）， 1235–1241.
Navigate to
Cade Metz. 2018. Big bets on AI open a new frontier for chip start-ups, too. The New York Times 14 Jan. (2018). Retrieved from https://www.nytimes.com/2018/01/14/technology/artificial-intelligence-chip-start-ups.html.
凯德·梅斯。2018年，对人工智能的大赌注也为芯片初创企业开辟了新的领域。《纽约时报》，2018年1月14日。取自 https://www.nytimes.com/2018/01/14/technology/artificial-intelligence-chip-start-ups.html。
Navigate to
Ryan J. Meuth. 2007. GPUs surpass computers at repetitive calculations. IEEE Potent. 26, 6 (2007), 12–23.
瑞安·莫斯（Ryan J.Meuth）。2007. GPU 在重复计算方面超越了计算机。IEEE电位。26, 6 (2007), 12–23.
Navigate to
Microsoft. 2018. Microsoft Azure Machine Learning. Retrieved from https://azure.microsoft.com/en-us/overview/machine-learning/.
Microsoft。2018. Microsoft Azure 机器学习。取自 https://azure.microsoft.com/en-us/overview/machine-learning/。
Navigate to
Microsoft Inc. 2015. Distributed Machine Learning Toolkit (DMTK). Retrieved from http://www.dmtk.io.
Microsoft Inc. 2015 年。分布式机器学习工具包（DMTK）。取自 http://www.dmtk.io。
Navigate to
Jyoti Nandimath, Ekata Banerjee, Ankur Patil, Pratima Kakade, Saumitra Vaidya, and Divyansh Chaturvedi. 2013. Big data analysis using Apache Hadoop. In Proceedings of the IEEE 14th International Conference on Information Reuse & Integration (IRI’13). IEEE, 700–703.
乔蒂·南迪马斯、埃卡塔·班纳吉、安库尔·帕蒂尔、普拉蒂玛·卡卡德、索米特拉·瓦迪亚和迪维扬什·查图尔韦迪。2013. 使用Apache Hadoop进行大数据分析.在IEEE第14届信息重用与集成国际会议（IRI’13）的论文集中。IEEE，700–703。
Navigate to
Fairuz Amalina Narudin, Ali Feizollah, Nor Badrul Anuar, and Abdullah Gani. 2016. Evaluation of machine learning classifiers for mobile malware detection. Soft Comput. 20, 1 (2016), 343–357.
Fairuz Amalina Narudin、Ali Feizollah、Nor Badrul Anuar 和 Abdullah Gani。2016. 用于移动恶意软件检测的机器学习分类器的评估。软计算。20, 1 (2016), 343–357.
Navigate to
David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2009. Distributed algorithms for topic models. J. Mach. Learn. Res. 10, Aug. (2009), 1801–1828.
大卫·纽曼、亚瑟·亚松森、帕德拉克·史密斯和马克斯·威灵。2009. 主题模型的分布式算法.J. Mach. 学习。第 10 号决议，8 月（2009 年），1801–1828 年。
Navigate to
NVIDIA Corporation. 2015. NVIDIA Collective Communications Library (NCCL). Retrieved from https://developer.nvidia.com/nccl.
英伟达公司。2015. NVIDIA 集体通信库（NCCL）。取自 https://developer.nvidia.com/nccl。
Navigate to
NVIDIA Corporation. 2017. Nvidia Tesla V100. Retrieved from https://www.nvidia.com/en-us/data-center/tesla-v100/.
英伟达公司。2017. 英伟达特斯拉 V100。取自 https://www.nvidia.com/en-us/data-center/tesla-v100/。
Navigate to
Kyoung-Su Oh and Keechul Jung. 2004. GPU implementation of neural networks. Pattern Recog. 37, 6 (2004), 1311–1314.
Kyoung-Su Oh 和 Keechul Jung。2004. 神经网络的 GPU 实现.模式研究。37, 6 (2004), 1311–1314.
Navigate to
Andreas Olofsson. 2016. Epiphany-V: A 1024 processor 64-bit RISC system-on-chip. (2016).
安德烈亚斯·奥洛夫森。2016. Epiphany-V：1024 处理器 64 位 RISC 片上系统。(2016).
Navigate to
Andreas Olofsson, Tomas Nordström, and Zain Ul-Abdin. 2014. Kickstarting high-performance energy-efficient manycore architectures with epiphany. arXiv preprint arXiv:1412.5538 (2014).
安德烈亚斯·奥洛夫松、托马斯·诺德斯特伦和扎因·阿卜丁。2014. 启动高性能、高能效的众核架构，实现顿悟。arXiv 预印本 arXiv：1412.5538 （2014）。
Navigate to
David W. Opitz and Richard Maclin. 1999. Popular ensemble methods: An empirical study. (1999).
大卫·奥皮茨（David W. Opitz）和理查德·麦克林（Richard Maclin）。1999. 流行的集成方法：实证研究.(1999).
Navigate to
Róbert Ormándi, István Hegedűs, and Márk Jelasity. 2013. Gossip learning with linear models on fully distributed data. Concurr. Comput.: Pract. Exper. 25, 4 (2013), 556–571.
Róbert Ormándi、István Hegedűs 和 Márk Jelasity。2013. 在完全分布式数据上使用线性模型进行八卦学习.同意。Comput.： Pract.埃克斯珀。25, 4 (2013), 556–571.
Navigate to
A. Orriols-Puig, J. Casillas, and E. Bernado-Mansilla. 2009. Fuzzy-UCS: A Michigan-style learning fuzzy-classifier system for supervised learning. IEEE Trans. Evol. Comput. 13, 2 ( Apr. 2009), 260–283.
A. Orriols-Puig、J. Casillas 和 E. Bernado-Mansilla。2009. Fuzzy-UCS：一种用于监督学习的密歇根式学习模糊分类器系统。IEEE译文计算。13， 2 （ 2009年4月）， 260–283.
Navigate to
Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S. Chung. 2015. Accelerating deep convolutional neural networks using specialized hardware. Micros. Res. Whitep. 2, 11 (2015).
Kalin Ovtcharov、Olatunji Ruwase、Joo-Young Kim、Jeremy Fowers、Karin Strauss 和 Eric S. Chung。2015. 使用专用硬件加速深度卷积神经网络.微观。怀特普。2, 11 (2015).
Navigate to
Matthew Felice Pace. 2012. BSP vs MapReduce. Proced. Comput. Sci. 9 (2012), 246–255.
马修·菲利斯·佩斯。2012. BSP 与 MapReduce.程序。计算。科学9（2012），246–255。
Navigate to
Louis Papageorgiou, Picasi Eleni, Sofia Raftopoulou, Meropi Mantaiou, Vasileios Megalooikonomou, and Dimitrios Vlachakis. 2018. Genomic big data hitting the storage bottleneck. EMBnet. J. 24 (2018).
Louis Papageorgiou、Picasi Eleni、Sofia Raftopoulou、Meropi Mantaiou、Vasileios Megalooikonomou 和 Dimitrios Vlachakis。2018. 基因组大数据触及存储瓶颈。EMBnet。J. 24 （2018年）。
Navigate to
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).
亚当·帕斯克、山姆·格罗斯、苏米斯·钦塔拉、格雷戈里·查南、爱德华·杨、扎卡里·德维托、林泽明、阿尔班·德斯迈松、卢卡·安提加和亚当·勒勒。2017. PyTorch 中的自动微分。(2017).
Navigate to
Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69, 2 ( Feb. 2009), 117–124.
Pitch Patarasuk 和 Xin Yuan。2009. 工作站集群的带宽最优全缩减算法.J. 并行分配计算。69， 2 （ 2009年2月）， 117–124.
Navigate to
Diego Peteiro-Barral and Bertha Guijarro-Berdiñas. 2013. A survey of methods for distributed machine learning. Prog. Artific. Intell. 2, 1 (2013), 1–11.
迭戈·佩泰罗-巴拉尔和伯莎·吉哈罗-贝尔迪尼亚斯。2013. 分布式机器学习方法综述.Prog.人工。智力。2, 1 (2013), 1–11.
Navigate to
Boris T. Polyak. 2007. Newton’s method and its use in optimization. Europ. J. Operat. Res. 181, 3 (2007), 1086–1096.
鲍里斯·波利亚克（Boris T.Polyak）。2007. 牛顿方法及其在优化中的应用.欧罗普。J.操作。第181号决议，第3号（2007年），第1086–1096号。
Navigate to
Daniel Pop. 2016. Machine learning and cloud computing: Survey of distributed and SaaS solutions. arXiv preprint arXiv:1603.08767 (2016).
丹尼尔·波普，2016 年。机器学习和云计算：分布式和 SaaS 解决方案调查。arXiv 预印本 arXiv：1603.08767 （2016）。
Navigate to
Foster Provost and Venkateswarlu Kolluri. 1999. A survey of methods for scaling up inductive algorithms. Data Min. Knowl. Disc. 3, 2 (1999), 131–169.
福斯特教务长和Venkateswarlu Kolluri。1999. 扩展归纳算法的方法综述.数据最小值 Knowl.光盘 3， 2 （1999）， 131–169.
Navigate to
Junfei Qiu, Qihui Wu, Guoru Ding, Yuhua Xu, and Shuo Feng. 2016. A survey of machine learning for big data processing. EURASIP J. Adv. Sig. Proc. 2016, 1 (2016), 67.
邱俊飞，吴启辉，丁国茹，徐玉华，和冯硕.2016. 机器学习在大数据处理中的应用综述.EURASIP J. Adv. Sig. Proc. 2016， 1 （2016）， 67.
Navigate to
Ioan Raicu, Ian Foster, Alex Szalay, and Gabriela Turcu. 2006. Astroportal: A science gateway for large-scale astronomy data analysis. In Proceedings of the TeraGrid Conference. 12–15.
伊万·雷库、伊恩·福斯特、亚历克斯·萨莱和加布里埃拉·图尔库。2006. Astroportal：用于大规模天文学数据分析的科学门户。在TeraGrid会议的论文集中。12–15.
Navigate to
Rajat Raina, Anand Madhavan, and Andrew Y. Ng. 2009. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th International Conference on Machine Learning. ACM, 873–880.
Rajat Raina、Anand Madhavan 和 Andrew Y. Ng. 2009 年。使用图形处理器的大规模深度无监督学习。在第 26 届机器学习国际会议论文集。ACM，873-880。
Navigate to
James Reinders. 2013. AVX-512 instructions. Intel Corporation (2013).
詹姆斯·莱因德斯。2013. AVX-512 说明。英特尔公司（2013 年）。
Navigate to
Peter Richtárik and Martin Takáč. 2016. Distributed coordinate descent method for learning with big data. J. Mach. Learn. Res. 17, 1 (2016), 2657–2681.
彼得·里希塔里克（Peter Richtárik）和马丁·塔卡奇（Martin Takáč）。2016. 基于大数据学习的分布式坐标下降方法.J. Mach. 学习。第 17， 1 号决议（2016）， 2657–2681.
Navigate to
Kaz Sato, Cliff Young, and David Patterson. 2017. An in-depth look at Google’s first Tensor Processing Unit (TPU). Google Cloud Big Data Mach. Learn. Blog 12 (2017).
Kaz Sato、Cliff Young 和 David Patterson。2017. 深入了解 Google 的第一个张量处理单元（TPU）。Google Cloud Big Data Mach. 学习。博客 12 （2017）。
Navigate to
Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2135–2135.
弗兰克·塞德（Frank Seide）和阿米特·阿加瓦尔（Amit Agarwal）。2016. CNTK：Microsoft 的开源深度学习工具包。在第 22 届 ACM SIGKDD 知识发现和数据挖掘国际会议论文集。ACM，2135-2135。
Navigate to
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’14). Retrieved from https://www.microsoft.com/en-us/research/publication/1-bit-stochastic-gradient-descent-and-application-to-data-parallel-distributed-training-of-speech-dnns/.
Frank Seide、Hao Fu、Jasha Droppo、Gang Li 和 Dong Yu。2014. 1-bit 随机梯度下降及其在语音 DNN 数据并行分布式训练中的应用.在国际言语交流协会会议记录（Interspeech’14）中。取自 https://www.microsoft.com/en-us/research/publication/1-bit-stochastic-gradient-descent-and-application-to-data-parallel-distributed-training-of-speech-dnns/。
Navigate to
Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and easy distributed deep learning in TensorFlow. Retrieved from https://arxiv.org/abs/1802.05799.
亚历山大·谢尔盖耶夫和迈克·德尔·巴尔索。2018. Horovod：TensorFlow 中快速简便的分布式深度学习。取自 https://arxiv.org/abs/1802.05799。
Navigate to
Pierre Sermanet, Soumith Chintala, and Yann LeCun. 2012. Convolutional neural networks applied to house numbers digit classification. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR’12). IEEE, 3288–3291.
Pierre Sermanet、Soumith Chintala 和 Yann LeCun。2012. 卷积神经网络在门牌号数字分类中的应用.在第 21 届模式识别国际会议（ICPR’12）的论文集中。IEEE，3288–3291。
Navigate to
Pierre Sermanet and Yann LeCun. 2011. Traffic sign recognition with multi-scale convolutional networks. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’11). IEEE, 2809–2813.
皮埃尔·塞尔马内特（Pierre Sermanet）和扬·勒昆（Yann LeCun）。2011. 基于多尺度卷积网络的交通标志识别.在国际神经网络联合会议（IJCNN’11）的论文集中。IEEE，2809–2813。
Navigate to
Amazon Web Services. 2016. Introducing Amazon EC2 P2 Instances, the Largest GPU-Powered Virtual Machine in the Cloud. Retrieved from https://aws.amazon.com/about-aws/whats-new/2016/09/introducing-amazon-ec2-p2-instances-the-largest-gpu-powered-virtual-machine-in-the-cloud/.
亚马逊网络服务。2016 年，推出 Amazon EC2 P2 实例，这是云中最大的 GPU 驱动的虚拟机。取自 https://aws.amazon.com/about-aws/whats-new/2016/09/introducing-amazon-ec2-p2-instances-the-largest-gpu-powered-virtual-machine-in-the-cloud/。
Navigate to
Amazon Web Services. 2017. Amazon EC2 F1 Instances. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.
亚马逊网络服务。2017 年。 Amazon EC2 F1 实例。取自 https://aws.amazon.com/ec2/instance-types/f1/。
Navigate to
Shai Shalev-Shwartz and Tong Zhang. 2013. Stochastic dual coordinate ascent methods for regularized loss minimization. (2013).
Shai Shalev-Shwartz 和 Tong Zhang。2013. 正则化损失最小化的随机双坐标上升方法.(2013).
Navigate to
James G. Shanahan and Laing Dai. 2015. Large scale distributed data science using Apache Spark. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2323–2324.
詹姆斯·沙纳汉（James G. Shanahan）和戴莱恩（Laing Dai）。2015. 使用 Apache Spark 的大规模分布式数据科学。在第 21 届 ACM SIGKDD 知识发现和数据挖掘国际会议论文集。ACM，2323-2324。
Navigate to
Shaohuai Shi and Xiaowen Chu. 2017. Performance modeling and evaluation of distributed deep learning frameworks on GPUs. CoRR abs/1711.05979 (2017). arxiv:1711.05979 Retrieved from http://arxiv.org/abs/1711.05979.
Shaohuai Shi 和 Xiaowen Chu.2017. GPU 上分布式深度学习框架的性能建模和评估.CoRR abs/1711.05979 （2017）。arxiv：1711.05979 取自 http://arxiv.org/abs/1711.05979。
Navigate to
Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. 2016. Benchmarking state-of-the-art deep learning software tools. In Proceedings of the 7th International Conference on Cloud Computing and Big Data (CCBD’16). IEEE, 99–104.
石少淮，王强，徐鹏飞，和 Xiaowen Chu.2016. 对最先进的深度学习软件工具进行基准测试。第七届云计算与大数据国际会议（CCBD’16）论文集。IEEE，99-104。
Navigate to
Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 1310–1321.
礼萨·肖克里和维塔利·什马蒂科夫。2015. 隐私保护深度学习.在第 22 届 ACM SIGSAC 计算机和通信安全会议论文集。ACM，1310-1321 年。
Navigate to
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop distributed file system. In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). IEEE, 1–10.
康斯坦丁·什瓦奇科、海荣·匡、桑杰·拉迪亚和罗伯特·钱斯勒。2010. Hadoop分布式文件系统.IEEE 第 26 届大容量存储系统与技术研讨会（MSST’10）会议记录。IEEE，1-10。
Navigate to
J. Simm, A. Arany, P. Zakeri, T. Haber, J. K. Wegner, V. Chupakhin, H. Ceulemans, and Y. Moreau. 2017. Macau: Scalable Bayesian factorization with high-dimensional side information using MCMC. In Proceedings of the IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP’17). DOI: https://doi.org/10.1109/MLSP.2017.8168143
J. Simm、A. Arany、P. Zakeri、T. Haber、JK Wegner、V. Chupakhin、H. Ceulemans 和 Y. Moreau。2017. 澳门：使用 MCMC 进行具有高维侧信息的可扩展贝叶斯分解。在IEEE第27届信号处理机器学习国际研讨会（MLSP’17）的论文集中。DOI： https://doi.org/10.1109/MLSP.2017.8168143
Navigate to
Dilpreet Singh and Chandan K. Reddy. 2015. A survey on platforms for big data analytics. J. Big Data 2, 1 (2015), 8.
迪尔普雷特·辛格（Dilpreet Singh）和钱丹·雷迪（Chandan K. Reddy）。2015. 关于大数据分析平台的调查.J. 大数据 2， 1 （2015）， 8.
Navigate to
Michael John Sebastian Smith. 1997. Application-specific Integrated Circuits. Vol. 7. Addison-Wesley, Reading, MA.
迈克尔·约翰·塞巴斯蒂安·史密斯。1997. 专用集成电路.第 7 卷。Addison-Wesley，马萨诸塞州雷丁。
Navigate to
Alexander Smola and Shravan Narayanamurthy. 2010. An architecture for parallel topic models. Proc. VLDB Endow. 3, 1–2 (2010), 703–710.
亚历山大·斯莫拉（Alexander Smola）和什拉万·纳拉亚纳穆尔蒂（Shravan Narayanamurthy）。2010. 并行主题模型的架构.Proc. VLDB 捐赠。3, 1–2 (2010), 703–710.
Navigate to
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Proceedings of the International Conference on Advances in Neural Information Processing Systems, Vol. 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2951–2959.
贾斯珀·斯诺克、雨果·拉罗切尔和瑞安·亚当斯。2012. 机器学习算法的实用贝叶斯优化.在神经信息处理系统进展国际会议论文集，第 25 卷中，F. Pereira、C. J. C. Burges、L. Bottou 和 K. Q. Weinberger（编辑）。Curran Associates， Inc.，2951-2959 年。
Navigate to
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going deeper with convolutions. CoRR abs/1409.4842 (2014). arxiv:1409.4842 http://arxiv.org/abs/1409.4842.
克里斯蒂安·塞格迪、刘伟、贾扬青、皮埃尔·塞尔马内特、斯科特·里德、德拉戈米尔·安格洛夫、杜米特鲁·埃尔汉、文森特·范霍克和安德鲁·拉比诺维奇。2014. 更深入地研究卷积。CoRR abs/1409.4842 （2014）。arxiv：1409.4842 http://arxiv.org/abs/1409.4842。
Navigate to
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the inception architecture for computer vision. CoRR abs/1512.00567 (2015). arxiv:1512.00567 http://arxiv.org/abs/1512.00567.
克里斯蒂安·塞格迪、文森特·范霍克、谢尔盖·约菲、乔纳森·施伦斯和兹比格涅夫·沃伊纳。2015. 重新思考计算机视觉的初始架构。CoRR abs/1512.00567 （2015年）。arxiv：1512.00567 http://arxiv.org/abs/1512.00567。
Navigate to
Martin Takác, Avleen Singh Bijral, Peter Richtárik, and Nati Srebro. 2013. Mini-batch primal and dual methods for SVMs. In Proceedings of the International Conference on Machine Learning (ICML’13). 1022–1030.
马丁·塔卡奇、阿夫琳·辛格·比杰拉尔、彼得·里希塔里克和纳蒂·斯雷布罗。2013. SVMs的小批量原始和双重方法.在机器学习国际会议（ICML’13）的论文集中。1022–1030.
Navigate to
The Khronos Group. 2018. Neural Network Exchange Format (NNEF). Retrieved from https://www.khronos.org/registry/NNEF/specs/1.0/nnef-1.0.pdf.
克罗诺斯集团。2018. 神经网络交换格式（NNEF）。取自 https://www.khronos.org/registry/NNEF/specs/1.0/nnef-1.0.pdf。
Navigate to
K. I. Tsianos, S. F. Lawlor, and M. G. Rabbat. 2012. Communication/computation tradeoffs in consensus-based distributed optimization. In Proceedings of the International Conference on Advances in Neural Information Processing Systems.
KI Tsianos、SF Lawlor 和 MG Rabbat。2012. 基于共识的分布式优化中的通信/计算权衡.在神经信息处理系统进展国际会议论文集。
Navigate to
Sujatha R. Upadhyaya. 2013. Parallel approaches to machine learning—A comprehensive survey. J. Parallel Distrib. Comput. 73, 3 (2013), 284–292.
Sujatha R. Upadhyaya。2013. 机器学习的并行方法——一项综合调查。J. 并行分配计算。73, 3 (2013), 284–292.
Navigate to
Praneeth Vepakomma, Tristan Swedish, Ramesh Raskar, Otkrist Gupta, and Abhimanyu Dubey. 2018. No peek: A survey of private distributed deep learning. arXiv preprint arXiv:1812.03288 (2018).
Praneeth Vepakomma、Tristan Swedish、Ramesh Raskar、Otkrist Gupta 和 Abhimanyu Dubey。2018. No peek： A survey of private distributed deep learning（无窥视：私有分布式深度学习综述）。arXiv 预印本 arXiv：1812.03288 （2018）。
Navigate to
Tim Verbelen, Pieter Simoens, Filip De Turck, and Bart Dhoedt. 2012. AIOLOS: Middleware for improving mobile application performance through cyber foraging. J. Syst. Softw. 85, 11 (2012), 2629–2639.
蒂姆·韦尔贝伦、彼得·西蒙斯、菲利普·德·图尔克和巴特·多特。2012. AIOLOS：通过网络觅食提高移动应用程序性能的中间件。J. Syst. 软件。85, 11 (2012), 2629–2639.
Navigate to
Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. 2015. Managed communication and consistency for fast data-parallel iterative analytics. (2015), 381–394.
魏金亮、戴伟、乔奥里克、何启荣、崔恒刚、Gregory R. Ganger、Phillip B. Gibbons、Garth A. Gibson 和 Eric P. Xing。2015. 用于快速数据并行迭代分析的托管通信和一致性。(2015), 381–394.
Navigate to
Sholom M. Weiss and Nitin Indurkhya. 1995. Rule-based machine learning methods for functional prediction. J. Artific. Intell. Res. 3 (1995), 383–403.
Sholom M. Weiss 和 Nitin Indurkhya。1995. 基于规则的机器学习方法进行功能预测.J.艺术。智力。第3号决议（1995年），第383-403页。
Navigate to
Stewart W. Wilson. 1995. Classifier fitness based on accuracy. Evolut. Comput. 3, 2 (1995), 149–175.
斯图尔特·威尔逊（Stewart W.Wilson）。1995. 基于准确性的分类器适应度.进化。计算。3, 2 (1995), 149–175.
Navigate to
Stephen J. Wright. 2015. Coordinate descent algorithms. Math. Prog. 151, 1 (2015), 3–34.
斯蒂芬·赖特（Stephen J.Wright）。2015. 坐标下降算法.数学程序 151， 1 （2015）， 3–34.
Navigate to
P. Xie, J. K. Kim, Y. Zhou, Q. Ho, A. Kumar, Y. Yu, and E. Xing. 2015. Distributed machine learning via sufficient factor broadcasting. (2015).
P. Xie， J. K. Kim， Y. 周， Q. Ho， A. Kumar， Y. Yu，和 E. Xing.2015. 通过充分因子广播进行分布式机器学习.(2015).
Navigate to
E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. 2013. Petuum: A new platform for distributed machine learning on big data. ArXiv e-prints ( Dec. 2013). arxiv:stat.ML/1312.7651.
E. P. Xing， Q. Ho， W. Dai， J. K. Kim， J. Wei， S. Lee， X. Zheng， P. Xie， A. Kumar，和 Y. Yu.2013. Petuum：基于大数据的分布式机器学习的新平台。ArXiv电子打印（2013年12月）。arxiv：stat.ML/1312.7651。
Navigate to
Eric P. Xing, Qirong Ho, Pengtao Xie, and Dai Wei. 2016. Strategies and principles of distributed machine learning on big data. Engineering 2, 2 (2016), 179–195.
Eric P. Xing、Qirong Ho、Pengtao Xie 和 Dai Wei。2016. 基于大数据的分布式机器学习的策略和原理.工程 2， 2 （2016）， 179–195.
Navigate to
Feng Yan, Olatunji Ruwase, Yuxiong He, and Evgenia Smirni. 2016. SERF: Efficient scheduling for fast deep neural network serving via judicious parallelism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). IEEE, 300–311.
Feng Yan、Olatunji Ruwase、Yuxiong He 和 Evgenia Smirni。2016. SERF：通过明智的并行性为快速深度神经网络提供服务提供高效调度。在高性能计算、网络、存储和分析国际会议（SC’16）的论文集中。IEEE，300-311。
Navigate to
Ying Yan, Yanjie Gao, Yang Chen, Zhongxin Guo, Bole Chen, and Thomas Moscibroda. 2016. TR-Spark: Transient computing for big data analytics. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC’16). ACM, New York, NY, 484–496.
Ying Yan， Yanjie Gao， Yang Chen， Zhongxin Guo， Bole Chen，和 Thomas Moscibroda.2016. TR-Spark：用于大数据分析的瞬态计算。在第七届 ACM 云计算研讨会（SoCC’16）的论文集中。ACM，纽约，纽约，484-496。
Navigate to
Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Wang Ling. 2016. Learning to compose words into sentences with reinforcement learning. CoRR abs/1611.09100 (2016). arxiv:1611.09100 http://arxiv.org/abs/1611.09100.
Dani Yogatama、Phil Blunsom、Chris Dyer、Edward Grefenstette 和 Wang Ling。 2016 年。学习通过强化学习将单词组合成句子。CoRR abs/1611.09100 （2016）。arxiv：1611.09100 http://arxiv.org/abs/1611.09100。
Navigate to
Yang You, Aydın Buluç, and James Demmel. 2017. Scaling deep learning on GPU and Knights Landing clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM.
杨佑、艾登·布卢奇和詹姆斯·德梅尔。2017. 在 GPU 和 Knights Landing 集群上扩展深度学习。在高性能计算、网络、存储和分析国际会议论文集。ACM。
Navigate to
Haifeng Yu and Amin Vahdat. 2002. Design and evaluation of a conit-based continuous consistency model for replicated services. ACM Trans. Comput. Syst. 20, 3 (2002), 239–282.
Haifeng Yu 和 Amin Vahdat。2002. 基于conit的复制服务持续一致性模型的设计与评估.ACM Trans. Comput.系统20,3（2002），239–282。
Navigate to
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 2–2.
Matei Zaharia，Mosharaf Chowdhury，Tathagata Das，Ankur Dave，Justin 马，Murphy McCauley，Michael J. Franklin，Scott Shenker和Ion Stoica。2012. 弹性分布式数据集：内存集群计算的容错抽象.在第九届USENIX网络系统设计与实现会议论文集。USENIX协会，2-2。
Navigate to
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10) 10, 10–10 (2010), 95.
马泰·扎哈里亚、莫沙拉夫·乔杜里、迈克尔·富兰克林、斯科特·申克和扬·斯托伊卡。2010. Spark：使用工作集进行集群计算。第二届USENIX云计算热点会议论文集（HotCloud’10）10,10–10（2010），95。
Navigate to
Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, et al. 2016. Apache Spark: A unified engine for big data processing. Commun. ACM 59, 11 (2016), 56–65.
Matei Zaharia、Reynold S. Xin、Patrick Wendell、Tathagata Das、Michael Armbrust、Ankur Dave、Xanggrui Meng、Josh Rosen、Shivaram Venkataraman、Michael J. Franklin 等人，2016 年。Apache Spark：用于大数据处理的统一引擎。通讯。ACM 59， 11 （2016）， 56–65.
Navigate to
Li Zeng, Ling Li, Lian Duan, Kevin Lu, Zhongzhi Shi, Maoguang Wang, Wenjuan Wu, and Ping Luo. 2012. Distributed data mining: A survey. Inf. Technol. Manag. 13, 4 (2012), 403–409.
曾丽，李玲，段连，陆凯文，石忠志，王茂光，吴文娟，罗萍.2012. 分布式数据挖掘：一项调查。Inf. Technol. Manag.13, 4 (2012), 403–409.
Navigate to
Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P. Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. arXiv preprint (2017).
张昊、郑泽宇、徐世珍、戴伟、何启荣、梁晓丹、胡志婷、魏金亮、谢鹏涛和邢斌。2017. Poseidon：一种用于 GPU 集群分布式深度学习的高效通信架构。arXiv预印本（2017）。