适应急流分布式ml生态系统的第1部分

最新推荐文章于 2024-04-16 15:12:00 发布

weixin_26756255

最新推荐文章于 2024-04-16 15:12:00 发布

阅读量250

点赞数

文章标签： python

原文链接：https://medium.com/rapids-ai/getting-oriented-in-the-rapids-distributed-ml-ecosystem-part-1-etl-6aaa03f7da4d

版权

For a long time being a data scientist that worked with large datasets and/or models meant mastering two sets of tools, one for local work and one for “big data”. pandas, numpy, and scikit-learn make it easy to do stuff on your local machine, but can’t handle anything too big to fit in RAM. Once data gets too big, or training too costly, you have to move on to a “big data” tool that pools the resources of several machines together to get the job done. This traditionally meant Apache Spark, which, though powerful, requires learning a brand new API and maybe even a brand new language (performance Spark code is written in Scala).

长期以来，作为一名处理大型数据集和/或模型的数据科学家意味着要掌握两套工具，一套用于本地工作，一套用于“大数据”。 pandas ， numpy和scikit-learn使您可以轻松地在本地计算机上执行操作，但无法处理太大而无法放入RAM的内容。一旦数据过大或培训成本过高，您就必须继续使用“大数据”工具，该工具将多台计算机的资源汇集在一起以完成工作。传统上，这意味着Apache Spark ，尽管功能强大，但需要学习一种全新的API，甚至可能需要一种全新的语言(高性能Spark代码是用Scala编写的)。

Enter Dask. Dask is a distributed ETL tool that’s tightly integrated into the Python data science ecosystem. Dask is extremely popular among data scientists because its core API is a subset of the pandas, numpy, and scikit-learn APIs. This flattens the learning curve considerably: most Pythonistas can be productive with Dask almost immediately.

输入Dask。 Dask是紧密集成到Python数据科学生态系统中的分布式ETL工具。 Dask在数据科学家中非常受欢迎，因为它的核心API是pandas ， numpy和scikit-learn API的子集。这极大地拉平了学习曲线：大多数Pythonista使用者几乎可以立即使用Dask进行工作。

As part of its RAPIDS initiative, NVIDIA is going one step further, partnering with the community to build an ecosystem for distributed data science on GPUs on top of Dask. Their new CUDA-accelerated dataframe library, called cuDF, already boasts some pretty impressive results — like this one from Capital One Labs showing a log-scale speedup for an internal ETL job that was previously being run on CPU:

作为RAPIDS计划的一部分，NVIDIA进一步发展，与社区合作，在Dask之上为GPU上的分布式数据科学构建了生态系统。他们的新CUDA加速的数据帧库，称为cuDF，现在已经拥有一些非常令人印象深刻的结果-像这一个显示了先前对正在运行的CPU内部ETL作业日志规模的加速距离首都一个实验室：

This blog post, the first of two exploring this emerging ecosystem, is an introduction to distributed ETL using the dask, cudf, and dask_cudf APIs. We build the following mental map of the ecosystem:

这篇博客文章是探索这个新兴生态系统的两篇文章中的第一篇，它是对使用dask ， cudf和dask_cudf API的分布式ETL的介绍。我们构建了以下生态系统的思维导图：

Note that this post assumes familiarity with the Python data science ecosystem.

请注意，本文假定您熟悉Python数据科学生态系统。

使用Dask的CPU上的分布式ETL (Distributed ETL on CPU with Dask)

ETL, aka extract-transform-load, is well-accepted terminology for the speculative, exploratory phase that precedes pretty much any data analysis, data science, or machine learning project. In data science, ETL is traditionally performed using the NumPy and Pandas libraries in a Jupyter notebook. These three tools form the backbone of the modern “data science stack”.

ETL，又名提取-转换-负载，是在任何数据分析，数据科学或机器学习项目之前几乎都处于推测性，探索性阶段的公认术语。在数据科学中，传统上，ETL是使用Jupyter笔记本中的NumPy和Pandas库执行的。这三个工具构成了现代“数据科学堆栈”的基础。

However, these libraries were only ever intended to work with datasets small enough to fit in memory. A good rule of thumb for Pandas is that you need to have 5x the size of a dataset in RAM to be able to work with it. So if your machine has 16 GB of RAM, ~3 GB is around how much you comfortably “fit” onto your machine before you start to run into problems (you might be able to go slightly bigger if you’re careful about how you use memory).

但是，这些库仅用于处理足够小以适合内存的数据集。对于Pandas来说，一个很好的经验法则是，您需要使用RAM中数据集大小的5倍才能使用它。因此，如果您的计算机具有16 GB的RAM，那么在您开始遇到问题之前，您可以舒适地“安装”到计算机上的空间约为3 GB(如果您谨慎使用，则可以将其增大一些)记忆)。

To comfortably work with any dataset larger than that (“big data”), you need to shift up to a so-called cluster computing framework. Cluster computing frameworks work by distributing the work required among many machines: your RAM becomes the sum of all of the machines’ RAM, your FLOPS the sum of all their FLOPS. The two cluster computing frameworks best known in the Python community are Apache Spark and Dask (see the JetBrains 2019 Developer Survey for some numbers).

为了舒适地使用大于该数据集(“大数据”)的任何数据集，您需要升级到所谓的集群计算框架 。集群计算框架通过在许多计算机之间分配所需的工作来工作：您的RAM成为所有计算机RAM的总和，您的FLOPS成为它们所有FLOPS的总和。在Python社区中最著名的两个集群计算框架是Apache Spark和Dask(有关一些数字，请参阅JetBrains 2019开发人员调查 )。

If you are working on a greenfield machine learning project, I recommend choosing Dask over Spark for your cluster computing framework.

如果您正在从事未开发的机器学习项目，建议您为集群计算框架选择Dask而不是Spark。

While the folks behind RAPIDS have indicated that they intend to support both Dask and Spark eventually, for a variety of reasons, they have prioritized Dask as their target platform for now. If machine learning pipelines are your primary use case, you want to write maximally performant code, and you don’t have an existing Spark cluster tying you down, Dask is the way to go. :)

尽管RAPIDS背后的人们表示，出于各种原因，他们最终打算同时支持Dask和Spark，但他们目前已将Dask列为目标平台。如果机器学习管道是您的主要用例，那么您想编写性能最高的代码，并且没有现有的Spark集群束缚您，那么Dask是您的最佳选择。 :)

Let’s now take a quick look at an example of Dask in action. conda install -c conda-forge dask[dataframe] distributed scikit-learn and you can run the following example script:

现在让我们快速看一下Dask的示例。 conda install -c conda-forge dask[dataframe] distributed scikit-learn ，您可以运行以下示例脚本：

from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
from sklearn.datasets import make_classification# create a dask cluster client
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
# create an example dataset and transform it into a dask dataframe
X_train, y_train = make_classification(
    n_features=2, n_redundant=0, n_informative=2,
    random_state=1, n_clusters_per_class=1, n_samples=1000
)
X_train, y_train = dd.from_array(X_train), dd.from_array(y_train)
# compute a trivial sum
X_train.sum().compute()

This script creates a distributed Dask cluster on your local machine (despite the use of dask.distributed, all the nodes are running on your local machine, so it’s not truly distributed), reads some fake data into a dask.dataframe, and computes its sum.

该脚本在您的本地计算机上创建一个分布式Dask群集(尽管使用了dask.distributed ，所有节点都在您的本地计算机上运行，因此它不是真正的分布式的)，将一些假数据读入dask.dataframe并计算其和。

The great thing about Dask is that its dataframe API is a comprehensive clone of the pandas API. Many of the functions you are familiar with from pandas continue to work the same way they did before. Under the hood, Dask partitions the dataset into memory blocks. Each block is a pandas DataFrame containing a disjoint chunk of the overall dataset. Here’s a diagram from the Dask documentation explaining how it works:

Dask的伟大之处在于其数据框API是pandas API的全面克隆。您从熊猫熟悉的许多功能继续以以前的方式工作。在后台，Dask将数据集划分为多个内存块 。每个块都是一个pandas DataFrame其中包含整个数据集的不相交的块。这是Dask文档中的图表，说明其工作方式：

When you call the “Dask-ified” version of a DataFrame function, Dask has each machine in the cluster compute a partial result on the data blocks it has locally. These results are then sent back to the primary node, combined, and returned to the end-user.

当您调用DataFrame函数的“达斯达克化”版本时，达斯达克使集群中的每台计算机都对其本地具有的数据块进行部分计算。然后将这些结果发送回主节点，进行合并，然后返回给最终用户。

Of course, the pandas API is huge, and Dask does not attempt to implement every single DataFrame feature offered. Dask only provides a subset of the pandas API. The more popular methods are there, as are the easy-to-implement primitives.

当然， pandas API非常庞大，Dask并没有尝试实现所提供的每个DataFrame功能。 Dask仅提供了熊猫API的子集。那里有更流行的方法，以及易于实现的原语。

The most up-to-date reference on what’s in and what’s not is the Dask DataFrame API Reference. All of the Pandas mainstays are there: assign, apply, groupby, loc, iloc, resample, rolling, merge, join, astype. Even some more exotic functions, like melt and pipe, have been implemented.

关于最新内容的最新参考是Dask DataFrame API参考。所有Pandas的中流there柱都存在： assign ， apply ， groupby ， loc ， iloc ， resample ， rolling ， merge ， join ， astype 。甚至还实现了一些更奇特的功能，例如melt和pipe 。

To get your hands dirty with Dask yourself, I recommend checking out Dask’s SciPy 2020 Tutorial.

为了让您自己动手使用Dask，我建议您查阅一下Dask的SciPy 2020教程。

GPU上的分布式ETL (Distributed ETL on GPU)

Now that we know how we would run distributed ETL jobs on CPU, let’s take a look at the GPU side of things.

现在我们知道了如何在CPU上运行分布式ETL作业，下面让我们看一下GPU方面的内容。

RAPIDS cuDF library provides two related DataFrame APIs: a cudf.DataFrame for single GPUs, and a dask_cudf.DataFrame for ETL across GPUs. For installation instructions, refer to the RAPIDS Getting Started page — on my test machine conda install -c rapidsai -c nvidia -c conda-forge -c defaults cudf=0.14 python=3.7 cudatoolkit=10.1 did the trick. Here’s an example cudf call demonstrating the API:

RAPIDS cuDF库提供了两个相关的数据帧的API：一个cudf.DataFrame单GPU和一个dask_cudf.DataFrame跨GPU的ETL。有关安装说明，请参阅“ RAPIDS 入门”页面-在我的测试机上conda install -c rapidsai -c nvidia -c conda-forge -c defaults cudf=0.14 python=3.7 cudatoolkit=10.1了作用。这是一个演示API的示例cudf调用：

import cudf
import dask_cudfdf = cudf.DataFrame(
    {'a': list(range(20)),
     'b': list(reversed(range(20))),
     'c': list(range(20))}
)ddf = dask_cudf.from_cudf(df, npartitions=2)
print(ddf.loc[0:5].compute().compute())a   b  c
0  0  19  0
1  1  18  1
2  2  17  2
3  3  16  3
4  4  15  4
5  5  14  5

Assuming you have a dask.distributed cluster up and running, a cudf.DataFrame can be sharded amongst GPUs by transforming it into a dask_cudf.DataFrame. The opposite is also true: a dask_cudf.DataFrame can be gathered back onto a single GPU. Compute operations return a DataFrame reduced onto one GPU by default: so if you run a compute a dask_cudf.DataFrame, expect to get a cudf.DataFrame back out.

假设您有一个dask.distributed集群已启动并正在运行， cudf.DataFrame可以通过将dask_cudf.DataFrame转换为dask_cudf.DataFrame来在GPU之间对其进行分dask_cudf.DataFrame 。反之亦然： dask_cudf.DataFrame可以收集回到单个GPU上。计算操作默认将一个缩减为DataFrame返回到一个GPU上：因此，如果运行计算dask_cudf.DataFrame ，则希望将cudf.DataFrame撤回。

As you might have noticed from the name, dask_cudf is tightly coupled to dask. Under the hood, Dask is still handling task scheduling and execution. The main difference is that the memory blocks it uses are now cudf.DataFrame objects, not pandas.DataFrame ones.

您可能已经从名称中注意到， dask_cudf与dask紧密耦合。在后台，Dask仍在处理任务调度和执行。主要区别在于它现在使用的内存块是cudf.DataFrame对象，而不是pandas.DataFrame对象。

However, whilst cudf and dask both inherit from the pandas API, the subsets they implement are somewhat different. cudf is also a whole lot newer than Dask is, leading to a significantly smaller API. And dask_cudf , which is even newer, is even more limited in terms of the operations it can perform.

但是，虽然cudf和dask都从pandas API继承，但它们实现的子集有所不同。 cudf也是新的一大堆比DASK是，导致显著较小的API。并且dask_cudf (甚至更新)在其可以执行的操作方面受到更大的限制。

Take apply for example. This is one of the most important functions in pandas, but it’s not directly present in dask_cudf. Instead, you use map_partitions, which takes a function as input and maps it over the raw memory blocks:

以apply为例。这是pandas中最重要的功能之一，但它并不直接存在于dask_cudf 。相反，您使用map_partitions ，它将功能作为输入并将其映射到原始内存块上：

ddf.map_partitions(lambda df: df + 1).compute()

This RDD-like API is workable when you want to compute something row-wise on the raw dataframe, but really limits what you can do after a groupby or rolling operation. It’s also notable that appy_rows, being GPU-accelerated, accepts a more limited set of function signatures than the apply method in pandas or dask: the function must be JIT-able by numba.

当您想在原始数据帧上逐行计算某些内容时，这种类似于RDD的API是可行的，但实际上限制了在groupby或rolling操作之后可以执行的操作。同样值得注意的是，GPU加速的appy_rows接受的功能签名集比pandas或dask的apply方法更有限：该功能必须是numba可JIT的。

For a deeper dive into cudf, I recommend checking out 10 Minutes to cudf and Dask-cuDF.

要深入了解cudf ，我建议您查看10分钟到cudf和Dask-cuDF 。

基准测试 (Benchmarks)

In summary, dask is stable and feature-complete, cudf is bleeding-edge and incomplete, but developing rapidly.

总之， dask稳定且功能完整， cudf尖端dask完整，但发展Swift。

The foibles of working on the bleeding edge are potentially very much worth it if the speedups from moving from CPU to GPU are big enough.

如果从CPU到GPU的加速速度足够大，那么在前沿技术上工作的隐患可能非常值得。

For a rule-of-thumb estimate on how much of a performance improvement dask_cudf is over dask alone, I used Spell to launch a basic computational job on three different instances:

为了对dask单独完成一项性能改进有dask_cudf的dask ，我使用Spell在三个不同的实例上启动了基本的计算工作：

cpu-huge, aka c5.18xlarge, one of the biggest CPU-only instances you can get on AWS
cpu-huge ，又名c5.18xlarge ，您可以在AWS上获得的最大的仅CPU实例之一
v100x4 (p3.8xlarge), a GPU server with four V100s (the most powerful cloud GPU currently available in AWS) onboard
v100x4 ( p3.8xlarge )， p3.8xlarge具有四个V100(AWS目前提供的最强大的云GPU)的GPU服务器
t4x4 (g4dn.xlarge), a GPU server with four T4 GPUs onboard (the most recent and most cost-effective cloud GPU available — but less powerful than the V100)
t4x4 ( g4dn.xlarge )，板载四个T4 GPU的GPU服务器(最新，最具成本效益的云GPU，但功能不如V100)

The objective was to perform a value_counts() operation on the passenger_count column in a 16 GB dataset of taxi trips (to learn more about this dataset check out my previous blog post, “Getting started with large-scale ETL jobs using Dask and AWS EMR”). While this is not really a robust benchmark, it’s good enough to get a sense of the speedup that cudf can offer.

目标是在出租车旅行的16 GB数据集中的passenger_count列上执行value_counts()操作(要了解有关此数据集的更多信息，请参阅我以前的博客文章“ 使用Dask和AWS EMR进行大规模ETL作业入门 ”)。尽管这并不是一个可靠的基准，但足以理解cudf可以提供的加速效果。

On a giant CPU instance, using dask, this operation (which requires scanning through the entire dataset) took just over 1 second to perform. First, reading the data into memory:

在使用dask的巨型CPU实例上，此操作(需要扫描整个数据集)仅花费了1秒钟以上的时间来执行。首先，将数据读入内存：

from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
cluster = LocalCluster()
client = Client(cluster)
df = dd.read_parquet('/mnt/data/2019-taxi-dataset/')

Then running the operation:

然后运行该操作：

# cpu-huge
%%time
df.passenger_count.value_counts().compute()
CPU times: user 3.16 s, sys: 1.76 s, total: 4.92 s
Wall time: 1.08 s

Running this same computation on GPU using dask_cudf resulted in a 10x speedup.

使用 dask_cudf 在GPU上运行相同的计算 dask_cudf 使速度提高10倍。

We start by reading the data into GPU memory and partitioning it into four pieces (this happens to be the most efficient partitioning scheme for this particular combination of dataset and operation):

我们首先将数据读取到GPU内存中并将其划分为四部分(对于这种特定的数据集和操作组合，这是最有效的划分方案)：

from dask_cuda import LocalCUDACluster
import dask_cudf
from dask.distributed import Client, waitcluster = LocalCUDACluster()
client = Client(cluster)
ddf = dask_cudf.read_parquet(f'/mnt/data/2019-taxi-dataset/')
ddf = ddf.repartition(npartitions=4)
ddf = ddf.persist()
wait(ddf)

We then run the code samples:

然后，我们运行代码示例：

# t4x4
%%time
ddf.passenger_count.value_counts().compute()
CPU times: user 17.5 ms, sys: 39 µs, total: 17.6 ms
Wall time: 93.4 ms# v100x4
%%time
ddf.passenger_count.value_counts().compute()
CPU times: user 28.1 ms, sys: 0 ns, total: 28.1 ms
Wall time: 93.1 ms

This is a huge speedup that’s consistent (in miniature) with the log-scale speedups Capital One Labs saw in their benchmarks. Reducing the runtime for an important ETL pipeline task from 60 minutes to 6 minutes is a huge boon for developer productivity, and can enable you to build more complex and useful processing pipelines that would have been computationally intractable on a CPU. All good things when you’re in the business of turning data into insights!

这是一个巨大的提速，与Capital One Labs在其基准测试中看到的对数刻度提速一致(从微观上)。将一项重要的ETL管道任务的运行时间从60分钟减少到6分钟，这对开发人员的工作而言是一个巨大的福音，并且可以使您建立更复杂，更有用的处理管道，而这些管道在CPU上是难以处理的。当您将数据变成洞察力时，所有的好事！

That’s all for now. Stay tuned for a future blog post exploring how dask_ml and cuml enable you to go one step further by distributing your model training and model scoring, too.

目前为止就这样了。请继续关注未来的博客文章，探讨dask_ml和cuml如何通过分发模型训练和模型评分来使您进一步前进。

翻译自: https://medium.com/rapids-ai/getting-oriented-in-the-rapids-distributed-ml-ecosystem-part-1-etl-6aaa03f7da4d

weixin_26756255

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
适应急流分布式ml生态系统的第1部分

For a long time being a data scientist that worked with large datasets and/or models meant mastering two sets of tools, one for local work and one for “big data”. pandas, numpy, and scikit-learn make ...
复制链接

扫一扫