网页缩放与窗口缩放_缩放熊猫的现代观点

最新推荐文章于 2024-08-17 12:54:01 发布

郝ren

最新推荐文章于 2024-08-17 12:54:01 发布

阅读量193

点赞数

文章标签： python

原文链接：https://towardsdatascience.com/the-modin-view-of-scaling-pandas-825215533122

版权

网页缩放与窗口缩放

Recently, a blog post was written that compared a variety of tools in a set of head to heads. I wanted to take the opportunity to talk about our vision with Modin and where we’d like to take the field of data science.

最近，写了一篇博客文章，比较了一系列工具中的各种工具。我想借此机会谈谈我们对Modin的愿景以及我们想要从事数据科学领域的工作。

Modin (https://github.com/modin-project/modin) takes a different view of how systems should be built. Modin is designed around enabling the data scientist to be more productive. When I created Modin as a PhD student in the RISELab at UC Berkeley, I noticed a dangerous trend: data science tools are being built with hardware performance in mind, but becoming more and more complex. This complexity and performance burden is being pushed onto the data scientist. These tools optimize for processor time at the cost of the data scientist’s time.

Modin( https://github.com/modin-project/modin )对如何构建系统有不同的看法。 Modin旨在帮助数据科学家提高生产力。 当我在UC Berkeley的RISELab担任博士生时创建Modin时，我注意到了一个危险的趋势：数据科学工具的构建考虑了硬件性能，但变得越来越复杂。这种复杂性和性能负担正在推向数据科学家。这些工具优化了处理器时间，但花费了数据科学家的时间。

This trend is harmful to the overall field of data science: we don’t want to force every data scientist into becoming a distributed systems expert or require that they understand low-level implementation details to avoid getting punished by performance. Modin is disrupting the data science tooling space by prioritizing the data scientists time over hardware time. To this end, Modin has:

这种趋势有害于数据科学的整个领域：我们不想强迫每个数据科学家成为分布式系统专家，或者不想让他们了解底层实现细节，以免受到性能的影响。 Modin通过优先考虑数据科学家的时间而不是硬件时间来破坏数据科学工具领域。 为此，莫丁具有：

No upfront cost to learning a new API
学习新API无需前期费用
Integration with the Python ecosystem
与Python生态系统集成
Integration with Ray/Dask clusters (Run on/with what you have!)
与Ray / Dask集群集成(运行/随身带！)

Modin started as a drop-in replacement for Pandas, because that is where we saw the biggest need. Using Modin is as simple as pip install modin[ray] or pip install modin[dask] and then changing the import statement:

Modin最初是熊猫的直接替代品，因为这是我们看到的最大需求。使用Modin就像pip install modin[ray]或pip install modin[dask]然后更改import语句一样简单：

# import pandas as pd
import modin.pandas as pd

Modin is evolving into more than just a drop-in replacement for pandas (try latest version 0.7.4), and I will discuss this later in this post. First, I’d like to do a more detailed comparison of Modin vs. other libraries since many comparisons often leave out crucial information.

Modin的发展不仅是熊猫的直接替代品(请尝试最新版本0.7.4 )，我将在本文稍后讨论。首先，我想对Modin与其他库进行更详细的比较，因为许多比较通常都忽略了关键信息。

These systems aim to solve different problems, so fair comparisons are challenging. I will try to stick to the qualitative properties to avoid bias.

这些系统旨在解决不同的问题，因此进行公平的比较具有挑战性。我将尝试坚持定性属性以避免偏差。

Dask数据框与Modin (Dask Dataframe vs. Modin)

Dask Dataframe (https://github.com/dask/dask) covers roughly 50% of the Pandas API due to how the underlying system is architected. Dask Dataframe is a row store, like many SQL systems before it. However, dataframes are not SQL Tables. This architecture allows Dask Dataframe to scale extremely well, but prevents it from supporting all Pandas APIs. Dask Dataframe also places requirements on the user to add .compute() or .persist() calls into their data science workload to manage when compute is triggered. Users are also required to specify the number of partitions. Dask is more mature with >5 years since first commit.

由于底层系统的架构方式，Dask Dataframe( https://github.com/dask/dask )涵盖了大约50％的Pandas API。 Dask Dataframe是行存储，就像之前的许多SQL系统一样。但是，数据框不是SQL表。这种架构允许Dask Dataframe很好地扩展，但是阻止它支持所有Pandas API。 Dask Dataframe还对用户提出要求，要求将.compute()或.persist()调用添加到其数据科学工作负载中，以管理何时触发计算。还要求用户指定分区数。自首次提交以来，Dask的成熟度已超过5年。

Modin currently covers >80% of the Pandas API. Modin is architected as a flexible column store, so the partitioning within Modin can be changed based on the operation that needs to be computed. This allows Modin to scale algorithms that strict column or row stores could not. Modin also does not place any computation or partitioning burden onto the user. Optimizations on the user’s query are run under the hood, without input from the user. Modin is still in its infancy at 2 years since first commit.

Modin当前覆盖了熊猫API的80％以上。 Modin被设计为灵活的列存储，因此可以根据需要计算的操作来更改Modin中的分区。这使Modin可以扩展严格列或行存储无法实现的算法。 Modin也不会对用户造成任何计算或分区负担。用户查询的优化在后台运行，无需用户输入。自首次提交以来的2年里，莫丁仍然处于婴儿期。

Modin integrates with Dask’s scheduling API, so Modin can run on top of Dask clusters as well. More on this comparison here.

Modin与Dask的调度API集成在一起，因此Modin也可以在Dask集群之上运行。在这里进行更多的比较。

Vaex与莫丁 (Vaex vs. Modin)

Vaex (https://github.com/vaexio/vaex) is a query engine on top of HDF5 and Arrow memory mapped files, so if your data is already in HDF5 format, it is a solid choice. Going from other file formats (CSV, JSON, Parquet) into a Vaex compatible format takes non-trivial compute time and disk space. Vaex can support roughly 40% of the Pandas API, and does not support key Pandas features like the Pandas row Index. Vaex has had nearly 6 years since the first commit.

Vaex( https://github.com/vaexio/vaex )是基于HDF5和Arrow内存映射文件的查询引擎，因此，如果您的数据已经是HDF5格式，那么这是一个不错的选择。从其他文件格式(CSV，JSON，Parquet)转换为与Vaex兼容的格式会占用大量的计算时间和磁盘空间。 Vaex可以支持大约40％的Pandas API，并且不支持Pandas的主要功能，例如Pandas行索引。自首次提交以来，Vaex已将近6年了。

As of writing, Modin relies on Pandas for its HDF5 file reader, and does not support the memory-mapped style of querying that Vaex supports. Modin scales to clusters, while Vaex tries to help users avoid the need for clusters with memory mapped files.

在撰写本文时，Modin的HDF5文件阅读器依赖于Pandas，并且不支持Vaex支持的内存映射查询方式。 Modin可以扩展到群集，而Vaex试图帮助用户避免使用带有内存映射文件的群集。

RAPIDS cuDF与Modin (RAPIDS cuDF vs. Modin)

cuDF (https://github.com/rapidsai/cudf) is a dataframe library with a Pandas-like API that runs on NVIDIA GPUs. cuDF is bottlenecked by the amount of memory (up to 48GB), and spilling over to host memory comes with a high overhead. The performance you get on the GPU is great as long as the data reasonably fits into the GPU memory. You can use multiple GPUs with cuDF and Dask, and the resulting distributed GPU dataframe will have the same general properties outlined above in the Dask section. cuDF has had 3 years since the first commit.

cuDF( https://github.com/rapidsai/cudf )是一个数据框架库，具有在NVIDIA GPU上运行的类似Pandas的API。 cuDF受到内存量(最大48GB)的瓶颈，并且溢出到主机内存会带来很高的开销。只要数据合理地适合GPU内存，您在GPU上获得的性能就非常好。您可以将多个GPU与cuDF和Dask一起使用，并且所得的分布式GPU数据帧将具有与上文“ Dask”部分中概述的常规属性相同的属性。自首次提交以来，cuDF已经有3年了。

As of writing, Modin does not have GPU support, but Modin’s architecture is layered, so Modin could integrate GPU kernels alongside CPU. Because of the flexibility of the underlying architecture Modin can even integrate hybrid CPU/GPU/*PU kernels. With this GPU support, we will need to ensure it is implemented in a way that allows users to avoid thinking about GPU memory constraints. More on this comparison here.

在撰写本文时，Modin不支持GPU，但是Modin的体系结构是分层的，因此Modin可以将GPU内核与CPU集成在一起。由于基础架构的灵活性，Modin甚至可以集成混合CPU / GPU / * PU内核。有了此GPU支持，我们将需要确保以一种允许用户避免考虑GPU内存限制的方式来实现它。在这里进行更多的比较。

雷与莫丁 (Ray vs. Modin)

Modin is the data processing engine for Ray. Ray is a low-level distributed computing framework for scaling Python code, while Modin has higher level APIs for data manipulation and querying.

Modin是Ray的数据处理引擎。 Ray是用于扩展Python代码的低级分布式计算框架，而Modin具有用于数据操作和查询的高级API。

莫丁的未来 (The future of Modin)

Modin has disrupted the interactive data science space, centering the discussion around data science productivity over benchmark performance. As Modin continues to mature, it is evolving into a platform for empowering data science at every scale. A few big features are coming at the end of July, with more later this year. I will discuss a couple of these features, if you’d like to find more, feel free to explore our future plans on the GitHub Issue tracker or ZenHub board.

Modin破坏了交互式数据科学领域，将讨论的重点放在了数据科学生产力与基准性能之间。随着Modin的不断成熟，它正在演变成一个平台，可在任何规模上增强数据科学的能力。 7月底将有一些重要功能，今年晚些时候将有更多功能。我将讨论其中的一些功能，如果您想找到更多功能，请随时在GitHub Issue Tracker或ZenHub board上探索我们的未来计划。

只需一行代码即可横向扩展 (Scale out with one line of code)

One of the major difficulties that data scientists face is that they often must switch between Jupyter notebooks to run code on different clusters or environments. This becomes cumbersome when trying to debug code locally, so we have developed an API for switching between running code locally or on multiple different cloud or cluster environments with just one line of code. Note: this API may change, and is not yet available on the latest release (0.7.4). Feedback welcome!

数据科学家面临的主要困难之一是，他们通常必须在Jupyter笔记本之间切换才能在不同的群集或环境上运行代码。当尝试在本地调试代码时，这变得很麻烦，因此我们开发了一种API，可以在本地运行的代码之间切换，也可以仅使用一行代码在多个不同的云或集群环境中切换代码。 注意：此API可能会更改，并且在最新版本( 0.7.4 ) 上尚不可用 。 欢迎反馈！

import modin
import modin.pandas as pdtest_cluster = modin.cloud.create(provider="AWS", 
                                  instance="m4.xlarge", 
                                  numnodes=2)with modin.local():  # run this code locally
    df_local = pd.read_csv("a_file.csv")
    df_local.head()with modin.cloud.deploy(test_cluster):  # run on my test cluster
    df_test = pd.read_csv("s3://bucket/a_file.csv")
    df_test.head()with modin.local():
    df_local.count()with modin.cloud.deploy(test_cluster):
    df_test.count()with modin.cloud.deploy(production):  # run this on production
    df_prod = ...

This works in standard Jupyter notebooks, or any python interpreter and enables data scientists to be more productive than ever in the context of a single notebook.

它可以在标准的Jupyter笔记本电脑或任何python解释器中工作，并使数据科学家在单个笔记本电脑的情况下比以往任何时候都更有效率。

Numpy和Sci-kit学习API (Numpy and Sci-kit Learn APIs)

These APIs have been long requested and are being built out. This will integrate with the existing Pandas engine and avoid common bottlenecks in going to/from Modin’s dataframe and numpy or sklearn. As is the case with the Pandas API, they will be drop-in replacements as well.

长期以来一直要求这些API，并且正在构建这些API。这将与现有的Pandas引擎集成，并避免在往返Modin的数据框和numpy或sklearn时遇到常见的瓶颈。与Pandas API一样，它们也将直接替换。

import modin.pandas as pd
import modin.numpy as np
from modin.sklearn.model_selection import train_test_split

As is the case with dataframes, Modin will be layered to integrate other technologies and compute kernels. We can integrate other computation engines or kernels, e.g. nums (another RISELab project).

与数据帧一样，Modin将分层以集成其他技术和计算内核。我们可以集成其他计算引擎或内核，例如nums (另一个RISELab项目)。

摘要 (Summary)

Data scientists love their tools, and we love data scientists. Modin is designed around enabling the data scientist to be more productive with the tools that they love. We are working to shift the focus of data science tooling to value the data scientist’s time more than the time of the hardware they are using.

数据科学家喜欢他们的工具，我们也喜欢数据科学家。 Modin的设计旨在使数据科学家能够利用他们喜欢的工具提高生产力。我们正在努力转移数据科学工具的重心，使其更加重视数据科学家的时间，而不是他们所使用的硬件时间。

翻译自: https://towardsdatascience.com/the-modin-view-of-scaling-pandas-825215533122

网页缩放与窗口缩放

郝ren

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
网页缩放与窗口缩放_缩放熊猫的现代观点

网页缩放与窗口缩放Recently, a blog post was written that compared a variety of tools in a set of head to heads. I wanted to take the opportunity to talk about our vision with Modin and where we’d like to take ...
复制链接

扫一扫