Pandas替代框架性能测评——Polars|Modin|Pandarallel|pySpark

1 篇文章 0 订阅
1 篇文章 0 订阅

Pandas在大数据处理上的不足,制约了其在数据科学领域的进一步发展,尽管它在小数据集上处理非常灵活方便;探究Pandas在大数据时代的替代品,是算法工程师面临的重要问题。

当然,现在各类公有云和分析型数据库大行其道,大数据对于它们来说不是问题,但这往往需要付出一定的成本才能获得商用版本带来的便捷与高效。那么,有没有低成本的方案可供选择呢?

反思Pandas面对大数据时羸弱的表现:由于Pandas在设计时只能单核运行,因此无法用到计算机的多核CPU,针对这个弱点的改善,业界实现了很多替代方案。

下面分别测试PandasPolarsModinPandarallel框架,以及大数据的常客——Spark的python版本pySpark,在较小的数据集上,运行UDF函数的性能表现,给我们今后选择框架带来参考。

这里选用的数据集shape为(45, 500000),数据经处理后需要将每列值进行md5哈希并截取后段(apply 函数),本地电脑环境为:Macbook Pro i5/16G/512G。

太长不看版:为节约时间,这里直接放出各个框架的测试结果,采取三次平均值。

序号框架版本操作操作时长
1Pandas1.3.5读取数据0:00:01.208665
apply函数0:05:14.027412
2Polars0.13.34读取数据0:00:00.280509
apply函数0:00:50.164057
3Modin0.12.1读取数据0:00:03.598980
apply函数0:04:08.191504
4Pandarallel1.6.1读取数据0:00:01.165021
apply函数0:01:51.759348
5pySpark3.2.1读取数据0:00:00.378601
apply函数0:00:22.682952

简单结论:

a. 读取数据速度排名:Polars > pySpark >> Pandarallel > Pandas > Modin

b. Apply函数处理速度排名: pySpark > Polars > Pandarallel >> Modin > Pandas

c. 在处理Apply函数上,Modin和Pandarallel并不如其所宣扬的那样带来很大的性能提升,尤其是Pandarallel运行时,明显感受到电脑风扇启动;

d. Polars表现令人惊艳;

e. Spark表现出其在大数据处理上的强劲实力;

备注:H2O.ai公司已经做过这些框架的性能测评,我这里补充了另一个常见操作——Apply函数的性能测试,供参考。

Database-like ops benchmarkicon-default.png?t=M4ADhttps://h2oai.github.io/db-benchmark/


 

~~~~~~~~~~~~~~下面进入PK环节 ~~~~~~~~~~~~~ 

 apply 函数为:

from hashlib import md5
def apply_md5(item):
    encoder = md5()
    encoder.update(str(item).encode('utf-8'))
    k = int(encoder.hexdigest(), 16)
    h = k % 2147483648
    return h

1. Pandas测试

读取数据集,记录该操作耗时:

import pandas as pd
df_data = pd.read_csv(data_file, names=col_list)

显示原始数据,df_data.head()

 col_0col_1col_2col_3col_4col_5col_6col_7col_8col_9col_10col_34col_35col_36col_37col_38col_39col_40col_41col_42col_43col_44
05460751310810444144911011278686111111111199363
1737831009778109829010571448071111121211857989
275847500373737592510111834511111111111819748
393637900373737592510114091471111111110819748
479074100373737592510113725591111111111819748

运行apply函数,并记录该操作耗时:

for col in df_data.columns:
    df_data[col] = df_data.apply(lambda x: apply_md5(x[col]), axis=1)

显示结果数据,df_data.head()

 col_0col_1col_2col_3col_4col_5col_6col_7...col_37col_38col_39col_40col_41col_42col_43col_44
06135703401869972635192359499520247011957686355401205285826590162701365501984...1869972635186997263518699726351869972635186997263518699726351869972635190489033
174671672820389163142038916314267377820911787345829327710088123861365501984...18699726351869972635127641348418699726351276413484186997263518699726351468256769
21668942466203891631420389163148058223780582237805822375497953051365501984...18699726351869972635186997263518699726351869972635186997263518699726351920310884
31301686701203891631420389163148058223780582237805822375497953051365501984...18699726351869972635186997263518699726351869972635186997263520389163141920310884
4725939270203891631420389163148058223780582237805822375497953051365501984...18699726351869972635186997263518699726351869972635186997263518699726351920310884

2. Polars测试

Polars特点:

  • Polars库在io上优势明显,非常快;
  • Polars是Rust编写的,内存模型是基于Apache Arrow,python只是一个前端的封装;
  • Polars存在两种API,一种是Eager API,另一种则是Lazy API;
  • Eager API和Pandas的使用类似,语法差不太多,立即执行就能产生结果。
  • Lazy API像Spark,首先将查询转换为逻辑计划,然后对计划进行重组优化,以减少执行时间和内存使用。

用户文档:

List context and row-wise compute - Polars - User Guideicon-default.png?t=M4ADhttps://pola-rs.github.io/polars-book/user-guide/dsl/list_context.html

安装:pip3 install polars -i https://pypi.mirrors.ustc.edu.cn/simple/

读取数据集,记录耗时:

import polars as pl

pl_data = pl.read_csv(data_file, has_header=False, new_columns=col_list)

运行apply函数,记录耗时:

pl_data = pl_data.select([
    pl.col(col).apply(lambda s: apply_md5(s)) for col in pl_data.columns
])

查看运行结果:

3. Modin测试

Modin特点:

  • 使用DataFrame作为基本数据类型;
  • Modin具有与 Pandas 相同的应用程序接口(API);
  • Pandas 仍然只会利用一个内核,而 Modin 会使用所有的内核;
  • 能处理1MB到1TB+的数据;
  • Modin 利用 Ray 使用同样的代码跑单台机器,也可以用于集群;
  • 使用者不需要知道系统有多少内核,也不需要指定如何分配数据;

用户文档:

Scale your pandas workflow by changing a single line of code — Modin 0.14.1+0.gd7eb019b.dirty documentationicon-default.png?t=M4ADhttps://modin.readthedocs.io/en/stable/

安装:pip3 install "modin[ray]" -i https://pypi.mirrors.ustc.edu.cn/simple/

为防止报错“ModuleNotFoundError: No module named 'aiohttp.signals'”,可以这样来解决:pip3 install aiohttp==3.7 -i https://pypi.mirrors.ustc.edu.cn/simple/

读取数据集,记录耗时:

import modin.pandas as pd

md_data = pd.read_csv(data_file, names=col_list)

运行apply函数,记录耗时: 

for col in md_data.columns:
    md_data[col] = md_data.apply(lambda x: apply_md5(x[col]), axis=1)

查看运行结果:

 col_0col_1col_2col_3col_4col_5col_6col_7...col_37col_38col_39col_40col_41col_42col_43col_44
06135703401869972635192359499520247011957686355401205285826590162701365501984...1869972635186997263518699726351869972635186997263518699726351869972635190489033
174671672820389163142038916314267377820911787345829327710088123861365501984...18699726351869972635127641348418699726351276413484186997263518699726351468256769
21668942466203891631420389163148058223780582237805822375497953051365501984...18699726351869972635186997263518699726351869972635186997263518699726351920310884
31301686701203891631420389163148058223780582237805822375497953051365501984...18699726351869972635186997263518699726351869972635186997263520389163141920310884
4725939270203891631420389163148058223780582237805822375497953051365501984...18699726351869972635186997263518699726351869972635186997263518699726351920310884

4. Pandarallel测试

Pandarallel特点:

  • 非常简单实现Pandas并行;
  • 没有自己的读取文件方式,依赖Pandas读取文件;

用户文档:

pandarallel · PyPIAn easy to use library to speed up computation (by parallelizing on multi CPUs) with pandas.https://pypi.org/project/pandarallel/

 读取数据集,记录耗时:

import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize()

dp_data = pd.read_csv(data_file, names=col_list)

运行apply函数,记录耗时: 

for col in dp_data.columns:
    dp_data[col] = dp_data.parallel_apply(lambda x: apply_md5(x[col]), axis=1)

查看运行结果:

 col_0col_1col_2col_3col_4col_5col_6col_7...col_37col_38col_39col_40col_41col_42col_43col_44
06135703401869972635192359499520247011957686355401205285826590162701365501984...1869972635186997263518699726351869972635186997263518699726351869972635190489033
174671672820389163142038916314267377820911787345829327710088123861365501984...18699726351869972635127641348418699726351276413484186997263518699726351468256769
21668942466203891631420389163148058223780582237805822375497953051365501984...18699726351869972635186997263518699726351869972635186997263518699726351920310884
31301686701203891631420389163148058223780582237805822375497953051365501984...18699726351869972635186997263518699726351869972635186997263520389163141920310884
4725939270203891631420389163148058223780582237805822375497953051365501984...18699726351869972635186997263518699726351869972635186997263518699726351920310884

5. pySpark测试

Spark资料很多了,可以参考:pyspark系列--pandas和pyspark对比_振裕的博客-CSDN博客_pyspark与pandas区别目录 1. pandas和pyspark对比 1.1. 工作方式1.2. 延迟机制1.3. 内存缓存1.4. DataFrame可变性1.5. 创建1.6. index索引1.7. 行结构1.8. 列结构1.9. 列名称1.10. 列添加1.11. 列修改1.12. 显示1.13. 排序1.14. 选择或切片1.15. 过滤1.16. 整合1.17. 统计...https://blog.csdn.net/suzyu12345/article/details/79673483

安装:pip3 install pyspark -i https://pypi.mirrors.ustc.edu.cn/simple/

 读取数据集,记录耗时:

from pyspark.sql import SparkSession
import pyspark.pandas as ps
spark = SparkSession.builder.appName('testpyspark').getOrCreate()

ps_data = ps.read_csv(data_file, names=header_name)

运行apply函数,记录耗时: 

for col in ps_data.columns:
    ps_data[col] = ps_data[col].apply(apply_md5)

查看运行结果:

 col_0col_1col_2col_3col_4col_5col_6col_7col_8col_9col_10col_11col_12col_13col_14col_15col_16col_17col_18col_19col_20col_21col_22col_23col_24col_25col_26col_27col_28col_29col_30col_31col_32col_33col_34col_35col_36col_37col_38col_39col_40col_41col_42col_43col_44
06135703401869972635192359499520247011957686355401205285826590162701365501984186997263518699726355281261102038916314186997263518699726354018529019672643002038916314186997263513655019841923594995127641348419672643001956845781127363616319568457811365501984195684578121255748762125574876212557487621255748761869972635186997263518699726351869972635186997263518699726351869972635186997263518699726351869972635186997263518699726351869972635190489033
174671672820389163142038916314267377820911787345829327710088123861365501984195684578112736361631434204725696132461192359499512764134841869972635136550198418699726351869972635186997263518699726351923594995136550198413655019841365501984136550198418699726352125574876186997263518699726351869972635186997263518699726351869972635186997263518699726351869972635186997263518699726351869972635127641348418699726351276413484186997263518699726351468256769

总结

序号框架版本操作操作时长读取数据速度排名Apply函数运行速度排名
1Pandas1.3.5读取数据0:00:01.20866545
apply函数0:05:14.027412
2Polars0.13.34读取数据0:00:00.28050912
apply函数0:00:50.164057
3Modin0.12.1读取数据0:00:03.59898054
apply函数0:04:08.191504
4Pandarallel1.6.1读取数据0:00:01.16502133
apply函数0:01:51.759348
5pySpark3.2.1读取数据0:00:00.37860121
apply函数0:00:22.682952

a. 读取数据速度排名:Polars > pySpark >> Pandarallel > Pandas > Modin

b. Apply函数处理速度排名: pySpark > Polars > Pandarallel >> Modin > Pandas

c. 在处理Apply函数上,Modin和Pandarallel并不如其所宣扬的那样带来很大的性能提升,尤其是Pandarallel运行时,明显感受到电脑风扇启动;

d. Polars表现令人惊艳,加上其对各类图表的支持,不失为Pandas的平替,不过,Polars虽与Pandas有一定的相似性,但很多API使用方法不同,有一定的学习成本;

e. pySpark表现出其在大数据处理上的强劲实力,与Pandas和Polars相比,在数据分析方面较弱,但集成了一定的机器学习能力;

参考资料:

List context and row-wise compute - Polars - User Guide

Scale your pandas workflow by changing a single line of code — Modin 0.14.1+0.gd7eb019b.dirty documentation

Python/Pandas如何处理百亿行,数十列的数据? - 知乎

Why Python is Slow: Looking Under the Hood | Pythonic Perambulations

Scaling Pandas: Dask vs Ray vs Modin vs Vaex vs RAPIDS

Database-like ops benchmark

pyspark系列--pandas和pyspark对比_振裕的博客-CSDN博客_pyspark与pandas区别

  • 5
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值