sparksql_可视化组分布_histogram

最新推荐文章于 2024-01-28 14:34:02 发布

炼丹师666

最新推荐文章于 2024-01-28 14:34:02 发布

阅读量571

点赞数

分类专栏：数据处理 spark

本文链接：https://blog.csdn.net/wj1298250240/article/details/103947070

版权

spark 同时被 2 个专栏收录

37 篇文章

订阅专栏

数据处理

17 篇文章

订阅专栏

本文介绍了一种使用SparkSQL对大规模数据集中的'balance'字段进行可视化的方法，通过聚合数据并绘制直方图来分析其分布情况。首先，利用RDD的flatMap和histogram函数生成直方图数据；然后，借助matplotlib库将这些数据绘制成直方图，并设置了合适的标题和宽度。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

sparksql_可视化组分布_histogram
可参考：
https://blog.csdn.net/weixin_39599711/article/details/79072691

# 如果数据是几百万行，第二种方法显然不可取。因此需要先聚合数据。
hists = fraud_df.select('balance').rdd.flatMap(lambda row: row).histogram(20)
To plot the histogram you can simply call the matplotlib like below.

# 绘图：
data = {
    'bins': hists[0][:-1],
    'freq': hists[1]
}

fig = plt.figure(figsize=(12,9))
ax = fig.add_subplot(1, 1, 1)
ax.bar(data['bins'], data['freq'], width=2000)
ax.set_title('Histogram of \'balance\'')

plt.savefig('B05793_05_22.png', dpi=300)