spark coalesce和repartition的区别

最新推荐文章于 2023-04-18 09:05:52 发布

左林右李02

最新推荐文章于 2023-04-18 09:05:52 发布

阅读量401

点赞数

分类专栏： spark

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/u011624157/article/details/104947192

版权

spark 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

查看spark源码

/**
   * Returns a new Dataset that has exactly `numPartitions` partitions, when the fewer partitions
   * are requested. If a larger number of partitions is requested, it will stay at the current
   * number of partitions. Similar to coalesce defined on an `RDD`, this operation results in
   * a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not
   * be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can call repartition. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * @group typedrel
   * @since 1.6.0
   */
  def coalesce(numPartitions: Int): Dataset[T] = withTypedPlan {
    Repartition(numPartitions, shuffle = false, logicalPlan)
  }

1、coalesce操作只能减少分区，它是使用现有分区来减少shuffer的数据量
2、repartion操作可以增加分区，也可减少分区，它创建新的分区，进行完全的shuffer操作

在减少分区时使用coalesce或者partition需要具体分析，如果需要急剧的缩小分区数量，则使用带有shuffle的repartition

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
spark coalesce和repartition的区别

查看spark源码/** * Returns a new Dataset that has exactly `numPartitions` partitions, when the fewer partitions * are requested. If a larger number of partitions is requested, it will stay at the c...
复制链接

扫一扫

专栏目录

左林右李02 CSDN认证博客专家 CSDN认证企业博客

码龄11年

170: 原创

2万+: 周排名

1万+: 总排名

18万+: 访问

: 等级

2550: 积分

1107: 粉丝

125: 获赞

24: 评论

273: 收藏

私信

关注

热门文章

分类专栏

数据仓库付费 28篇
flink 付费 59篇
高并发 6篇
微服务 7篇
前端相关 7篇
springboot单体项目 5篇
flink大数据量实战 5篇
流批一体-数据湖 1篇
Java 28篇
数据结构 11篇
presto 1篇
iceberg 5篇
maven 8篇
gradle 1篇
kafka 9篇
hadoop 2篇
实时计算 1篇
thrift rpc
storm 1篇
hive 9篇
SQL 2篇
bigdata conception 3篇
spark 9篇
antlr
linux 5篇
mysql 7篇
flume 2篇
python 1篇

最新评论

24讲spark AQE的三个特性怎么才能用好？
你huai哦: 可以请问一下，这个24讲的全系列是出处是哪里吗？
springboot基础——公共字段填充
码到近视: FiledFill是自定义的还是mybatisplus自带的？如何辨别是insert还是update？谢谢大佬
消息队列：秒杀时如何处理每秒上万次的下单请求？
lgywyx: 每次搜到消息队列的经典相关应用是秒杀系统，一直没弄明白是咋应用的，这次终于看明白了，主要是写的明白吧，是请求放到消息队列里面，消费端排队判断库存下单，然后用户端一直loading，去循环请求另一个接口判断有没有秒杀成功（理解是对的吧！），这是之前没明白的点，感谢博主
短 URL 生成器设计：百亿短 URL 怎样做到无冲突
WTIFS: 指正一点：redis存储空间不能简单的用1亿*1KB=100GB来算，redis里存储1KB的字符串实际占用空间是会超过1KB的
动态代理之JDK Proxy
阿呆布衣酷: 北京的分享者啊，支持一下

您愿意向朋友推荐“博客详情页”吗？

强烈不推荐
不推荐
一般般
推荐
强烈推荐

提交

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

打赏作者

左林右李02 你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20

扫码支付：¥1

获取中

扫码支付

您的余额不足，请更换扫码支付或充值

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。