scala数据结构笔记

最新推荐文章于 2022-02-11 00:47:50 发布

luohaifang

最新推荐文章于 2022-02-11 00:47:50 发布

阅读量248

点赞数

分类专栏：编程语言大数据

本文链接：https://blog.csdn.net/luohaifang/article/details/107780405

版权

大数据同时被 2 个专栏收录

10 篇文章 1 订阅

订阅专栏

编程语言

3 篇文章 1 订阅

订阅专栏

scala数据结构笔记

coalesce与repartition
- close和stop

coalesce与repartition

先看源码

  def coalesce(numPartitions: Int): Dataset[T] = withTypedPlan {
    Repartition(numPartitions, shuffle = false, logicalPlan)
  }

  def repartition(numPartitions: Int): Dataset[T] = withTypedPlan {
    Repartition(numPartitions, shuffle = true, logicalPlan)
  }

不同在于是否shuffle，也就是计算过程是否产生网络IO，曾经遇到过一个问题：读取了很多的数据，目测有几百G，中间有很多计算逻辑，最后要将结果写到一个csv文件，当时写操作用的是：

scvTB.coalesce(1).write.mode(SaveMode.Overwrite).option("header","true").csv("/test/")

运行参数是这样：

--num-executors 3 --executor-memory 2G --executor-cores 2 --driver-memory 1G --conf spark.default.parallelism=12 --conf spark.storage.memoryFraction=0.5 --conf spark.shuffle.memoryFraction=0.3

每次运行到最后写操作时，任务就挂掉了，后来找到原因，因为读取了很多分区的数据，最后都往1个分区写，这个分区承受不住，然后任务就挂掉了。
最近看代码，再次留意到这个问题，生成一定数量的文件还可以用repartition来做，对于大文件个人认为应该使用shuffle，以解决个别分区压力过大。

scvTB.repartition(1).write.mode(SaveMode.Overwrite)
      .option("delimiter", ",")
      .option("quote", "")
      .option("nullValue", "\\N")
      .option("ignoreLeadingWhiteSpace", false)
      .option("ignoreTrailingWhiteSpace", false)
      .csv(filePath)

又看了一些网上的博客https://www.cnblogs.com/jiangxiaoxian/p/9539760.html，貌似想通了些什么。

T表有10G数据有100个partition 资源也为–executor-memory 2g --executor-cores 2 --num-executors 5。我们想要结果文件只有一个
1）如果用coalesce：sql(select * from T).coalesce(1)
5个executor 有4个在空跑，只有1个在真正读取数据执行，这时候效率是极低的。所以coalesce要慎用，而且它还用产出oom问题，这个我们以后再说。
2）如果用repartition：sql(select * from T).repartition(1)
这样效率就会高很多，并行5个executor在跑（10个task）,然后shuffle到同一节点，最后写到一个文件中。
那么如果我不想产生一个文件了，我想产生10个文件会怎样，是不是coalesce 又变得比 repartition高效了呢。(因为coalesce无shuffle，相当于每个executor的 task认领 10个 partition)
那么如果我又不想产生10个文件呢？其实一旦要产生的文件数大于executor x vcore数，coalesce效率就更高(一般是这样，不绝对)。

close和stop

先来看源码

  /**
   * Stop the underlying `SparkContext`.
   *
   * @since 2.0.0
   */
  def stop(): Unit = {
    sparkContext.stop()
  }

  /**
   * Synonym for `stop()`.
   *
   * @since 2.1.0
   */
  override def close(): Unit = stop()

close方法直接调用了stop方法，而且是stop的 “ 同义词 ” 。。。。

luohaifang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scala数据结构笔记

scala数据结构笔记coalesce与repartitionclose和stopcoalesce与repartition先看源码 def coalesce(numPartitions: Int): Dataset[T] = withTypedPlan { Repartition(numPartitions, shuffle = false, logicalPlan) } def repartition(numPartitions: Int): Dataset[T] = withT
复制链接

扫一扫