Spark core 常用算子对比和优化

最新推荐文章于 2022-07-12 16:34:09 发布

zhangjian_eng

最新推荐文章于 2022-07-12 16:34:09 发布

阅读量124

点赞数

分类专栏： Spark 文章标签： spark

本文链接：https://blog.csdn.net/zhangjian_eng/article/details/117909052

版权

Spark 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

一、算子优化

groupByKey(): 没有预聚合，sum、average等操作用 PairRDDFunctions.aggregateByKey或者 PairRDDFunctions.reduceByKey

mapPartition 和 map：
map ：每条处理一条数据，网络开销大,内存消耗小
mapPartition ：每个partition的数据发送到Executor一次，网络开销小，内存消耗大
checkpoint: checkpoint 后面紧跟cache，负责会多次读取checkpoint结果。
将此RDD标记为检查点。它将保存到“SparkContext#setCheckpointDir”设置的检查点目录中的一个文件中，并且将删除对其父RDD的所有引用。必须先调用此函数，然后才能在此RDD上执行任何其他job。强烈建议将此RDD持久化到内存中，否则将其保存到文件中需要重新计算。

  /**
   * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
   * directory set with `SparkContext#setCheckpointDir` and all references to its parent
   * RDDs will be removed. This function must be called before any job has been
   * executed on this RDD. It is strongly recommended that this RDD is persisted in
   * memory, otherwise saving it on a file will require recomputation.
   */
  	@Test //本地测试
	def checkpointTest(): Unit = {
		sc.setCheckpointDir("cp")
		val word = sc.textFile("in/wc").flatMap(_.split(" "))
		val res =  word.map((_, 1)).reduceByKey(_+_)
		res.checkpoint()
		res.cache()
		res.foreach(println) // 行动算子才会触发 checkpoint
		println(res.toDebugString) // CachedPartitions, 内存中会记住血缘，磁盘会忘记
	}

persist():

 /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

cache():

  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def cache(): this.type = persist()

二、对两个RDD操作的算子

rdd1.union(rdd2)

结果是并集,简单的合并，分区数是原来的和

rdd1.join(rdd2)

内连接,任何一个RDD没有key都会被忽略

rdd1.cogroup(rdd2)

全连接，返回RDD[(String, (Iterable[Int], Iterable[Int]))]

rdd1.leftOuterJoin(rdd2)

左连接，返回RDD[(String, (Int, Option[Int]))]

三、会触发Shuffle的算子

partitionBy: 如果子分区数不等于父分区数，new ShuffledRDD
默认是HashPartitioner

  /**
   * Return a copy of the RDD partitioned using the specified partitioner.
   */
  def partitionBy(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
    if (keyClass.isArray && partitioner.isInstanceOf[HashPartitioner]) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
    if (self.partitioner == Some(partitioner)) {
      self
    } else {
      new ShuffledRDD[K, V, V](self, partitioner)
    }
  }

repartition 不触发Shuffle

  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

四、常用行动算子

takeOrdered():排序后的前几个 take
aggregate():分区内分区间都有初始值，aggregateByKey跟区间没有初始值
saveAsTextFile("")
saveAsSequenceFile("")
saveAsObjectFile("")
countByKey():统计每个key出现的次数
foreach(func):对每个Executor上的数据遍历
foreachPartition():以partition为单位执行，内存消耗大，可能导致OOM，和 mapPartition 是一样的
每个行动算子触发一次 sc.runJob()；TaskScheduler submit job
orcOurRDD: RDD[(NullWritable, Writable)]保存文件到hdfs:
orcOurRDD.saveAsNewAPIHadoopFile(outPath,classOf[NullWritable],classOf[Writable],classOf[OrcNewOutputFormat],hadoopConf)