一、算子优化
groupByKey()
: 没有预聚合,sum、average等操作用 PairRDDFunctions.aggregateByKey
或者 PairRDDFunctions.reduceByKey
mapPartition 和 map:
map
:每条处理一条数据,网络开销大,内存消耗小
mapPartition
:每个partition的数据发送到Executor一次,网络开销小,内存消耗大
checkpoint
: checkpoint 后面紧跟cache,负责会多次读取checkpoint结果。
将此RDD标记为检查点。它将保存到“SparkContext#setCheckpointDir”设置的检查点目录中的一个文件中,并且将删除对其父RDD的所有引用。必须先调用此函数,然后才能在此RDD上执行任何其他job。强烈建议将此RDD持久化到内存中,否则将其保存到文件中需要重新计算。
/**
* Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
* directory set with `SparkContext#setCheckpointDir` and all references to its parent
* RDDs will be removed. This function must be called before any job has been
* executed on this RDD. It is strongly recommended that this RDD is persisted in
* memory, otherwise saving it on a file will require recomputation.
*/
@Test //本地测试
def checkpointTest(): Unit = {
sc.setCheckpointDir("cp")
val word = sc.textFile("in/wc").flatMap(_.split(" "))
val res = word.map((_, 1)).reduceByKey(_+_)
res.checkpoint()
res.cache()
res.foreach(println) // 行动算子才会触发 checkpoint
println(res.toDebugString) // CachedPartitions, 内存中会记住血缘,磁盘会忘记
}
persist()
:
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
cache()
:
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def cache(): this.type = persist()
二、对两个RDD操作的算子
rdd1.union(rdd2)
结果是并集,简单的合并,分区数是原来的和
rdd1.join(rdd2)
内连接,任何一个RDD没有key都会被忽略
rdd1.cogroup(rdd2)
全连接,返回RDD[(String, (Iterable[Int], Iterable[Int]))]
rdd1.leftOuterJoin(rdd2)
左连接,返回RDD[(String, (Int, Option[Int]))]
三、会触发Shuffle的算子
partitionBy
: 如果子分区数不等于父分区数,new ShuffledRDD
默认是HashPartitioner
/**
* Return a copy of the RDD partitioned using the specified partitioner.
*/
def partitionBy(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
if (keyClass.isArray && partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
if (self.partitioner == Some(partitioner)) {
self
} else {
new ShuffledRDD[K, V, V](self, partitioner)
}
}
repartition 不触发Shuffle
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
四、常用行动算子
takeOrdered()
:排序后的前几个 take
aggregate()
:分区内分区间都有初始值,aggregateByKey跟区间没有初始值
saveAsTextFile("")
saveAsSequenceFile("")
saveAsObjectFile("")
countByKey()
:统计每个key出现的次数
foreach(func)
:对每个Executor上的数据遍历
foreachPartition()
:以partition为单位执行,内存消耗大,可能导致OOM,和 mapPartition 是一样的
每个行动算子触发一次 sc.runJob();TaskScheduler submit job
orcOurRDD: RDD[(NullWritable, Writable)]
保存文件到hdfs:
orcOurRDD.saveAsNewAPIHadoopFile(outPath,classOf[NullWritable],classOf[Writable],classOf[OrcNewOutputFormat],hadoopConf)