Spark
GScallion
这个作者很懒,什么都没留下…
展开
-
Spark:技术点总结
一、发生oom的解决方案 一、Driver 内存不够 1、读取数据太大 -、增加Driver内存,–driver-memory 2、数据回传 -、collect会造成大量数据回传Driver,使用foreach 二、Executor 内存不够 1、map 类操作产生大量数据,包括 map、flatMap、filter、mapPartitions等 -、使用repartition,减少每个 task 计算数据的大小,从而减少每个 task 的输出 -、减少中间输出:用 mapPartitions 替代多个 m原创 2022-04-25 11:12:02 · 2333 阅读 · 0 评论 -
Spark:coalesce repartition 源码分析
Spark版本:2.4.0 源代码位置:org/apache/spark/rdd/RDD.scala 应用示例: scala> val x=(1 to 10).toList x: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> val df1=x.toDF("df") df1: org.apache.spark.sql.DataFrame = [number: int] scala> df1.rdd.partitions.siz原创 2021-02-01 15:46:38 · 334 阅读 · 0 评论 -
Spark:accumulator源码分析
Spark版本:2.4.0 源代码位置:org/apache/spark/util/AccumulatorV2.scala 源代码如下: 当需要long型累加器时继承如下抽象类 /** * An [[AccumulatorV2 accumulator]] for computing sum, count, and average of 64-bit integers. * * @since 2.0.0 */ class LongAccumulator extends AccumulatorV2[jl原创 2021-01-28 16:54:26 · 243 阅读 · 0 评论 -
Spark:broadcast源码分析
Spark版本:2.4.0 源代码位置:org/apache/spark/SparkContext.scala 应用示例: scala> val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0) scala> broadcastVar.value res0: Array[Int] = Array(1,原创 2021-01-28 15:16:57 · 125 阅读 · 0 评论 -
Spark:sortByKey源码分析
Spark版本:2.4.0 源码位置:org.apache.spark.rdd.OrderedRDDFunctions 源代码如下: /** * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling * `collect` or `save` on the resulting RDD will return or output an ordered list of原创 2021-01-27 16:42:40 · 158 阅读 · 0 评论 -
Spark:groupByKey源码分析
Spark版本:2.4.0 代码位置:org.apache.spark.rdd.PairRDDFunctions groupByKey(): RDD[(K, Iterable[V])] groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] 使用示例: val source: RDD[(Int, Int)] = sc.parallelize(Seq((1, 1), (1, 2), (2, 2), (2, 3))) val groupByKeyRDD: RD原创 2021-01-26 18:40:33 · 287 阅读 · 0 评论 -
Spark:foldByKey源码分析
Spark版本:2.4.0 代码位置:org.apache.spark.rdd.PairRDDFunctions foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)] foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)] 应用示例 object FoldByKeyDemo { def main(args: Array[String]):原创 2021-01-26 17:59:03 · 99 阅读 · 0 评论 -
Spark:aggregateByKey源码分析
Spark版本:2.4.0 代码位置:org.apache.spark.rdd.PairRDDFunctions 相应代码如下: 这两个方法会调用方法3进行计算 方法1: /** * Aggregate the values of each key, using given combine functions and a neutral "zero value". * This function can return a different result type, U, than the ty原创 2021-01-26 15:56:26 · 116 阅读 · 0 评论 -
Spark:combineByKey源码分析
Spark版本:2.4.0 代码位置:org.apache.spark.rdd.PairRDDFunctions 代码片段如下: /** * Generic function to combine the elements for each key using a custom set of aggregation * functions. This method is here for backward compatibility. It does not provide combiner原创 2021-01-26 15:09:20 · 111 阅读 · 0 评论 -
Spark:ReduceByKey源码分析
代码位置:org.apache.spark.rdd.PairRDDFunctions 相关的三个方法片段: ######方法1原创 2021-01-26 11:26:52 · 297 阅读 · 1 评论