1. map
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.map(cleanF))
}
逐条输入,逐条输出,数量不会发生改变,输入输出类型可以改变
// map
val dataRDD: RDD[Int] = sc.makeRDD(List(1,2,3,4))
val mapRDD = dataRDD.map(_*2)
mapRDD.collect().foreach(println(_))
-------------print--------------
2
4
6
8
map算子会按照分区执行,单个分区执行类似于单线程,是有序的,多个分区间是无序的,且数据执行的时候,是一条一条的执行,执行完后续所有逻辑后,才开始执行下一条数据。
val dataRDD: RDD[Int] = sc.makeRDD(List(1,2,3,4), 2)
dataRDD.saveAsTextFile("data")
dataRDD.map(
num => {
println(">>>>>> " + num * 2)
num * 2
}
).map(
num => {
println("!!!!!! " + num * 2)
num * 2
}
).collect()
-------------print--------------
>>>>>> 2
!!!!!! 4
>>>>>> 4
!!!!!! 8
>>>>>> 6
!!!!!! 12
>>>>>> 8
!!!!!! 16
其中1,2在一个分区,3,4在一个分区
思考:如果map后面遇到mapPartitions算子,是否会等到所有数据执行完再执行下一段逻辑。
2. mapPartitions
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(_: TaskContext, _: Int, iter: Iterator[T]) => cleanedF(iter),
preservesPartitioning)
}
传入一个迭代器,返回一个迭代器。输入输出前后数量可以不相同,特点是以分区为单位进行逻辑运算。相比较map来说有了类似于批处理操作,可以稍微提高运行效率,但存在数据引用,内存无法释放内存溢出的风险。
val dataRDD: RDD[Int] = sc.makeRDD(List(1,2,3,4),2)
val mapPartRDD: RDD[Int] = dataRDD.mapPartitions(
it => {
println("######## ")
it.map(_ + 1)
}
)
mapPartRDD.collect().foreach(println(_))
-------------print--------------
########
########
2
3
4
5
由打印结果可以看出print一共执行了两次,每个分区执行一次。
//求各个分区中的最大值
val rdd = sc.makeRDD(List(1,2,3,4), 2)
val mpRDD = rdd.mapPartitions(
iter => {
List(iter.max).iterator
}
)
mpRDD.collect().foreach(println)
-------------print--------------
2
4
3. mapPartitionsWithIndex
传入一个元组包括分区号和迭代器,返回迭代器
def mapPartitionsWithIndex[U: ClassTag](
f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(_: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
preservesPartitioning)
}
以分区为单位进行计算,每个迭代器代表一个分区内的所有数据
demo:获取一号分区的数据
val rdd = sc.makeRDD(List(1,2,3,3,3,4,3,4,3,2,1,4), 2)
rdd.saveAsTextFile("data")
val mpiRDD = rdd.mapPartitionsWithIndex(
(index, iter) => {
if ( index == 1 ) {
iter
} else {
Nil.iterator
}
}
)
mpiRDD.collect().foreach(println)
-------------print--------------
3
4
3
2
1
4
4. flatMap
对数据进行扁平化处理
/**
* Return a new RDD by first applying a function to all elements of this
* RDD, and then flattening the results.
*/
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.flatMap(cleanF))
}
demo1:字符串切分
val rdd: RDD[String] = sc.makeRDD(List(
"Hello Scala", "Hello Spark"
))
val flatRDD: RDD[String] = rdd.flatMap(
s => {
s.split(" ")
}
)
flatRDD.collect().foreach(println)
-------------print--------------
Hello
Scala
Hello
Spark
demo2:模式匹配
val rdd: RDD[Any] = sc.makeRDD(List(
List(1, 2), List(3, 4), "hello"
))
val flatRDD: RDD[Any] = rdd.flatMap(
data => {
data match {
case list: List[_] => list
case dat => List(dat)
}
}
)
flatRDD.collect().foreach(println)
-------------print--------------
1
2
3
4
hello
5. glom
将分区内的数据转换成数组
/**
* Return an RDD created by coalescing all elements within each partition into an array.
*/
def glom(): RDD[Array[T]] = withScope {
new MapPartitionsRDD[Array[T], T](this, (_, _, iter) => Iterator(iter.toArray))
}
demo1
val rdd : RDD[Int] = sc.makeRDD(List(1,2,3,4), 2)
val glomRDD: RDD[Array[Int]] = rdd.glom()
glomRDD.collect().foreach(data=> println(data.mkString(",")))
-------------print--------------
1,2
3,4
思考:要是分区内的数据不是相同类型会怎样?
不是相同类型的数据会转换成Array[Any]数组;
val rdd: RDD[Any] = sc.makeRDD(List(1,3,"a",2,4,"C"), 2)
val glomRDD: RDD[Array[Any]] = rdd.glom()
glomRDD.collect().foreach(
data=> {
data.foreach(
data1 =>{
data1 match {
case a : Int => println(a+1)
case b : String => println(b)
case _ => print("unknow")
}
}
)
}
)
-------------print--------------
2
4
a
3
5
C
6. groupBy
将数据根据指定的规则进行分组,分区数不变,数据会被打乱重新组合。可能会存在数据倾斜问题。
/**
* Return an RDD of grouped items. Each group consists of a key and a sequence of elements
* mapping to that key. The ordering of elements within each group is not guaranteed, and
* may even differ each time the resulting RDD is evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*/
def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
groupBy[K](f, defaultPartitioner(this))
}
根据源码注释说明,groupBy只负责分组,不负责聚合,如果想用聚合功能,使用reduceByKey和aggregateByKey函数会更好。
groupBy输入一个函数,返回一个元组 RDD[(K, Iterable[T])
demo1
val rdd: RDD[Int] = sc.makeRDD(List(1,2,3,4), 2)
val grpRDD: RDD[(Int, Iterable[Int])] = rdd.groupBy(
data => data % 2
)
grpRDD.collect().foreach(println(_))
-------------print--------------
(0,CompactBuffer(2, 4))
(1,CompactBuffer(1, 3))
RDD[(K, Iterable[T])中的泛型K是我们分组的key值。
demo2
val rdd: RDD[Int] = sc.makeRDD(List(1,2,3,4), 2)
val grpRDD: RDD[(Int, Iterable[Int])] = rdd.groupBy(
data => data % 2
)
grpRDD.flatMap(
data =>{
data._2
}
).collect().foreach(println(_))
-------------print--------------
2
4
1
3
7. filter
传入一个表达式,根据表达式最终的比较结果,把符合规则的数据保留,不符合规则的数据丢弃,相同分区的数据,可能会存在过滤掉太多数据,导致数据倾斜问题。
/**
* Return a new RDD containing only the elements that satisfy a predicate.
*/
def filter(f: T => Boolean): RDD[T] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[T, T](
this,
(_, _, iter) => iter.filter(cleanF),
preservesPartitioning = true)
}
demo
val dataRDD = sparkContext.makeRDD(List(
1,2,3,4
),1)
val dataRDD1 = dataRDD.filter(_%2 == 0)
-------------print--------------
2
4
8. distinct
去除重复数据
/**
* Return a new RDD containing the distinct elements in this RDD.
*/
def distinct(): RDD[T] = withScope {
distinct(partitions.length)
}
demo
val rdd = sc.makeRDD(List(1,2,3,4,1,2,3,4))
val rdd1: RDD[Int] = rdd.distinct()
rdd1.collect().foreach(println)
-------------print--------------
1
2
3
4
思考:distinct是否会shuffle,缩减分区?
val rdd = sc.makeRDD(List(1,2,2,3,2,1,3,4), 2)
rdd.saveAsTextFile("data1")
val distinctRDD = rdd.distinct()
distinctRDD.saveAsTextFile("data2")
distinctRDD.collect().foreach(println(_))
-------------print--------------
4
2
1
3
从以上执行结果来看,distinct执行时,分区数不变,但是执行过程中数据发生了shuffle。
9. coalesce
缩减或扩大分区,传入两个参数,第一个参数为新分区数量,第二个参数是是否进行shuffle,Boolean值,默认为false,可不传。
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null)
demo
val rdd = sc.makeRDD(List(1,2,3,4,5,6), 3)
val newRDD1: RDD[Int] = rdd.coalesce(2)
val newRDD2: RDD[Int] = rdd.coalesce(2,true)
newRDD1.saveAsTextFile("output1")
newRDD2.saveAsTextFile("output2")
如果选择扩大分区,一定要进行shuffle,不然无法扩大分区数
val rdd = sc.makeRDD(List(1,2,3,4,5,6), 3)
val newRDD1: RDD[Int] = rdd.coalesce(4)
newRDD1.saveAsTextFile("output1")
val rdd = sc.makeRDD(List(1,2,3,4,5,6), 3)
val newRDD1: RDD[Int] = rdd.coalesce(4,true)
newRDD1.saveAsTextFile("output2")
如上图所示,如果不传入true进行shuffle,分区数不会改变。
10. repartition
也是进行分区重置,可参考repartition源码:
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
repartition底层默认使用coalesce进行实现的,默认进行shuffle操作。
11. sortBy
根据指定的规则对数据源中的数据进行排序(第一个参数),默认为升序(true),第二个参数可以改变排序的方式,第三个参数会改变分区数,不传默认不改变。
/**
* Return this RDD sorted by the given key function.
*/
def sortBy[K](
f: (T) => K,
ascending: Boolean = true,
numPartitions: Int = this.partitions.length)
(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
this.keyBy[K](f)
.sortByKey(ascending, numPartitions)
.values
}
demo1
val rdd = sc.makeRDD(List(6,2,4,5,3,1), 2)
val newRDD = rdd.sortBy(
num => num,
false
)
newRDD.saveAsTextFile("output")
执行结果为两个分区
demo2
val rdd = sc.makeRDD(List(6,2,4,5,3,1), 2)
val newRDD = rdd.sortBy(
num => num,
false,
1
)
newRDD.saveAsTextFile("output")
执行结果为一个分区
由上面现象可以得知,sortBy默认会进行shuffle操作。