value类型1
map
rdd.map:调用map方法的源码如下:
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
// 底层源码
override def getPartitions: Array[Partition] = firstParent[T].partitions
map里面的new MapPartitionsRDD里面的 getPartitions通过 .partitions方法返回了父RDD的分区数。说明和经过map算子后分区数不变。并且最后还是调用的scala里面的map方法。
mapPartitions
val conf: SparkConf = new SparkConf().setAppName("haha").setMaster("local[2]")
val sc = new SparkContext(conf)
val line: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4))
val mapPRDD: RDD[Int] = line.mapPartitions(
iter => iter.map(num => num * 2)
)
mapPRDD.collect().foreach(println)
sc.stop()
假设有N个元素,有M个分区,那么map的函数的将被调用N次,而mapPartitions被调用M次,一个函数一次处理所有分区。
mapPartitions运算时将一个个的分区拉过来进行计算,而map只是将一个个数据拉过来进行运算,当一个分区内数据量很大时,如果使用mapPartitions则会容易造成内存溢出,报OOM。原因如下:克隆浅复制:只复制最外层的内容,不会复制数组内存放的user,而是复制了最外层的数组。
package com.aiyunxiao.rddByjava;
import java.util.ArrayList;
public class clone {
public static void main(String[] args) {
ArrayList<User> list = new ArrayList<User>();
User user = new User();
user.name = "zhagnsan";
list.add(user);
ArrayList<User> list2 = (ArrayList<User>) list.clone();
User user2 = list2.get(0);
user2.name = "lisi";
System.out.println(user.name);
System.out.println(user2.name);
}
}
class User{
public String name;
}
输出结果
mapPartitionsWithIndex
package com.aiyunxiao.RDDTest
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object mapPartitionsWithIndex {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("haha").setMaster("local[2]")
val sc = new SparkContext(conf)
val line: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4,5,6,7,8,9))
//TODO 获取每个元素所在的分区号
val mapIndexRDD: RDD[(Int, Int)] = line.mapPartitionsWithIndex((index, ite) =>
ite.map(num => (index, num))
)
//TODO 计算分区号为1的元素的和
val tuples: Array[(Int, Int)] = mapIndexRDD.collect()
val tuples1: Array[(Int, Int)] = tuples.filter(t => t._1 == 1)
println(tuples1.map(_._2).sum)
sc.stop()
}
}
flatMap
package com.aiyunxiao.RDDTest
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object flatMap {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("haha").setMaster("local[2]")
val sc = new SparkContext(conf)
val line: RDD[Int] = sc.makeRDD(List(1, 2, 3))
val rdd: RDD[Int] = line.flatMap(num => List(num * num, num * num * num))
rdd.collect().foreach(println)
sc.stop()
}
}
注意:flatMap算子必须返回 一个可迭代的集合
glom
package com.aiyunxiao.RDDTest
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object glom {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("haha").setMaster("local[2]")
val sc = new SparkContext(conf)
val line: RDD[Int] = sc.makeRDD(List(1, 2, 3,4,5))
// 将每个分区的元素放到一个数组里
val rddGlom: RDD[Array[Int]] = line.glom()
// 求出每个分区的最大值
val res: RDD[Int] = rddGlom.map(list => list.max)
// 求每个分区最大值的和
res.sum()
res.collect().foreach(println)
sc.stop()
}
}
groupBy
package com.aiyunxiao.RDDTest
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object groupBy {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("haha").setMaster("local[2]")
val sc = new SparkContext(conf)
val line: RDD[Int] = sc.makeRDD(List(1, 2, 3,4,5))
// 根据奇偶数进行分组
val groupByRdd: RDD[(Boolean, Iterable[Int])] = line.groupBy(num => num % 2 == 0)
// 方法二
val value: RDD[(Int, Iterable[Int])] = line.groupBy(num => num % 2)
groupByRdd.collect().foreach(println)
value.collect().foreach(println)
sc.stop()
}
}
filter
package com.aiyunxiao.RDDTest
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object filter {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("haha").setMaster("local[2]")
val sc = new SparkContext(conf)
val line: RDD[Int] = sc.makeRDD(List(1, 2, 3,4,5))
val res: RDD[Int] = line.filter(num => num % 2 == 0)
res.collect().foreach(println)
sc.stop()
}
}
sample
package com.aiyunxiao.RDDTest
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sample {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("haha").setMaster("local[2]")
val sc = new SparkContext(conf)
val line: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4))
val res: RDD[Int] = line.sample(false, 0.5)
res.collect().foreach(println)
sc.stop()
}
}
说明:此处的sample算子第一个参数可以区别出底层使用哪个抽样方法:是泊松还是伯努利。第二个参数是:针对每一个元素,sample算子自己会生成一个随机数,与第二个参数进行比较,如果小于此参数则将对应元素输出,重复每一个元素此行为。
distinct
package com.aiyunxiao.RDDTest
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object distinct {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("haha").setMaster("local[2]")
val sc = new SparkContext(conf)
val line: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4,1,2,3,1,2,3))
val res: RDD[Int] = line.distinct()
res.collect().foreach(println)
sc.stop()
}
}
注意:去重的过程中会将数据打乱重组,这个操作称之为shuffle。将相同数据放到同一个分区中,所以需要等待。shuffle需要落盘
coalesce(numPartitions) & repartition(默认使用shuffle)
缩减分区至指定数量,可以指定是否shuffle
package sparkCore
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object makeRDD {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("make").setMaster("local")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val rdd1: RDD[Int] = sc.makeRDD(List(1, 2, 2, 1,2,4,8,9),2)
val value: RDD[Array[Int]] = rdd1.glom()
val array: Array[Array[Int]] = value.collect()
array.foreach(a=>{
println(a.mkString(","))})
val rdd2: RDD[Int] = rdd1.coalesce(3,true)
val value2: RDD[Array[Int]] = rdd2.glom()
val array2: Array[Array[Int]] = value2.collect()
array2.foreach(a=>{
println(a.mkString(","))})
}
}
coalesce可以重新调整分区数量,可以增多,可以减少。如果原来的分区数量是3,在使用coalesce时,可以直接指定2,这样不会有shuffle过程;如果原来分区数是2,在使用coalesce时指定3,同时还需要增加一个参数“true”,也就是指定shuffle。也就是说分区数量增多必须使用shuffle,分区数不增多,可以不使用shuffle。coalesce的默认值是不使用shuffle
repartition()算子底层也是调用coalesce,且默认使用shuffle源码如下:
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
如果是减少分区,则尽量减少分区
sortBy
package sparkCore
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object makeRDD {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("make").setMaster("local")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val rdd1: RDD[Int] = sc.makeRDD(List(1, 2, 2, 1,2,4,8,9),2)
val value: RDD[Int] = rdd1.sortBy(num => num,true)
value.collect().foreach(println)
}
}
可以指定是否升序或者降序(有默认值,也会shuffle,因为要排序)
union & subtract & intersection & zip
package sparkCore
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object makeRDD {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("make").setMaster("local")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val rdd1: RDD[Int] = sc.makeRDD(List(1, 2, 3,4),2)
val rdd2: RDD[Int] = sc.makeRDD(List(4, 5, 6, 7))
// 1,2,3,4,4,5,6,7 并集,不去重
val r1: RDD[Int] = rdd1.union(rdd2)
r1.collect().foreach(println)
// 2,1,3 只留左边独有的,会shuffle
val r2: RDD[Int] = rdd1.subtract(rdd2)
r2.collect().foreach(println)
// 4 交集
val r3: RDD[Int] = rdd1.intersection(rdd2)
r3.collect().foreach(println)
// 拉链 (1,4),(2,5),(3,6),(4,7)
// 两个rdd的数量必须一致,并且分区数量一样
val r4: RDD[(Int, Int)] = rdd1.zip(rdd2)
r4.collect().foreach(println)
}
}
key-value
partitionBy()
package sparkCore
import org.apache.spark.rdd.RDD
import org.apache.spark.{Partitioner, SparkConf, SparkContext}
object makeRDD {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("make").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val rdd1 = sc.makeRDD(List(("a",1),("b",1),("c",1)),3)
println(rdd1.mapPartitionsWithIndex(
(index, datas) => {
datas.map(data => {
(index, data)
})
}
).collect().mkString(","))
// 使用分区器将数据重新分区
// Spark默认会提供HashPartitioner
val res: RDD[(String, Int)] = rdd1.partitionBy(new MyPartition(3))
println(res.mapPartitionsWithIndex(
(index, datas) => {
datas.map(data => {
(index, data)
}
)
}
).collect().mkString(","))
}
}
// 自定义分区器
// 1、继承Partitioner
// 2、重写方法
class MyPartition(partitions:Int) extends Partitioner{
override def numPartitions: Int = {
partitions
}
override def getPartition(key: Any): Int = 1
}
输出结果为
(0,(a,1)),(1,(b,1)),(2,(c,1))
(1,(a,1)),(1,(b,1)),(1,(c,1))
此代码中包含如何自定义分区器。spark默认使用HashPartitioner分区器