一、map、mapPartitions、mapPartitionsWithIndex、
foreach、foreachPartition
1.map、mapPartitions比较:
map传入一个函数,将其作用到每一个元素上
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
mappartitions:传入一个函数,将其作用到每个分区上
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
preservesPartitioning)
}
应用场景:
当数据量不太大的时候,可以用mapPartitions,可以提高运行效率
当数据量太大的时候,有可能会发生oom
注意:
1.mappartitions后面的那个函数参数,获得的值,和map不一样,map获得的是元素,而mappartitions这个参数函数f的传入的值是迭代器 ,ps:迭代器有些不能直接用 ,要加上toList,才能用。
2.函数返回要加上 .iterator
val sc = new SparkContext(conf)
val rdd1: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7,8),2)
println( rdd1.mapPartitions(_.toList.reverse.iterator).collect().toBuffer)
结果:
ArrayBuffer(4, 3, 2, 1, 8, 7, 6, 5)
2.mapPartitionsWithIndex
传入一个参数:f函数,这个f函数的第一个参数,即index代表的是分区号码(从0开始),iterator即分区中的元素
def mapPartitionsWithIndex[U: ClassTag](
f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
preservesPartitioning)
}
3.实际操作举例:
val conf: SparkConf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("mymap1")
val sc = new SparkContext(conf)
val rdd1: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7,8),2)
val rdd2: RDD[Int] = rdd1.map(_*10)
val rdd3: RDD[Int] = rdd1.mapPartitions(_.map(_*10))
val myfunc1=(index:Int,it:Iterator[Int])=>{it.toList.map(x=>"[分区号:"+index+"值:"+x+"]").iterator}
//myfunc1 是实现了输出元素的分区号以及当前元素
val rdd4: RDD[String] = rdd1.mapPartitionsWithIndex(myfunc1)
val myfunc2=(index:Int,it:Iterator[Int])=>{it.toList.map(x=>index*10+x).iterator}
//myfunc2 是实现了输出元素的分区号*10+当前元素的值
val rdd5: RDD[Int] = rdd1.mapPartitionsWithIndex(myfunc2)
println(rdd2.partitions.length) //输出分区数
println(rdd2.collect().toBuffer)
println(rdd3.collect().toBuffer)
println(rdd4.collect().toBuffer)
println(rdd5.collect().toBuffer)
运行结果:
ArrayBuffer(10, 20, 30, 40, 50, 60, 70, 80)
ArrayBuffer(10, 20, 30, 40, 50, 60, 70, 80)
ArrayBuffer([分区号:0值:1], [分区号:0值:2], [分区号:0值:3], [分区号:0值:4], [分区号:1值:5], [分区号:1值:6], [分区号:1值:7], [分区号:1值:8])
ArrayBuffer(1, 2, 3, 4, 15, 16, 17, 18)
4、foreach、forechPartition
mapPartitions和map是transform算子,分别返回一个iterator迭代器和RDD。
foreachPartition、foreach是action算子,无返回值。用于结果的输出操作
mapPartitions、foreachPartition中定义的是一个RDD的每一个分区的统一处理逻辑,每个分区中共用的变量和对象。map、foreach中定义的是一个RDD中每一条记录的转换和处理逻辑。
foreach中的处理逻辑是串行的,map中的处理逻辑是并行的,具体与CPU核数有关,如果是1核则是串行处理。
一般要连接数据库的时候,选择foreachPartition算子
foreach和foreachPartition都是action算子,在excutor端执行计算,foreach是一条一条的输出到目标库,那么每输出一条,就要连接一次数据库,而foreachPartition返回的是一个Partition的元素,每次一个分区连接一次数据库。这样效率就高非常多,但是由于一次就收集一个分区的数据,所以也有可能会导致OOM,采取的办法可以扩大分区数。
//错误的使用示范:
val dStream = periodic.foreachRDD(rdd => {
//val jedis = RedisUtil.getConnectionFromPool 数据库连接定义在这里对象会被序列化,无法使用
val pStream = rdd.mapPartitions(partition => {
val newPartition = partition.map(jsonobject => {
// val jedis = RedisUtil.getConnectionFromPool //会为每一条RDD记录操作创建一个数据库连接,会将数据库连接池用完导致TimeOut异常
//做一些处理
...
jsonobject
})
jedis.close() //map是惰性加载机制,此处会有连接对象没有被使用就被关闭的风险,导致连接对象在真正需要使用时报空指针异常
// newPartition.toIterator 此处若注释掉将导致整个mapPartitions操作失效
}).persist()
//foreachPartition连接对象的错误位置与mapPartitions一样
pStream.foreachPartition(partition => {
val mongoClient = mongoClient()
val jedis = RedisUtil.getConnectionFromPool
val pipeline = jedis.pipelined()
val commandList = new util.ArrayList[WriteModel[Document]]()
partition.foreach(line => {
//做一些处理
...
})
//尽量减少数据库连接数量,对数据库执行批处理操作
if(pipeline!=null){
pipeline.sync()
pipeline.close()
}
jedis.close()
if (commandList.size() > 0) {
val result = mongoClient.getMongoColl(dbName, devDictionaryCollName).bulkWrite(commandList, new BulkWriteOptions().ordered(false))
println("*** 批量执行结果: ***" + result.toString)
}
})
pStream.unpersist()
})
//正确的使用示范:
val dStream = periodic.foreachRDD(rdd => {
val pStream = rdd.mapPartitions(partition => {
val jedis = RedisUtil.getConnectionFromPool //正确的连接定义位置
val newPartition = partition.map(jsonobject => {
//做一些处理
...
jsonobject
}).toList //由于map操作时惰性加载,所以如果不进行action算子触发,会导致下一行jedis.close()报空指针异常
jedis.close()
newPartition.toIterator
}).persist() //把rdd持久化到内存中用于后面多次action算子操作
//对transform的RDD执行action输出操作
pStream.foreachPartition(partition => {
val mongoClient = mongoClient()
val jedis = RedisUtil.getConnectionFromPool
val pipeline = jedis.pipelined()
val commandList = new util.ArrayList[WriteModel[Document]]()
partition.foreach(line => {
//做一些处理
...
})
//尽量减少数据库连接数量,对数据库执行批处理操作
if(pipeline!=null){
pipeline.sync()
pipeline.close()
}
jedis.close()
if (commandList.size() > 0) {
val result = mongoClient.getMongoColl(dbName, devDictionaryCollName).bulkWrite(commandList, new BulkWriteOptions().ordered(false))
println("*** 批量执行结果: ***" + result.toString)
}
})
pStream.unpersist()
})
二、groupByKey、reduceByKey、aggregate、aggregateByKey、combineByKey
1.groupByKey
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(defaultPartitioner(self))
}
此处 groupByKey(defaultPartitioner(self))底层代码如下:
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
2.reduceByKey
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
reduceByKey(defaultPartitioner(self), func)
}
//此处的reduceByKey(defaultPartitioner(self), func)底层代码如下:
def reduceByKey(partitioner: Partitioner,
func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
那么相对于groupByKey而言,reduceByKey优化了在相同partition中相同key先进行reduce,然后再在不同partition中进行reduce。reduce和groupby的输入参数也不一样,参数可以决定我们返回值的类型,通过自定义的参数Func,可以灵活的配置我们以怎样的方式组合value。
相同点:
1,都作用于 RDD[K,V]
2,都是根据key来分组聚合
3, 默认,分区的数量都是不变的,但是都可以通过参数来指定分区数量
不同点:
1, groupByKey默认没有聚合函数,得到的返回值类型是RDD[ k,Iterable[V]]
2, reduceByKey 必须传聚合函数 得到的返回值类型 RDD[(K,聚合后的V)]
3, groupByKey().map() = reduceByKey
最重要的区别:
reduceByKey 会进行分区内聚合,然后再按照相同的key对应的数据进行全局聚合
groupByKey 不会进行局部聚合,只进行分组
结论:
如果这两个算子,都可以使用, 优先使用reduceByKey
3、aggregate
aggregate
(zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U
举例分析:
val rdd1: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6),2)
val r2: Int = rdd1.aggregate(5)(math.max(_,_),_+_)
在0号分区中:5 先和1 比较大小,得到5;然后得到的5 和2比较大小,得到5;然后得到的5和3比较大小,得到5
在1号分区中:5 先和4 比较大小,得到5;然后得到的5 和5比较大小,得到5;然后得到的5和6比较大小,得到6
然后第二个函数做计算时,先将初始值,也就是此处的5先放在左边,5+其中一个分区的结果+另外一个分区的结果=16
一定注意:初始值不仅在第一个函数用到,第二个函数也会用到
第二个函数进行分区间运算的时候,不一定是哪个分区的结果先放在前面!!!随机的
val conf: SparkConf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("myaggregate")
val sc = new SparkContext(conf)
val rdd1: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6),2)
def func1(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}
println(rdd1.mapPartitionsWithIndex(func1).collect().toBuffer)
val r1: Int = rdd1.aggregate(0)(math.max(_,_),_+_)
val r2: Int = rdd1.aggregate(5)(math.max(_,_),_+_)
println("r1:"+r1)
println("r2:"+r2)
val rdd2 = sc.parallelize(List("a","b","c","d","e","f"),2)
def func2(index: Int, iter: Iterator[(String)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}
println( rdd2.mapPartitionsWithIndex(func2).collect.toBuffer)
val r3: String = rdd2.aggregate("")(_ + _, _ + _)
val r4: String = rdd2.aggregate("=")(_ + _, _ + _)
println("r3:"+r3)
println("r4:"+r4)
结果:
ArrayBuffer([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3], [partID:1, val: 4], [partID:1, val: 5], [partID:1, val: 6])
r1:9
r2:16
ArrayBuffer([partID:0, val: a], [partID:0, val: b], [partID:0, val: c], [partID:1, val: d], [partID:1, val: e], [partID:1, val: f])
r3:abcdef //或者是defabc
r4:==abc=def //或者是==def=abc //顺序不一定,都有可能
检验:实现以下需求:
需求:统计参与计算的数据的数量和总和
数据:val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9), 3)
结果: (9, 45)
val a = sc.parallelize(Array(1,2,3,4,5,6,7,8,9), 3)
val tuple: (Int, Int) = a.aggregate((0,0))((x, y)=>(x._1+1,x._2+y), (i, j)=>(i._1+j._1,i._2+j._2))
println(tuple)
4、aggregateByKey
aggregateByKey
(zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)
第二个是指分区器,可以不传,使用默认分区器
def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
}
此处 aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
底层代码如下:
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
// Serialize the zero value to a byte array so that we can get a new clone of it on each key
val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
val zeroArray = new Array[Byte](zeroBuffer.limit)
zeroBuffer.get(zeroArray)
lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))
// We will clean the combiner closure later in `combineByKey`
val cleanedSeqOp = self.context.clean(seqOp)
combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
cleanedSeqOp, combOp, partitioner)
}
注意:
1.初始值只作用在分区内的计算,不作用在分区间的函数上
分区内,每个key开始的时候,将初始值代入分区内函数的第一个参数
2.aggregateByKey作用于k,v类型的数据; 分区内函数、分区间函数的参数都代表的是(k,v)中的v
举例:
val b = sc.parallelize(List(("mouse", 2),("cat",200), ("cat", 5),
("mouse", 4),("cat", 12), ("dog", 12)), 2)
def func3(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}
println(b.mapPartitionsWithIndex(func3).collect.toBuffer)
val r: RDD[(String, Int)] = b.aggregateByKey(100)(math.max(_,_),_+_)
println(r.collect().toBuffer)
运行结果:
ArrayBuffer([partID:0, val: (mouse,2)], [partID:0, val: (cat,200)], [partID:0, val: (cat,5)], [partID:1, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)])
ArrayBuffer((dog,100), (cat,300), (mouse,200))
过程分析:
ArrayBuffer(
[partID:0, val: (mouse,2)],[partID:0, val: (cat,200)], [partID:0, val: (cat,5)],
[partID:1, val: (mouse,4)], [partID:1, val: (cat,12)],[partID:1, val: (dog,12)])
val r: RDD[(String, Int)] =
b.aggregateByKey(100)(math.max(_,_),_+_)
------------------
分区内计算:
分区0:
mouse 100>2 100
cat 100<200 200
cat 200>5 200
分区1:
mouse 100>4 100
cat 100>12 100
dog 100>12 100
----------------
分区间:
mouse 100+100 =200
cat 200+100=300
dog 0+100=100
5.combineByKey
combineByKey
(createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C)
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C): RDD[(K, C)] = self.withScope {
combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners)(null)
}
和aggregateByKey一样,可以返回和输入不同类型的value(
combineByKey参数解释:
(1)createCombiner:
combineByKey() 会遍历分区中的所有元素,因此每个元素的键要么还没有遇到过,要么就和之前的某个元素的键相同。如果这是一个新的元素,combineByKey()会使用一个叫作createCombiner()的函数来创建那个键对应的累加器的初始值
(2)mergeValue:
如果这是一个在处理当前分区之前已经遇到的键,它会使用mergeValue()方法将该键的累加器对应的当前值与这个新的值进行合并
(3)mergeCombiners:
由于每个分区都是独立处理的, 因此对于同一个键可以有多个累加器。如果有两个或者更多的分区都有对应同一个键的累加器, 就需要使用用户提供的 mergeCombiners() 方法将各个分区的结果进行合并。
和aggregateByKey不同的是,初始值,变成了初始函数
一样的是,初始函数也只作用在分区内
举例:
val rdd5 = sc.textFile("hdfs://mini1:9000/data/wcount").flatMap(_.split(" ")).map((_,1))
def func4(index:Int,iter:Iterator[(String,Int)]):Iterator[String] ={
iter.toList.map(x=>"[partitionID:"+index+"value:"+x+"]").iterator
}
println("rdd5的分区数"+rdd5.partitions.length)
println(rdd5.mapPartitionsWithIndex(func4).collect().toBuffer)
val rdd6: RDD[(String, Int)] = rdd5.combineByKey(x=>x+10, (a:Int, b:Int)=>a+b, (i:Int, j:Int)=>i+j)
println(rdd6.collect().toBuffer)
ArrayBuffer([partitionID:0value:(hello,1)],
[partitionID:0value:(china,1)], [partitionID:0value:(hello,1)],
[partitionID:0value:(jack,1)], [partitionID:0value:(hello,1)],
[partitionID:0value:(jim,1)], [partitionID:0value:(hello,1)],
[partitionID:0value:(tom,1)],
[partitionID:1value:(hello,1)], [partitionID:1value:(hadoop,1)],
[partitionID:1value:(hello,1)],
[partitionID:1value:(china,1)], [partitionID:1value:(nihao,1)],
[partitionID:1value:(zhongguo,1)], [partitionID:1value:(mingbai,1)],
[partitionID:1value:(lala,1)], [partitionID:1value:(weiwei,1)],
[partitionID:1value:(xingfen,1)], [partitionID:1value:(xiaoguo,1)],
[partitionID:2value:(mimi,1)], [partitionID:2value:(lala,1)],
[partitionID:2value:(mingbai,1)])
ArrayBuffer((xiaoguo,11), (tom,11), (weiwei,11), (hadoop,11),
(hello,26), (jack,11), (jim,11), (mingbai,22),
(lala,22), (zhongguo,11), (mimi,11), (china,22), (xingfen,11), (nihao,11))
举例二:
筛选单身狗和非单身狗
val rdd4 = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val rdd5 = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val rdd6 = rdd5.zip(rdd4)
val rdd7 = rdd6.combineByKey(List(_), (x: List[String], y: String)
=> x :+ y, (m: List[String], n: List[String]) => m ++ n)
println(rdd7.collect().toBuffer)
运行结果:
ArrayBuffer((1,List(dog, cat, turkey)), (2,List(gnu, salmon, rabbit, wolf, bear, bee)))
三、coalesce、repartition、repartitionAndSortWithinPartitions、partitonBy
这两重分区函数,实际上更改的是并行度
coalesce默认不产生shuffle,即分区数不可以变大;想要变大,就传入第二个参数,设置为true,即允许产生shuffle
repartition:调用了coalesce函数,并且默认coalesce的第二个参数为true,所以可以扩大也可以缩小分区数
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
val conf: SparkConf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("myaggregate")
val sc = new SparkContext(conf)
val rdd1: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7,8,9),3)
println(rdd1.partitions.length)
val rdd2: RDD[Int] = rdd1.coalesce(2)
println(rdd1.partitions.length)
println(rdd2.partitions.length)
val rdd3: RDD[Int] = rdd1.coalesce(4)
println(rdd3.partitions.length) //修改不成功
val rdd33: RDD[Int] = rdd1.coalesce(4,true)
println(rdd33.partitions.length) //修改成功
val rdd4 = rdd2.repartition(2)
println(rdd4.partitions.length)
val rdd5 = rdd2.repartition(4)
println(rdd5.partitions.length)
运行结果:
3
3
2
3
4
2
4
3、repartitionAndSortWithinPartitions
传入一个分区器,并且重新排序,结果显示,只是按照key的哈希值进行排序,value的值大小对结果没有影响
val rdd6: RDD[(String, Int)] = sc.parallelize(List(("apple",8),("caption",9),
("dog",4),("cat",34),("fish",5),("apen",1),("drink",23),("car",6)),3)
rdd6.repartitionAndSortWithinPartitions(new HashPartitioner(2)).foreach(println)
运行结果:
(apen,1)
(apple,8)
(caption,9)
(car,6)
(cat,34)
(dog,4)
(drink,23)
(fish,5)
4.partitionBy
可以将自己定义的分区器传入