Spark常见的算子
----Spark中map和flatmap的区别:
map函数会对每一条输入进行指定的操作,然后为每一条输入返回一个对象。
而flatMap函数则是两个操作的集合——正是“先映射后扁平化”:
操作1:同map函数一样:对每一条输入进行指定的操作,然后为每一条输入返回一个对象
操作2:最后将所有对象合并为一个对象
scala> rdd5.map(t=>{
t.split("\t")}).collect
res9: Array[Array[String]] = Array(Array(75, 2018-09-17, BK181713017, 小一),
Array(75, 2018-09-17, BK181913016, 小二),
Array(75, 2018-09-17, BK181913062, 小四))
scala> rdd5.flatMap(t=>{
t.split("\t")}).collect
res8: Array[String] =
Array(75, 2018-09-17, BK181713017, 小一,75, 2018-09-17, BK181913016, 小二,75, 2018-09-17, BK181913062, 小四, 75, 2018-09-17, BK181913007)
scala> val a = sc.parallelize(1 to 10, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> a.glom.collect
res0: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))
----map算子
----x:RDD中所有元素
scala> a.map(x=>(x*2)).glom.collect
res1: Array[Array[Int]] = Array(Array(2, 4, 6), Array(8, 10, 12), Array(14, 16, 18, 20))
map、mapPartitions区别主要在于调用力度不同:
map的输入变换函数应用于RDD中所有元素,而mapPartitions应用于所有分区。
如parallelize(1 to 10, 3),map函数执行10次,而mapPartitions函数执行3次。
----mapPartitions算子
----x:所有分区
----y:每个分区里面的所有元素
scala> a.mapPartitions(x=>(x.map(y=>(y*2)))).glom.collect
res2: Array[Array[Int]] = Array(Array(2, 4, 6), Array(8, 10, 12), Array(14, 16, 18, 20))
----mapPartitionsWithIndex算子
----index:每个分区的索引号
----x:所有分区
----y:每个分区里面的所有元素
scala> a.mapPartitionsWithIndex((index,x)=>x.map(y=>(index,y))).glom.collect
res4: Array[Array[(Int, Int)]] = Array(Array((0,1), (0,2), (0,3)), Array((1,4), (1,5), (1,6)), Array((2,7), (2,8), (2,9), (2,10)))
----union算子:合并两个RDD
scala> val b = sc.makeRDD(1 to 5,2)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at makeRDD at <console>:24
scala> b.glom.collect
res5: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4, 5))
----生成一个新的RDD
scala> a.union(b)
res6: org.apache.spark.rdd.RDD[Int] = UnionRDD[12] at union at <console>:28
scala> res6.glom.collect
res7: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10), Array(1, 2), Array(3, 4, 5))
----intersection: 交集算子
scala> a.intersection(b)
res8: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[19] at intersection at <console>:28
scala> res8.collect
res9: Array[Int] = Array(3, 4, 1, 5, 2)
scala> res8.glom.collect
res10: Array[Array[Int]] = Array(Array(3), Array(4, 1), Array(5, 2))
----subtract:计算差的一种函数
----前一个RDD中的元素减去后一个RDD中相同的元素,将前一个RDD不同的元素保留下来
scala> val rdd1 = sc.parallelize(1 to 7)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[104] at parallelize at <console>:24
scala> val rdd2 = sc.makeRDD(3 to 6)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[106] at makeRDD at <console>:24
scala> rdd1.subtract(rdd2).collect
res55: Array[Int] = Array(1, 2, 7)
----sortBy排序算子:默认true(升序),false(降序)
----sortBy源码:
def sortBy[K](
f: (T) => K,
ascending: Boolean = true, 是否排序,默认(true)升序
numPartitions: Int = this.partitions.length) 重新定义分区
scala> val c = a.intersection(b)
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[27] at intersection at <console>:27
scala> c.collect
res12: Array[Int] = Array(3, 4, 1, 5, 2)
----sortBy:不加true默认是true(升序)
scala> c.sortBy(x=>x).collect
res14: Array[Int] = Array(1, 2, 3, 4, 5)
----加上true一样
scala> c.sortBy(x=>x,true).collect
res15: Array[Int] = Array(1, 2, 3, 4, 5)
----sortBy:false(降序)
scala> c.sortBy(x=>x,false).collect
res16: Array[Int] = Array(5, 4, 3, 2, 1)
----distinct:去重算子
----(因为去重后数据量肯定减少,所以可以加上参数,重新定义默认的分区数量)
scala> val c = a.union(b)
c: org.apache.spark.rdd.RDD[Int] = UnionRDD[48] at union at <console>:27
scala> c.collect
res17: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5)
scala> c.distinct.collect
res18: Array