1. pair RDD
spark为包含键值对类型的RDD提供了一些专有的操作。这些RDD被称为pair RDD,pair RDD是很多程序的构成要素,因为他们提供了并行操作各个键或跨节点重新进行数据分组的啊哦做接口。比如普通的RDD有countByValue,而pair RDD提供了reduceByKey的操作。
2. pair RDD 创建
根据之前了解的普通RDD的一些转化操作和pair RDD的定义,我们知道,pair RDD 可以从普通的RDD使用替换的转化操作得到。
val nums = sc.parallelize(List(1,2,3,4,5))
val pairNums = nums map (x => (x,1))
println(pairNums.collect.mkString(","))
3. pair RDD 转化操作
3.1 reduceByKey 根据键聚合
格式:
pairRDD reduceByKey ( => )
val nums = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
val all = nums sample (true, 100000)
val pairNums = all map (x => (x,1))
val sumNums = pairNums reduceByKey(_+_)
println(nums.collect.mkString(","))
println(pairNums.collect.mkString(","))
println(sumNums.collect.mkString(","))
val result = sumNums map (x => (x._1,x_2/100000.toDouble))
println(result.collect.mkString(","))
如果将数量设置到100亿呢?
不可思议,100亿啊,6.7分钟就完了。
3.2 groupByKey 根据键分组
格式:
pairRDD groupByKey
val initNums = sc parallelize ( 0 to 9)
val pairNums= initNums sample(true,20) map (x => (x, x+"x"))
println(pairNums.collect.mkString(","))
println(pairNums.groupByKey.collect.mkString(","))
3.3 keys 获取键
格式
pairRDD keys
val initNums = sc parallelize ( 0 to 9)
val pairNums = initNums sample(true,20) map (x => (x, x+1))
println(pairNums.collect.mkString(","))
println(pairNums.keys.collect.mkString(","))
3.4 values 获取值
格式
pairRDD values
val initNums = sc parallelize ( 0 to 9)
val pairNums= initNums sample(true,20) map (x => (x, x+"x"))
println(pairNums.collect.mkString(","))
println(pairNums.values.collect.mkString(","))
3.5 sortByKey 根据键排序
格式:
pairRDD sortByKey
val nums = sc parallelize ( 0 to 9) map (x => (10 - x, x))
println(nums.collect.mkString(","))
println(nums.sortByKey().collect.mkString(","))
3.6 mapValues 值操作
格式:
pairRDD mapValues ( => )
val nums = sc parallelize (0 to 9) map (x => (x%4,x))
nums collect() foreach print
nums mapValues ( x => x * 10 ) collect() foreach print
3.7 flatMapValues 合并值流操作
格式:
pairRDD flatMapValues( => )
val nums = sc parallelize ( 0 to 9 ) map ( x => ( x, x ))
nums collect() foreach print
nums flatMapValues ( x => x to 10 ) collect() foreach print
3.8 combineByKey 根据键自定义聚合
格式:
pairRDD combineByKey( => , => , => )
第一个 => :元素转返回类型
第一个 => :参数 元素
第二个 => :分区内元素聚合
第二个 => :参数 返回类型,元素
第三个 => :分区聚合
第三个 => :参数 返回类型,返回类型
val nums = sc parallelize(List(("A",66),("B",56),("C",88),("D",99),("A",33),("B",67858),("C",8987),("D",11231)))
type Mt = (Int,Int)
nums.combineByKey(a => (a,1),(x:Mt,s)=>(x._1+s,x._2 + 1),(c:Mt,d:Mt)=>(c._1+d._1,c._2+d._2)) map{ case (key,value) => (key, value._1/value._2.toDouble)} collect() foreach print
3.9 subtractByKey 差集
格式:
pairRDD1 subtractByKey pairRDD2
val num1 = sc parallelize(List(("A",66),("B",78),("C",77),("D",88)))
val num2 = sc parallelize(List(("A",66),("C",77)))
num1 collect() foreach print
num2 collect() foreach print
num1 subtract num2 collect() foreach print
3.10 join 内连接
格式:
pairRDD1 join pairRDD2
val num1 = sc parallelize(List(("A",66),("B",78),("C",77),("D",88)))
val num2 = sc parallelize(List(("A",778),("C",899)))
num1 collect() foreach print
num2 collect() foreach print
num1 join num2 collect() foreach print
3.11 rightOuterJoin 右外连接
格式:
pairRDD1 rightOuterJoin pairRDD2
val num1 = sc parallelize(List(("A",66),("B",78),("C",77),("D",88)))
val num2 = sc parallelize(List(("A",778),("C",899)))
num1 collect() foreach print
num2 collect() foreach print
num1 rightOuterJoin num2 collect() foreach print
3.12 leftOuterJoin 左外连接
格式:
pairRDD1 leftOuterJoin pairRDD2
val num1 = sc parallelize(List(("A",66),("B",78),("C",77),("D",88)))
val num2 = sc parallelize(List(("A",778),("C",899)))
num1 collect() foreach print
num2 collect() foreach print
num1 leftOuterJoin num2 collect() foreach print
3.13 cogroup 并集
格式:
pairRDD1 cogroup pairRDD2
val num1 = sc parallelize(List(("A",66),("B",78),("C",77),("D",88)))
val num2 = sc parallelize(List(("A",778),("C",899)))
num1 collect() foreach print
num2 collect() foreach print
num1 cogroup num2 collect() foreach print
3.14 转化操作速查表
操作名 | 方法名 | 格式 |
---|---|---|
根据键聚合 | reduceByKey | pairRDD reduceByKey ( => ) |
根据键分组 | groupByKey | pairRDD groupByKey |
获取键 | keys | pairRDD keys |
获取值 | values | pairRDD values |
根据键排序 | sortByKey | pairRDD mapValues ( => ) |
值操作 | flatMapValues | pairRDD mapValues ( => ) |
合并值流操作 | combineByKey | pairRDD flatMapValues( => ) |
根据键自定义聚合 | combineByKey | pairRDD combineByKey( => , => , => ) |
差集 | subtractByKey | pairRDD1 subtractByKey pairRDD2 |
内连接 | join | pairRDD1 join pairRDD2 |
右外连接 | rightOuterJoin | pairRDD1 rightOuterJoin pairRDD2 |
左外连接 | leftOuterJoin | pairRDD1 leftOuterJoin pairRDD2 |
交集 | cogroup | pairRDD1 cogroup pairRDD2 |
4. pair RDD 转化操作分类
4.1 元素
4.1.1 map
因为pair RDD 是继承RDD的,所以,RDD的操作,pair RDD都可以使用。
格式:
pairRDD map {case (key,value) => (key, value’)}
val keys = sc parallelize( 1 to 6)
val pairs = keys map (x => (x,x*x))
pairs map {case (key,value) => (key, value * 10 )} collect() foreach print
4.1.2 filter
格式:
pairRDD filter {{case (key,value) => Boolean}
val keys = sc parallelize( 1 to 6)
val pairs = keys map (x => (x,x*x*10))
pairs collect() foreach print
pairs filter {case (key,value) => value < 100 } collect() foreach print
4.1.3 keys
格式:
pairRDD keys
val keys = sc parallelize( 1 to 6)
val pairs = keys map (x => (x,x*x*10))
pairs collect() foreach print
println(pairs.keys.collect.mkString(","))
4.1.4 values
格式:
pairRDD values
val keys = sc parallelize( 1 to 6)
val pairs = keys map (x => (x,x*x*10))
pairs collect() foreach print
println(pairs.keys.collect.mkString(","))
println(pairs.values.collect.mkString(","))
4.1.5 mapValues
格式:
pairRDD mapValues ( => )
val keys = sc parallelize( 1 to 6)
val pairs = keys map (x => (x,x*x*10))
pairs collect() foreach print
pairs mapValues ( x => x /10 ) collect() foreach print
4.2 聚合操作
4.2.1 reduceByKey
格式:
pairRDD reduceByKey ( => )
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = keys cartesian values
pairs collect() foreach print
pairs reduceByKey ((a,b) => ( if (a > b) a else b)) collect() foreach print
4.2.2 foldByKey
格式:
pairRDD foldByKey (value)( => )
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs collect() foreach print
pairs.foldByKey(3)((a,b) => (a+b)) collect()
3 +1+2+3+4 = 13
4.2.3 aggregateByKey
格式:
pairRDD aggregateByKey(value)( => , => )
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs collect() foreach print
pairs.aggregateByKey("")((a,b)=>(a+""+b),(s,t)=>s+t) collect()
4.2.3 combineByKey
格式:
pairRDD combineByKey( => , => , => )
第一个 => :元素转返回类型
第一个 => :参数 元素
第二个 => :分区内元素聚合
第二个 => :参数 返回类型,元素 (参数顺序不可变)
第三个 => :分区聚合
第三个 => :参数 返回类型,返回类型
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs combineByKey(x=>x.toDouble,(a:Double,b:Int)=>(a + b.toDouble),(a:Double,b:Double)=>(a+b)) collect
4.3 分组操作
4.3.1 groupBy
格式:
pairRDD groupBy ( => )
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs groupBy{case(key,value) => key} map {case(key,value) => (key, value map (x => x._2))} collect
等价于
groupByKey
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs groupByKey() collect
4.3.2 groupByKey
格式:
pairRDD groupByKey ( => )
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs groupByKey() collect
4.3.3 cogroup
格式:
pairRDD1 cogroup pairRDD2 [cogroup pairRDD3 …]
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs1 = sc parallelize (keys cartesian values collect)
val keys = sc parallelize(List("B","C"))
val values = sc parallelize(5 to 8)
val pairs2 = sc parallelize (keys cartesian values collect)
pairs1 collect() foreach print
pairs2 collect() foreach print
pairs1 cogroup pairs2 collect() foreach print
4.4 连接操作
4.4.1 join
格式:
pairRDD1 join pairRDD2
val keys = sc parallelize (1 to 3)
val values = sc parallelize ('A' to 'C')
val pairs1 = sc parallelize ( keys cartesian values collect)
keys collect() foreach print
values collect() foreach print
pairs1 collect() foreach print
val keys = sc parallelize (2 to 4)
val values = sc parallelize ( 'M' to 'O')
val pairs2 = sc parallelize ( keys cartesian values collect)
keys collect() foreach print
values collect() foreach print
pairs2 collect() foreach print
pairs1 join pairs2 collect() foreach print
4.4.2 leftOuterJoin
格式:
pairRDD1 leftOuterJoin pairRDD2
val keys = sc parallelize ( 1 to 3)
val values = sc parallelize( 'A' to 'C')
keys collect() foreach print
values collect() foreach print
val pairs1 = keys cartesian values
val keys = sc parallelize ( 2 to 3)
val values = sc parallelize ('A' to 'D')
keys collect() foreach print
values collect() foreach print
val pairs2 = keys cartesian values
pairs1 collect() foreach print
pairs2 collect() foreach print
pairs1 leftOuterJoin pairs2 collect() foreach print
4.4.3 rightOuterJoin
格式:
pairRDD1 rightOuterJoin pairRDD2
val keys = sc parallelize ( 1 to 2)
val values = sc parallelize ( 'A' to 'C')
val pairs1 = keys cartesian values
keys collect() foreach print
values collect() foreach print
val keys = sc parallelize ( 2 to 4)
val values = sc parallelize ('B' to 'D')
val pairs2 = keys cartesian values
keys collect() foreach print
values collect() foreach print
pairs1 collect() foreach print
pairs2 collect() foreach print
pairs1 rightOuterJoin pairs2 collect() foreach print
4.5 排序操作
4.5.1 sortByKey
格式:
pairRDD sortByKey
val keys = sc parallelize (List(3,2,1))
val values = sc parallelize ( 'M' to 'O')
val pairs = keys cartesian values
keys collect() foreach print
values collect() foreach print
pairs collect() foreach print
pairs sortByKey() collect() foreach print
5. pair RDD 行动操作
5.1 countByKey
格式:
pairRDD couuntByKey
val keys = sc parallelize( 1 to 8)
val values = sc parallelize ( 'A' to 'E')
val pairs = keys cartesian values
keys collect() foreach print
values collect() foreach print
pairs collect() foreach print
pairs countByKey
5.2 collectAsMap
格式:
pairRDD collectAsMap
val keys = sc parallelize( 1 to 7)
val values = sc parallelize( 'E' to 'H')
val pairs = keys cartesian values
keys collect() foreach print
values collect() foreach print
pairs collect() foreach print
pairs collectAsMap
5.3 lookup
格式:
pairRDD lookup key
val keys = sc parallelize ( 1 to 5)
val values = sc parallelize ( 'M' to 'Z')
val pair = keys cartesian values
keys collect() foreach print
values collect() foreach print
pair collect() foreach print
pair lookup 2