创建PairRDD
map():创建RDD
rdd = sc.parallelize( ["hadoop","spark","hive","spark","mapreduce"])
pairRDD = rdd.map(lambda word:(word,1))
pairRDD.foreach(print)
结果:
('hadoop', 1)
('spark', 1)
('spark', 1)
('mapreduce', 1)
('hive', 1)
reduceByKey():合并key相同的键值对
pairRDD.reduceByKey(lambda a,b:a+b).foreach(print)
结果:
('spark', 2)
('hive', 1)
('hadoop', 1)
('mapreduce', 1)
groupByKey():将所有键值对按照keys进行分类
pairRDD.groupByKey().foreach(print)
结果:
('hive', <pyspark.resultiterable.ResultIterable object at 0x7fd746d09978>)
('hadoop', <pyspark.resultiterable.ResultIterable object at 0x7fd746d09978>)
('spark', <pyspark.resultiterable.ResultIterable object at 0x7fd746d09978>)
('mapreduce', <pyspark.resultiterable.ResultIterable object at 0x7fd746c706d8>)
Keys():打印所有Keys
pairRDD.keys().foreach(print)
结果:
spark
hadoop
hive
spark
mapreduce
values():打印所有键值队的值
pairRDD.values().foreach(print)
结果:
1
1
1
1
1
sortByKeys():根据键值对的key进行键值对排序
pairRDD.sortByKey().foreach(print)
结果:
('hadoop', 1)
('hive', 1)
('mapreduce', 1)
('spark', 1)
('spark', 1)
mapValues():定义一个函数,对所有键值对的值进行该操作
Action行动操作:
countByKey():对每个键的对应元素进行分别计数
pairrdd.countByKey()
结果:
defaultdict(<class 'int'>, {'hadoop': 1, 'spark': 2, 'hive': 1, 'mapreduce': 1})
lookup(Key):返回key对应的所有键值对值
pairrdd.lookup('spark')
结果:
[1, 1]