一、值类型valueType
-
map:map(func)
-
将原来RDD的每个数据项通过map中的用户自定义函数f映射转变为一个新的元素。源码中的map算子相当于初始化一个RDD,新RDD叫做MappedRDD(this, sc.clean(f))
-
# 将func函数作用到数据集的每一个元素上,生成一个新的RDD返回 rdd1 = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9], 3) rdd2 = rdd1.map(lambda x: x+1) rdd2.collect() # [2, 3, 4, 5, 6, 7, 8, 9, 10]
-
-
groupBy
-
x = sc.parallelize([1, 2, 3]) y = x.groupBy(lambda x: 'A' if (x%2==1) else 'B') print(y.mapValues(list).collect()) # [('A', [1, 3]), ('B', [2])]
-
-
filter
-
选出所有func返回值为true的元素,生成一个新的RDD返回
-
rdd1 = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9], 3) rdd2 = rdd1.map(lambda x: x*2) rdd3 = rdd2.filter(lambda x: x>4) rdd3.collect() # [6, 8, 10, 12, 14, 16, 18]
-
-
flatmap
-
flatMap会先执行map的操作,再将所有对象合并为一个对象
-
rdd1 = sc.parallelize(["a b c", "d e f", "h i j"]) rdd2 = rdd1.flatMap(lambda x: x.split(" ")) rdd3 = rdd1.map(lambda x: x.split(" ")) rdd2.collect() # ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'i', 'j'] rdd3.collect() # [['a', 'b', 'c'], ['d', 'e', 'f'], ['h', 'i', 'j']]
-
二、双值类型DoubleValueType
-
Union:对两个RDD求并集
-
rdd1 = sc.parallelize([('a', 1), ('b', 2)]) rdd2 = sc.parallelize([('c', 1), ('b', 3)]) rdd3 = rdd1.union(rdd2) rdd3.collect() # [('a', 1), ('b', 2), ('c', 1), ('b', 3)]
-
-
Intersection:对两个RDD求交集
-
rdd1 = sc.parallelize([('a', 1), ('b', 2)]) rdd2 = sc.parallelize([('c', 1), ('b', 3)]) rdd3 = rdd1.union(rdd2) rdd4 = rdd3.intersection(rdd3) rdd4.collect() # [('c', 1), ('b', 3)]
-
三、Key-Value值类型
-
groupByKey
-
以元组中的第0个元素作为key,进行分组,返回一个新的RDD
-
rdd1 = sc.parallelize([('a', 1), ('b', 2)]) rdd2 = sc.parallelize([('c', 1), ('b', 3)]) rdd3 = rdd1.union(rdd2) rdd4 = rdd3.groupByKey() rdd4.collect() # [('a', <pyspark.resultiterable.ResultIterable object at 0x7fba6a5e5898>), ('c', <pyspark.resultiterable.ResultIterable object at 0x7fba6a5e5518>), ('b', <pyspark.resultiterable.ResultIterable object at 0x7fba6a5e5f28>)] # groupByKey之后的结果中 value是一个Iterable >>>result[2] ('b', <pyspark.resultiterable.ResultIterable object at 0x7fba6c18e518>) >>>result[2][1] <pyspark.resultiterable.ResultIterable object at 0x7fba6c18e518> >>>list(result[2][1]) [2, 3]
-
-
reduceByKey
-
将key相同的键值对,按照Function进行计算
>>>rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) >>>rdd.reduceByKey(lambda x,y:x+y).collect() [('b', 1), ('a', 2)]
-
-
sortByKey
-
根据key进行排序
-
sortByKey
(ascending=True, numPartitions=None, keyfunc=<function RDD.>) -
>>>tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)] >>>sc.parallelize(tmp).sortByKey().first() ('1', 3) >>>sc.parallelize(tmp).sortByKey(True, 1).collect() [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)] >>>sc.parallelize(tmp).sortByKey(True, 2).collect() [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)] >>>tmp2 = [('Mary', 1), ('had', 2), ('a', 3), ('little', 4), ('lamb', 5)] >>>tmp2.extend([('whose', 6), ('fleece', 7), ('was', 8), ('white', 9)]) >>>sc.parallelize(tmp2).sortByKey(True, 3, keyfunc=lambda k: k.lower()).collect() [('a', 3), ('fleece', 7), ('had', 2), ('lamb', 5),...('white', 9), ('whose', 6)]
-
countByValue
>>> x = sc.parallelize([1,3,1,2,3]) >>> y = x.countByValue() >>> print(x.collect()) [1, 3, 1, 2, 3] >>> print(y) defaultdict(<class 'int'>, {1: 2, 3: 2, 2: 1})
-