*ByKey 操作
数据类型为(key,value)模式
操作 | 解释 |
---|---|
sortByKey | 对数据进行排序 |
reduceByKey | 合并数据 |
reduceByKeyLocally | 合并数据以字典的形式返回数据到master |
sampleByKey | 返回rdd的子集 |
subtractByKey | 返回两个RDD key 的差 |
aggregateByKey | 合并函数 |
combineByKey | 合并数据 |
countByKey | 计算key的个数 |
foldByKey | 合并是数据 |
groupBykey | 根据key分组 |
sortByKey
sortByKey(ascending=True, numPartitions=None, keyfunc=<function RDD.>)
eg:
sc=spark.sparkContext
tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5),('-2', 'test')]
res=sc.parallelize(tmp).sortByKey(False)
print(res.collect())
res = sc.parallelize(tmp).sortByKey(True)
print(res.collect())
输出:
[(‘d’, 4), (‘b’, 2), (‘a’, 1), (‘2’, 5), (‘1’, 3), (’-2’, ‘test’)]
[(’-2’, ‘test’), (‘1’, 3), (‘2’, 5), (‘a’, 1), (‘b’, 2), (‘d’, 4)]
reduceByKey
Merge the values for each key using an associative and commutative reduce function.
eg:
from operator import add
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
sorted(rdd.reduceByKey(add).collect())
res:
[(‘a’, 2), (‘b’, 1)
reduceByKeyLocally(func)
注意:返回数据到master节点
Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a dictionary.
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(rdd.reduceByKeyLocally(add))
{‘a’: 2, ‘b’: 1}
sampleByKey
sampleByKey(withReplacement, fractions, seed=None)
- fractions 每个key采样的概率
- seed 随机数种子
- withReplacement 是否有抽样放回
Return a subset of this RDD sampled by key (via stratified sampling). Create a sample of this RDD using variable sampling rates for different keys as specified by fractions, a key to sampling rate map.
fractions = {"a": 0.2, "b": 0.1}
rdd = sc.parallelize(fractions.keys()).cartesian(sc.parallelize(range(0, 10)))
sample=rdd.sampleByKey(True,fractions,2)
print(sample.collect())
sample = rdd.sampleByKey(False, fractions, 2)
print(sample.collect())
[(‘a’, 7), (‘b’, 1)]
[(‘a’, 2), (‘a’, 9), (‘b’, 8)]
groupByKey
groupByKey(numPartitions=None, partitionFunc=< function portable_hash>)
Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions.
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(sorted(rdd.groupByKey().mapValues(len).collect()))
print(sorted(rdd.groupByKey().mapValues(list).collect()))
[(‘a’, 2), (‘b’, 1)]
[(‘a’, [1, 1]), (‘b’, [1])]
官网一个有趣的demo:
fractions = {"a": 0.2, "b": 0.1}
rdd = sc.parallelize(fractions.keys()).cartesian(sc.parallelize(range(0, 10)))
sample = dict(rdd.sampleByKey(False, fractions, 2).groupByKey().collect())
print(sample)
print(type(sample['a']))
print(list(sample['a']))
{‘b’: <pyspark.resultiterable.ResultIterable object at 0x1159ff890>, ‘a’: <pyspark.resultiterable.ResultIterable object at 0x1159ff2d0>}
<class ‘pyspark.resultiterable.ResultIterable’>
[2, 9]
substractByKey(other,numPartitions=None)
Return each (key, value) pair in self that has no pair with matching key in other.
>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 2)])
>>> y = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(x.subtractByKey(y).collect())
[('b', 4), ('b', 5)]
aggregateByKey
aggregateByKey(zeroValue, seqFunc, combFunc, numPartitions=None, partitionFunc=< function portable_hash>)
combineByKey
combineByKey(createCombiner, mergeValue, mergeCombiners, numPartitions=None, partitionFunc=< function portable_hash>)
用自定义函数根据key合并数据
共有三个函数
- createCombiner
- mergeValue
- mergeCombiners
x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
def to_list(a):
return [a]
def append(a, b):
a.append(b)
return a
def extend(a, b):
a.extend(b)
return a
res = sorted(x.combineByKey(to_list, append, extend).collect())
print(res)
[(‘a’, [1, 2]), (‘b’, [1])]
countByKey
计算key的个数,以字典的形式返回到master节点
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(rdd.countByKey())
defaultdict(<class ‘int’>, {‘a’: 2, ‘b’: 1})
foldByKey
foldByKey(zeroValue, func, numPartitions=None, partitionFunc=< function portable_hash>)
对KV做合并处理
- zeroValue
这个就是用来对原始的V做合并操作的 - func
作用于V的函数
Merge the values for each key using an associative function “func” and a neutral “zeroValue” which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.).
rdd = sc.parallelize([("a", 3), ("b", 100), ("a", 5)])
print(sorted(rdd.foldByKey(0, add).collect()))
[(‘a’, 8), (‘b’, 100)]