spark RDD *ByKey操作

最新推荐文章于 2023-03-11 11:58:41 发布

NoOne-csdn

最新推荐文章于 2023-03-11 11:58:41 发布

阅读量612

点赞数 1

分类专栏： pyspark

本文链接：https://blog.csdn.net/weixin_40161254/article/details/103472056

版权

pyspark 专栏收录该内容

63 篇文章 9 订阅

订阅专栏

*ByKey 操作

数据类型为(key,value)模式

操作	解释
sortByKey	对数据进行排序
reduceByKey	合并数据
reduceByKeyLocally	合并数据以字典的形式返回数据到master
sampleByKey	返回rdd的子集
subtractByKey	返回两个RDD key 的差
aggregateByKey	合并函数
combineByKey	合并数据
countByKey	计算key的个数
foldByKey	合并是数据
groupBykey	根据key分组

sortByKey

sortByKey(ascending=True, numPartitions=None, keyfunc=<function RDD.>)
eg:

 sc=spark.sparkContext
    tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5),('-2', 'test')]
    res=sc.parallelize(tmp).sortByKey(False)
    print(res.collect())
    res = sc.parallelize(tmp).sortByKey(True)
    print(res.collect())

输出：

[(‘d’, 4), (‘b’, 2), (‘a’, 1), (‘2’, 5), (‘1’, 3), (’-2’, ‘test’)]
[(’-2’, ‘test’), (‘1’, 3), (‘2’, 5), (‘a’, 1), (‘b’, 2), (‘d’, 4)]

reduceByKey

Merge the values for each key using an associative and commutative reduce function.

eg:

from operator import add
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
sorted(rdd.reduceByKey(add).collect())

res:

[(‘a’, 2), (‘b’, 1)

reduceByKeyLocally(func)

注意：返回数据到master节点
Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a dictionary.

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(rdd.reduceByKeyLocally(add))

{‘a’: 2, ‘b’: 1}

sampleByKey

sampleByKey(withReplacement, fractions, seed=None)

fractions 每个key采样的概率
seed 随机数种子
withReplacement 是否有抽样放回

Return a subset of this RDD sampled by key (via stratified sampling). Create a sample of this RDD using variable sampling rates for different keys as specified by fractions, a key to sampling rate map.

 fractions = {"a": 0.2, "b": 0.1}
    rdd = sc.parallelize(fractions.keys()).cartesian(sc.parallelize(range(0, 10)))
    sample=rdd.sampleByKey(True,fractions,2)
    print(sample.collect())
    sample = rdd.sampleByKey(False, fractions, 2)
    print(sample.collect())

[(‘a’, 7), (‘b’, 1)]
[(‘a’, 2), (‘a’, 9), (‘b’, 8)]

groupByKey

groupByKey(numPartitions=None, partitionFunc=< function portable_hash>)

Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions.

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(sorted(rdd.groupByKey().mapValues(len).collect()))
print(sorted(rdd.groupByKey().mapValues(list).collect()))

[(‘a’, 2), (‘b’, 1)]
[(‘a’, [1, 1]), (‘b’, [1])]

官网一个有趣的demo：

	fractions = {"a": 0.2, "b": 0.1}
    rdd = sc.parallelize(fractions.keys()).cartesian(sc.parallelize(range(0, 10)))
    sample = dict(rdd.sampleByKey(False, fractions, 2).groupByKey().collect())

    print(sample)
    print(type(sample['a']))
    print(list(sample['a']))

{‘b’: <pyspark.resultiterable.ResultIterable object at 0x1159ff890>, ‘a’: <pyspark.resultiterable.ResultIterable object at 0x1159ff2d0>}
<class ‘pyspark.resultiterable.ResultIterable’>
[2, 9]

substractByKey(other,numPartitions=None)

Return each (key, value) pair in self that has no pair with matching key in other.

>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 2)])
>>> y = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(x.subtractByKey(y).collect())
[('b', 4), ('b', 5)]

aggregateByKey

aggregateByKey(zeroValue, seqFunc, combFunc, numPartitions=None, partitionFunc=< function portable_hash>)

combineByKey

combineByKey(createCombiner, mergeValue, mergeCombiners, numPartitions=None, partitionFunc=< function portable_hash>)

用自定义函数根据key合并数据
共有三个函数

createCombiner
mergeValue
mergeCombiners

 x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])

    def to_list(a):
        return [a]

    def append(a, b):
        a.append(b)
        return a

    def extend(a, b):
        a.extend(b)
        return a

    res = sorted(x.combineByKey(to_list, append, extend).collect())
    print(res)

[(‘a’, [1, 2]), (‘b’, [1])]

countByKey

计算key的个数，以字典的形式返回到master节点

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(rdd.countByKey())

defaultdict(<class ‘int’>, {‘a’: 2, ‘b’: 1})

foldByKey

foldByKey(zeroValue, func, numPartitions=None, partitionFunc=< function portable_hash>)
对KV做合并处理

zeroValue
这个就是用来对原始的V做合并操作的
func
作用于V的函数
Merge the values for each key using an associative function “func” and a neutral “zeroValue” which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.).

rdd = sc.parallelize([("a", 3), ("b", 100), ("a", 5)])
 print(sorted(rdd.foldByKey(0, add).collect()))

[(‘a’, 8), (‘b’, 100)]

NoOne-csdn

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录