spark RDD *ByKey操作

*ByKey 操作

数据类型为(key,value)模式

操作解释
sortByKey对数据进行排序
reduceByKey合并数据
reduceByKeyLocally合并数据以字典的形式返回数据到master
sampleByKey返回rdd的子集
subtractByKey返回两个RDD key 的差
aggregateByKey合并函数
combineByKey合并数据
countByKey计算key的个数
foldByKey合并是数据
groupBykey根据key分组

sortByKey

sortByKey(ascending=True, numPartitions=None, keyfunc=<function RDD.>)
eg:

 sc=spark.sparkContext
    tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5),('-2', 'test')]
    res=sc.parallelize(tmp).sortByKey(False)
    print(res.collect())
    res = sc.parallelize(tmp).sortByKey(True)
    print(res.collect())

输出:

[(‘d’, 4), (‘b’, 2), (‘a’, 1), (‘2’, 5), (‘1’, 3), (’-2’, ‘test’)]
[(’-2’, ‘test’), (‘1’, 3), (‘2’, 5), (‘a’, 1), (‘b’, 2), (‘d’, 4)]

reduceByKey

Merge the values for each key using an associative and commutative reduce function.

eg:

from operator import add
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
sorted(rdd.reduceByKey(add).collect())

res:

[(‘a’, 2), (‘b’, 1)

reduceByKeyLocally(func)

注意:返回数据到master节点
Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a dictionary.

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(rdd.reduceByKeyLocally(add))

{‘a’: 2, ‘b’: 1}

sampleByKey

sampleByKey(withReplacement, fractions, seed=None)

  • fractions 每个key采样的概率
  • seed 随机数种子
  • withReplacement 是否有抽样放回

Return a subset of this RDD sampled by key (via stratified sampling). Create a sample of this RDD using variable sampling rates for different keys as specified by fractions, a key to sampling rate map.

 fractions = {"a": 0.2, "b": 0.1}
    rdd = sc.parallelize(fractions.keys()).cartesian(sc.parallelize(range(0, 10)))
    sample=rdd.sampleByKey(True,fractions,2)
    print(sample.collect())
    sample = rdd.sampleByKey(False, fractions, 2)
    print(sample.collect())
  

[(‘a’, 7), (‘b’, 1)]
[(‘a’, 2), (‘a’, 9), (‘b’, 8)]

groupByKey

groupByKey(numPartitions=None, partitionFunc=< function portable_hash>)

Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions.

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(sorted(rdd.groupByKey().mapValues(len).collect()))
print(sorted(rdd.groupByKey().mapValues(list).collect()))

[(‘a’, 2), (‘b’, 1)]
[(‘a’, [1, 1]), (‘b’, [1])]

官网一个有趣的demo:

	fractions = {"a": 0.2, "b": 0.1}
    rdd = sc.parallelize(fractions.keys()).cartesian(sc.parallelize(range(0, 10)))
    sample = dict(rdd.sampleByKey(False, fractions, 2).groupByKey().collect())

    print(sample)
    print(type(sample['a']))
    print(list(sample['a']))

{‘b’: <pyspark.resultiterable.ResultIterable object at 0x1159ff890>, ‘a’: <pyspark.resultiterable.ResultIterable object at 0x1159ff2d0>}
<class ‘pyspark.resultiterable.ResultIterable’>
[2, 9]

substractByKey(other,numPartitions=None)

Return each (key, value) pair in self that has no pair with matching key in other.

>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 2)])
>>> y = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(x.subtractByKey(y).collect())
[('b', 4), ('b', 5)]

aggregateByKey

aggregateByKey(zeroValue, seqFunc, combFunc, numPartitions=None, partitionFunc=< function portable_hash>)

combineByKey

combineByKey(createCombiner, mergeValue, mergeCombiners, numPartitions=None, partitionFunc=< function portable_hash>)

用自定义函数根据key合并数据
共有三个函数

  • createCombiner
  • mergeValue
  • mergeCombiners
 x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])

    def to_list(a):
        return [a]

    def append(a, b):
        a.append(b)
        return a

    def extend(a, b):
        a.extend(b)
        return a

    res = sorted(x.combineByKey(to_list, append, extend).collect())
    print(res)

[(‘a’, [1, 2]), (‘b’, [1])]

countByKey

计算key的个数,以字典的形式返回到master节点

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
print(rdd.countByKey())

defaultdict(<class ‘int’>, {‘a’: 2, ‘b’: 1})

foldByKey

foldByKey(zeroValue, func, numPartitions=None, partitionFunc=< function portable_hash>)
对KV做合并处理

  • zeroValue
    这个就是用来对原始的V做合并操作的
  • func
    作用于V的函数
    Merge the values for each key using an associative function “func” and a neutral “zeroValue” which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.).
rdd = sc.parallelize([("a", 3), ("b", 100), ("a", 5)])
 print(sorted(rdd.foldByKey(0, add).collect()))

[(‘a’, 8), (‘b’, 100)]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值