spark 常用python API

最新推荐文章于 2024-08-15 10:00:02 发布

__WILL

最新推荐文章于 2024-08-15 10:00:02 发布

阅读量606

点赞数

分类专栏：大数据与分布式文章标签： spark

本文链接：https://blog.csdn.net/u010560443/article/details/50611233

版权

本文详细介绍了Spark的Python API，包括转换操作如map、filter、flatMap等，以及动作操作如count、collect、reduce等。通过实例解析了各种操作的使用方法和效果，帮助理解如何在Spark中处理数据。

摘要由CSDN通过智能技术生成

1.概述

转换

动作

2.实例

2.1转换

map(f, preservesPartitioning=False)

根据闭包函数f将RDD[T]映射成RDD[U]，RDD元素和分区数不变。1->1

>>> rdd = sc.parallelize(["b", "a", "c"])
>>> sorted(rdd.map(lambda x: (x, 1)).collect())
[('a', 1), ('b', 1), ('c', 1)]

mapPartitions(f, preservesPartitioning=False)

Return a new RDD by applying a function to each partition of this RDD.

>>> rdd = sc.parallelize([1, 2, 3, 4], 2)
>>> def f(iterator): yield sum(iterator)
>>> rdd.mapPartitions(f).collect()
[3, 7]

partitionBy(numPartitions, partitionFunc=)

Return a copy of the RDD partitioned using the specified partitioner.

>>> pairs = sc.parallelize([1, 2, 3, 4, 2, 4, 1]).map(lambda x: (x, x))
>>> sets = pairs.partitionBy(2).glom().collect()
>>> len(set(sets[0]).intersection(set(sets[1])))
0

mapValues(f)

Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning.

>>> x = sc.parallelize([("a", ["apple", "banana", "lemon"]), ("b", ["grapes"])])
>>> def f(x): return len(x)
>>> x.mapValues(f).collect()
[('a', 3), ('b', 1

最低0.47元/天解锁文章

__WILL

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
spark 常用python API

1.概述转换动作2.实例2.1转换map(f, preservesPartitioning=False)根据闭包函数f将RDD[T]映射成RDD[U]，RDD元素和分区数不变。1->1>>> rdd = sc.parallelize(["b", "a", "c"])>>> sorted(rdd.map(lambda x: (x, 1)).collect())[('a', 1), ('b', 1)
复制链接

扫一扫

专栏目录