【Pyspark-驯化】一文搞懂Pyspark中的RDD的使用技巧
本次修炼方法请往下查看
🌈 欢迎莅临我的个人主页 👈这里是我工作、学习、实践 IT领域、真诚分享 踩坑集合,智慧小天地!
🎇 相关内容文档获取 微信公众号
🎇 相关内容视频讲解 B站
🎓 博主简介:AI算法驯化师,混迹多个大厂搜索、推荐、广告、数据分析、数据挖掘岗位 个人申请专利40+,熟练掌握机器、深度学习等各类应用算法原理和项目实战经验。
🔧 技术专长: 在机器学习、搜索、广告、推荐、CV、NLP、多模态、数据分析等算法相关领域有丰富的项目实战经验。已累计为求职、科研、学习等需求提供近千次有偿|无偿定制化服务,助力多位小伙伴在学习、求职、工作上少走弯路、提高效率,近一年好评率100% 。
📝 博客风采: 积极分享关于机器学习、深度学习、数据分析、NLP、PyTorch、Python、Linux、工作、项目总结相关的实用内容。
下滑查看解决方法
🎯 1.基本介绍
spark的运行基本由两部分组成:Transformnation(转换)和action,其中第一部分这类方法仅仅是定义逻辑,并不会立即执行,即lazy特性。目的是将一个RDD转为新的RDD。action不会产生新的RDD,而是直接运行,得到我们想要的结果
💡 2. 代码用法
2.1 常用rdd的transfromation
map:
这个类似于python的map函数对rdd的列进行一些转换操作, 返回为一个rdd对象,如果需要其值,则进行collect操作即可reduce:
是针对RDD对应的列表中的元素,递归地选择第一个和第二个元素进行操作,操作的结果作为一个元素用来替换这两个元素,注意,reduce返回的是一个Python可以识别的对象,非RDD对象。reduceByKey:
对输入函数按照k进行合并filter:
按条件过滤rdd的值distinct:
去重操作join:
对每个序列化的kv根据k进行两个或多个rdd合并,类似于sql中的表连接union()和intersection():
第一个为合并两个rdd且不进行去重,第二个为两个rdd的交集
具体的代码写代如下所示:
rdd = sc.parallelize([1,2,3,4,5])
print(rdd.reduce(lambda a, b : a + b)) # output 15
rdd_2 = sc.parallelize([[1,10], [2,20], [3,30], [1,1], [3,4]])
rdd_2.reduceByKey(lambda x, y: x + y) # [1,11], [2,20], [3,34]
# rdd一般用的比较多的就是map和reduce函数,一般用来存在一些结果,比如
df_user_query.rdd.map(lambda x: json.dumps((x[0], x[1])))
2.2 常用的action操作
- collect(): 以数组的形式,返回数据集中所有的元素
- count(): 返回数据集中元素的个数
- take(n): 返回数据集的前N个元素
- takeOrdered(n): 升序排列,取出前N个元素
- takeOrdered(n, lambda x: -x): 降序排列,取出前N个元素
- first(): 返回数据集的第一个元素
- min(): 取出最小值
- max(): 取出最大值
- stdev(): 计算标准差
- sum(): 求和
- mean(): 平均值
- countByKey(): 统计各个key值对应的数据的条数
- lookup(key): 根据传入的key值来查找对应的Value值
- foreach(func): 对集合中每个元素应用func
具体的代码写代如下所示:
from pyspark import SparkContext
sc = SparkContext("local", "map example")
nums = sc.parallelize([1, 2, 3, 4, 5])
squares = nums.map(lambda x: x*x)
print(squares.collect()) # 输出[1, 4, 9, 16, 25]
2. filter(func)
from pyspark import SparkContext
sc = SparkContext("local", "filter example")
nums = sc.parallelize([1, 2, 3, 4, 5])
even_nums = nums.filter(lambda x: x %! (MISSING)== 0)
print(even_nums.collect()) # 输出[2, 4]
3. flatMap(func)
from pyspark import SparkContext
sc = SparkContext("local", "flatMap example")
words = sc.parallelize(["Hello world", "Goodbye world"])
split_words = words.flatMap(lambda x: x.split(" "))
print(split_words.collect()) # 输出['Hello', 'world', 'Goodbye', 'world']
4. distinct(numPartitions=None)
from pyspark import SparkContext
sc = SparkContext("local", "distinct example")
nums = sc.parallelize([1, 2, 3, 3, 4, 4, 5])
distinct_nums = nums.distinct()
print(distinct_nums.collect()) # 输出[1, 2, 3, 4, 5]
5. groupByKey(numPartitions=None)
from pyspark import SparkContext
sc = SparkContext("local", "groupByKey example")
pairs = sc.parallelize([(1, 2), (2, 3), (1, 4), (3, 5)])
grouped_pairs = pairs.groupByKey()
print(grouped_pairs.collect()) # 输出[(1, <pyspark.resultiterable.ResultIterable object at 0x7f7e744b5e80>), (2, <pyspark.resultiterable.ResultIterable object at 0x7f7e744b5e90>), (3, <pyspark.resultiterable.ResultIterable object at 0x7f7e744b5eb0>)]
6. reduceByKey(func, numPartitions=None)
from pyspark import SparkContext
sc = SparkContext("local", "reduceByKey example")
pairs = sc.parallelize([(1, 2), (2, 3), (1, 4), (3, 5)])
sum_by_key = pairs.reduceByKey(lambda x, y: x + y)
print(sum_by_key.collect()) # 输出[(1, 6), (2, 3), (3, 5)]
7. sortByKey(ascending=True, numPartitions=None, keyfunc=lambda x: x)
from pyspark import SparkContext
sc = SparkContext("local", "sortByKey example")
pairs = sc.parallelize([(1, "apple"), (3, "banana"), (2, "orange")])
sorted_pairs = pairs.sortByKey()
print(sorted_pairs.collect()) # 输出[(1, 'apple'), (2, 'orange'), (3, 'banana')]
8. join(other, numPartitions=None)
from pyspark import SparkContext
sc = SparkContext("local", "join example")
names = sc.parallelize([(1, "Alice"), (2, "Bob"), (3, "Charlie")])
scores = sc.parallelize([(1, 80), (2, 90), (3, 85)])
joined_data = names.join(scores)
print(joined_data.collect()) # 输出[(1, ('Alice', 80)), (2, ('Bob', 90)), (3, ('Charlie', 85))]
9. union(other)
from pyspark import SparkContext
sc = SparkContext("local", "union example")
nums1 = sc.parallelize([1, 2, 3])
nums2 = sc.parallelize([3, 4, 5])
union_nums = nums1.union(nums2)
print(union_nums.collect()) # 输出[1, 2, 3, 3, 4, 5]
10. mapValues(func)
from pyspark import SparkContext
sc = SparkContext("local", "mapValues example")
pairs = sc.parallelize([(1, 2), (2, 3), (3, 4)])
mapped_values = pairs.mapValues(lambda x: x*10)
print(mapped_values.collect()) # 输出[(1, 20), (2, 30), (3, 40)]
from pyspark import SparkContext
sc = SparkContext("local", "mapValues example")
pairs = sc.parallelize([(1, "apple"), (2, "banana"), (3, "orange")])
mapped_values = pairs.mapValues(lambda x: x.upper())
print(mapped_values.collect()) # 输出[(1, 'APPLE'), (2, 'BANANA'), (3, 'ORANGE')]
11. keys()
from pyspark import SparkContext
sc = SparkContext("local", "keys example")
pairs = sc.parallelize([(1, 2), (2, 3), (3, 4)])
keys = pairs.keys()
print(keys.collect()) # 输出[1, 2, 3]
12. values()
from pyspark import SparkContext
sc = SparkContext("local", "values example")
pairs = sc.parallelize([(1, 2), (2, 3), (3, 4)])
values = pairs.values()
print(values.collect()) # 输出[2, 3, 4]
13. cogroup(other, numPartitions=None)
from pyspark import SparkContext
sc = SparkContext("local", "cogroup example")
names = sc.parallelize([(1, "Alice"), (2, "Bob"), (3, "Charlie")])
scores = sc.parallelize([(1, 80), (2, 90), (3, 85), (1, 75)])
cogrouped_data = names.cogroup(scores)
print(cogrouped_data.collect()) # 输出[(1, (<pyspark.resultiterable.ResultIterable object at 0x7ff2a805f780>, <pyspark.resultiterable.ResultIterable object at 0x7ff2a805f748>)), (2, (<pyspark.resultiterable.ResultIterable object at 0x7ff2a805f6d8>, <pyspark.resultiterable.ResultIterable object at 0x7ff2a805f6a0>)), (3, (<pyspark.resultiterable.ResultIterable object at 0x7ff2a805f710>, <pyspark.resultiterable.ResultIterable object at 0x7ff2a805f6a0>))]
14. subtract(other, numPartitions=None)
from pyspark import SparkContext
sc = SparkContext("local", "subtract example")
nums1 = sc.parallelize([1, 2, 3, 4, 5])
nums2 = sc.parallelize([3, 4, 5, 6, 7])
subtracted_nums = nums1.subtract(nums2)
print(subtracted_nums.collect()) # 输出[1, 2]
15. sample(withReplacement, fraction, seed=None)
from pyspark import SparkContext
sc = SparkContext("local", "sample example")
nums = sc.parallelize(range(10))
sampled_nums = nums.sample(False, 0.5)
print(sampled_nums.collect()) # 输出[1, 3, 5, 7]
16. takeOrdered(num, key=None)
from pyspark import SparkContext
sc = SparkContext("local", "takeOrdered example")
nums = sc.parallelize([5, 8, 1, 3, 9, 2])
ordered_nums = nums.takeOrdered(3)
print(ordered_nums) # 输出[1, 2, 3]
17. zip(other)
from pyspark import SparkContext
sc = SparkContext("local", "zip example")
nums1 = sc.parallelize([1, 2, 3])
nums2 = sc.parallelize([4, 5, 6])
zipped_data = nums1.zip(nums2)
print(zipped_data.collect()) # 输出[(1, 4), (2, 5), (3, 6)]
18. mapPartitions(func)
from pyspark import SparkContext
sc = SparkContext("local", "mapPartitions example")
nums = sc.parallelize([1, 2, 3, 4, 5], 2)
def sum_partitions(iterator):
yield sum(iterator)
sums = nums.mapPartitions(sum_partitions)
print(sums.collect()) # 输出[3, 12]
19. repartition(numPartitions)
from pyspark import SparkContext
sc = SparkContext("local", "repartition example")
nums = sc.parallelize([1, 2, 3, 4, 5], 2)
repartitioned_nums = nums.repartition(4)
print(repartitioned_nums.getNumPartitions()) # 输出4
20. pipe(command, env=None, checkCode=False)
from pyspark import SparkContext
sc = SparkContext("local", "pipe example")
nums = sc.parallelize([1, 2, 3, 4, 5])
result = nums.pipe("head -n 3")
print(result) # 输出1\n2\n3\n
21. coalesce(numPartitions)
from pyspark import SparkContext
sc = SparkContext("local", "coalesce example")
nums = sc.parallelize([1, 2, 3, 4, 5], 5)
coalesced_nums = nums.coalesce(2)
print(coalesced_nums.getNumPartitions()) # 输出2
22. glom()
from pyspark import SparkContext
sc = SparkContext("local", "glom example")
nums = sc.parallelize([1, 2, 3, 4, 5], 2)
glommed_data = nums.glom()
print(glommed_data.collect()) # 输出[[1, 2, 3], [4, 5]]
23. flatMapValues(func)
from pyspark import SparkContext
sc = SparkContext("local", "flatMapValues example")
pairs = sc.parallelize([(1, "apple"), (2, "banana"), (3, "orange")])
mapped_values = pairs.flatMapValues(lambda x: [x, x.upper()])
print(mapped_values.collect()) # 输出[(1, 'apple'), (1, 'APPLE'), (2, 'banana'), (2, 'BANANA'), (3, 'orange'), (3, 'ORANGE')]
24. zipWithIndex()
from pyspark import SparkContext
sc = SparkContext("local", "zipWithIndex example")
nums = sc.parallelize([1, 2, 3, 4, 5])
indexed_nums = nums.zipWithIndex()
print(indexed_nums.collect()) # 输出[(1, 0), (2, 1), (3, 2), (4, 3), (5, 4)]
25. mapPartitionsWithIndex(func)
from pyspark import SparkContext
sc = SparkContext("local", "mapPartitionsWithIndex example")
nums = sc.parallelize([1, 2, 3, 4, 5], 2)
def sum_partitions_with_index(partition_index, iterator):
yield (partition_index, sum(iterator))
sums_by_partition_index = nums.mapPartitionsWithIndex(sum_partitions_with_index)
print(sums_by_partition_index.collect()) # 输出[(0, 3), (1, 12)]
26. keyBy(func)
from pyspark import SparkContext
sc = SparkContext("local", "keyBy example")
words = sc.parallelize(["apple", "banana", "orange"])
keyed_words = words.keyBy(lambda x: x[0])
print(keyed_words.collect()) # 输出[('a', 'apple'), ('b', 'banana'), ('o', 'orange')]
通常我们将数据保存为parquet格式,这样可以将数据的存放大小缩小一个量级。
💡 3. 注意事项
- RDD是不可变的,每次转换操作都会生成一个新的RDD。
- 行动操作(如collect、count)会触发实际的计算。
- 使用缓存可以提高性能,但会增加内存使用。
- 理解RDD的分区和并行度,以优化性能。
💡 4. 总结
RDD是PySpark中的核心数据结构,提供了丰富的操作来处理大规模数据集。通过本博客的代码示例,我们学习了如何创建RDD、执行转换和行动操作,以及使用高级功能如Pair RDD和聚合操作。希望这篇博客能够帮助你更好地理解RDD,并将其应用于处理大规模数据集。