pyspark RDD zip、zipWithUniqueId、zipWithIndex操作详解

最新推荐文章于 2023-04-11 12:00:00 发布

NoOne-csdn

最新推荐文章于 2023-04-11 12:00:00 发布

阅读量3.5k

点赞数 1

分类专栏： pyspark

本文链接：https://blog.csdn.net/weixin_40161254/article/details/87935192

版权

pyspark 专栏收录该内容

63 篇文章 9 订阅

订阅专栏

一、zip(other)
Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).
两个RDDzip，返回k-v
前提：
两个RDD具有相同个数的分区，并且每个分区内的个数相等
例如：

例子：

x=sc.parallelize(range(5),2)
y=sc.parallelize(range(1000,1005),2)
a=x.zip(y).glom().collect()
print(a)
a=x.zip(y).collect()
print(a)

运行结果：
[[(0, 1000), (1, 1001)], [(2, 1002), (3, 1003), (4, 1004)]]
[(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]

二、zipWithIndex()
Zips this RDD with its element indices.
The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index.
返回：k-v
与分区没有关系，v的值对应list下标值

例子：
a=sc.parallelize(list('abczyx'),1).zipWithIndex().glom().collect()
print(a)

a=sc.parallelize(list('abczyx'),2).zipWithIndex().glom().collect()
print(a)

a=sc.parallelize(list('abczyx'),6).zipWithIndex().glom().collect()
print(a)

运行结果

运行结果：

[[('a', 0), ('b', 1), ('c', 2), ('z', 3), ('y', 4), ('x', 5)]]
[[('a', 0), ('b', 1), ('c', 2)], [('z', 3), ('y', 4), ('x', 5)]]
[[('a', 0)], [('b', 1)], [('c', 2)], [('z', 3)], [('y', 4)], [('x', 5)]]

三、zipWithUniqueId

Zips this RDD with generated unique Long ids.

Items in the kth partition will get ids k, n+k, 2n+k, …, where n is the number of partitions. So there may exist gaps, but this method won’t trigger a spark job, which is different from zipWithIndex
返回k-v，与分区有关系
k, n+k, 2n+k,
n为第几个分区
k为第n个分区的第k个值
从0开始计数

程序实例：一个分区

a=sc.parallelize(list('abczyx'),1).zipWithUniqueId().glom().collect()
print(a)

结果同zipWithIndex()

结果
[[('a', 0), ('b', 1), ('c', 2), ('z', 3), ('y', 4), ('x', 5)]]

二个分区

代码：
rdd=sc.parallelize(list('abczyx'),2)
print(rdd.glom().collect())
a=rdd.zipWithUniqueId().glom().collect()
print(a)

结果：

[['a', 'b', 'c'], ['z', 'y', 'x']]
[[('a', 0), ('b', 2), ('c', 4)], [('z', 1), ('y', 3), ('x', 5)]]

在这里插入图片描述

4个分区

rdd=sc.parallelize(list('45abczyx'),4)
print(rdd.glom().collect())
a=rdd.zipWithUniqueId().glom().collect()
print(a)

[['4', '5'], ['a', 'b'], ['c', 'z'], ['y', 'x']]
[[('4', 0), ('5', 4)], [('a', 1), ('b', 5)], [('c', 2), ('z', 6)], [('y', 3), ('x', 7)]]

分析：

在这里插入图片描述

NoOne-csdn

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
pyspark RDD zip、zipWithUniqueId、zipWithIndex操作详解

一、zip(other)Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions...
复制链接

扫一扫