Spark处理数据如何获得行号

最新推荐文章于 2021-12-21 18:24:16 发布

old_R

最新推荐文章于 2021-12-21 18:24:16 发布

阅读量1.2k

点赞数 1

分类专栏： Spark

本文链接：https://blog.csdn.net/ddfgaet/article/details/101479168

版权

Spark 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Spark处理数据如何获得行号

因为Spark并行的处理数据，所以你不能在自己的driver program中计数到底是处理到第几个。Spark提供了zipWithIndex可以给你提供索引号。这个索引号是全局有序和唯一的。

public RDD<scala.Tuple2<T,Object>> zipWithIndex()
Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index.
This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type. This method needs to trigger a spark job when this RDD contains more than one partitions.


Note that some RDDs, such as those returned by groupBy(), do not guarantee order of elements in a partition. The index assigned to each element is therefore not guaranteed, and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee the same index assignments, you should sort the RDD with sortByKey() or save it to a file.

另一个可以让你得到唯一标示符的函数是zipWithUniqueId,跟zipWithIndex不同的是，zipWithUniqueId只保证唯一，不保证连续性，中间可能有gap.

old_R

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark处理数据如何获得行号

Spark处理数据如何获得行号因为Spark并行的处理数据，所以你不能在自己的driver program中计数到底是处理到第几个。Spark提供了zipWithIndex可以给你提供索引号。这个索引号是全局有序和唯一的。public RDD<scala.Tuple2<T,Object>> zipWithIndex()Zips this RDD with its el...
复制链接

扫一扫