PySpark RDD 之collect、 take、top、first取值操作

最新推荐文章于 2025-03-11 15:42:52 发布

G_scsd

最新推荐文章于 2025-03-11 15:42:52 发布

阅读量1.1w

点赞数 4

分类专栏： pyspark 文章标签： pyspark collect take top first

本文链接：https://blog.csdn.net/Gscsd_T/article/details/103540896

版权

pyspark 专栏收录该内容

16 篇文章

订阅专栏

1. pyspark 版本

2.3.0版本

2. collect()

collect()[source]

Return a list that contains all of the elements in this RDD.

中文：返回包含此RDD中的所有元素的列表。

Note This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.

中文注解： 注意，这个方法应该只在预期得到的数组很小的情况下使用，因为所有的数据都加载到驱动程序的内存中。

案列：

from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local").setAppName("quzhi")
sc = SparkContext(conf=conf)
lines1 = sc.parallelize(list(range(10)))
print('lines1= ', lines1.collect())

>>> lines1=  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

3. take

take(num)[source]

Take the first num elements of the RDD.

It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.

Translated from the Scala implementation in RDD#take().

中文：取RDD的前num个元素。

它的工作方式是先扫描一个分区，然后使用该分区的结果来估算满足限制所需的其他分区的数量。

从RDD＃take（）中的Scala实现翻译而来。

Note this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.

中文：注意 仅当预期结果数组较小时才应使用此方法，因为所有数据均已加载到驱动程序的内存中。

>>> sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)
[2, 3]
>>> sc.parallelize([2, 3, 4, 5, 6]).take(10)
[2, 3, 4, 5, 6]
>>> sc.parallelize(range(100), 100).filter(lambda x: x > 90).take(3)
[91, 92, 93]

案列：

# take: 从rdd中返回前n个元素
print('take(2)= ', lines1.take(2)) 

>>> take(2)=  [0, 1]

4. top

top(num, key=None)[source]

Get the top N elements from an RDD.

中文：

Note This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.
中文： 注意 仅当预期结果数组较小时才应使用此方法，因为所有数据均已加载到驱动程序的内存中。


Note It returns the list sorted in descending order.
注意 它返回以降序排序的列表。


>>> sc.parallelize([10, 4, 2, 12, 3]).top(1)
[12]
>>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2)
[6, 5]
>>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str)
[4, 3, 2]

案列：

# top(num)返回最前面的两个元素
lines2 = sc.parallelize(list(range(0, 10))[::-1])
print('lines2= ', lines2.collect())
print('lines1.top(2)= ', lines1.top(2))
print('lines2.top(2)= ', lines2.top(2))

>>>lines2=  [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
>>>lines1.top(2)=  [9, 8]
>>>lines2.top(2)=  [9, 8]

5. first

first()[source]

Return the first element in this RDD.

中文：返回此RDD中的第一个元素。

>>> sc.parallelize([2, 3, 4]).first()
2
>>> sc.parallelize([]).first()
Traceback (most recent call last):
    ...
ValueError: RDD is empty

案列

# first() 从RDD中返回第一个元素
print('lines1.first()= ', lines1.first())
print('lines2.first()= ', lines2.first())

>>>lines1.first()=  0
>>>lines2.first()=  9