Pyspark之map与flatMap

最新推荐文章于 2023-07-31 23:06:35 发布

zlbingo

最新推荐文章于 2023-07-31 23:06:35 发布

阅读量3.3k

点赞数 1

分类专栏： Rookie_Spark 文章标签：大数据 python

本文链接：https://blog.csdn.net/zlbingo/article/details/113118584

版权

Rookie_Spark 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

map和flatMap

map

🌀功能：Return a new RDD by applying a function to each element of this RDD.

将函数作用于RDD中的每个元素，将返回值构成新的RDD。

☀️语法

>>> rdd = sc.parallelize(["b", "a", "c"])
>>> rdd.map(lambda x: (x, 1)).collect()
[('b', 1), ('a', 1), ('c', 1)]

flatMap

🌀功能：Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

首先将函数作用于RDD中的每个元素，然后将结果展平，以返回新的RDD。

☀️语法

>>> rdd = sc.parallelize([2, 3, 4])
>>> rdd.flatMap(lambda x: range(1, x)).collect()
[1, 1, 2, 1, 2, 3]
>>> rdd.flatMap(lambda x: [(x, x), (x, x)]).collect()
[(2, 2), (2, 2), (3, 3), (3, 3), (4, 4), (4, 4)]

两者之间的区别

map会将函数作用于RDD中的每一行（对每行的内容进行修改，但是仍然为一行），返回新的RDD，返回的新的RDD的大小和原来RDD的大小相同
flatMap首先会将函数作用于RDD中的每一行，然后将函数作用于每一行产生的新的元素进行拉平（将每个元素都作为新的一行），最后将所有元素组成新的RDD

看上述解释可能有点迷糊，我们通过几个例子进行理解分析：

这里有一个文本文件for_test.txt，内容为

hello world
a new line
hello
the end

我们可以看到这个文件一共有四行，如果通过Spark读入的话，这个RDD的大小应该为4

from pyspark import SparkContext, SparkConf

conf = SparkConf().setMaster("local").setAppName("test")
sc = SparkContext(conf=conf)
test_rdd = sc.textFile('for_test.txt')
print(f'the count of test_rdd is {test_rdd.count()}')
print(f'{test_rdd.collect()}\n')

map_rdd = test_rdd.map(lambda x: x.split(' '))
print(f'the count of map_rdd is {map_rdd.count()}')
print(f'{map_rdd.collect()}\n')

flatMap_rdd = test_rdd.flatMap(lambda x: x.split(' '))
print(f'the count of flatMap_rdd is {flatMap_rdd.count()}')
print(f'{flatMap_rdd.collect()}')

the count of test_rdd is 4
[‘hello world’, ‘a new line’, ‘hello’, ‘the end’]

the count of map_rdd is 4
[[‘hello’, ‘world’], [‘a’, ‘new’, ‘line’], [‘hello’], [‘the’, ‘end’]]

the count of flatMap_rdd is 8
[‘hello’, ‘world’, ‘a’, ‘new’, ‘line’, ‘hello’, ‘the’, ‘end’]

可以看到读入的RDD的大小为4，说明一共有四行，这四行分别为hello world，a new line，hello，the end

当我们进行map操作时，map作用的是RDD每行中的所有元素，首先是第一行，通过空格进行分割，然后返回一个列表，也就是说RDD的第一行由一个字符串变成了一个列表，一行仍然是一行，没有变成两行数据或者更多的行。因此，新的RDD的行数与原先的RDD的行数是一致的
当我们进行flatMap时，首先的操作同map一样，生成一个列表，例如第一行：生成了一个包含两个元素的列表[‘hello’, ‘world’]，那么接下来需要将生成的两个元素拉平（将每个元素作为新RDD的一行），因此原始RDD的一行变成了新RDD的两行，其他行同理，因此产生[‘hello’, ‘world’, ‘a’, ‘new’, ‘line’, ‘hello’, ‘the’, ‘end’]这样的结果，flatMap产生的新的RDD与原始的RDD的行数可能是不同的。

通过上图可以看出，flatMap其实比map多的就是flatten操作。

参考文章：

zlbingo

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Pyspark之map与flatMap

map和flatMapmap????功能：Return a new RDD by applying a function to each element of this RDD. 将函数作用于RDD中的每个元素，将返回值构成新的RDD。☀️语法>>> rdd = sc.parallelize(["b", "a", "c"])>>> rdd.map(lambda x: (x, 1)).collect()[('b', 1), ('a', 1), ('c'
复制链接

扫一扫

专栏目录