Pyspark实战（三）wordcount算子分析

最新推荐文章于 2023-08-21 03:06:37 发布

落叶1210

最新推荐文章于 2023-08-21 03:06:37 发布

阅读量594

点赞数

分类专栏：大数据 pyspark 文章标签： pyspark wordcount

本文链接：https://blog.csdn.net/luoye4321/article/details/93934099

版权

大数据同时被 2 个专栏收录

13 篇文章 2 订阅

订阅专栏

pyspark

6 篇文章 2 订阅

订阅专栏

Pyspark的本质还是调用scala的jar包，我们以上篇文章wordcount为例，其中一段代码为：

rdd.flatMap(lambda x:x.split( )).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).foreach(lambda x:print(x))
其中：flatMap，map为转换算子。

reduceByKey，foreach为执行算子，当rdd添加转换算子的时候，rdd本身不会做任何操作，当执行算子添加时才会执行转换算子。

我们把代码定位到rdd.py的map,flatMap，源代码如下：

def map(self, f, preservesPartitioning=False):

    """

    Return a new RDD by applying a function to each element of this RDD.



    >>> rdd = sc.parallelize(["b", "a", "c"])

    >>> sorted(rdd.map(lambda x: (x, 1)).collect())

    [('a', 1), ('b', 1), ('c', 1)]

    """

    def func(_, iterator):

        return map(f, iterator)

    return self.mapPartitionsWithIndex(func, preservesPartitioning)

def flatMap(self, f, preservesPartitioning=False):

    """

    Return a new RDD by first applying a function to all elements of this

    RDD, and then flattening the results.



    >>> rdd = sc.parallelize([2, 3, 4])

    >>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())

    [1, 1, 1, 2, 2, 3]

    >>> sorted(rdd.flatMap(lambda x: [(x, x), (x, x)]).collect())

    [(2, 2), (2, 2), (3, 3), (3, 3), (4, 4), (4, 4)]

    """

    def func(s, iterator):

        return chain.from_iterable(map(f, iterator))

    return self.mapPartitionsWithIndex(func, preservesPartitioning)

map需要两个参数，第一个参数为f，第二个从字面意思是分片数量。那么f是什么类型呢？我们从scala源代码看可能更清楚一些：

/**

 * Return a new RDD by applying a function to all elements of this RDD.

 */

def map[U: ClassTag](f: T => U): RDD[U] = withScope {

  val cleanF = sc.clean(f)

  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))

}

map是一个泛型方法，这里的U类型实际上可以是所有类型，这里清楚的标明f的类型：f: T => U,f是一个参数为T， U为返回值的匿名函数，算子最后返回一个新的rdd

/**

 *  Return a new RDD by first applying a function to all elements of this

 *  RDD, and then flattening the results.

 */

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {

  val cleanF = sc.clean(f)

  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))

}

flat相比map，多了一步处理，就是将返回的结果U进行TraversableOnce处理，意思是将U类型的集合分散并合并为一个新的集合。

所以，我们再回头看看代码：

rdd=sc.textFile(txtfile)

rdd是一个集合，集合的要素是文本文件的一行数据，类似于Array[line]。

rdd.flatMap(lambda x:x.split( )).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).foreach(lambda x:print(x))
rdd.flatMap(lambda x:x.split( )).的意思是先将每个line通过空格分开，这时候line返回的是Array[char]，最后通过TraversableOnce处理，多个Array[char]返回一个Array[char]