Spark中foreachRDD、foreachPartition和foreach解读

最新推荐文章于 2021-02-23 15:20:49 发布

程序员小陶

最新推荐文章于 2021-02-23 15:20:49 发布

阅读量3.7k

点赞数 3

文章标签： java python spark 大数据编程语言

本文链接：https://blog.csdn.net/qq_31975963/article/details/107662660

版权

精选30+云产品，助力企业轻松上云！>>>

点击蓝色“大数据每日哔哔”关注我

加个“星标”，第一时间获取大数据架构，实战经验

区别

最近有不少同学问我，Spark 中 foreachRDD、foreachPartition和foreach 的区别，工作中经常会用错或不知道怎么用，今天简单聊聊它们之间的区别：

其实区别它们很简单，首先是作用范围不同，foreachRDD 作用于 DStream中每一个时间间隔的 RDD，foreachPartition 作用于每一个时间间隔的RDD中的每一个 partition，foreach 作用于每一个时间间隔的 RDD 中的每一个元素。

http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd

SparkStreaming 中对 foreachRDD的说明。

foreach 与 foreachPartition都是在每个partition中对iterator进行操作,不同的是,foreach是直接在每个partition中直接对iterator执行foreach操作,而传入的function只是在foreach内部使用,而foreachPartition是在每个partition中把iterator给传入的function,让function自己对iterator进行处理（可以避免内存溢出）

一个简单的例子

在Spark 官网中，foreachRDD被划分到Output Operations on DStreams中，所有我们首先要明确的是，它是一个输出操作的算子，然后再来看官网对它的含义解释：The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

最常用的输出操作,需要一个函数作为参数，函数作用于DStream中的每一个RDD,函数将RDD中的数据输出到外部系统，如文件、数据库,在driver上执行

函数中通常要有action算子，因为foreachRDD本身是transform算子

官网还给出了开发者常见的错误：

Often writing data to external system requires creating a connection object (e.g. TCP connection to a remote server) and using it to send data to a remote system. For this purpose, a developer may inadvertently try creating a connection object at the Spark driver, and then try to use it in a Spark worker to save records in the RDDs. For example ：(中文解析见代码下方)

// ① 这种写法是错误的 ❌dstream.foreachRDD { rdd =>  val connection = createNewConnection()  // executed at the driver  rdd.foreach { record =>    connection.send(record) // executed at the worker  }}

上面说的是我们使用foreachRDD向外部系统输出数据时，通常要创建一个连接对象，如果像上面的代码中创建在 driver 上就是错误的，因为foreach在每个节点上执行时节点上并没有连接对象。driver节点就一个，而worker节点有多个。

所以，我们改成下面这样：

// ② 把创建连接写在 forech 里面，RDD 中的每个元素都会创建一个连接dstream.foreachRDD { rdd =>  rdd.foreach { record =>    val connection = createNewConnection() // executed at the worker    connection.send(record) // executed at the worker    connection.close()  }}

这时不会出现计算节点没有连接对象的情况。但是，这样写会在每次循环RDD的时候都会创建一个连接，创建连接和关闭连接都很频繁，造成系统不必要的开销。

可以通过使用 foreachPartirion 来解决这类问题：

// ③ 使用foreachPartitoin来减少连接的创建，RDD的每个partition创建一个链接dstream.foreachRDD { rdd =>  rdd.foreachPartition { partitionOfRecords =>    val connection = createNewConnection()    partitionOfRecords.foreach(record => connection.send(record))    connection.close()  }}

上面这种方式还可以优化，虽然连接申请变少了，但是对一每一个partition来说，连接还是没有办法复用，所以我们可以引入静态连接池。官方说明：该连接池必须是静态的、懒加载的。

// ④ 使用静态连接池，可以增加连接的复用、减少连接的创建和关闭。dstream.foreachRDD { rdd =>  rdd.foreachPartition { partitionOfRecords =>    // ConnectionPool is a static, lazily initialized pool of connections    val connection = ConnectionPool.getConnection()    partitionOfRecords.foreach(record => connection.send(record))    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse  }}

这里需要注意的是：使用连接池中的连接应按需创建，如果有一段时间不使用，则应超时，这样实现了向外部系统最有效地发送地数据。

总结

文中所展示的例子也是初学者容易犯的错误，通过分析 RDD 数据输出到第三方系统的例子，我们可以很清晰的理解foreachRDD、foreachPartiton、foreach 的作用范围。

（完）

对了，学累了别忘记起身运动运动