Spark中foreachRDD、foreachPartition和foreach解读

最新推荐文章于 2020-12-08 15:32:54 发布

数据小二

最新推荐文章于 2020-12-08 15:32:54 发布

阅读量676

点赞数

分类专栏：大数据文章标签： Spark中foreachRDD、foreachPartit

本文链接：https://blog.csdn.net/Mirror_w/article/details/95445447

版权

大数据专栏收录该内容

84 篇文章 4 订阅

订阅专栏

foreachRDD、foreachPartition和foreach的不同之处主要在于它们的作用范围不同，

foreachRDD作用于DStream中每一个时间间隔的RDD
foreachPartition作用于每一个时间间隔的RDD中的每一个partition，foreach作用于每一个时间间隔的RDD中的每一个元素。

在Spark 官网中，foreachRDD被划分到Output Operations on DStreams中，所有我们首先要明确的是，它是一个输出操作的算子，然后再来看官网对它的含义解释：

Output Operation

Meaning

foreachRDD(func)

The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

最常用的输出操作

需要一个函数作为参数，函数作用于DStream中的每一个RDD

函数将RDD中的数据输出到外部系统，如文件、数据库

函数在driver上执行

函数中通常要有action算子，因为foreachRDD本身是transform算子

官网还给出了开发者常见的错误：

Often writing data to external system requires creating a connection object (e.g. TCP connection to a remote server) and using it to send data to a remote system. For this purpose, a developer may inadvertently try creating a connection object at the Spark driver, and then try to use it in a Spark worker to save records in the RDDs. For example ：

dstream.foreachRDD { rdd =>
  val connection = createNewConnection()  // executed at the driver,在executor端执行
  rdd.foreach { record =>
    connection.send(record) // executed at the worker
  }
}

This is incorrect as this requires the connection object to be serialized and sent from the driver to the worker. Such connection objects are rarely transferrable across machines. This error may manifest as serialization errors (connection object not serializable), initialization errors (connection object needs to be initialized at the workers), etc. The correct solution is to create the connection object at the worker.

说的是我们使用foreachRDD向外部系统输出数据时，通常要创建一个连接对象，如果像上面的代码中创建在driver上就是错误的，因为foreach在每个节点上执行时节点上并没有连接对象。通常会报序列化错误或者初始化错误。

However, this can lead to another common mistake - creating a new connection for every record. For example：

dstream.foreachRDD { rdd =>
  rdd.foreach { record =>
    val connection = createNewConnection()  //driver端创建，每次调用都会创建
    connection.send(record)
    connection.close()
  }
}

Typically, creating a connection object has time and resource overheads. Therefore, creating and destroying a connection object for each record can incur unnecessarily high overheads and can significantly reduce the overall throughput of the system. A better solution is to use rdd.foreachPartition - create a single connection object and send all the records in a RDD partition using that connection.

这样虽然不会报错，但是foreach中的每一个元素都会创建连接对象，浪费资源，foreach适用于对每一个元素进行操作的场景，因此需要创建连接对象时一般使用foreachPartition来解决这个问题，这样每个partition中只创建一个连接对象，使用它来对该partition内的每个元素进行输出如：

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>  //利用foreachpartition一个分区创建一次
    val connection = createNewConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    connection.close()
  }
}

Finally, this can be further optimized by reusing connection objects across multiple RDDs/batches. One can maintain a static pool of connection objects than can be reused as RDDs of multiple batches are pushed to the external system, thus further reducing the overheads.

Note that the connections in the pool should be lazily created on demand and timed out if not used for a while. This achieves the most efficient sending of data to external systems.

更进一步的话，在处理一批RDD时，可以使用数据库连接池来重复使用连接对象，注意连接池必须是静态、懒加载的，官网示例：

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool is a static, lazily initialized pool of connections
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}

这一点在SparkLearning一书中也有体现：

我自己写的使用连接池连接Redis的程序，程序暂时没有出现什么问题，如下：

public class RedisUtil {
    private static JedisPool pool = null;
 
    static {
        if (pool == null) {
            JedisPoolConfig config = new JedisPoolConfig();
            config.setMaxIdle(25);
            config.setMaxWaitMillis(1000 * 100);
            config.setTestOnBorrow(true);
            pool = new JedisPool(config, "Redis IP", 6379);
        }
    }
 
    public static Jedis getConnection(){
        return pool.getResource();
    }
 
    public static void closeConnection(Jedis jedis, Boolean exceptionExist){
        if (jedis != null) {
            if(exceptionExist) {
                pool.returnBrokenResource(jedis);
            }else {
                pool.returnResource(jedis);
            }
        }
    }
}

linesFormat.foreachRDD(rdd => {
      rdd.foreachPartition(it => {
        var exceptionExist = false
        val jedis = RedisUtil.getConnection
        it.foreach(record => {
          try{
            if(jedis != null && jedis.exists("abc" + record._2 + record._3)){
              //从Redis中获取数据
              val value = jedis.hmget("abc" + record._2 + record._3,"a1","a2","a3","a4","a5","a6")      
              //使用数据。。。
            }
          }catch{
            case e: Exception => {
              println(e)
              exceptionExist = true
            }
          }
        })
        RedisUtil.closeConnection(jedis,exceptionExist)
        exceptionExist = false
      })
    })

Other points to remember:

DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions. Specifically, RDD actions inside the DStream output operations force the processing of the received data. Hence, if your application does not have any output operation, or has output operations like dstream.foreachRDD() without any RDD action inside them, then nothing will get executed. The system will simply receive the data and discard it.

By default, output operations are executed one-at-a-time. And they are executed in the order they are defined in the application.

数据小二

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark中foreachRDD、foreachPartition和foreach解读

foreachRDD、foreachPartition和foreach的不同之处主要在于它们的作用范围不同，foreachRDD作用于DStream中每一个时间间隔的RDD foreachPartition作用于每一个时间间隔的RDD中的每一个partition，foreach作用于每一个时间间隔的RDD中的每一个元素。在Spark 官网中，foreachRDD被划分到Output O...
复制链接

扫一扫

专栏目录