Spark Streaming中foreachRDD的使用及闭包问题的产生处理

最新推荐文章于 2023-12-29 11:49:56 发布

番茄炒蛋213

最新推荐文章于 2023-12-29 11:49:56 发布

阅读量401

点赞数 1

分类专栏： Spark 大数据流处理文章标签： Spark 流处理闭包 Scalikejdbc 大数据

本文链接：https://blog.csdn.net/mcdull213/article/details/105646713

版权

Spark 同时被 3 个专栏收录

11 篇文章 0 订阅

订阅专栏

大数据

8 篇文章 0 订阅

订阅专栏

流处理

1 篇文章 0 订阅

订阅专栏

一、前言

foreachRDD是用来把Spark Streaming的数据sink到外部系统，但是使用的时候，这个算子将会被执行在driver进程中，而从driver到executor必然会涉及到序列化的问题。

二、测试。

需求：把流处理的WC结果写到MySQL

MySQLUtils

object MySQLUtils {

  /**
    * 获取连接
    *
    * @return
    */
  def getConnection(): Connection = {
    Class.forName("com.mysql.jdbc.Driver")
    val connection = DriverManager.getConnection(
      "jdbc:mysql://localhost:3306/demo?useSSL=false&serverTimezone=UTC",
      "root",
      "123456"
    )
    connection
  }

  /**
    * 关闭连接
    *
    * @param connection
    */
  def closeConnection(connection: Connection): Unit = {
    if (connection != null) connection.close()
  }

}

2.1 Version1.0

   stream.map((_, 1)).reduceByKey(_ + _) //统计wc
      .foreachRDD(rdd => { //把结果写入数据库
      val connection = MySQLUtils.getConnection() //executed at the driver
      rdd.foreach(wc => {
        val sql = "insert into wc(word,cnt) values(?,?)"
        val statement = connection.prepareStatement(sql) //executed at the worker
        statement.setString(1, wc._1)
        statement.setInt(2, wc._2)
        statement.execute()
      })
      MySQLUtils.closeConnection(connection)
    })

执行后会报错，Task not serializable，其实是connection不能序列化。

由于foreachRDD这个算子在driver端执行的，而foreach这个算子在executor端执行的，而我们的Connection不是序列化的，所以会报错，其实这是一个闭包问题。

闭包：在函数内部引用了一个外部的变量

2.2 Version2.0

为了解决闭包问题，我们把connection放到foreach中

 val result = stream.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)
    result //统计wc
      .foreachRDD(rdd => { //把结果写入数据库
      rdd.foreach(pair => {
        val connection = MySQLUtils.getConnection() //executed at the driver
        val sql = s"insert into wc(word,cnt) values('${pair._1}', ${pair._2})"
        connection.createStatement().execute(sql)
        MySQLUtils.closeConnection(connection)
      })
    })

这样运行是OK的，数据也写到了MySQL，但是假如我们有一亿条数据，每条数据都要创建关闭connection，很明显这样不行。

2.3 Version3.0

针对2.0的问题，再做进一步优化，针对每一个分区，创建一个连接


    val result = stream.flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _)

    result.foreachRDD(rdd => {
      rdd.foreachPartition(partition => {
        val connection = MySQLUtils.getConnection() //每个分区创建一个partition
        partition.foreach(pair => {
          val sql = s"insert into wc(word,cnt) values('${pair._1}',${pair._2})"
          connection.createStatement().execute(sql)
        })
        MySQLUtils.closeConnection(connection)
      })
    })

到目前为止，其实已经可以用了，也可以接受了，但是还可以进一步优化，如果数据量很大，分区设置的很多，这样数据库连接还是会很多。

2.4 Version4.0

我们可以使用数据库连接池，根据我们生产上的实际情况设置好连接池的数量，用的时候从连接池里取，用完还回去。

可以借助Scalikejdbc: http://scalikejdbc.org/

1）加入依赖

  <dependency>
                <groupId>org.scalikejdbc</groupId>
                <artifactId>scalikejdbc_${scala.tools.version}</artifactId>
                <version>${scalikejdbc.version}</version>
            </dependency>
            <dependency>
                <groupId>org.scalikejdbc</groupId>
                <artifactId>scalikejdbc-config_${scala.tools.version}</artifactId>
                <version>${scalikejdbc.version}</version>
            </dependency>

2）创建数据库连接配置信息

resource下创建application.conf

db.default.driver = "com.mysql.jdbc.Driver"
db.default.url = "jdbc:mysql://localhost:3306/demo?useSSL=false&serverTimezone=UTC"
db.default.user = "root"
db.default.password = "123456"

db.default.poolInitialSize = 10
db.default.poolMaxSize = 20
db.default.connectionTimeoutMillis = 3000

3）使用

  //拿到结果
    val result = stream.flatMap(_.split(","))
      .map((_, 1))
      .reduceByKey(_ + _)

    DBs.setupAll() //解析配置文件application.conf
    result.foreachRDD(rdd => {
      rdd.foreachPartition(partition => {
        partition.foreach(pair => {
          DB.autoCommit {
            implicit session => {
              // NamedDB(""),如果配置的DB名称不是default可以在使用其进行指定,默认是default名字无需指定
              // 默认就使用了连接池
              SQL("insert into wc(word,cnt) values(?,?)")
                .bind(pair._1, pair._2)
                .update()
                .apply()
            }
          }
        })
      })
    })