【Spark七十六】Spark计算结果存到MySQL

package spark.examples.db

import java.sql.{PreparedStatement, Connection, DriverManager}

import com.mysql.jdbc.Driver
import org.apache.spark.{SparkContext, SparkConf}

object SparkMySQLIntegration {

  case class Person(name: String, age: Int)

  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("SparkRDDCount").setMaster("local");
    val sc = new SparkContext(conf);
    val data = sc.parallelize(List(("Tom", 31), ("Jack", 22), ("Mary", 25)))
    def func(iter: Iterator[(String, Int)]): Unit = {
//      Class.forName("com.mysql.jdbc.Driver ")
      var conn:Connection = null
      val d :Driver = null
      var pstmt:PreparedStatement = null
      try {
        val url="jdbc:mysql://localhost:3306/person";
        val user="root";
        val password=""
        //在forPartition函数内打开连接,这样连接将在worker上打开
        conn = DriverManager.getConnection(url, user, password)
        while (iter.hasNext) {
          val item = iter.next()
          println(item._1 + "," + item._2)
          val sql = "insert into TBL_PERSON(name, age) values (?, ?)";
          pstmt = conn.prepareStatement(sql);
          pstmt.setString(1, item._1)
          pstmt.setInt(2, item._2)
          pstmt.executeUpdate();
        }
      } catch {
        case e: Exception => e.printStackTrace()
      } finally {
        if (pstmt != null) {
          pstmt.close()
        }
        if (conn != null) {
          conn.close()
        }
      }
    }
    data.foreachPartition(func);
  }

}

 

这个代码遇到了两个坑,

1. 按照Java程序员使用JDBC的习惯,首先通过Class.forName("com.mysql.jdbc.Driver ")注册MySQL的JDBC驱动,但是在Scala中却不需要这么做,这么做还出错,包ClassNotFoundExeception(但是com.mysql.jdbc.Driver明明在classpath上)

所以代码中添加了注释

2. 在本地运行这个代码时,反反复复报错说sql语句的(?,?)附近有语法错误,反反复复的看也没看出来哪里有错,后来发现原来是pstmt.executeUpdate();写成了pstmt.executeUpdate(sql);如此严重的编译错,Intellij Idea竟然编译不报错!!!

 

 

Spark RDD存入MySQL等存储系统最佳实践

将Spark的RDD写入数据存储系统,不管是关系型数据库如MySQL,还是NoSQL,如MongoDB,HBase,都面临着比较大的存储压力,因为每个RDD的每个partition的数据量可能非常大,因为必须节省有限的存储服务器连接,如下是一些最佳实践:

 

  • You can write your own custom writer and call a transform on your RDD to write each element to a database of your choice, but there's a lot of ways to write something that looks like it would work, but does not work well in a distributed environment. Here are some things to watch out for:
  • A common naive mistake is to open a connection on the Spark driver program, and then try to use that connection on the Spark workers. The connection should be opened on the Spark worker, such as by calling forEachPartition and opening the connection inside that function.
  • Use partitioning to control the parallelism for writing to your data storage. Your data storage may not support too many concurrent connections.
  • Use batching for writing out multiple objects at a time if batching is optimal for your data storage.
  • Make sure your write mechanism is resilient to failures.
  • Writing out a very large dataset can take a long time, which increases the chance something can go wrong - a network failure, etc.
  • Consider utilizing a static pool of database connections on your Spark workers.
  • If you are writing to a sharded data storage, partition your RDD to match your sharding strategy. That way each of your Spark workers only connects to one database shard, rather than each Spark worker connecting to every database shard.
  • Be cautious when writing out so much data, and make sure you understand the distributed nature of Spark!

 

**上面提到了batch操作,batch应该是一个节省连接资源非常有效的手段,将多个更新或者插入操作组成一个batch,使用一个连接将数据传送到存储系统引擎,关注下MySQL和MongoDB的batch操作**

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
可以通过以下步骤将 Spark 数据存储MySQL 数据库中: 1. 首先,我们需要使用 JDBC 驱动程序来连接 MySQL 数据库。可以从 Maven 中央存储库下载 MySQL JDBC 驱动程序。 2. 在 Spark 应用程序中,使用 SparkSession 创建一个 DataFrame,并将其转换为一个 RDD。 3. 将 RDD 转换为一个 DataFrame,并使用 JDBC API 将其写入 MySQL 数据库。 下面是一个示例代码: ```scala import org.apache.spark.sql.{Row, SparkSession} import java.util.Properties object SparkToMySQL { case class Person(name: String, age: Int) def main(args: Array[String]) { val spark = SparkSession.builder.appName("SparkToMySQL").getOrCreate() val props = new Properties() props.setProperty("user", "root") props.setProperty("password", "password") val jdbcUrl = "jdbc:mysql://localhost:3306/test" val data = Seq(Person("Alice", 25), Person("Bob", 30), Person("Charlie", 35)) val rdd = spark.sparkContext.parallelize(data) val df = spark.createDataFrame(rdd) df.write.jdbc(jdbcUrl, "person", props) spark.stop() } } ``` 在此示例中,我们首先创建一个 SparkSession 对象,然后使用 Properties 对象来设置 MySQL 数据库的用户名和密码。接着,我们将数据转换为 DataFrame,并使用 `write.jdbc()` 方法将其写入 MySQL 数据库。最后,我们停止 SparkSession。 注意,这只是一个简单的示例,实际上可能需要进行更多的配置和优化,例如设置连接池大小、批量插入等。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值