spark通过官方jdbc写入数据到clickhouse

因为之前测试spark 2.4.0以上版本无法通过native jdbc接口写入clickhouse(之前的文章),尝试了下官方的jdbc接口。

背景

  • clickhouse两个分片,无副本
  • 读取hive分区,不同分区数据交替写入两个分片

实现

import java.util.Random

import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{DoubleType, LongType, StringType}
import org.apache.spark.sql.{SaveMode, SparkSession}
import ru.yandex.clickhouse.ClickHouseDataSource

object OfficialJDBCDriver {

  val chDriver = "ru.yandex.clickhouse.ClickHouseDriver"
  val chUrls = Array(
    "jdbc:clickhouse://1.1.1.1:8123/default",
    "jdbc:clickhouse://2.2.2.2:8123/default")

  def main(args: Array[String]): Unit = {
    if (args.length < 3) {
      System.err.println("Usage: OfficialJDBCDriver <tableName> <partitions> <batchSize>\n" +
        "  <tableName> is the hive table name \n" +
        "  <partitions> is the partitions which want insert into clickhouse, like 20200516,20200517\n" +
        "  <batchSize> is JDBC batch size, may be 1000\n\n")
      System.exit(1)
    }

    val (tableName, partitions, batchSize) = (args(0), args(1).split(","),  args(2).toInt)
    val sparkConf: SparkConf =
      new SparkConf()
        .setAppName("OfficialJDBCDriver ")

    val spark: SparkSession =
      SparkSession
        .builder()
        .enableHiveSupport()
        .config(sparkConf)
        .getOrCreate()

    val pro = new java.util.Properties
    pro.put("driver", chDriver)
    pro.setProperty("user", "default")
    pro.setProperty("password", "123456")
    var chShardIndex = 1
    for (partition <- partitions) {
      val chUrl = chUrls((chShardIndex - 1) % chUrls.length)
      val sql = s"select * from tmp.$tableName where dt = $partition"
      val df = spark.sql(sql)
      val (fieldNames, placeholders) = df.schema.fieldNames.foldLeft("", "")(
        (str, name) =>
          if (str._1.nonEmpty && str._2.nonEmpty)
            (str._1 + ", " + name, str._2 + ", " + "?")
          else (name, "?")
      )
      val insertSQL = s"insert into my_table ($fieldNames) values ($placeholders)"
      df.foreachPartition(records => {
        try {
          var count = 0;
          val chDatasource = new ClickHouseDataSource(chUrl, pro)
          val chConn = chDatasource.getConnection("default", "123456")
          val psmt = chConn.prepareStatement(insertSQL)
          while (records.hasNext) {
            val record = records.next()
            var fieldIndex = 1
            record.schema.fields.foreach(field => {
              field.dataType match {
                case StringType =>
                  psmt.setString(fieldIndex, record.getAs[String](field.name))
                case LongType =>
                  psmt.setLong(fieldIndex, record.getAs[Long](field.name))
                case DoubleType =>
                  psmt.setDouble(fieldIndex, record.getAs[Double](field.name))
                // 这里可以新增自己需要的type
                case _ => println(s" other type: ${field.dataType}")
              }
              fieldIndex += 1
            })
            psmt.addBatch()
            // 批量写入
            if (count % batchSize == 0) {
              psmt.executeBatch()
              psmt.clearBatch()
            }
            count += 1
          }
          psmt.executeBatch()
          psmt.close()
          chConn.close()
        } catch {
          case e: Exception =>
            e.printStackTrace()
        }
      })
      chShardIndex += 1
    }
    spark.close()
  }
}

测试结果:

spark 2.4.0版本,使用4个执行器,每个执行器1 core & 4g,batchsize设置为50000,写入500万条数据,用时76秒

  • 0
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 6
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值