前言
一般的将Spark dataframe写入clickhouse的代码
.
// An highlighted block
val prop=new Properties()
val ckDriver="ru.yandex.clickhouse.ClickHouseDriver"
prop.put("driver",ckDriver)
prop.put("user",username)
prop.put("password",password)
df.write.mode(saveMode="append")
.option("batchsize", "20000")
.option("isolationLevel", "NONE") // 设置事务
.option("numPartitions", "1") // 设置并发
.jdbc(url, table,prop)
问题
如果Spark Dataframe中包含Array(String)类型的列,会报如下错:
Caused by: java.lang.IllegalArgumentException: Can’t get JDBC type for array
解决
这里的driver和url是要对应的,官网提供了两种jdbc的连接Clickhouse JDBC
通过原生clickhouse jdbc的PreparedStatement写入:
def insertToCkWithStatement(database :String,tableName :String,df:DataFrame,parallels:Int = 100): Unit ={
df.repartition(parallels).foreachPartition(rows =>{
Class.forName(driver) // com.github.housepower.jdbc.ClickHouseDriver ru.yandex.clickhouse.ClickHouseDriver
var connection: ClickHouseConnection = null
var pstmt: PreparedStatement= null
val properties=new ClickHouseProperties()
properties.setUser(user)
properties.setPassword(password)
properties.setDatabase(database)
try{
connection = new ClickHouseDataSource(url, properties).getConnection()
pstmt= connection.prepareStatement(s"insert into $database.$tableName values (?,?,?)")
rows.foreach(line => {
count += 1
line.schema.fields.foreach(field => {
indexField += 1
val columnName = field.name
// 这里对Array(String)类型的特殊处理
if (columnName.equals("url_req_arr")){
var value = line.getAs[Seq[String]](columnName)
if (value == null) value = Seq[String]()
val arr = connection.createArrayOf("String", value.toArray)
pstmt.setArray(indexField, arr)
// 处理Int类型
}else if (field.dataType.equals(IntegerType)){
pstmt.setInt(indexField, line.getAs[Int](columnName))
// 处理String类型
}else{
pstmt.setString(indexField, line.getAs[String](columnName))
}
})
// 批量写入
pstmt.addBatch()
indexField = 0
if (count >= 100000){
pstmt.executeBatch()
count = 0
}
})
pstmt.executeBatch()
}catch {
case ex: SQLException =>
println(ex)
case _ =>
println("There are other exceptions!!!")
} finally {
if (statement != null)
statement.close()
if(connection != null)
connection.close()
}
})
}
心得体会:
1、这里的Connection用的是ru.yandex.clickhouse.ClickHouseConnection,之前一直用的是java.sql.Connection,虽然数据能写入但会出现各种问题,比如会出现乱码, 端口只有9000能使用,8123端口一直使用不了等;
2、如果用Statement操作,目前还未找到能将Array(String)写入的方法。而且,Statement仅仅支持单条的插入,而不能addBatch和executeBatch
3、写入速度经测试,4千万数据6个G左右,差不多5分钟能完全写入,舒服极了。