spark针对sqlserver_Spark:优化将DataFrame写入SQL Server

I am using the code below to write a DataFrame of 43 columns and about 2,000,000 rows into a table in SQL Server:

dataFrame

.write

.format("jdbc")

.mode("overwrite")

.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")

.option("url", url)

.option("dbtable", tablename)

.option("user", user)

.option("password", password)

.save()

Sadly, while it does work for small DataFrames it's either extremely slow or gets timed out for large ones. Any hints on how to optimize it?

I've tried setting rewriteBatchedStatements=true

Thanks.

解决方案

We resorted to using the azure-sqldb-spark library instead of the default built-in exporting functionality of Spark. This library gives you a bulkCopyToSqlDB method which is a real batch insert and goes a lot faster. It's a bit less practical to use than the built-in functionality, but in my experience it's still worth it.

We use it more or less like this:

import com.microsoft.azure.sqldb.spark.config.Config

import com.microsoft.azure.sqldb.spark.connect._

import com.microsoft.azure.sqldb.spark.query._

val options = Map(

"url" -> "***",

"databaseName" -> "***",

"user" -> "***",

"password" -> "***",

"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver"

)

// first make sure the table exists, with the correct column types

// and is properly cleaned up if necessary

val query = dropAndCreateQuery(df, "myTable")

val createConfig = Config(options ++ Map("QueryCustom" -> query))

spark.sqlContext.sqlDBQuery(createConfig)

val bulkConfig = Config(options ++ Map(

"dbTable" -> "myTable",

"bulkCopyBatchSize" -> "20000",

"bulkCopyTableLock" -> "true",

"bulkCopyTimeout" -> "600"

))

df.bulkCopyToSqlDB(bulkConfig)

As you can see we generate the CREATE TABLE query ourselves. You can let the library create the table, but it will just do dataFrame.limit(0).write.sqlDB(config) which can still be pretty inefficient, probably requires you to cache your DataFrame, and it doesn't allow you to choose the SaveMode.

Also potentially interesting: we had to use an ExclusionRule when adding this library to our sbt build, or the assembly task would fail.

libraryDependencies += "com.microsoft.azure" % "azure-sqldb-spark" % "1.0.2" excludeAll(

ExclusionRule(organization = "org.apache.spark")

)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值