spark通过jdbc方法连接数据库

最新推荐文章于 2023-07-27 17:45:00 发布

楓尘林间

最新推荐文章于 2023-07-27 17:45:00 发布

阅读量4.4k

点赞数

分类专栏： SQL Spark

本文链接：https://blog.csdn.net/bowenlaw/article/details/108076772

版权

Spark 同时被 2 个专栏收录

96 篇文章 1 订阅

订阅专栏

SQL

24 篇文章 0 订阅

订阅专栏

jdbc()方法

Spark SQL支持数据源使用JDBC从其他数据库读取数据。与使用JdbcRDD相比，应优先使用此功能。这是因为结果以DataFrame的形式返回，并且可以轻松地在Spark SQL中进行处理或与其他数据源合并。 JDBC数据源也更易于从Java或Python使用，因为它不需要用户提供ClassTag。（请注意，这与Spark SQL JDBC服务器不同，后者允许其他应用程序使用Spark SQL运行查询）。

该方法位于 : org.apache.spark.sql中的 DataFrameReader类

共有三种形式：

形式1：

def jdbc(url: String, table: String, properties: Properties): DataFrame

Construct a DataFrame representing the database table accessible via JDBC URL url named table and connection properties.

Since

    1.4.0

例子：

import java.util.Properties

val prop = new Properties

// mysql 
val url = "jdbc:mysql://mysqlHost:3306/database"
// oracle
//val url = "jdbc:oracle:thin:@//oracleHost:3306/database"

val tableName = "table"

// 设置连接用户&密码
val prop = new java.util.Properties
prop.setProperty("user","username")
prop.setProperty("password","pwd")

// 取得该表数据
val jdbcDF = spark.read.jdbc(url,tableName,prop)

查看并发度

jdbcDF.rdd.partitions.size # 结果返回 1

缺点：

该操作的并发度为1，你所有的数据都会在一个partition中进行操作，意味着无论你给的资源有多少，只有一个task会执行任务，执行效率可想而之，并且在稍微大点的表中进行操作分分钟就会OOM。

更直观的说法是，达到千万级别的表就不要使用该操作。

形式2：


def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties): DataFrame

Construct a DataFrame representing the database table accessible via JDBC URL url named table using connection properties. The predicates parameter gives a list expressions suitable for inclusion in WHERE clauses; each one defines one partition of the DataFrame.

Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.

url
    JDBC database url of the form jdbc:subprotocol:subname
    
table
    Name of the table in the external database.
    
predicates
    Condition in the where clause for each partition.
    
connectionProperties
    JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. "fetchsize" can be used to control the number of rows per fetch.

Since

    1.4.0

增加了predicates参数，目的是提升并发数，使得读取速度并行加速。
其中predicates为Array类型，是多个SQL条件筛选

如：

Mysql:
var mysql_predicates = “to_days(START_TIME)>=to_days(‘2018-01-01’) and to_days(START_TIME)<to_days(‘2018-10-01’)#to_days(START_TIME)>=to_days(‘2018-10-01’) and to_days(START_TIME)<to_days(‘2019-01-01’)#to_days(START_TIME)>=to_days(‘2019-01-01’) and to_days(START_TIME)<to_days(‘2019-11-08’)”
 
 var mysql_predicates_array =   mysql_predicates.split("#")
   
Oracle:
var oracle_predicates = ” to_char(START_TIME)>=‘2018-01-01’ and to_char(START_TIME)<‘2018-10-01’#to_char(START_TIME)>= 2018-10-01’ and to_char(START_TIME)<’2019-01-01’#to_char(START_TIME)>=‘2019-01-01’ and to_char(START_TIME)<=‘2019-11-01’“

var oracle_predicates_array =   mysql_predicates.split("#")

其中 START_TIME为我们数据中的时间字段

形式3：



def jdbc(url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties): DataFrame

Construct a DataFrame representing the database table accessible via JDBC URL url named table. Partitions of the table will be retrieved in parallel based on the parameters passed to this function.

Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.

url

    JDBC database url of the form jdbc:subprotocol:subname.
table

    Name of the table in the external database.
columnName

    the name of a column of numeric, date, or timestamp type that will be used for partitioning.
lowerBound

    the minimum value of columnName used to decide partition stride.
upperBound

    the maximum value of columnName used to decide partition stride.
numPartitions

    the number of partitions. This, along with lowerBound (inclusive), upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the column columnName evenly. When the input is less than 1, the number is set to 1.
connectionProperties

    JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. "fetchsize" can be used to control the number of rows per fetch and "queryTimeout" can be used to wait for a Statement object to execute to the given number of seconds.

Since

    1.4.0

根据Long类型字段分区

例子：

val url = "jdbc:mysql://mysqlHost:3306/database"
val tableName = "table"

val columnName = "colName"
val lowerBound = 1,
val upperBound = 10000000,
val numPartitions = 10,

// 设置连接用户&密码
val prop = new java.util.Properties
prop.setProperty("user","username")
prop.setProperty("password","pwd")

// 取得该表数据
val jdbcDF = spark.read.jdbc(url,tableName,columnName,lowerBound,upperBound,numPartitions,prop)

// 一些操作

缺点：

jdbcDF.rdd.partitions.size # 结果返回 10
该操作将字段 colName 中1-10000000条数据分到10个partition中，使用很方便，缺点也很明显，只能使用整形数据字段作为分区关键字。

拓展：

另一个优化连接数据库读取数据的注意点：

设置连接每次查询次数：


//  spark链接数据库每次查询的条数，默认为10
val fetchsize : Int = 10000
var data = spark.read.option("fetchsize",fetchsize).jdbc(url, table, predicates, prop)

以上都是读取数据库方法

下面介绍一个写入数据库加速的方法：

设置batchsize(写入)：


var data = spark.read.option("batchsize",100000).jdbc(url, table, predicates, prop)

总结：

文章参考链接：凯文の博客 | 超叔csdn 博客 |banana` 的博客|大壮vip 博客

楓尘林间

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录