Spark-SQL连接JDBC的方式及代码

最新推荐文章于 2024-07-25 10:39:07 发布

Aricya

最新推荐文章于 2024-07-25 10:39:07 发布

阅读量951

点赞数

文章标签： spark sql java

本文链接：https://blog.csdn.net/Aricya/article/details/128240496

版权

前言

Spark SQL支持数据源使用JDBC从其他数据库读取数据。 与使用JdbcRDD相比，应优先使用此功能。 这是因为结果以DataFrame的形式返回，并且可以轻松地在Spark SQL中进行处理或与其他数据源合并。JDBC数据源也更易于从Java或Python使用，因为它不需要用户提供ClassTag。JDBC和 Kubernetes中的 Java数据源一样，都使用 Java语言编写。不过 JDBC比 JDBC更小，因为它使用标准的 JSON文件格式。与 JDBC不同， JDBC不需要用户提供 API来向数据库添加或删除数据。因此， JDBC并不是一个可移植的 SQL数据库，而只能用于简单的数据查询。因为它使用了 SQL语言来执行查询语句，所以对于数据量大、复杂的查询来说效率会降低。另外，虽然基于 Java开发可扩展的数据库管理系统是可能的（尽管还没有被广泛接受），但它还是很有用。如果你正在开发一个 SQL数据库管理系统，可以使用 Kubernetes集成库来构建一个扩展版本 Java JDD；如果你正在开发一个支持多种数据类型（例如表、值等）的大型数据库管理系统，则可以在 Java中使用它来构建自己的分布式 SQL管理系统；但在这里我要提醒你注意：尽管许多开源项目已经被集成到 Java中了，它们仍然是很大程度上依赖于 JavaBean环境。由于使用了分布式技术、并使用了 Java语言来编写 JDDC，它比基于 SQL Server实现要简单得多。此外 JDDC不支持关系型数据库查询或条件查询。因为关系型系统中涉及到多个关系以及它们之间的频繁交换数据和更新数据。另外也因为它是面向对象技术设计的，所以它无法访问大量数据集或使用复杂查询来对某些列进行优化。最后因为 JDDC不支持复杂 SQL查询以及非关系型存储在其上的列的索引设置和查询等复杂过程，所以它很难通过 Java实现它自己的数据库管理系统。

方法一：

 该操作的并发度为1，你所有的数据都会在一个partition中进行操作，意味着无论你给的资源有多少，只有一个task会执行任务，执行效率可想而之，并且在稍微大点的表中进行操作分分钟就会OOM。更直观的说法是，达到千万级别的表就不要使用该操作。当我们的表达到千万级别，比如有百万级，那么表中就会有大量 task，而且你的数据也会更多，比如有百万个用户名，百万个 ID号。这时，如果只使用一个 task来进行操作的话，就会导致系统宕机。

def jdbc(url: String, table: String, properties: Properties): DataFrame

Construct a DataFrame representing the database table accessible via JDBC URL url named table and connection properties.

Since

    1.4.0

import java.util.Properties

val prop = new Properties

// mysql 
val url = "jdbc:mysql://mysqlHost:3306/database"
// oracle
//val url = "jdbc:oracle:thin:@//oracleHost:3306/database"

val tableName = "table"

// 设置连接用户&密码
val prop = new java.util.Properties
prop.setProperty("user","username")
prop.setProperty("password","pwd")

// 取得该表数据
val jdbcDF = spark.read.jdbc(url,tableName,prop)

方法二：

     增加了predicates参数，目的是提升并发数，使得读取速度并行加速。其中predicates为Array类型，是多个SQL条件筛选器的一种，是用来表示两个条件下的数据同时读入的数据。在一个表中， predicates参数通常由表结构中的 name参数决定，或者可以指定为某一列或多个行。

def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties): DataFrame

Construct a DataFrame representing the database table accessible via JDBC URL url named table using connection properties. The predicates parameter gives a list expressions suitable for inclusion in WHERE clauses; each one defines one partition of the DataFrame.

Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.

url
    JDBC database url of the form jdbc:subprotocol:subname
    
table
    Name of the table in the external database.
    
predicates
    Condition in the where clause for each partition.
    
connectionProperties
    JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. "fetchsize" can be used to control the number of rows per fetch.

Since

    1.4.0

Mysql:
var mysql_predicates = “to_days(START_TIME)>=to_days(‘2018-01-01’) and to_days(START_TIME)<to_days(‘2018-10-01’)#to_days(START_TIME)>=to_days(‘2018-10-01’) and to_days(START_TIME)<to_days(‘2019-01-01’)#to_days(START_TIME)>=to_days(‘2019-01-01’) and to_days(START_TIME)<to_days(‘2019-11-08’)”
 
 var mysql_predicates_array =   mysql_predicates.split("#")
   
Oracle:
var oracle_predicates = ” to_char(START_TIME)>=‘2018-01-01’ and to_char(START_TIME)<‘2018-10-01’#to_char(START_TIME)>= 2018-10-01’ and to_char(START_TIME)<’2019-01-01’#to_char(START_TIME)>=‘2019-01-01’ and to_char(START_TIME)<=‘2019-11-01’“

var oracle_predicates_array =   mysql_predicates.split("#")

方法三：

 jdbcDF.rdd.partitions.size # 结果返回 10，该操作将字段 colName 中1-10000000条数据分到10个partition中，使用很方便，缺点也很明显，只能使用整形数据字段作为分区关键字。

def jdbc(url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties): DataFrame

Construct a DataFrame representing the database table accessible via JDBC URL url named table. Partitions of the table will be retrieved in parallel based on the parameters passed to this function.

Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.

url

    JDBC database url of the form jdbc:subprotocol:subname.
table

    Name of the table in the external database.
columnName

    the name of a column of numeric, date, or timestamp type that will be used for partitioning.
lowerBound

    the minimum value of columnName used to decide partition stride.
upperBound

    the maximum value of columnName used to decide partition stride.
numPartitions

    the number of partitions. This, along with lowerBound (inclusive), upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the column columnName evenly. When the input is less than 1, the number is set to 1.
connectionProperties

    JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. "fetchsize" can be used to control the number of rows per fetch and "queryTimeout" can be used to wait for a Statement object to execute to the given number of seconds.

Since

    1.4.0

val url = "jdbc:mysql://mysqlHost:3306/database"
val tableName = "table"

val columnName = "colName"
val lowerBound = 1,
val upperBound = 10000000,
val numPartitions = 10,

// 设置连接用户&密码
val prop = new java.util.Properties
prop.setProperty("user","username")
prop.setProperty("password","pwd")

// 取得该表数据
val jdbcDF = spark.read.jdbc(url,tableName,columnName,lowerBound,upperBound,numPartitions,prop)

// 一些操作

Aricya

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark-SQL连接JDBC的方式及代码

Spark SQL支持数据源使用JDBC从其他数据库读取数据。与使用JdbcRDD相比，应优先使用此功能。这是因为结果以DataFrame的形式返回，并且可以轻松地在Spark SQL中进行处理或与其他数据源合并。JDBC数据源也更易于从Java或Python使用，因为它不需要用户提供ClassTag。JDBC和 Kubernetes中的 Java数据源一样，都使用 Java语言编写。不过 JDBC比 JDBC更小，因为它使用标准的 JSON文件格式。
复制链接

扫一扫