jdbc()方法
Spark SQL支持数据源使用JDBC从其他数据库读取数据。 与使用JdbcRDD相比,应优先使用此功能。 这是因为结果以DataFrame的形式返回,并且可以轻松地在Spark SQL中进行处理或与其他数据源合并。 JDBC数据源也更易于从Java或Python使用,因为它不需要用户提供ClassTag。 (请注意,这与Spark SQL JDBC服务器不同,后者允许其他应用程序使用Spark SQL运行查询)。
该方法位于 : org.apache.spark.sql中的 DataFrameReader类
形式1:
def jdbc(url: String, table: String, properties: Properties): DataFrame
Construct a DataFrame representing the database table accessible via JDBC URL url named table and connection properties.
Since
1.4.0
例子:
import java.util.Properties
val prop = new Properties
// mysql
val url = "jdbc:mysql://mysqlHost:3306/database"
// oracle
//val url = "jdbc:oracle:thin:@//oracleHost:3306/database"
val tableName = "table"
// 设置连接用户&密码
val prop = new java.util.Properties
prop.setProperty("user","username")
prop.setProperty("password","pwd")
// 取得该表数据
val jdbcDF = spark.read.jdbc(url,tableName,prop)
查看并发度
jdbcDF.rdd.partitions.size # 结果返回 1
缺点:
该操作的并发度为1,你所有的数据都会在一个partition中进行操作,意味着无论你给的资源有多少,只有一个task会执行任务,执行效率可想而之,并且在稍微大点的表中进行操作分分钟就会OOM。
更直观的说法是,达到千万级别的表就不要使用该操作。
形式2:
def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties): DataFrame
Construct a DataFrame representing the database table accessible via JDBC URL url named table using connection properties. The predicates parameter gives a list expressions suitable for inclusion in WHERE clauses; each one defines one partition of the DataFrame.
Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
url
JDBC database url of the form jdbc:subprotocol:subname
table
Name of the table in the external database.
predicates
Condition in the where clause for each partition.
connectionProperties
JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. "fetchsize" can be used to control the number of rows per fetch.
Since
1.4.0
增加了predicates参数,目的是提升并发数,使得读取速度并行加速。
其中predicates为Array类型,是多个SQL条件筛选
如:
Mysql:
var mysql_predicates = “to_days(START_TIME)>=to_days(‘2018-01-01’) and to_days(START_TIME)<to_days(‘2018-10-01’)#to_days(START_TIME)>=to_days(‘2018-10-01’) and to_days(START_TIME)<to_days(‘2019-01-01’)#to_days(START_TIME)>=to_days(‘2019-01-01’) and to_days(START_TIME)<to_days(‘2019-11-08’)”
var mysql_predicates_array = mysql_predicates.split("#")
Oracle:
var oracle_predicates = ” to_char(START_TIME)>=‘2018-01-01’ and to_char(START_TIME)<‘2018-10-01’#to_char(START_TIME)>= 2018-10-01’ and to_char(START_TIME)<’2019-01-01’#to_char(START_TIME)>=‘2019-01-01’ and to_char(START_TIME)<=‘2019-11-01’“
var oracle_predicates_array = mysql_predicates.split("#")
其中 START_TIME为我们数据中的时间字段
形式3:
def jdbc(url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties): DataFrame
Construct a DataFrame representing the database table accessible via JDBC URL url named table. Partitions of the table will be retrieved in parallel based on the parameters passed to this function.
Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
url
JDBC database url of the form jdbc:subprotocol:subname.
table
Name of the table in the external database.
columnName
the name of a column of numeric, date, or timestamp type that will be used for partitioning.
lowerBound
the minimum value of columnName used to decide partition stride.
upperBound
the maximum value of columnName used to decide partition stride.
numPartitions
the number of partitions. This, along with lowerBound (inclusive), upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the column columnName evenly. When the input is less than 1, the number is set to 1.
connectionProperties
JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. "fetchsize" can be used to control the number of rows per fetch and "queryTimeout" can be used to wait for a Statement object to execute to the given number of seconds.
Since
1.4.0
根据Long类型字段分区
例子:
val url = "jdbc:mysql://mysqlHost:3306/database"
val tableName = "table"
val columnName = "colName"
val lowerBound = 1,
val upperBound = 10000000,
val numPartitions = 10,
// 设置连接用户&密码
val prop = new java.util.Properties
prop.setProperty("user","username")
prop.setProperty("password","pwd")
// 取得该表数据
val jdbcDF = spark.read.jdbc(url,tableName,columnName,lowerBound,upperBound,numPartitions,prop)
// 一些操作
缺点:
jdbcDF.rdd.partitions.size # 结果返回 10
该操作将字段 colName 中1-10000000条数据分到10个partition中,使用很方便,缺点也很明显,只能使用整形数据字段作为分区关键字。