mysql jdbc driver下载地址
https://dev.mysql.com/downloads/connector/j/
在spark中使用jdbc
1.在 spark-env.sh 文件中加入:
export SPARK_CLASSPATH=/path/mysql-connector-java-5.1.42.jar
2.任务提交时加入:
--jars /path/mysql-connector-java-5.1.42.jar
从Spark Shell连接到MySQL:
spark-shell --jars "/path/mysql-connector-java-5.1.42.jar
可以使用Data Sources API将来自远程数据库的表作为DataFrame或Spark SQL临时视图加载。用户可以在数据源选项中指定JDBC连接属性。
可以使用Data Sources API将来自远程数据库的表作为DataFrame或Spark SQL临时视图加载。用户可以在数据源选项中指定JDBC连接属性。 user并且password通常作为用于登录数据源的连接属性提供。除了连接属性外,Spark还支持以下不区分大小写的选项:
JDBC connection properties
属性名称和含义
url:要连接的JDBC URL。列如:jdbc:mysql://ip:3306
dbtable:应该读取的JDBC表。可以使用括号中的子查询代替完整表。
driver:用于连接到此URL的JDBC驱动程序的类名,列如:com.mysql.jdbc.Driver
partitionColumn, lowerBound, upperBound, numPartitions:
这些options仅适用于read数据。这些options必须同时被指定。他们描述,如何从多个workers并行读取数据时,分割表。
partitionColumn:必须是表中的数字列。
lowerBound和upperBound仅用于决定分区的大小,而不是用于过滤表中的行。
表中的所有行将被分割并返回。
fetchsize:仅适用于read数据。JDBC提取大小,用于确定每次获取的行数。这可以帮助JDBC驱动程序调优性能,这些驱动程序默认具有较低的提取大小(例如,Oracle每次提取10行)。
batchsize:仅适用于write数据。JDBC批量大小,用于确定每次insert的行数。
这可以帮助JDBC驱动程序调优性能。默认为1000。
isolationLevel:仅适用于write数据。事务隔离级别,适用于当前连接。它可以是一个NONE,READ_COMMITTED,READ_UNCOMMITTED,REPEATABLE_READ,或SERIALIZABLE,对应于由JDBC的连接对象定义,缺省值为标准事务隔离级别READ_UNCOMMITTED。请参阅文档java.sql.Connection。
truncate:仅适用于write数据。当SaveMode.Overwrite启用时,此选项会truncate在MySQL中的表,而不是删除,再重建其现有的表。这可以更有效,并且防止表元数据(例如,索引)被去除。但是,在某些情况下,例如当新数据具有不同的模式时,它将无法工作。它默认为false。
createTableOptions:仅适用于write数据。此选项允许在创建表(例如CREATE TABLE t (name string) ENGINE=InnoDB.)时设置特定的数据库表和分区选项。
spark jdbc read MySQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
val
jdbcDF
11
=
spark.read.format(
"jdbc"
)
.option(
"driver"
,
"com.mysql.jdbc.Driver"
)
.option(
"url"
,
"jdbc:mysql://ip:3306"
)
.option(
"dbtable"
,
"db.user_test"
)
.option(
"user"
,
"test"
)
.option(
"password"
,
"123456"
)
.option(
"fetchsize"
,
"3"
)
.load()
jdbcDF
11
.show
val
jdbcDF
12
=
spark.read.format(
"jdbc"
).options(
Map(
"driver"
->
"com.mysql.jdbc.Driver"
,
"url"
->
"jdbc:mysql://ip:3306"
,
"dbtable"
->
"db.user_test"
,
"user"
->
"test"
,
"password"
->
"123456"
,
"fetchsize"
->
"3"
)).load()
jdbcDF
12
.show
|
jdbc(url: String, table: String, properties: Properties): DataFrame
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
|
//-----------------------------------
import
java.util.Properties
// jdbc(url: String, table: String, properties: Properties): DataFrame
val
readConnProperties
1
=
new
Properties()
readConnProperties
1
.put(
"driver"
,
"com.mysql.jdbc.Driver"
)
readConnProperties
1
.put(
"user"
,
"test"
)
readConnProperties
1
.put(
"password"
,
"123456"
)
readConnProperties
1
.put(
"fetchsize"
,
"3"
)
val
jdbcDF
1
=
spark.read.jdbc(
"jdbc:mysql://ip:3306"
,
"db.user_test"
,
readConnProperties
1
)
jdbcDF
1
.show
+---+------+---+
|uid|gender|age|
+---+------+---+
|
2
|
2
|
20
|
|
3
|
1
|
30
|
|
4
|
2
|
40
|
|
5
|
1
|
50
|
|
6
|
2
|
60
|
|
7
|
1
|
25
|
|
8
|
2
|
35
|
|
9
|
1
|
70
|
|
10
|
2
|
80
|
|
1
|
1
|
18
|
+---+------+---+
//默认并行度为1
jdbcDF
1
.rdd.partitions.size
Int
=
1
//-------------------------
// jdbc(url: String, table: String, properties: Properties): DataFrame
val
readConnProperties
4
=
new
Properties()
readConnProperties
4
.put(
"driver"
,
"com.mysql.jdbc.Driver"
)
readConnProperties
4
.put(
"user"
,
"test"
)
readConnProperties
4
.put(
"password"
,
"123456"
)
readConnProperties
4
.put(
"fetchsize"
,
"3"
)
val
jdbcDF
4
=
spark.read.jdbc(
"jdbc:mysql://ip:3306"
,
"(select * from db.user_test where gender=1) t"
,
// 注意括号和表别名,必须得有,这里可以过滤数据
readConnProperties
4
)
jdbcDF
4
.show
+---+------+---+
|uid|gender|age|
+---+------+---+
|
3
|
1
|
30
|
|
5
|
1
|
50
|
|
7
|
1
|
25
|
|
9
|
1
|
70
|
|
1
|
1
|
18
|
+---+------+---+
|
jdbc(url: String, table: String,
columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int,
connectionProperties: Properties): DataFrame
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
|
import
java.util.Properties
val
readConnProperties
2
=
new
Properties()
readConnProperties
2
.put(
"driver"
,
"com.mysql.jdbc.Driver"
)
readConnProperties
2
.put(
"user"
,
"test"
)
readConnProperties
2
.put(
"password"
,
"123456"
)
readConnProperties
2
.put(
"fetchsize"
,
"2"
)
val
columnName
=
"uid"
val
lowerBound
=
1
val
upperBound
=
6
val
numPartitions
=
3
val
jdbcDF
2
=
spark.read.jdbc(
"jdbc:mysql://ip:3306"
,
"db.user_test"
,
columnName,
lowerBound,
upperBound,
numPartitions,
readConnProperties
2
)
jdbcDF
2
.show
+---+------+---+
|uid|gender|age|
+---+------+---+
|
2
|
2
|
20
|
|
1
|
1
|
18
|
|
3
|
1
|
30
|
|
4
|
2
|
40
|
|
5
|
1
|
50
|
|
6
|
2
|
60
|
|
7
|
1
|
25
|
|
8
|
2
|
35
|
|
9
|
1
|
70
|
|
10
|
2
|
80
|
+---+------+---+
// 并行度为3,对应于numPartitions
jdbcDF
2
.rdd.partitions.size
Int
=
3
|
jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties): DataFrame
predicates: Condition in the WHERE clause for each partition.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
|
import
java.util.Properties
val
readConnProperties
3
=
new
Properties()
readConnProperties
3
.put(
"driver"
,
"com.mysql.jdbc.Driver"
)
readConnProperties
3
.put(
"user"
,
"test"
)
readConnProperties
3
.put(
"password"
,
"123456"
)
readConnProperties
3
.put(
"fetchsize"
,
"2"
)
val
arr
=
Array(
(
1
,
50
),
(
2
,
60
))
// 此处的条件,既可以分割数据用作并行度,也可以过滤数据
val
predicates
=
arr.map {
case
(gender, age)
=
>
s
" gender = $gender "
+ s
" AND age < $age "
}
val
predicates
1
=
Array(
"2017-05-01"
->
"2017-05-20"
,
"2017-06-01"
->
"2017-06-05"
).map {
case
(start, end)
=
>
s
"cast(create_time as date) >= date '$start' "
+ s
"AND cast(create_time as date) <= date '$end'"
}
val
jdbcDF
3
=
spark.read.jdbc(
"jdbc:mysql://ip:3306"
,
"db.user_test"
,
predicates,
readConnProperties
3
)
jdbcDF
3
.show
+---+------+---+
|uid|gender|age|
+---+------+---+
|
3
|
1
|
30
|
|
7
|
1
|
25
|
|
1
|
1
|
18
|
|
2
|
2
|
20
|
|
4
|
2
|
40
|
|
8
|
2
|
35
|
+---+------+---+
// 并行度为2,对应arr数组中元素的个数
jdbcDF
3
.rdd.partitions.size
Int
=
2
|
spark jdbc write MySQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
// For implicit conversions like converting RDDs to DataFrames
import
spark.implicits.
_
val
dataList
:
List[(Double, String, Double, Double, String, Double, Double, Double, Double)]
=
List(
(
0
,
"male"
,
37
,
10
,
"no"
,
3
,
18
,
7
,
4
),
(
0
,
"female"
,
27
,
4
,
"no"
,
4
,
14
,
6
,
4
),
(
0
,
"female"
,
32
,
15
,
"yes"
,
1
,
12
,
1
,
4
),
(
0
,
"male"
,
57
,
15
,
"yes"
,
5
,
18
,
6
,
5
),
(
0
,
"male"
,
22
,
0.75
,
"no"
,
2
,
17
,
6
,
3
),
(
0
,
"female"
,
32
,
1.5
,
"no"
,
2
,
17
,
5
,
5
),
(
0
,
"female"
,
22
,
0.75
,
"no"
,
2
,
12
,
1
,
3
),
(
0
,
"male"
,
57
,
15
,
"yes"
,
2
,
14
,
4
,
4
),
(
0
,
"female"
,
32
,
15
,
"yes"
,
4
,
16
,
1
,
2
))
val
colArray
:
Array[String]
=
Array(
"affairs"
,
"gender"
,
"age"
,
"yearsmarried"
,
"children"
,
"religiousness"
,
"education"
,
"occupation"
,
"rating"
)
val
df
=
dataList.toDF(colArray
:
_
*)
df.write.mode(
"overwrite"
).format(
"jdbc"
).options(
Map(
"driver"
->
"com.mysql.jdbc.Driver"
,
"url"
->
"jdbc:mysql://ip:3306"
,
"dbtable"
->
"db.affairs"
,
"user"
->
"test"
,
"password"
->
"123456"
,
"batchsize"
->
"1000"
,
"truncate"
->
"true"
)).save()
|