1.配置信息
#mysql数据库配置
mysql.driver=com.mysql.jdbc.Driver
mysql.url=jdbc:mysql://localhost:3306/test?serverTimezone=UTC
mysql.user=******
mysql.password=******
2.需求示例
1.mysql中score表示例数据
mysql> select * from score;
+-------+-------------+---------------+-------------+---------------+
| st_id | st_name | chinese_score | match_score | english_score |
+-------+-------------+---------------+-------------+---------------+
| 1 | SunHui | 84 | 91 | 71 |
| 2 | ZhouWen | 91 | 74.5 | 94 |
| 4 | ZhengShuang | 86 | 73 | 79 |
| 5 | LiuHua | 89 | 86 | 95 |
| 6 | ZhaoSi | 79 | 88 | 81 |
| 7 | ZhangSan | 85 | 96 | 92.5 |
| 8 | WeiZai | 76 | 79 | 82 |
| 9 | WuFei | 77 | 89 | 92 |
| 10 | ChenYe | 89 | 76.5 | 84 |
| 11 | LiSi | 81 | 90 | 93 |
| 12 | LiuChang | 76 | 92 | 97 |
| 13 | WuHao | 86 | 92 | 97 |
| 14 | HuangXin | 87 | 76 | 98 |
| 15 | HuKai | 73 | 89 | 90 |
| 16 | LiuBin | 77 | 79 | 94 |
| 17 | XiaLe | 76 | 89 | 84 |
| 18 | WangGang | 76 | 94 | 87 |
+-------+-------------+---------------+-------------+---------------+
2.需求:统计学生的三科成绩之和,写入到total_score表个中
3.完整代码实现
package apply
import java.io.FileInputStream
import java.sql.{Connection, DriverManager}
import java.util.Properties
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
/**
* @Program: apply
* @Author: H.w
* @Date: 2020/10/16
* @description: Spark读写MySQL
*/
object OperateDataFrameToMySQL {
// 调整控制台日志输出级别
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.appName("")
.master("local[*]")
.getOrCreate()
/**
* 读取Mysql
* @param spark
* @param tableName 表名
* @param proPath 配置文件路径
* @return
*/
def readMySQLTable(spark:SparkSession, tableName: String, proPath: String) = {
val properties = getProperties(proPath)
spark.read
.format("jdbc")
.option("url", properties.getProperty("mysql.url"))
.option("driver", properties.getProperty("mysql.driver"))
.option("user", properties.getProperty("mysql.user"))
.option("password", properties.getProperty("mysql.password"))
.option("dbtable", tableName)
.load()
}
/**
* 将DataFrame保存为MySQL表
* @param dataFrame
* @param tableName 存些的表名
* @param saveMode 保存模式
* @param proPath 配置文件路径
*/
def saveAsMysqlTable(dataFrame:DataFrame, tableName:String, saveMode: SaveMode, proPath: String) = {
var table = tableName
val properties = getProperties(proPath)
val prop = new Properties() //配置文件中的key 与 spark 中的 key 不同 所以 创建prop 按照spark 的格式 进行配置数据库
prop.setProperty("url", properties.getProperty("mysql.url"))
prop.setProperty("driver", properties.getProperty("mysql.driver"))
prop.setProperty("user", properties.getProperty("mysql.user"))
prop.setProperty("password", properties.getProperty("mysql.password"))
//当SaveMode=OverWrite时,需要先清理表再写数据 清理表的方法分两种
// 第一种是truncate即清空表,如果是这种的话,则先清空表,然后再写数据
// 第二种是drop掉表,如果是这种的话,则先drop表,然后建表,最后写数据
if (saveMode == SaveMode.Overwrite) {
var coon:Connection = null
try {
coon = DriverManager.getConnection(
prop.getProperty("url"),
prop.getProperty("user"),
prop.getProperty("password")
)
val stmt = coon.createStatement()
table = table.toUpperCase()
stmt.execute(s"truncate table $table")
coon.close()
}
catch {
case e:Exception => {
println("MySQL Error:")
e.printStackTrace()
}
}
}
dataFrame.write.mode(SaveMode.Append).jdbc(prop.getProperty("url"),table, prop)
}
/**
* 获取配置信息
* @param proPath 配置文件路径
* @return
*/
def getProperties(proPath: String) = {
val properties = new Properties()
properties.load(new FileInputStream(proPath))
properties
}
def main(args: Array[String]): Unit = {
val proPath = "src/main/resources/job.properties"
val scoreDF:DataFrame = readMySQLTable(spark,"score",proPath)
scoreDF.createOrReplaceTempView("score")
val totalScoreDF = spark.sql("SELECT st_id,st_name,chinese_score+match_score+english_score as total_score FROM score")
saveAsMysqlTable(totalScoreDF,"total_score",SaveMode.Append,proPath)
}
}
4.输出结果
mysql> select * from total_score;
+-------+-------------+-------------+
| st_id | st_name | total_score |
+-------+-------------+-------------+
| 1 | SunHui | 246 |
| 2 | ZhouWen | 259.5 |
| 4 | ZhengShuang | 238 |
| 5 | LiuHua | 270 |
| 6 | ZhaoSi | 248 |
| 7 | ZhangSan | 273.5 |
| 8 | WeiZai | 237 |
| 9 | WuFei | 258 |
| 10 | ChenYe | 249.5 |
| 11 | LiSi | 264 |
| 12 | LiuChang | 265 |
| 13 | WuHao | 275 |
| 14 | HuangXin | 261 |
| 15 | HuKai | 252 |
| 16 | LiuBin | 250 |
| 17 | XiaLe | 249 |
| 18 | WangGang | 257 |
+-------+-------------+-------------+
5.总结
1.需要保存的表最好事先建好,否则字段类型会使用默认的,Text类型实在是耗资源
mysql> desc score;
+---------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-------------+------+-----+---------+-------+
| st_id | int(11) | NO | PRI | NULL | |
| st_name | varchar(20) | YES | | NULL | |
| chinese_score | double | YES | | NULL | |
| match_score | double | YES | | NULL | |
| english_score | double | YES | | NULL | |
+---------------+-------------+------+-----+---------+-------+
5 rows in set (0.07 sec)
mysql> desc total_score;
+-------------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| st_id | int(11) | YES | | NULL | |
| st_name | text | YES | | NULL | |
| total_score | double | YES | | NULL | |
+-------------+---------+------+-----+---------+-------+
2.自带的方法就是按照分区来存的,每一个分区开启一个mysql连接,所以最简单的优化方式就是在保存之前对DataFrame进行重新分区,注意数据倾斜问题,不然可能效率没有提升