读取Mysql方法见:
有这样一个需求,读取mysql某一个表某一天数据;
String beginTime = day+ " 00:00:00";
String endTime = day + " 23:59:59";
Dataset<Row> jdbcTable=spark.read().format("jdbc")
.option("url", PropertiesConfig.URL)
.option("dbtable",tableName)
.option("user",PropertiesConfig.USERNAME)
.option("password",PropertiesConfig.PASSWORD).load().filter("create_time between '"+beginTime+"' and '"+endTime+"'");
发现spark是把所有该tableName所有数据load进spark再filter,速率很慢.
所以想,是否提前过滤, 终于找到方法:
String tableName= "(select * from device_info where create_time between '"+ beginTime + "' and '" + endTime + "' ) as device_time_filter";
Dataset<Row> jdbcTable=spark.read().format("jdbc")
.option("url", PropertiesConfig.URL)
.option("dbtable",tableName)
.option("user",PropertiesConfig.USERNAME)
.option("password",PropertiesConfig.PASSWORD).load().
只要在该mysql表的create_time建index,效率快很多.
原理就是先从mysql 先过滤读device_info 张表重命名为device_time_filter.