背景
spark 读取mysql数据库表中数据表,数据体量五百万条
代码:
String url = "jdbc:mysql://localhost:3306/demo?useSSL=true&characterEncoding=utf-8&serverTimezone=UTC";
// 查找的表名
String table = "test";
// mysql 增加数据库的用户名(user)密码(password),指定test数据库的驱动(driver)
Properties connectionProperties = new Properties();
connectionProperties.put("user", "***");
connectionProperties.put("password", "***");
// connectionProperties.put("password", "***");
connectionProperties.put("driver", "com.mysql.cj.jdbc.Driver");
// 读取表中所有数据
Dataset<Row> datasetRdd = session.read().jdbc(url, table, connectionProperties);
错误:
1、java.lang.OutOfMemoryError: GC overhead limit exceeded
2、Container exited with a non-zero exit code 143.
Killed by external signal
错误原因分析:
没有设置分区,该操作的并发度为1,你所有的数据都会在一个partition中进行操作,意味着无论你给的资源有多少,只有一个task会执行任务,执行效率可想而之,并且在稍微大点的表中进行操作分分钟就会OOM。
解决方案:
Dataset<Row> count = session.read().jdbc(url, "(select count(*) as allnum from demo) as t1", connectionProperties);
long allcount = count.head().getAs("allnum");
Dataset<Row> datasetRdd = session.read().jdbc(url, table, "id", 1, allcount, 10, connectionProperties);
【注】上边的(select count(*) as allnum from demo) as t1,一定要加别名as t1,否则报错信息如下:java.sql.SQLSyntaxErrorException: Every derived table must have its own alias
参考:
1、https://blog.csdn.net/weixin_44769733/article/details/102491612
2、https://www.cnblogs.com/Kaivenblog/p/12622008.html