spark jdbc读取并发度优化 GC overhead

最新推荐文章于 2023-08-04 18:06:35 发布

半条小咸鱼

最新推荐文章于 2023-08-04 18:06:35 发布

阅读量589

点赞数

分类专栏：笔记文章标签： mysql

本文链接：https://blog.csdn.net/subian1261/article/details/118370771

版权

笔记专栏收录该内容

8 篇文章 0 订阅

订阅专栏

背景

spark 读取mysql数据库表中数据表，数据体量五百万条

代码：

		String url = "jdbc:mysql://localhost:3306/demo?useSSL=true&characterEncoding=utf-8&serverTimezone=UTC";
		// 查找的表名
		String table = "test";
		// mysql 增加数据库的用户名(user)密码(password),指定test数据库的驱动(driver)
		Properties connectionProperties = new Properties();
		connectionProperties.put("user", "***");
		connectionProperties.put("password", "***");
//		connectionProperties.put("password", "***");
		connectionProperties.put("driver", "com.mysql.cj.jdbc.Driver");
		// 读取表中所有数据
		Dataset<Row> datasetRdd = session.read().jdbc(url, table, connectionProperties);

错误：

1、java.lang.OutOfMemoryError: GC overhead limit exceeded
2、Container exited with a non-zero exit code 143.
Killed by external signal
在这里插入图片描述

错误原因分析：

没有设置分区，该操作的并发度为1，你所有的数据都会在一个partition中进行操作，意味着无论你给的资源有多少，只有一个task会执行任务，执行效率可想而之，并且在稍微大点的表中进行操作分分钟就会OOM。

解决方案：

        Dataset<Row> count = session.read().jdbc(url, "(select count(*) as allnum from demo) as t1", connectionProperties);
		long allcount = count.head().getAs("allnum");
		
		Dataset<Row> datasetRdd = session.read().jdbc(url, table, "id", 1, allcount, 10, connectionProperties);

【注】上边的(select count(*) as allnum from demo) as t1，一定要加别名as t1,否则报错信息如下：java.sql.SQLSyntaxErrorException: Every derived table must have its own alias

参考：
1、https://blog.csdn.net/weixin_44769733/article/details/102491612
2、https://www.cnblogs.com/Kaivenblog/p/12622008.html

半条小咸鱼

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
2
评论
spark jdbc读取并发度优化 GC overhead

背景spark 读取mysql数据库表中数据表，数据体量五百万条代码： String url = "jdbc:mysql://localhost:3306/demo?useSSL=true&characterEncoding=utf-8&serverTimezone=UTC"; // 查找的表名 String table = "test"; // mysql 增加数据库的用户名(user)密码(password),指定test数据库的驱动(driver) Proper
复制链接

扫一扫

专栏目录