Spark抽取mysql表数据性能慢优化

最新推荐文章于 2024-06-12 16:14:47 发布

weixin_38163331

最新推荐文章于 2024-06-12 16:14:47 发布

阅读量1.3k

点赞数 1

本文链接：https://blog.csdn.net/weixin_38163331/article/details/102716310

版权

Spark抽取mysql表数据性能慢优化

SparkSQL抽取MySql数据做分析，经常会遇到一下问题：
1.在数据量很大的情况下，如果只有一个worker一个excutor一个task，那你excutor的内存足够大或者worker足够多，没问题，否则就要内存溢出Out of memory。
2.在既定的资源下，并行度只有一的情况，处理数据速度特别慢，我们就要利用RDD的partition。将获取一个大表的任务拆分成多个任务，并行来处理，每个任务只获取一小部分数据，这样通过多个连接同时去取数据，速度反而更快
在Spark读取MySql，增加并行度，下图红色部分在这里插入图片描述

参数参考：

partitionColumn, lowerBound, upperBound	These options must all be specified if any of them is specified. In addition, numPartitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.
numPartitions	The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.

partitionColumn, lowerBound, upperBound

These options must all be specified if any of them is specified. In addition, numPartitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.

numPartitions

The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.

partitionColumn ：分区字段，必须是数字类型，int是可以的，一般用id
lowerBound ：分区下界，假如是10000，那么10000条数据之前都是在一个任务执行
upperBound：分区上届，lowerBound和upperBound的数据会被拆分，而边界外围的会单独作为分区
numPartitions：分区边界之间的数据要分多少分区
注：四个参数都需要写，否则报错

weixin_38163331

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Spark抽取mysql表数据性能慢优化

Spark抽取mysql表数据性能慢优化SparkSQL抽取MySql数据做分析，经常会遇到一下问题：1.在数据量很大的情况下，如果只有一个worker一个excutor一个task，那你excutor的内存足够大或者worker足够多，没问题，否则就要内存溢出Out of memory。2.在既定的资源下，并行度只有一的情况，处理数据速度特别慢，我们就要利用RDD的partition。将获取一个...
复制链接

扫一扫