记一次spark-streaming性能优化的经历

最新推荐文章于 2021-11-02 14:45:49 发布

fct2001140269

最新推荐文章于 2021-11-02 14:45:49 发布

阅读量806

点赞数 3

分类专栏：大数据技术 spark

本文链接：https://blog.csdn.net/fct2001140269/article/details/98215427

版权

大数据技术同时被 2 个专栏收录

89 篇文章 1 订阅

订阅专栏

spark

28 篇文章 0 订阅

订阅专栏

//知识，哪怕是知识的幻影，也会成为你的铠甲，保护你不被愚昧反噬

记一次spark-streaming性能优化的经历

记一次spark-streaming性能优化的经历：

1.优化后效果：

一个job执行时间有原来的3min，优化之后提升到40s，性能提升4倍；

2.原因分析与解决方法

优化前原始spark-steaming程序，跑的特别慢，原因分析：

（1）主要问题是使用repatition函数之后，多了一个stage，需要刷写磁盘，效率较低；

（2）reduce操作，例如reduceByKey之后，shuffle后分区数量没有在集群或者应用中配置，导致reduceByKey后只有少量分区，无法发挥多核并行处理的优势，导致程序处理变慢；

具体如下图;
在这里插入图片描述

图1-0 优化前DAG
在这里插入图片描述

图1-1 优化前DAG 中task并行度情况

注解：原始的stage1001和stage1002分别只有2个task、11个task处理，远远到不到配置的16core，导致分别有14和4个core处于空闲状态。
在这里插入图片描述

图2-0 优化后DAG

在这里插入图片描述

图2-1 优化后DAG 中task并行度情况

注解：（1）优化时，去掉了无用的repatatition算子，将中间两个stage连接起来（好处：是不用shuffleWrite刷写磁盘，在内存中效率更高）；（2）.同时调整spark.default.parallelism=40，保证执行shuffle之后的patatition数量足够core来执行，提高了并行度；

最终优化前后的对比

在这里插入图片描述

图3-0 优化前Job执行耗时与延迟

在这里插入图片描述

图3-1 优化前Job执行耗时与延迟

总结：一个job执行时间有原来的3min，优化之后提升到40s，性能提升4倍；

注解：其中spark的批次大小是1min的时间窗口，数据量是每分钟100条；可以看到前后执行时间上的区别，优化前差不多一个job执行耗时约3min多，优化后耗时约40。

3.以上6张图是对优化后的效果总结：

需要在集群中配置相应的spark.default.parallelism配置项，此项决定了shuffling之后的patatition的的数量，如果不设置，则根据不同的使用环境（操作系统、调度系统的差异），patatition的数量也不同；

下面是官网对spark.default.parallelism的描述：

from:http://spark.apache.org/docs/latest/configuration.html

Property Name	Default	Meaning
spark.default.parallelism	For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations likeparallelize with no parent RDDs, it depends on the cluster manager: Local mode: number of cores on the local machine Mesos fine grained mode: 8 Others: total number of cores on all executor nodes or 2, whichever is larger	Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

from:http://spark.apache.org/docs/latest/tuning.html

Level of Parallelism

Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile, etc), and for distributed “reduce” operations, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions documentation), or set the config propertyspark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster.

3 tasks per CPU core in your cluster.

fct2001140269

关注

3
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
记一次spark-streaming性能优化的经历

//知识，哪怕是知识的幻影，也会成为你的铠甲，保护你不被愚昧反噬记一次spark-streaming性能优化的经历记一次spark-streaming性能优化的经历：1.优化后效果：一个job执行时间有原来的3min，优化之后提升到40s，性能提升4倍；2.原因分析与解决方法优化前原始spark-steaming程序，跑的特别慢，原因分析：（1）主要问题是使用repatition函数...
复制链接

扫一扫

专栏目录