A-Deeper-Understanding-of-Spark-Internals（Spark内核深入理解）

最新推荐文章于 2024-04-06 09:33:43 发布

Xeon-Shao

最新推荐文章于 2024-04-06 09:33:43 发布

阅读量1k

点赞数

分类专栏： Spark 大数据 Scala 文章标签： spark 优化性能

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/sdujava2011/article/details/50603427

版权

Spark 同时被 3 个专栏收录

92 篇文章 1 订阅

订阅专栏

51 篇文章 0 订阅

订阅专栏

20 篇文章 0 订阅

订阅专栏

这篇文章是对Spark Submit 2014会议上Aaron Davidson做的报告的PPT内容的整理，报告主要讲了Spark中对一个统计各个字母开头的名字的个数的代码做的优化。

对此PPT做了下简单整理，加入一些自己的理解。

Goal: Understanding how Spark runs, focus on performance

• Major core components:

– Execution Model

– The Shuffle

– Caching

Why understand internals?

Goal: Find number of distinct names per “first letter”

sc.textFile(“hdfs:/names”)

.map(name => (name.charAt(0), name))

.groupByKey()

.mapValues(names => names.toSet.size)

.collect()

RDD中元素转换过程

Spark Execution Model

1. Create DAG of RDDs to represent computation

2. Create logical execution plan for DAG

3. Schedule and execute individual tasks

在一本书上对要进行shuffle的原因的解释我感觉很有道理：之所以需要shuffle，还是因为具有某种共同特征的一类数据需要最终汇聚到一个计算节点上进行计算。

另外之前有一个误区是看到文章中提到的Spark基于内存计算，就认为Spark作计算的时候真的只是纯内存计算，然而现在发现自己真是幼稚的可以，道路阻且长啊。通过阅读相关书籍并结合本ppt，可以知道，在Spark 0.8以后，Shuffle Write会将数据持久化到硬盘，也就是shuffle过程依然是要读写硬盘的，存在性能瓶颈，然后自己现在有点搞不清Spark和Hadoop的区别了。

What went wrong?

• Too few partitions to get good concurrency

• Large per-key groupBy()

• Shipped all data across the cluster

Common issue checklist

1. Ensure enough partitions for concurrency

2. Minimize memory consumption (esp. of sorting and large keys in groupBys)

3. Minimize amount of data shuffled

4. Know the standard library

1 & 2 are about tuning number of partitions!

Importance of Partition Tuning

• Main issue: too few partitions

– Less concurrency

– More susceptible to data skew

– Increased memory pressure for groupBy,reduceByKey, sortByKey, etc.

• Secondary issue: too many partitions

• Need “reasonable number” of partitions

– Commonly between 100 and 10,000 partitions

– Lower bound: At least ~2x number of cores in cluster

– Upper bound: Ensure tasks take at least 100m

Memory Problems

• Symptoms:

– Inexplicably bad performance

– Inexplicable executor/machine failures"

(can indicate too many shuffle files too)

• Diagnosis:

– Set spark.executor.extraJavaOptions to include

• -XX:+PrintGCDetails

• -XX:+HeapDumpOnOutOfMemoryError

– Check dmesg for oom-killer logs

• Resolution:

– Increase spark.executor.memory

– Increase number of partitions

– Re-evaluate program structure (!)

了解了我们的程序存在什么问题，以及Spark中出现问题的诊断方法及部分解决方案，我们可以对之前的程序作出修改。

下面是原始程序代码

下面是修改后的代码：

前面有提到对shuffle的优化，但是目前对shuffle的具体过程还不是很了解，PPT也没有详细解释哪一步是对shuffle的优化，通过另一篇文章中的内容，初步判断是将groupByKey改为reduceByKey，此步中含有对shuffle的优化。

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。