Spark 原理总结

MilkyTea'Ou

已于 2022-04-21 12:58:37 修改

阅读量586

点赞数

分类专栏： spark 文章标签： spark

于 2018-11-07 21:15:58 首次发布

本文链接：https://blog.csdn.net/micro_msdn/article/details/83756344

版权

spark 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

spark 原理总结

spark架构
spark 任务提交流程
stage划分逻辑
- 依赖关系
- DAG
Spark 比 MapReduce 快的原因：

spark架构

在这里插入图片描述

spark 任务提交流程

stage划分逻辑

依赖关系

宽依赖：shuffleDependency，宽依赖往往对应着shuffle操作，需要在运行的过程中将同一个RDD分区传入到不同的RDD分区中，中间可能涉及到多个节点之间数据的传输，比如groupby，join等算子都是宽依赖
窄依赖：窄依赖的每个父RDD分区通常只会传入到另一个子RDD分区，通常在一个节点内完成

DAG

Spark任务会根据RDD之间的依赖关系，形成一个DAG有向无环图，DAG会提交给DAGScheduler，DAGScheduler会把DAG划分相互依赖的多个stage，划分依据就是宽窄依赖，遇到宽依赖就划分stage，每个stage包含一个或多个task，然后将这些task以taskSet的形式提交给TaskScheduler运行，stage是由一组并行的task组成；切割规则是从后往前，遇到宽依赖就切割stage；后面的stage依赖于前面的stage，也就是说只有前面依赖的stage计算完毕后，后面的stage才会运行；

在这里插入图片描述

Spark 比 MapReduce 快的原因：

线程模型

Faster task startup time. Spark forks the thread, MR brings up a new JVM。
spark中一个executor只起一个进程，一个executor包含n个core，在每个core上fork一个线程

pipeline模式

pipeline管道计算模式，piepeline只是一种计算思想，一种模式
spark的pipeline管道计算模式相当于执行了一个高阶函数，也就是说来一条数据计算一条数据，把当前stage所有的逻辑走完，然后落地；而MapReduce是所有数据逐个逻辑执行，全部数据计算完一个逻辑后落地，然后再计算下一个逻辑，少了很多IO操作，并且完全基于内存计算，所以是比MapReduce快的原因之一。

比如
pipeline模式：每读1条数据, 顺次执行map()和flatMap(), 再读取下一条;
mapreduce模式：对1亿条数据遍历做完map()后, 然后再重新读取一遍这1亿条数据, 做flatMap()操作.

Caching

It is doubtful because at the moment HDFS can also utilize the cache, but in general Spark cache is quite good, especially its SparkSQL part that caches the data in optimized column-oriented form(面向列的存储方式，能够快速读写)

Do you think that Spark processes all the transformations in memory?
Answer : You would be disappointed, but the heart of Spark, “shuffle”, writes data to disks. If you have a “group by” statement in your SparkSQL query or you are just transforming RDD to PairRDD and calling on it some aggregation by key, you are forcing Spark to distribute data among the partitions based on the hash value of the key. The “shuffle” process consists of two phases, usually referred as “map” and “reduce”. “Map” just calculates hash values of your key (or other partitioning function if you set it manually) and outputs the data to N separate files on the local filesystem, where N is the number of partitions on the “reduce” side. “Reduce” side polls the “map” side for the data and merges it in new partitions. So if you have an RDD of M partitions and you transform it to pair RDD with N partitions, there would be M*N files created on the local filesystems in your cluster, holding all the data of the specific RDD. There are some optimizations available to reduce amount of files. Also there are some work undergo to pre-sort them and then “merge” on “reduce” side, but this does not change the fact that each time you need to “shuffle” you data you are putting it to the HDDs.
So finally, Spark is not an in-memory technology. It is the technology that allows you to efficiently utilize in-memory LRU cache with possible on-disk eviction on memory full condition. It does not have built-in persistence functionality (neither in-memory, nor on-disk). And it puts all the dataset data on the local filesystems during the “shuffle” process.

本文引用的博客有下面两篇，学习后进行了相应的总结，如有spark方面的疑问，欢迎探讨。
https://blog.csdn.net/qq_16038125/article/details/80359414
https://www.cnblogs.com/gaoxing/p/5041806.html)

MilkyTea'Ou

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
Spark 原理总结

本文引用了高兴的博客中的一篇文章spark架构中的部分内容。Spark架构图：JVM堆空间下Spark的内存默认分配情况：1、Spark允许利用90%的堆空间，剩余10%用作jvm系统中的后台进程占用，剩下20%的safe内存，存储用户自定的对象等。2、Spark不是一个真的内存工具，它只是把内存作为他的LRU缓存，Storage默认情况下占executor.memory的54%，用来...
复制链接

扫一扫

专栏目录