Spark 原理总结

spark架构

在这里插入图片描述

spark 任务提交流程

stage划分逻辑

依赖关系

  • 宽依赖:shuffleDependency,宽依赖往往对应着shuffle操作,需要在运行的过程中将同一个RDD分区传入到不同的RDD分区中,中间可能涉及到多个节点之间数据的传输,比如groupby,join等算子都是宽依赖
  • 窄依赖:窄依赖的每个父RDD分区通常只会传入到另一个子RDD分区,通常在一个节点内完成

DAG

Spark任务会根据RDD之间的依赖关系,形成一个DAG有向无环图,DAG会提交给DAGScheduler,DAGScheduler会把DAG划分相互依赖的多个stage,划分依据就是宽窄依赖,遇到宽依赖就划分stage,每个stage包含一个或多个task,然后将这些task以taskSet的形式提交给TaskScheduler运行,stage是由一组并行的task组成;切割规则是从后往前,遇到宽依赖就切割stage;后面的stage依赖于前面的stage,也就是说只有前面依赖的stage计算完毕后,后面的stage才会运行;

在这里插入图片描述

Spark 比 MapReduce 快的原因:

线程模型

  • Faster task startup time. Spark forks the thread, MR brings up a new JVM。
    spark中一个executor只起一个进程,一个executor包含n个core,在每个core上fork一个线程

pipeline模式

pipeline管道计算模式,piepeline只是一种计算思想,一种模式
spark的pipeline管道计算模式相当于执行了一个高阶函数,也就是说来一条数据计算一条数据,把当前stage所有的逻辑走完,然后落地;而MapReduce是所有数据逐个逻辑执行,全部数据计算完一个逻辑后落地,然后再计算下一个逻辑,少了很多IO操作,并且完全基于内存计算,所以是比MapReduce快的原因之一。

比如
pipeline模式: 每读1条数据, 顺次执行map()和flatMap(), 再读取下一条;
mapreduce模式: 对1亿条数据遍历做完map()后, 然后再重新读取一遍这1亿条数据, 做flatMap()操作.

Caching

It is doubtful because at the moment HDFS can also utilize the cache, but in general Spark cache is quite good, especially its SparkSQL part that caches the data in optimized column-oriented form(面向列的存储方式,能够快速读写)

Do you think that Spark processes all the transformations in memory?
Answer : You would be disappointed, but the heart of Spark, “shuffle”, writes data to disks. If you have a “group by” statement in your SparkSQL query or you are just transforming RDD to PairRDD and calling on it some aggregation by key, you are forcing Spark to distribute data among the partitions based on the hash value of the key. The “shuffle” process consists of two phases, usually referred as “map” and “reduce”. “Map” just calculates hash values of your key (or other partitioning function if you set it manually) and outputs the data to N separate files on the local filesystem, where N is the number of partitions on the “reduce” side. “Reduce” side polls the “map” side for the data and merges it in new partitions. So if you have an RDD of M partitions and you transform it to pair RDD with N partitions, there would be M*N files created on the local filesystems in your cluster, holding all the data of the specific RDD. There are some optimizations available to reduce amount of files. Also there are some work undergo to pre-sort them and then “merge” on “reduce” side, but this does not change the fact that each time you need to “shuffle” you data you are putting it to the HDDs.
So finally, Spark is not an in-memory technology. It is the technology that allows you to efficiently utilize in-memory LRU cache with possible on-disk eviction on memory full condition. It does not have built-in persistence functionality (neither in-memory, nor on-disk). And it puts all the dataset data on the local filesystems during the “shuffle” process.

本文引用的博客有下面两篇,学习后进行了相应的总结,如有spark方面的疑问,欢迎探讨。
https://blog.csdn.net/qq_16038125/article/details/80359414
https://www.cnblogs.com/gaoxing/p/5041806.html)

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值