一、上次课回顾
大数据实战第十四课之-Spark-Core02:https://blog.csdn.net/zhikanjiani/article/details/90786069
1.1、回顾Spark RDD
问题:如下三个方法是在Driver端还是Executor端执行
RDD中几个方法 | 源码中对应的方法 | Location(是在Driver端还是Executor端执行) | Input | Output |
---|---|---|---|---|
A list of partitions | getPartitions | ? | - | [Partition]数组 |
A function for computing each split | compute | ? | Partition | Iterable(迭代) |
A list of dependencies on other RDD | getDependencies | ? | - | [Dependencies]依赖的数组 |
- 一系列的分区
- 执行的时候有一个函数每一个分片和partition一个概念
- 有一系列的依赖在其它的RDD
回顾Spark RDD:
A list of partitions 方法:getPartitions Location Input Output
A functions for computing each split 方法:compute
A list of dependencies on other RDD 方法:getDependencies
Optional | A partitioner for key-value RDDs | a list of prederred locations to compute each split |
---|
二、Stage剖析
2.1 Stage概念
-
Each job get divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce);
-
每一个job都会被拆分成更小的task ==> 称为Stages;如果有依赖的Stage,必须前一个执行完才轮得到后一个Stage.
-
you will see this term used in the driver’s logs
-
每一个action触发一个job,每一个job会被拆分成更小的task,和MapReduce相似,reduce要等map执行完后再执行。
2.2 在Spark-shell中测试Stage
-
使用spark-shell --master local[2]命令启动Spark
-
scala> sc.parallelize(List(1,1,2,2,3,3,3,3,4,4,4,5)).map((_,1)).collect
res1: Array[(Int, Int)] = Array((1,1), (1,1), (2,1), (2,1), (3,1), (3,1), (3,1), (3,1), (4,1), (4,1), (4,1), (5,1))
- 此时hadoop002:4040端口上的页面显示:Job Id 0
- scala> sc.parallelize(List(1,1,2,2,3,3,3,3,4,4,4,5)).map((,1)).reduceByKey(+_).collect
res2: Array[(Int, Int)] = Array((4,3), (2,2), (1,2), (3,4), (5,1))
- 此时hadoop002:4040端口上的页面显示:Job Id 1
2.3 为什么会触发Job?
job:
A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you’ll see this term used in the driver’s logs
- 每遇到一个action算子如collect就会变成一个job
- 如上所示:我们使用了两个action算子collect,所以生成了两个job
- Job Id:初始值是0,为什么接下来是1,然后是2,每遇到一个action后都会加1
2.4 Job如何产生Stage
- 1个job会被拆为stage;如下图Description中的collect at和map at都是以stage中的最后一个算子命名的。
分别是下面的三个算子,parallelize、map、reduceByKey;reduceByKey算子会产生shuffle,shuffle会产生stage。举例:原来是一个stage,当我们遇到shuffle后,就会被切一刀,变成有2个stage。
对于RDD来说我们已经介绍过transformation和action,还需要了解cache
3 RDD Persistence
用处:提升速度用的
扩充:JVM:Java Memory Model,计算是通过CPU来的,数据是在内存中的,会导致很多地方消耗CPU;先把数据加载到内存。。。 数据一致性;每一个Core运行的时候都有自己的线程,如何保证。
volatile:保证了可见性、有序性
- One of the most impotant capabilities in Spark is persisting (or caching) a dataset in memory across operations.
- Spark中最重要的一个功能是持久化数据到内存中,内存存储在executor中的
-
When you persist an RDD(持久化一个RDD的时候), each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or dataset derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.
-
You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant- if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
3.1 在Spark-shell中测试RDD缓存
读入文件:
- scala> val lines = sc.textFile(“hdfs://hadoop002:9000/wordcount/input/ruozeinput.txt”)
lines: org.apache.spark.rdd.RDD[String] = hdfs://hadoop002:9000/wordcount/input/ruozeinput.txt MapPartitionsRDD[6] at textFile at :24
把这个文件cache住:
Cache相当于是一个transformation,不直接执行,lazy状态,通过action才会触发。
- lines.cache (cache是lazy状态,通过action触发)
res0: lines.type = hdfs://hadoop002:9000/wc_input/ruozeinput.txt MapPartitionsRDD[1] at textFile at :24
使用collect触发:
- lines.collect(触发)
[Stage 0:> (0 + 2) res1: Array[String] = Array(hello world john, hello world, hello)
注意1:
我们注意到使用cache时不触发操作,使用collect时才进行触发;在Storage页面查看到的是缓存信息,2个分区%被缓存中了;又注意到Stage页面查看:发现输入文件大小只有53B,而内存中的缓存数据是200B,大小是变大的。
注意2:
再次执行一个lines.collect,我们此时刷新Stage页面查看信息,发现此时输入大小是240B,而此时Storage中缓存信息已经没了。
比如要求域名之和,对于一个作业中的多个需求,是不是可以抽出来cache住。
3.2 cache和persist的区别?
- def cache(): this.type = persist() //cache调用的是persist
- def persist(): this.type = persist(StorageLevel.MEMORY_ONLY) //persist调用persist方法重载,传递一个只存内存。
查看StorageLevel.scala方法:
MEMORY_ONLY是Spark-core默认的存储策略
@DeveloperApi
class StorageLevel private(
private var _useDisk: Boolean, //是否使用磁盘
private var _useMemory: Boolean, //是否存内存
private var _useOffHeap: Boolean,
private var _deserialized: Boolean,
private var _replication: Int = 1)
extends Externalizable {
/**
* Various [[org.apache.spark.storage.StorageLevel]] defined and utility functions for creating
* new storage levels.
*/
object StorageLevel {
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2) // 2表示副本数
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
3.3 Spark-shell中进行测试
- scala> lines.unpersist(true) //清除缓存,执行cache是一个lazy的,清除缓存是一个easy的
res9: lines.type = hdfs://hadoop002:9000/wordcount/input/ruozeinput.txt MapPartitionsRDD[6] at textFile at :24
-----------我是分割线---------------
-
scala> import org.apache.spark.storage.StorageLevel
import org.apache.spark.storage.StorageLevel -
scala> lines.persist(StorageLevel.MEMORY_AND_DISK_SER_2)
res10: lines.type = hdfs://hadoop002:9000/wordcount/input/ruozeinput.txt MapPartitionsRDD[6] at textFile at :24
注意:
StorageLevel和下面的DISK_ONLY 、MEMORY_ONLY_2、MEMORY_AND_DISK_SER_2是一对多的一个关系. -
scala> lines.count
19/08/11 04:37:13 WARN storage.RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/08/11 04:37:13 WARN storage.BlockManager: Block rdd_6_0 replicated to only 0 peer(s) instead of 1 peers
19/08/11 04:37:13 WARN storage.RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
19/08/11 04:37:13 WARN storage.BlockManager: Block rdd_6_1 replicated to only 0 peer(s) instead of 1 peers
res11: Long = 3
扩展:
了解下JMM: Java Memory Model
数据一致性:原子性、可见性、顺序性
volatile i++ 在多线程环境下i++是一个不能保证的
面试题:
lines.cache()
lines.persist()这两个方法的区别
def cache(): this.type = persist()
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY) //方法重载,只存内存)
SparkCore默认存储策略是MEMORY_ONLY
4 Spark-Core中选择的框架
http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence
Storage Level | Meaning |
---|---|
MEMORY_ONLY | Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re needed. This is the default level. |
解释:存储RDD以序列化的方式,默认false,如果RDD在内存中存储不了了 ==> 意味着有些分区中的数据不能被cache住,在你需要的时候需要被重新计算
Storage Level | Meaning |
---|---|
MEMORY_AND_DISK | Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on disk, and read them from there when they’re needed. |
解释:也是存储数据以序列化的方式,如果内存中存不下,
MEMORY_AND_DISK是非序列化的方式,**MEMORY_ONLY_SER(Java and Scala)**是序列化的方式;序列化是能够减少空间的,但同时会耗费更多的CPU
4.1 Which Storage Level to Choose?
Which Storage Level to Choose?
Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. We recommend going through the following process to select one:
默认第一种:
- If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
CPU可以的话使用这种:
- If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)
CPU的速度比去磁盘读取要快,就是白缓存了
- Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.
它已经提供了高容错方式,所以此处多副本没多大必要:
- Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.
总结:在Spark Core的框架中,带Disk的全部Pass;PK哥公司使用MENMORY_ONLY和MEMORY_ONLY_SER
-
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used(LRU) fashion.
-
spark会自动监测cache的缓存。也会手工清。
eg:使用了lines.cache,要使用lines.unpersist(true)
sc.stop();这句话最好加上去。- Dont spill to disk unless the functions that computed your dataset are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.
- 涉及到一个概念:重算:recomputing
假设一个RDD,数据在hdfs上,使用textFile把数据读进来,做了map操作变成了一个新的RDD,然后触发了一个action。
对一个RDD做计算就是对RDD中的每一个分区做计算,在RDD1中做了一个reduceByKey或者groupByKey,就变成了两个RDD;分区中的数据存放位置不确定,
假设第三个partition挂掉了,而我们的RDD是有依赖关系的(RDD1是由RDD0转过来的),肯定能找到父RDD的信息;如下图我们只要恢复RDD0中的P2就行了,这个就叫血缘关系(lineage)
4.2 Spark宽窄依赖
Dependency依赖:
我们把这张图进行一刀切,遇到reduceByKey、groupByKey;把它拆为stage0、stage1
Narrow:窄依赖:定义
- 一个父RDD的partition只能被子RDD的partition使用一次。(map filter union join with inputs co-partitioned)
Wide:宽依赖(带shuffle的)
- 一个父RDD的partition只能被子RDD的partition使用多次
groupByKey:一份数据会被子RDD的partition使用多次
宽窄依赖的区别:
对于窄依赖的RDD的:父RDD的某个分区丢了,只需要重新计算父RDD;
对于宽依赖的RDD:父RDD的某个分区丢了,重新计算的内容比较多,
解析上图:
- map、filter这种算子是一对一计算的;
- Union:两个RDD变成一个RDD.
- join with inputs co-partitioned:父的partition最多被子使用一次就是窄依赖
面试过程中问:join是不是宽依赖?
wide:一个父RDD的partition只能被子RDD的partition使用多次。
RDD Stage分析:
分析:遇到shuffle算子拆stage,下图中黑色部分理解为分区挂了;stage2中黑色部分从当前重新计算;stage3中黑色部分从A开始重新计算。
遇到一个action就产生一个job,job由N个stage构成,每个stage由N个task构成。
误区:一个算子产生一个task?? 错误的
窄依赖是以pipeline的方式执行的,如下图:一个map、一个union进来后直接跑到结束的;C --> D --> F
MR处理:1+1+1+1 的过程:
1+1 ==> 2 每一步都需要有落地
2+1 ==> 3
3+1 ==>4
一个partition是一个task
总结: action==>join==>n stages==>n task
MapReduce:1+1+1+1 窄依赖以pipeline的方式走到底
4.3 Working with Key-Value Pairs
-
通过key-value访问,最常用的方式是分布式中shuffle的操作,such as grouping or aggregating the elements by a key.
-
(1,2) The key-value operations are available in the PairRDD class,which automatically wraps around an RDD of tuples.
- 我们虽然没有直接使用PairRDD,但内部肯定是隐式转换了的.
reduceByKey是PairRDDFunctions这个类中的:
class PairRDDFunctions[K, V](self: RDD[(K, V)])
(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
extends Logging with Serializable {
这个类的说明:Extra functions available on RDDs of (key, value) pairs through an implicit conversion.
通过一个隐式转换,去到RDD.scala中查看
RDD.scala中方法:
理解:人到超人,传入一个人,new了一个超人把我们的人放进去。
implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): PairRDDFunctions[K, V] = {
new PairRDDFunctions(rdd)
}
又查看map算子:
传一个function进来,它的返回值是一个RDD,
lines.map() 这是一个rdd,返回类型是tuple;
测试:
在RDD.scala中,implict def rddToPairRDDFunctions上打一个断点,运行LogApp.scala测试,debug代码看有没有进入到RDD.scala中这个位置
打断点测试看有没有进来。
面试题:
reduceByKey是哪个类里的?
PairRDDFunctions.scala