Spark-Core深入理解

18 篇文章 0 订阅

Spark-Core深入理解

1.Spark Stage理解

Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you’ll see this term used in the driver’s logs.

我们知道每个action触发一个job,每个job会被拆分成多个task的集合,一个集合叫做一个stage,一个job可以有多个stage;Stage之间有依赖关系,例如B stage要等A stage执行完之后才能执行

scala> sc.parallelize(List(1,2,3,4,5,6,7,8)).map((_,1)).reduceByKey(_+_)
res0: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[3] at reduceByKey at <console>:25

上边的代码中,collect是一个action算子,所以就产生了一个job,也就是图中的job0,然后这个job中有两个stage,stage1中包含parallelize算子以及map算子,stage2中包含reduceByKey算子,其中reduceByKey是会产生shuffle的算子,产生了shuffle就会产生stage,所以reduceByKey算子之前是一个stage,之后又是一个stage.

2.cache/persist 详解

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

In addition, each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes. These levels are set by passing a StorageLevel object (Scala,Java, Python) to persist(). The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory). The full set of storage levels is:

scala> val lines = sc.textFile("hdfs://stg.bihdp01.hairongyi.local:8020/user/hdfs/test.txt").collect
lines: Array[String] = Array(hello	world	hello, world	welcome	hello)

scala> lines.cache
res1: lines.type = hdfs://stg.bihdp01.hairongyi.local/user/hdfs/test.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> lines.collect
res2: Array[String] = Array(hello	world	hello, world	welcome	hello)

进行collect操作之后,在spark webui界面的storage选项栏能看到我们的数据已经被cache了

cache跟persist底层源码

  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   * cache底层调用的是persist
   */
  def cache(): this.type = persist()

  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   * 而persist底层调用的是persist的重载方法,并且传入值为只存内存
   */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

StorageLevel(存储级别)详解

//分别是对是否使用磁盘,是否使用内存,是否使用offheap,是否反序列化,副本数的定义
class StorageLevel private(
    private var _useDisk: Boolean,
    private var _useMemory: Boolean,
    private var _useOffHeap: Boolean,
    private var _deserialized: Boolean,
    private var _replication: Int = 1)

//从这里可以看到StorageLevel.MEMORY_ONLY的底层源码对存储级别的定义,调用上班的方法,不使用磁盘,使用内存,不使用offheap,反序列化,所以使用cache时,是存储在内存中的.
object StorageLevel {
  val NONE = new StorageLevel(false, false, false, false)
  val DISK_ONLY = new StorageLevel(true, false, false, false)
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
Storage LevelMeaning
MEMORY_ONLYStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re needed. This is the default level.
存储RDD以反序列化的方式,如果在内存里面存不了了,一些partitions的数据就不能被cache,在计算的时候就会重来一次,这是默认的级别
MEMORY_AND_DISKStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on disk, and read them from there when they’re needed.
存储RDD以反序列化的方式,如果在内存里面存不了了,一些partitions的数据就不能被cache,存储不下的数据就存储到磁盘中
MEMORY_ONLY_SER (Java and Scala)Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
存储RDD以序列化的方式(序列化之后能节省内存空间,但是会消耗cpu),这种方式跟上面两种方式比,是更加节省空间的,尤其是使用一中快速的序列化方式的时候,但是也更消耗cpu

Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. We recommend going through the following process to select one:

spark的缓存级别意味着在内存跟cpu效率提供不同的权衡.

  • If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.

    一般选择默认情况

  • If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)

    如果第一种情况满足不了,使用MEMORY_ONLY_SER这种序列化级别,可以自己选择一个更快的序列化方式,可以更加节省空间,但仍然可以快速访问.

  • Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.

    不要写到磁盘,除非计算数据集的函数很昂贵,否则它们不会溢出到磁盘,或者它们会过滤大量数据。否则,重新计算分区可能与从磁盘读取分区一样快。

  • Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.

    如果想更加快速的容错,请使用复制的存储级别,所有存储级别通过重新计算丢失的数据提供完全容错,但复制的存储级别允许您继续在RDD上运行任务,而无需等待重新计算丢失的分区。

Removing Data

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

Spark会自动监视每个节点上的缓存使用情况,并以least-recently-used(LRU)的方式删除旧数据分区。如果您想手动删除RDD而不是等待它退出缓存,请使用RDD.unpersist()方法。

3.RDD的Dependency(依赖)

1.Narrow(窄依赖)

一个父RDD的partition只能被子RDD的某个partition使用一次

2.Wide(shuffle,宽依赖)

一个父RDD的partition会被子RDD的partition使用多次

val rddB = rddA.groupBy

val rddF = rddC.map().union(rddE)

val rddG = rddB.join(rddF)

遇到shuffle算子会拆stage,所以groupBy跟join(shuffle)就将job切成了3个stage.

action ==> Job ==> n stages ==> n task

4.Key-Values Pairs

While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key.

In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in tuples in the language, created by simply writing (a, b)). The key-value pair operations are available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples.

虽然大多数Spark操作都适用于包含任何类型对象的RDD,但一些特殊操作仅适用于Key-Values Pairs的RDD。最常见的是分布式shuffle操作,例如通过key对value进行分组或聚合。

在Scala中,这些操作在包含Tuple2对象的RDD上自动可用(语言中的内置元组,通过简单编写(a,b)创建)。PairRDDFunctions类中提供了键值对操作,它自动包装元组的RDD。

val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)

reduceByKey是哪个类里的?

reduceByKey底层源码其实是内部进行了隐式转换,将RDD转换成PairRDD

/**
 * Extra functions available on RDDs of (key, value) pairs through an implicit conversion.
 能够通过隐式转换以(key, value)的形式在RDDS中被使用
 */
class PairRDDFunctions[K, V](self: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
  extends Logging with Serializable {

RDD底层源码

//从普通的RDD到一个高级PairRDD的底层实现  
implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
    (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): PairRDDFunctions[K, V] = {
    new PairRDDFunctions(rdd)
  }
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值