spark入门2-SparkCore架构

SparkCore架构

一、流程

1、wordCount流程

    val conf = new SparkConf().setMaster("local").setAppName("WordCount")
    val context = new SparkContext(conf)
    val rdd1 = context.textFile("input")
    val rdd2 = rdd1.flatMap(_.split(" "))
    val rdd3 = rdd2.map((_, 1))
    val rdd4 = rdd3.reduceByKey(_ + _)
    val rdd5 = rdd3.reduceByKey(_ + _,3)
    val result1 = rdd4.collect()//job1
    val result2 = rdd5.collect()//job2
    println(result1.mkString("|"))
    println(result2.mkString("||"))
    context.stop()
  1. input文件夹中有a.txt和b.txt两个文件

  2. driver首先在通过行动算子提交job(上述的collect方法,调了两次行动算子,会有两个job)

  3. DAGScheduler为job划分阶段stage

  4. 以job1为例,划分为stage0(map)和stage1(reduce)两个阶段

  5. 其中stage0 有三RDD分别为rdd1,rdd2,rdd3

  6. stage0的task数量为两个,由stage中的最后一个rdd的分区(分片)数决定,因为input中有两个文件,所以分为两片

  7. stage1阶段的task数量可以手动设置,比如上述的rdd5,设置为三个,如果不设置,task数量跟stage0阶段的task数量一样

  8. 最终结果,2个job,4个stage,9个task

    job1:

    ​ stage0

    ​ task0

    ​ task1

    ​ stage1

    ​ task0

    ​ task1

    job2:

    ​ stage0

    ​ task0

    ​ task1

    ​ stage1

    ​ task0

    ​ task1

    ​ task2

    image-20201128141842979
image-20201128141857556

2、RDD

2.1源码

从下面源码可以看出,RDD是一个弹性的分布式的数据集,spark中的基本抽象类。代表可以并行操作的一个不可变的,分区的元素集合。

/**
 * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
 * partitioned collection of elements that can be operated on in parallel. This class contains the
 * basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
 * [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
 * pairs, such as `groupByKey` and `join`;
 * [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
 * Doubles; and
 * [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
 * can be saved as SequenceFiles.
 * All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)])
 * through implicit.
 *
 * Internally, each RDD is characterized by five main properties:
 *
 *  - A list of partitions
 *  - A function for computing each split
 *  - A list of dependencies on other RDDs
 *  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
 *  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
 *    an HDFS file)
 *
 * All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
 * to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
 * reading data from a new storage system) by overriding these functions. Please refer to the
 * <a href="http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf">Spark paper</a>
 * for more details on RDD internals.
 */
abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {
    ....
}
2.2特征属性

每个RDD都有五个主要的特征属性

  1. 分区列表

    /**
       * Implemented by subclasses to return the set of partitions in this RDD. This method will only
       * be called once, so it is safe to implement a time-consuming computation in it.
       *
       * The partitions in this array must satisfy the following property:
       *   `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
       */
      protected def getPartitions: Array[Partition]
    
  2. 对每个切片的计算函数

      /**
       * :: DeveloperApi ::
       * Implemented by subclasses to compute a given partition.
       */
      @DeveloperApi
      def compute(split: Partition, context: TaskContext): Iterator[T]
    
  3. 依赖其他RDD的列表

      /**
       * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
       * be called once, so it is safe to implement a time-consuming computation in it.
       */
      protected def getDependencies: Seq[Dependency[_]] = deps
    
  4. 分区器:对一个键值对的RDD来说,他的分区器就是哈希分区器

      /** Optionally overridden by subclasses to specify how they are partitioned. */
      @transient val partitioner: Option[Partitioner] = None
    
  5. 每切片计算的首选位置(比如对HDFS来说就是块的位置)

      /**
       * Optionally overridden by subclasses to specify placement preferences.
       */
      protected def getPreferredLocations(split: Partition): Seq[String] = Nil
    
2.3RDD的创建
  1. 通过别的RDD调用转换算子得到
  2. new
    1. 直接new
    2. 调用sparkContext的方法创建
      1. makeRDD
      2. textFile

3、并行度、分区

创建RDD时

  1. 使用textFile返回的是hadoopRDD
  2. 使用makeRDD(底层调用parallelize()方法) 返回的是ParallelCollectionRDD
3.1hadoopRDD

查看源码可以得知,分区数的计算步骤:

  1. minPartitions=defaultParallelism和2取最小值
    1. defaultParallelism=spark.default.parallelism设置的值
    2. 如果spark.default.parallelism没有设置,则为local[n]中的n
  2. 通过inputFormat(默认为TextFileInputFormat,这里开始为hadoop的类,默认用的是mapred的旧的api)获取所有的切片splits数量
    1. 切片为空,划分为一个切片
    2. 不为空,并且可切(isSplitable)
    3. splitSize=Math.max(minSize, Math.min(goalSize, blockSize))
      1. minSize为1
      2. blockSize文件的块大小,一般为128兆
      3. goalSize目标大小:totalSize / (numSplits == 0 ? 1 : numSplits)
        1. totalSize 文件总大小
        2. numSplits =minPartitions
    4. 根据splitSize,最后一块小于等于splitSize*1.1
  3. 根据是否过滤空切片,过滤掉空的splits
  4. 如果切片是FileSplit还有额外逻辑
  5. 最终分区数=切片个数
3.2ParallelCollectionRDD

分区数:

  1. 用户指定分区数
  2. 否则为:
    1. defaultParallelism=spark.default.parallelism设置的值
    2. 如果spark.default.parallelism没有设置,则为local[n]中的n

分区策略:

  1. 将Seq中的元素按照索引进行分区
  2. 大致是平均分,如果不能均分,后置位的分区会多分
/**
 * Slice a collection into numSlices sub-collections. One extra thing we do here is to treat Range
 * collections specially, encoding the slices as other Ranges to minimize memory cost. This makes
 * it efficient to run Spark over RDDs representing large sets of numbers. And if the collection
 * is an inclusive Range, we use inclusive range for the last slice.
 */
def slice[T: ClassTag](seq: Seq[T], numSlices: Int): Seq[Seq[T]] = {
  if (numSlices < 1) {
    throw new IllegalArgumentException("Positive number of partitions required")
  }
  // Sequences need to be sliced at the same set of index positions for operations
  // like RDD.zip() to behave as expected
  def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
    (0 until numSlices).iterator.map { i =>
      val start = ((i * length) / numSlices).toInt
      val end = (((i + 1) * length) / numSlices).toInt
      (start, end)
    }
  }
  seq match {
    case r: Range =>
      positions(r.length, numSlices).zipWithIndex.map { case ((start, end), index) =>
        // If the range is inclusive, use inclusive range for the last slice
        if (r.isInclusive && index == numSlices - 1) {
          new Range.Inclusive(r.start + start * r.step, r.end, r.step)
        }
        else {
          new Range(r.start + start * r.step, r.start + end * r.step, r.step)
        }
      }.toSeq.asInstanceOf[Seq[Seq[T]]]
    case nr: NumericRange[_] =>
      // For ranges of Long, Double, BigInteger, etc
      val slices = new ArrayBuffer[Seq[T]](numSlices)
      var r = nr
      for ((start, end) <- positions(nr.length, numSlices)) {
        val sliceSize = end - start
        slices += r.take(sliceSize).asInstanceOf[Seq[T]]
        r = r.drop(sliceSize)
      }
      slices
    case _ =>
      val array = seq.toArray // To prevent O(n^2) operations for List etc
      positions(array.length, numSlices).map { case (start, end) =>
          array.slice(start, end).toSeq
      }.toSeq
  }
}

二、技巧

  1. src的main目录下的某个类的方法右键,go to ==>test可以快速创建测试类
  2. 在设置中搜索Live Templates,可以配置代码模板
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

程序猿-瑞瑞

打赏不准超过你的一半工资哦~

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值