spark入门2-SparkCore架构

最新推荐文章于 2024-01-28 23:41:27 发布

程序猿-瑞瑞

最新推荐文章于 2024-01-28 23:41:27 发布

阅读量147

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/weixin_39743356/article/details/110410552

版权

spark 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

spark入门2-SparkCore架构

SparkCore架构

SparkCore架构

一、流程

1、wordCount流程

    val conf = new SparkConf().setMaster("local").setAppName("WordCount")
    val context = new SparkContext(conf)
    val rdd1 = context.textFile("input")
    val rdd2 = rdd1.flatMap(_.split(" "))
    val rdd3 = rdd2.map((_, 1))
    val rdd4 = rdd3.reduceByKey(_ + _)
    val rdd5 = rdd3.reduceByKey(_ + _,3)
    val result1 = rdd4.collect()//job1
    val result2 = rdd5.collect()//job2
    println(result1.mkString("|"))
    println(result2.mkString("||"))
    context.stop()

input文件夹中有a.txt和b.txt两个文件
driver首先在通过行动算子提交job（上述的collect方法，调了两次行动算子，会有两个job）
DAGScheduler为job划分阶段stage
以job1为例，划分为stage0(map)和stage1(reduce)两个阶段
其中stage0 有三RDD分别为rdd1，rdd2，rdd3
stage0的task数量为两个，由stage中的最后一个rdd的分区（分片）数决定，因为input中有两个文件，所以分为两片
stage1阶段的task数量可以手动设置，比如上述的rdd5，设置为三个，如果不设置，task数量跟stage0阶段的task数量一样
最终结果，2个job，4个stage，9个task

job1：

stage0

task0

task1

stage1

task0

task1

job2：

stage0

task0

task1

stage1

task0

task1

task2

2、RDD

2.1源码

从下面源码可以看出，RDD是一个弹性的分布式的数据集，spark中的基本抽象类。代表可以并行操作的一个不可变的，分区的元素集合。

/**
 * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
 * partitioned collection of elements that can be operated on in parallel. This class contains the
 * basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
 * [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
 * pairs, such as `groupByKey` and `join`;
 * [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
 * Doubles; and
 * [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
 * can be saved as SequenceFiles.
 * All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)])
 * through implicit.
 *
 * Internally, each RDD is characterized by five main properties:
 *
 *  - A list of partitions
 *  - A function for computing each split
 *  - A list of dependencies on other RDDs
 *  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
 *  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
 *    an HDFS file)
 *
 * All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
 * to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
 * reading data from a new storage system) by overriding these functions. Please refer to the
 * <a href="http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf">Spark paper</a>
 * for more details on RDD internals.
 */
abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {
    ....
}

2.2特征属性

每个RDD都有五个主要的特征属性

分区列表

/**
   * Implemented by subclasses to return the set of partitions in this RDD. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   *
   * The partitions in this array must satisfy the following property:
   *   `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
   */
  protected def getPartitions: Array[Partition]

对每个切片的计算函数

  /**
   * :: DeveloperApi ::
   * Implemented by subclasses to compute a given partition.
   */
  @DeveloperApi
  def compute(split: Partition, context: TaskContext): Iterator[T]

依赖其他RDD的列表

  /**
   * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   */
  protected def getDependencies: Seq[Dependency[_]] = deps

分区器：对一个键值对的RDD来说，他的分区器就是哈希分区器

  /** Optionally overridden by subclasses to specify how they are partitioned. */
  @transient val partitioner: Option[Partitioner] = None

每切片计算的首选位置（比如对HDFS来说就是块的位置）

  /**
   * Optionally overridden by subclasses to specify placement preferences.
   */
  protected def getPreferredLocations(split: Partition): Seq[String] = Nil

2.3RDD的创建

通过别的RDD调用转换算子得到
new
1. 直接new
2. 调用sparkContext的方法创建
  1. makeRDD
  2. textFile
  3. …

3、并行度、分区

创建RDD时

使用textFile返回的是hadoopRDD
使用makeRDD（底层调用parallelize()方法）返回的是ParallelCollectionRDD

3.1hadoopRDD

查看源码可以得知，分区数的计算步骤：

minPartitions=defaultParallelism和2取最小值
1. defaultParallelism=spark.default.parallelism设置的值
2. 如果spark.default.parallelism没有设置，则为local[n]中的n
通过inputFormat（默认为TextFileInputFormat，这里开始为hadoop的类，默认用的是mapred的旧的api）获取所有的切片splits数量
1. 切片为空，划分为一个切片
2. 不为空，并且可切（isSplitable）
3. splitSize=Math.max(minSize, Math.min(goalSize, blockSize))
  1. minSize为1
  2. blockSize文件的块大小，一般为128兆
  3. goalSize目标大小：totalSize / (numSplits == 0 ? 1 : numSplits)
    1. totalSize 文件总大小
    2. numSplits =minPartitions
4. 根据splitSize，最后一块小于等于splitSize*1.1
根据是否过滤空切片，过滤掉空的splits
如果切片是FileSplit还有额外逻辑
最终分区数=切片个数

3.2ParallelCollectionRDD

分区数：

用户指定分区数
否则为：
1. defaultParallelism=spark.default.parallelism设置的值
2. 如果spark.default.parallelism没有设置，则为local[n]中的n

分区策略：

将Seq中的元素按照索引进行分区
大致是平均分，如果不能均分，后置位的分区会多分

/**
 * Slice a collection into numSlices sub-collections. One extra thing we do here is to treat Range
 * collections specially, encoding the slices as other Ranges to minimize memory cost. This makes
 * it efficient to run Spark over RDDs representing large sets of numbers. And if the collection
 * is an inclusive Range, we use inclusive range for the last slice.
 */
def slice[T: ClassTag](seq: Seq[T], numSlices: Int): Seq[Seq[T]] = {
  if (numSlices < 1) {
    throw new IllegalArgumentException("Positive number of partitions required")
  }
  // Sequences need to be sliced at the same set of index positions for operations
  // like RDD.zip() to behave as expected
  def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
    (0 until numSlices).iterator.map { i =>
      val start = ((i * length) / numSlices).toInt
      val end = (((i + 1) * length) / numSlices).toInt
      (start, end)
    }
  }
  seq match {
    case r: Range =>
      positions(r.length, numSlices).zipWithIndex.map { case ((start, end), index) =>
        // If the range is inclusive, use inclusive range for the last slice
        if (r.isInclusive && index == numSlices - 1) {
          new Range.Inclusive(r.start + start * r.step, r.end, r.step)
        }
        else {
          new Range(r.start + start * r.step, r.start + end * r.step, r.step)
        }
      }.toSeq.asInstanceOf[Seq[Seq[T]]]
    case nr: NumericRange[_] =>
      // For ranges of Long, Double, BigInteger, etc
      val slices = new ArrayBuffer[Seq[T]](numSlices)
      var r = nr
      for ((start, end) <- positions(nr.length, numSlices)) {
        val sliceSize = end - start
        slices += r.take(sliceSize).asInstanceOf[Seq[T]]
        r = r.drop(sliceSize)
      }
      slices
    case _ =>
      val array = seq.toArray // To prevent O(n^2) operations for List etc
      positions(array.length, numSlices).map { case (start, end) =>
          array.slice(start, end).toSeq
      }.toSeq
  }
}

二、技巧

src的main目录下的某个类的方法右键，go to ==>test可以快速创建测试类
在设置中搜索Live Templates，可以配置代码模板

程序猿-瑞瑞

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
1
评论
spark入门2-SparkCore架构

spark入门2-SparkCore架构SparkCore架构一、流程1、wordCount流程2、RDD2.1源码2.2特征属性2.3RDD的创建3、并行度、分区3.1hadoopRDD3.2ParallelCollectionRDD二、技巧SparkCore架构一、流程1、wordCount流程 val conf = new SparkConf().setMaster("local").setAppName("WordCount") val context = new SparkC
复制链接

扫一扫