spark入门2-SparkCore架构
SparkCore架构
一、流程
1、wordCount流程
val conf = new SparkConf().setMaster("local").setAppName("WordCount")
val context = new SparkContext(conf)
val rdd1 = context.textFile("input")
val rdd2 = rdd1.flatMap(_.split(" "))
val rdd3 = rdd2.map((_, 1))
val rdd4 = rdd3.reduceByKey(_ + _)
val rdd5 = rdd3.reduceByKey(_ + _,3)
val result1 = rdd4.collect()//job1
val result2 = rdd5.collect()//job2
println(result1.mkString("|"))
println(result2.mkString("||"))
context.stop()
-
input文件夹中有a.txt和b.txt两个文件
-
driver首先在通过行动算子提交job(上述的collect方法,调了两次行动算子,会有两个job)
-
DAGScheduler为job划分阶段stage
-
以job1为例,划分为stage0(map)和stage1(reduce)两个阶段
-
其中stage0 有三RDD分别为rdd1,rdd2,rdd3
-
stage0的task数量为两个,由stage中的最后一个rdd的分区(分片)数决定,因为input中有两个文件,所以分为两片
-
stage1阶段的task数量可以手动设置,比如上述的rdd5,设置为三个,如果不设置,task数量跟stage0阶段的task数量一样
-
最终结果,2个job,4个stage,9个task
job1:
stage0
task0
task1
stage1
task0
task1
job2:
stage0
task0
task1
stage1
task0
task1
task2
2、RDD
2.1源码
从下面源码可以看出,RDD是一个弹性的分布式的数据集,spark中的基本抽象类。代表可以并行操作的一个不可变的,分区的元素集合。
/**
* A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
* partitioned collection of elements that can be operated on in parallel. This class contains the
* basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
* [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
* pairs, such as `groupByKey` and `join`;
* [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
* Doubles; and
* [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
* can be saved as SequenceFiles.
* All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)])
* through implicit.
*
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions
* - A function for computing each split
* - A list of dependencies on other RDDs
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
* an HDFS file)
*
* All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
* to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
* reading data from a new storage system) by overriding these functions. Please refer to the
* <a href="http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf">Spark paper</a>
* for more details on RDD internals.
*/
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {
....
}
2.2特征属性
每个RDD都有五个主要的特征属性
-
分区列表
/** * Implemented by subclasses to return the set of partitions in this RDD. This method will only * be called once, so it is safe to implement a time-consuming computation in it. * * The partitions in this array must satisfy the following property: * `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }` */ protected def getPartitions: Array[Partition]
-
对每个切片的计算函数
/** * :: DeveloperApi :: * Implemented by subclasses to compute a given partition. */ @DeveloperApi def compute(split: Partition, context: TaskContext): Iterator[T]
-
依赖其他RDD的列表
/** * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only * be called once, so it is safe to implement a time-consuming computation in it. */ protected def getDependencies: Seq[Dependency[_]] = deps
-
分区器:对一个键值对的RDD来说,他的分区器就是哈希分区器
/** Optionally overridden by subclasses to specify how they are partitioned. */ @transient val partitioner: Option[Partitioner] = None
-
每切片计算的首选位置(比如对HDFS来说就是块的位置)
/** * Optionally overridden by subclasses to specify placement preferences. */ protected def getPreferredLocations(split: Partition): Seq[String] = Nil
2.3RDD的创建
- 通过别的RDD调用转换算子得到
- new
- 直接new
- 调用sparkContext的方法创建
- makeRDD
- textFile
- …
3、并行度、分区
创建RDD时
- 使用textFile返回的是hadoopRDD
- 使用makeRDD(底层调用parallelize()方法) 返回的是ParallelCollectionRDD
3.1hadoopRDD
查看源码可以得知,分区数的计算步骤:
- minPartitions=defaultParallelism和2取最小值
- defaultParallelism=spark.default.parallelism设置的值
- 如果spark.default.parallelism没有设置,则为local[n]中的n
- 通过inputFormat(默认为TextFileInputFormat,这里开始为hadoop的类,默认用的是mapred的旧的api)获取所有的切片splits数量
- 切片为空,划分为一个切片
- 不为空,并且可切(isSplitable)
- splitSize=Math.max(minSize, Math.min(goalSize, blockSize))
- minSize为1
- blockSize文件的块大小,一般为128兆
- goalSize目标大小:totalSize / (numSplits == 0 ? 1 : numSplits)
- totalSize 文件总大小
- numSplits =minPartitions
- 根据splitSize,最后一块小于等于splitSize*1.1
- 根据是否过滤空切片,过滤掉空的splits
- 如果切片是FileSplit还有额外逻辑
- 最终分区数=切片个数
3.2ParallelCollectionRDD
分区数:
- 用户指定分区数
- 否则为:
- defaultParallelism=spark.default.parallelism设置的值
- 如果spark.default.parallelism没有设置,则为local[n]中的n
分区策略:
- 将Seq中的元素按照索引进行分区
- 大致是平均分,如果不能均分,后置位的分区会多分
/**
* Slice a collection into numSlices sub-collections. One extra thing we do here is to treat Range
* collections specially, encoding the slices as other Ranges to minimize memory cost. This makes
* it efficient to run Spark over RDDs representing large sets of numbers. And if the collection
* is an inclusive Range, we use inclusive range for the last slice.
*/
def slice[T: ClassTag](seq: Seq[T], numSlices: Int): Seq[Seq[T]] = {
if (numSlices < 1) {
throw new IllegalArgumentException("Positive number of partitions required")
}
// Sequences need to be sliced at the same set of index positions for operations
// like RDD.zip() to behave as expected
def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
(0 until numSlices).iterator.map { i =>
val start = ((i * length) / numSlices).toInt
val end = (((i + 1) * length) / numSlices).toInt
(start, end)
}
}
seq match {
case r: Range =>
positions(r.length, numSlices).zipWithIndex.map { case ((start, end), index) =>
// If the range is inclusive, use inclusive range for the last slice
if (r.isInclusive && index == numSlices - 1) {
new Range.Inclusive(r.start + start * r.step, r.end, r.step)
}
else {
new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}
}.toSeq.asInstanceOf[Seq[Seq[T]]]
case nr: NumericRange[_] =>
// For ranges of Long, Double, BigInteger, etc
val slices = new ArrayBuffer[Seq[T]](numSlices)
var r = nr
for ((start, end) <- positions(nr.length, numSlices)) {
val sliceSize = end - start
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
r = r.drop(sliceSize)
}
slices
case _ =>
val array = seq.toArray // To prevent O(n^2) operations for List etc
positions(array.length, numSlices).map { case (start, end) =>
array.slice(start, end).toSeq
}.toSeq
}
}
二、技巧
- src的main目录下的某个类的方法右键,go to ==>test可以快速创建测试类
- 在设置中搜索Live Templates,可以配置代码模板