一、行动算子
行动算子的执行,就会触发整个作业的执行, 会采集各个分区的数据到driver端的内存中。
1. 常见的行动算子
val data: RDD[Int] = context.makeRDD(List(1,2,3,4), 2)
// 数据源的个数
val count: Long = data.count()
// 数据源的第一个
val first = data.first()
// 数据源的前多少个组成一个数组
val take = data.take(4)
// 输出结果
result.collect()
// 按照对象的形式保存
data.saveAsObjectFile("output")
// 要求数据类型是是保存kv 类型的数据
data.saveAsSequenceFile("output")
2. aggregate
val data: RDD[Int] = context.makeRDD(List(1,2,3,4), 2)
// 将分区内的数据进行计算 分区间进行计算
val i: Int = data.aggregate(0)(_ + _, _ + _)
aggregateBykey和aggregate的区别
1. aggregateBykey的初始值只会参与分区内的计算
2. aggregate的初始值不仅参与分区内的计算还会参与分区间的计算
3. fold
// 当分区内和分区间的规则计算相同时使用fold 同样分区内和分区间初始值都会参与计算
val i: Int = data.fold(1)(_ + _)
4. countByKey
// 通过value进行统计个数
val i: collection.Map[Int, Long] = data.countByValue()
// 通过key进行统计个数
val i: collection.Map[Int, Long] = data.countByKey()
5. forEach
rdd.foreach() 算子采集数据会乱序
二、血缘关系、依赖关系
相邻的两个RDD关系称之为依赖关系,多个连续的RDD的依赖称之为血缘关系。
val data: RDD[Int] = context.makeRDD(List(1,2,3,4), 2)
data.toDebugString //rdd的血缘关系
data.dependencies //rdd的依赖关系
--------------------------------------------------
1. 新RDD的一个分区数据依赖于旧RDD的一个分区的数据叫做 一对一 是窄依赖
// 继承的窄依赖
@DeveloperApi
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
override def getParents(partitionId: Int): List[Int] = List(partitionId)
}
2. 新RDD的一个分区数据依赖于旧RDD的多个分区的数据叫做 shuffle依赖
// 继承的是普通的依赖
@DeveloperApi
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](...)
extends Dependency[Product2[K, V]] {...}
阶段划分源码解读:总阶段数 = 1 + shuffle数量
rdd.collect() // 从collect进入,collect会触发作业
---------------------------------------------------
def collect(): Array[T] = withScope {
// 把当前RDD对象传入
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
---------------------------------------------------
一路进入runJob
---------------------------------------------------
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
// 提交作业
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
}
---------------------------------------------------
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
val waiter = new JobWaiter[U](this, jobId, partitions.size, resultHandler)
// 进入handleJobSubmitted,eventProcessLoop.post会将消息发送出去
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
Utils.cloneProperties(properties)))
waiter
}
---------------------------------------------------
def post(event: E): Unit = {
if (!stopped.get) {
if (eventThread.isAlive) { // eventThread启动的时候会执行receive方法
eventQueue.put(event)
}
}
}
---------------------------------------------------
private[spark] val eventThread = new Thread(name) {
override def run(): Unit = {
try {
while (!stopped.get) {
val event = eventQueue.take() // 从队列中获取
try {
onReceive(event)
}
....
}
}
}
---------------------------------------------------
override def onReceive(event: DAGSchedulerEvent): Unit = {
doOnReceive(event)
}
---------------------------------------------------
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
// 模式匹配
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
...
}
---------------------------------------------------
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties): Unit = {
var finalStage: ResultStage = null
try {
// 去创建阶段
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
}
}
---------------------------------------------------
private def createResultStage(
rdd: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
jobId: Int,
callSite: CallSite): ResultStage = {
// 获取当前rdd的依赖
val (shuffleDeps, resourceProfiles) = getShuffleDependenciesAndResourceProfiles(rdd)
val resourceProfile = mergeResourceProfilesForStage(resourceProfiles)
checkBarrierStageWithDynamicAllocation(rdd)
checkBarrierStageWithNumSlots(rdd, resourceProfile)
checkBarrierStageWithRDDChainPattern(rdd, partitions.toSet.size)
// 获取或者创建stage
val parents = getOrCreateParentStages(shuffleDeps, jobId)
val id = nextStageId.getAndIncrement()
// 无论如何都会创建一个stage
val stage = new ResultStage(id, rdd, func, partitions, parents, jobId,
callSite, resourceProfile.id)
stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
}
---------------------------------------------------
private[scheduler] def getShuffleDependenciesAndResourceProfiles(
rdd: RDD[_]): (HashSet[ShuffleDependency[_, _, _]], HashSet[ResourceProfile]) = {
val parents = new HashSet[ShuffleDependency[_, _, _]]
val resourceProfiles = new HashSet[ResourceProfile]
val visited = new HashSet[RDD[_]]
val waitingForVisit = new ListBuffer[RDD[_]]
waitingForVisit += rdd // 添加进去
while (waitingForVisit.nonEmpty) {
// 取数据
val toVisit = waitingForVisit.remove(0)
if (!visited(toVisit)) {
visited += toVisit
Option(toVisit.getResourceProfile).foreach(resourceProfiles += _)
toVisit.dependencies.foreach {
// 进行模式匹配 将shuffle数量添加到parents中
case shuffleDep: ShuffleDependency[_, _, _] => parents += shuffleDep
case dependency => waitingForVisit.prepend(dependency.rdd)
}
}
}
(parents, resourceProfiles)
}
---------------------------------------------------
private def getOrCreateParentStages(shuffleDeps: HashSet[ShuffleDependency[_, _, _]],
firstJobId: Int): List[Stage] = {
// 创建stage
shuffleDeps.map { shuffleDep => getOrCreateShuffleMapStage(shuffleDep, firstJobId)
}.toList
}
任务划分源码解读:先说结论 任务数量 = 最后一个RDD的分区数量
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties): Unit = {
var finalStage: ResultStage = null
try {
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
...
}
...
// 任务划分
submitStage(finalStage)
}
-------------------------------------------------------
private def submitStage(stage: Stage): Unit = {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
// 进行校验
val missing = getMissingParentStages(stage).sortBy(_.id)
if (missing.isEmpty) {
// 划分任务
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
submitStage(parent)
}
}
}
}
}
-------------------------------------------------------
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
private def submitMissingTasks(stage: Stage, jobId: Int): Unit = {
val tasks: Seq[Task[_]] = try {
stage match {
case stage: ShuffleMapStage =>
stage.pendingPartitions.clear()
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = partitions(id)
stage.pendingPartitions += id
new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
}
// 这里创建多少个任务取决于partitionsToCompute的元素个数
case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
stage.rdd.isBarrier())
}
}
}
}
---------------------------------------------
stage: ResultStage
override def findMissingPartitions(): Seq[Int] = {
val job = activeJob.get
// numPartitions就是分区的数量
(0 until job.numPartitions).filter(id => !job.finished(id))
}
---------------------------------------------
stage: ShuffleMapStage
override def findMissingPartitions(): Seq[Int] = {
mapOutputTrackerMaster
.findMissingPartitions(shuffleDep.shuffleId)
.getOrElse(0 until numPartitions)
}
三、持久化
如果想要复用RDD的对象,表面上看着确实复用了对象,但是实际上却没有,因为RDD本身是不保存数据的,所以我们需要进行数据的持久化。
1. cache方法
val value: RDD[(String, Int)] = data.map {
case (a, b) => (a, b + 1)
}
value.cache() // 缓存RDD中的数据,保存到内存中
2. persist 可以指定数据持久化在内存还是磁盘中
val value: RDD[(String, Int)] = data.map {
case (a, b) => (a, b + 1)
}
value.persist(StorageLevel.DISK_ONLY) // 缓存RDD到磁盘中
3. checkPoint 设置检查点 一般和cache联合使用
// 需要设置检查点的保存路径 一般是hdfs
context.setCheckpointDir("cp");
val value: RDD[(String, Int)] = data.map(word => {
(word, 1)
})
value.checkpoint() // 设置检查点
※:三种模式的差异
cache:将数据缓存在内存中,缓存前后会增加血缘关系,不改变原有的血缘关系。
persist:将数据缓存在磁盘中,执行完作业文件会丢失。
checkPoint:将数据缓存在磁盘中,会切断血缘关系,执行完作业文件不会丢失。
四、自定义分区器
val base: RDD[String] = context.makeRDD(List("hello world", "hello scala"), numSlices = 1)
// 自定义分区器
val partitionData: RDD[(String, Int)] = base.partitionBy(new MyPartitioner)
partitionData.saveAsTextFile("outPut")
context.stop()
class MyPartitioner extends Partitioner{
// 定义分区数
override def numPartitions: Int = 2
// 返回到哪个分区的编号(从0开始)
override def getPartition(key: Any): Int = {
if(key == "hello"){
0
}else{
1
}
}
}
五、累加器:分布式共享只写变量
如果我们按照平常的思路写一个累加器,这样得到的结果会是0.原因是这是一个分布式的数据累加。
val data: RDD[Int] = context.makeRDD(List(1,2,3,4))
var sum = 0
data.foreach(num => sum += num)
所以我们需要用spark为我们提供的累加器
val data: RDD[Int] = context.makeRDD(List(1,2,3,4))
// 声明一个累加器, 每调用一次行动算子就会执行一次作业,所以一般放置在行动算子中进行操作
val counter: LongAccumulator = context.longAccumulator("counter")
data.foreach(num => counter.add(num))
println(counter.value)
自定义一个累加器,
object Test02 {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("RDD")
val context = new SparkContext(conf)
val data: RDD[String] = context.makeRDD(List("hello","world", "hello","scala"))
val counter = new MyCounter()
context.register(counter, "counter")
data.foreach(word => counter.add(word))
println(counter.value)
context.stop()
}
// 自定义累加器需要继承AccumulatorV2并且定义好 入参(传入的参数)出参(保存的数据结构类型)的泛型
class MyCounter extends AccumulatorV2[String, mutable.Map[String, Long]]{
private var wCount = mutable.Map[String, Long]()
// 判断是否是初始状态,如果为空就是初始状态
override def isZero: Boolean = {
wCount.isEmpty
}
// 复制一个
override def copy(): AccumulatorV2[String, mutable.Map[String, Long]] = {
new MyCounter
}
// 重置计数器
override def reset(): Unit = {
wCount.clear()
}
// 获取累加
override def add(v: String): Unit = {
val newCnt: Long = wCount.getOrElse(v, 0L) + 1
wCount.update(v, newCnt)
}
// 合并多个累加器
override def merge(other: AccumulatorV2[String, mutable.Map[String, Long]]): Unit = {
val map1 = this.wCount
val map2 = other.value
map2.foreach(word=> {
val newCont: Long = map1.getOrElse(word._1, 0L) + word._2
wCount.update(word._1, newCont)
})
}
// 获取累价值
override def value: mutable.Map[String, Long] = {
wCount
}
}
}
六、广播变量:分布式共享只读变量
val data: RDD[String] = context.makeRDD(List("hello","world", "hello","scala"))
// 声明一个广播变量
val value: Broadcast[RDD[String]] = context.broadcast(data)
println(value.value)