有了文件读写过程,就可以读取一个文件执行简单的hello spark程序了。
wordcount执行过程
val lines = sc.textFile(“D:/resources/README.md”)
val words = lines.flatMap(.split(" ")).filter(word => word != " ")
val counts = words.map(word => (word,1)).reduceByKey( + _)
counts.collect().foreach(wordNum=>println(wordNum._1+":"+wordNum._2))
textFile()返回一个HadoopRDD(继承自RDD),只有文本行,不是(key,value)的。
HadoopRDD的flatMap()返回一个MapPartitionsRDD(继承自RDD)。
RDD的map()返回一个MapPartitionsRDD,RDD中并没有定义reduceByKey()函数。
reduceByKey()
这用到了scala的隐式转换。
一个从类型 S 到类型 T 的隐式转换由一个函数类型 S => T的隐式值来定义,或者由一个可转换成所需值的隐式方法来定义。
Scala 2.10引入了一种叫做隐式类的新特性。
隐式类指的是用implicit关键字修饰的类。在对应的作用域内,带有这个关键字的类的主构造函数可用于隐式转换。
隐式转换,使对象能调用类中本不存在的方法
// https://www.cnblogs.com/MOBIN/p/5351900.html
class SwingType{
def wantLearned(sw : String) = println("兔子已经学会了"+sw)
}
object swimming{
implicit def learningType(s : AminalType) = new SwingType
}
class AminalType
object AminalType extends App{
import com.mobin.scala.Scalaimplicit.swimming._
val rabbit = new AminalType
// 编译器发现rabbit对象没有wantLearned方法,此时编译器就会在作用域范围内查找能使其编译通过的隐式视图,
// 找到implicit的learningType方法后,编译器通过隐式转换将对象转换成具有这个方法的对象(SwingType),
// 之后调用其wantLearned方法
rabbit.wantLearned("breaststroke") //蛙泳
}
对RDD调用reduceByKey()时,编译器会在作用域范围内查找隐式转换,
在RDD的伴生对象中找到了隐式转换方法,于是通过隐式转换将RDD转换成具有reduceByKey()的对象PairRDDFunctions。
然后调用PairRDDFunctions的reduceByKey()方法。
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
reduceByKey(defaultPartitioner(self), func)
}
// ......
//reduceByKey的实现
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
foreach()
RDD的foreach()是一个Action方法,调用了SparkContext的runJob(),处理了progressBar,并做了RDD的doCheckpoint()。
/**
* Applies a function f to all elements of this RDD.
*/
def foreach(f: T => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}
runJob()
SparkContext的runJob调用了初始化时候创建的DAGScheduler的runJob()方法。
参数func: (TaskContext, Iterator[T])
代表了foreach()等Action操作传入的函数,是对RDD每一条记录要进行的操作。
/**
* Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark.
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
DAGScheduler.runJob()
在给定的 RDD上运行一个action的job,并把所有最终结果回传递给resultHandler对象。
DAGScheduler在runJob()主要是执行了submitJob(),把任务提交了。
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
// Check to make sure we are not launching a task on a partition that does not exist.
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}