Driver Program是用户编写的提交给Spark集群执行的application,它包含两部分
- 作为驱动: Driver与Master、Worker协作完成application进程的启动、DAG划分、计算任务封装、计算任务分发到各个计算节点(Worker)、计算资源的分配等。
- 计算逻辑本身,当计算任务在Worker执行时,执行计算逻辑完成application的计算任务
接下来的问题是,给定一个driver programming,哪些作为"驱动代码"在Driver进程中执行,哪些"任务逻辑代码”被包装到任务中,然后分发到计算节点进行计算?
1. 基本Spark driver application
package spark.examples.databricks.reference.apps.loganalysis
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._
object LogAnalyzer {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Log Analyzer in Scala").setMaster("local[3]")
val sc = new SparkContext(sparkConf)
//logFile path is provided by the program args
val logFile = if (args != null && args.length == 1) args(0) else "E:\\softwareInstalled\\Apache2.2\\logs\\access.log"
//transform each line of the logFile into ApacheAccessLog object,
//RDD[T],T is of type ApacheAccessLog
//Because it will be used more than more, so cache it.
从数据源中读取文本文件内容,每一行转换为ApacheAccessLog,然后进行cache
val accessLogs = sc.textFile(logFile).map(ApacheAccessLog.parseLogLine).cache()
// Calculate statistics based on the content size.
//Retrieve the contentSize column and cache it
val contentSizes = accessLogs.map(log => log.contentSize).cache()
///reduce是个action,count是个action
println("Content Size Avg: %s, Min: %s, Max: %s".format(
contentSizes.reduce(_ + _) / contentSizes.