1.Spark天堂之门SparkContext
(1)Spark程序在运行的时候分为:Driver和Executors两部分
(2)Spark程序的编写是基于SparkContext,包括两个方面
a) Spark的核心基础是RDD,是由SparkContext来最初创建(第一个RDD一定是由SparkContext来创建)
b) Spark程序的调度优化也是基于SparkContext
(3)Spark程序注册是通过SparkContext实例化的时候产生的对象来完成的(其实是SchedulerBankend来注册程序)
(4)Spark程序运行的时候,通过Cluster Manager获得具体的计算资源,计算资源的获取也是通过SparkContext产生的对象来申请的(其实是SchedulerBackend来获取计算资源的)
(5)SparkContext崩溃前或结束的时候整个Spark程序也结束
总结:
(a)Spark程序是通过SparkContext发布到Spark集群的;
(b)Spark程序运行都是在SparkContext为核心的调度器的指挥下进行的;
(c)SparkContext崩溃前或结束的时候整个Spark程序也结束。
2.SparkContext使用鉴赏
3.SparkContext构建的顶级三大核心对象:DAGScheduler、TaskScheduler、SchedulerBackend
(1)DAGScheduler是面向Job的Stage的高层调度器
(2)TaskScheduler是一个接口,根据具体的Cluster Manager的不同会有不同实现,Standalone模式下的具体实现是TaskSchedulerImpl
(3)SchedulerBackend是一个接口,根据具体的Cluster Manager不同会有不同的实现,Standalone模式下的具体实现是SparkDeploySchedulerBackend
4.从整个程序的运行角度看,SparkContext有四大核心:DAGScheduler、TaskScheduler、SchedulerBackend、MapoutputTrackerMaster
5.SparkDeploySchedulerBackend有三大核心功能
(1)负责Master链接注册当前程序
(2)接收集群中为当前应用程序而分配的计算资源Executor的注册并管理Executors;
(3)负责发送Task到具体的Executor执行。
注意:SparkDeploySchedulerBackend是被TaskSchedulerImpl来管理的,
然后启动taskScheduler
6.代码运行实例查看SparkContext的实例化过程
import org.apache.spark.SparkContext._
import org.apache.spark.{SparkConf, SparkContext}
object
SparkWordCount{
def main(args: Array[String]) {
//输入文件既可以是本地linux系统文件,也可以是其它来源文件,例如HDFS
if (args.length == 0) {
System.err.println("Usage: SparkWordCount <inputfile>")
System.exit(1)
}
//以本地线程方式运行,可以指定线程个数,
//如.setMaster("local[2]"),两个线程执行
//下面给出的是单线程执行
val conf = new SparkConf().setAppName("SparkWordCount").setMaster("local")
val sc = new SparkContext(conf)
//wordcount操作,计算文件中包含Spark的行数
val count=sc.textFile(args(0)).filter(line => line.contains("hello")).count()
//打印结果
println("count="+count)
sc.stop()
}
}
使用文件内容
运行日志
/usr/local/jdk1.7.0_72/bin/java -Didea.launcher.port=7532 -Didea.launcher.bin.path=/usr/local/idea-IC-141.1532.4/bin -Dfile.encoding=UTF-8 -classpath /usr/local/jdk1.7.0_72/jre/lib/management-agent.jar:/usr/local/jdk1.7.0_72/jre/lib/jsse.jar:/usr/local/jdk1.7.0_72/jre/lib/deploy.jar:/usr/local/jdk1.7.0_72/jre/lib/jfxrt.jar:/usr/local/jdk1.7.0_72/jre/lib/resources.jar:/usr/local/jdk1.7.0_72/jre/lib/jce.jar:/usr/local/jdk1.7.0_72/jre/lib/javaws.jar:/usr/local/jdk1.7.0_72/jre/lib/jfr.jar:/usr/local/jdk1.7.0_72/jre/lib/charsets.jar:/usr/local/jdk1.7.0_72/jre/lib/rt.jar:/usr/local/jdk1.7.0_72/jre/lib/plugin.jar:/usr/local/jdk1.7.0_72/jre/lib/ext/sunpkcs11.jar:/usr/local/jdk1.7.0_72/jre/lib/ext/dnsns.jar:/usr/local/jdk1.7.0_72/jre/lib/ext/sunec.jar:/usr/local/jdk1.7.0_72/jre/lib/ext/sunjce_provider.jar:/usr/local/jdk1.7.0_72/jre/lib/ext/zipfs.jar:/usr/local/jdk1.7.0_72/jre/lib/ext/localedata.jar:/root/IdeaProjects/test/out/production/test:/hadoop/scala/lib/scala-actors.jar:/hadoop/scala/lib/scala-swing.jar:/hadoop/scala/lib/scala-library.jar:/hadoop/scala/lib/scala-actors-migration.jar:/hadoop/scala/lib/scala-reflect.jar:/hadoop/spark/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/usr/local/idea-IC-141.1532.4/lib/idea_rt.jar com.intellij.rt.execution.application.AppMain SparkWordCount /hadoop/mr/wordcount.txt
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/03/19 00:38:54 INFO
SparkContext: Running Spark version 1.6.0
16/03/19 00:38:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/19 00:38:58 INFO SecurityManager: Changing view acls to: root
16/03/19 00:38:58 INFO SecurityManager: Changing modify acls to: root
16/03/19 00:38:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/03/19 00:38:59 INFO Utils: Successfully started service '
sparkDriver' on port 43290.
16/03/19 00:39:00 INFO Slf4jLogger: Slf4jLogger started
16/03/19 00:39:00 INFO Remoting: Starting remoting
16/03/19 00:39:01 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.6.135:57495]
16/03/19 00:39:01 INFO Utils: Successfully started service '
sparkDriverActorSystem' on port 57495.
16/03/19 00:39:01 INFO
SparkEnv: Registering MapOutputTracker
16/03/19 00:39:01 INFO SparkEnv: Registering BlockManagerMaster
16/03/19 00:39:02 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-a99dec66-4d57-4213-bd3a-55cd4f7af0ac
16/03/19 00:39:02 INFO MemoryStore: MemoryStore started with capacity 1200.4 MB
16/03/19 00:39:02 INFO SparkEnv: Registering OutputCommitCoordinator
16/03/19 00:39:03 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/03/19 00:39:03 INFO SparkUI: Started SparkUI at http://192.168.6.135:4040
16/03/19 00:39:04 INFO Executor: Starting executor ID driver on host localhost
16/03/19 00:39:04 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 51867.
16/03/19 00:39:04 INFO NettyBlockTransferService: Server created on 51867
16/03/19 00:39:04 INFO BlockManagerMaster: Trying to register BlockManager
16/03/19 00:39:04 INFO BlockManagerMasterEndpoint: Registering block manager localhost:51867 with 1200.4 MB RAM, BlockManagerId(driver, localhost, 51867)
16/03/19 00:39:04 INFO BlockManagerMaster: Registered BlockManager
16/03/19 00:39:05 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 153.6 KB, free 153.6 KB)
16/03/19 00:39:06 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 167.5 KB)
16/03/19 00:39:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:51867 (size: 13.9 KB, free: 1200.4 MB)
16/03/19 00:39:06 INFO
SparkContext: Created broadcast 0 from textFile at SparkWordCount.scala:21
16/03/19 00:39:07 INFO FileInputFormat: Total input paths to process : 1
16/03/19 00:39:07 INFO
SparkContext: Starting job: count at SparkWordCount.scala:21
16/03/19 00:39:07 INFO DAGScheduler: Got job 0 (count at SparkWordCount.scala:21) with 1 output partitions
16/03/19 00:39:07 INFO DAGScheduler: Final stage: ResultStage 0 (count at SparkWordCount.scala:21)
16/03/19 00:39:07 INFO DAGScheduler: Parents of final stage: List()
16/03/19 00:39:07 INFO DAGScheduler: Missing parents: List()
16/03/19 00:39:07 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at SparkWordCount.scala:21), which has no missing parents
16/03/19 00:39:07 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 170.6 KB)
16/03/19 00:39:07 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1858.0 B, free 172.4 KB)
16/03/1