不知不觉,已经到了Spark的第19篇博客了,这个系列很不系统,基本上是学到哪写到哪,而不是成竹在胸之后,高屋建瓴的写,这个等到对Spark有了比较深刻的理解和把握之后再来整理这些博客,毕竟刚接触Spark10天,继续!
在之前的文章中,Spark都是使用默认的伪分布式部署方式,没有从系统部署的角度去审视Spark,目前的状态是能运行Spark能跑通例子的程度,在此之前,Spark的配置文件内容是:
export SCALA_HOME=/home/hadoop/software/scala-2.11.4
export JAVA_HOME=/home/hadoop/software/jdk1.7.0_67
###localhost表示MASTER节点在本机?
export SPARK_MASTER=localhost
###这个表示什么意思?
export SPARK_LOCAL_IP=localhost
export HADOOP_HOME=/home/hadoop/software/hadoop-2.5.2
export SPARK_HOME=/home/hadoop/software/spark-1.2.0-bin-hadoop2.4
export SPARK_LIBARY_PATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib:$HADOOP_HOME/lib/native
###这个是干啥用的,如果使用Spark独立运行的话,应该不需要配置YARN相关的选项
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
Spark基于YARN的两种部署方式
- yarn-client
- yarn-cluster
1. yarn-client
在这种模式下,Spark driver在客户机上运行,然后向YARN申请运行exeutor以运行Task,即Driver和YARN是分开的,Driver程序作为YARN集群的一个客户端,这是一种CS模式
2. yarn-cluster
这种模式下,Spark driver将作为一个ApplicationMaster在YARN集群中先启动,然后再由ApplicationMaster向RM申请资源启动executor以运行Task。也就是说,在这种部署方式下,Driver程序运行在YARN集群上
在YARN中部署Spark应用程序时,可以使用Spark的bin/spark-submit提交Spark应用程序。在YARN上部署Spark应用程序的时候,不需要象Standalone、Mesos一样提供URL作为master参数的值,因为Spark应用程序可以在hadoop的配置文件里面获取相关的信息,所以只需要简单以yarn-cluster或yarn-client指定给master就可以了。因此,因为Spark需要从hadoop(或者具体的yarn相关的配置)的配置文件中获取相关的信息,所以需要配置环境变量HADOOP_CONF_DIR或者YARN_CONF_DIR。
所在在上面的配置再加一个配置项到conf/spark-env.sh中,同时在/etc/profile中也添加一行
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
yarn-client部署
1. 提交命令:
./spark-submit --name SparkWordCount --class spark.examples.SparkWordCount --master yarn-client --executor-memory 512M --total-executor-cores 1 SparkWordCount.jar README.md
对比之前的由Spark自己管理计算资源的提交方式
./spark-submit --name SparkWordCount --class spark.examples.SparkWordCount --master spark://hadoop.master:7077 --executor-memory 512M --total-executor-cores 1 SparkWordCount.jar README.md
2. 说明:
2.1. 采用yarn-client方式,因为driver在客户端,所以可以通过webUI访问driver的状态,默认是http://hadoop.master:4040访问,而YARN通过http://haoop.master:8088访问。
2.2 提交一个作业,产生的日志以及整个过程貌似很复杂的样子
[hadoop@hadoop bin]$ sh submitSparkApplicationYarnClient.sh //yarn-client方式提交任务
Delete the HDFS output directory //删除上次执行任务时,产生的HDFS输出目录
15/01/10 07:27:49 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/hadoop/SortedWordCountRDDInSparkApplication
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/01/10 07:27:52 INFO spark.SecurityManager: Changing view acls to: hadoop
15/01/10 07:27:52 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/01/10 07:27:52 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/01/10 07:27:53 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/01/10 07:27:53 INFO Remoting: Starting remoting
15/01/10 07:27:54 INFO util.Utils: Successfully started service 'sparkDriver' on port 35401.
15/01/10 07:27:54 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@localhost:35401]
15/01/10 07:27:54 INFO spark.SparkEnv: Registering MapOutputTracker
15/01/10 07:27:54 INFO spark.SparkEnv: Registering BlockManagerMaster
15/01/10 07:27:54 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20150110072754-dcdf
15/01/10 07:27:54 INFO storage.MemoryStore: MemoryStore started with capacity 267.3 MB
15/01/10 07:27:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/01/10 07:27:56 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-8f55f6ec-399b-4371-9ab4-d648047381c5
15/01/10 07:27:56 INFO spark.HttpServer: Starting HTTP Server
15/01/10 07:27:56 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/01/10 07:27:57 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:52196
15/01/10 07:27:57 INFO util.Utils: Successfully started service 'HTTP file server' on port 52196.
15/01/10 07:27:57 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/01/10 07:27:58 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
15/01/10 07:27:58 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
15/01/10 07:27:58 INFO ui.SparkUI: Started SparkUI at http://localhost:4040
15/01/10 07:27:58 INFO spark.SparkContext: Added JAR file:/home/hadoop/software/spark-1.2.0-bin-hadoop2.4/bin/SparkWordCount.jar at http://localhost:52196/jars/SparkWordCount.jar with timestamp 1420892878400
//到此时,Spark1工作做完,将任务提交给Yarn/
15/01/10 07:28:00 INFO client.RMProxy: Connecting to ResourceManager at hadoop.master/192.168.26.136:8032 //连接ResourceManager
15/01/10 07:28:02 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers //申请资源
15/01/10 07:28:02 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/01/10 07:28:02 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead //分配一个资源单位,AM Container
15/01/10 07:28:02 INFO yarn.Client: Setting up container launch context for our AM //设置container
15/01/10 07:28:02 INFO yarn.Client: Preparing resources for our AM container
15/01/10 07:28:03 INFO yarn.Client: Uploading resource file:/home/hadoop/software/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar -> hdfs://hadoop.master:9000/user/hadoop/.sparkStaging/application_1420859110621_0002/spark-assembly-1.2.0-hadoop2.4.0.jar
把spark-assembly-1.2.0-hadoop-2.4.0.jar上传到HDFS上???
15/01/10 07:28:22 INFO yarn.Client: Setting up the launch environment for our AM container
15/01/10 07:28:22 INFO spark.SecurityManager: Changing view acls to: hadoop
15/01/10 07:28:22 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/01/10 07:28:22 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/01/10 07:28:22 INFO yarn.Client: Submitting application 2 to ResourceManager
任务提交
15/01/10 07:28:22 INFO impl.YarnClientImpl: Submitted application application_1420859110621_0002
15/01/10 07:28:23 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:23 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1420892902791
final status: UNDEFINED
tracking URL: http://hadoop.master:8088/proxy/application_1420859110621_0002/
user: hadoop
///下面这一坨是什么情况?每秒钟rport一次状态?那这得产生多少垃圾日志?
15/01/10 07:28:24 INFO yarn.Client: Application report for application_1420859110621_0002 (state: ACCEPTED)
15/01/10 07:28:26 INFO