前言
day10,我们学习了Spark是什么及其特点、Spark架构、Spark的安装、演示使用scala和java实现的基于spark进行wordcount程序的示例。今天学习Spark HA模式、Spark submit及spark shell的演示。
Spark HA
Spark的高可用有两种方式:
- 基于文件系统实现单点恢复。
- 基于zookeeper的HA
基于文件系统实现单点恢复
这是为了解决单点故障的,并不是严格意义上的高可用,它是把spark Application和worker的注册信息的恢复状态写到指定的目录下,它只是冷备。在spark-en.sh配置文件中添加如下的配置(所有节点都需要添加),然后再启动Spark。
[root@bigdata121 ~]# vi/opt/module/spark-2.1.0-bin-hadoop2.7/conf/spark-env.sh
#在文件中添加如下的配置
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=FILESYSTEM -Dspark.deploy.recoveryDirectory=/root/spark-2.1.0-bin-hadoop2.7"
#重启Spark
[root@bigdata121 ~]# /opt/module/spark-2.1.0-bin-hadoop2.7/sbin/stop-all.sh
[root@bigdata121 ~]# /opt/module/spark-2.1.0-bin-hadoop2.7/sbin/start-all.sh
#查看/root/spark-2.1.0-bin-hadoop2.7目录,可以看到如下的信息,如下就是worker节点的信息,这里只是伪分布式,所以只有一个worker,如果是集群,目录下会有多个worker。
[root@bigdata121 ~]# ll /root/spark-2.1.0-bin-hadoop2.7
total 4
-rw-r--r-- 1 root root 1231 Jan 1 15:32 worker_worker-20200101153244-192.168.80.121-55020
[root@bigdata121 ~]#
基于zookeeper的HA
这个方式与Hadoop的HA差不多,都是依赖于zookeeper,所以在配置Spark的HAp这前,要把zookeeper集群运行起来,关于zookeeper集群的配置这里不作介绍。
[root@bigdata121 ~]# vi/opt/module/spark-2.1.0-bin-hadoop2.7/conf/spark-env.sh
#在文件中添加如下的配置
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=bigdata121:2181,bigdata122:2181,bigdata123:2181 -Dspark.deploy.zookeeper.dir=/spark"
#然后重启spark就形成了HA。
Spark的任务提交方式
Spark的任务提交方式有两种:Spark submit、Spark shell
Spark submit
Spark官方提供了很多Demo程序,这里使用Spark submit方式提交任务来运行一个Spark官方提供的Demo程序,蒙特卡洛求Pi。
[root@bigdata121 ~]# /opt/module/spark-2.1.0-bin-hadoop2.7/bin/spark-submit \
--master spark://bigdata121:7077 \
--class org.apache.spark.examples.SparkPi \
/opt/module/spark-2.1.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.1.0.jar \
100
上面命令参数的解析:
–master spark://bigdata121:7077 指定提交到的Spark集群。
–class org.apache.spark.examples.SparkPi 指定运行的主类。
/opt/module/spark-2.1.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.1.0.jar Demo程序的Jar包。
100 模拟运算多少次,模拟次数越多,得到的结果越精确。
运行的结果如下图:
Spark shell
spark-shell是Spark自带的交互式Shell程序,方便用户进行交互式编程,用户可以在该命令行下用scala编写spark程序。
如果启动spark shell时没有指定master地址,也可以正常启动spark shell和执行spark shell中的程序,其实是启动了spark的local模式。
#没有指定集群,启动的是local模式。
[root@bigdata121 ~]# /opt/module/spark-2.1.0-bin-hadoop2.7/bin/spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/01/01 16:27:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/01/01 16:27:36 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.80.121:4040
Spark context available as 'sc' (master = local[*], app id = local-1577867248175).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
#指定了集群,连接到指定的集群
[root@bigdata121 ~]# /opt/module/spark-2.1.0-bin-hadoop2.7/bin/spark-shell --master spark://bigdata121:7077
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/01/01 16:25:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/01/01 16:25:37 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.80.121:4040
Spark context available as 'sc' (master = spark://bigdata121:7077, app id = app-20200101162518-0003).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
下面在Spark-shell下面编写Wordcount,Spark-shell已经帮我们初始化好sc了,所以在Spark-shell中直接使用即可。
scala> sc.textFile("/root/one").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect
res0: Array[(String, Int)] = Array((hadop,1), (is,1), (hello,1), (java,2), (my,1), (spark,1), (you,2), (name,2), (do,1), (jinrui,1))
scala>
上面处理的数据是本地的数据,如果是在Spark集群模式下运行,则需要把要处理的数据/root/one先同步到所有Worker节点上。所以在集群下基本读取的数据是放在HDFS中的。下面启动一个HDFS集群,然后把/root/one文件上传到HDFS中。
[root@bigdata121 ~]# /opt/module/hadoop-2.8.4/sbin/start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [bigdata121]
bigdata121: starting namenode, logging to /opt/module/hadoop-2.8.4/logs/hadoop-root-namenode-bigdata121.out
bigdata121: starting datanode, logging to /opt/module/hadoop-2.8.4/logs/hadoop-root-datanode-bigdata121.out
Starting secondary namenodes [bigdata121]
bigdata121: starting secondarynamenode, logging to /opt/module/hadoop-2.8.4/logs/hadoop-root-secondarynamenode-bigdata121.out
starting yarn daemons
starting resourcemanager, logging to /opt/module/hadoop-2.8.4/logs/yarn-root-resourcemanager-bigdata121.out
bigdata121: starting nodemanager, logging to /opt/module/hadoop-2.8.4/logs/yarn-root-nodemanager-bigdata121.out
[root@bigdata121 ~]# hdfs dfs -put /root/one /one
[root@bigdata121 ~]# hdfs dfs -cat /one
my name is jinrui
do you
hello hadop java spark java
you name
[root@bigdata121 ~]#
从HDFS中读取数据进行Wordcount
scala> sc.textFile("hdfs://bigdata121:9000/one").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect
res1: Array[(String, Int)] = Array((hadop,1), (is,1), (hello,1), (java,2), (my,1), (spark,1), (you,2), (name,2), (do,1), (jinrui,1))
查看HDFS中的结果
[root@bigdata121 ~]# hdfs dfs -ls /oneOut
Found 3 items
-rw-r--r-- 3 root supergroup 0 2020-01-01 16:48 /oneOut/_SUCCESS
-rw-r--r-- 3 root supergroup 43 2020-01-01 16:48 /oneOut/part-00000
-rw-r--r-- 3 root supergroup 45 2020-01-01 16:48 /oneOut/part-00001
[root@bigdata121 ~]# hdfs dfs -cat /oneOut/part-00000
(hadop,1)
(is,1)
(hello,1)
(java,2)
(my,1)
[root@bigdata121 ~]# hdfs dfs -cat /oneOut/part-00001
(spark,1)
(you,2)
(name,2)
(do,1)
(jinrui,1)
[root@bigdata121 ~]#
上面的是一条命令,在实际中,如果想查看每一步的结果,可以把上面的命令拆分开来。
scala> val rdd1=sc.textFile("hdfs://bigdata121:9000/one")
rdd1: org.apache.spark.rdd.RDD[String] = hdfs://bigdata121:9000/one MapPartitionsRDD[26] at textFile at <console>:24
scala> rdd1.collect
res6: Array[String] = Array(my name is jinrui, do you, hello hadop java spark java, you name)
scala> val rdd2=rdd1.flatMap(_.split(" "))
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[27] at flatMap at <console>:26
scala> val rdd3=rdd2.map((_,1))
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[28] at map at <console>:28
scala> val rdd4=rdd3.reduceByKey(_+_)
rdd4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[29] at reduceByKey at <console>:30
scala> rdd4.collect
res7: Array[(String, Int)] = Array((hadop,1), (is,1), (hello,1), (java,2), (my,1), (spark,1), (you,2), (name,2), (do,1), (jinrui,1))
scala>