Spark Yarn|Standalone
基础环境
- 关闭防火墙
[root@centos ~]# service iptables stop # 关闭防火墙
iptables: Setting chains to policy ACCEPT: filter [ OK ]
iptables: Flushing firewall rules: [ OK ]
iptables: Unloading modules: [ OK ]
[root@centos ~]# chkconfig iptables off # 关闭开机自启动
- 修改主机名
[root@centos ~]# cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=centos
- 配置主机名和IP的映射关系
[root@centos ~]# vi /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.58.24 centos
-
重启centos系统
-
配置SSH免密登录
[root@centos ~]# ssh-keygen -t rsa
[root@centos ~]# ssh-copy-id centos
- 安装JDK,配置JAVA_HOME
[root@centos ~]# rpm -ivh jdk-8u191-linux-x64.rpm
[root@centos ~]# vi ~/.bashrc
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
[root@centos ~]# source .bashrc
安装HDFS
[root@centos ~]# tar -zxf hadoop-2.9.2.tar.gz -C /usr/
[root@centos ~]# vi ~/.bashrc
HADOOP_HOME=/usr/hadoop-2.9.2
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
export HADOOP_HOME
[root@centos ~]# source .bashrc
[root@centos ~]# vi /usr/hadoop-2.9.2/etc/hadoop/core-site.xml
<!--nn访问入口-->
<property>
<name>fs.defaultFS</name>
<value>hdfs://centos:9000</value>
</property>
<!--hdfs工作基础目录-->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/hadoop-2.9.2/hadoop-${user.name}</value>
</property>
[root@centos ~]# vi /usr/hadoop-2.9.2/etc/hadoop/slaves
centos
[root@centos ~]# vi /usr/hadoop-2.9.2/etc/hadoop/hdfs-site.xml
<!--block副本因子-->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!--配置Sencondary namenode所在的物理主机-->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>centos:50090</value>
</property>
<!--设置datanode最大文件操作数-->
<property>
<name>dfs.datanode.max.xcievers</name>
<value>4096</value>
</property>
<!--设置datanode并行处理能力-->
<property>
<name>dfs.datanode.handler.count</name>
<value>6</value>
</property>
[root@centos ~]# hdfs namenode -format # 创建启动NameNode所需的fsimage文件
[root@centos ~]# start-dfs.sh
安装Spark
[root@centos ~]# tar -zxf spark-2.4.3-bin-without-hadoop.tgz -C /usr/
[root@centos ~]# mv /usr/spark-2.4.3-bin-without-hadoop/ /usr/spark-2.4.3
[root@centos ~]# cd /usr/spark-2.4.3/
[root@centos spark-2.4.3]# mv conf/slaves.template conf/slaves
[root@centos spark-2.4.3]# vi conf/slaves
centos
[root@centos spark-2.4.3]# mv conf/spark-env.sh.template conf/spark-env.sh
[root@centos spark-2.4.3]# vi conf/spark-env.sh
SPARK_MASTER_HOST=centos
SPARK_MASTER_PORT=7077
SPARK_WORKER_CORES=4
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_INSTANCES=2
LD_LIBRARY_PATH=/usr/hadoop-2.9.2/lib/native
SPARK_DIST_CLASSPATH=$(hadoop classpath)
export SPARK_MASTER_HOST
export SPARK_MASTER_PORT
export SPARK_WORKER_CORES
export SPARK_WORKER_MEMORY
export SPARK_WORKER_INSTANCES
export LD_LIBRARY_PATH
export SPARK_DIST_CLASSPATH
[root@centos spark-2.4.3]# ./sbin/start-all.sh #只有在Standalone模式下才需要启动
[root@centos spark-2.4.3]# jps
8064 Jps
2066 NameNode
2323 SecondaryNameNode
7912 Worker
7801 Master
7981 Worker
2157 DataNode
用户可以访问:http://centos:8080/
测试集群计算功能:
[root@centos spark-2.4.3]# ./bin/spark-shell
--master spark://centos:7077 # 连接集群的Master
--deploy-mode client # Diver运行方式:必须是client
--total-executor-cores 4 # 分配计算资源
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://centos:4040
Spark context available as 'sc' (master = spark://centos:7077, app id = app-20190924232452-0000).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.3
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sc.textFile("hdfs:///demo/words")
.flatMap(_.split(" "))
.map((_,1))
.groupBy(t=>t._1)
.map(t=>(t._1,t._2.size))
.sortBy(t=>t._2,false,4)
.saveAsTextFile("hdfs:///demo/results")
YARN
-
修改yarn-site.xml
[root@centos ~]# vi /usr/hadoop-2.9.2/etc/hadoop/yarn-site.xml <!--配置MapReduce计算框架的核心实现Shuffle-洗牌--> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!--配置资源管理器所在的目标主机--> <property> <name>yarn.resourcemanager.hostname</name> <value>centos</value> </property> <!--关闭物理内存检查--> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <!--关闭虚拟内存检查--> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property>
-
修改mapred-site.xml
[root@centos ~]# mv /usr/hadoop-2.9.2/etc/hadoop/mapred-site.xml.template /usr/hadoop-2.9.2/etc/hadoop/mapred-site.xml
[root@centos ~]# vi /usr/hadoop-2.9.2/etc/hadoop/mapred-site.xml
<!--MapRedcue框架资源管理器的实现-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
- 启动YARN服务
[root@centos ~]# start-yarn.sh
安装Spark
[root@CentOS ~]# tar -zxf spark-2.4.3-bin-without-hadoop.tgz -C /usr/
[root@CentOS ~]# mv /usr/spark-2.4.3-bin-without-hadoop/ /usr/spark-2.4.3
[root@CentOS ~]# cd /usr/spark-2.4.3/
[root@CentOS spark-2.4.3]# mv conf/spark-env.sh.template conf/spark-env.sh
[root@CentOS spark-2.4.3]# vi conf/spark-env.sh
HADOOP_CONF_DIR=/usr/hadoop-2.9.2/etc/hadoop
YARN_CONF_DIR=/usr/hadoop-2.9.2/etc/hadoop
SPARK_EXECUTOR_CORES=4
SPARK_EXECUTOR_MEMORY=1g
SPARK_DRIVER_MEMORY=1g
LD_LIBRARY_PATH=/usr/hadoop-2.9.2/lib/native
SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HADOOP_CONF_DIR
export YARN_CONF_DIR
export SPARK_EXECUTOR_CORES
export SPARK_DRIVER_MEMORY
export SPARK_EXECUTOR_MEMORY
export LD_LIBRARY_PATH
export SPARK_DIST_CLASSPATH
注意:这里和standalone不同,用户无需启动start-all.sh服务,因为任务的执行会交给YARN执行
[root@centos spark-2.4.3]# ./bin/spark-shell
--master yarn # 连接集群的Master
--deploy-mode client # Diver运行方式:必须是client
--executor-cores 4 # 每个进程最多运行两个Core
--num-executors 2 # 分配2个Executor进程
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/09/25 00:14:40 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/09/25 00:14:43 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980)
at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)
Spark context Web UI available at http://centos:4040
Spark context available as 'sc' (master = yarn, app id = application_1569341195065_0001).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.3
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sc.textFile("hdfs:///demo/words")
.flatMap(_.split(" "))
.map((_,1))
.groupBy(t=>t._1)
.map(t=>(t._1,t._2.size))
.sortBy(t=>t._2,false,4)
.saveAsTextFile("hdfs:///demo/results")
发布与部署
远程测试
- 添加Spark开发的依赖
<properties>
<spark.version>2.4.3</spark.version>
<scala.version>2.11</scala.version>
</properties>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
<!--告知:在Maven打jar包时,自动剔除该包-->
<scope>provided</scope>
</dependency>
//1.创建SparkContext
val sparkConf = new SparkConf()
.setAppName("wordcount")
.setMaster("spark://centos:7077")
val sc = new SparkContext(sparkConf)
//2.创建分布式集合RDD -细化
val lines:RDD[String] = sc.textFile("hdfs:///demo/words")
//3.对数据集合进行转换 -细化
val transformRDD:RDD[(String,Int)] = lines.flatMap(_.split(" "))
.map((_, 1))
.groupBy(t => t._1)
.map(t => (t._1, t._2.size))
.sortBy(t => t._2, false, 4)
//4.对RDD做Action动作提交任务 -细化
transformRDD.saveAsTextFile("hdfs:///demo/results")
//5.释放资源
sc.stop()
- 添加Maven插件
<!--在执行package时,将scala源码编译进jar-->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>4.0.1</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<!--将依赖jar打入到jar中-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
- 执行Maven打包指令
mvn -package
- 在项目的target目录下会产生两个jar
original-xxx-1.0-SNAPSHOT.jar //没有第三方依赖
xxx-1.0-SNAPSHOT.jar //除provide以外的所有jar都会打包进来--->fatjar
- 使用spark-submit指令提交spark任务
[root@centos spark-2.4.3]# ./bin/spark-submit
--master spark://centos:7077
--deploy-mode cluster
--class com.baizhi.demo01.SparkWordCount
--driver-cores 2
--total-executor-cores 4
/root/rdd-1.0-SNAPSHOT.jar
本地测试
//1.创建SparkContext
val sparkConf = new SparkConf()
.setAppName("wordcount")
.setMaster("local[6]")
val sc = new SparkContext(sparkConf)
//2.创建分布式集合RDD -细化
val lines:RDD[String] = sc.textFile("file:///D:/demo/words")
//3.对数据集合做转换 - 细化
val transformRDD:RDD[(String,Int)] = lines.flatMap(_.split(" "))
.map((_, 1))
.groupBy(t => t._1)
.map(t => (t._1, t._2.size))
.sortBy(t => t._2, false, 4)
//4.对RDD做Action动作提交任务 -细化
transformRDD.saveAsTextFile("file:///D:/demo/results")
//5.释放资源
sc.stop()
注意:直接本地运行,不需要Spark环境的支持,注释掉
<scope>provide</scope>
即可
./bin/spark-shell
--master local[6] # 连接集群Master
--deploy-mode client # Diver运行方式:必须是client
--total-executor-cores 4 # 计算资源
History Server
用于记录任务在执行过程中的历史状态信息
- 添加spark-env.sh
[root@centos spark-2.4.3]# vi conf/spark-env.sh
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs:///spark-logs"
export SPARK_HISTORY_OPTS
- 修改spark-defaults.conf
[root@centos spark-2.4.3]# mv conf/spark-defaults.conf.template conf/spark-defaults.conf
[root@centos spark-2.4.3]# vi conf/spark-defaults.conf
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs:///spark-logs
- 在HDFS上创建
spark-logs
目录(History服务器存储数据的地方)
[root@centos ~]# hdfs dfs -mkdir /spark-logs
- 启动history server
[root@centos spark-2.4.3]# ./sbin/start-history-server.sh
测试服务是否正常:http://centos:18080