课程推荐
【黑马程序员Spark全套视频教程,4天spark3.2快速入门到精通,基于Python语言的spark教程】 https://www.bilibili.com/video/BV1Jq4y1z7VP
评论区关注公众号即可领取课程资源,推荐。
工具
- finalshell 3.9.4
- Anaconda3-2021.05-Linux-x86_64.sh
- spark-3.2.0-bin-hadoop3.2.tgz
- 3台虚拟机:hadoop集群node1,node2,node3
conda命令
# 查看虚拟环境列表
conda env list
# 删除虚拟环境pyspark_env
conda remove -n pyspark_env --all
# 创建虚拟环境pyspark
conda create -n pyspark python=3.8
# 切换虚拟环境pyspark
conda activate pyspark
# 切换base
conda deactivate
windows-hosts
cd C:\Windows\System32\drivers\etc
notepad hosts
192.168.88.161 node1 node1.itcast.cn
192.168.88.162 node2 node2.itcast.cn
192.168.88.163 node3 node3.itcast.cn
1.spark-local
-
仅使用
node1
-
本质:启动一个JVM进程(一个进程里面有多个线程),执行任务Task
-
角色分布:
- master–local进程本身
- worker–local进程本身
- driver–local进程本身
- executor–不存在
-
Spark的任务在运行后,会在Driver所在机器绑定到4040端口,提供当前任务的监控页面供查看
1.1.anaconda安装
- 将
Anaconda3-2021.05-Linux-x86_64.sh
放入/export/software/
sh /export/software/Anaconda3-2021.05-Linux-x86_64.sh
1.回车一次
2.按空格翻页(多次)
3.出现"Do you accept the license terms? [yes|no][no] >>> ",输入yes
4.输入安装路径:/export/server/anaconda3,这里也可以直接回车,则直接选择默认路径安装,默认在/root/anaconda3
------等待一小会,直到询问初始化--------
5.询问是否初始化,"Do you wish the installer to initialize Anaconda3 by running conda init? [yes|no] >>>",输入yes
- 断开连接,再次接入,命令行前出现(base):
exit
- 切换国内源,创建
~/.condarc
vim ~/.condarc
channels:
- defaults
show_channel_urls: true
default_channels:
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
- 创建虚拟环境
conda create -n pyspark python=3.8
y
- 切换为虚拟环境
conda activate pyspark
1.2.spark安装
- 下载地址
https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
- 将
spark-3.2.0-bin-hadoop3.2.tgz
放入/export/software
tar -zxvf /export/software/spark-3.2.0-bin-hadoop3.2.tgz -C /export/server/
- 由于spark目录名称很长, 给其一个软链接
ln -s /export/server/spark-3.2.0-bin-hadoop3.2 /export/server/spark
1.3.配置环境变量
- 配置
/etc/profile
变量 | 含义 |
---|---|
SPARK_HOME | Spark安装路径 |
JAVA_HOME | Java安装路径 |
HADOOP_CONF_DIR | hadoop配置文件 |
HADOOP_HOME | hadoop安装路径 |
PYSPARK_PYTHON | python执行器路径 |
vi /etc/profile
写入:
export JAVA_HOME=/export/server/jdk1.8.0_241
export HADOOP_HOME=/export/server/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/export/server/spark
export PYSPARK_PYTHON=/export/server/anaconda3/envs/pyspark/bin/python3.8
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin
source /etc/profile
~/.bashrc
vi ~/.bashrc
写入:
export JAVA_HOME=/export/server/jdk1.8.0_241
export PYSPARK_PYTHON=/export/server/anaconda3/envs/pyspark/bin/python3.8
1.4.测试
1.4.1.pyspark
-
bin/pyspark
程序, 可以提供一个交互式
的Python解释器环境 -
parallelize
和map
都是spark提供的API
cd /export/server/spark/bin
./pyspark --master local[*]
sc.parallelize([1,2,3,4,5]).map(lambda x:x+10).collect()
- 结果如下
[11, 12, 13, 14, 15]
- 浏览器查看
node1:4040
1.4.2.spark-submit
- 计算圆周率程序
cd /export/server/spark/bin
./spark-submit --master local[*] /export/server/spark/examples/src/main/python/pi.py 10
- 往上翻日志可见
2.spark on yarn
- 这里的node1已完成1.1-1.3
2.1.配置文件
- 配置
spark-env.sh
cd /export/server/spark/conf
mv spark-env.sh.template spark-env.sh
vi spark-env.sh
## HADOOP软件配置文件目录,读取HDFS上文件和运行YARN集群
HADOOP_CONF_DIR=/export/server/hadoop/etc/hadoop
YARN_CONF_DIR=/export/server/hadoop/etc/hadoop
2.2.启动hadoop
- hadoop集群的启动条件
- windows的hosts
- linux的hosts
- ssh免密登录
- 安装java 1.8
- 安装hadoop 3.3.0
- 配置好了hadoop和yarn
/export/server/hadoop/sbin/start-all.sh
2.3.子节点python环境
对node2
和node3
完成步骤1.1
2.4.测试
2.4.1.官网对于两种模式的说明
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster.
确保HADOOP_CONF_DIR或YARN_CONF_DIR指向包含Hadoop集群(客户端)配置文件的目录
These configs are used to write to HDFS and connect to the YARN ResourceManager.
这些配置用于写入HDFS和连接YARN ResourceManager
There are two deploy modes that can be used to launch Spark applications on YARN.
有两种部署模式可用于在YARN上启动Spark应用程序
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.
在cluster模式下,driiver运行在application master中,由集群上的YARN管理,客户端启动应用后可以离开
In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
在client模式下,driver运行在client进程中,application master仅用于从YARN请求资源
Unlike other cluster managers supported by Spark in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the --master parameter is yarn.
不像Spark支持的其他集群管理器,其中主节点的地址是在——master参数中指定的,在YARN模式下,ResourceManager的地址是从Hadoop配置中获取的。因此,——master参数是yarn
2.4.2.pyspark
cd /export/server/spark/bin
./pyspark --master yarn
sc.parallelize([1,2,3,4,5]).map(lambda x:x+10).collect()
注意: 交互式环境 pyspark 和 spark-shell 无法运行 cluster模式
2.4.3.spark-submit
2.4.3.1.client模式
cd /export/server/spark/bin
./spark-submit --master yarn --deploy-mode client --driver-memory 512m --executor-memory 512m --num-executors 4 --executor-cores 1 /export/server/spark/examples/src/main/python/pi.py 100
--deploy-mode 部署模式
--driver-memory driver内存
--executor-memory executor的内存
--num-executors executor的数量
--executor-coress 每个executor的核心数
- 浏览器查看
node1:4040
- 往上翻日志可以看到计算结果
2.4.3.2.cluster模式
./spark-submit --master yarn --deploy-mode cluster --driver-memory 512m --executor-memory 512m --num-executors 4 --executor-cores 1 /export/server/spark/examples/src/main/python/pi.py 100
- 浏览器查看
node1:4040
- 点击上图的
driver
那行右边的stdout
,可查看运行结果,只有几秒钟可以看到,程序运行结束,4040的页面就打不开了