Win11 Docker 快速部署spark
实验环境:
- 操作系统:win11
- 命令行工具: PowerShell
- Docker Desktop for Windows (内置 Docker CLI client 与 Docker Compose)
- Spark Docker 镜像:bitnami-docker-spark,Spark 版本:3.3.0
- Hadoop 版本:hadoop-3.3.5
目录
一、快速部署pySpark
docker pull jupyter/pyspark-notebook
docker run --name pyspark --rm -v D:\spark:/home/jovyan -p 8888:8888 jupyter/pyspark-notebook
在浏览器打开网址即可
相关教程
User Guide — PySpark 3.2.1 documentation (apache.org)
二、Docker compose部署spark
参考教程:
GitHub - sshaik/bitnami-docker-spark: Bitnami Docker Image for Apache Spark
使用 Docker 快速部署 Spark + Hadoop 大数据集群 - 知乎 (zhihu.com)
使用 Docker 快速部署 Spark + Hadoop 大数据集群 | 芥子屋 (s1mple.cc)
1、拉取镜像
docker pull bitnami/spark:3
2、查看docker-compose版本号
PS C:\Windows\system32> docker-compose -v
Docker Compose version v2.17.2
3、创建 docker-compose.yml
在本地新建一个目录创建docker-compose.yml (任意目录均可)
docker-compose.yml
version: '2'
services:
spark:
image: docker.io/bitnami/spark:3
hostname: master
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- D:/Docker/spark/share:/opt/share
ports:
- '8080:8080'
- '4040:4040'
spark-worker-1:
image: docker.io/bitnami/spark:3
hostname: worker1
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://master:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- D:/Docker/spark/share:/opt/share
ports:
- '8081:8081'
spark-worker-2:
image: docker.io/bitnami/spark:3
hostname: worker2
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://master:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- D:/Docker/spark/share:/opt/share
ports:
- '8082:8081'
-
version
:compose版本号 -
hostname
:容器实例主机名; -
volumes
:挂载本地目录 D:/Docker/spark/share 到容器目录 /opt/share; -
ports
:开放 4040 和从节点 Spark Web UI 端口
返回PowerShell界面,进入docker-compose.yml所在的目录
cd D:\Docker\spark
4、启动Spark Docker集群
PS D:\Docker\spark> docker-compose up -d
[+] Running 3/3
✔ Container spark-spark-1 Started 1.9s
✔ Container spark-spark-worker-2-1 Started 2.4s
✔ Container spark-spark-worker-1-1 Started
在浏览器打开 http://localhost:8080/
5、确定Hadoop版本
PS D:\Docker\spark> docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b905de0c118b bitnami/spark:3 "/opt/bitnami/script…" 2 hours ago Up 12 minutes 0.0.0.0:8081->8081/tcp spark-spark-worker-1-1
33be43384bf3 bitnami/spark:3 "/opt/bitnami/script…" 2 hours ago Up 12 minutes 0.0.0.0:4040->4040/tcp, 0.0.0.0:8080->8080/tcp spark-spark-1
47c4bf45f4ca bitnami/spark:3 "/opt/bitnami/script…" 2 hours ago Up 12 minutes 0.0.0.0:8082->8081/tcp spark-spark-worker-2-1
进入到 master 容器内部
PS D:\Docker\spark> docker exec -it 33be43384bf3 /bin/bash
pyspark
查找对应hadoop安装包的下载路径,例如:
https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/stable/hadoop-3.3.5.tar.gz
6、准备配置文件及启动脚本
GitHub - s1mplecc/spark-hadoop-docker
新建一个config目录,创建相应的文件(从上述链接下载)
查看config
PS D:\Docker\spark> tree /F config
编写 Hadoop 启动脚本。
由于设置了 ssh 免密通信,首先需要启动 ssh 服务,然后依次启动 HDFS 和 YARN 集群。
#!/bin/bash
service ssh start
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
7、基于 bitnami/spark 构建新镜像
在工作目录下,创建用于构建新镜像的 Dockerfile
- 设置 Hadoop 环境变量;
- 配置集群间 ssh 免密通信。此处直接将 ssh-keygen 工具生成的公钥写入 authorized_keys 文件中,由于容器集群基于同一个镜像创建的,因此集群的公钥都相同且 authorized_keys 为自己本身;
- 下载 Hadoop 3.3.5 安装包并解压;
- 创建 HDFS NameNode 和 DataNode 工作目录;
- 覆盖
$HADOOP_CONF_DIR
目录下的 Hadoop 配置文件; - 拷贝 Hadoop 启动脚本并设置为可执行文件;
- 格式化 HDFS 文件系统;
- 在入口脚本中启动 ssh 服务。
注意下面hadoop安装包的下载路径
FROM docker.io/bitnami/spark:3
LABEL maintainer="s1mplecc <s1mple951205@gmail.com>"
LABEL description="Docker image with Spark (3.1.2) and Hadoop (3.2.0), based on bitnami/spark:3. \
For more information, please visit https://github.com/s1mplecc/spark-hadoop-docker."
USER root
ENV HADOOP_HOME="/opt/hadoop"
ENV HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop"
ENV HADOOP_LOG_DIR="/var/log/hadoop"
ENV PATH="$HADOOP_HOME/hadoop/sbin:$HADOOP_HOME/bin:$PATH"
WORKDIR /opt
RUN apt-get update && apt-get install -y openssh-server
RUN ssh-keygen -t rsa -f /root/.ssh/id_rsa -P '' && \
cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
RUN curl -OL https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/stable/hadoop-3.3.5.tar.gz
RUN tar -xzvf hadoop-3.3.5.tar.gz && \
mv hadoop-3.3.5 hadoop && \
rm -rf hadoop-3.3.5.tar.gz && \
mkdir /var/log/hadoop
RUN mkdir -p /root/hdfs/namenode && \
mkdir -p /root/hdfs/datanode
COPY config/* /tmp/
RUN mv /tmp/ssh_config /root/.ssh/config && \
mv /tmp/hadoop-env.sh $HADOOP_CONF_DIR/hadoop-env.sh && \
mv /tmp/hdfs-site.xml $HADOOP_CONF_DIR/hdfs-site.xml && \
mv /tmp/core-site.xml $HADOOP_CONF_DIR/core-site.xml && \
mv /tmp/mapred-site.xml $HADOOP_CONF_DIR/mapred-site.xml && \
mv /tmp/yarn-site.xml $HADOOP_CONF_DIR/yarn-site.xml && \
mv /tmp/workers $HADOOP_CONF_DIR/workers
COPY start-hadoop.sh /opt/start-hadoop.sh
RUN chmod +x /opt/start-hadoop.sh && \
chmod +x $HADOOP_HOME/sbin/start-dfs.sh && \
chmod +x $HADOOP_HOME/sbin/start-yarn.sh
RUN hdfs namenode -format
RUN sed -i "1 a /etc/init.d/ssh start > /dev/null &" /opt/bitnami/scripts/spark/entrypoint.sh
ENTRYPOINT [ "/opt/bitnami/scripts/spark/entrypoint.sh" ]
CMD [ "/opt/bitnami/scripts/spark/run.sh" ]
在工作目录下,执行如下命令构建镜像:
PS D:\Docker\spark> docker build -t s1mplecc/spark-hadoop:3 .
8、启动 spark-hadoop 集群
构建镜像完成后,还需要修改 docker-compose.yml 文件,使其从新的镜像 s1mplecc/spark-hadoop:3中启动容器集群,同时映射 Hadoop Web UI 端口。
拉取镜像
docker pull s1mplecc/spark-hadoop:3
修改docker-compose.yml
version: '2'
services:
spark:
image: s1mplecc/spark-hadoop:3
hostname: master
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- D:/Docker/spark/share:/opt/share
ports:
- '8080:8080'
- '4040:4040'
- '8088:8088'
- '8042:8042'
- '9870:9870'
- '19888:19888'
spark-worker-1:
image: s1mplecc/spark-hadoop:3
hostname: worker1
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://master:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- D:/Docker/spark/share:/opt/share
ports:
- '8081:8081'
spark-worker-2:
image: s1mplecc/spark-hadoop:3
hostname: worker2
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://master:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- D:/Docker/spark/share:/opt/share
ports:
- '8082:8081'
运行 docker-compose 启动命令重建集群,不需要停止或删除旧集群。
docker-compose up -d
进入主节点
docker ps
docker exec -it 322da32f44cb bash
./start-hadoop.sh
有警告不影响启动
9、测试
注:如果测试过程中出现问题,可以考虑把GitHub - s1mplecc/spark-hadoop-docker安装包里的相关内容都复制到相应本地目录下
使用命令将共享文件中的 words.txt 写入 HDFS:
hadoop fs -put /opt/share/words.txt /
hdfs dfs -ls /
有警告不影响
http://localhost:9870
Spark 访问 HDFS
pyspark
>>>lines = sc.textFile('hdfs://master:9000/words.txt')
>>>lines.collect()
>>>words = lines.flatMap(lambda x: x.split(' '))
>>>words.saveAsTextFile('hdfs://master:9000/split-words.txt')
>>>exit()
HDFS 上的文件被读取为 RDD,在内存上进行 Transformation 后写入 HDFS。写入的文件被存储到 HDFS 的 DataNode 块分区上。
hdfs dfs -ls /
hdfs dfs -ls /split-words.txt
hdfs dfs -cat /split-words.txt/part-00000
将 Spark 应用提交到 YARN 集群
spark-submit --master yarn --deploy-mode cluster --name "Word Count" --executor-memory 1g --class org.apache.spark.examples.JavaWordCount /opt/bitnami/spark/examples/jars/spark-examples_2.12-3.2.0.jar /word.txt
http://localhost:8088/cluster/apps/
Web UI 汇总
Web UI | 默认网址 | 备注 |
---|---|---|
* Spark Application | http://localhost:4040 | 由 SparkContext 启动,显示以本地或 Standalone 模式运行的 Spark 应用 |
Spark Standalone Master | http://localhost:8080 | 显示集群状态,以及以 Standalone 模式提交的 Spark 应用 |
* HDFS NameNode | http://localhost:9870 | 可浏览 HDFS 文件系统 |
* YARN ResourceManager | http://localhost:8088 | 显示提交到 YARN 上的 Spark 应用 |
YARN NodeManager | http://localhost:8042 | 显示工作节点配置信息和运行时日志 |
MapReduce Job History | http://localhost:19888 | MapReduce 历史任务 |
注:星号标注的为较常用的 Web UI。