Spark on kubernetes 外部文件系统(存储系统)
glusterfs篇
搭建glusterfs环境
我的glusterfs环境
需求 | 需求量 |
---|---|
设备数量 | 3 |
操作系统 | CentOS7.3 |
hostname | 三台设备的hostname分别为glusterfs-spark-1,glusterfs-spark-2,glusterfs-spark-3 |
安装glusterfs
```
# 三台设备都需要执行
yum search gluster
# yum -y install centos-release-gluster${你想要的版本}.noarch
yum -y install centos-release-gluster7.noarch
yum install -y glusterfs glusterfs-server glusterfs-fuse glusterfs-rdma
systemctl start glusterd.service
systemctl status glusterd.service
```
glusterfs node连接
```
gluster peer probe glusterfs-spark-2
gluster peer probe glusterfs-spark-3
gluster peer status
```
创建glusterfs volume
```
# glusterfs建议不要使用系统盘做volume的本地磁盘,如果强制要使用,就按照下面的命令,后面加上force
gluster volume create test replica 3 transport tcp glusterfs-spark-1:/data/gluster glusterfs-spark-2:/data/gluster glusterfs-spark-3:/data/gluster force
```
启动glusterfs 的volume
```
gluster volume start test
```
挂载glusterfs到kubernetes各node上
将glusterfs集群的每一个设备的hostname都配置到kubernetes的node上的/etc/hosts中
```
${IP 1} glusterfs-spark-1
${IP 2} glusterfs-spark-2
${IP 3} glusterfs-spark-3
```
在所有的kubernetes node上安装glusterfs客户端
```
yum -y install centos-release-gluster7.noarch
yum -y install glusterfs-cli
yum -y install glusterfs-fuse.x86_64
```
挂载glusterfs的volume test到kubernetes node上
···
mkdir -p /data/glusterfs
glusterfs --volfile-server=10.80.0.133 --volfile-id=test /data/glusterfs
···
在glusterfs上运行spark wordcount程序
将wordcount程序所需的jar包上传到glusterfs上
wordcount 代码
```
package spark.wordcount
import org.apache.spark.{SparkConf, SparkContext}
object SparkWordCount {
def main(args: Array[String]):Unit = {
val conf = new SparkConf().setAppName("Spark Word Count")
val sc = new SparkContext(conf)
val startTime:Long = sc.startTime
println(startTime)
// glusterfs part
val words = sc.textFile("/mnt/glusterfs/data/wordcount/")
words.map(c => (c, 1)).reduceByKey(_ + _).collect().foreach(println(_))
// 这个没有的话,kubernetes的spark executor是不会停止的
sc.stop()
}
}
```
spark-submit shell代码
```
$SPARK_HOME/bin/bin/spark-submit \
--master k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
-name spark-word-count-on-kubernetes \
--class spark.wordcount.SparkWordCount \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=172.16.80.150:5000/apache/spark:2.4.5 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.volumes.hostPath.glusterfs.mount.path=/mnt/glusterfs \
--conf spark.kubernetes.driver.volumes.hostPath.glusterfs.mount.readOnly=false \
--conf spark.kubernetes.driver.volumes.hostPath.glusterfs.options.path=/data/glusterfs \
--conf spark.kubernetes.executor.volumes.hostPath.glusterfs.options.readOnly=false \
--conf spark.kubernetes.executor.volumes.hostPath.glusterfs.mount.path=/mnt/glusterfs \
--conf spark.kubernetes.executor.volumes.hostPath.glusterfs.options.path=/data/glusterfs \
--conf spark.kubernetes.executor.volumes.hostPath.glusterfs.options.readOnly=false \
local:///mnt/glusterfs/jars/spark-gluster.jar
```
hdfs篇
搭建hadoop环境
下载hadoop2.7.7离线安装包和jdk1.8
解压并配置HADOOP_HOME和JAVA_HOME
配置hadoop配置文件
- core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://${namenode IP}:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/data/hadoop/tmp</value> </property> </configuration>
- hadoop-env.sh
- 修改JAVA_HOME为本地JAVA_HOME指向的路径
- hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///data/hadoop/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///data/hadoop/datanode</value> </property> <property> <name>dfs.http.address</name> <value>0.0.0.0:50070</value> </property> <property> <name>dfs.permissions.enabled</name> <value>true</value> <description>用于允许hadoop创建目录修改目录等</description> </property> <property> <name>dfs.datanode.max.transfer.threads</name> <value>8192</value> <description>解决java.io.IOException: Premature EOF from inputStream问题</description </property> </configuration>
初始化namenode
hadoop namenode -format
启动hdfs
$HADOOP_HOME/sbin/start-dfs.sh
在hdfs上运行spark wordcount程序
wordcount代码
···
package spark.wordcount
import org.apache.spark.{SparkConf, SparkContext}
object SparkWordCount {
def main(args: Array[String]):Unit = {
val conf = new SparkConf().setAppName("Spark Word Count")
val sc = new SparkContext(conf)
val startTime:Long = sc.startTime
println(startTime)
val words = sc.textFile("hdfs://name:9000/data/test/")
val result = words.flatMap(line => line.split("\\s|,"))
.map(word => (word, 1))
.reduceByKey(_ + _)
result.sortByKey()
result.collect().foreach(word => println(word))
sc.stop()
}
}
···
将wordcount的代码打成jar包,然后上传hdfs的指定路径上
jar包路径:hdfs://10.80.0.133:9000/jars/
上传数据文件到指定路径
- 文件路径
hdfs://namenode:9000/data/test/ - 文件内容
hello world we are using spark on kubernetes using hdfs as filesystem hope it will be successed
spark-submit shell代码
```
$SPARK_HOME/bin/spark-submit \
--master k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
--name spark-word-count-on-kubernetes \
--class spark.wordcount.SparkWordCount \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=172.16.80.150:5000/apache/spark:2.4.5 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
hdfs://${hdfs namenode ip}:${hdfs namenode port}/jars/${jar 包名称} #例如hdfs://namenode:9000/jars/test.jar
```
可能遇到的问题
- kubernetes with glusterfs上的exeuctor执行过程中遇到的unknown hostname的问题
- 问题日志表现
0-resolver: getaddrinfo failed (Name or service not known)
- 应该是因为executor所在的节点上的/etc/hosts文件中未配置glusterfs的hostname解析
- 问题日志表现
- 目前仍未解决的是,在submit脚本中传入文件路径
- 当前spark-submit提交时使用的是cluster模式,测试过client模式不好用
总结
- glusterfs的使用方式和local filesystem是一样的
- hdfs不需要进行多余的配置即可使用
额外内容
- 通过设置SPARK_LOCAL_DIRS进行shuffle加速
设置方法:(可以设置多对,最好设置为hostPath和本地硬盘)
例如spark.kubernetes.driver.volumes.[VolumeType].spark-local-dir-[VolumeName].mount.path=<mount path> spark.kubernetes.driver.volumes.[VolumeType].spark-local-dir-[VolumeName].options.path=<host path>
spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path=/opt/cache/1 spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path=/data/1 spark.kubernetes.driver.volumes.hostPath.spark-local-dir-2.mount.path=/opt/cache/2 spark.kubernetes.driver.volumes.hostPath.spark-local-dir-2.options.path=/data/2 spark.kubernetes.exeuctor.volumes.hostPath.spark-local-dir-1.mount.path=/opt/cache/1 spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path=/data/1 spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/opt/cache/2 spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/data/2