Spark on kubernetes 外部文件系统

最新推荐文章于 2023-04-19 13:15:00 发布

FeelBreak

最新推荐文章于 2023-04-19 13:15:00 发布

阅读量652

点赞数

分类专栏： Spark Kubernetes 文件存储系统

本文链接：https://blog.csdn.net/DANTE54/article/details/105726925

版权

Kubernetes 同时被 3 个专栏收录

10 篇文章 1 订阅

订阅专栏

Spark

4 篇文章 0 订阅

订阅专栏

文件存储系统

4 篇文章 0 订阅

订阅专栏

Spark on kubernetes 外部文件系统(存储系统)

glusterfs篇

搭建glusterfs环境

我的glusterfs环境

需求	需求量
设备数量	3
操作系统	CentOS7.3
hostname	三台设备的hostname分别为glusterfs-spark-1,glusterfs-spark-2,glusterfs-spark-3

安装glusterfs

```
# 三台设备都需要执行
yum  search gluster
# yum -y install centos-release-gluster${你想要的版本}.noarch
yum -y install centos-release-gluster7.noarch
yum install -y glusterfs glusterfs-server glusterfs-fuse glusterfs-rdma
systemctl start glusterd.service
systemctl status glusterd.service
```

glusterfs node连接

```
gluster peer probe glusterfs-spark-2
gluster peer probe glusterfs-spark-3
gluster peer status
```

创建glusterfs volume

```
# glusterfs建议不要使用系统盘做volume的本地磁盘，如果强制要使用，就按照下面的命令，后面加上force
gluster volume create test replica 3 transport tcp glusterfs-spark-1:/data/gluster glusterfs-spark-2:/data/gluster glusterfs-spark-3:/data/gluster force
```

启动glusterfs 的volume

```
gluster volume start test
```

挂载glusterfs到kubernetes各node上

将glusterfs集群的每一个设备的hostname都配置到kubernetes的node上的/etc/hosts中

```
${IP 1} glusterfs-spark-1
${IP 2} glusterfs-spark-2
${IP 3} glusterfs-spark-3
```

在所有的kubernetes node上安装glusterfs客户端

```
yum -y install centos-release-gluster7.noarch
yum -y install glusterfs-cli
yum -y install glusterfs-fuse.x86_64
```

挂载glusterfs的volume test到kubernetes node上

···
mkdir -p /data/glusterfs
glusterfs --volfile-server=10.80.0.133 --volfile-id=test /data/glusterfs
···

在glusterfs上运行spark wordcount程序

将wordcount程序所需的jar包上传到glusterfs上

wordcount 代码

```
package spark.wordcount

import org.apache.spark.{SparkConf, SparkContext}

object SparkWordCount {
	def main(args: Array[String]):Unit = {
		val conf = new SparkConf().setAppName("Spark Word Count")
		val sc = new SparkContext(conf)
		val startTime:Long = sc.startTime
		println(startTime)
		
		// glusterfs part
		val words = sc.textFile("/mnt/glusterfs/data/wordcount/")
		words.map(c => (c, 1)).reduceByKey(_ + _).collect().foreach(println(_))
		
		// 这个没有的话，kubernetes的spark executor是不会停止的
		sc.stop()
	}
}
```

spark-submit shell代码

```
$SPARK_HOME/bin/bin/spark-submit \
	--master k8s://http://127.0.0.1:8001 \
	--deploy-mode cluster \
	-name spark-word-count-on-kubernetes \
	--class spark.wordcount.SparkWordCount \
	--conf spark.executor.instances=2 \
	--conf spark.kubernetes.container.image=172.16.80.150:5000/apache/spark:2.4.5 \
	--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
	--conf spark.kubernetes.driver.volumes.hostPath.glusterfs.mount.path=/mnt/glusterfs \
	--conf spark.kubernetes.driver.volumes.hostPath.glusterfs.mount.readOnly=false \
	--conf spark.kubernetes.driver.volumes.hostPath.glusterfs.options.path=/data/glusterfs \
	--conf spark.kubernetes.executor.volumes.hostPath.glusterfs.options.readOnly=false \
	--conf spark.kubernetes.executor.volumes.hostPath.glusterfs.mount.path=/mnt/glusterfs \
	--conf spark.kubernetes.executor.volumes.hostPath.glusterfs.options.path=/data/glusterfs \
	--conf spark.kubernetes.executor.volumes.hostPath.glusterfs.options.readOnly=false \
	local:///mnt/glusterfs/jars/spark-gluster.jar
```

hdfs篇

搭建hadoop环境

下载hadoop2.7.7离线安装包和jdk1.8

解压并配置HADOOP_HOME和JAVA_HOME

配置hadoop配置文件

core-site.xml

<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://${namenode IP}:9000</value>
	</property>
	
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/data/hadoop/tmp</value>
	</property>
</configuration>

hadoop-env.sh
- 修改JAVA_HOME为本地JAVA_HOME指向的路径

hdfs-site.xml

<configuration>
	<property>
		<name>dfs.replication</name>
		<value>2</value>
	</property>
	<property>
		<name>dfs.namenode.name.dir</name>
		<value>file:///data/hadoop/namenode</value>
	</property>
	<property>
		<name>dfs.datanode.data.dir</name>
		<value>file:///data/hadoop/datanode</value>
	</property>
	<property>
		<name>dfs.http.address</name>
		<value>0.0.0.0:50070</value>
	</property>
	<property>
		<name>dfs.permissions.enabled</name>
		<value>true</value>
		<description>用于允许hadoop创建目录修改目录等</description>
	</property>
	<property>
		<name>dfs.datanode.max.transfer.threads</name>
		<value>8192</value>
		<description>解决java.io.IOException: Premature EOF from inputStream问题</description
	</property>
</configuration>

初始化namenode

hadoop namenode -format

启动hdfs

$HADOOP_HOME/sbin/start-dfs.sh

在hdfs上运行spark wordcount程序

wordcount代码

···
package spark.wordcount

import org.apache.spark.{SparkConf, SparkContext}

object SparkWordCount {
	def main(args: Array[String]):Unit = {
		val conf = new SparkConf().setAppName("Spark Word Count")
		val sc = new SparkContext(conf)
		val startTime:Long = sc.startTime
		println(startTime)
		
		val words = sc.textFile("hdfs://name:9000/data/test/")
		val result = words.flatMap(line => line.split("\\s|,"))
			.map(word => (word, 1))
			.reduceByKey(_ + _)
		result.sortByKey()
		result.collect().foreach(word => println(word))
		
		sc.stop()
	}
}
···

将wordcount的代码打成jar包，然后上传hdfs的指定路径上

jar包路径：hdfs://10.80.0.133:9000/jars/

上传数据文件到指定路径

文件路径
hdfs://namenode:9000/data/test/

文件内容

hello world
we are using spark on kubernetes
using hdfs as filesystem
hope it will be successed

spark-submit shell代码

```
$SPARK_HOME/bin/spark-submit \
	--master k8s://http://127.0.0.1:8001 \
	--deploy-mode cluster \
	--name spark-word-count-on-kubernetes \
	--class spark.wordcount.SparkWordCount \
	--conf spark.executor.instances=2 \
	--conf spark.kubernetes.container.image=172.16.80.150:5000/apache/spark:2.4.5 \
	--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
	hdfs://${hdfs namenode ip}:${hdfs namenode port}/jars/${jar 包名称} #例如hdfs://namenode:9000/jars/test.jar
```

可能遇到的问题

kubernetes with glusterfs上的exeuctor执行过程中遇到的unknown hostname的问题
- 问题日志表现 0-resolver: getaddrinfo failed (Name or service not known)
- 应该是因为executor所在的节点上的/etc/hosts文件中未配置glusterfs的hostname解析
目前仍未解决的是，在submit脚本中传入文件路径
当前spark-submit提交时使用的是cluster模式，测试过client模式不好用

总结

glusterfs的使用方式和local filesystem是一样的
hdfs不需要进行多余的配置即可使用

额外内容

通过设置SPARK_LOCAL_DIRS进行shuffle加速
设置方法：(可以设置多对，最好设置为hostPath和本地硬盘)

spark.kubernetes.driver.volumes.[VolumeType].spark-local-dir-[VolumeName].mount.path=<mount path>
spark.kubernetes.driver.volumes.[VolumeType].spark-local-dir-[VolumeName].options.path=<host path>

例如

spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path=/opt/cache/1
spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path=/data/1
spark.kubernetes.driver.volumes.hostPath.spark-local-dir-2.mount.path=/opt/cache/2
spark.kubernetes.driver.volumes.hostPath.spark-local-dir-2.options.path=/data/2
spark.kubernetes.exeuctor.volumes.hostPath.spark-local-dir-1.mount.path=/opt/cache/1
spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path=/data/1
spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/opt/cache/2
spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/data/2

FeelBreak

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark on kubernetes 外部文件系统

Spark on kubernetes 外部文件系统(存储系统)glusterfs篇搭建glusterfs环境我的glusterfs环境需求需求量设备数量3操作系统CentOS7.3hostname三台设备的hostname分别为glusterfs-spark-1,glusterfs-spark-2,glusterfs-spark-3安装glus...
复制链接

扫一扫