Spark on kubernetes 外部文件系统

Spark on kubernetes 外部文件系统(存储系统)

glusterfs篇

搭建glusterfs环境

我的glusterfs环境
需求需求量
设备数量3
操作系统CentOS7.3
hostname三台设备的hostname分别为glusterfs-spark-1,glusterfs-spark-2,glusterfs-spark-3
安装glusterfs
```
# 三台设备都需要执行
yum  search gluster
# yum -y install centos-release-gluster${你想要的版本}.noarch
yum -y install centos-release-gluster7.noarch
yum install -y glusterfs glusterfs-server glusterfs-fuse glusterfs-rdma
systemctl start glusterd.service
systemctl status glusterd.service
```
glusterfs node连接
```
gluster peer probe glusterfs-spark-2
gluster peer probe glusterfs-spark-3
gluster peer status
```
创建glusterfs volume
```
# glusterfs建议不要使用系统盘做volume的本地磁盘,如果强制要使用,就按照下面的命令,后面加上force
gluster volume create test replica 3 transport tcp glusterfs-spark-1:/data/gluster glusterfs-spark-2:/data/gluster glusterfs-spark-3:/data/gluster force
```
启动glusterfs 的volume
```
gluster volume start test
```

挂载glusterfs到kubernetes各node上

将glusterfs集群的每一个设备的hostname都配置到kubernetes的node上的/etc/hosts中
```
${IP 1} glusterfs-spark-1
${IP 2} glusterfs-spark-2
${IP 3} glusterfs-spark-3
```
在所有的kubernetes node上安装glusterfs客户端
```
yum -y install centos-release-gluster7.noarch
yum -y install glusterfs-cli
yum -y install glusterfs-fuse.x86_64
```
挂载glusterfs的volume test到kubernetes node上
···
mkdir -p /data/glusterfs
glusterfs --volfile-server=10.80.0.133 --volfile-id=test /data/glusterfs
···

在glusterfs上运行spark wordcount程序

将wordcount程序所需的jar包上传到glusterfs上
wordcount 代码
```
package spark.wordcount

import org.apache.spark.{SparkConf, SparkContext}

object SparkWordCount {
	def main(args: Array[String]):Unit = {
		val conf = new SparkConf().setAppName("Spark Word Count")
		val sc = new SparkContext(conf)
		val startTime:Long = sc.startTime
		println(startTime)
		
		// glusterfs part
		val words = sc.textFile("/mnt/glusterfs/data/wordcount/")
		words.map(c => (c, 1)).reduceByKey(_ + _).collect().foreach(println(_))
		
		// 这个没有的话,kubernetes的spark executor是不会停止的
		sc.stop()
	}
}
```
spark-submit shell代码
```
$SPARK_HOME/bin/bin/spark-submit \
	--master k8s://http://127.0.0.1:8001 \
	--deploy-mode cluster \
	-name spark-word-count-on-kubernetes \
	--class spark.wordcount.SparkWordCount \
	--conf spark.executor.instances=2 \
	--conf spark.kubernetes.container.image=172.16.80.150:5000/apache/spark:2.4.5 \
	--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
	--conf spark.kubernetes.driver.volumes.hostPath.glusterfs.mount.path=/mnt/glusterfs \
	--conf spark.kubernetes.driver.volumes.hostPath.glusterfs.mount.readOnly=false \
	--conf spark.kubernetes.driver.volumes.hostPath.glusterfs.options.path=/data/glusterfs \
	--conf spark.kubernetes.executor.volumes.hostPath.glusterfs.options.readOnly=false \
	--conf spark.kubernetes.executor.volumes.hostPath.glusterfs.mount.path=/mnt/glusterfs \
	--conf spark.kubernetes.executor.volumes.hostPath.glusterfs.options.path=/data/glusterfs \
	--conf spark.kubernetes.executor.volumes.hostPath.glusterfs.options.readOnly=false \
	local:///mnt/glusterfs/jars/spark-gluster.jar
```

hdfs篇

搭建hadoop环境

下载hadoop2.7.7离线安装包和jdk1.8
解压并配置HADOOP_HOME和JAVA_HOME
配置hadoop配置文件
  • core-site.xml
    <configuration>
    	<property>
    		<name>fs.defaultFS</name>
    		<value>hdfs://${namenode IP}:9000</value>
    	</property>
    	
    	<property>
    		<name>hadoop.tmp.dir</name>
    		<value>/data/hadoop/tmp</value>
    	</property>
    </configuration>
    
  • hadoop-env.sh
    • 修改JAVA_HOME为本地JAVA_HOME指向的路径
  • hdfs-site.xml
    <configuration>
    	<property>
    		<name>dfs.replication</name>
    		<value>2</value>
    	</property>
    	<property>
    		<name>dfs.namenode.name.dir</name>
    		<value>file:///data/hadoop/namenode</value>
    	</property>
    	<property>
    		<name>dfs.datanode.data.dir</name>
    		<value>file:///data/hadoop/datanode</value>
    	</property>
    	<property>
    		<name>dfs.http.address</name>
    		<value>0.0.0.0:50070</value>
    	</property>
    	<property>
    		<name>dfs.permissions.enabled</name>
    		<value>true</value>
    		<description>用于允许hadoop创建目录修改目录等</description>
    	</property>
    	<property>
    		<name>dfs.datanode.max.transfer.threads</name>
    		<value>8192</value>
    		<description>解决java.io.IOException: Premature EOF from inputStream问题</description
    	</property>
    </configuration>
    
初始化namenode

hadoop namenode -format

启动hdfs

$HADOOP_HOME/sbin/start-dfs.sh

在hdfs上运行spark wordcount程序

wordcount代码
···
package spark.wordcount

import org.apache.spark.{SparkConf, SparkContext}

object SparkWordCount {
	def main(args: Array[String]):Unit = {
		val conf = new SparkConf().setAppName("Spark Word Count")
		val sc = new SparkContext(conf)
		val startTime:Long = sc.startTime
		println(startTime)
		
		val words = sc.textFile("hdfs://name:9000/data/test/")
		val result = words.flatMap(line => line.split("\\s|,"))
			.map(word => (word, 1))
			.reduceByKey(_ + _)
		result.sortByKey()
		result.collect().foreach(word => println(word))
		
		sc.stop()
	}
}
···
将wordcount的代码打成jar包,然后上传hdfs的指定路径上

jar包路径:hdfs://10.80.0.133:9000/jars/

上传数据文件到指定路径
  • 文件路径
    hdfs://namenode:9000/data/test/
  • 文件内容
    hello world
    we are using spark on kubernetes
    using hdfs as filesystem
    hope it will be successed
    
spark-submit shell代码
```
$SPARK_HOME/bin/spark-submit \
	--master k8s://http://127.0.0.1:8001 \
	--deploy-mode cluster \
	--name spark-word-count-on-kubernetes \
	--class spark.wordcount.SparkWordCount \
	--conf spark.executor.instances=2 \
	--conf spark.kubernetes.container.image=172.16.80.150:5000/apache/spark:2.4.5 \
	--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
	hdfs://${hdfs namenode ip}:${hdfs namenode port}/jars/${jar 包名称} #例如hdfs://namenode:9000/jars/test.jar
```

可能遇到的问题

  • kubernetes with glusterfs上的exeuctor执行过程中遇到的unknown hostname的问题
    • 问题日志表现 0-resolver: getaddrinfo failed (Name or service not known)
    • 应该是因为executor所在的节点上的/etc/hosts文件中未配置glusterfs的hostname解析
  • 目前仍未解决的是,在submit脚本中传入文件路径
  • 当前spark-submit提交时使用的是cluster模式,测试过client模式不好用

总结

  • glusterfs的使用方式和local filesystem是一样的
  • hdfs不需要进行多余的配置即可使用

额外内容

  • 通过设置SPARK_LOCAL_DIRS进行shuffle加速
    设置方法:(可以设置多对,最好设置为hostPath和本地硬盘)
    spark.kubernetes.driver.volumes.[VolumeType].spark-local-dir-[VolumeName].mount.path=<mount path>
    spark.kubernetes.driver.volumes.[VolumeType].spark-local-dir-[VolumeName].options.path=<host path>
    
    例如
    spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path=/opt/cache/1
    spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path=/data/1
    spark.kubernetes.driver.volumes.hostPath.spark-local-dir-2.mount.path=/opt/cache/2
    spark.kubernetes.driver.volumes.hostPath.spark-local-dir-2.options.path=/data/2
    spark.kubernetes.exeuctor.volumes.hostPath.spark-local-dir-1.mount.path=/opt/cache/1
    spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path=/data/1
    spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/opt/cache/2
    spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/data/2
    
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值