Spark with glusterfs
测试设备架构
测试环境搭建过程
- 搭建GlusterFS,测试环境中用的是两个节点做GlusterFS,备份数是两份
- 搭建Spark Standalone环境,三台机器做Spark Standalone集群,其中每台GlusterFS上都要配置为Spark的Worker机
- 三台Spark机器上都要挂载glusterfs的文件到同一个目录
mount -t glusterfs slave1:/test /data/mnt/glusterfs/
Spark WordCount 代码
代码:
package spark.wordcount
import org.apache.spark.{SparkConf, SparkContext}
object SparkWordCount {
def main(args: Array[String]):Unit = {
val conf = new SparkConf().setAppName("Spark Word Count")
val sc = new SparkContext(conf)
val startTime:Long = sc.startTime
println(startTime)
val words = sc.textFile("/data/mnt/glusterfs/test/")
words.map(c => (c, 1)).reduceByKey(_ + _).collect().foreach(println(_))
}
}
提交jar包命令脚本:
$SPARK_HOME/bin/spark-submit \
--master spark://Master:7077 \
--num-executors 2 \
--class spark.wordcount.SparkWordCount \
--conf spark.dynamicAllocation.enabled=false \
ScalaWordCount.jar
Tips:
- 注意挂载glusterfs目录到每一个Spark节点,Master也必须挂载
- 注意提交任务时,添加参数–conf spark.dynamicAllocation.enabled=false,目的是为了如下问题解决问题