@羲凡——只为了更好的活着
Oozie调用spark案例(shell脚本)——Oozie4.3.1
oozie调用spark的shell脚本准备三个文件 job.properties、testWordCount.sh、workflow.xml。spark的wordcount代码我也附在文章的最后(代码需要传入两个参数,一个输入文件,一个输出文件夹),仅供参考
1.job.properties
# 当你配置了dfs高可用,fs.defaultFS参数对应的名字,
# 否者写hdfs://deptest1:8020或者hdfs://deptest1:9000
nameNode=hdfs://ns
# 当你配置的是高可用的yarn,yarn.resourcemanager.cluster-id参数对应的名字
# 否者写deptest2:8032
jobTracker=rmcluster
# 任务所用的队列,根据自己公司情况写,这里我选着默认
queueName=default
# examplesRoot这个名字最好不要改,有时会出错,切记!
examplesRoot=aarontest/oozie/sparkshell
# oozie能使用hdfs上系统lib目录
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/${examplesRoot}/workflow.xml
EXEC=testWordCount.sh
shellpath=${nameNode}/${examplesRoot}/${EXEC}
inputfile=/aarontest/data/oozie/sparkshell/wc.txt
outputdir=/aarontest/data/oozie/sparkshell/wcoutpath
2.testWordCount.sh
#!/bin/sh
/usr/local/package/spark-2.3.2-bin-hadoop2.7/bin/spark-submit \
--master yarn \
--deploy-mode client \
--class sparktest.SparkWordCount \
hdfs://ns/aarontest/oozie/sparkshell/SparkWordCount.jar $1 $2
3.workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.4" name="sparkshell-wf">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${EXEC}</exec>
<argument>${inputfile}</argument>
<argument>${outputdir}</argument>
<file>${shellpath}</file>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
4.提交
-oozie 指定oozie地址 ,-config 指定任务的配置参数
# 将文件上传到hdfs上(job.properties中examplesRoot的路径)
hdfs dfs -put sparkshell/ aarontest/oozie/
# 启动oozie任务
oozie job -oozie http://deptest45:11000/oozie -config job.properties -run
5.spark的wordcount代码
package sparktest
import org.apache.spark.sql.SparkSession
object SparkWordCount {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("SparkWordCount")
.master("local[*]") //代码打包是一定要将这行注释掉
.enableHiveSupport()
.getOrCreate()
val sc = spark.sparkContext
val inpath = args(0)
val outpath = args(1)+s"${System.currentTimeMillis()}"
val rdd = sc.textFile(inpath)
.filter(_.nonEmpty)
.flatMap(_.split("\t"))
.map((_,1))
.reduceByKey(_+_)
.map(_.swap)
.sortByKey(ascending = false,numPartitions = 1)
.map(_.swap)
rdd.foreach(println)
rdd.saveAsTextFile(outpath)
sc.stop()
spark.stop()
}
}
====================================================================
@羲凡——只为了更好的活着
若对博客中有任何问题,欢迎留言交流