工作流调度框架Oozie
* 工作流
import -> hive -> export
将不同的业务进行编排
* 调度
作业/任务 定时执行
事件触发执行
时间
数据集
调度框架
Linux crontab
规则
* * * * * cmd
前五个字段是设置的时间
* * * * * cmd
分
时
天
月
星期
* mr
yarn jar xxx.jar input output
* hive
hive -f xx.sql
* sqoop
sqoop --options-file xx.txt
* shell script
sh xxx.sh
Azkaban
开源的工作流管理,可视化的
https://azkaban.github.io
Oozie
http://oozie.apache.org
Zeus(宙斯)
阿里开源框架,一个完整的Hadoop作业平台。
一个Oozie Job,是一个MapReduce程序,仅仅只有Map Task
针对不同类型的任务,编写模板。
--------------------------------------------------------Oozie安装----------------------------------------------------------------------------
1.The following two properties are required in Hadoop core-site.xml:
<!-- OOZIE -->
<property>
<name>hadoop.proxyuser.[OOZIE_SERVER_USER].hosts</name> #root
<value>[OOZIE_SERVER_HOSTNAME]</value> #*
</property>
<property>
<name>hadoop.proxyuser.[OOZIE_SERVER_USER].groups</name> #root
<value>[USER_GROUPS_THAT_ALLOW_IMPERSONATION]</value> #*
</property>
vi oozie-site.xml
<property>
<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=/opt/hadoop-2.5.0-cdh5.3.6/etc/hadoop</value>
<description>
Comma separated AUTHORITY=HADOOP_CONF_DIR, where AUTHORITY is the HOST:PORT of
the Hadoop service (JobTracker, HDFS). The wildcard '*' configuration is
used when there is no exact match for an authority. The HADOOP_CONF_DIR contains
the relevant Hadoop *-site.xml files. If the path is relative is looked within
the Oozie configuration directory; though the path can be absolute (i.e. to point
to Hadoop client conf/ directories in the local filesystem.
</description>
</property>
可配置数据库存放数据
2.解压jar包
tar -zxf oozie-hadooplibs-4.0.0-cdh5.3.6.tar.gz
3.创建目录
mkdir oozie-4.0.0-cdh5.3.6/libext
4.拷贝jar包
mv oozie-4.0.0-cdh5.3.6/hadooplibs/hadooplib-2.5.0-cdh5.3.6.oozie-4.0.0-cdh5.3.6/* ./libext/
5.If using the ExtJS library copy the ZIP file to the libext/ directory.
http://extjs.com/deploy/ext-2.0.2.zip http://archive.cloudera.com/gplextras/misc/ext-2.2.zip
ext-2.2.zip
6.Run the oozie-setup.sh script to configure Oozie with all the components added to the libext/ directory.
bin/oozie-setup.sh prepare-war
bin/oozie-setup.sh sharelib create -fs hdfs://hadoop-senior01.zhangbk.com:8020 -locallib oozie-sharelib-4.0.0-cdh5.3.6-yarn.tar.gz
bin/ooziedb.sh create -sqlfile oozie.sql -run DB Connection
7.Start Oozie as a daemon process run:
bin/oozied.sh start
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Command Line Examples
The examples/ directory must be copied to the user HOME directory in HDFS:
hdfs dfs -put examples examples
vi /opt/oozie-4.0.0-cdh5.3.6/examples/apps/map-reduce/job.properties
nameNode=hdfs://ns1
jobTracker=hadoop-senior03.zhangbk.com:8032
How to run an example application:
bin/oozie job -oozie http://hadoop-senior01.zhangbk.com:11000/oozie -config examples/apps/map-reduce/job.properties -run
job: 14-20090525161321-oozie-tucu
Check the workflow job status:
oozie job -oozie http://localhost:11000/oozie -info 14-20090525161321-oozie-tucu
------------------------------1. hdfs Action---------------------------------------------------------------------
vi /opt/oozie-4.0.0-cdh5.3.6/examples/apps/map-reduce/workflow.xml
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<workflow-app xmlns="uri:oozie:workflow:0.5" name="mr-wordcount-wf">
<start to="mr-node-wordcount"/>
<action name="mr-node-wordcount">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/${oozieDataRoot}/${outputDir}"/>
</prepare>
<configuration>
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<property>
<name>mapreduce.job.queuename</name>
<value>${queueName}</value>
</property>
<property>
<name>mapreduce.job.map.class</name>
<value>com.zhangbk.mapreduce.WordCount$WordCountMapper</value>
</property>
<property>
<name>mapreduce.job.reduce.class</name>
<value>com.zhangbk.mapreduce.WordCount$WordCountReducer</value>
</property>
<property>
<name>mapreduce.map.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.map.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<property>
<name>mapreduce.job.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.job.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<property>
<name>mapreduce.input.fileinputformat.inputdir</name>
<value>${nameNode}/${oozieDataRoot}/${inputDir}</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.outputdir</name>
<value>${nameNode}/${oozieDataRoot}/${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
vi job.properties
nameNode=hdfs://ns1
jobTracker=hadoop-senior03.zhangbk.com:8032
queueName=default
oozieAppsRoot=user/oozie-apps
oozieDataRoot=user/oozie/datas
oozie.wf.application.path=${nameNode}/${oozieAppsRoot}/mr-wordcount-wf/workflow.xml
inputDir=mr-wordcount-wf/input
outputDir=mr-wordcount-wf/output
------------------------------------------------------------------------------------------------------------------
如何定义一个workflow
job.properties
关键点:指向workflow.xml文件所在的HDFS位置
workflow.xml
定义文件
xml文件
lib目录
依赖的jar包
workflow.xml编写
流程控制节点
Action
Map reduce Action
如何使用Oozie调度mapreduce程序
关键点
将以前Java MapReduce程序中的Driver部分配置成xml
----------------------2. Hive Action--------------------------------------------------------------------------------------------------------------------------------
vi job.properties
nameNode=hdfs://ns1
jobTracker=hadoop-senior03.zhangbk.com:8032
queueName=default
oozieAppsRoot=user/oozie-apps
oozieDataRoot=user/oozie/datas
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/${oozieAppsRoot}/hive-select
outputDir=hive-select/output
vi /opt/oozie-4.0.0-cdh5.3.6/examples/apps/map-reduce/workflow.xml
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<workflow-app xmlns="uri:oozie:workflow:0.5" name="hive-select">
<start to="hive-node"/>
<action name="hive-node">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/${oozieDataRoot}/${outputDir}"/>
</prepare>
<job-xml>${nameNode}/${oozieAppsRoot}/hive-select/hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>select-tab.sql</script>
<param>OUTPUT=${nameNode}/${oozieDataRoot}/${outputDir}</param>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
#执行Oozie程序
export OOZIE_URL=http://hadoop-senior01.zhangbk.com:11000/oozie
bin/oozie job -config oozie-apps/hive-select/job.properties -run
注意:拷贝MySQL的jar包,放在lib下,配置<job-xml>${nameNode}/${oozieAppsRoot}/hive-select/hive-site.xml</job-xml>,编写sql语句。
--------------3.Sqoop Action--------------------------------------------------------------------------------------------------------------------------------------
vi job.properties
nameNode=hdfs://ns1
jobTracker=hadoop-senior03.zhangbk.com:8032
queueName=default
oozieAppsRoot=user/oozie-apps
oozieDataRoot=user/oozie/datas
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/${oozieAppsRoot}/sqoop-imp
outputDir=sqoop-imp/output
-----------------------------
vi workflow.xml
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<workflow-app xmlns="uri:oozie:workflow:0.5" name="sqoop-imp-wf">
<start to="sqoop-node"/>
<action name="sqoop-node">
<sqoop xmlns="uri:oozie:sqoop-action:0.3">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/${oozieDataRoot}/${outputDir}"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<command>import --connect jdbc:mysql://hadoop-senior01.zhangbk.com:3306/test --username root --password password01 --table dm_gy_xzqh --target-dir ${nameNode}/${oozieDataRoot}/${outputDir} --num-mappers 1</command>
</sqoop> #ʹԃimport --options-file sqoop_import_hdfs.txt
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
--------------------
The same Sqoop action using arg elements:
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
...
<action name="myfirsthivejob">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-traker>foo:8021</job-tracker>
<name-node>bar:8020</name-node>
<prepare>
<delete path="${jobOutput}"/>
</prepare>
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<arg>import</arg>
<arg>--connect</arg>
<arg>jdbc:hsqldb:file:db.hsqldb</arg>
<arg>--table</arg>
<arg>TT</arg>
<arg>--target-dir</arg>
<arg>hdfs://localhost:8020/user/tucu/foo</arg>
<arg>-m</arg>
<arg>1</arg>
</sqoop>
<ok to="myotherjob"/>
<error to="errorcleanup"/>
</action>
...
</workflow-app>
-----------------4.Shell Action-----------------------------------------------------------------
vi job.properties
nameNode=hdfs://ns1
jobTracker=hadoop-senior03.zhangbk.com:8032
queueName=default
oozieAppsRoot=user/oozie-apps
oozieDataRoot=user/oozie/datas
oozie.wf.application.path=${nameNode}/${oozieAppsRoot}/shell-hive-select
output=shell-hive-select
exec=select-tab.sh
script=select-user.sql
vi workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.5" name="shell-wf">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${exec}</exec>
<file>${nameNode}/${oozieAppsRoot}/shell-hive-select/${exec}#${exec}</file>
<file>${nameNode}/${oozieAppsRoot}/shell-hive-select/${script}#${script}</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
---------------------------------------------------------------------------------
workflow 多个action,形成一个完整的工作流
案例:
start node
hive action
table
result -> hdfs
sqoop action
hdfs -> mysql
end
kill
--------------------------------------------------------------------------------
coordinate 调度
修改服务器时区
删除软连接
rm -rf /etc/localtime
创建软连接
ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
查看时区及时间
date -R
Sat, 06 Jul 2019 16:29:26 +0800
配置时间
date -s 2019-7-6
date -s 16:31:30
配置Oozie时区
Oozie默认使用UTC时区,而服务器可能是CST,建议统一使用GMT+0800
修改 Oozie-site.xml
<property>
<name>oozie.processing.timezone</name>
<value>GMT+0800</value>
</property>
清理缓存
rm -rf /opt/oozie-4.0.0-cdh5.3.6/oozie-server/work/Catalina
rm -rf /opt/oozie-4.0.0-cdh5.3.6/oozie-server/conf/Catalina
--------------------------------------------------------------------------------------------------------------------------------------------------------------
vi /opt/oozie-4.0.0-cdh5.3.6/oozie-server/webapps/oozie/oozie-console.js
function getTimeZone() {
Ext.state.Manager.setProvider(new Ext.state.CookieProvider());
return Ext.state.Manager.get("TimezoneId","GMT+0800");
}
Coordinator配置定时
<property>
<name>oozie.service.coord.check.maximum.frequency</name>
<value>false</value>
<description>
When true, Oozie will reject any coordinators with a frequency faster than 5 minutes. It is not recommended to disable
this check or submit coordinators with frequencies faster than 5 minutes: doing so can cause unintended behavior and
additional system stress.
</description>
</property>
--------------------------------------------------------------------------------------------------------------------------------------------------------
Coordinator案例
vi job.properties
nameNode=hdfs://ns1
jobTracker=hadoop-senior03.zhangbk.com:8032
queueName=default
oozieAppsRoot=user/oozie-apps
oozieDataRoot=user/oozie/datas
oozie.coord.application.path=${nameNode}/${oozieAppsRoot}/cron-schedule
start=2019-07-06T23:15+0800
end=2019-07-06T23:25+0800
workflowAppUri=${nameNode}/${oozieAppsRoot}/cron-schedule
vi workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.5" name="no-op-wf">
<start to="end"/>
<end name="end"/>
</workflow-app>
vi coordinator.xml
<coordinator-app name="cron-coord" frequency="${coord:minutes(2)}"
start="${start}" end="${end}" timezone="GMT+0800"
xmlns="uri:oozie:coordinator:0.4">
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
---------------------------------------------------------------------------------------------------------
Coordinator配置调度mpreduce
nameNode=hdfs://ns1
jobTracker=hadoop-senior03.zhangbk.com:8032
queueName=default
oozieAppsRoot=user/oozie-apps
oozieDataRoot=user/oozie/datas
oozie.coord.application.path=${nameNode}/${oozieAppsRoot}/cron
start=2019-07-06T23:50+0800
end=2019-07-06T23:59+0800
workflowAppUri=${nameNode}/${oozieAppsRoot}/cron
inputDir=mr-wordcount-wf/input
outputDir=mr-wordcount-wf/output
<workflow-app xmlns="uri:oozie:workflow:0.5" name="mr-wordcount-wf">
<start to="mr-node-wordcount"/>
<action name="mr-node-wordcount">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/${oozieDataRoot}/${outputDir}"/>
</prepare>
<configuration>
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<property>
<name>mapreduce.job.queuename</name>
<value>${queueName}</value>
</property>
<property>
<name>mapreduce.job.map.class</name>
<value>com.zhangbk.mapreduce.WordCount$WordCountMapper</value>
</property>
<property>
<name>mapreduce.job.reduce.class</name>
<value>com.zhangbk.mapreduce.WordCount$WordCountReducer</value>
</property>
<property>
<name>mapreduce.map.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.map.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<property>
<name>mapreduce.job.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.job.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<property>
<name>mapreduce.input.fileinputformat.inputdir</name>
<value>${nameNode}/${oozieDataRoot}/${inputDir}</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.outputdir</name>
<value>${nameNode}/${oozieDataRoot}/${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
<coordinator-app name="cron-coord" frequency="0/3 * * * *"
start="${start}" end="${end}" timezone="GMT+0800"
xmlns="uri:oozie:coordinator:0.4">
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
在Hive table中,提供一些列的属性
INPUT__FILE__NAME 查看数据存放的位置
-------------------------------------------------------------------------------------------
Hive Action--Sqoop Action定时调度
job.properties
nameNode=hdfs://ns1
jobTracker=hadoop-senior03.zhangbk.com:8032
queueName=default
oozieAppsRoot=user/oozie-apps
oozieDataRoot=user/oozie/datas
oozie.use.system.libpath=true
oozie.coord.application.path=${nameNode}/${oozieAppsRoot}/wf-hive-select
start=2019-07-07T21:50+0800
end=2019-07-07T21:59+0800
workflowAppUri=${nameNode}/${oozieAppsRoot}/wf-hive-select
outputDir=wf-hive-select/output
workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.5" name="wf-hive-select">
<start to="hive-node"/>
<action name="hive-node">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/${oozieDataRoot}/${outputDir}"/>
</prepare>
<job-xml>${nameNode}/${oozieAppsRoot}/hive-select/hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>select-tab.sql</script>
<param>OUTPUT=${nameNode}/${oozieDataRoot}/${outputDir}</param>
</hive>
<ok to="sqoop-node"/>
<error to="fail"/>
</action>
<action name="sqoop-node">
<sqoop xmlns="uri:oozie:sqoop-action:0.3">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<command>export --connect jdbc:mysql://hadoop-senior01.zhangbk.com:3306/test --username root --password
password01 --table xzqh_hive --input-fields-terminated-by "\t" --export-dir hdfs://ns1/user/oozie/datas/wf-hive-sele
ct/output --num-mappers 1</command>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
select-tab.sql
drop table if exists xzqh_tmp ;
create table if not exists default.xzqh_tmp like default.dm_gy_xzqh_hive location '${OUTPUT}';
insert overwrite table default.xzqh_tmp
select *
from default.dm_gy_xzqh_hive
order by xzqhsz_dm
limit 100 ;
coordinator.xml
<coordinator-app name="cron-coord" frequency="0/10 * * * *"
start="${start}" end="${end}" timezone="GMT+0800"
xmlns="uri:oozie:coordinator:0.4">
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
可能出现的问题:
java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.HiveMain not found
解决方法:
缺少jar包,在job.properties中添加
oozie.use.system.libpath=true
-------------------------------------------------------------------------------------------------------
Oozie
* workflow
* coordinator
time定时触发
${coord:days(1)}
cron
data触发
* bundle