本文包含内容:
一、ooize使用sqoop将oracle导入到hdfs
二、ooize串行定时任务
三、ooize并行定时任务
四、遇到的问题
一、ooize使用sqoop将oracle表导入到hdfs
此处在ooize的lib文件夹下需要oracle的OJDBC驱动包, 不然会报错
workflow.xml文件
<workflow-app xmlns="uri:oozie:workflow:0.4" name="sqoop-wmz">
<start to="sqoop-node"/>
<action name="sqoop-node">
<sqoop xmlns="uri:oozie:sqoop-action:0.3">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>oozie.sqoop.log.level</name>
<value>WARN</value>
</property>
</configuration>
<command>sqoop import --connect jdbc:oracle:thin:@***.***.**.***:1521:orcl --username ** --password ** --table ** --delete-target-dir --target-dir /yss/guzhi/**/** --m 1</command>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
job.properties文件
nameNode=hdfs://bj-rack001-hadoop002:8020
jobTracker=bj-rack001-hadoop003:8050
queueName=default
examplesRoot=wmz_test
oozie.libpath=hdfs://bj-rack001-hadoop002:8020/user/oozie/share/lib/sqoop
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/tmp/oracle2hdfs
二、ooize串行定时任务
当需求需要导入导出多表或者多个操作时,可以添加多个action, 将多个命令放入一个command或者将多个command写入一个action都会报错
workflow.xml文件 首先通过shell脚本获取当前日期, 再赋值给sqoop的命令, 以当天日期建立文件夹
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.5" name="testshell-wmz">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${shell}</exec>
<file>${nameNode}/tmp/oracle2hdfs/${shell}#${shell}</file>
<capture-output/>
</shell>
<ok to="sqoop-node"/>
<error to="fail"/>
</action>
<action name="sqoop-node">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<command>import --connect jdbc:oracle:thin:@***.***.***.**:1521:orcl --username **--password **--table ***--target-dir /yss/guzhi/**/${wf:actionData('shell-node')['day']}/LSETLIST/ --delete-target-dir --m 1 </command>
</sqoop>
<ok to="sqoop-node2"/>
<error to="fail"/>
</action>
<action name="sqoop-node2">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<command>import --connect jdbc:oracle:thin:@***.***.***.**:1521:orcl --username **--password **--table ***--target-dir /yss/**/**/${wf:actionData('shell-node')['day']}/CSGDZH --delete-target-dir --m 1 </command>
</sqoop>
<ok to="sqoop-node3"/>
<error to="fail"/>
</action>
<action name="sqoop-node3">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<command>import --connect jdbc:oracle:thin:@***.***.***.**:1521:orcl --username **--password **--table ***--target-dir /yss/**/**/${wf:actionData('shell-node')['day']}/CSQSXW --delete-target-dir --m 1 </command>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end'/>
</workflow-app>
coordinator.xml文件 这里设置的是12小时跑一次
<coordinator-app name="oracleToHdfsBySqoop-wmz" frequency="${coord:hours(12)}" start="${start}" end="${end}" timezone="GMT+0800" xmlns="uri:oozie:coordinator:0.5">
<action>
<workflow>
<app-path>${nameNode}/tmp/oracle2hdfs/workflow.xml</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
shell 获取当天日期
#!/bin/sh
day=`date '+%Y%m%d'`
echo "day:$day"
job.properties
nameNode=hdfs://bj-rack001-hadoop002:8020
jobTracker=bj-rack001-hadoop003:8050
queueName=default
examplesRoot=examples
oozie.service.coord.check.maximum.frequency=false
oozie.coord.application.path=${nameNode}/tmp/oozietest/
start=2018-09-11T16:00Z
end=2018-09-11T16:00Z
workflowAppUri=${oozie.coord.application.path}
因为设置的GML时间, 所以时间上要北京时间-8小时
三、ooize并行任务
当串行action过多时会导致效率过慢,此时可以设置并行执行
这里并行执行用到了bundle组建
workflow1.xml
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.5" name="bundle-wmz">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${shell}</exec>
<file>${nameNode}/tmp/oracle2hdfs/${shell}#${shell}</file>
<capture-output/>
</shell>
<ok to="sqoop-node"/>
<error to="fail"/>
</action>
<action name="sqoop-node">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<command>import --connect jdbc:oracle:thin:@***.***.***.**:1521:orcl --username ***--password ***--table LSETLIST --target-dir /yss/guzhi/***/${wf:actionData('shell-node')['day']}/LSETLIST/ --delete-target-dir --m 1 </command>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end'/>
</workflow-app>
workflow2.xml
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.5" name="bundle2-wmz">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${shell}</exec>
<file>${nameNode}/tmp/oracle2hdfs/${shell}#${shell}</file>
<capture-output/>
</shell>
<ok to="sqoop-node2"/>
<error to="fail"/>
</action>
<action name="sqoop-node2">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<command>import --connect jdbc:oracle:thin:@***.***.***.**:1521:orcl --username ***--password ***--table CSGDZH --target-dir /yss/guzhi/***/${wf:actionData('shell-node')['day']}/CSGDZH --delete-target-dir --m 1 </command>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end'/>
</workflow-app>
workflow3.xml等以此类推
coordinate1.xml
<coordinator-app name="oracleToHdfsBySqoop-wmz" frequency="${coord:hours(12)}" start="${start}" end="${end}" timezone="GMT+0800" xmlns="uri:oozie:coordinator:0.5">
<action>
<workflow>
<app-path>${workflowAppUri1}</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
coordinate2.xml
<coordinator-app name="oracleToHdfsBySqoop-wmz" frequency="${coord:hours(12)}" start="${start}" end="${end}" timezone="GMT+0800" xmlns="uri:oozie:coordinator:0.5">
<action>
<workflow>
<app-path>${workflowAppUri2}</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
corrdinate3.xml等以此类推
bundle.xml
<bundle-app name='bundle-app' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xmlns='uri:oozie:bundle:0.1'>
<coordinator name='cron-bundle1'>
<app-path>${coordinator1}</app-path>
</coordinator>
<coordinator name='cron-bundle2'>
<app-path>${coordinator2}</app-path>
</coordinator>
</bundle-app>
job.properties
nameNode=hdfs://bj-rack001-hadoop002:8020
jobTracker=bj-rack001-hadoop003:8050
queueName=default
examplesRoot=wmz_test
oozie.libpath=hdfs://bj-rack001-hadoop002:8020/user/oozie/share/lib/sqoop
oozie.use.system.libpath=true
#oozie.wf.application.path=${nameNode}/tmp/oracle2hdfs
shell=getDate.sh
oozie.bundle.application.path=${nameNode}/tmp/oracle2hdfs/bundle.xml
oozie.service.coord.check.maximum.frequency=false
#oozie.coord.application.path=${nameNode}/tmp/bundleTest
start=2018-09-10T16:00Z
end=2028-09-10T16:00Z
workflowAppUri1=${nameNode}/tmp/oracle2hdfs/workflow1.xml
workflowAppUri2=${nameNode}/tmp/oracle2hdfs/workflow2.xml
coordinator1=${nameNode}/tmp/oracle2hdfs/coordinator1.xml
coordinator2=${nameNode}/tmp/oracle2hdfs/coordinator2.xml
oozie job -oozie http://***.***.***.***:11000/oozie -config /data/temp/wmz/shelltest/job.properties -run 执行任务
四、遇到的问题
1、脚本文件起始 若#!/bin/bash无法执行报错,可写为#!/bin/sh
2、之前试过将sqoop操作写入shell, 使用ooize执行shell操作sqoop, 但是shell中的sqoop只能做到list-tables和list-databases,各种import命令都无法执行,至今不知道是什么原因, 单独执行脚本也可以执行, 单独用ooize执行shell和单独用ooize执行sqoop import操作都没问题, 但是结合起来就不行, 很诧异。