最近一段时间写了个mr程序,最后要进行作业调度,但是不知道用什么方式比较合适,最终选择了oozie。
之前一直写web程序,老板们突然让玩hadoop,于是就这么愉快的的接受了这个活,对于一个新手来说其中遇到好多好多的坑。。。
环境:hadoop :1.2.1, sqoop:1.4.4, oozie:3.3.2
1. oozie安装请参考我的这篇文章:http://blog.csdn.net/jueshengtianya/article/details/25300761 这里面有我之前遇到的坑。
2. oozie的workflow去找要运行的jar包是在的他的同级目录下的lib目录下,workflow要找依赖的jar包都是在这个路径下。
3. 我的oozie工作目录:
hadoop@steven:~/hadoop1.1.2/hadoop-1.2.1/iesRunShell/oozie/iesCron$ ../../../bin/hadoop fs -ls /ies/oozie/cron/
Found 4 items
-rw-r--r-- 3 hadoop supergroup 1591 2014-05-12 19:37 /ies/oozie/cron/coordinator.xml
-rw-r--r-- 3 hadoop supergroup 1032 2014-05-10 20:12 /ies/oozie/cron/job.properties
drwxr-xr-x - hadoop supergroup 0 2014-05-13 21:41 /ies/oozie/cron/lib
-rw-r--r-- 3 hadoop supergroup 5450 2014-05-13 20:13 /ies/oozie/cron/workflow.xml
4. 我的workflow文件是这样配置的,没啥可说的,直接看吧
<workflow-app xmlns="uri:oozie:workflow:0.2" name="java-main-wf">
<start to="firstMid"/>
<!--生成第一次中间结果-->
<action name="firstMid">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.miaozhen.ies.job.IesJob4MZSEQ</main-class>
<arg>/ies/output/mid</arg>
</java>
<ok to="joinLog"/>
<error to="fail"/>
</action>
<!--聚合中间结果和当天的日志-->
<action name="joinLog">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.miaozhen.ies.job.JoinJob</main-class>
<arg>/ies/join</arg>
</java>
<ok to="generateResult"/>
<error to="fail"/>
</action>
<fork name="generateResult">
<path start="iesResult"/>
<path start="spidResult"/>
</fork>
<!--生成结果-->
<action name="iesResult">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.miaozhen.ies.job.IesResultJob</main-class>
<arg>/ies/join/joinResult/iesResult-r-00000</arg>
<arg>/ies/iesResult</arg>
</java>
<ok to="completed"/>
<error to="fail"/>
</action>
<action name="spidResult">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.miaozhen.ies.job.ResultJob</main-class>
<arg>/ies/join/joinResult/iesResult-r-00000</arg>
<arg>/ies/spidResult</arg>
</java>
<ok to="completed"/>
<error to="fail"/>
</action>
<join name="completed" to="sqoopResult"/>
<fork name="sqoopResult">
<path start="sqoopIesResult"/>
<path start="sqoopSpidResult"/>
<path start="sqoopRelationResult"/>
</fork>
<action name="sqoopIesResult">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>
export --connect jdbc:mysql://127.0.0.1:3306/ies2 --username root --table ies_report --export-dir /ies/iesResult/data/iesRegionResult-r-00000 --columns iesId,caid,imp3rd,clk3rd,period,regionId,insertTime
</command>
</sqoop>
<ok to="sqoopCompleted"/>
<error to="fail"/>
</action>
<action name="sqoopSpidResult">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>
export --connect jdbc:mysql://127.0.0.1:3306/ies2 --username root --table spots_report --export-dir /ies/spidResult/spid/spidResult-r-00000 --columns spid,impIES,clkIES,insertTime
</command>
</sqoop>
<ok to="sqoopCompleted"/>
<error to="fail"/>
</action>
<action name="sqoopRelationResult">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>
export --connect jdbc:mysql://127.0.0.1:3306/ies2 --username root --table relation --export-dir /ies/spidResult/spid/relation-r-00000 --columns iesId,spid,insertTime
</command>
</sqoop>
<ok to="sqoopCompleted"/>
<error to="fail"/>
</action>
<join name="sqoopCompleted" to="end"/>
<kill name="fail">
<message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
4. 这里要说一下在oozie调用sqoop的时候:
export --connect jdbc:mysql://127.0.0.1:3306/ies2 --username root --table ies_report --export-dir /ies/iesResult/data/iesRegionResult-r-00000 --columns iesId,caid,imp3rd,clk3rd,period,regionId,insertTime
在进行insertTime插入的时候,要注意一定要把时间设置成这种格式:yyyy-MM-dd HH:mm:ss,sqoop在进行时间插入的时候会把date转化为timestrap,如果你不保留时分秒的话就会抛出如下错误:
java.io.IOException: Can't export data, please check task tracker logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:202)
at spots_report.__loadFromFields(spots_report.java:266)
at spots_report.parse(spots_report.java:203)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
... 10 more
如果你的输出格式是yyyy-MM-dd HH:mm:ss这种而不是yyyy-MM-dd这种,sqoop的日期转化就没有问题。