Apache Oozie - the workflow scheduler for hadoop

One of Oozie’s strengths is that it was custom built from the ground up for Hadoop. This not only means that Oozie works well on Hadoop, but that the authors of Oozie had an opportunity to build a new system incorporating much of their knowledge about other legacy workflow systems. Although some users view Oozie as just a workflow system, it has evolved into something more than that. The ability to use data availability and time-based triggers to schedule workflows via the Oozie coordinator is as important to today’s users as the workflow. The higher-level concept of bundles, which enable users to package multiple coordinators into complex data pipelines, is also gaining a lot of traction as applications and pipelines moving to Hadoop are getting more complicated.

Recurrent Problem

Hadoop, Pig, Hive, and many other projects provide the foundation for storing and processing large amounts of data in an efficient way. Most of the time, it is not possible to perform all required processing with a single MapReduce, Pig, or Hive job.Multiple MapReduce, Pig, or Hive jobs often need to be chained together, producing and consuming intermediate data and coordinating their flow of execution.

As developers started doing more complex processing using Hadoop, multistage Hadoop jobs became common. This led to several ad hoc solutions to manage the execution and interdependency of these multiple Hadoop jobs. Some developers wrote simple shell scripts to start one Hadoop job after the other. Others used Hadoop’s JobControl class, which executes multiple MapReduce jobs using topological sorting. One development team resorted to Ant with a custom Ant task to specify their MapReduce and Pig jobs as dependencies of each other—also a topological sorting mechanism. Another team implemented a server-based solution that ran multiple Hadoop jobs using one thread to execute each job.

As these solutions started to be widely used, several issues emerged. It was hard to track errors and it was difficult to recover from failures. It was not easy to monitor progress. It complicated the life of administrators, who not only had to monitor the health of the cluster but also of different systems running multistage jobs from client machines. Developers moved from one project to another and they had to learn the specifics of the custom framework used by the project they were joining. Different organizations were using significant resources to develop and support multiple frameworks for accomplishing basically the same task.

A Common Solution: Oozie

It was clear that there was a need for a general-purpose system to run multistage Hadoop jobs with the following requirements:

  • It should use an adequate and well-understood programming model to facilitate its adoption and to reduce developer ramp-up time.
  • It should be easy to troubleshot and recover jobs when something goes wrong.
  • It should be extensible to support new types of jobs.
  • It should scale to support several thousand concurrent jobs.
  • Jobs should run in a server to increase reliability.
  • It should be a multitenant service to reduce the cost of operation.

A Simple Oozie Job

We’ll create an Oozie workflow application named identity-WF that runs an identity MR job, and the job just echoes its input as output and does nothing else.

identity-WF Oozie workflow example

$ git clone https://github.com/oozie-book/examples.git
$ cd examples/chapter-01/identity-wf
$ mvn clean assembly:single
...
[INFO] BUILD SUCCESS
...
$ tree target/example

tree target/example

The workflow.xml file contains the workflow definition of application identity-WF.

Why XML?

By using XML, Oozie application developers can use any XML editor tool to author their Oozie application. The Oozie server uses XML libraries to parse and validate the correctness of an Oozie application before attempting to use it, significantly simplifying the logic that processes the Oozie application definition. The same holds true for systems creating Oozie applications on the fly.

identity-WF Oozie workflow XML (workflow.xml)

<workflow-app xmlns="uri:oozie:workflow:0.4" name="identity-WF">

  <parameters>
    <property>
      <name>jobTracker</name>
    </property>
    <property>
      <name>nameNode</name>
    </property>
    <property>
      <name>exampleDir</name>
    </property>
  </parameters>

  <start to="identity-MR"/>

  <action name="identity-MR">
    <map-reduce>
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <prepare>
        <delete path="${exampleDir}/data/output"/>
      </prepare>
      <configuration>
        <property>
          <name>mapred.mapper.class</name>
          <value>org.apache.hadoop.mapred.lib.IdentityMapper</value>
        </property>
        <property>
          <name>mapred.reducer.class</name>
          <value>org.apache.hadoop.mapred.lib.IdentityReducer</value>
        </property>
        <property>
          <name>mapred.input.dir</name>
          <value>${exampleDir}/data/input</value>
        </property>
        <property>
          <name>mapred.output.dir</name>
          <value>${exampleDir}/data/output</value>
        </property>
      </configuration>
    </map-reduce>
    <ok to="success"/>
    <error to="fail"/>
  </action>

  <kill name="fail">
    <message>The Identity Map-Reduce job failed!</message>
  </kill>

  <end name="success"/>

</workflow-app>

The example application consists of a single file, workflow.xml. We need to package and deploy the application on HDFS before we can run a job.

hanying@master$ hdfs dfs -mkdir -p /user/hanying
hanying@master$ hdfs dfs -put target/example/ch01-identity ch01-identity //会存到/user/hanying目录下
hanying@master$ hdfs dfs -ls -R /user/hanying/ch01-identity #同$ hdfs dfs -ls -R ch01-identity

hdfs dfs

Before we can run the Oozie job, we need a job.properties file in our local filesystem that specifies the required parameters for the job and the location of the application package in HDFS:

job.properties

$export OOZIE_URL=http://localhost:11000/oozie
$oozie job -run -config target/example/job.properties

error:

SLF4J: Class path contains multiple SLF4J bindings.
...
Error: HTTP error code: 500 : Internal Server Error

error

Solution of SLF4J:http://www.slf4j.org/codes.html#multiple_bindings
remove the conflict jar.

Solution of Error: HTTP error code: 500 : Internal Server Error

oozie-error.log

ERROR V1JOBSSERVLENT

authorization disabled
http://oozie.apache.org/docs/4.2.0/AG_Install.html #To use a Self-Signed Certificate

$sudo keytool -genkeypair -alias tomcat2 -keyalg RSA -dname "CN=localhost" -storepass password -keypass password
$sudo keytool -exportcert -alias tomcat2 -file ./certificate.cert
#生成certificate.cert文件
$sudo keytool -import -alias tomcat -file certificate.cert
#是
证书已添加到密钥库中

Error:

2016-05-20 09:55:49,729 ERROR ShareLibService:517 - SERVER[master] org.apache.oozie.service.ServiceException: E0104: Could not fully initialize service [org.apache.oozie.service.ShareLibService], Not able to cache sharelib. An Admin needs to install the sharelib with oozie-setup.sh and issue the 'oozie admin' CLI command to update the sharelib
org.apache.oozie.service.ServiceException: E0104: Could not fully initialize service [org.apache.oozie.service.ShareLibService], Not able to cache sharelib. An Admin needs to install the sharelib with oozie-setup.sh and issue the 'oozie admin' CLI command to update the sharelib

Fix it:

#stop oozie
#edit oozie-site.xml
<property>          
    <name>oozie.service.HadoopAccessorService.hadoop.configurations</name> 
    <value>*=/home/hanying/hadoop/etc/hadoop</value>
</property>

hanying@master:/usr/local/src/oozie-4.2.0$ sudo -u oozie bin/oozie-setup.sh sharelib create -fs hdfs://master:8020 -locallib share/

Error:

hanying@master:/usr/local/src/oozie-4.2.0$ sudo -u oozie bin/oozie-setup.sh sharelib create -fs hdfs://master:8020 -locallib share/
  setting CATALINA_OPTS="$CATALINA_OPTS -Xmx1024m"
the destination path for sharelib is: /user/oozie/share/lib/lib_20160520173926

Error: User: oozie is not allowed to impersonate oozie

Stack trace for the error was (for debug purposes):

Fix it:

#1.Add the following content to core-site.xml
<property>
  <name>hadoop.proxyuser.oozie.hosts</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.oozie.groups</name>
  <value>*</value>
</property>
#2.Restart all.

make sure your hadoop service is started

The default Hadoop ports are as follows: (HTTP ports, they have WEB UI):

DaemonDefault PortConfiguration Parameter
Namenode50070dfs.http.address
Datanodes50075dfs.datanode.http.address
Secondarynamenode50090dfs.secondary.http.address
Backup/Checkpoint node?50105dfs.backup.http.address
Jobracker50030mapred.job.tracker.http.address
Tasktrackers50060mapred.task.tracker.http.address

Internally, Hadoop mostly uses Hadoop IPC, which stands for Inter Process Communicator, to communicate amongst servers. The following table presents the ports and protocols that Hadoop uses. This table does not include the HTTP ports mentioned above.

DaemonDefault PortConfiguration Parameter
Namenode8020fs.default.name
Datanode50010dfs.datanode.address
Datanode50020dfs.datanode.ipc.address
Backupnode50100dfs.backup.address
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值