Azkaban 总结
概述
Azkaban是由Linkedin开源的一个批量工作流任务调度器。用于在一个工作流内以一个特定的顺序运行一组工作和流程。
Azkaban定义了一种KV文件(properties)格式来建立任务之间的依赖关系,并提供一个易于使用的web用户界面维护和跟踪你的工作流。
它有如下功能特点:
1、Web用户界面
2、方便上传工作流
3、方便设置任务之间的关系
4、调度工作流
5、认证/授权(权限的工作)
6、能够杀死并重新启动工作流
7、模块化和可插拔的插件机制
8、项目工作区
9、工作流和任务的日志记录和审计
其他调度器对比
特性 | Hamake | Oozie | Azkaban | Cascading |
工作流描述语言 | XML | XML (xPDL based) | text file with key/value pairs | Java API |
依赖机制 | data-driven | explicit | explicit | explicit |
是否要web容器 | No | Yes | Yes | No |
进度跟踪 | console/log messages | web page | web page | Java API |
Hadoop job调度支持 | no | yes | yes | yes |
运行模式 | command line utility | daemon | daemon | API |
Pig支持 | yes | yes | yes | yes |
事件通知 | no | no | no | yes |
需要安装 | no | yes | yes | no |
支持的hadoop版本 | 0.18+ | 0.20+ | currently unknown | 0.18+ |
重试支持 | no | workflownode evel | yes | yes |
运行任意命令 | yes | yes | yes | yes |
Amazon EMR支持 | yes | no | currently unknown | yes |
为什么需要工作流调度系统
1、一个完整的数据分析系统通常都是由大量任务单元组成:
shell脚本程序,java程序,mapreduce程序、hive脚本等
2、各任务单元之间存在时间先后及前后依赖关系
3、为了很好地组织起这样的复杂执行计划,需要一个工作流调度系统来调度执行;
实现的方式
简单的任务调度:直接使用linux的crontab来定义;
复杂的任务调度:开发调度平台或使用现成的开源调度系统,比如ooize、azkaban等
Azkaban安装
软件下载:链接:http://pan.baidu.com/s/1b4mJWq 密码:jh75 如果无法下载请联系作者。
1-1)、安装
[root@hadoop1 azkaban]# ls
azkaban-2.5.0 azkaban-executor-2.5.0 azkaban-web-2.5.0
[root@hadoop1 azkaban]# mv azkaban-executor-2.5.0 executor
[root@hadoop1 azkaban]# mv azkaban-web-2.5.0 webserver
1-2)、创建数据库
[root@hadoop1 ~]# mysql -uroot -p
mysql> create database azkaban;
Query OK, 1 row affected (0.00 sec)
mysql> use azkaban;
Database changed
mysql> source /usr/local/azkaban/azkaban-2.5.0/create-all-sql-2.5.0.sql;
Query OK, 0 rows affected (0.25 sec)
*******
mysql> show tables;
+------------------------+
| Tables_in_azkaban |
+------------------------+
| active_executing_flows |
| active_sla |
| execution_flows |
| execution_jobs |
| execution_logs |
| project_events |
| project_files |
| project_flows |
| project_permissions |
| project_properties |
| project_versions |
| projects |
| properties |
| schedules |
| triggers |
+------------------------+
15 rows in set (0.00 sec)
1-3)、创建SSL配置
参考地址: http://docs.codehaus.org/display/JETTY/How+to+configure+SSL
命令: keytool -keystore keystore -alias jetty -genkey -keyalg RSA
运行此命令后,会提示输入当前生成 keystor的密码及相应信息,输入的密码请劳记,信息如下:
[root@hadoop1 azkaban]# keytool -keystore keystore -alias jetty -genkey -keyalg RSA
//请输入密码
Enter keystore password:
// 请输入确认密码
Re-enter new password:
// 您的名字与姓氏是什么?
What is your first and last name?
[Unknown]:
// 您的组织单位名称是什么?
What is the name of your organizational unit?
[Unknown]:
//您的组织名称是什么?
What is the name of your organization?
[Unknown]:
// 您所在的城市或区域名称是什么?
What is the name of your City or Locality?
[Unknown]:
// 您所在的州或省份名称是什么?
What is the name of your State or Province?
[Unknown]:
// 该单位的两字母国家代码是什么
What is the two-letter country code for this unit?
[Unknown]: CN
// 正确吗?
Is CN=Unknown, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown, C=CN correct?
[no]: Y
// 输入<jetty>的主密码
Enter key password for <jetty>
(如果和 keystore 密码相同,按回车):
(RETURN if same as keystore password):
再次输入新密码:
Re-enter new password:
[root@hadoop1 azkaban]# ls
azkaban-2.5.0 executor webserver jobs keystore
因为web支持SSL协议,所以配置SSL协议。
[root@hadoop1 azkaban]# mv keystore webserver
提示界面:
[root@hadoop2 azkaban-2.5.0]# keytool -keystore keystore -alias jetty -genkey -keyalg RSA
Enter keystore password:
Re-enter new password:
What is your first and last name?
[Unknown]:
What is the name of your organizational unit?
[Unknown]:
What is the name of your organization?
[Unknown]:
What is the name of your City or Locality?
[Unknown]:
What is the name of your State or Province?
[Unknown]:
What is the two-letter country code for this unit?
[Unknown]: CN
Is CN=Unknown, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown, C=CN correct?
[no]: y
Enter key password for <jetty>
(RETURN if same as keystore password):
Re-enter new password:
1-4)、配置时区
[root@hadoop1 conf]# tzselect
Please identify a location so that time zone rules can be set correctly.
Please select a continent or ocean.
1) Africa
2) Americas
3) Antarctica
4) Arctic Ocean
5) Asia
6) Atlantic Ocean
7) Australia
8) Europe
9) Indian Ocean
10) Pacific Ocean
11) none - I want to specify the time zone using the Posix TZ format.
#? 5
Please select a country.
1) Afghanistan 18) Israel 35) Palestine
2) Armenia 19) Japan 36) Philippines
3) Azerbaijan 20) Jordan 37) Qatar
4) Bahrain 21) Kazakhstan 38) Russia
5) Bangladesh 22) Korea (North) 39) Saudi Arabia
6) Bhutan 23) Korea (South) 40) Singapore
7) Brunei 24) Kuwait 41) Sri Lanka
8) Cambodia 25) Kyrgyzstan 42) Syria
9) China 26) Laos 43) Taiwan
10) Cyprus 27) Lebanon 44) Tajikistan
11) East Timor 28) Macau 45) Thailand
12) Georgia 29) Malaysia 46) Turkmenistan
13) Hong Kong 30) Mongolia 47) United Arab Emirates
14) India 31) Myanmar (Burma) 48) Uzbekistan
15) Indonesia 32) Nepal 49) Vietnam
16) Iran 33) Oman 50) Yemen
17) Iraq 34) Pakistan
#? 9
Please select one of the following time zone regions.
1) Beijing Time
2) Xinjiang Time
#? 1
The following information has been given:
China
Beijing Time
Therefore TZ='Asia/Shanghai' will be used.
Local time is now: Tue Sep 27 10:13:25 CST 2016.
Universal Time is now: Tue Sep 27 02:13:25 UTC 2016.
Is the above information OK?
1) Yes
2) No
#? yes
Please enter 1 for Yes, or 2 for No.
#? 1
You can make this change permanent for yourself by appending the line
TZ='Asia/Shanghai'; export TZ
to the file '.profile' in your home directory; then log out and log in again.
Here is that TZ value again, this time on standard output so that you
can use the /usr/bin/tzselect command in shell scripts:
Asia/Shanghai
一个很不友好的设计、、、、
[root@hadoop1 conf]# cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
1-5)、修改文件
[root@hadoop1 conf]# cat azkaban.properties
#Azkaban Personalization Settings
azkaban.name=Test
azkaban.label=My Local Azkaban
azkaban.color=#FF3601
azkaban.default.servlet.path=/index
web.resource.dir=web/
default.timezone.id=Asia/Shanghai
#Azkaban UserManager class
user.manager.class=azkaban.user.XmlUserManager
user.manager.xml.file=conf/azkaban-users.xml
#Loader for projects
executor.global.properties=conf/global.properties
azkaban.project.dir=projects
database.type=mysql
mysql.port=3306
mysql.host=localhost
mysql.database=azkaban
mysql.user=it
mysql.password=it
mysql.numconnections=100
# Velocity dev mode
velocity.dev.mode=false
# Azkaban Jetty server properties.
jetty.maxThreads=25
jetty.ssl.port=8443
jetty.port=8081
jetty.keystore=keystore
jetty.password=123456
jetty.keypassword=123456
jetty.truststore=keystore
jetty.trustpassword=123456
# Azkaban Executor settings
executor.port=12321
# mail settings
mail.sender=
mail.host=
job.failure.email=
job.success.email=
lockdown.create.projects=false
cache.directory=cache
主要配置以上标红的部分,Jetty的密码为以上keystore 生成的密码。
[root@hadoop1 conf]# vi azkaban-users.xml
<azkaban-users>
<user username="azkaban" password="azkaban" roles="admin" groups="azkaban" />
<user username="metrics" password="metrics" roles="metrics"/>
<user username="admin" password="admin" roles="admin,metrics" />
<role name="admin" permissions="ADMIN" />
<role name="metrics" permissions="METRICS"/>
</azkaban-users>
添加以上标红的部分
C)、azkaban-executor-2.5.0文件
[root@hadoop1 conf]# cat azkaban.properties
#Azkaban
default.timezone.id=Asia/Shanghai
# Azkaban JobTypes Plugins
azkaban.jobtype.plugin.dir=plugins/jobtypes
#Loader for projects
executor.global.properties=conf/global.properties
azkaban.project.dir=projects
database.type=mysql
mysql.port=3306
mysql.host=localhost
mysql.database=azkaban
mysql.user=it
mysql.password=it
mysql.numconnections=100
# Azkaban Executor settings
executor.maxThreads=50
executor.port=12321
executor.flow.threads=30修改以上标红的部分
1-6)启动
[root@hadoop1 executor]# ./bin/azkaban-executor-start.sh
Using Hadoop from /usr/local/hadoop-2.6.4
Using Hive from
./bin/..
[root@hadoop1 webserver]# ./bin/azkaban-web-start.sh
Using Hadoop from /usr/local/hadoop-2.6.4
Using Hive from
./bin/..
先启动executor再启动webserver
[root@hadoop1 azkaban-web-2.5.0]# nohup bin/azkaban-web-start.sh 1>/tmp/azstd.out 2>/tmp/azerr.out &
[root@hadoop1 azkaban-web-2.5.0]# 2016/09/26 20:47:47.686 -0700 ERROR [AzkabanWebServer] [Azkaban] Starting Jetty Azkaban Executor...
请先开启Jetty 服务
org.apache.commons.dbcp.SQLNestedException: Cannot create PoolableConnectionFactory
(Access denied for user 'it'@'localhost' (using password: YES))
mysql> GRANT ALL PRIVILEGES ON *.* TO 'it'@'localhost' IDENTIFIED BY 'it' WITH GRANT OPTION;
Query OK, 0 rows affected (0.00 sec)
mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)
注意访问的是HTTPS,用户名与密码是admin,以上配置的
Azkaban 实例
Azkaba内置的任务类型支持command、java
1-1)、创建job描述文件
现在win上创建test.job文件
#command.job
type=command
command=echo "helloword"
格式一定位utf-8 bom格式否则无法识别
上传过程如下:
其它的造作建议多看看、、、、、
1-2)、Command类型多job工作流flow
[root@hadoop1 azkaban]# mkdir test
[root@hadoop1 azkaban]# touch test.text
[root@hadoop1 azkaban]# cat test
cat: test: Is a directory
[root@hadoop1 azkaban]# cat test.text
[root@hadoop1 azkaban]#
test.sh
#!/bin/bash
echo "1234567890" > /usr/local/azkaban/test.text
command.job
#command.job
type=command
command=sh test.sh
C)、打成一个压缩包
command.zip
按照以上步骤执行、、、
D)、查看执行结果
[root@hadoop1 azkaban]# cat test.text
1234567890
1-3)、HDFS操作任务
# fs.job
type=command
command=hadoop fs -mkdir /azkabanTest
Hdfs.zip
按照以上步骤执行、、、、
[root@hadoop1 azkaban]# hadoop fs -ls /
Found 7 items
drwxr-xr-x - root supergroup 0 2016-09-28 10:59 /azkabanTest
drwxr-xr-x - root supergroup 0 2016-09-25 04:50 /data
drwxr-xr-x - root supergroup 0 2016-09-26 00:28 /flume
drwxr-xr-x - root supergroup 0 2016-09-28 10:53 /hadoopTest
drwxr-xr-x - root supergroup 0 2016-09-26 02:56 /home
drwx-wx-wx - root supergroup 0 2016-09-24 00:39 /tmp
drwxr-xr-x - root supergroup 0 2016-09-24 20:00 /user
1-4)、MapReduce任务
A)、上传文件
[root@hadoop1 hadoop]# hadoop fs -put /usr/local/hadoop-2.6.4/etc/hadoop/*.xml /wordcount
# mapReduce.job
type=command
command=hadoop jar /usr/local/azkaban/hadoop-mapreduce-examples-2.6.4.jar wordcount /wordcount /wordcountOuput
mapReduce.zip
按照以上步骤执行、、、、
[root@hadoop1 azkaban]# hadoop fs -ls /wordcountOuput
Found 2 items
-rw-r--r-- 3 root supergroup 0 2016-09-28 11:36 /wordcountOuput/_SUCCESS
-rw-r--r-- 3 root supergroup 10544 2016-09-28 11:36 /wordcountOuput/part-r-00000
[root@hadoop1 azkaban]# hadoop fs -cat /wordcountOuput/part-r-00000
"*" 18
"AS 8
"License"); 8
"alice,bob 18
"kerberos". 1
"simple" 1
'HTTP/' 1
'none' 1
'random' 1
'sasl' 1
'string' 1
'zookeeper' 2
****************
1-5)、Azkaban与Hive
1-1)、写配置文件 azkaban-hive.job
# azkaban-hive.job
type=command
command=/usr/local/hive/bin/hive -e "show databases"
1-2)、在win上压缩
azkaban-hive.zip
依照以上步骤上传zip文件、、、、
1-3)、查看结果
1-1)、准备数据
[root@hadoop1 testData]# vi test.text
1,dsdefe
2,dfegf
3,edfgrgrg
4,fhthty
5,ghjyjyj
6,fhgththjy
1-2)、写azkaban配置
hive.job
# hive.job
type=command
command=/usr/local/hive/bin/hive -f 'test.sql'
1-3)、test.sql 配置
create database azkabanHive2;
use azkabanHive2;
drop table hive2;
create table hive2(id int,name string) row format delimited fields terminated by ',';
load data inpath '/usr/local/hive/testData/test.text' into table hive2;
create table hive3 as select id from hive2;
1-4)、上传数据
[root@hadoop1 testData]# hadoop fs -put test.text /azkabanTest
[root@hadoop1 testData]# hadoop fs -cat /azkabanTest/test.text
1 dsdefe
2 dfegf
3 edfgrgrg
4 fhthty
5 ghjyjyj
6 fhgththjy
5、查看结果
hive> show databases;
OK
azkabanhive2
Time taken: 0.212 seconds, Fetched: 5 row(s)
hive> use azkabanhive2;
OK
Time taken: 0.079 seconds
hive> show tables;
OK
hive2
hive3
Time taken: 0.084 seconds, Fetched: 2 row(s)
hive> select * from hive2;
OK
1 dsdefe
2 dfegf
3 edfgrgrg
4 fhthty
5 ghjyjyj
6 fhgththjy
Time taken: 0.413 seconds, Fetched: 6 row(s)
hive> select * from hive3;
OK
1 dsdefe
2 dfegf
3 edfgrgrg
4 fhthty
5 ghjyjyj
6 fhgththjy
Time taken: 0.167 seconds, Fetched: 6 row(s)