一、oozie执行shell脚本(执行mr任务,实现合并增量数据)
参考:http://gethue.com/use-the-shell-action-in-oozie/
1、点击创建、拖动到上面
2、添加命令:bash,当然也可以是linux的其他可执行的命令
3、添加参数:注意---》参数名称是shell脚本的全称(run-mr-compact.sh)
4、添加xxx.sh、xxx.jar、xxx.properties文件(注意:这些文件需要在同一个hdfs文件夹目录下,否则会报错!)
添加的文件顺序建议是:sh 、jar 、properties
相关文件内容:
(1)/user/greatgas/oozie/everyday/shell/run-mr-compact.sh.sh:
#!/bin/bash hadoop jar mr-compact.jar main类全类名 properties文件相对于该jar的路径
|
(2)/user/greatgas/oozie/everyday/MR/mr-compact.jar:
MR合并程序 |
(3)/user/greatgas/oozie/everyday/MR/conf.properties:
jobName=t_test
#baseDir=hdfs上数据库路径 baseDir=/user/hive/warehouse/origin_ennenergy_test.db/
#就是根据table name来寻找hdfs中的hive数据路径 #tableName=原始表,增量表,中转表 tableName= s_t_test,incr_ t_test,out_ t_test
keyIndex=0 #时间戳所在字段,用于判断哪条记录是最新的 timeStampIndex=24 reduceNum=2
|
5、点击红框添加属性
6、属性添加
注意:一定要添加”HADOOP_USER_NAME=hue用户名”,否则没有执行hdfs的文件目录权限!
7、点击6图片中的右上角,退出,然后保存
8、提交任务
二、oozie执行mapreduce任务(实现合并增量数据)
需要上传2个文件到hdfs(如果是集群模式执行)
/user/e_test/workflow/mr/mr-compact.jar(hdfslinux目录jar)
mapred.output.dir /user/hive/warehouse/origin_ennenergy_onecard.db/m(输出路径)
mapred.input.dir /user/hive/warehouse/origin_ennenergy_onecard.db/t(输入路径)
delete /user/hive/warehouse/origin_ennenergy_onecard.db/m(删除)
/user/e_test/workflow/mr/conf.properties(hdfslinux目录properties)
编辑完成后点击提交即可!
三、oozie执行spark任务(实现实时接收kafka消息队列的数据)
1、创建oozie项目
2、添加参数:
yarn-cluster (相当于setMaster(“yarn-cluster”))
cluster (client或者cluster模式,一般都是使用cluster)
MySpark (job名称,可以任意写)
hdfs://master-28.dev.cluster.enn.cn:8020/user/e_liuy/testspark/spark-test.jar(hdfs路径)
或者
hdfs:// nameservice/user/e_liuy/testspark/spark-test.jar(注意集群总名称可以替代url+端口)
enn.action.ConsumerRealDataKafka (需要运行的类的全路径)
四、oozie执行hive任务(实现在hive的origin_ennenergy_onecard数据库中创建一个表----->M_BD_CD_CarInfo_H)
1、创建oozie的hive项目
2、添加sh脚本和hive xml文件
文件内容
(1)onecardtomodel.q/
use origin_ennenergy_onecard;
drop table if exists M_BD_CD_CarInfo_H;
create table M_BD_CD_CarInfo_H
(
FGUID STRING,
FStationNo STRING,
FCOMPANYID STRING,
FCARNO STRING,
FCARER STRING,
FTEL STRING,
time_stamp STRING,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY'\t'
STORED AS TEXTFILE;
(2)hive-site.xml
内容:
hive.metastore.uris
thrift://yours_hosts:9083
hive.metastore.client.socket.timeout
300
hive.metastore.warehouse.dir
/user/hive/warehouse(hive的hdfs目录)
hive.warehouse.subdir.inherit.perms
true
hive.enable.spark.execution.engine
false
hive.conf.restricted.list
hive.enable.spark.execution.engine
hive.auto.convert.join
true
hive.auto.convert.join.noconditionaltask.size
20971520
hive.optimize.bucketmapjoin.sortedmerge
false
hive.smbjoin.cache.rows
10000
mapred.reduce.tasks
-1
hive.exec.reducers.bytes.per.reducer
67108864
hive.exec.copyfile.maxsize
33554432
hive.vectorized.groupby.checkinterval
4096
hive.vectorized.groupby.flush.percent
0.1
hive.compute.query.using.stats
false
hive.vectorized.execution.enabled
true
hive.vectorized.execution.reduce.enabled
false
hive.merge.mapfiles
true
hive.merge.mapredfiles
false
hive.cbo.enable
false
hive.fetch.task.conversion
minimal
hive.fetch.task.conversion.threshold
268435456
hive.limit.pushdown.memory.usage
0.1
hive.merge.sparkfiles
true
hive.merge.smallfiles.avgsize
16777216
hive.merge.size.per.task
268435456
hive.optimize.reducededuplication
true
hive.optimize.reducededuplication.min.reducer
4
hive.map.aggr
true
hive.map.aggr.hash.percentmemory
0.5
hive.optimize.sort.dynamic.partition
false
spark.executor.memory
268435456
spark.driver.memory
268435456
spark.executor.cores
1
spark.yarn.driver.memoryOverhead
26
spark.yarn.executor.memoryOverhead
26
spark.dynamicAllocation.enabled
true
spark.dynamicAllocation.initialExecutors
1
spark.dynamicAllocation.minExecutors
1
spark.dynamicAllocation.maxExecutors
2147483647
hive.metastore.execute.setugi
true
hive.support.concurrency
true
hive.zookeeper.quorum
slave-29.dev.cluster.enn.cn,slave-30.dev.cluster.enn.cn,slave-31.dev.cluster.enn.cn
hive.zookeeper.client.port
2181
hive.zookeeper.namespace
hive_zookeeper_namespace_hive2
hbase.zookeeper.quorum
zookeeper的机器列表(node1,node2,node3……)
hbase.zookeeper.property.clientPort
2181
hive.cluster.delegation.token.store.class
org.apache.hadoop.hive.thrift.MemoryTokenStore
hive.server2.enable.doAs
true
hive.metastore.sasl.enabled
true
hive.server2.authentication
kerberos
hive.metastore.kerberos.principal
hive/_HOST@ENN.CN
hive.server2.authentication.kerberos.principal
hive/_HOST@ENN.CN
spark.shuffle.service.enabled
true
hive.cli.print.current.db
true
hive.exec.reducers.max
32
hive.metastore.uris
thrift://yours_hosts:9083
hive.metastore.client.socket.timeout
300
hive.metastore.warehouse.dir
/user/hive/warehouse(hive的hdfs目录)
hive.warehouse.subdir.inherit.perms
true
hive.enable.spark.execution.engine
false
hive.conf.restricted.list
hive.enable.spark.execution.engine
hive.auto.convert.join
true
hive.auto.convert.join.noconditionaltask.size
20971520
hive.optimize.bucketmapjoin.sortedmerge
false
hive.smbjoin.cache.rows
10000
mapred.reduce.tasks
-1
hive.exec.reducers.bytes.per.reducer
67108864
hive.exec.copyfile.maxsize
33554432
hive.vectorized.groupby.checkinterval
4096
hive.vectorized.groupby.flush.percent
0.1
hive.compute.query.using.stats
false
hive.vectorized.execution.enabled
true
hive.vectorized.execution.reduce.enabled
false
hive.merge.mapfiles
true
hive.merge.mapredfiles
false
hive.cbo.enable
false
hive.fetch.task.conversion
minimal
hive.fetch.task.conversion.threshold
268435456
hive.limit.pushdown.memory.usage
0.1
hive.merge.sparkfiles
true
hive.merge.smallfiles.avgsize
16777216
hive.merge.size.per.task
268435456
hive.optimize.reducededuplication
true
hive.optimize.reducededuplication.min.reducer
4
hive.map.aggr
true
hive.map.aggr.hash.percentmemory
0.5
hive.optimize.sort.dynamic.partition
false
spark.executor.memory
268435456
spark.driver.memory
268435456
spark.executor.cores
1
spark.yarn.driver.memoryOverhead
26
spark.yarn.executor.memoryOverhead
26
spark.dynamicAllocation.enabled
true
spark.dynamicAllocation.initialExecutors
1
spark.dynamicAllocation.minExecutors
1
spark.dynamicAllocation.maxExecutors
2147483647
hive.metastore.execute.setugi
true
hive.support.concurrency
true
hive.zookeeper.quorum
slave-29.dev.cluster.enn.cn,slave-30.dev.cluster.enn.cn,slave-31.dev.cluster.enn.cn
hive.zookeeper.client.port
2181
hive.zookeeper.namespace
hive_zookeeper_namespace_hive2
hbase.zookeeper.quorum
zookeeper的机器列表(node1,node2,node3……)
hbase.zookeeper.property.clientPort
2181
hive.cluster.delegation.token.store.class
org.apache.hadoop.hive.thrift.MemoryTokenStore
hive.server2.enable.doAs
true
hive.metastore.sasl.enabled
true
hive.server2.authentication
kerberos
hive.metastore.kerberos.principal
hive/_HOST@ENN.CN
hive.server2.authentication.kerberos.principal
hive/_HOST@ENN.CN
spark.shuffle.service.enabled
true
hive.cli.print.current.db
true
hive.exec.reducers.max
32
3、保存项目后,提交任务即可
五、oozie执行sqoop任务(实现将hive数据导出到mysql中)
1、创建oozie的sqoop项目
2、输入sqoop命令
命令内容:
export --connectjdbc:mysql://10.37.149.183:3306/enn_application?characterEncoding=utf8--username root --password secret_password --table total_info --export-dir/user/hive/warehouse/origin_ennenergy_onecard.db/total_info/*--input-fields-terminated-by "\t" --update-mode allowinsert--update-key fstationno,fusercardno
3、保存执行即可
六、添加java程序
1、上传jar到liunx对应的oozie目录路径上
2、开始在Hue上操作:
以上测试已经通过,可以按要求更改后直接使用!