本文通过上一节Hadoop离线项目之数据清洗
开发的数据清洗jar包,对日志文件进行清洗,并把清洗后的结果移动到hive表里,并刷新元数据信息。并把这个过程写到shell脚本里。
jar包路径:
/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/lib/g6-hadoop-1.0.jar
jar包主程序:
com.ruozedata.hadoop.mapreduce.driver.LogETLDriver
hdfs上日志文件路径:
/g6/hadoop/accesslog/
hdfs上日志文件清洗后要放的路径:
/g6/hadoop/access/output/
hive外部表g6_access指定的路径:
/g6/hadoop/access/clear/
shell脚本代码:
#!/bin/bash
if [ $# != 1 ] ; then
echo "USAGE: g6-train-hadoop.sh <dateString>"
echo " e.g.: g6-train-hadoop.sh 20180717"
exit 1;
fi
process_date=$1
echo "--------------------step1: mapreduce etl--------------------"
hadoop jar /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/lib/g6-hadoop-1.0.jar com.ruozedata.hadoop.mapreduce.driver.LogETLDriver /g6/hadoop/accesslog/${process_date}.log /g6/hadoop/access/output/day=${process_date}
echo "--------------------step2:mv data to DW--------------------"
hdfs dfs -rm -r /g6/hadoop/access/clear/day=${process_date}
hdfs dfs -mv /g6/hadoop/access/output/day=${process_date} /g6/hadoop/access/clear/
echo "--------------------step4:flush meatadata--------------------"
hive -e "use g6_hadoop; alter table g6_access add if not exists partition(day='${process_date}');"
然后运行shell脚本:
[hadoop@10-9-140-90 shell]$ ./g6-train-hadoop.sh
USAGE: g6-train-hadoop.sh <dateString>
e.g.: g6-train-hadoop.sh 20180717
[hadoop@10-9-140-90 shell]$ ./g6-train-hadoop.sh 20180717
......此处省略
运行成功后,去hive里查看一下:
hive (g6_hadoop)> select * from g6_access where day=20180717 limit 1;
OK
cdn region level time ip domain url traffic day
baidu CN E 20180717042142 156.89.48.178 v2.go2yd.com http://v1.go2yd.com/user_upload/1531633977627104fdecdc68fe7a2c4b96b2226fd3f4c.mp4_bd.mp4 62109 20180717
Time taken: 0.42 seconds, Fetched: 1 row(s)
hive (g6_hadoop)>