一、实现功能
日志文件需要按时自动上传到hdfs、hive,然后,才可以进行下一步的ETL。所以,定时定点将日志信息按时上传时非常重要的。
二、实现
1.hive中创建源表
create database load_hive;
create table load_hive.load_tb(
id string,
url string,
referer string,
keyword string,
type string,
guid string,
pageId string,
moduleId string,
linkId string,
attachedInfo string,
sessionId string,
trackerU string,
trackerType string,
ip string,
trackerSrc string,
cookie string,
orderCode string,
trackTime string,
endUserId string,
firstLink string,
sessionViewNo string,
productId string,
curMerchantId string,
provinceId string,
cityId string,
fee string,
edmActivity string,
edmEmail string,
edmJobId string,
ieVersion string,
platform string,
internalKeyword string,
resultSum string,
currentPage string,
linkPosition string,
buttonPosition string
)partitioned by(`date` string,hour string)
row format delimited fields terminated by "\t";
2.通过hive -e的方式
(1)创建一个脚本 load_to_hive.sql
#!/bin/sh
#拿到昨天日期,例如:20180526
YESTERDAY=`date -d '-1 days' +%Y%m%d`
ACCESS_LOG_DIR=/opt/access_logs/$YESTERDAY
#HIVE_HOME=/opt/modules/class22/apache-hive-1.2.1-bin
HIVE_HOME=/opt/modules/hive-1.2.1
for file in `ls $ACCESS_LOG_DIR`
do
DAY=${file:0:8}
HOUR=${file:8:2}
echo "${DAY}${HOUR}"
$HIVE_HOME/bin/hive -e "load data local inpath '$ACCESS_LOG_DIR/$file' into
table load_hive.load_tb partition(date='${DAY}',hour='${HOUR}')"
done
$HIVE_HOME/bin/hive -e "show partitions load_hive.load_tb"
(2)创建目录
/opt/access_logs/20180526
(3)把日志文件都考过来并且改名字
[root@bigdata access_logs]# cd 20180526/
[root@bigdata 20180526]# cp /opt/datas/2015082818 ./
[root@bigdata 20180526]# cp /opt/datas/2015082819 ./
[root@bigdata 20180526]# mv 2015082818 2018052601.log
[root@bigdata 20180526]# mv 2015082819 2018052602.log
[root@bigdata 20180526]# cp /opt/datas/2015082819 ./
[root@bigdata 20180526]# mv 2015082819 2018052603.log
(4)hive-site添加,取消关键字检查
<property>
<name>hive.support.sql11.reserved.keywords</name>
<value>false</value>
</property>
(5)hive目录下执行脚本
sh -x load_to_hive.sql
(6)重新启动hiveserver2和beeline
(7)执行脚本
sh -x load_to_hive.sql
结果:
partition
date=20180526/hour=01.log
date=20180526/hour=02.log
date=20180526/hour=03.log
Time taken: 1.994 seconds, Fetched: 3 row(s)
(8)查看表的分区:
show partitions load_hive.load_tb;
3. 通过hive -f的方式
可以通过--hiveconf 传递参数
(1)日志目录
在/opt/access_logs/20181110下面有以下三个日志文件
mv 2017120901.log 2018052601.log
mv 2017120902.log 2018052602.log
mv 2017120903.log 2018052603.log
(2)创建一个文件:/opt/datas/hive_shell/load.sql
vi load.sql
添加
load data local inpath '${hiveconf:log_dir}/${hiveconf:file_path}' into table load_hive.load_tb partition (date='${hiveconf:DAY}',hour='${hiveconf:HOUR}')
(4)编写load_to_hive_file.sh
#! /bin/bash
#定义昨天的日期时间
YESTERDAY=`date -d '-1 days' +%Y%m%d`
#定义数据目录
ACCESS_LOG_DIR=/opt/access_logs/$YESTERDAY
#定义HIVE_HOME
HIVE_HOME=/opt/modules/hive-1.2.1
#定义遍历目录下文件名称,获取日期和时间指定分区
for FILE in `ls $ACCESS_LOG_DIR`
do
DAY=${FILE:0:8}
HOUR=${FILE:8:2}
#echo "${DAY}${HOUR}"
$HIVE_HOME/bin/hive --hiveconf log_dir=$ACCESS_LOG_DIR --hiveconf file_path=$FILE --hiveconf DAY=$DAY --hiveconf HOUR=$HOUR -f '/opt/datas/hive_shell/load.sql'
done
$HIVE_HOME/bin/hive -e "show partitions load_hive.load_tb"
(5)删除表中数据
truncate table load_hive.load_tb;
(6)查看分区并且删除全部分区
show partitions load_hive.load_tb;
+----------------------------+--+
| partition |
+----------------------------+--+
| date=20180526/hour=01 |
| date=20180526/hour=02 |
| date=20180526/hour=03 |
+----------------------------+--+
删除分区
alter table load_hive.load_tb drop partition(date='20180526',hour='01');
alter table load_hive.load_tb drop partition(date='20180526',hour='02');
alter table load_hive.load_tb drop partition(date='20180526',hour='03');
(7)加载数据
sh -x load_to_hive_file.sh
(8)结果(在脚本中写了)
date=20180526/hour=01
date=20180526/hour=02
date=20180526/hour=03