将日志数据导入到hive的ODS层

最新推荐文章于 2024-07-27 11:00:39 发布

fqcbb

最新推荐文章于 2024-07-27 11:00:39 发布

阅读量1k

点赞数

文章标签： hive

本文链接：https://blog.csdn.net/fqcbb/article/details/110843367

版权

刚入门,学的没有那么深,没那么多条条道道,直接导入.后面用sql语句对它动动手脚.
应该会得到一个比较宽的表.
先将数据导入到虚拟机中(含有客户的设备,时间戳,客户id,使用上网的渠道的等等).
启动hive,调用start-all.sh.启动完成后,调用元数据服务 :hive --service metastore,然后启动hiveserver2,在调用远程端口.
创建一个表,此次使用的数据类型,一行就一个字段(json类型).

create table tb_log(
log String
)partitioned by(dt String);
load data local inpath "/home/event.log" into table tb_log partition(dt='202001007');
--根据数据的时间戳所得到的时间来进行分区(这个数据时间戳都是一天,意思意思,静态分区)

select * from tb_log limit 10 ;
--看看数据,应该大差不差

看的多少有点不得劲,根据数据格式,用json_tuple解析一下,在给其取个别名,岂不妙哉.

select
json_tuple(log,'account' ,'appId' ,'appVersion','carrier','deviceId','deviceType','eventId','ip','latitude','longitude','netType','osName','osVersion','properties','releaseChannel','resolution','sessionId' ,'timeStamp') 
as (account ,appId ,appVersion,carrier,deviceId,deviceType,eventId,ip,latitude,longitude,netType,osName,osVersion,properties,releaseChannel,resolution,sessionId ,`timeStamp`)
from
tb_log limit 10; 
--数据又粗又长,就不粘贴了

在过滤掉account和deviceId不为空的.清洗一下数据

create table tb_ods_log as
select 
if(account='' , deviceId,account) as guid,--如果account是空,取devicId
* 
from
(select
json_tuple(log,'account' ,'appId' ,'appVersion','carrier','deviceId','deviceType','eventId','ip','latitude','longitude','netType','osName','osVersion','properties','releaseChannel','resolution','sessionId' ,'timeStamp') 
as (account ,appId ,appVersion,carrier,deviceId,deviceType,eventId,ip,latitude,longitude,netType,osName,osVersion,properties,releaseChannel,resolution,sessionId ,`timeStamp`)
from
tb_log) t
where account != '' or deviceId !='';--条件判断一下,都是空,犹如鸡肋,食之无味,弃之可惜,果决一点,不要了.

来看看记个数,瞅一瞅

select guid , count(1) from tb_ods_log group by guid;

这只是将数据传入,后面根据个人需求,创建表格时,注意一下,不要搞太多没用的信息.

fqcbb

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫