在实际开发当中json 很常见的类型,hive处理json字符串成为目标的宽表。
从json ===》 hive table ===》sql
一,解析json
先介绍下用到的hive 内置函数 json_tuple
执行:desc function [extended] json_tuple;
显示:
json_tuple(jsonStr, p1, p2, ..., pn) like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string
说白了就是传入json的那一列,然后p*就是要查询的列名
源数据如下:
{"movie":"2791","rate":"4","time":"978302188","userid":"1"}
{"movie":"2687","rate":"3","time":"978824268","userid":"1"}
{"movie":"2018","rate":"4","time":"978301777","userid":"1"}
{"movie":"3105","rate":"5","time":"978301713","userid":"1"}
{"movie":"2797","rate":"4","time":"978302039","userid":"1"}
①建hive表,以及导入json数据
create table test_json(json string);
load data local inpath '/home/hadoop/data/jsontest.json' into test_json
②
select json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) from test_json limit 10
解析ok,在对time字段进行进一步解析,现在是时间戳,需要解析出年月日,时分秒等。
③,用到的函数
int : year(string date)
string : from_unixtime(bigint unixtime[, string format])
select movie_id,rate,time,user_id,
year(from_unixtime(cast(time as bigint)))as year,
month(from_unixtime(cast(time as bigint)))as month,
day(from_unixtime(cast(time as bigint)))as day,
hour(from_unixtime(cast(time as bigint)))as hour,
minute(from_unixtime(cast(time as bigint)))as minute
from
(
select json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id)
from test_json
)tmp limit 10;
经过验证数据解析无误
④,将结果写入宽表
create table rating_width as
select movie_id,rate,time,user_id,
year(from_unixtime(cast(time as bigint)))as year,
month(from_unixtime(cast(time as bigint)))as month,
day(from_unixtime(cast(time as bigint)))as day,
hour(from_unixtime(cast(time as bigint)))as hour,
minute(from_unixtime(cast(time as bigint)))as minute
from
(
select json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id)
from rating_json
)tmp;
⑤,之后所有的操作都是基于宽表进行SQL操作了。
二,解析URL
①,用到的hive内置函数 parse_url_tuple 解析url
执行:desc function [extended] parse_url_tuple;
显示示例:
SELECT parse_url_tuple(a.fullurl, 'HOST', 'PATH', 'QUERY', 'REF', 'PROTOCOL', 'FILE', 'AUTHORITY', 'USERINFO', 'QUERY:k1') as (ho, pa, qu, re, pr, fi, au, us, qk1) from src a
②,将结果写入宽表即可,这里不再做示例。