Hive json_tuple与parse_url_tuple 函数

1.json_tuple函数

json_tnple函数用来处理json数据
使用方法:
json_tuple(jsonStr, p1, p2, …, pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.

创建一张表用来存储json数据
hive (ruozedata_d7)> create table  IF NOT EXISTS rating_json(json string);
OK
Time taken: 0.023 seconds
hive (ruozedata_d7)> load data local inpath '/home/hadoop/rating.json' overwrite into table rating_json;
hive (ruozedata_d7)> select * from rating_json limit 10;
OK
rating_json.json
{"movie":"1193","rate":"5","time":"978300760","userid":"1"}
{"movie":"661","rate":"3","time":"978302109","userid":"1"}
{"movie":"914","rate":"3","time":"978301968","userid":"1"}
{"movie":"3408","rate":"4","time":"978300275","userid":"1"}
{"movie":"2355","rate":"5","time":"978824291","userid":"1"}
{"movie":"1197","rate":"3","time":"978302268","userid":"1"}
{"movie":"1287","rate":"5","time":"978302039","userid":"1"}
{"movie":"2804","rate":"5","time":"978300719","userid":"1"}
{"movie":"594","rate":"4","time":"978302268","userid":"1"}
{"movie":"919","rate":"4","time":"978301368","userid":"1"}
Time taken: 0.056 seconds, Fetched: 10 row(s)
可以看到通过json_tuple函数,将json分割成了一个个字段
hive (ruozedata_d7)> select 
                   > json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) 
                   > from rating_json limit 10;
OK
movie_id    rate    time    user_id
1193    5   978300760   1
661    3   978302109   1
914    3   978301968   1
3408    4   978300275   1
2355    5   978824291   1
1197    3   978302268   1
1287    5   978302039   1
2804    5   978300719   1
594    4   978302268   1
919    4   978301368   1
Time taken: 0.676 seconds, Fetched: 10 row(s)
然后我们通过json_tnple创建一张大宽表,你后续需要的所有的字段我全部给你准备完毕
hive (ruozedata_d7)> create table rating_width as
                   > select
                   > movie_id,rate,time,user_id,
                   > year(from_unixtime(cast(time as bigint))) as year,
                   > month(from_unixtime(cast(time as bigint))) as month,
                   > day(from_unixtime(cast(time as bigint))) as day,
                   > hour(from_unixtime(cast(time as bigint))) as hour,
                   > minute(from_unixtime(cast(time as bigint))) as minute,
                   > from_unixtime(cast(time as bigint)) as ts
                   > from
                   > (
                   > select 
                   > json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) 
                   > from rating_json 
                   > ) tmp
                   > ;
可以看到我们创建好之后的字段
hive (ruozedata_d7)> select * from rating_width limit 10;
OK
rating_width.movie_id    rating_width.rate   rating_width.time   rating_width.user_id    rating_width.year   rating_width.month  rating_width.day    rating_width.hourrating_width.minute    rating_width.ts
1193    5   978300760   1   2001    1   1   6   12  2001-01-01 06:12:40
661    3   978302109   1   2001    1   1   6   35  2001-01-01 06:35:09
914    3   978301968   1   2001    1   1   6   32  2001-01-01 06:32:48
3408    4   978300275   1   2001    1   1   6   4   2001-01-01 06:04:35
2355    5   978824291   1   2001    1   7   7   38  2001-01-07 07:38:11
1197    3   978302268   1   2001    1   1   6   37  2001-01-01 06:37:48
1287    5   978302039   1   2001    1   1   6   33  2001-01-01 06:33:59
2804    5   978300719   1   2001    1   1   6   11  2001-01-01 06:11:59
594    4   978302268   1   2001    1   1   6   37  2001-01-01 06:37:48
919    4   978301368   1   2001    1   1   6   22  2001-01-01 06:22:48
Time taken: 0.068 seconds, Fetched: 10 row(s)

2.parse_url_tuple函数

使用方法:
hive (ruozedata_d7)> desc function extended parse_url_tuple;
OK
tab_name
parse_url_tuple(url, partname1, partname2, …, partnameN) - extracts N (N>=1) parts from a URL.
It takes a URL and one or multiple partnames, and returns a tuple. All the input parameters and output column types are string.
Partname: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:
Note: Partnames are case-sensitive, and should not contain unnecessary white spaces.
Example:

SELECT b.* FROM src LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id LIMIT 1;
SELECT parse_url_tuple(a.fullurl, 'HOST', 'PATH', 'QUERY', 'REF', 'PROTOCOL', 'FILE', 'AUTHORITY', 'USERINFO', 'QUERY:k1') as (ho, pa, qu, re, pr, fi, au, us, qk1) from src a;
Time taken: 0.006 seconds, Fetched: 7 row(s)

hive (default)> select parse_url_tuple("http://www.ruozedata.com/d7/xxx.html?cookieid=1234567&a=b&c=d","HOST","PATH","QUERY","QUERY:cookieid") as (host,path,query ,cookieid) from dual;
OK
host    path    query   cookieid
www.ruozedata.com    /d7/xxx.html    cookieid=1234567&a=b&c=d    1234567

可以看出,将对应的URL解析成我们想要的字段

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值