Json函数解析
电影topn
将数据rating.json上传到node01的/home下
导入数据
hive> load data local inpath ‘/home/rating.json’ into table t_rate_json;
创建一张表,存储解析后的数据
hive> create table t_rate(movie string,rate int,ts string,uid string) row format delimited fields terminated by ‘\001’;
解析json函数使用get_json_object函数
测试:
hive> select get_json_object(line,"$.movie") from t_rate_json limit 2;
hive> insert into table t_rate
select get_json_object(line,’
.
m
o
v
i
e
′
)
,
g
e
t
j
s
o
n
o
b
j
e
c
t
(
l
i
n
e
,
′
.movie'), get_json_object(line,'
.movie′),getjsonobject(line,′.rate’),
get_json_object(line,’
.
t
i
m
e
S
t
a
m
p
′
)
,
g
e
t
j
s
o
n
o
b
j
e
c
t
(
l
i
n
e
,
′
.timeStamp'), get_json_object(line,'
.timeStamp′),getjsonobject(line,′.uid’)
from t_rate_json;
hive> select * from t_rate limit 10
统计评分大于3的所有评分记录
hive> select * from t_rate where rate > 3;
统计每个人评分的总分数
hive> select uid,sum(rate) from t_rate group by uid;
统计每个人评分的总分数倒序排
hive> select uid,sum(rate) rate_sum from t_rate group by uid order by rate_sum desc;
另外一种json解析的方法:
测试:
hive> select
json_tuple(line,“movie”,“rate”,“timeStamp”,“uid”)
as(movie,rate,ts,uid)
from t_rate_json
limit 10;
hive> create table t_rate_a
as
select uid,movie,rate,year(from_unixtime(cast(ts as bigint))) as year,month(from_unixtime(cast(ts as bigint))) as month,day(from_unixtime(cast(ts as bigint))) as day,hour(from_unixtime(cast(ts as bigint))) as hour,
minute(from_unixtime(cast(ts as bigint))) as minute,from_unixtime(cast(ts as bigint)) as ts
from
(select
json_tuple(line,‘movie’,‘rate’,‘timeStamp’,‘uid’) as(movie,rate,ts,uid)
from t_rate_json) tmp;
分组topn
hive> select *,row_number() over(partition by uid order by rate desc) as rank from t_rate;
hive> select uid,movie,rate,ts
from
(select uid,movie,rate,ts,row_number() over(partition by uid order by rate desc) as rank from t_rate) tmp
where rank<=3;
网址解析
例如有网址:http://www.baidu.com/find?cookieid=4234234234
解析成:www.baidu.com /find cookieid 4234234234
测试:
hive> select parse_url_tuple(“http://www.baidu.com/find?cookieid=4234234234”,‘HOST’,‘PATH’,‘QUERY’,‘QUERY:cookieid’)
from dual;
explode 和 lateral view
hive> create table t_xuanxiu(uid string,name string,kc array)
row format delimited
fields terminated by ‘,’
collection items terminated by ‘:’;
加载数据:
hive> load data local inpath “/home/sutdent.txt” into table t_xuanxiu;
hive> select uid,name,kc[0] from t_xuanxiu;
希望得到:
1,zhangsan,数学
1,zhangsan,语文
1,zhangsan,英语
1,zhangsan,生物
2,lisi,数学
2,lisi,语文
…
测试:
hive> select explode(kc) from t_xuanxiu where uid=1;
可以讲一个数组变成列
问题:
hive> select uid,name,explode(kc) from t_xuanxiu where uid=1;
lateral view 表生成函数
但是实际中经常要拆某个字段,然后一起与别的字段一起出.例如上面的id和拆分的array元素是对应的.我们应该如何进行连接呢?我们知道直接select id,explode()是不行的.这个时候就需要lateral view出厂了.
lateral view为侧视图,意义是为了配合UDTF来使用,把某一行数据拆分成多行数据.不加lateral view的UDTF只能提取单个字段拆分,并不能塞会原来数据表中.加上lateral view就可以将拆分的单个字段数据与原始表数据关联上.
在使用lateral view的时候需要指定视图别名和生成的新列别名
hive> select uid,name,tmp.course from t_xuanxiu
lateral view explode(kc) tmp as course;
解释:lateral view 将 explode(kc) 看成一个表是 tmp 就一个字段as course;
思考用hive做wordcount?
[root@node01 home]# vi words.txt
a b c d e f g
a b c
e f g a
b c d b
hive> select explode(split(line,’ ‘)) from t_juzi;
将打散的结果看成一个表
hive> select tmp.* from t_juzi lateral view explode(split(line,’ ')) tmp as word;
hive> select a.word,count(1) sum
from
(select tmp.* from t_juzi lateral view explode(split(line,’ ‘)) tmp as word) a
group by a.word
Order by sum desc;
根据单词的数量倒序排序
hive> select a.word,count(1) sum
from
(select tmp.* from t_juzi lateral view explode(split(line,’ ')) tmp as word) a
group by a.word
order by sum desc;