一、数据准备
1.1.user_action.txt
11,101,2018-12-01 06:01:12
22,102,2018-12-03 06:01:13
33,103,2018-12-04 06:01:14
11,104,2018-12-06 06:01:15
22,103,2018-12-07 06:01:16
33,102,2018-12-12 06:01:17
11,101,2018-12-15 06:01:18
35,105,2018-12-17 06:01:19
22,104,2018-12-16 06:01:20
77,103,2018-12-13 06:01:21
99,102,2018-12-13 06:01:22
33,101,2018-12-01 06:01:23
11,101,2018-12-11 06:01:24
35,102,2018-12-16 06:01:25
22,103,2018-12-16 06:01:26
77,104,2018-12-14 06:01:27
99,105,2018-12-13 06:01:28
1.2.key_words.txt
101,http://abc.net/,kw8|kw1
102,http://abc.net/,kw6|kw3
103,http://abc.net/,kw7
104,http://abc.net/,kw5|kw1|kw4|kw9
105,http://abc.net/,
把上面的txt文件上传到HDFS里
二、Hive中建表
2.1.user_actions #用户行为表,共3个字段,分别是:用户ID,阅读文章ID,阅读日期
create external table user_actions(
> user_id string,
> article_id string,
> ts string)
> row format delimited fields terminated by ','
> location '/tmp/zxm/articles/user_action/';
2.2.articles #文章标签表,共3个字段,分别是文章ID,URL,标签
create external table articles(
> article_id string,
> url string,
> kws array<string> )
> row format delimited fields terminated by ','
> collection items terminated by '|'
> location '/tmp/zxm/articles/article_kws/';
2.3.查询articles表
hive> select * from articles;
101 http://abc.net/ ["kw8","kw1"]
102 http://abc.net/ ["kw6","kw3"]
103 http://abc.net/ ["kw7"]
104 http://abc.net/ ["kw5","kw1","kw4","kw9"]
105 http://abc.net/ []
三、Hive SQL
3.1把每个用户看了哪些文章,做个汇总
select user_id,collect_set(article_id) from user_actions group by user_id
结果
OK
11 ["101","104"]
22 ["102","103","104"]
33 ["103","102","101"]
3.2 对article_id进行从小到到大排序
select user_id,sort_array(collect_ list(article_id)) as contents from user_actions group by user_id;
结果
OK
11 ["101","101","101","104"]
22 ["102","103","103","104"]
33 ["101","102","103"]
35 ["102","105"]
77 ["103","104"]
99 ["102","105"]
3.3 Lateral view explode(kws) 把数据拆成一条条记录,会拆成一个视图,只有一列
select article_id,kw from articles lateral view explode(kws) to as kw;
结果
101 kw8
101 kw1
102 kw6
102 kw3
103 kw7
104 kw5
104 kw1
104 kw4
104 kw9
3.4 把两个表做Join,列出用户看过所有文章与其对应的标签
select a.user_id,b.kw from user_actions as a left outer join(select article_id,kw from articles lateral view explode(kws) to as kw) b on (a.article_id=b.article_id) order by a.user_id;
结果:
11 kw1
11 kw8
11 kw5
11 kw1
11 kw4
11 kw1
11 kw9
11 kw8
11 kw1
11 kw8
22 kw1
22 kw7
22 kw9
22 kw4
22 kw5
22 kw7
22 kw3
22 kw6
33 kw8
33 kw1
33 kw3
33 kw6
33 kw7
35 NULL
35 kw6
35 kw3
77 kw9
3.5 根据用户和kw进行分组,统计出用户对应的各标签数量
select a.user_id,b.kw ,count(1) as weight
from user_actions as a
left outer join(
select article_id,kw
from articles
lateral view explode(kws) to as kw
) b
on (a.article_id=b.article_id)
group by a.user_id,b.kw
order by a.user_id,weight desc;
结果
OK
11 kw1 4
11 kw8 3
11 kw5 1
11 kw9 1
11 kw4 1
22 kw7 2
22 kw9 1
22 kw1 1
22 kw3 1
22 kw4 1
待完结,看续篇..............