实战:Hive在内容推荐系统中的应用(一)

一、数据准备

1.1.user_action.txt

11,101,2018-12-01 06:01:12
22,102,2018-12-03 06:01:13
33,103,2018-12-04 06:01:14
11,104,2018-12-06 06:01:15
22,103,2018-12-07 06:01:16
33,102,2018-12-12 06:01:17
11,101,2018-12-15 06:01:18
35,105,2018-12-17 06:01:19
22,104,2018-12-16 06:01:20
77,103,2018-12-13 06:01:21
99,102,2018-12-13 06:01:22
33,101,2018-12-01 06:01:23
11,101,2018-12-11 06:01:24
35,102,2018-12-16 06:01:25
22,103,2018-12-16 06:01:26
77,104,2018-12-14 06:01:27
99,105,2018-12-13 06:01:28

1.2.key_words.txt

101,http://abc.net/,kw8|kw1
102,http://abc.net/,kw6|kw3
103,http://abc.net/,kw7
104,http://abc.net/,kw5|kw1|kw4|kw9
105,http://abc.net/,

把上面的txt文件上传到HDFS里

二、Hive中建表

2.1.user_actions #用户行为表,共3个字段,分别是:用户ID,阅读文章ID,阅读日期

create external table user_actions(

    > user_id  string,

    > article_id string,

    > ts string)

    > row format delimited fields terminated by ','

   > location '/tmp/zxm/articles/user_action/';

2.2.articles  #文章标签表,共3个字段,分别是文章ID,URL,标签

create external table articles(

    > article_id string,

    > url string,

    > kws array<string> ) 

    > row format delimited fields terminated by ','

    > collection items terminated by '|' 

  > location '/tmp/zxm/articles/article_kws/';

2.3.查询articles表

hive> select * from articles;

101     http://abc.net/ ["kw8","kw1"]

102     http://abc.net/ ["kw6","kw3"]

103     http://abc.net/ ["kw7"]

104     http://abc.net/ ["kw5","kw1","kw4","kw9"]

105     http://abc.net/ []
三、Hive SQL

3.1把每个用户看了哪些文章,做个汇总

select  user_id,collect_set(article_id)  from user_actions group by user_id
结果
OK
11      ["101","104"]
22      ["102","103","104"]
33      ["103","102","101"]

3.2 对article_id进行从小到到大排序

select user_id,sort_array(collect_ list(article_id)) as contents from user_actions group by user_id;

结果

OK

11      ["101","101","101","104"]

22      ["102","103","103","104"]

33      ["101","102","103"]

35      ["102","105"]

77      ["103","104"]

99      ["102","105"]

3.3 Lateral view explode(kws) 把数据拆成一条条记录,会拆成一个视图,只有一列

 select article_id,kw from articles lateral view explode(kws) to as kw;

结果

101     kw8

101     kw1

102     kw6

102     kw3

103     kw7

104     kw5

104     kw1

104     kw4

104     kw9

3.4 把两个表做Join,列出用户看过所有文章与其对应的标签

select a.user_id,b.kw from user_actions as a left outer join(select article_id,kw from articles lateral view explode(kws) to as kw) b on (a.article_id=b.article_id) order by a.user_id;

结果:

11      kw1

11      kw8

11      kw5

11      kw1

11      kw4

11      kw1

11      kw9

11      kw8

11      kw1

11      kw8

22      kw1

22      kw7

22      kw9

22      kw4

22      kw5

22      kw7

22      kw3

22      kw6

33      kw8

33      kw1

33      kw3

33      kw6

33      kw7

35      NULL

35      kw6

35      kw3

77      kw9

3.5 根据用户和kw进行分组,统计出用户对应的各标签数量

select a.user_id,b.kw ,count(1) as weight

      from user_actions as a

     left outer join(

     select article_id,kw

     from articles

      lateral view explode(kws) to as kw

     ) b

     on (a.article_id=b.article_id)

     group by a.user_id,b.kw

     order by a.user_id,weight desc;

结果

OK

11      kw1     4

11      kw8     3

11      kw5     1

11      kw9     1

11      kw4     1

22      kw7     2

22      kw9     1

22      kw1     1

22      kw3     1

22      kw4     1

待完结,看续篇..............

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值