hive实战例子

Hive实战

实战案例1——数据ETL

ü 对web点击流日志基础数据表进行etl(按照仓库模型设计)

ü  按各时间维度统计来源域名top10

已有数据表 “t_orgin_weblog”

+------------------+------------+----------+--+

|     col_name     | data_type  | comment  |

+------------------+------------+----------+--+

| valid            | string     |          |

| remote_addr      | string     |          |

| remote_user      | string     |          |

| time_local       | string     |          |

| request          | string     |          |

| status           | string     |          |

| body_bytes_sent  | string     |          |

| http_referer     | string     |          |

| http_user_agent  | string     |          |

+------------------+------------+----------+--+

 

| true|1.162.203.134| - | 18/Sep/2013:13:47:35| /images/my.jpg                        | 200| 19939 | "http://www.angularjs.cn/A0d9"                      | "Mozilla/5.0 (Windows   |

 

| true|1.202.186.37 | - | 18/Sep/2013:15:39:11| /wp-content/uploads/2013/08/windjs.png| 200| 34613 | "http://cnodejs.org/topic/521a30d4bee8d3cb1272ac0f" | "Mozilla/5.0 (Macintosh;|

 

1、对原始数据进行抽取转换

--将来访url分离出host  path query  query id

drop table if exists t_etl_referurl;

create table t_etl_referurl as

SELECT a.*,b.*

FROM t_orgin_weblog a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as host, path, query, query_id

 

 

3、从前述步骤进一步分离出日期时间形成ETL明细表“t_etl_detail    day tm  

drop table if exists t_etl_detail;

create table t_etl_detail as

select b.*,substring(time_local,0,11) as daystr,

substring(time_local,13) as tmstr,

substring(time_local,4,3) as month,

substring(time_local,0,2) as day,

substring(time_local,13,2) as hour

from t_etl_referurl b;

 

 

3、对etl数据进行分区(包含所有数据的结构化信息)

drop table t_etl_detail_prt;

create table t_etl_detail_prt(

valid                   string,

remote_addr            string,

remote_user            string,

time_local               string,

request                 string,

status                  string,

body_bytes_sent         string,

http_referer             string,

http_user_agent         string,

host                   string,

path                   string,

query                  string,

query_id               string,

daystr                 string,

tmstr                  string,

month                  string,

day                    string,

hour                   string)

partitioned by (mm string,dd string);

 

导入数据

insert into table t_etl_detail_prt partition(mm='Sep',dd='18')

select * from t_etl_detail where daystr='18/Sep/2013';

 

insert into table t_etl_detail_prt partition(mm='Sep',dd='19')

select * from t_etl_detail where daystr='19/Sep/2013';

 

分个时间维度统计各referer_host的访问次数并排序

create table t_refer_host_visit_top_tmp as

select referer_host,count(*) as counts,mm,dd,hh from t_display_referer_counts group by hh,dd,mm,referer_host order by hh asc,dd asc,mm asc,counts desc;

 


4、来源访问次数topn各时间维度URL

取各时间维度的referer_host访问次数topn

select * from (select referer_host,counts,concat(hh,dd),row_number() over (partition by concat(hh,dd) order by concat(hh,dd) asc) as od from t_refer_host_visit_top_tmp) t where od<=3;

 

 

实战案例2——访问时长统计

web日志中统计每日访客平均停留时间

1、 由于要从大量请求中分辨出用户的各次访问,逻辑相对复杂,通过hive直接实现有困难,因此编写一个mr程序来求出访客访问信息(详见代码)

启动mr程序获取结果:

[hadoop@hdp-node-01 ~]$ hadoop jar weblog.jar cn.itcast.bigdata.hive.mr.UserStayTime /weblog/input /weblog/stayout


2、mr的处理结果导入hive

drop table t_display_access_info_tmp;

create table t_display_access_info_tmp(remote_addr string,firt_req_time string,last_req_time string,stay_long bigint)

row format delimited fields terminated by '\t';

 

load data inpath '/weblog/stayout4' into table t_display_access_info_tmp;

 

3、得出访客访问信息表"t_display_access_info"

由于有一些访问记录是单条记录,mr程序处理处的结果给的时长是0,所以考虑给单次请求的停留时间一个默认市场30

drop table t_display_access_info;

create table t_display_access_info as

select remote_addr,firt_req_time,last_req_time,

case stay_long

when 0 then 30000

else stay_long

end as stay_long

from t_display_access_info_tmp;

 

 

4、统计所有用户停留时间平均值

select avg(stay_long) fromt_display_access_info;

 实战案例3——级联求和


有如下访客访问次数统计表 t_access_times

访客

月份

访问次数

A

2015-01-02

5

A

2015-01-03

15

B

2015-01-01

5

A

2015-01-04

8

B

2015-01-05

25

A

2015-01-06

5

A

2015-02-02

4

A

2015-02-06

6

B

2015-02-06

10

B

2015-02-07

5

……

……

……

 

需要输出报表:t_access_times_accumulate

访客

月份

月访问总计

累计访问总计

A

2015-01

33

33

A

2015-02

10

43

…….

…….

…….

…….

B

2015-01

30

30

B

2015-02

15

45

…….

…….

…….

…….

 

可以用一个hql语句即可实现:

select A.username,A.month,max(A.salary) as salary,sum(B.salary) as accumulate

from

(select username,month,sum(salary) as salary from t_access_times group by username,month) A

inner join

(select username,month,sum(salary) as salary from t_access_times group by username,month) B

on

A.username=B.username

where B.month <= A.month

group by A.username,A.month

order by A.username,A.month;

  • 0
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值