每一种统计指标都可以跟各维度表进行叉乘,从而得出各个维度的统计结果 ,为了在前端展示时速度更快,每一个指标都事先算出各维度结果存入mysql。
提前准备好维表数据,在hive仓库中创建相应维表,如:
时间维表:
create table v_time(year string,month string,day string,hour string)
row format delimited
fields terminated by ',';
load data local inpath '/home/hadoop/v_time.txt' into table v_time;
v_time.txt
2013,09,18,00
2013,09,18,01
2013,09,18,02
2013,09,18,03
2013,09,18,04
2013,09,18,05
2013,09,18,06
。。。
在实际生产中,究竟需要哪些统计指标通常由相关数据需求部门人员提出,而且会不断有新的统计需求产生,以下为网站流量分析中的一些典型指标示例。
1、PV统计
1.1、多维度统计PV总量
1.1.1、时间维度
--计算指定的某个小时pvs
select count(*),month,day,hour from dw_click.ods_weblog_detail group by month,day,hour;
--计算该处理批次(一天)中的各小时pvs
drop table dw_pvs_hour;
create table dw_pvs_hour(month string,day string,hour string,pvs bigint) partitioned by(datestr string);
insert into table dw_pvs_hour partition(datestr='2016-03-18')
select a.month as month,a.day as day,a.hour as hour,count(1) as pvs from ods_weblog_detail a
where a.datestr='2016-03-18' group by a.month,a.day,a.hour;
或者用时间维表关联
维度:日
drop table dw_pvs_day;
create table dw_pvs_day(pvs bigint,month string,day string);
insert into table dw_pvs_day
select count(1) as pvs,a.month as month,a.day as day from dim_time a
join ods_weblog_detail b
on b.dd='18/Sep/2013' and a.month=b.month and a.day=b.day
group by a.month,a.day;
--或者,从之前算好的小时结果中统计
insert into table dw_pvs_day
select sum(pvs) as pvs,month,day from dw_pvs_hour group by month,day having day='18';
维度:月
drop table t_display_pv_month;
create table t_display_pv_month (pvs bigint,month string);
insert into table t_display_pv_month
select count(*) as pvs,a.month from t_dim_time a
join t_ods_detail_prt b on a.month=b.month group by a.month;
1.1.2、按终端维度统计pv总量
注:探索数据中的终端类型
select distinct(http_user_agent) from ods_weblog_detail where http_user_agent like '%Mozilla%' limit 200;
终端维度:uc
drop table t_display_pv_terminal_uc;
create table t_display_pv_ terminal_uc (pvs bigint,mm string,dd string,hh string);
终端维度:chrome
drop table t_display_pv_terminal_chrome;
create table t_display_pv_ terminal_ chrome (pvs bigint,mm string,dd string,hh string);
终端维度:safari
drop table t_display_pv_terminal_safari;
create table t_display_pv_ terminal_ safari (pvs bigint,mm string,dd string,hh string);
栏目维度:job
栏目维度:news
栏目维度:bargin
栏目维度:lane
1.2、人均浏览页数
需求描述:比如,今日所有来访者,平均请求的页面数。
总页面请求数/去重总人数
drop table dw_avgpv_user_d;
create table dw_avgpv_user_d(
day string,
avgpv string);
insert into table dw_avgpv_user_d
select '2013-09-18',sum(b.pvs)/count(b.remote_addr) from
(select remote_addr,count(1) as pvs from ods_weblog_detail where datestr='2013-09-18' group by remote_addr) b;
1.3、按referer维度统计pv总量
需求:按照来源及时间维度统计PVS,并按照PV大小倒序排序。
按照小时粒度统计,查询结果存入:( "dw_pvs_referer_h" )
drop table dw_pvs_referer_h;
create table dw_pvs_referer_h(referer_url string,referer_host string,month string,day string,hour string,pv_referer_cnt bigint) partitioned by(datestr string);
insert into table dw_pvs_referer_h partition(datestr='2016-03-18')
select http_referer,ref_host,month,day,hour,count(1) as pv_referer_cnt
from ods_weblog_detail
group by http_referer,ref_host,month,day,hour
having ref_host is not null
order by hour asc,day asc,month asc,pv_referer_cnt desc;
按小时粒度统计各来访域名的产生的pv数并排序
drop table dw_ref_host_visit_cnts_h;
create table dw_ref_host_visit_cnts_h(ref_host string,month string,day string,hour string,ref_host_cnts bigint) partitioned by(datestr string);
insert into table dw_ref_host_visit_cnts_h partition(datestr='2016-03-18')
select ref_host,month,day,hour,count(1) as ref_host_cnts
from ods_weblog_detail
group by ref_host,month,day,hour
having ref_host is not null
order by hour asc,day asc,month asc,ref_host_cnts desc;
注:还可以按来源地域维度、访客终端维度等计算。
1.4、统计pv总量最大的来源TOPN (取分组TOP)
需求描述:按照时间维度,比如,统计一天内各小时产生最多pvs的来源top N。
需要用到row_number函数
以下语句对每个小时内的来访host次数倒序排序标号:
select ref_host,ref_host_cnts,concat(month,hour,day),
row_number() over (partition by concat(month,hour,day) order by ref_host_cnts desc) as od
from dw_ref_host_visit_cnts_h;
效果如下:
2、受访分析
统计每日最热门的页面top10
drop table dw_pvs_d;
create table dw_pvs_d(day string,url string,pvs string);
insert into table dw_pvs_d
select '2013-09-18',a.request,a.request_counts from
(select request as request,count(request) as request_counts from ods_weblog_detail where datestr='2013-09-18' group by request having request is not null) a
order by a.request_counts desc limit 10;
结果如下:
注:还可继续得出各维度交叉结果
3、访客分析
3.1 独立访客
需求描述:按照时间维度比如小时来统计独立访客及其产生的pvCnts
对于独立访客的识别,如果在原始日志中有用户标识,则根据用户标识即很好实现;
此处,由于原始日志中并没有用户标识,以访客IP来模拟,技术上是一样的,只是精确度相对较低。
时间维度:时
drop table dw_user_dstc_ip_h;
create table dw_user_dstc_ip_h(
remote_addr string,
pvs bigint,
hour string);
insert into table dw_user_dstc_ip_h
select remote_addr,count(1) as pvs,concat(month,day,hour) as hour
from ods_weblog_detail
Where datestr='2013-09-18'
group by concat(month,day,hour),remote_addr;
在此结果表之上,可以进一步统计出,每小时独立访客总数,每小时请求次数topn访客等
如每小时独立访客总数:
select count(1) as dstc_ip_cnts,hour from dw_user_dstc_ip_h group by hour;
统计每小时请求次数topn的独立访客
时间维度:月
select remote_addr,count(1) as counts,month
from ods_weblog_detail
group by month,remote_addr;
时间维度:日
select remote_addr,count(1) as counts,concat(month,day) as day
from ods_weblog_detail
Where dd='18/Sep/2013'
group by concat(month,day),remote_addr;
注:还可以按来源地域维度、访客终端维度等计算
3.2、每日新访客
需求描述:将每天的新访客统计出来
实现思路:创建一个去重访客累积表,然后将每日访客对比累积表。
时间维度:日
--历日去重访客累积表
drop table dw_user_dsct_history;
create table dw_user_dsct_history(
day string,
ip string
)
partitioned by(datestr string);
--每日新用户追加到累计表
drop table dw_user_dsct_history;
create table dw_user_dsct_history(
day string,
ip string
)
partitioned by(datestr string);
--每日新用户追加到累计表
insert into table dw_user_dsct_history partition(datestr='2013-09-19')
select tmp.day as day,tmp.today_addr as new_ip from
(
select today.day as day,today.remote_addr as today_addr,old.ip as old_addr
from
(select distinct remote_addr as remote_addr,"2013-09-19" as day from ods_weblog_detail where datestr="2013-09-19") today
left outer join
dw_user_dsct_history old
on today.remote_addr=old.ip
) tmp
where tmp.old_addr is null;
验证:
select count(distinct remote_addr) from ods_weblog_detail;
-- 1005
select count(1) from dw_user_dsct_history where prtflag_day='18/Sep/2013';
--845
select count(1) from dw_user_dsct_history where prtflag_day='19/Sep/2013';
--160
时间维度:月
类似日粒度算法。
注:还可以按来源地域维度、访客终端维度等计算。
4、Visit分析(点击流模型)
4.1、回头/单次访客统计
需求描述:查询今日所有回头访客及其访问次数。
实现思路:上表中出现次数>1的访客,即回头访客;反之,则为单次访客
drop table dw_user_returning;
create table dw_user_returning(
day string,
remote_addr string,
acc_cnt string)
partitioned by (datestr string);
insert overwrite table dw_user_returning partition(datestr='2013-09-18')
select tmp.day,tmp.remote_addr,tmp.acc_cnt
from
(select '2013-09-18' as day,remote_addr,count(session) as acc_cnt from click_stream_visit group by remote_addr) tmp
where tmp.acc_cnt>1;
4.2、人均访问频次
需求:统计出每天所有用户访问网站的平均次数(visit)。
总visit数/去重总用户数
select sum(pagevisits)/count(distinct remote_addr) from click_stream_visit partition(datestr='2013-09-18');
6、关键路径转化率分析——漏斗模型
转化:在一条指定的业务流程中,各个步骤的完成人数及相对上一个步骤的百分比。
6.1 需求分析
6.2 模型设计
定义好业务流程中的页面标识,下例中的步骤为:
Step1、 /item%
Step2、 /category
Step3、 /index
Step4、 /order
6.3 开发实现
分步骤开发:
6.3.1、查询每一个步骤的总访问人数
create table dw_oute_numbs as
select 'step1' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where request like '/item%'
union
select 'step2' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where request like '/category%'
union
select 'step3' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where request like '/order%'
union
select 'step4' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where request like '/index%';
6.3.2、查询每一步骤相对于路径起点人数的比例
思路:利用join
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from route_num rn
inner join
route_num rr
select tmp.rnstep,tmp.rnnumbs/tmp.rrnumbs as ratio
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
inner join
dw_oute_numbs rr) tmp
where tmp.rrstep='step1';
6.3.3、查询每一步骤相对于上一步骤的漏出率
select tmp.rrstep as rrstep,tmp.rrnumbs/tmp.rnnumbs as ration
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from route_num rn
inner join
route_num rr) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1
|
6.3.4、汇总以上两种指标
select abs.step,abs.numbs,abs.ratio as abs_ratio,rel.ratio as rel_ratio
from
(
select tmp.rnstep as step,tmp.rnnumbs as numbs,tmp.rnnumbs/tmp.rrnumbs as ratio
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from route_num rn
inner join
route_num rr) tmp
where tmp.rrstep='step1'
) abs
left outer join
(
select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as ratio
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from route_num rn
inner join
route_num rr) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1
) rel
on abs.step=rel.step
|