一、项目需求
根据给出数据样例统计以下指标
1.统计各商品的浏览数量(PV
)
2.统计各商品的访客数(UV
)
3.统计商品分类的浏览数(PV
)
4.统计商品分类的访客数(UV
)
5.统计商品、分类页面的会话数
6.统计各个外部来源渠道的访客数(UV
)
二、数据介绍(access.log文件)
access.log
为某东网站一天的访问数据,通过制表符“\t”
分隔每个字段的数据
其中数据的含义如下:
第1列 时间:用户访问网站的时间,即当前日志的时间
第2列 IP
地址:IP
地址,即用户端的IP
地址
第3列 访问方法:只有GET/POST
第4列 URL
:用户当前访问网站的URL
第5列 HTTP
协议
第6列 用户ID
:唯一的标示了一个用户的身份
第7列 Referer URL
:用户上一个访问的URL地址
第8列 状态码:即访问状态,如200
表示访问成功
三、确定数据结构
启动hadoop
集群,分别执行 start-hdfs.sh
和 start-yarn.sh
命令
等待hadoop
集群启动后进入hive
,执行导入如下命令
1、根据已知日志数据建立日志表
create table log_data(
localtime string,
ip string,
method string,
uuid string,
url string,
http_protorol string,
referer_url string,
status_code string
)
PARTITIONED BY (dt int)
row format delimited fields terminated by '\t';
2、根据项目需求建立结果表
建立fact_goods表
create table fact_goods(
goods_id string,
pv int,
uv int,
session_num int
)
PARTITIONED BY (dt int);
建立fact_category表
create table fact_category(
category_id string,
pv int,
uv int,
session_num int
)
PARTITIONED BY (dt int);
建立fact_source表
create table fact_source(
source_name string,
UV int
)
PARTITIONED BY (dt int);
3、将用户行为日志数据导入日志表
1)将access.log日志上传到虚拟机的/usr/local/qst/logs
2)执行如下代码将日志导入日志表
load data local inpath '/usr/local/qst/logs/access.log' overwrite into table log_data partition(dt=20200604);
四、业务实现
1、数据分析
通过需求的1-4项我们可以知道主要统计商品和分类的指标,通过日志我们能看出来只有两种类型的URL
如下:
https://item.jd.com/数字.html
https://list.jd.com/list.html?cat=数字,数字,数字
通过在浏览其中打开URL,我们可以发现如下规律
https://item.jd.com/数字.html
类型的URL打开的都是商品页面
https://list.jd.com/list.html?cat=数字,数字,数字
类型的URL
打开的都是商品的分类页面
由此推断第一种类型中的数字为“商品ID
”,第二种类型的URL中的数字组合为“分类ID
”
2、商品的统计指标实现
1)计算商品指标
insert overwrite table fact_goods partition(dt=20200603)
select
regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','') as goods_id,
count(1) as pv,
count(distinct uuid) as uv,
0 as session_num
from log_data
where dt = 20200603
and status_code == 200
and length(uuid) >= 10
and url like 'https://item.jd.com/%'
group by regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','');
2)计算分类指标
insert overwrite table fact_category partition(dt=20200603)
select
parse_url(url,'QUERY','cat') as category_id,
count(1) as pv,
count(distinct uuid) as uv,
0 as session_num
from log_data
where dt = 20200603
and status_code == 200
and length(uuid) >= 10
and url like 'https://list.jd.com/list.html?cat=%'
group by parse_url(url,'QUERY','cat') ;
3)代码优化
经过优化后的写法,因为分别计算商品和分类需要执行两次读取操作(log_data
),所以这里用 from... insert...
方式只读取一次数据
from(
select
url,
uuid,
regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','') as goods_id,
parse_url(url,'QUERY','cat') as category_id,
sum(unix_timestamp(localtime)) over(partition by uuid order by unix_timestamp(localtime) rows between current row and 1 following) - unix_timestamp(localtime) * 2 as stay_time
from log_data
where dt = 20200603
and status_code == 200
and length(uuid) >= 10
) tempTable
insert overwrite table fact_goods partition(dt=20200603)
select
goods_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url like 'https://item.jd.com/%'
group by goods_id
insert overwrite table fact_category partition(dt=20200603)
select
category_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url like 'https://list.jd.com/list.html?cat=%'
group by category_id;
3.统计各个外部来源渠道的访客数(UV)
这里通过分析日志表中的“referer_url
”确定外部来源,首先判断“referer_url”
不为空并且为站外域名,其次取得站外来源的域名
insert overwrite table fact_source partition(dt=20200603)
select
parse_url('https://www.baidu.com/s?wd={query}','HOST') as source,
count(distinct uuid) as uv
where referer_url is not null and referer_url <> '-' and UPPER(referer_url) <> 'NULL' and length(referer_url) > 8
and referer_url not like '%jd.com%'
and status_code == 200
and length(uuid) >= 10
group by parse_url('https://www.baidu.com/s?wd={query}','HOST');
优化后我们结合上面的 from... insert...
语句
from(
select
url,
referer_url,
uuid,
regexp_replace(regexp_replace(parse_url(url , 'PATH'),'.html',''),'/','') as goods_id,
parse_url(url,'QUERY','cat') as category_id,
parse_url(url,'HOST') as source,
sum(unix_timestamp(localtime)) over(partition by uuid order by unix_timestamp(localtime) rows between current row and 1 following) - unix_timestamp(localtime) * 2 as stay_time
from log_data
where dt = 20200604
and status_code == 200
and length(uuid) >= 10
) tempTable
insert overwrite table fact_goods partition(dt=20200604)
select
goods_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url like 'https://item.jd.com/%'
group by goods_id
insert overwrite table fact_category partition(dt=20200604)
select
category_id,
count(1) as pv,
count(distinct uuid) as uv,
sum(if(stay_time >= 1800 , 1 , 0 )) as session_num
where url like 'https://list.jd.com/list.html?cat=%'
group by category_id
insert overwrite table fact_source partition(dt=20200604)
select
source,
count(distinct uuid) as uv
where referer_url is not null and referer_url <> '-' and referer_url <> 'null' and length(referer_url) > 8
and referer_url not like '%jd.com%'
group by source;
五、编写自动运行脚本
1、在/usr/local/qst
目录下创建项目目录data_helper
,并在data_helper
下创建如下目录
bin | 存放运行脚本及代码 |
---|---|
conf | 存放配置文件 |
data | 数据 |
lib | 类库 |
log | 日志文件 |
tmp | 临时文件 |
2、编写控制脚本main.sh
(bin
文件夹)
3、编写Hive
脚本user_action.sql
(文件夹)
4、设置自动运行任务(crontab
)
0 3 * * * /usr/local/qst/data_helper/bin/main.sh
参考资料:
hive函数
https://www.cnblogs.com/end/archive/2012/06/18/2553682.html
hive from... insert... 语句
http://blog.chinaunix.net/uid-30041424-id-5766715.html
Oracle开窗函数(Hive参照Oracle的开窗函数实现了相同功能)
https://blog.csdn.net/qq_41708308/article/details/89374701
https://blog.csdn.net/qq_33879355/article/details/78447430
UTM参数
https://blog.csdn.net/yiguanfangzhou/article/details/94559120
linux crontab用法
https://blog.csdn.net/yeguxin/article/details/79543514
linux crontab测试
https://tool.lu/crontab
#!/bin/bash和#!/bin/sh是什么意思以及区别
https://blog.csdn.net/u010486679/article/details/78534841
/bin/bash^M: 坏的解释器: 没有那个文件或目录
https://blog.csdn.net/ouyang_peng/article/details/86488451
sed -i 's/\r$//' /usr/local/qst/data_helper/bin/main.sh
生活坏到一定程度就会好起来,因为他无法更坏,努力过后,才知道许多事情,坚持坚持,就来了。 ——宫崎骏