ETL+数据可视化+工作调度

最新推荐文章于 2024-05-05 05:48:12 发布

弗瑞得姆

最新推荐文章于 2024-05-05 05:48:12 发布

阅读量1.2k

点赞数

文章标签：数据仓库数据分析 hive 大数据

本文链接：https://blog.csdn.net/aiyin9511/article/details/105215702

版权

数据入库

建表语句按data分区，以优化查询效率
因为又分区，所以不能用put命令导入数据要用load装载
要看表内数据的时候记得要加select * from *** limit 1000 不然表太大内存会溢出
查看hdfs端口 50070
ETL 工作的实质就是从各个数据源提取数据，对数据进行转换，并最终加载填充数据到数据仓库维度建模后的表中。只有当这些维度/事实表被填充好，ETL工作才算完成。
本项目的数据分析过程在 hadoop 集群上实现，主要应用 hive 数据仓库工
具，因此，采集并经过预处理后的数据，需要加载到 hive 数据仓库中，以进行后续的分析过程。

1．创建 ODS 层数据表

1.1．原始日志数据表

drop table if exists ods_weblog_origin;
create table ods_weblog_origin(
valid string,
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
http_referer string,
http_user_agent string)
partitioned by (datestr string)
row format delimited
fields terminated by '\001';

1.2．点击流模型 pageviews 表

drop table if exists ods_click_pageviews;
create table ods_click_pageviews(
session string,
remote_addr string,
remote_user string,
time_local string,
request string,
visit_step string,
page_staylong string,
http_referer string,
http_user_agent string,
body_bytes_sent string,
status string)
partitioned by (datestr string)
row format delimited
fields terminated by '\001';

1.3．点击流 visit 模型表

drop table if exist ods_click_stream_visit;
create table ods_click_stream_visit(
session string,
remote_addr string,
inTime string,
outTime string,
inPage string,
outPage string,
referal string,
pageVisits int)
partitioned by (datestr string)
row format delimited
fields terminated by '\001';

另外需要新建时间维度表，后边要用时间维度得星型模型进分析

维度表示例：
drop table if exists t_dim_time;
create table t_dim_time(date_key int,year string,month string,day string,hour string) row format delimited fields terminated by ',';

维度表数据由脚本生成：本案例中

1,2013,09,17,22
2,2013,09,17,23
3,2013,09,18,00
4,2013,09,18,01
5,2013,09,18,02
6,2013,09,18,03
7,2013,09,18,04
8,2013,09,18,05
9,2013,09,18,06
10,2013,09,18,07
11,2013,09,18,08
12,2013,09,18,09
13,2013,09,18,10
14,2013,09,18,11
15,2013,09,18,12
16,2013,09,18,13
17,2013,09,18,14
18,2013,09,18,15
19,2013,09,18,16
20,2013,09,18,17
21,2013,09,18,18
22,2013,09,18,19
23,2013,09,18,20
24,2013,09,18,21
25,2013,09,18,22
26,2013,09,18,23
27,2013,09,19,00
28,2013,09,19,01
29,2013,09,19,02

数据倒入

导入清洗结果数据到贴源数据表ods_weblog_origin

load data inpath '/weblog/preprocessed/' overwrite into table ods_weblog_origin partition(datestr='20130918');
show partitions ods_weblog_origin;
select count(*) from ods_weblog_origin;
---------------------------------------------------------------------------
导入点击流模型pageviews数据到ods_click_pageviews表

load data inpath '/weblog/clickstream/pageviews' overwrite into table ods_click_pageviews partition(datestr='20130918');
select count(*) from ods_click_pageviews;
-----------------------------------------------------------------------------
导入点击流模型visit数据到ods_click_stream_visit表

load data inpath '/weblog/clickstream/visits' overwrite into table ods_click_stream_visit partition(datestr='20130918');

----------------------------------------------------------------------------------------------------------------------
时间维度表数据导入
参考数据《dim_time_dat.txt》
load data inpath '/weblog/dim_time' overwrite into table t_dim_time;

注：生产环境中应该将数据 load 命令，写在脚本中，然后配置在 azkaban
中定时运行，注意运行的时间点，应该在预处理数据完成之后。

3．生成 ODS 层明细宽表

窄表：数据中多个属性糅合在一起构成一个字段（比如年月日时分秒构成time_local），但是不利于分组分析
因此需要把这些糅合的字段提取出来变成新的字段
time_local–> 月日时
标准的http请求：
http://host:port/resource?query=queryV&query1=queryV2
host resource query queryV
宽表(明细)：数据来自于窄表把窄表中的某些糅合字段提取出来变成多个字段有利于分析
整个数据分析的过程是按照数据仓库的层次分层进行的，总体来说，是从ODS 原始数据中整理出一些中间表（比如，为后续分析方便，将原始数据中的时间、url 等非结构化数据作结构化抽取，将各种字段信息进行细化，形成明细表），然后再在中间表的基础之上统计出各种指标数据。

3.2． ETL 实现

 建明细表 ods_weblog_detail:
drop table ods_weblog_detail;
create table ods_weblog_detail(
valid string, --有效标识
remote_addr string, --来源 IP
remote_user string, --用户标识
time_local string, --访问完整时间
**daystr string, --访问日期
timestr string, --访问时间
month string, --访问月
day string, --访问日
hour string, --访问时**
request string, --请求的 url
status string, --响应码
body_bytes_sent string, --传输字节数
http_referer string, --来源 url
**ref_host string, --来源的 host
ref_path string, --来源的路径
ref_query string, --来源参数 query
ref_query_id string, --来源参数 query 的值
http_user_agent string --客户终端标识**
)
partitioned by(datestr string);

通过查询插入数据到明细宽表 ods_weblog_detail 中

1、抽取 refer_url 到中间表 t_ods_tmp_referurl

也就是将来访 url 分离出 host path query query id

drop table if exists t_ods_tmp_referurl;
create table t_ods_tmp_referurl as
SELECT a.*,b.*
FROM ods_weblog_origin a 
LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 
'HOST', 'PATH','QUERY', 'QUERY:id') b as host, path, query, query_id;

ive内置parse_url_tuple 只要穿入标准的URL 该函数就可以解析出host resource query queryV
a lateral view b：把b内容字段和左边的a粘贴在一起返回成新的表
注：lateral view 用于和 split, explode 等 UDTF 一起使用，它能够将一列数据拆成多行数据。
UDTF(User-Defined Table-Generating Functions) 用来解决输入一行输出多行(Onto-many maping) 的需求。Explode 也是拆列函数，比如 Explode (ARRAY) ，array 中的每
个元素生成一行。

2、抽取转换 time_local 字段到中间表明细表 t_ods_tmp_detail

--抽取转换time_local字段到中间表明细表 t_ods_tmp_detail

drop table if exists t_ods_tmp_detail;
create table t_ods_tmp_detail as 
select b.*,substring(time_local,0,10) as daystr,
substring(time_local,12) as tmstr,
substring(time_local,6,2) as month,
substring(time_local,9,2) as day,
substring(time_local,11,3) as hour
From t_ods_tmp_referurl b;

以上语句可以改写成：
insert into table ods_weblog_detail partition(datestr='20130918')
select c.valid,c.remote_addr,c.remote_user,c.time_local,
substring(c.time_local,0,10) as daystr,
substring(c.time_local,12) as tmstr,
substring(c.time_local,6,2) as month,
substring(c.time_local,9,2) as day,
substring(c.time_local,11,3) as hour,
c.request,c.status,c.body_bytes_sent,c.http_referer,c.ref_host,c.ref_path,c.ref_query,c.ref_query_id,c.http_user_agent
from
(SELECT 
a.valid,a.remote_addr,a.remote_user,a.time_local,
a.request,a.status,a.body_bytes_sent,a.http_referer,a.http_user_agent,b.ref_host,b.ref_path,b.ref_query,b.ref_query_id 
FROM ods_weblog_origin a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as ref_host, ref_path, ref_query, ref_query_id) c;


show partitions ods_weblog_detail;

四、模块开发----统计分析

流量分析

按时间维度

–第一种方式：直接在ods_weblog_detail单表上进行查询
1.1.1 计算该处理批次（一天）中的各小时pvs
drop table dw_pvs_everyhour_oneday;
create table dw_pvs_everyhour_oneday(month string,day string,hour string,pvs bigint) partitioned by(datestr string);

insert into table dw_pvs_everyhour_oneday partition(datestr=‘20130918’)
select a.month as month,a.day as day,a.hour as hour,count(*) as pvs from ods_weblog_detail a
where a.datestr=‘20130918’ group by a.month,a.day,a.hour;

–计算每天的pvs
drop table dw_pvs_everyday;
create table dw_pvs_everyday(pvs bigint,month string,day string);

insert into table dw_pvs_everyday
select count(*) as pvs,a.month as month,a.day as day from ods_weblog_detail a
group by a.month,a.day;

1.1.2 第二种方式：与时间维表关联查询

--维度：日
drop table dw_pvs_everyday;
create table dw_pvs_everyday(pvs bigint,month string,day string);

insert into table dw_pvs_everyday
select count(*) as pvs,a.month as month,a.day as day from (select distinct month, day from t_dim_time) a
join ods_weblog_detail b 
on a.month=b.month and a.day=b.day
group by a.month,a.day;

--维度：月
drop table dw_pvs_everymonth;
create table dw_pvs_everymonth (pvs bigint,month string);

insert into table dw_pvs_everymonth
select count(*) as pvs,a.month from (select distinct month from t_dim_time) a
join ods_weblog_detail b on a.month=b.month group by a.month;


--另外，也可以直接利用之前的计算结果。比如从之前算好的小时结果中统计每一天的
Insert into table dw_pvs_everyday
Select sum(pvs) as pvs,month,day from dw_pvs_everyhour_oneday group by month,day having day='18';

按栏目维度

1.2 按照来访维度统计pv

–统计每小时各来访url产生的pv量，查询结果存入：( “dw_pvs_referer_everyhour” )

drop table dw_pvs_referer_everyhour;
create table dw_pvs_referer_everyhour(referer_url string,referer_host string,month string,day string,hour string,pv_referer_cnt bigint) partitioned by(datestr string);

insert into table dw_pvs_referer_everyhour partition(datestr='20130918')
select http_referer,ref_host,month,day,hour,count(1) as pv_referer_cnt
from ods_weblog_detail 
group by http_referer,ref_host,month,day,hour 
having ref_host is not null //分组内聚合
order by hour asc,day asc,month asc,pv_referer_cnt desc

--统计每小时各来访host的产生的pv数并排序
drop table dw_pvs_refererhost_everyhour;
create table dw_pvs_refererhost_everyhour(ref_host string,month string,day string,hour string,ref_host_cnts bigint) partitioned by(datestr string);

insert into table dw_pvs_refererhost_everyhour partition(datestr='20130918')
select ref_host,month,day,hour,count(1) as ref_host_cnts
from ods_weblog_detail 
group by ref_host,month,day,hour 
having ref_host is not null
order by hour asc,day asc,month asc,ref_host_cnts desc;

1.2．人均浏览量

需求描述：统计今日所有来访者平均请求的页面数。
人均浏览量也称作人均浏览页数，该指标可以说明网站对用户的粘性。
人均页面浏览量表示用户某一时段平均浏览页面的次数。
计算方式：总页面请求数/去重总人数
remote_addr 表示不同的用户。可以先统计出不同 remote_addr 的 pv 量，然后累加（sum）
所有 pv 作为总的页面请求数，再 count 所有 remote_addr 作为总的去重总人数。

1.3	人均浏览页数
--需求描述：统计今日所有来访者平均请求的页面数。
--总页面请求数/去重总人数

drop table dw_avgpv_user_everyday;
create table dw_avgpv_user_everyday(
day string,
avgpv string);

insert into table dw_avgpv_user_everyday
select '20130918',sum(b.pvs)/count(b.remote_addr) from
(select remote_addr,count(1) as pvs from ods_weblog_detail where datestr='20130918' group by remote_addr) b;

先算出每个用户每天的访问量在相除

1.3．统计 pv 总量最大的来源 TOPN (分组 TOP)

需求描述：统计每小时各来访 host 的产生的 pvs 数最多的前 N 个（topN）。
row_number()函数
 语法：row_number() over (partition by xxx order by xxx) rank，rank 为分
组的别名，相当于新增一个字段为 rank。
 partition by 用于分组，比方说依照 sex 字段分组
 order by 用于分组内排序，比方说依照 sex 分组，组内按照 age 排序
 排好序之后，为每个分组内每一条分组记录从 1 开始返回一个数字
 取组内某个数据，可以使用 where 表名.rank>x 之类的语法去取
以下语句对每个小时内的来访 host 次数倒序排序标号:
1.4 统计pv总量最大的来源TOPN
–需求：按照时间维度，统计一天内各小时产生最多pvs的来源topN

--row_number函数
select ref_host,ref_host_cnts,concat(month,day,hour),
row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od 
from dw_pvs_refererhost_everyhour;

--综上可以得出
drop table dw_pvs_refhost_topn_everyhour;
create table dw_pvs_refhost_topn_everyhour(
hour string,
toporder string,
ref_host string,
ref_host_cnts string
)partitioned by(datestr string);

insert into table dw_pvs_refhost_topn_everyhour partition(datestr='20130918')
select t.hour,t.od,t.ref_host,t.ref_host_cnts from
 (select ref_host,ref_host_cnts,concat(month,day,hour) as hour,
row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od 
from dw_pvs_refererhost_everyhour) t where od<=3;
//where执行在from之前 必须起别名t

分组topN 根据多条件分组分组之后找出每个组内的共性最多的最大的那么几个
https://www.cnblogs.com/wujin/p/6051768.html
窗口函数 window function(区间内操作的函数)
row_number和他的兄弟们

受访分析

2.1．各页面访问统计

主要是针对数据中的 request 进行统计分析，比如各页面 PV ，各页面 UV 等。
以上指标无非就是根据页面的字段 group by。例如：
–各页面访问统计
各页面PV

select request as request,count(request) as request_counts from
ods_weblog_detail group by request having request is not null order by request_counts desc limit 20;

-----------------------------------------------
--热门页面统计
统计每日最热门的页面top10

drop table dw_hotpages_everyday;
create table dw_hotpages_everyday(day string,url string,pvs string);

insert into table dw_hotpages_everyday
select '20130918',a.request,a.request_counts from
(select request as request,count(request) as request_counts from ods_weblog_detail where datestr='20130918' group by request having request is not null) a
order by a.request_counts desc limit 10;

问：为什么这里不用row_number()
因为我们按照日期对数据进行分区，这里只是统计每日最热门，用不到组内按小时进行聚合

访客分析

每日新访客

**需求描述：**需求：将每天的新访客统计出来。
**实现思路：**创建一个去重访客累积表，然后将每日访客对比累积表。
在这里插入图片描述

----------------------------------------------------------------------------------------
-- 每日新访客
-- 需求：将每天的新访客统计出来。

--历日去重访客累积表
drop table dw_user_dsct_history;
create table dw_user_dsct_history(
day string,
ip string
) 
partitioned by(datestr string);

--每日新访客表
drop table dw_user_new_d;
create table dw_user_new_d (
day string,
ip string
) 
partitioned by(datestr string);

--每日新用户插入新访客表
insert into table dw_user_new_d partition(datestr='20130918')
select tmp.day as day,tmp.today_addr as new_ip from
(
select today.day as day,today.remote_addr as today_addr,old.ip as old_addr 
from 
(select distinct remote_addr as remote_addr,"20130918" as day from ods_weblog_detail where datestr="20130918") today
left outer join 
dw_user_dsct_history old
on today.remote_addr=old.ip
) tmp
where tmp.old_addr is null;

--每日新用户追加到累计表
insert into table dw_user_dsct_history partition(datestr='20130918')
select day,ip from dw_user_new_d where datestr='20130918';


验证：
select count(distinct remote_addr) from ods_weblog_detail;

select count(1) from dw_user_dsct_history where datestr='20130918';

select count(1) from dw_user_new_d where datestr='20130918';

回头/单次访客统计

在这里插入图片描述实现思路：上表中出现次数>1 的访客，即回头访客；反之，则为单次访客。

--  回头/单次访客统计
drop table dw_user_returning;
create table dw_user_returning(
day string,
remote_addr string,
acc_cnt string)
partitioned by (datestr string);

需求：查询今日所有回头访客(session>=2)及其访问次数ods_click_stream_visit 
--explain extended	  
select t.remote_addr,count (t.session) as visits
from ods_click_stream_visit t
where t.datestr="20130918"
group by t.remote_addr
having visits >1;


select tmp.ip,tmp.visits from
(select t.remote_addr as ip,count (t.session) as visits
from ods_click_stream_visit t
where t.datestr="20130918"
group by t.remote_addr) tmp
where tmp.visits >1;

-------------------------------------------------------------
需求：人均访问频次

	  人均浏览页面的次数=总pvs/去重人
		select sum(pagevisits)/count(distinct remote_addr) from ods_click_stream_visit where datestr='20130918';
	  人居会话的次数    =总session/去重人
		select count(session)/count(distinct remote_addr) from ods_click_stream_visit where datestr='20130918';

5．关键路径转化率分析（漏斗模型）

5.1．需求分析

**转化：**在一条指定的业务流程中，各个步骤的完成人数及相对上一个步骤的百分比
在这里插入图片描述
tip
在hive中，如果从分析出计算当前的数据跟之前的状态数据有关--------->先把自己跟自己join–>就会把自己当前的状态数据跟之前的状态数据join起来出现在一行中。
详情请见级联求和

load data local inpath '/root/hivedata/click-part-r-00000' overwrite into table ods_click_pageviews partition(datestr='20130920');

----------------------------------------------------------
---1、查询每一个步骤的总访问人数
UNION将多个SELECT语句的结果集合并为一个独立的结果集

create table dw_oute_numbs as 
select 'step1' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20130920' and request like '/item%'
union
select 'step2' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20130920' and request like '/category%'
union
select 'step3' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20130920' and request like '/order%'
union
select 'step4' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20130920' and request like '/index%';

±--------------------±---------------------±-+
| dw_oute_numbs.step | dw_oute_numbs.numbs |
±--------------------±---------------------±-+
| step1 | 1029 |
| step2 | 1029 |
| step3 | 1028 |
| step4 | 1018 |
±--------------------±---------------------±-+

--2、查询每一步骤相对于路径起点人数的比例
--级联查询，自己跟自己join

select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
inner join 
dw_oute_numbs rr;

自join后结果如下图所示：
±--------±---------±--------±---------±-+
| rnstep | rnnumbs | rrstep | rrnumbs |
±--------±---------±--------±---------±-+
| step1 | 1029 | step1 | 1029 |
| step2 | 1029 | step1 | 1029 |
| step3 | 1028 | step1 | 1029 |
| step4 | 1018 | step1 | 1029 |
| step1 | 1029 | step2 | 1029 |
| step2 | 1029 | step2 | 1029 |
| step3 | 1028 | step2 | 1029 |
| step4 | 1018 | step2 | 1029 |
| step1 | 1029 | step3 | 1028 |
| step2 | 1029 | step3 | 1028 |
| step3 | 1028 | step3 | 1028 |
| step4 | 1018 | step3 | 1028 |
| step1 | 1029 | step4 | 1018 |
| step2 | 1029 | step4 | 1018 |
| step3 | 1028 | step4 | 1018 |
| step4 | 1018 | step4 | 1018 |
±--------±---------±--------±---------±-+

--每一步的人数/第一步的人数==每一步相对起点人数比例
select tmp.rnstep,tmp.rnnumbs/tmp.rrnumbs as ratio
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
inner join 
dw_oute_numbs rr) tmp
where tmp.rrstep='step1';

tmp
±--------±---------±--------±---------±-+
| rnstep | rnnumbs | rrstep | rrnumbs |
±--------±---------±--------±---------±-+
| step1 | 1029 | step1 | 1029 |
| step2 | 1029 | step1 | 1029 |
| step3 | 1028 | step1 | 1029 |
| step4 | 1018 | step1 | 1029 |

--3、查询每一步骤相对于上一步骤的漏出率

--首先通过自join表过滤出每一步跟上一步的记录

select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
inner join 
dw_oute_numbs rr
where cast(substr(rn.step,5,1) as int)=cast(substr(rr.step,5,1) as int)-1;

注意：cast为Hive内置函数 类型转换
select cast(1 as float); --1.0  
select cast('2016-05-22' as date); --2016-05-22

| step1 | 1029 | step2 | 1029 |
| step2 | 1029 | step3 | 1028 |
| step3 | 1028 | step4 | 1018 |

±--------±---------±--------±---------±-+
| rnstep | rnnumbs | rrstep | rrnumbs |
±--------±---------±--------±---------±-+
| step1 | 1029 | step2 | 1029 |
| step2 | 1029 | step3 | 1028 |
| step3 | 1028 | step4 | 1018 |
±--------±---------±--------±---------±-+

--然后就可以非常简单的计算出每一步相对上一步的漏出率
select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as leakage_rate
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
inner join 
dw_oute_numbs rr) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1;

–4、汇总以上两种指标

select abs.step,abs.numbs,abs.rate as abs_ratio,rel.rate as leakage_rate
from 
(
select tmp.rnstep as step,tmp.rnnumbs as numbs,tmp.rnnumbs/tmp.rrnumbs as rate
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
inner join 
dw_oute_numbs rr) tmp
where tmp.rrstep='step1'
) abs
left outer join
(
select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as rate
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
inner join 
dw_oute_numbs rr) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1
) rel
on abs.step=rel.step;

弗瑞得姆

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
ETL+数据可视化+工作调度

数据入库建表语句按data分区，以优化查询效率因为又分区，所以不能用put命令导入数据要用load装载要看表内数据的时候记得要加select * from *** limit 1000 不然表太大内存会溢出查看hdfs端口 50070ETL 工作的实质就是从各个数据源提取数据，对数据进行转换，并最终加载填充数据到数据仓库维度建模后的表中。只有当这些维度/事实表被填充好，ETL工作才算完...
复制链接

扫一扫