数仓搭建（三）

最新推荐文章于 2024-08-26 17:41:15 发布

Starry sky325

最新推荐文章于 2024-08-26 17:41:15 发布

阅读量808

点赞数

文章标签：大数据

本文链接：https://blog.csdn.net/qq_52111247/article/details/118875798

版权

数仓搭建（三）

文章目录

- 数仓搭建（三）

一、业务知识准备

1.1 业务术语

用户

用户以设备为判断标准，在移动统计中，每个独立设备认为是一个独立用户。Android系统根据IMEI号，IOS系统根据OpenUDID来标识一个独立用户，每部手机一个用户。

新增用户

首次联网使用应用的用户。如果一个用户首次打开某APP，那这个用户定义为新增用户；卸载再安装的设备，不会被算作一次新增。新增用户包括日新增用户、周新增用户、月新增用户。

活跃用户

打开应用的用户即为活跃用户，不考虑用户的使用情况。每天一台设备打开多次会被计为一个活跃用户。

周（月）活跃用户

某个自然周（月）内启动过应用的用户，该周（月）内的多次启动只记一个活跃用户。

月活跃率

月活跃用户与截止到该月累计的用户总和之间的比例。

沉默用户

用户仅在安装当天（次日）启动一次，后续时间无再启动行为。该指标可以反映新增用户质量和用户与APP的匹配程度。

版本分布

不同版本的周内各天新增用户数，活跃用户数和启动次数。利于判断APP各个版本之间的优劣和用户行为习惯。

本周回流用户

上周未启动过应用，本周启动了应用的用户。

连续n周活跃用户

连续n周，每周至少启动一次。

忠诚用户

连续活跃5周以上的用户

连续活跃用户

连续2周及以上活跃的用户

近期流失用户

连续n(2<= n <= 4)周没有启动应用的用户。（第n+1周没有启动过）

留存用户

某段时间内的新增用户，经过一段时间后，仍然使用应用的被认作是留存用户；这部分用户占当时新增用户的比例即是留存率。

例如，5月份新增用户200，这200人在6月份启动过应用的有100人，7月份启动过应用的有80人，8月份启动过应用的有50人；则5月份新增用户一个月后的留存率是50%，二个月后的留存率是40%，三个月后的留存率是25%。

用户新鲜度

每天启动应用的新老用户比例，即新增用户数占活跃用户数的比例。

单次使用时长

每次启动使用的时间长度。

日使用时长

累计一天内的使用时间长度。

启动次数计算标准

IOS平台应用退到后台就算一次独立的启动；Android平台我们规定，两次启动之间的间隔小于30秒，被计算一次启动。用户在使用过程中，若因收发短信或接电话等退出应用30秒又再次返回应用中，那这两次行为应该是延续而非独立的，所以可以被算作一次使用行为，即一次启动。业内大多使用30秒这个标准，但用户还是可以自定义此时间间隔。

1.2 系统函数

1.2.1 collect_set函数

1.创建原数据表

hive (gmall)>
drop table if exists stud;
create table stud (name string, area string, course string, score int);

2.向原数据表中插入数据

hive (gmall)>
insert into table stud values('zhang3','bj','math',88);
insert into table stud values('li4','bj','math',99);
insert into table stud values('wang5','sh','chinese',92);
insert into table stud values('zhao6','sh','chinese',54);
insert into table stud values('tian7','bj','chinese',91);

3.查询表中数据

hive (gmall)> select * from stud;
stud.name    stud.area    stud.course   stud.score
zhang3 bj   math  88
li4   bj   math  99
wang5  sh   chinese 92
zhao6  sh   chinese 54
tian7  bj   chinese 91

4.把同一分组的不同行的数据聚合成一个集合

hive (gmall)> select course, collect_set(area), avg(score) from stud group by course;
chinese ["sh","bj"]   79.0
math  ["bj"] 93.5

用下标可以取某一个

hive (gmall)> select course, collect_set(area)[0], avg(score) from stud group by course;
chinese sh   79.0
math  bj   93.5

1.2.2 日期处理函数（datediff）

1.date_format函数（根据格式整理日期）

hive (gmall)> select date_format('2019-12-14','yyyy-MM');
2019-02

2.date_add函数（加减日期）

hive (gmall)> select date_add('2019-12-14',-1);
2019-02-09
hive (gmall)> select date_add('2019-12-14',1);
2019-02-11

3.next_day函数

（1）取当前天的下一个周一

hive (gmall)> select next_day('2019-02-12','MO');
2019-02-18

（2）取当前周的周一

hive (gmall)> select date_add(next_day('2019-02-12','MO'),-7);
2019-02-11

4）last_day函数（求当月最后一天日期）

hive (gmall)> select last_day('2019-12-14');
2019-02-28

1.3 需求实施流程

以下是活跃用户需求的整体开发流程。

第一步：确定指标的业务口径

业务口径应该由产品经理主导，找到提出该指标的运营负责人沟通。首先要问清楚指标是怎么定义的，比如活跃用户是指启动过APP的用户。

第二步：确定指标的技术口径

技术口径是由建模工程师主导，此时产品经理要和模型设计师沟通整个指标的业务逻辑，另外就是要协调业务方的技术开发人员和我们的建模工程师一起梳理需要采集的用户行为，或者业务数据库层面需要用到表结构和字段。

第三步：原型设计和评审

由产品经理主导设计原型，对于活跃主题，我们最终要展示的是最近n天的活跃用户数变化趋势 ，效果如下图所示。此处需要建模工程师、数据开发工程师、后端开发工程师、前端开发工程师、UI共同参与，一起说明整个功能的价值和详细的操作流程，确保大家理解的一致。

第四步：模型设计

此时主导的是我们的模型设计工程师，一般会采用分层建模的方式把数据更加科学的组织存储。分为 ODS（操作数据层），DWD(明细数据层)、DWS（汇总数据层）、ADS （应用数据层），这是业务对数据分层常用的模型。模型设计工程师要清楚的知道数据来源自那里，要怎么存放。

以用户活跃需求为例，ods层需要存放start_log（启动日志），dwd层需要对数据进行清洗、过滤，dws层需要对数据进行轻度聚合，ads层需要得出最终统计指标的结果。

第五步：数据开发

此时主导的是大数据开发工程师，首先要和数据建模工程师沟通好技术口径明确好我们计算的指标都来自于那些业务系统，他们通过数据同步的工具如flume、sqoop等将数据同步到模型工程师设计的ODS层，然后就是一层一层的通过SQL计算到DWD、DWS层，一层一层的汇总，最后形成可为应用直接服务的数据填充到ADS层。

第六步：后端开发

此时由后端开发主导，后端开发工程师基于产品经理的功能定义输出相应的接口给前端开发工程师调用，由于ADS层的数据已经由开发工程师导出到常规的关系型数据库（如MYSQL等），此时后端开发工程师更多的是和产品经理沟通产品的功能、性能方面的问题，以便给使用者更好的用户体验。

第七步：前端开发

此时主导的是前端开发工程师。原型出来后产品经理会让UI设计师基于产品功能的重点设计UI，UI设计师经过反复的设计，UI最终定型后，会给我们的前端开发工程师提供切图。前端开发工程师基于UI的切图做前端页面的开发。

第八步：联调

此时数据开发工程师、前端开发工程师、后端开发工程师都要参与进来。此时会要求大数据开发工程师基于历史的数据执行计算任务，数据开发工程师承担数据准确性的校验。前后端解决用户操作的相关BUG保证不出现低级的问题完成自测。

第九步：测试

测试工程师在完成原型评审后就要开始写测试用例，那些是开发人员自己要自测通过才能交上来测试的，那些是自己要再次验证的都在测试用例写清楚。此时有经验的产品经理会向运营人员要历史的统计数据来核对数据，不过运营人员的数据不一定准确，只是拿来参考。最终测试没问题产品经理协调运营人员试用，试用中发现的一些问题再回炉重新修改，此时整个研发过程就结束了。

第十步：上线

运维工程师会配合我们的前后端开发工程师更新最新的版本到服务器。此时产品经理要找到该指标的负责人长期跟进指标的准确性。重要的指标还要每过一个周期内部再次验证，从而保证数据的准确性。

二、需求分析

2.1 需求一：用户活跃主题

2.1.1 DWS层

1.每周活跃设备明细

根据日用户访问明细，获得周用户访问明细。

1）建表语句

hive (gmall)>
drop table if exists dws_uv_detail_wk;
create external table dws_uv_detail_wk( 
  `mid_id` string COMMENT '设备唯一标识',
  `user_id` string COMMENT '用户标识', 
  `version_code` string COMMENT '程序版本号', 
  `version_name` string COMMENT '程序版本名', 
  `lang` string COMMENT '系统语言', 
  `source` string COMMENT '渠道号', 
  `os` string COMMENT '安卓系统版本', 
  `area` string COMMENT '区域', 
  `model` string COMMENT '手机型号', 
  `brand` string COMMENT '手机品牌', 
  `sdk_version` string COMMENT 'sdkVersion', 
  `gmail` string COMMENT 'gmail', 
  `height_width` string COMMENT '屏幕宽高',
  `app_time` string COMMENT '客户端日志产生时的时间',
  `network` string COMMENT '网络模式',
  `lng` string COMMENT '经度',
  `lat` string COMMENT '纬度',
  `monday_date` string COMMENT '周一日期',
  `sunday_date` string COMMENT '周日日期' 
) COMMENT '活跃用户按周明细'
PARTITIONED BY (`wk_dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_uv_detail_wk/';

2）数据导入

hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_uv_detail_wk partition(wk_dt)
select 
  mid_id,
  concat_ws('|', collect_set(user_id)) user_id
  concat_ws('|', collect_set(version_code)) version_code
  concat_ws('|', collect_set(version_name)) version_name
  concat_ws('|', collect_set(lang)) lang,
  concat_ws('|', collect_set(source)) source,
  concat_ws('|', collect_set(os)) os,
  concat_ws('|', collect_set(area)) area, 
  concat_ws('|', collect_set(model)) model,
  concat_ws('|', collect_set(brand)) brand,
  concat_ws('|', collect_set(sdk_version)) sdk_version,
  concat_ws('|', collect_set(gmail)) gmail,
  concat_ws('|', collect_set(height_width)) height_width,
  concat_ws('|', collect_set(app_time)) app_time,
  concat_ws('|', collect_set(network)) network,
  concat_ws('|', collect_set(lng)) lng,
  concat_ws('|', collect_set(lat)) lat,
  date_add(next_day('2019-12-14','MO'),-7),
  date_add(next_day('2019-12-14','MO'),-1),
  concat(date_add( next_day('2019-12-14','MO'),-7), '_' , date_add(next_day('2019-12-14','MO'),-1)
)
from dws_uv_detail_day 
where dt>=date_add(next_day('2019-12-14','MO'),-7) and dt<=date_add(next_day('2019-12-14','MO'),-1) 
group by mid_id;

3）查询导入结果

hive (gmall)> select * from dws_uv_detail_wk limit 1;
hive (gmall)> select count(*) from dws_uv_detail_wk;

2.每日活跃设备明细

1）建表语句

hive (gmall)>
drop table if exists dws_uv_detail_day;
create external table dws_uv_detail_day
(
  `mid_id` string COMMENT '设备唯一标识',
  `user_id` string COMMENT '用户标识', 
  `version_code` string COMMENT '程序版本号', 
  `version_name` string COMMENT '程序版本名',
  `lang` string COMMENT '系统语言', 
  `source` string COMMENT '渠道号', 
  `os` string COMMENT '安卓系统版本', 
  `area` string COMMENT '区域', 
  `model` string COMMENT '手机型号', 
  `brand` string COMMENT '手机品牌', 
  `sdk_version` string COMMENT 'sdkVersion', 
  `gmail` string COMMENT 'gmail', 
  `height_width` string COMMENT '屏幕宽高',
  `app_time` string COMMENT '客户端日志产生时的时间',
  `network` string COMMENT '网络模式',
  `lng` string COMMENT '经度',
  `lat` string COMMENT '纬度'
)
partitioned by(dt string)
stored as parquet
location '/warehouse/gmall/dws/dws_uv_detail_day';

2）数据导入

以用户单日访问为key进行聚合，如果某个用户在一天中使用了两种操作系统、两个系统版本、多个地区，登录不同账号，只取其中之一

hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_uv_detail_day 
partition(dt='2019-12-14')
select
  mid_id,
  concat_ws('|', collect_set(user_id)) user_id,
  concat_ws('|', collect_set(version_code)) version_code,
  concat_ws('|', collect_set(version_name)) version_name,
  concat_ws('|', collect_set(lang))lang,
  concat_ws('|', collect_set(source)) source,
  concat_ws('|', collect_set(os)) os,
  concat_ws('|', collect_set(area)) area, 
  concat_ws('|', collect_set(model)) model,
  concat_ws('|', collect_set(brand)) brand,
  concat_ws('|', collect_set(sdk_version)) sdk_version,
  concat_ws('|', collect_set(gmail)) gmail,
  concat_ws('|', collect_set(height_width)) height_width,
  concat_ws('|', collect_set(app_time)) app_time,
  concat_ws('|', collect_set(network)) network,
  concat_ws('|', collect_set(lng)) lng,
  concat_ws('|', collect_set(lat)) lat
from dwd_start_log
where dt='2019-12-14'
group by mid_id;

3）查询导入结果

hive (gmall)> select * from dws_uv_detail_day limit 1;
hive (gmall)> select count(*) from dws_uv_detail_day;

3.每月活跃设备明细

1）建表语句

hive (gmall)>
drop table if exists dws_uv_detail_mn;
create external table dws_uv_detail_mn( 
  `mid_id` string COMMENT '设备唯一标识',
  `user_id` string COMMENT '用户标识', 
  `version_code` string COMMENT '程序版本号', 
  `version_name` string COMMENT '程序版本名', 
  `lang` string COMMENT '系统语言', 
  `source` string COMMENT '渠道号', 
  `os` string COMMENT '安卓系统版本', 
  `area` string COMMENT '区域', 
  `model` string COMMENT '手机型号', 
  `brand` string COMMENT '手机品牌', 
  `sdk_version` string COMMENT 'sdkVersion', 
  `gmail` string COMMENT 'gmail', 
  `height_width` string COMMENT '屏幕宽高',
  `app_time` string COMMENT '客户端日志产生时的时间',
  `network` string COMMENT '网络模式',
  `lng` string COMMENT '经度',
  `lat` string COMMENT '纬度'
) COMMENT '活跃用户按月明细'
PARTITIONED BY (`mn` string)
stored as parquet
location '/warehouse/gmall/dws/dws_uv_detail_mn/';

2）数据导入

hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_uv_detail_mn partition(mn)
select 
  mid_id,
  concat_ws('|', collect_set(user_id)) user_id,
  concat_ws('|', collect_set(version_code)) version_code,
  concat_ws('|', collect_set(version_name)) version_name,
  concat_ws('|', collect_set(lang)) lang,
  concat_ws('|', collect_set(source)) source,
  concat_ws('|', collect_set(os)) os,
  concat_ws('|', collect_set(area)) area, 
  concat_ws('|', collect_set(model)) model,
  concat_ws('|', collect_set(brand)) brand,
  concat_ws('|', collect_set(sdk_version)) sdk_version,
  concat_ws('|', collect_set(gmail)) gmail,
  concat_ws('|', collect_set(height_width)) height_width,
  concat_ws('|', collect_set(app_time)) app_time,
  concat_ws('|', collect_set(network)) network,
  concat_ws('|', collect_set(lng)) lng,
  concat_ws('|', collect_set(lat)) lat,
  date_format('2019-12-14','yyyy-MM')
from dws_uv_detail_day
where date_format(dt,'yyyy-MM') = date_format('2019-12-14','yyyy-MM')
group by mid_id;

3）查询导入结果

hive (gmall)> select * from dws_uv_detail_mn limit 1;
hive (gmall)> select count(*) from dws_uv_detail_mn ;

4.DWS层加载数据脚本

1）在hadoop102的/home/atguigu/bin目录下创建脚本

[atguigu@hadoop102 bin]$ vim dws_uv_log.sh

在脚本中编写如下内容

#!/bin/bas
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期；如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
   do_date=$1
else 
   do_date=`date -d "-1 day" +%F` 
fi 
sql="
 set hive.exec.dynamic.partition.mode=nonstrict;
 insert overwrite table "$APP".dws_uv_detail_day partition(dt='$do_date')
 select 
  mid_id,
  concat_ws('|', collect_set(user_id)) user_id,
  concat_ws('|', collect_set(version_code)) version_code,
  concat_ws('|', collect_set(version_name)) version_name,
  concat_ws('|', collect_set(lang)) lang,
  concat_ws('|', collect_set(source)) source,
  concat_ws('|', collect_set(os)) os,
  concat_ws('|', collect_set(area)) area, 
  concat_ws('|', collect_set(model)) model,
  concat_ws('|', collect_set(brand)) brand,
  concat_ws('|', collect_set(sdk_version)) sdk_version,
  concat_ws('|', collect_set(gmail)) gmail,
  concat_ws('|', collect_set(height_width)) height_width,
  concat_ws('|', collect_set(app_time)) app_time,
  concat_ws('|', collect_set(network)) network,
  concat_ws('|', collect_set(lng)) lng,
  concat_ws('|', collect_set(lat)) lat
 from "$APP".dwd_start_log
 where dt='$do_date' 
 group by mid_id;
 insert overwrite table "$APP".dws_uv_detail_wk partition(wk_dt)
 select 
  mid_id,
  concat_ws('|', collect_set(user_id)) user_id,

  concat_ws('|', collect_set(version_code)) version_code,

  concat_ws('|', collect_set(version_name)) version_name,

  concat_ws('|', collect_set(lang)) lang,

  concat_ws('|', collect_set(source)) source,

  concat_ws('|', collect_set(os)) os,

  concat_ws('|', collect_set(area)) area, 

  concat_ws('|', collect_set(model)) model,

  concat_ws('|', collect_set(brand)) brand,

  concat_ws('|', collect_set(sdk_version)) sdk_version,

  concat_ws('|', collect_set(gmail)) gmail,

  concat_ws('|', collect_set(height_width)) height_width,

  concat_ws('|', collect_set(app_time)) app_time,

  concat_ws('|', collect_set(network)) network,

  concat_ws('|', collect_set(lng)) lng,

  concat_ws('|', collect_set(lat)) lat,

  date_add(next_day('$do_date','MO'),-7),

  date_add(next_day('$do_date','MO'),-1),

  concat(date_add( next_day('$do_date','MO'),-7), '_' , date_add(next_day('$do_date','MO'),-1) 

 )

 from "$APP".dws_uv_detail_day

 where dt>=date_add(next_day('$do_date','MO'),-7) and dt<=date_add(next_day('$do_date','MO'),-1) 

 group by mid_id; 
 insert overwrite table "$APP".dws_uv_detail_mn partition(mn)

 select

  mid_id,

  concat_ws('|', collect_set(user_id)) user_id,

  concat_ws('|', collect_set(version_code)) version_code,

  concat_ws('|', collect_set(version_name)) version_name,

  concat_ws('|', collect_set(lang))lang,

  concat_ws('|', collect_set(source)) source,

  concat_ws('|', collect_set(os)) os,

  concat_ws('|', collect_set(area)) area, 

  concat_ws('|', collect_set(model)) model,

  concat_ws('|', collect_set(brand)) brand,

  concat_ws('|', collect_set(sdk_version)) sdk_version,

  concat_ws('|', collect_set(gmail)) gmail,

  concat_ws('|', collect_set(height_width)) height_width,

  concat_ws('|', collect_set(app_time)) app_time,

  concat_ws('|', collect_set(network)) network,

  concat_ws('|', collect_set(lng)) lng,

  concat_ws('|', collect_set(lat)) lat,

  date_format('$do_date','yyyy-MM')

 from "$APP".dws_uv_detail_day
 where date_format(dt,'yyyy-MM') = date_format('$do_date','yyyy-MM')  
 group by mid_id;
"
$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 dws_uv_log.sh

3）脚本使用

[atguigu@hadoop102 module]$ dws_uv_log.sh 2019-02-11

4）查询结果

hive (gmall)> select count(*) from dws_uv_detail_day where dt='2019-02-11';

hive (gmall)> select count(*) from dws_uv_detail_wk;

hive (gmall)> select count(*) from dws_uv_detail_mn ;

5）脚本执行时间

企业开发中一般在每日凌晨30分~1点

2.1.2 ADS层

目标：当日、当周、当月活跃设备数

1.活跃设备数

1）建表语句

hive (gmall)>

drop table if exists ads_uv_count;

create external table ads_uv_count( 

  `dt` string COMMENT '统计日期',

  `day_count` bigint COMMENT '当日用户数量',

  `wk_count` bigint COMMENT '当周用户数量',

  `mn_count` bigint COMMENT '当月用户数量',

  `is_weekend` string COMMENT 'Y,N是否是周末,用于得到本周最终结果',

  `is_monthend` string COMMENT 'Y,N是否是月末,用于得到本月最终结果'

) COMMENT '活跃设备数'

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_uv_count/';

2）导入数据

hive (gmall)>

insert into table ads_uv_count 

select 

 '2019-12-14' dt,

  daycount.ct,

  wkcount.ct,

  mncount.ct,

  if(date_add(next_day('2019-12-14','MO'),-1)='2019-12-14','Y','N') ,

  if(last_day('2019-12-14')='2019-12-14','Y','N')

from 

(

  select 

   '2019-12-14' dt,

    count(*) ct

  from dws_uv_detail_day

  where dt='2019-12-14' 

)daycount join 

( 

  select 

   '2019-12-14' dt,

   count (*) ct

  from dws_uv_detail_wk

  where wk_dt=concat(date_add(next_day('2019-12-14','MO'),-7),'_' ,date_add(next_day('2019-12-14','MO'),-1) )

) wkcount on daycount.dt=wkcount.dt

join 

( 

  select 

   '2019-12-14' dt,

   count (*) ct

  from dws_uv_detail_mn

  where mn=date_format('2019-12-14','yyyy-MM') 

)mncount on daycount.dt=mncount.dt;

3）查询导入结果

hive (gmall)> select * from ads_uv_count ;

ADS层加载数据脚本

1）在hadoop102的/home/atguigu/bin目录下创建脚本

[atguigu@hadoop102 bin]$ vim ads_uv_log.sh

在脚本中编写如下内容

\#!/bin/bash

 

\# 定义变量方便修改

APP=gmall

hive=/opt/module/hive/bin/hive

 

\# 如果是输入的日期按照取输入日期；如果没输入日期取当前时间的前一天

if [ -n "$1" ] ;then

   do_date=$1

else 

   do_date=`date -d "-1 day" +%F` 

fi 

 

sql="

 set hive.exec.dynamic.partition.mode=nonstrict;

 

insert into table "$APP".ads_uv_count

select 

 '$do_date' dt,

  daycount.ct,

  wkcount.ct,

  mncount.ct,

  if(date_add(next_day('$do_date','MO'),-1)='$do_date','Y','N') ,

  if(last_day('$do_date')='$do_date','Y','N') 

from 

(

  select 

   '$do_date' dt,

    count(*) ct

  from "$APP".dws_uv_detail_day

  where dt='$do_date' 

)daycount  join 

( 

  select 

   '$do_date' dt,

   count (*) ct

  from "$APP".dws_uv_detail_wk

  where wk_dt=concat(date_add(next_day('$do_date','MO'),-7),'_' ,date_add(next_day('$do_date','MO'),-1) )

) wkcount on daycount.dt=wkcount.dt

join 

( 

  select 

   '$do_date' dt,

   count (*) ct

  from "$APP".dws_uv_detail_mn

  where mn=date_format('$do_date','yyyy-MM') 

)mncount on daycount.dt=mncount.dt;

"

 

$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 ads_uv_log.sh

3）脚本使用

[atguigu@hadoop102 module]$ ads_uv_log.sh 2019-02-11

4）脚本执行时间

企业开发中一般在每日凌晨30分~1点

5）查询导入结果

hive (gmall)> select * from ads_uv_count;

2.2 需求二：用户新增主题

2.2.1 DWS层（每日新增设备明细表）

1）建表语句

hive (gmall)>

drop table if exists dws_new_mid_day;

create external table dws_new_mid_day

(

  `mid_id` string COMMENT '设备唯一标识',

  `user_id` string COMMENT '用户标识', 

  `version_code` string COMMENT '程序版本号', 

  `version_name` string COMMENT '程序版本名', 

  `lang` string COMMENT '系统语言', 

  `source` string COMMENT '渠道号', 

  `os` string COMMENT '安卓系统版本', 

  `area` string COMMENT '区域', 

  `model` string COMMENT '手机型号', 

  `brand` string COMMENT '手机品牌', 

  `sdk_version` string COMMENT 'sdkVersion', 

  `gmail` string COMMENT 'gmail', 

  `height_width` string COMMENT '屏幕宽高',

  `app_time` string COMMENT '客户端日志产生时的时间',

  `network` string COMMENT '网络模式',

  `lng` string COMMENT '经度',

  `lat` string COMMENT '纬度',

  `create_date` string comment '创建时间' 

) COMMENT '每日新增设备信息'

stored as parquet

location '/warehouse/gmall/dws/dws_new_mid_day/';

2）导入数据

用每日活跃用户表Left Join每日新增设备表，关联的条件是mid_id相等。如果是每日新增的设备，则在每日新增设备表中为null。

hive (gmall)>

insert into table dws_new_mid_day

select 

  ud.mid_id,

  ud.user_id , 

  ud.version_code , 

  ud.version_name , 

  ud.lang , 

  ud.source, 

  ud.os, 

  ud.area, 

  ud.model, 

  ud.brand, 

  ud.sdk_version, 

  ud.gmail, 

  ud.height_width,

  ud.app_time,

  ud.network,

  ud.lng,

   ud.lat,

  '2019-12-14'

from dws_uv_detail_day ud left join dws_new_mid_day nm on ud.mid_id=nm.mid_id

where ud.dt='2019-12-14' and nm.mid_id is null;

3）查询导入数据

hive (gmall)> select count(*) from dws_new_mid_day ;

4）导入数据脚本

[atguigu@hadoop102 bin]$ vi dws_new_log.sh

\#!/bin/bash

 

\# 定义变量方便修改

APP=gmall

hive=/opt/module/hive/bin/hive

hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

 

\# 如果是输入的日期按照取输入日期；如果没输入日期取当前时间的前一天

if [ -n "$1" ] ;then

  do_date=$1

else 

  do_date=`date -d "-1 day" +%F`

fi

 

echo "===日志日期为 $do_date==="

sql="

insert into table "$APP".dws_new_mid_day

select 

  ud.mid_id,

  ud.user_id , 

  ud.version_code , 

  ud.version_name , 

  ud.lang , 

  ud.source, 

  ud.os, 

  ud.area, 

  ud.model, 

  ud.brand, 

  ud.sdk_version, 

  ud.gmail, 

  ud.height_width,

  ud.app_time,

  ud.network,

  ud.lng,

  ud.lat,

  '$do_date'

from "$APP".dws_uv_detail_day ud left join "$APP".dws_new_mid_day nm on ud.mid_id=nm.mid_id

where ud.dt='$do_date' and nm.mid_id is null;

"

 

$hive -e "$sql"

2.2.2 ADS层（每日新增设备表）

1）建表语句

hive (gmall)>

drop table if exists ads_new_mid_count;

create external table ads_new_mid_count

(

  `create_date`   string comment '创建时间' ,

  `new_mid_count`  BIGINT comment '新增设备数量' 

) COMMENT '每日新增设备信息数量'

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_new_mid_count/';

2）导入数据

hive (gmall)>

insert into table ads_new_mid_count 

select

create_date,

count(*)

from dws_new_mid_day

where create_date='2019-12-14'

group by create_date;

3）查询导入数据

hive (gmall)> select * from ads_new_mid_count;

4）导入数据脚本

[atguigu@hadoop102 bin]$ vim ads_new_log.sh

\#!/bin/bash

 

\# 定义变量方便修改

APP=gmall

hive=/opt/module/hive/bin/hive

hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

 

\# 如果是输入的日期按照取输入日期；如果没输入日期取当前时间的前一天

if [ -n "$1" ] ;then

  do_date=$1

else 

  do_date=`date -d "-1 day" +%F`

fi

 

echo "===日志日期为 $do_date==="

sql="

insert into table "$APP".ads_new_mid_count 

select

create_date,

count(*)

from "$APP".dws_new_mid_day

where create_date='$do_date'

group by create_date;"

 

$hive -e "$sql"

2.3 需求三：用户留存主题

2.3.1 需求目标

1.用户留存概念

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rKIndxiG-1626582760818)(C:\Users\guochao\AppData\Roaming\Typora\typora-user-images\image-20210718105316633.png)]

2.需求描述

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RuOrWB1X-1626582760819)(C:\Users\guochao\AppData\Roaming\Typora\typora-user-images\image-20210718105338942.png)]

2.3.2 DWS层

1.DWS层（每日留存用户明细表）

1）建表语句

hive (gmall)>

drop table if exists dws_user_retention_day;

create external table dws_user_retention_day 

(

  `mid_id` string COMMENT '设备唯一标识',

  `user_id` string COMMENT '用户标识', 

  `version_code` string COMMENT '程序版本号', 

  `version_name` string COMMENT '程序版本名', 

  `lang` string COMMENT '系统语言', 

  `source` string COMMENT '渠道号', 

  `os` string COMMENT '安卓系统版本', 

  `area` string COMMENT '区域', 

  `model` string COMMENT '手机型号', 

  `brand` string COMMENT '手机品牌', 

  `sdk_version` string COMMENT 'sdkVersion', 

  `gmail` string COMMENT 'gmail', 

  `height_width` string COMMENT '屏幕宽高',

  `app_time` string COMMENT '客户端日志产生时的时间',

  `network` string COMMENT '网络模式',

  `lng` string COMMENT '经度',

  `lat` string COMMENT '纬度',

  `create_date`  string comment '设备新增时间',

  `retention_day` int comment '截止当前日期留存天数'

) COMMENT '每日用户留存情况'

PARTITIONED BY (`dt` string)

stored as parquet

location '/warehouse/gmall/dws/dws_user_retention_day/';

2）导入数据（每天计算前1天的新用户访问留存明细）

hive (gmall)>

insert overwrite table dws_user_retention_day

partition(dt="2019-02-11")

select 

  nm.mid_id,

  nm.user_id , 

  nm.version_code , 

  nm.version_name , 

  nm.lang , 

  nm.source,

  nm.os,

  nm.area,

  nm.model,

  nm.brand,

  nm.sdk_version,

  nm.gmail,

  nm.height_width,

  nm.app_time,

  nm.network,

  nm.lng,

  nm.lat,

nm.create_date,

1 retention_day 

from dws_uv_detail_day ud join dws_new_mid_day nm  on ud.mid_id =nm.mid_id 

where ud.dt='2019-02-11' and nm.create_date=date_add('2019-02-11',-1);

3）查询导入数据（每天计算前1天的新用户访问留存明细）

hive (gmall)> select count(*) from dws_user_retention_day;

2.DWS层（1,2,3,n天留存用户明细表）

1）导入数据（每天计算前1,2,3，n天的新用户访问留存明细）

hive (gmall)>

insert overwrite table dws_user_retention_day

partition(dt="2019-02-11")

select

  nm.mid_id,

  nm.user_id,

  nm.version_code,

  nm.version_name,

  nm.lang,

  nm.source,

  nm.os,

  nm.area,

  nm.model,

  nm.brand,

  nm.sdk_version,

  nm.gmail,

  nm.height_width,

  nm.app_time,

  nm.network,

  nm.lng,

  nm.lat,

  nm.create_date,

  1 retention_day 

from dws_uv_detail_day ud join dws_new_mid_day nm on ud.mid_id =nm.mid_id 

where ud.dt='2019-02-11' and nm.create_date=date_add('2019-02-11',-1)

 

union all

select 

  nm.mid_id,

  nm.user_id , 

  nm.version_code , 

  nm.version_name , 

  nm.lang , 

  nm.source, 

  nm.os, 

  nm.area, 

  nm.model, 

  nm.brand, 

  nm.sdk_version, 

  nm.gmail, 

  nm.height_width,

  nm.app_time,

  nm.network,

  nm.lng,

  nm.lat,

  nm.create_date,

  2 retention_day 

from dws_uv_detail_day ud join dws_new_mid_day nm  on ud.mid_id =nm.mid_id 

where ud.dt='2019-02-11' and nm.create_date=date_add('2019-02-11',-2)

 

union all

select 

  nm.mid_id,

  nm.user_id,

  nm.version_code,

  nm.version_name,

  nm.lang, 

  nm.source, 

  nm.os, 

  nm.area, 

  nm.model, 

  nm.brand, 

  nm.sdk_version, 

  nm.gmail, 

  nm.height_width,

  nm.app_time,

  nm.network,

  nm.lng,

  nm.lat,

  nm.create_date,

  3 retention_day 

from dws_uv_detail_day ud join dws_new_mid_day nm  on ud.mid_id =nm.mid_id 

where ud.dt='2019-02-11' and nm.create_date=date_add('2019-02-11',-3);

2）查询导入数据（每天计算前1,2,3天的新用户访问留存明细）

hive (gmall)> select retention_day , count(*) from dws_user_retention_day group by retention_day;

3.Union与Union all区别

1）准备两张表

tableA             tableB

id　　name　　score       id　　name　　score

1　　　a　　　　80        1　　  d　　　　48

2　　  b　　　　79        2　　　e　　　　23

3　　  c　　  68        3　　　c　　　　86

2）采用union查询

select name from tableA　　　　　　　　　　　　　

union　　　　　　　　　　　　　　　　　　　　　　　　

select name from tableB

查询结果

name

a
 d
 b
 e
 c

3）采用union all查询

select name from tableA
 union all
 select name from tableB

查询结果

name

a
 b
 c
 d
 e
 c

4）总结

（1）union会将联合的结果集去重，效率较union all差

（2）union all不会对结果集去重，所以效率高

2.3.3 ADS层

1.留存用户数

1）建表语句

hive (gmall)>

drop table if exists ads_user_retention_day_count;

create external table ads_user_retention_day_count

(

  `create_date`    string comment '设备新增日期',

  `retention_day`   int comment '截止当前日期留存天数',

  `retention_count`  bigint comment '留存数量'

) COMMENT '每日用户留存情况'

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_user_retention_day_count/';

2）导入数据

hive (gmall)>

insert into table ads_user_retention_day_count 

select

  create_date,

  retention_day,

  count(*) retention_count

from dws_user_retention_day

where dt='2019-02-11' 

group by create_date,retention_day;

3）查询导入数据

hive (gmall)> select * from ads_user_retention_day_count;

2.留存用户比率

1）建表语句

hive (gmall)>

drop table if exists ads_user_retention_day_rate;

create external table ads_user_retention_day_rate

(

   `stat_date`     string comment '统计日期',

   `create_date`    string comment '设备新增日期',

   `retention_day`   int comment '截止当前日期留存天数',

   `retention_count`  bigint comment '留存数量',

   `new_mid_count`   bigint comment '当日设备新增数量',

   `retention_ratio`  decimal(10,2) comment '留存率'

) COMMENT '每日用户留存情况'

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_user_retention_day_rate/';

2）导入数据

hive (gmall)>

insert into table ads_user_retention_day_rate

select 

  '2019-02-11', 

  ur.create_date,

  ur.retention_day, 

  ur.retention_count, 

  nc.new_mid_count,

  ur.retention_count/nc.new_mid_count*100

from ads_user_retention_day_count ur join ads_new_mid_count nc

on nc.create_date=ur.create_date;

3）查询导入数据

hive (gmall)>select * from ads_user_retention_day_rate;

2.4 新数据准备

为了分析沉默用户、本周回流用户数、流失用户、最近连续3周活跃用户、最近七天内连续三天活跃用户数，需要准备2019-02-12、2019-02-20日的数据。

1）2019-02-12数据准备

（1）修改日志时间

[atguigu@hadoop102 ~]$ dt.sh 2019-02-12

（2）启动集群

[atguigu@hadoop102 ~]$ cluster.sh start

（3）生成日志数据

[atguigu@hadoop102 ~]$ lg.sh

（4）将HDFS数据导入到ODS层

[atguigu@hadoop102 ~]$ ods_log.sh 2019-02-12

（5）将ODS数据导入到DWD层

[atguigu@hadoop102 ~]$ dwd_start_log.sh 2019-02-12

[atguigu@hadoop102 ~]$ dwd_base_log.sh 2019-02-12

[atguigu@hadoop102 ~]$ dwd_event_log.sh 2019-02-12

（6）将DWD数据导入到DWS层

[atguigu@hadoop102 ~]$ dws_uv_log.sh 2019-02-12

（7）验证

hive (gmall)> select * from dws_uv_detail_day where dt='2019-02-12' limit 2;

2）2019-02-20数据准备

（1）修改日志时间

[atguigu@hadoop102 ~]$ dt.sh 2019-02-20

（2）启动集群

[atguigu@hadoop102 ~]$ cluster.sh start

（3）生成日志数据

[atguigu@hadoop102 ~]$ lg.sh

（4）将HDFS数据导入到ODS层

[atguigu@hadoop102 ~]$ ods_log.sh 2019-02-20

（5）将ODS数据导入到DWD层

[atguigu@hadoop102 ~]$ dwd_start_log.sh 2019-02-20

[atguigu@hadoop102 ~]$ dwd_base_log.sh 2019-02-20

[atguigu@hadoop102 ~]$ dwd_event_log.sh 2019-02-20

（6）将DWD数据导入到DWS层

[atguigu@hadoop102 ~]$ dws_uv_log.sh 2019-02-20

（7）验证

hive (gmall)> select * from dws_uv_detail_day where dt='2019-02-20' limit 2;

2.5 需求四：沉默用户数

2.5.1 DWS层

使用日活明细表dws_uv_detail_day作为DWS层数据

2.5.2 ADS层

1）建表语句

hive (gmall)>

drop table if exists ads_silent_count;

create external table ads_silent_count( 

  `dt` string COMMENT '统计日期',

  `silent_count` bigint COMMENT '沉默设备数'

) 

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_silent_count';

2）导入2019-02-20数据

hive (gmall)>

insert into table ads_silent_count

select 

  '2019-02-20' dt,

  count(*) silent_count

from 

(

  select mid_id

  from dws_uv_detail_day

  where dt<='2019-02-20'

  group by mid_id

  having count(*)=1 and max(dt)<date_add('2019-02-20',-7)

) t1;

3）查询导入数据

hive (gmall)> select * from ads_silent_count;

2.5.3 编写脚本

1）在hadoop102的/home/atguigu/bin目录下创建脚本

[atguigu@hadoop102 bin]$ vim ads_silent_log.sh

在脚本中编写如下内容

#!/bin/bash

 

hive=/opt/module/hive/bin/hive

APP=gmall

 

if [ -n "$1" ];then

  do_date=$1

else

  do_date=`date -d "-1 day" +%F`

fi

 

echo "-----------导入日期$do_date-----------"

 

sql="

insert into table "$APP".ads_silent_count

select 

  '$do_date' dt,

  count(*) silent_count

from 

(

  select 

    mid_id

  from "$APP".dws_uv_detail_day

  where dt<='$do_date'

  group by mid_id

  having count(*)=1 and min(dt)<=date_add('$do_date',-7)

)t1;"

 

$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 ads_silent_log.sh

3）脚本使用

[atguigu@hadoop102 module]$ ads_silent_log.sh 2019-02-20

4）查询结果

hive (gmall)> select * from ads_silent_count;

5）脚本执行时间

企业开发中一般在每日凌晨30分~1点

2.6 需求五：本周回流用户数

本周回流=本周活跃-本周新增-上周活跃

2.6.1 DWS层

使用日活明细表dws_uv_detail_day作为DWS层数据

2.6.2 ADS层

1）建表语句

hive (gmall)>

drop table if exists ads_back_count;

create external table ads_back_count( 

  `dt` string COMMENT '统计日期',

  `wk_dt` string COMMENT '统计日期所在周',

  `wastage_count` bigint COMMENT '回流设备数'

) 

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_back_count';

2）导入数据：

hive (gmall)> 

insert into table ads_back_count

select 

  '2019-02-20' dt,

  concat(date_add(next_day('2019-02-20','MO'),-7),'_',date_add(next_day('2019-02-20','MO'),-1)) wk_dt,

  count(*)

from 

(

  select t1.mid_id

  from 

  (

    select mid_id

    from dws_uv_detail_wk

    where wk_dt=concat(date_add(next_day('2019-02-20','MO'),-7),'_',date_add(next_day('2019-02-20','MO'),-1))

  )t1

  left join

  (

    select mid_id

    from dws_new_mid_day

    where create_date<=date_add(next_day('2019-02-20','MO'),-1) and create_date>=date_add(next_day('2019-02-20','MO'),-7)

  )t2

  on t1.mid_id=t2.mid_id

  left join

  (

    select mid_id

    from dws_uv_detail_wk

    where wk_dt=concat(date_add(next_day('2019-02-20','MO'),-7*2),'_',date_add(next_day('2019-02-20','MO'),-7-1))

  )t3

  on t1.mid_id=t3.mid_id

  where t2.mid_id is null and t3.mid_id is null

)t4;

3）查询结果

hive (gmall)> select * from ads_back_count;

2.6.3 编写脚本

1）在hadoop102的/home/atguigu/bin目录下创建脚本

[atguigu@hadoop102 bin]$ vim ads_back_log.sh

在脚本中编写如下内容

#!/bin/bash

 

if [ -n "$1" ];then

  do_date=$1

else

  do_date=`date -d "-1 day" +%F`

fi

 

hive=/opt/module/hive/bin/hive

APP=gmall

 

echo "-----------导入日期$do_date-----------"

 

sql="

insert into table "$APP".ads_back_count

select 

    '$do_date' dt,

    concat(date_add(next_day('$do_date','MO'),-7),'_',date_add(next_day('$do_date','MO'),-1)) wk_dt,

    count(*)

from 

(

  select t1.mid_id

  from 

  (

    select mid_id

    from "$APP".dws_uv_detail_wk

    where wk_dt=concat(date_add(next_day('$do_date','MO'),-7),'_',date_add(next_day('$do_date','MO'),-1))

  )t1

  left join

  (

    select mid_id

    from "$APP".dws_new_mid_day

    where create_date<=date_add(next_day('$do_date','MO'),-1) and create_date>=date_add(next_day('$do_date','MO'),-7)

  )t2

  on t1.mid_id=t2.mid_id

  left join

  (

    select mid_id

    from "$APP".dws_uv_detail_wk

    where wk_dt=concat(date_add(next_day('$do_date','MO'),-7*2),'_',date_add(next_day('$do_date','MO'),-7-1))

  )t3

  on t1.mid_id=t3.mid_id

  where t2.mid_id is null and t3.mid_id is null

)t4;"

 

$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 ads_back_log.sh

3）脚本使用

[atguigu@hadoop102 module]$ ads_back_log.sh 2019-02-20

4）查询结果

hive (gmall)> select * from ads_back_count;

5）脚本执行时间

企业开发中一般在每周一凌晨30分~1点

2.7 需求六：流失用户数

流失用户：最近7天未登录我们称之为流失用户

2.7.1 DWS层

使用日活明细表dws_uv_detail_day作为DWS层数据

2.7.2 ADS层

1）建表语句

hive (gmall)>

drop table if exists ads_wastage_count;

create external table ads_wastage_count(

  `dt` string COMMENT '统计日期',

  `wastage_count` bigint COMMENT '流失设备数'

)

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_wastage_count';

2）导入2019-02-20数据

hive (gmall)>

insert into table ads_wastage_count

select

   '2019-02-20',

   count(*)

from 

(

  select mid_id

from dws_uv_detail_day

  group by mid_id

  having max(dt)<=date_add('2019-02-20',-7)

)t1;

2.7.3 编写脚本

1）在hadoop102的/home/atguigu/bin目录下创建脚本

[atguigu@hadoop102 bin]$ vim ads_wastage_log.sh

在脚本中编写如下内容

#!/bin/bash

 

if [ -n "$1" ];then

  do_date=$1

else

  do_date=`date -d "-1 day" +%F`

fi

 

hive=/opt/module/hive/bin/hive

APP=gmall

 

echo "-----------导入日期$do_date-----------"

 

sql="

insert into table "$APP".ads_wastage_count

select

   '$do_date',

   count(*)

from 

(

  select mid_id

  from "$APP".dws_uv_detail_day

  group by mid_id

  having max(dt)<=date_add('$do_date',-7)

)t1;"

 

$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 ads_wastage_log.sh

3）脚本使用

[atguigu@hadoop102 module]$ ads_wastage_log.sh 2019-02-20

4）查询结果

hive (gmall)> select * from ads_wastage_count;

5）脚本执行时间

企业开发中一般在每日凌晨30分~1点

2.8 需求七：最近连续三周活跃用户数

2.8.1 DWS层

使用周活明细表dws_uv_detail_wk作为DWS层数据

2.8.2 ADS层

1）建表语句

hive (gmall)>

drop table if exists ads_continuity_wk_count;

create external table ads_continuity_wk_count( 

  `dt` string COMMENT '统计日期,一般用结束周周日日期,如果每天计算一次,可用当天日期',

  `wk_dt` string COMMENT '持续时间',

  `continuity_count` bigint

) 

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_continuity_wk_count';

2）导入2019-02-20所在周的数据

hive (gmall)>

insert into table ads_continuity_wk_count

select

   '2019-02-20',

   concat(date_add(next_day('2019-02-20','MO'),-7*3),'_',date_add(next_day('2019-02-20','MO'),-1)),

   count(*)

from 

(

  select mid_id

  from dws_uv_detail_wk

  where wk_dt>=concat(date_add(next_day('2019-02-20','MO'),-7*3),'_',date_add(next_day('2019-02-20','MO'),-7*2-1))

  and wk_dt<=concat(date_add(next_day('2019-02-20','MO'),-7),'_',date_add(next_day('2019-02-20','MO'),-1))

  group by mid_id

  having count(*)=3

)t1;

3）查询

hive (gmall)> select * from ads_continuity_wk_count;

2.8.3 编写脚本

1）在hadoop102的/home/atguigu/bin目录下创建脚本

[atguigu@hadoop102 bin]$ vim ads_continuity_wk_log.sh

在脚本中编写如下内容

#!/bin/bash

 

if [ -n "$1" ];then

  do_date=$1

else

  do_date=`date -d "-1 day" +%F`

fi

 

hive=/opt/module/hive/bin/hive

APP=gmall

 

echo "-----------导入日期$do_date-----------"

 

sql="

insert into table "$APP".ads_continuity_wk_count

select 

   '$do_date',

   concat(date_add(next_day('$do_date','MO'),-7*3),'_',date_add(next_day('$do_date','MO'),-1)),

   count(*)

from 

(

  select mid_id

  from "$APP".dws_uv_detail_wk

  where wk_dt>=concat(date_add(next_day('$do_date','MO'),-7*3),'_',date_add(next_day('$do_date','MO'),-7*2-1))

  and wk_dt<=concat(date_add(next_day('$do_date','MO'),-7),'_',date_add(next_day('$do_date','MO'),-1))

  group by mid_id

  having count(*)=3

)t1;"

 

$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 ads_continuity_wk_log.sh

3）脚本使用

[atguigu@hadoop102 module]$ ads_continuity_wk_log.sh 2019-02-20

4）查询结果

hive (gmall)> select * from ads_continuity_wk_count;

5）脚本执行时间

企业开发中一般在每周一凌晨30分~1点

2.9 需求八：最近七天内连续三天活跃用户数

说明：最近7天内连续3天活跃用户数

2.9.1 DWS层

使用日活明细表dws_uv_detail_day作为DWS层数据

2.9.2 ADS层

1）建表语句

hive (gmall)>

drop table if exists ads_continuity_uv_count;

create external table ads_continuity_uv_count( 

  `dt` string COMMENT '统计日期',

  `wk_dt` string COMMENT '最近7天日期',

  `continuity_count` bigint

) COMMENT '连续活跃设备数'

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_continuity_uv_count';

2）写出导入数据的SQL语句

hive (gmall)>

insert into table ads_continuity_uv_count

select

  '2019-02-12',

  concat(date_add('2019-02-12',-6),'_','2019-02-12'),

  count(*)

from

(

  select mid_id

  from

  (

    select mid_id   

    from

    (

      select 

        mid_id,

        date_sub(dt,rank) date_dif

      from

      (

        select 

          mid_id,

          dt,

          rank() over(partition by mid_id order by dt) rank

        from dws_uv_detail_day

        where dt>=date_add('2019-02-12',-6) and dt<='2019-02-12'

      )t1

    )t2 

    group by mid_id,date_dif

    having count(*)>=3

  )t3

  group by mid_id

)t4;

3）查询

hive (gmall)> select * from ads_continuity_uv_count;

2.9.3 编写脚本

1）在hadoop102的/home/atguigu/bin目录下创建脚本

[atguigu@hadoop102 bin]$ vim ads_continuity_log.sh

在脚本中编写如下内容

#!/bin/bash

 

if [ -n "$1" ];then

  do_date=$1

else

  do_date=`date -d "-1 day" +%F`

fi

 

hive=/opt/module/hive/bin/hive

APP=gmall

 

echo "-----------导入日期$do_date-----------"

 

sql="

insert into table "$APP".ads_continuity_uv_count

select 

   '$do_date',

   concat(date_add('$do_date',-6),'_','$do_date') dt,

   count(*) 

from 

(

  select mid_id

  from

  (

    select mid_id

    from 

    (

      select

        mid_id,

        date_sub(dt,rank) date_diff

      from 

      (

        select 

          mid_id,

          dt,

          rank() over(partition by mid_id order by dt) rank

        from "$APP".dws_uv_detail_day

        where dt>=date_add('$do_date',-6) and dt<='$do_date'

      )t1

    )t2

    group by mid_id,date_diff

    having count(*)>=3

  )t3

  group by mid_id

)t4;

"

 

$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 ads_continuity_log.sh

3）脚本使用

[atguigu@hadoop102 module]$ ads_continuity_log.sh 2019-02-12

4）查询结果

hive (gmall)> select * from ads_continuity_uv_count;

5）脚本执行时间

企业开发中一般在每日凌晨30分~1点

2.10 需求九：每个用户累计访问次数

2.10.1 DWS层

1.建表语句

hive (gmall)>

drop table if exists dws_user_total_count_day;

create external table dws_user_total_count_day( 

  `mid_id` string COMMENT '设备id',

`subtotal` bigint COMMENT '每日登录小计'

)

partitioned by(`dt` string)

row format delimited fields terminated by '\t'

location '/warehouse/gmall/dws/dws_user_total_count_day';

2.导入数据

1）导入数据

insert overwrite table dws_user_total_count_day 

partition(dt='2019-12-14')

select

  mid_id,

  count(mid_id) cm

from

dwd_start_log

where

  dt='2019-12-14'

group by

  mid_id;

2）查询结果

hive (gmall)> select * from dws_user_total_count_day;

数据导入脚本

1）在/home/atguigu/bin目录下创建脚本dws_user_total_count_day.sh

[atguigu@hadoop102 bin]$ vim dws_user_total_count_day.sh

在脚本中填写如下内容

#!/bin/bash

 

\# 定义变量方便修改

APP=gmall

hive=/opt/module/hive/bin/hive

hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

 

\# 如果是输入的日期按照取输入日期；如果没输入日期取当前时间的前一天

if [ -n "$1" ] ;then

  do_date=$1

else 

  do_date=`date -d "-1 day" +%F`

fi

 

echo "===日志日期为 $do_date==="

sql="

insert overwrite table "$APP".dws_user_total_count_day partition(dt='$do_date')

select

  mid_id,

  count(mid_id) cm

from

  "$APP".dwd_start_log

where

  dt='$do_date'

group by

  mid_id,dt;

"

 

$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 ads_user_total_count.sh

3）脚本使用

[atguigu@hadoop102 module]$ ads_user_total_count.sh 2019-02-20

4）查询结果

hive (gmall)> select * from ads_user_total_count;

5）脚本执行时间

企业开发中一般在每天凌晨30分~1点

2.10.2 ADS层

建表语句

drop table if exists ads_user_total_count;

create external table ads_user_total_count( 

  `mid_id` string COMMENT '设备id',

  `subtotal` bigint COMMENT '每日登录小计',

  `total` bigint COMMENT '登录次数总计'

)

partitioned by(`dt` string)

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_user_total_count';

2.导入数据

insert overwrite table ads_user_total_count partition(dt='2019-10-03')

select

 if(today.mid_id is null, yesterday.mid_id, today.mid_id) mid_id,

 today.subtotal,

 if(today.subtotal is null, 0, today.subtotal) + if(yesterday.total is null, 0, yesterday.total) total

from (

 select

  *

 from dws_user_total_count_day

 where dt='2019-10-03'

) today

full join (

 select

  *

 from ads_user_total_count

 where dt=date_add('2019-10-03', -1)

) yesterday

on today.mid_id=yesterday.mid_id

数据导入脚本

1）在hadoop102的/home/atguigu/bin目录下创建脚本

[atguigu@hadoop102 bin]$ vim ads_user_total_count.sh

在脚本中编写如下内容

#!/bin/bash

 

db=gmall

hive=/opt/module/hive-1.2.1/bin/hive

hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

 

if [[ -n $1 ]]; then

  do_date=$1

else

  do_date=`date -d '-1 day' +%F`

fi

 

sql="

use gmall;

insert overwrite table ads_user_total_count partition(dt='$do_date')

select

 if(today.mid_id is null, yesterday.mid_id, today.mid_id) mid_id,

 today.subtotal,

 if(today.subtotal is null, 0, today.subtotal) + if(yesterday.total is null, 0, yesterday.total) total

from (

 select

  *

 from dws_user_total_count_day

 where dt='$do_date'

) today

full join (

 select

  *

 from ads_user_total_count

 where dt=date_add('$do_date', -1)

) yesterday

on today.mid_id=yesterday.mid_id

"

 

$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 ads_user_total_count.sh

3）脚本使用

[atguigu@hadoop102 module]$ ads_user_total_count.sh 2019-02-20

4）查询结果

hive (gmall)> select * from ads_user_total_count;

5）脚本执行时间

企业开发中一般在每天凌晨30分~1点

2.11 需求十：新收藏用户数

新收藏用户：指的是在某天首次添加收藏的用户

2.11.1 DWS层建立用户日志行为宽表

考虑到后面的多个需求会同时用到多张表中的数据, 如果每次都join操作, 则影响查询的效率. 可以先提前做一张宽表, 提高其他查询的执行效率.

每个用户对每个商品的点击次数, 点赞次数, 收藏次数

1.建表语句

drop table if exists dws_user_action_wide_log;

CREATE EXTERNAL TABLE dws_user_action_wide_log(

  `mid_id` string COMMENT '设备id',

  `goodsid` string COMMENT '商品id',

  `display_count` string COMMENT '点击次数',

  `praise_count` string COMMENT '点赞次数',

  `favorite_count` string COMMENT '收藏次数')

PARTITIONED BY (`dt` string)

stored as parquet

location '/warehouse/gmall/dws/dws_user_action_wide_log/'

TBLPROPERTIES('parquet.compression'='lzo');

导入数据

insert overwrite table dws_user_action_wide_log partition(dt='2019-12-14')

select

  mid_id,

  goodsid,

  sum(display_count) display_count,

  sum(praise_count) praise_count,

  sum(favorite_count) favorite_count

from

( select

    mid_id,

    goodsid,

    count(*) display_count,

    0 praise_count,

    0 favorite_count

  from

    dwd_display_log

  where

    dt='2019-12-14' and action=2

  group by

    mid_id,goodsid

 

  union all

 

  select

    mid_id,

    target_id goodsid,

    0,

    count(*) praise_count,

    0

  from

    dwd_praise_log

  where

    dt='2019-12-14'

  group by

    mid_id,target_id

 

  union all

 

  select

    mid_id,

    course_id goodsid,

    0,

    0,

    count(*) favorite_count

  from

    dwd_favorites_log

  where

    dt='2019-12-14'

  group by

    mid_id,course_id

)user_action

group by 

mid_id,goodsid;

3.数据导入脚本

[atguigu@hadoop102 bin]$ vi dws_user_action_wide_log.sh

[atguigu@hadoop102 bin]$ chmod 777 dws_user_action_wide_log.sh

 

\#!/bin/bash

db=gmall

hive=/opt/module/hive-1.2.1/bin/hive

hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

 

if [[ -n $1 ]]; then

  do_date=$1

else

  do_date=`date -d '-1 day' +%F`

fi

 

sql="

use gmall;

insert overwrite table dws_user_action_wide_log partition(dt='$do_date')

select

  mid_id,

  goodsid,

  sum(display_count) display_count,

  sum(praise_count) praise_count,

  sum(favorite_count) favorite_count

from

( select

    mid_id,

    goodsid,

    count(*) display_count,

    0 praise_count,

    0 favorite_count

  from

    dwd_display_log

  where

    dt='$do_date' and action=2

  group by

    mid_id,goodsid

 

  union all

 

  select

    mid_id,

    target_id goodsid,

    0,

    count(*) praise_count,

    0

  from

    dwd_praise_log

  where

    dt='$do_date'

  group by

    mid_id,target_id

 

  union all

 

  select

    mid_id,

    course_id goodsid,

    0,

    0,

    count(*) favorite_count

  from

    dwd_favorites_log

  where

    dt='$do_date'

  group by

    mid_id,course_id

)user_action

group by

mid_id,goodsid;

"

 

$hive -e "$sql"

2.11.2 DWS层

使用日志数据用户行为宽表作为DWS层表

2.11.3 ADS层

1.建表语句

drop table if exists ads_new_favorites_mid_day;

create external table ads_new_favorites_mid_day( 

  `dt` string COMMENT '日期',

  `favorites_users` bigint COMMENT '新收藏用户数'

) 

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_new_favorites_mid_day';

2.导入数据

insert into table ads_new_favorites_mid_day

select

  '2019-12-14' dt,

  count(*) favorites_users

from

(

  select

    mid_id

  from

    dws_user_action_wide_log

  where

    favorite_count>0

  group by

    mid_id

  having

    min(dt)='2019-12-14'

)user_favorite;

3.数据导入脚本

1）在/home/atguigu/bin目录下创建脚本ads_new_favorites_mid_day.sh

[atguigu@hadoop102 bin]$ vim ads_new_favorites_mid_day.sh

在脚本中填写如下内容

#!/bin/bash

 

\# 定义变量方便修改

APP=gmall

hive=/opt/module/hive/bin/hive

hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

 

\# 如果是输入的日期按照取输入日期；如果没输入日期取当前时间的前一天

if [ -n "$1" ] ;then

  do_date=$1

else 

  do_date=`date -d "-1 day" +%F`

fi

 

echo "===日志日期为 $do_date==="

sql="

insert into table "$APP".ads_new_favorites_mid_day

select

  '$do_date' dt,

  count(*) favorites_users

from

(

  select

    mid_id

  from

    "$APP".dws_user_action_wide_log

  where

    favorite_count>0

  group by

    mid_id

  having

    min(dt)='$do_date'

)user_favorite;

"

 

$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 ads_new_favorites_mid_day.sh

3）脚本使用

[atguigu@hadoop102 module]$ ads_new_favorites_mid_day.sh 2019-02-20

4）查询结果

hive (gmall)> select * from ads_new_favorites_mid_day;

5）脚本执行时间

企业开发中一般在每天凌晨30分~1点

2.12 需求十一：各个商品点击次数top3的用户

2.12.1 DWS层

使用日志数据用户行为宽表作为DWS层表

2.12.2 ADS层

1.建表语句

drop table if exists ads_goods_count;

create external table ads_goods_count( 

  `dt` string COMMENT '统计日期',

  `goodsid` string COMMENT '商品',

  `user_id` string COMMENT '用户',

  `goodsid_user_count` bigint COMMENT '商品用户点击次数'

) 

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_goods_count';

2.导入数据

insert into table ads_goods_count

select

  '2019-10-03',

  goodsid,

  mid_id,

  sum_display_count

from(

  select

   goodsid,

   mid_id,

   sum_display_count,

   row_number() over(partition by goodsid order by sum_display_count desc) rk

  from(

   select

    goodsid,

    mid_id,

    sum(display_count) sum_display_count

   from dws_user_action_wide_log

   where display_count>0

   group by goodsid, mid_id

  ) t1

) t2

where rk <= 3

3.数据导入脚本

1）在/home/atguigu/bin目录下创建脚本ads_goods_count.sh

[atguigu@hadoop102 bin]$ vim ads_goods_count.sh

在脚本中填写如下内容

#!/bin/bash

 

db=gmall

hive=/opt/module/hive-1.2.1/bin/hive

hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

 

if [[ -n $1 ]]; then

  do_date=$1

else

  do_date=`date -d '-1 day' +%F`

fi

 

sql="

use gmall;

insert into table ads_goods_count

select

  '$do_date',

  goodsid,

  mid_id,

  sum_display_count

from(

  select

   goodsid,

   mid_id,

   sum_display_count,

   row_number() over(partition by goodsid order by sum_display_count desc) rk

  from(

   select

    goodsid,

    mid_id,

    sum(display_count) sum_display_count

   from dws_user_action_wide_log

   where display_count>0

   group by goodsid, mid_id

  ) t1

) t2

where rk <= 3

"

$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 ads_goods_count.sh

3）脚本使用

[atguigu@hadoop102 module]$ ads_goods_count.sh 2019-02-20

4）查询结果

hive (gmall)> select * from ads_goods_count;

2.13 需求十二：统计每日各类别下点击次数top10的商品

2.13.1 DWS层

使用点击日志表作为DWS层数据源

2.13.2 ADS层

1.建表语句

drop table if exists ads_goods_display_top10;

create external table ads_goods_display_top10 ( 

  `dt` string COMMENT '日期',

  `category` string COMMENT '品类',

  `goodsid` string COMMENT '商品id',

  `goods_count` string COMMENT '商品点击次数'

) 

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_goods_display_top10';

导入数据

insert into table ads_goods_display_top10

select

 '2019-10-03',

 category,

 goodsid,

 count

from(

 select

  category,

  goodsid,

  count,

  rank() over(partition by category order by count desc) rk

 from(

  select

   category,

   goodsid,

   count(*) count

  from dwd_display_log

  where dt='2019-10-03' and action=2

  group by category, goodsid

 )t1

)t2

where rk<=10;

3.导入数据脚本

1）在/home/atguigu/bin目录下创建脚本ads_goods_display_top10.sh

[atguigu@hadoop102 bin]$ vim ads_goods_display_top10.sh

在脚本中填写如下内容

\#!/bin/bash

db=gmall

hive=/opt/module/hive-1.2.1/bin/hive

hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

 

if [[ -n $1 ]]; then

  do_date=$1

else

  do_date=`date -d '-1 day' +%F`

fi

 

sql="

use gmall;

insert into table ads_goods_display_top10

select

 '$do_date',

 category,

 goodsid,

 count

from(

 select

  category,

  goodsid,

  count,

  rank() over(partition by category order by count desc) rk

 from(

  select

   category,

   goodsid,

   count(*) count

  from dwd_display_log

  where dt='$do_date' and action=2

  group by category, goodsid

 )t1

)t2

where rk<=10

"

 

$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 ads_goods_display_top10.sh

3）脚本使用

[atguigu@hadoop102 module]$ ads_goods_display_top10.sh 2019-02-20

4）查询结果

hive (gmall)> select * from ads_goods_display_top10;

2.14 点击次数最多的10个用户点击的商品次数

2.14.1 DWS层

使用日志数据用户行为宽表作为DWS层表

2.14.2 ADS层

建表语句

drop table if exists ads_goods_user_count;

create external table ads_goods_user_count( 

`dt` string COMMENT '统计日期',

`mid_id` string COMMENT '用户id',

  `u_ct` string COMMENT '用户总点击次数',

  `goodsid` string COMMENT '商品id',

  `d_ct` string COMMENT '各个商品点击次数'

) 

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_goods_user_count';

2.导入数据

insert into table ads_goods_user_count

select

 '2019-10-03',

 mid_id,

 u_ct,

 goodsid,

 d_ct

from(

 select

  mid_id,

  u_ct,

  goodsid,

  d_ct,

  row_number() over(partition by mid_id order by d_ct desc ) rn

 from(

  select

   dl.mid_id,

   u_ct,

   dl.goodsid,

   count(*) d_ct

  from dwd_display_log dl join (

   select

    mid_id,

    count(*) u_ct

   from dws_user_action_wide_log

   group by mid_id

   order by u_ct desc

   limit 10

  )t1

  on dl.mid_id=t1.mid_id

  group by dl.mid_id, u_ct, dl.goodsid

 ) t2

) t3

where rn<=10

3.导入数据脚本

1）在/home/atguigu/bin目录下创建脚本ads_goods_user_count.sh

[atguigu@hadoop102 bin]$ vim ads_goods_user_count.sh

在脚本中填写如下内容

#!/bin/bash

db=gmall

hive=/opt/module/hive-1.2.1/bin/hive

hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

 

if [[ -n $1 ]]; then

  do_date=$1

else

  do_date=`date -d '-1 day' +%F`

fi

 

sql="

use gmall;

insert into table ads_goods_user_count

select

 '$do_date',

 mid_id,

 u_ct,

 goodsid,

 d_ct

from(

 select

  mid_id,

  u_ct,

  goodsid,

  d_ct,

  row_number() over(partition by mid_id order by d_ct desc ) rn

 from(

  select

   dl.mid_id,

   u_ct,

   dl.goodsid,

   count(*) d_ct

  from dwd_display_log dl join (

   select

    mid_id,

    count(*) u_ct

   from dws_user_action_wide_log

   group by mid_id

   order by u_ct desc

   limit 10

  )t1

on dl.mid_id=t1.mid_id

group by dl.mid_id, u_ct, dl.goodsid

 ) t2

) t3

where rn<=10

"

 

$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 ads_goods_user_count.sh

3）脚本使用

[atguigu@hadoop102 module]$ ads_goods_user_count.sh 2019-02-20

4）查询结果

hive (gmall)> select * from ads_goods_user_count;

2.15 需求十四：月活跃率

月活跃用户与截止到该月累计的用户总和之间的比例

2.15.1 DWS层

使用DWS层月活表以及ADS新增用户表作为DWS层

2.15.2 ADS层

建表语句

drop table if exists ads_mn_ratio_count;

create external table ads_mn_ratio_count( 

  `dt` string COMMENT '统计日期',

  `mn` string COMMENT '统计月活跃率的月份',

  `ratio` string COMMENT '活跃率'

) 

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_mn_ratio_count';

2.导入数据

insert into table ads_mn_ratio_count

select

  '2019-10-03',

  date_format('2019-10-03','yyyy-MM'),

  mn_count/sum_user*100 mn_percent

from

(select count(*) mn_count from dws_uv_detail_mn where mn=date_format('2019-10-03','yyyy-MM')) t1,

(select sum(new_mid_count) sum_user from ads_new_mid_count) t2;

3.导入数据脚本

1）在/home/atguigu/bin目录下创建脚本ads_mn_ratio_count.sh

[atguigu@hadoop102 bin]$ vim ads_mn_ratio_count.sh

在脚本中填写如下内容

#!/bin/bash

 

db=gmall

hive=/opt/module/hive-1.2.1/bin/hive

hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

 

if [[ -n $1 ]]; then

  do_date=$1

else

  do_date=`date -d '-1 day' +%F`

fi

 

sql="

use gmall;

insert into table ads_mn_ratio_count

select

  '$do_date',

  date_format('$do_date','yyyy-MM'),

  mn_count/sum_user*100 mn_percent

from

(select count(*) mn_count from dws_uv_detail_mn where mn=date_format('$do_date','yyyy-MM')) t1,

(select sum(new_mid_count) sum_user from ads_new_mid_count) t2;

"

 

$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 ads_mn_ratio_count.sh

3）脚本使用

[atguigu@hadoop102 module]$ ads_mn_ratio_count 2019-02-20

4）查询结果

hive (gmall)> select * from ads_mn_ratio_count;

2.14.1 DWS层

使用日志数据用户行为宽表作为DWS层表

2.14.2 ADS层

建表语句

drop table if exists ads_goods_user_count;

create external table ads_goods_user_count( 

`dt` string COMMENT '统计日期',

`mid_id` string COMMENT '用户id',

  `u_ct` string COMMENT '用户总点击次数',

  `goodsid` string COMMENT '商品id',

  `d_ct` string COMMENT '各个商品点击次数'

) 

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_goods_user_count';

2.导入数据

insert into table ads_goods_user_count

select

 '2019-10-03',

 mid_id,

 u_ct,

 goodsid,

 d_ct

from(

 select

  mid_id,

  u_ct,

  goodsid,

  d_ct,

  row_number() over(partition by mid_id order by d_ct desc ) rn

 from(

  select

   dl.mid_id,

   u_ct,

   dl.goodsid,

   count(*) d_ct

  from dwd_display_log dl join (

   select

    mid_id,

    count(*) u_ct

   from dws_user_action_wide_log

   group by mid_id

   order by u_ct desc

   limit 10

  )t1

  on dl.mid_id=t1.mid_id

  group by dl.mid_id, u_ct, dl.goodsid

 ) t2

) t3

where rn<=10

3.导入数据脚本

1）在/home/atguigu/bin目录下创建脚本ads_goods_user_count.sh

[atguigu@hadoop102 bin]$ vim ads_goods_user_count.sh

在脚本中填写如下内容

#!/bin/bash

db=gmall

hive=/opt/module/hive-1.2.1/bin/hive

hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

 

if [[ -n $1 ]]; then

  do_date=$1

else

  do_date=`date -d '-1 day' +%F`

fi

 

sql="

use gmall;

insert into table ads_goods_user_count

select

 '$do_date',

 mid_id,

 u_ct,

 goodsid,

 d_ct

from(

 select

  mid_id,

  u_ct,

  goodsid,

  d_ct,

  row_number() over(partition by mid_id order by d_ct desc ) rn

 from(

  select

   dl.mid_id,

   u_ct,

   dl.goodsid,

   count(*) d_ct

  from dwd_display_log dl join (

   select

    mid_id,

    count(*) u_ct

   from dws_user_action_wide_log

   group by mid_id

   order by u_ct desc

   limit 10

  )t1

on dl.mid_id=t1.mid_id

group by dl.mid_id, u_ct, dl.goodsid

 ) t2

) t3

where rn<=10

"

 

$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 ads_goods_user_count.sh

3）脚本使用

[atguigu@hadoop102 module]$ ads_goods_user_count.sh 2019-02-20

4）查询结果

hive (gmall)> select * from ads_goods_user_count;

2.15 需求十四：月活跃率

月活跃用户与截止到该月累计的用户总和之间的比例

2.15.1 DWS层

使用DWS层月活表以及ADS新增用户表作为DWS层

2.15.2 ADS层

建表语句

drop table if exists ads_mn_ratio_count;

create external table ads_mn_ratio_count( 

  `dt` string COMMENT '统计日期',

  `mn` string COMMENT '统计月活跃率的月份',

  `ratio` string COMMENT '活跃率'

) 

row format delimited fields terminated by '\t'

location '/warehouse/gmall/ads/ads_mn_ratio_count';

2.导入数据

insert into table ads_mn_ratio_count

select

  '2019-10-03',

  date_format('2019-10-03','yyyy-MM'),

  mn_count/sum_user*100 mn_percent

from

(select count(*) mn_count from dws_uv_detail_mn where mn=date_format('2019-10-03','yyyy-MM')) t1,

(select sum(new_mid_count) sum_user from ads_new_mid_count) t2;

3.导入数据脚本

1）在/home/atguigu/bin目录下创建脚本ads_mn_ratio_count.sh

[atguigu@hadoop102 bin]$ vim ads_mn_ratio_count.sh

在脚本中填写如下内容

#!/bin/bash

 

db=gmall

hive=/opt/module/hive-1.2.1/bin/hive

hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

 

if [[ -n $1 ]]; then

  do_date=$1

else

  do_date=`date -d '-1 day' +%F`

fi

 

sql="

use gmall;

insert into table ads_mn_ratio_count

select

  '$do_date',

  date_format('$do_date','yyyy-MM'),

  mn_count/sum_user*100 mn_percent

from

(select count(*) mn_count from dws_uv_detail_mn where mn=date_format('$do_date','yyyy-MM')) t1,

(select sum(new_mid_count) sum_user from ads_new_mid_count) t2;

"

 

$hive -e "$sql"

2）增加脚本执行权限

[atguigu@hadoop102 bin]$ chmod 777 ads_mn_ratio_count.sh

3）脚本使用

[atguigu@hadoop102 module]$ ads_mn_ratio_count 2019-02-20

4）查询结果

hive (gmall)> select * from ads_mn_ratio_count;

Starry sky325

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数仓搭建（三）

数仓搭建（三）文章目录数仓搭建（三）一、业务知识准备1.1 业务术语1.2 系统函数1.2.1 collect_set函数1.2.2 日期处理函数（datediff）1.3 需求实施流程二、需求分析2.1 需求一：用户活跃主题2.1.1 DWS层2.1.2 ADS层2.2 需求二：用户新增主题2.2.1 DWS层（每日新增设备明细表）2.2.2 ADS层（每日新增设备表）2.3 需求三：用户留存主题2.3.1 需求目标2.3.2 DWS层2.3.3 ADS层2.4 新数据准备2.5 需求四：沉默用户数2.
复制链接

扫一扫