3——Hive数仓项目完整流程(在线教育)

在线教育(3)

1. 访问咨询主题看板__全量分析

1.1 需求分析

  • 需求一:统计指定时间段内,访问客户的总数量。能够下钻到小时数据。
涉及维度:	时间维度--年、季度、月、天、小时
涉及指标:	总访问量
涉及表:	web_chat_ems_2019_12(咨询表)
涉及字段:	sid、create_time
需清洗:	sid:	去重
需转换:  create_time:	年、季度、月、天、小时
  • 需求二:统计指定时间段内,访问客户中各区域人数热力图。能够下钻到小时数据

    涉及维度:	时间维度、地区维度
    涉及指标:	访问量
    涉及表:	web_chat_ems_2019_12(咨询表)
    涉及字段:	sid、area、country、province、city、create_time
    需清洗:	sid:	去重
    需转换:  create_time:	年、季度、月、天、小时
    
  • 需求三:统计指定时间段内,不同地区(省、市)访问的客户中发起咨询的人数占比;咨询率=发起咨询的人数/访问客户量;客户与网咨有说一句话的称为有效咨询

    涉及维度:	时间维度、地区维度
    涉及指标:	咨询量、访问量
    		过滤条件: msg_count >= 1
    		计算公式:	咨询率=咨询量/访问量
    涉及表:	web_chat_ems_2019_12(咨询表)
    涉及字段:	sid、msg_count、create_time、area
    需清洗:	sid:	去重
    
  • 需求四:统计指定时间段内,每日客户访问量/咨询率双轴趋势图。能够下钻到小时数据

    涉及维度:	时间维度:年、季度、月、天、小时
    涉及指标:	访问量、咨询率
    涉及表:	web_chat_ems_2019_12(咨询表)
    涉及字段:	sid(去重)、msg_count、create_time
    
  • 需求五:统计指定时间段内,1-24h之间,每个时间段的访问客户量。

    横轴:1-24h,间隔为一小时,纵轴:指定时间段内同一小时内的总访问客户量

涉及维度:	时间维度:年、季度、月、天、小时
涉及指标:	访问量
涉及表:	web_chat_ems_2019_12(咨询表)
涉及字段:	sid(去重)、create_time
需转换:  create_time:	年、季度、月、天、小时
  • 需求六:统计指定时间段内,不同来源渠道的访问客户量占比。能够下钻到小时数据
涉及维度:	时间维度:年、季度、月、天、小时	来源渠道
涉及指标:	访问量、咨询量
涉及表:	web_chat_ems_2019_12(咨询表)
涉及字段:	sid(去重)、create_time、origin_channel
需转换:  create_time:	年、季度、月、天、小时
  • 需求七:统计指定时间段内,不同搜索来源的访问客户量占比。能够下钻到小时数据
涉及维度:	时间维度、			搜索来源
涉及指标:	访问量
涉及表:	web_chat_ems_2019_12(咨询表)
涉及字段:	sid(去重)、create_time、seo_source
需转换:  create_time:	年、季度、月、天、小时
  • 需求八:统计指定时间段内,产生访问客户量最多的页面排行榜TOPN。能够下钻到小时数据
涉及维度:	时间维度、页面
涉及指标:	访问量
涉及表:	web_chat——text_ems_2019_11(咨询附属表)
涉及字段:	from_url
缺失:	时间维度字段

1.2 需求分析总结

涉及维度:	
	固有维度:	时间维度(create_time)[年、季度、月、天、小时]
	产品属性维度: 地区维度(area)、来源渠道(origin_channel)、搜索来源(seo_source)、页面(from_url)
涉及指标:	访问量、咨询量
涉及表:	
	事实表:web_chat_ems_2019_12(咨询表)、web_chat——text_ems_2019_11(咨询附属表)
	维度表:	无
涉及字段:	时间维度(create_time)、地区维度(area)、来源渠道(origin_channel)、搜索来源(seo_source)、页面(附属表.from_url)、sid
需清洗:	sid(去重)
需转换:	create_time

1.3 业务数据准备(工作环境无)

  • 在MySQL中建立数据库,用于业务数据中的两张表[web_chat_ems_2019_12(咨询表)、web_chat——text_ems_2019_11(咨询附属表)]
create database nev default character set utf8mb4 collate utf8mb4_unicode_ci;
  • 之后将业务数据上传至该数据库中

1.4 建模分析

  • ODS层: 源数据层
作用:	对接源数据,与业务库中的数据保持一致
建表:	字段与业务库中表的字段一致即可,
	需注意:	在建表时,需要构建分区表,分区字段为时间字段,用于标记什么时候将数据抽取到ODS层的
	有两个表:	咨询表、咨询附属表
  • DIM层: 维度层
作用:	存储维度表
不需要:	无维度表
  • DWD层: 明细层
作用:	用于对ODS层数据进行清洗转换、以及少量的维度退化
需清洗:	sid(去重),在分析时进行即可
需转换:	creat_time转换为yearinfo、querterinfo、monthinfo、dayinfo、hourinfo
维度退化:	将   咨询表	与	咨询附属表	合并
建表字段:	原有字段+转换字段+清洗字段
sid,session_id,ip,create_time,area,origin_channel,seo_source,
	from_url,msg_count,yearinfo,quarterinfo,monthinfo,dayinfo,
	hourinfo,referrer,landing_page_url,url_title,
	platform_description,other_params,history
注意:
当合并表与表时,获取抽取字段时,处理方案有三种:
	* 当表中字段比较多的时候, 只需要抽取需要用的字段
	* 当无法确定需要用那些字段的时候, 采用全部抽取
	* 如果表中字段比较少, 不管用得上, 还是用不上, 都抽取
  • DWM层: 中间层(省略)
作用:	维度退化、提前聚合
注意: 
		无DIM维度层,不需要进行维度退化。
		指标字段sid存在重复,提前聚合会导致后期统计不精确
  • DWS层: 业务层
作用:	细化维度、统计分析
一个指标会对应一个结果表
访问量: 
	涉及维度:
		固有维度: 
			时间维度: 年 季度 月 天 小时
		产品属性维度:
			地区维度
			来源渠道
			搜索来源
			页面
	建表字段:	指标统计字段+各个维度字段+经验字段(time_type,group_time,time_str)
	sid_total,sessionid_total,ip_total,yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,
	area,origin_channel,seo_source,from_url, time_type,group_time,time_str
	
咨询量:
	涉及维度:
		固有维度: 
			时间维度: 年 季度 月 天 小时
		产品属性维度:
			地区维度
			来源渠道

建表字段: 指标统计字段 + 各个维度字段 + 三个经验字段(time_type,group_time,time_str)
	sid_total,sessionid_total,ip_total,yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,
	area,origin_channel, time_type,group_time,time_str

需注意:如果不存在的维度字段值,可设置为-1,表示没有该维度
  • DA层: 数据应用层
作用:	存储DWS的分析结果,用于对接后续应用

1.5 建模操作

  • 在hive中建表:需考虑(存储格式、压缩格式、类型表)

    • 存储格式: ORC
    • 压缩格式:
      • ODS层:Zlib
      • 其他层: Snappy
    • 类型表: 内部表
  • hive分区:动静结合

  • 总结:整体分区为动静结合;ODS层:ORC+Zlib | 其他层:ORC+Snappy

  • ODS层:

create database if not exists 'itcast_ods';
--写入时压缩生效
set hive.exe.orc.compression.strategy=COMPRESSION;

-- 咨询表
CREATE EXTERNAL TABLE IF NOT EXISTS itcast_ods.web_chat_ems (
  id INT comment '主键',
  create_date_time STRING comment '数据创建时间',
  session_id STRING comment '七陌sessionId',
  sid STRING comment '访客id',
  create_time STRING comment '会话创建时间',
  seo_source STRING comment '搜索来源',
  seo_keywords STRING comment '关键字',
  ip STRING comment 'IP地址',
  area STRING comment '地域',
  country STRING comment '所在国家',
  province STRING comment '省',
  city STRING comment '城市',
  origin_channel STRING comment '投放渠道',
  user_match STRING comment '所属坐席',
  manual_time STRING comment '人工开始时间',
  begin_time STRING comment '坐席领取时间 ',
  end_time STRING comment '会话结束时间',
  last_customer_msg_time_stamp STRING comment '客户最后一条消息的时间',
  last_agent_msg_time_stamp STRING comment '坐席最后一下回复的时间',
  reply_msg_count INT comment '客服回复消息数',
  msg_count INT comment '客户发送消息数',
  browser_name STRING comment '浏览器名称',
  os_info STRING comment '系统名称')
comment '访问会话信息表'
PARTITIONED BY(starts_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
location '/user/hive/warehouse/itcast_ods.db/web_chat_ems_ods'
TBLPROPERTIES ('orc.compress'='ZLIB');

-- 咨询附属表
CREATE EXTERNAL TABLE IF NOT EXISTS itcast_ods.web_chat_text_ems (
  id INT COMMENT '主键来自MySQL',
  referrer STRING comment '上级来源页面',
  from_url STRING comment '会话来源页面',
  landing_page_url STRING comment '访客着陆页面',
  url_title STRING comment '咨询页面title',
  platform_description STRING comment '客户平台信息',
  other_params STRING comment '扩展字段中数据',
  history STRING comment '历史访问记录'
) comment 'EMS-PV测试表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
stored as orc
location '/user/hive/warehouse/itcast_ods.db/web_chat_text_ems_ods'
TBLPROPERTIES ('orc.compress'='ZLIB');
  • DWD层:
CREATE DATABASE IF NOT EXISTS `itcast_dwd`;

create table if not exists itcast_dwd.visit_consult_dwd(
  session_id STRING comment '七陌sessionId',
  sid STRING comment '访客id',
  create_time bigint comment '会话创建时间',
  seo_source STRING comment '搜索来源',
  ip STRING comment 'IP地址',
  area STRING comment '地域',
  msg_count int comment '客户发送消息数',
  origin_channel STRING COMMENT '来源渠道',
  referrer STRING comment '上级来源页面',
  from_url STRING comment '会话来源页面',
  landing_page_url STRING comment '访客着陆页面',
  url_title STRING comment '咨询页面title',
  platform_description STRING comment '客户平台信息',
  other_params STRING comment '扩展字段中数据',
  history STRING comment '历史访问记录',
  hourinfo string comment '小时'
)
comment '访问咨询DWD表'
partitioned by(yearinfo String,quarterinfo string, monthinfo String, dayinfo string)
row format delimited fields terminated by '\t'
stored as orc
location '/user/hive/warehouse/itcast_dwd.db/visit_consult_dwd'
tblproperties ('orc.compress'='SNAPPY');
  • DWS层:
CREATE DATABASE IF NOT EXISTS `itcast_dws`;
-- 访问量统计结果表
CREATE TABLE IF NOT EXISTS itcast_dws.visit_dws (
  sid_total INT COMMENT '根据sid去重求count',
  sessionid_total INT COMMENT '根据sessionid去重求count',
  ip_total INT COMMENT '根据IP去重求count',
  area STRING COMMENT '区域信息',
  seo_source STRING COMMENT '搜索来源',
  origin_channel STRING COMMENT '来源渠道',
  hourinfo STRING COMMENT '创建时间,统计至小时',
  time_str STRING COMMENT '时间明细',
  from_url STRING comment '会话来源页面',
  groupType STRING COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.总访问量',
  time_type STRING COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;')
comment 'EMS访客日志dws表'
PARTITIONED BY(yearinfo STRING,quarterinfo STRING,monthinfo STRING,dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
location '/user/hive/warehouse/itcast_dws.db/visit_dws'
TBLPROPERTIES ('orc.compress'='SNAPPY');

-- 咨询量统计结果表
CREATE TABLE IF NOT EXISTS itcast_dws.consult_dws
(
  sid_total INT COMMENT '根据sid去重求count',
  sessionid_total INT COMMENT '根据sessionid去重求count',
  ip_total INT COMMENT '根据IP去重求count',
  area STRING COMMENT '区域信息',
  origin_channel STRING COMMENT '来源渠道',
  hourinfo STRING COMMENT '创建时间,统计至小时',
  time_str STRING COMMENT '时间明细',
  groupType STRING COMMENT '产品属性类型:1.地区;2.来源渠道',
  time_type STRING COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;'
)
COMMENT '咨询量DWS宽表'
PARTITIONED BY (yearinfo string,quarterinfo STRING, monthinfo STRING, dayinfo string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS ORC
LOCATION '/user/hive/warehouse/itcast_dws.db/consult_dws'
TBLPROPERTIES ('orc.compress'='SNAPPY');

1.6 数据采集

  • 将MySQL中的数据 利用sqoop 导入至 hive中的 ODS层

SQL语句:

-- 咨询表
SELECT
id,create_date_time,session_id,sid,create_time,seo_source,seo_keywords,ip,
AREA,country,province,city,origin_channel,USER AS user_match,manual_time,begin_time,end_time,
last_customer_msg_time_stamp,last_agent_msg_time_stamp,reply_msg_count,
msg_count,browser_name,os_info, '2023-01-11' AS starts_time
FROM web_chat_ems_2019_07
-- 咨询附属表
SELECT 
  *, '2023-01-11' AS start_time
FROM web_chat_text_ems_2019_07

sqoop脚本:

-- 咨询表
sqoop import
--connect jdbc:mysql://192.168.52.150:3306/数据库名 \
--username root \
--password 123456 \
--query 'SELECT
id,create_date_time,session_id,sid,create_time,seo_source,seo_keywords,ip,
AREA,country,province,city,origin_channel,USER AS user_match,manual_time,begin_time,end_time,
last_customer_msg_time_stamp,last_agent_msg_time_stamp,reply_msg_count,
msg_count,browser_name,os_info, '2023-01-11' AS starts_time
FROM web_chat_ems_2019_07 where 1=1 AND $CONDITIONS' \
--hcatalog-database Hive库名 \
--hcatalog-table Hive表名 \
-m 1
-- 咨询附属表
sqoop import
--connect jdbc:mysql://192.168.52.150:3306/数据库名 \
--username root \
--password 123456 \
--query 'SELECT 
  *, '2023-01-11' AS start_time
FROM web_chat_text_ems_2019_07 where 1=1 AND $CONDITIONS' \
--hcatalog-database Hive库名 \
--hcatalog-table Hive表名 \
-m 1

校验数据是否导入成功:

1) 查看mysql共计有多少条数据
	SELECT COUNT(1) FROM web_chat_ems_2019_07; 211197
	SELECT COUNT(1) FROM web_chat_text_ems_2019_07; 105599
2) 到hive中对表查询一下一共多少条数据
	SELECT COUNT(1) FROM itcast_ods.web_chat_ems; 211197
	SELECT COUNT(1) FROM itcast_ods.web_chat_text_ems; 105599
3) 查询其中一部分数据, 观察数据映射是否OK
	select * from itcast_ods.web_chat_ems limit 10;
	SELECT * FROM itcast_ods.web_chat_text_ems limit 10;

1.7 数据清洗转换

  • 将数据从ODS层转移到DWD层
需清洗: 无
需转换: 将create_time日期 转换为 yearinfo quarterinfo monthinfo dayinfo hourinfo
	1.利用substr(截取字段,从几开始,截取几位)字符串截取函数
	2.利用year()|quarter()|month()|day()|hour()
  • 转换的SQL
select
    wce.session_id,
    wce.sid,
    create_time,  
    wce.seo_source,
    wce.ip,
    wce.area,
    wce.msg_count,
    wce.origin_channel,
    wcte.referrer,
    wcte.from_url,
    wcte.landing_page_url,
    wcte.url_title,
    wcte.platform_description,
    wcte.other_params,
    wcte.history,
    substr(wce.create_time,12,2) as hourinfo,
    substr(wce.create_time,1,4) as yearinfo, 
    quarter(wce.create_time) as quarterinfo,
    substr(wce.create_time,6,2) as monthinfo,
    substr(wce.create_time,9,2) as dayinfo
from itcast_ods.web_chat_ems wce join itcast_ods.web_chat_text_ems wcte
    on wce.id = wcte.id;
  • 在hive中的操作:
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;


insert into table itcast_dwd.visit_consult_dwd partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
    wce.session_id,
    wce.sid,
    unix_timestamp(wce.create_time) as create_time,  
    wce.seo_source,
    wce.ip,
    wce.area,
    wce.msg_count,
    wce.origin_channel,
    wcte.referrer,
    wcte.from_url,
    wcte.landing_page_url,
    wcte.url_title,
    wcte.platform_description,
    wcte.other_params,
    wcte.history,
    substr(wce.create_time,12,2) as hourinfo,
    substr(wce.create_time,1,4) as yearinfo, 
    quarter(wce.create_time) as quarterinfo,
    substr(wce.create_time,6,2) as monthinfo,
    substr(wce.create_time,9,2) as dayinfo
from itcast_ods.web_chat_ems wce join itcast_ods.web_chat_text_ems wcte
    on wce.id = wcte.id;

1.8 数据分析

  • 将数据从DWD层灌入DWS层
  • 访问量
访问量: 
	涉及维度:
		固有维度: 
			时间维度: 年 季度 月 天 小时
		产品属性维度:
			地区维度
			来源渠道
			搜索来源
			页面
			总访问量
image-20230111205419995

以时间为基准,统计总访问量

-- 统计每年的总访问量
-- 统计每年、每季度的总访问量
-- 统计每年、每季度、每月的总访问量
-- 统计每年、每季度、每月、每天的总访问量
-- 统计每年、每季度、每月、每天、每小时的总访问量

统计的SQL:

-- 统计每年的总访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  yearinfo as time_str,
  '-1' as from_url,
  '5' as grouptype,
  '5' as time_type,
  yearinfo,
  '-1' as quarterinfo,
  '-1' as monthinfo,
  '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd
group by yearinfo;
-- 统计每年、每季度的总访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  yearinfo as time_str,
  '-1' as from_url,
  '5' as grouptype,
  '5' as time_type,
  yearinfo,
  quarterinfo,
  '-1' as monthinfo,
  '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo;
-- 统计每年、每季度、每月的总访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  yearinfo as time_str,
  '-1' as from_url,
  '5' as grouptype,
  '5' as time_type,
  yearinfo,
  quarterinfo,
  monthinfo,
  '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo;
-- 统计每年、每季度、每月、每天的总访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  yearinfo as time_str,
  '-1' as from_url,
  '5' as grouptype,
  '5' as time_type,
  yearinfo,
  quarterinfo,
  monthinfo,
  dayinfo
from  itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo;
-- 统计每年、每季度、每月、每天、每小时的总访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  hourinfo,
  yearinfo as time_str,
  '-1' as from_url,
  '5' as grouptype,
  '5' as time_type,
  yearinfo,
  quarterinfo,
  monthinfo,
  dayinfo
from  itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo;

以时间为基准,统计各个页面的访问量

-- 统计每年各个页面的访问量
-- 统计每年、每季度各个页面的访问量
-- 统计每年、每季度、每月各个页面的访问量
-- 统计每年、每季度、每月、每天各个页面的访问量
-- 统计每年、每季度、每月、每天、每小时各个页面的访问量

统计的SQL:

-- 统计每年各个页面的访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  yearinfo as time_str,
  from_url,
  '4' as grouptype,
  '5' as time_type,
  yearinfo,
  '-1' as quarterinfo,
  '-1' as monthinfo,
  '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd
group by yearinfo,from_url;
-- 统计每年、每季度各个页面的访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  concat(yearinfo,'_',quarterinfo) as time_str,
  from_url,
  '4' as grouptype,
  '4' as time_type,
  yearinfo,
  quarterinfo,
  '-1' as monthinfo,
  '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,from_url;
-- 统计每年、每季度、每月各个页面的访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  concat(yearinfo,'-',monthinfo) as time_str,
  from_url,
  '4' as grouptype,
  '3' as time_type,
  yearinfo,
  quarterinfo,
  monthinfo,
  '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,from_url;
-- 统计每年、每季度、每月、每天各个页面的访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
  from_url,
  '4' as grouptype,
  '2' as time_type,
  yearinfo,
  quarterinfo,
  monthinfo,
  dayinfo
from  itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,from_url;
-- 统计每年、每季度、每月、每天、每小时各个页面的访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  hourinfo,
  concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
  from_url,
  '4' as grouptype,
  '1' as time_type,
  yearinfo,
  quarterinfo,
  monthinfo,
  dayinfo
from  itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,from_url;
  • 咨询量
咨询量
	维度: 
		固有维度:
			时间: 年 季度 月 天 小时
		产品属性维度:
			地区
			来源渠道
			总咨询量

咨询和访问的区别:
	msg_count >=1 即为咨询数据

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7OG4ESOV-1673524060360)(C:\Users\fan\AppData\Roaming\Typora\typora-user-images\image-20230111205754194.png)]

以时间为基准,统计总咨询量

-- 统计每年的总咨询量
-- 统计每年、每季度的总咨询量
-- 统计每年、每季度、每月的总咨询量
-- 统计每年、每季度、每月、每天的总咨询量
-- 统计每年、每季度、每月、每天、每小时的总咨询量

统计SQL:

-- 统计每年的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    yearinfo as time_str,
    '3' as grouptype,
    '5' as time_type,
    yearinfo,
    '-1' as quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd where msg_count >= 1
group by yearinfo;
-- 统计每年、每季度的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'_',quarterinfo) as time_str,
    '3' as grouptype,
    '4' as time_type,
    yearinfo,
    quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd where msg_count >= 1
group by yearinfo,quarterinfo;
-- 统计每年、每季度、每月的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo) as time_str,
    '3' as grouptype,
    '3' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd where msg_count >= 1
group by yearinfo,quarterinfo,monthinfo;
-- 统计每年、每季度、每月、每天的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
    '3' as grouptype,
    '2' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from  itcast_dwd.visit_consult_dwd where msg_count >= 1
group by yearinfo,quarterinfo,monthinfo,dayinfo;
-- 统计每年、每季度、每月、每天、每小时的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as origin_channel,
    hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
    '3' as grouptype,
    '1' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from  itcast_dwd.visit_consult_dwd where msg_count >= 1
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo;

1.9 数据导出

  • 将DWS层分析的结果利用sqoop导出到MySQL对应的目标表中

    • 一、在MySQL中建表
    create database scrm_bi default character set utf8mb4 collate utf8mb4_general_ci;
    
    -- 访问量的结果表:
    CREATE TABLE IF NOT EXISTS scrm_bi.visit_dws (
      sid_total INT COMMENT '根据sid去重求count',
      sessionid_total INT COMMENT '根据sessionid去重求count',
      ip_total INT COMMENT '根据IP去重求count',
      area varchar(32) COMMENT '区域信息',
      seo_source varchar(32) COMMENT '搜索来源',
      origin_channel varchar(32) COMMENT '来源渠道',
      hourinfo varchar(32) COMMENT '创建时间,统计至小时',
      time_str varchar(32) COMMENT '时间明细',
      from_url varchar(32) comment '会话来源页面',
      groupType varchar(32) COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.总访问量',
      time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
      yearinfo varchar(32) COMMENT '年' ,
      quarterinfo varchar(32) COMMENT '季度',
      monthinfo varchar(32) COMMENT '月',
      dayinfo varchar(32) COMMENT '天'
    )comment 'EMS访客日志dws表';
    
    -- 咨询量的结果表:
    CREATE TABLE IF NOT EXISTS scrm_bi.consult_dws
    (
      sid_total INT COMMENT '根据sid去重求count',
      sessionid_total INT COMMENT '根据sessionid去重求count',
      ip_total INT COMMENT '根据IP去重求count',
      area varchar(32) COMMENT '区域信息',
      origin_channel varchar(32) COMMENT '来源渠道',
      hourinfo varchar(32) COMMENT '创建时间,统计至小时',
      time_str varchar(32) COMMENT '时间明细',
      groupType varchar(32) COMMENT '产品属性类型:1.地区;2.来源渠道',
      time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
      yearinfo varchar(32) COMMENT '年' ,
      quarterinfo varchar(32) COMMENT '季度',
      monthinfo varchar(32) COMMENT '月',
      dayinfo varchar(32) COMMENT '天'
    )COMMENT '咨询量DWS宽表';
    
    • 二、将数据导出至MySQL中
    -- 访问量数据导出(connect后面是为解决乱码)
    sqoop export \
    --connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
    --username root \
    --password 123456 \
    --table visit_dws \
    --hcatalog-database itcast_dws \
    --hcatalog-table visit_dws \
    -m 1
    
    -- 咨询量数据导出
    sqoop export \
    --connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
    --username root \
    --password 123456 \
    --table consult_dws \
    --hcatalog-database itcast_dws \
    --hcatalog-table consult_dws \
    -m 1
    

2.访问咨询主题看板___增量分析

2.1 模拟数据(工作环境无)

  • 根据现有数据,对其进行更改达到增量的效果
  • 对原始数据中的create_time字段进行截取替换,最终再插入到 咨询表和咨询附属表 中,达到增量数据添加的效果

2.2 数据采集

  • 将数据从业务数据库MySQL中转移到Hive的 ODS层 中

  • 只需要采集新增的数据即可,在采集时添加抽取时间字段

    • 由于增量分析存在一个固定周期,所以将增量采集的SQL集成到shell脚本中,达到自动完成的效果!
  • 采集周期以天为例,采集SQL如下:

-- 咨询表
SELECT
id,create_date_time,session_id,sid,create_time,seo_source,seo_keywords,ip,
AREA,country,province,city,origin_channel,USER AS user_match,manual_time,begin_time,end_time,
last_customer_msg_time_stamp,last_agent_msg_time_stamp,reply_msg_count,
msg_count,browser_name,os_info, '2023-01-12' AS starts_time
FROM web_chat_ems_2023_01
WHERE create_time BETWEEN '2023-01-11 00:00:00' AND '2023-01-11 23:59:59';
-- 咨询附属表
SELECT 
	temp2.*, '2023-01-12' AS start_time
FROM (SELECT id FROM web_chat_ems_2023_01 WHERE create_time BETWEEN '2023-01-11 00:00:00' AND '2023-01-11 23:59:59') temp1
	JOIN web_chat_text_ems_2023_01 temp2 ON temp1.id = temp2.id
  • 将增量数据采集SQL集成到sqoop中
-- 咨询表
sqoop import
--connect jdbc:mysql://192.168.52.150:3306/数据库名 \
--username root \
--password 123456 \
--table 数据表名 \
--query 'SELECT
id,create_date_time,session_id,sid,create_time,seo_source,seo_keywords,ip,
AREA,country,province,city,origin_channel,USER AS user_match,manual_time,begin_time,end_time,
last_customer_msg_time_stamp,last_agent_msg_time_stamp,reply_msg_count,
msg_count,browser_name,os_info, '2023-01-12' AS starts_time
FROM web_chat_ems_2023_01
WHERE create_time BETWEEN '2023-01-11 00:00:00' AND '2023-01-11 23:59:59'AND $CONDITIONS' \
--hcatlog-database Hive库名 \
--hcatlog-table Hive表名 \
-m 1
-- 咨询附属表
sqoop import
--connect jdbc:mysql://192.168.52.150:3306/数据库名 \
--username root \
--password 123456 \
--table 数据表名 \
--query 'SELECT 
	temp2.*, '2023-01-12' AS start_time
FROM (SELECT id FROM web_chat_ems_2023_01 WHERE create_time BETWEEN '2023-01-11 00:00:00' AND '2023-01-11 23:59:59') temp1
	JOIN web_chat_text_ems_2023_01 temp2 ON temp1.id = temp2.id
    where 1=1 AND $CONDITIONS' \
--hcatlog-database Hive库名 \
--hcatlog-table Hive表名 \
-m 1
  • 将sqoop脚本写为符合周期自动运行的格式:shell脚本内容如下
#!/bin/bash

export SQOOP_HOME=/usr/bin/sqoop

if [ $# == 1 ]
	then
		dateStr=$1
	else
		dateStr=`date -d '-1 day' + '%Y-%m-%d'`
fi

dateNowStr=`date + '%Y-%m-%d'`
yearStr=`date -d ${dateStr} + '%Y'`
monthStr=`date -d ${dateStr} + '%m'`

jdbcUrl='jdbc:mysql://192.168.52.150:3306/数据库名'
userName='root'
PassWord='123456'

${SQOOP_HOME} import \
--connect ${jdbcUrl} \
--username ${userName} \
--password ${PassWord} \
--query "SELECT
id,create_date_time,session_id,sid,create_time,seo_source,seo_keywords,ip,
AREA,country,province,city,origin_channel,USER AS user_match,manual_time,begin_time,end_time,
last_customer_msg_time_stamp,last_agent_msg_time_stamp,reply_msg_count,
msg_count,browser_name,os_info, '${dateNowStr}' AS starts_time
FROM web_chat_ems_${yearStr}_${monthStr}
WHERE create_time BETWEEN '${dateStr} 00:00:00' AND '${dateStr} 23:59:59'AND $CONDITIONS" \
--hcatlog-database Hive库名 \
--hcatlog-table Hive表名 \
-m 1

${SQOOP_HOME} import \
--connect ${jdbcUrl} \
--username ${userName} \
--password ${PassWord} \
--query "SELECT 
	temp2.*, '${dateNowStr}' AS start_time
FROM (SELECT id FROM web_chat_ems_2023_01 WHERE create_time BETWEEN '${dateStr} 00:00:00' AND '${dateStr} 23:59:59') temp1
	JOIN web_chat_text_ems_${yearStr}_${monthStr} temp2 ON temp1.id = temp2.id
    where 1=1 AND $CONDITIONS" \
--hcatalog-database Hive库名 \
--hcatalog-table Hive表名 \
-m 1
  • 将上述shell脚本配置到 oozie 中,以实现自动化调度操作

2.3 数据清洗转换

  • 全量数据清洗转换SQL:
select
    wce.session_id,
    wce.sid,
    create_time,  
    wce.seo_source,
    wce.ip,
    wce.area,
    wce.msg_count,
    wce.origin_channel,
    wcte.referrer,
    wcte.from_url,
    wcte.landing_page_url,
    wcte.url_title,
    wcte.platform_description,
    wcte.other_params,
    wcte.history,
    substr(wce.create_time,12,2) as hourinfo,
    substr(wce.create_time,1,4) as yearinfo, 
    quarter(wce.create_time) as quarterinfo,
    substr(wce.create_time,6,2) as monthinfo,
    substr(wce.create_time,9,2) as dayinfo
from itcast_ods.web_chat_ems wce join itcast_ods.web_chat_text_ems wcte
    on wce.id = wcte.id;
  • 增量数据清洗转换的SQL:
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;


insert into table itcast_dwd.visit_consult_dwd partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
    wce.session_id,
    wce.sid,
    create_time,  
    wce.seo_source,
    wce.ip,
    wce.area,
    wce.msg_count,
    wce.origin_channel,
    wcte.referrer,
    wcte.from_url,
    wcte.landing_page_url,
    wcte.url_title,
    wcte.platform_description,
    wcte.other_params,
    wcte.history,
    substr(wce.create_time,12,2) as hourinfo,
    substr(wce.create_time,1,4) as yearinfo, 
    quarter(wce.create_time) as quarterinfo,
    substr(wce.create_time,6,2) as monthinfo,
    substr(wce.create_time,9,2) as dayinfo
from (select * from itcast_ods.web_chat_ems where starts_time='2023-01-12') wce join (select * from itcast_ods.web_chat_text_ems where start_time='2023-01-12') wcte
    on wce.id = wcte.id;
  • 将SQL集成到shell脚本中,shell内容如下:
#!/bin/bash

export HIVE_HOME=/usr/bin/hive

if [ $# == 1 ]
	then
		dateStr=$1
	else
		dateStr=`date -d '-1 day' + '%Y-%m-%d'`
fi

sqlStr="
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
set hive.exec.orc.compression.strategy=COMPRESSION;


insert into table itcast_dwd.visit_consult_dwd partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
    wce.session_id,
    wce.sid,
    create_time,  
    wce.seo_source,
    wce.ip,
    wce.area,
    wce.msg_count,
    wce.origin_channel,
    wcte.referrer,
    wcte.from_url,
    wcte.landing_page_url,
    wcte.url_title,
    wcte.platform_description,
    wcte.other_params,
    wcte.history,
    substr(wce.create_time,12,2) as hourinfo,
    substr(wce.create_time,1,4) as yearinfo, 
    quarter(wce.create_time) as quarterinfo,
    substr(wce.create_time,6,2) as monthinfo,
    substr(wce.create_time,9,2) as dayinfo
from (select * from itcast_ods.web_chat_ems where starts_time='${dateStr}') wce join (select * from itcast_ods.web_chat_text_ems where start_time='${dateStr}') wcte
    on wce.id = wcte.id;"
    
${HIVE_HOME} -e "${sqlStr}" -S
  • 将其在oozie中配置到上一个shell后面进行

2.4 数据分析

在增量分析中,以天为例,
对年统计当年结果数据;对季度统计当年当季度结果数据;对月统计当年当季度当月结果数据 存在影响;
对当天和小时的统计对历史数据无影响
解决方案:通过删除分区的方案来解决

例如:2023-01-11 的数据所在分区为:yearinfo=‘2023’ and quarterinfo=‘1’ and monthinfo=‘01’ and dayinfo=‘11’

执行删除:alter table 表名 drop partition("分区");
  • 增量数据分析SQL:
-- 统计每年的总访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  yearinfo as time_str,
  '-1' as from_url,
  '5' as grouptype,
  '5' as time_type,
  yearinfo,
  '-1' as quarterinfo,
  '-1' as monthinfo,
  '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd where yearinfo='2023'
group by yearinfo;
-- 统计每年、每季度的总访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  yearinfo as time_str,
  '-1' as from_url,
  '5' as grouptype,
  '5' as time_type,
  yearinfo,
  quarterinfo,
  '-1' as monthinfo,
  '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd where yearinfo='2023' and quarterinfo='1'
group by yearinfo,quarterinfo;
-- 统计每年、每季度、每月的总访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  yearinfo as time_str,
  '-1' as from_url,
  '5' as grouptype,
  '5' as time_type,
  yearinfo,
  quarterinfo,
  monthinfo,
  '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd where yearinfo='2023' and quarterinfo='1' and monthinfo='01'
group by yearinfo,quarterinfo,monthinfo;
-- 统计每年、每季度、每月、每天的总访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  yearinfo as time_str,
  '-1' as from_url,
  '5' as grouptype,
  '5' as time_type,
  yearinfo,
  quarterinfo,
  monthinfo,
  dayinfo
from  itcast_dwd.visit_consult_dwd where yearinfo='2023' and quarterinfo='1' and monthinfo='01' and dayinfo='11'
group by yearinfo,quarterinfo,monthinfo,dayinfo;
-- 统计每年、每季度、每月、每天、每小时的总访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  hourinfo,
  yearinfo as time_str,
  '-1' as from_url,
  '5' as grouptype,
  '5' as time_type,
  yearinfo,
  quarterinfo,
  monthinfo,
  dayinfo
from  itcast_dwd.visit_consult_dwd where yearinfo='2023' and quarterinfo='1' and monthinfo='01' and dayinfo='11'
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo;


-- 咨询量
-- 统计每年各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    yearinfo as time_str,
    '1' as grouptype,
    '5' as time_type,
    yearinfo,
    '-1' as quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd where msg_count >= 1 and yearinfo='2023'
group by yearinfo,area;
-- 统计每年每季度各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'_',quarterinfo) as time_str,
    '1' as grouptype,
    '4' as time_type,
    yearinfo,
    quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd where msg_count >= 1 and yearinfo='2021' and quarterinfo='1' 
group by yearinfo,quarterinfo,area;
-- 统计每年每季度每月各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo) as time_str,
    '1' as grouptype,
    '3' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd where msg_count >= 1 and yearinfo='2023' and quarterinfo='1' and monthinfo='01'
group by yearinfo,quarterinfo,monthinfo,area;
-- 统计每年每季度每月每天各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
    '1' as grouptype,
    '2' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from  itcast_dwd.visit_consult_dwd where msg_count >= 1 and yearinfo='2023' and quarterinfo='1' and monthinfo='01' and dayinfo='11'
group by yearinfo,quarterinfo,monthinfo,dayinfo,area;
-- 统计每年每季度每月每天每小时各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
    '1' as grouptype,
    '1' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from  itcast_dwd.visit_consult_dwd where msg_count >= 1 and yearinfo='2023' and quarterinfo='1' and monthinfo='01' and dayinfo='11'
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,area;
  • 将增量数据分析SQL集成到shell脚本中
#!/bin/bash

export HIVE_HOME=/usr/bin/hive

if [ $# == 1 ]
	then
		dateStr=$1
	else
		dateStr=`date -d '-1 day' + '%Y-%m-%d'`
fi

yearStr=`date -d ${dateStr} +'%Y'`
monthStr=`date -d ${dateStr} +'%m'`
month_for_quarter=`date -d ${dateStr} +'%-m'`
quarterStr=$((($month_for_quarter-1)/3+1))
dayStr=`date -d ${dateStr} +'%d'`

sqlStr="
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
set hive.exec.orc.compression.strategy=COMPRESSION;

alter table itcast_dws.visit_dws drop partition(yearinfo='${yearStr}',quarterinfo='-1',monthinfo='-1',dayinfo='-1');
alter table itcast_dws.visit_dws drop partition(yearinfo='${yearStr}',quarterinfo='${quarterStr}',monthinfo='-1',dayinfo='-1');
alter table itcast_dws.visit_dws drop partition(yearinfo='${yearStr}',quarterinfo='${quarterStr}',monthinfo='${monthStr}}',dayinfo='-1');

alter table itcast_dws.consult_dws drop partition(yearinfo='${yearStr}',quarterinfo='-1',monthinfo='-1',dayinfo='-1');
alter table itcast_dws.consult_dws drop partition(yearinfo='${yearStr}',quarterinfo='${quarterStr}',monthinfo='-1',dayinfo='-1');
alter table itcast_dws.consult_dws drop partition(yearinfo='${yearStr}',quarterinfo='${quarterStr}',monthinfo='${monthStr}}',dayinfo='-1');

insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  yearinfo as time_str,
  '-1' as from_url,
  '5' as grouptype,
  '5' as time_type,
  yearinfo,
  '-1' as quarterinfo,
  '-1' as monthinfo,
  '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd where yearinfo='${yearStr}'
group by yearinfo;

insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  concat(yearinfo,'_',quarterinfo) as time_str,
  '-1' as from_url,
  '5' as grouptype,
  '4' as time_type,
  yearinfo,
  quarterinfo,
  '-1' as monthinfo,
  '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd where yearinfo='${yearStr}' and quarterinfo='${quarterStr}'
group by yearinfo,quarterinfo;

insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  concat(yearinfo,'-',monthinfo) as time_str,
  '-1' as from_url,
  '5' as grouptype,
  '3' as time_type,
  yearinfo,
  quarterinfo,
  monthinfo,
  '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd where  yearinfo='${yearStr}' and quarterinfo='${quarterStr}' and monthinfo='${monthStr}'
group by yearinfo,quarterinfo,monthinfo;

insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  '-1' as hourinfo,
  concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
  '-1' as from_url,
  '5' as grouptype,
  '2' as time_type,
  yearinfo,
  quarterinfo,
  monthinfo,
  dayinfo
from  itcast_dwd.visit_consult_dwd where  yearinfo='${yearStr}' and quarterinfo='${quarterStr}' and monthinfo='${monthStr}' and dayinfo='${dayStr}'
group by yearinfo,quarterinfo,monthinfo,dayinfo;

insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
  count(distinct sid) as  sid_total,
  count(distinct session_id) as sessionid_total,
  count(distinct ip) as ip_total,
  '-1' as area,
  '-1' as seo_source,
  '-1' as origin_channel,
  hourinfo,
  concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
  '-1' as from_url,
  '5' as grouptype,
  '1' as time_type,
  yearinfo,
  quarterinfo,
  monthinfo,
  dayinfo
from  itcast_dwd.visit_consult_dwd where  yearinfo='${yearStr}' and quarterinfo='${quarterStr}' and monthinfo='${monthStr}' and dayinfo='${dayStr}'
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo;


insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    yearinfo as time_str,
    '1' as grouptype,
    '5' as time_type,
    yearinfo,
    '-1' as quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd where msg_count >= 1 and yearinfo='${yearStr}'
group by yearinfo,area;

insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'_',quarterinfo) as time_str,
    '1' as grouptype,
    '4' as time_type,
    yearinfo,
    quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd where msg_count >= 1 and yearinfo='${yearStr}' and quarterinfo='${quarterStr}'
group by yearinfo,quarterinfo,area;

insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo) as time_str,
    '1' as grouptype,
    '3' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' as dayinfo
from  itcast_dwd.visit_consult_dwd where msg_count >= 1 and yearinfo='${yearStr}' and quarterinfo='${quarterStr}' and monthinfo='${monthStr}'
group by yearinfo,quarterinfo,monthinfo,area;

insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
    '1' as grouptype,
    '2' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from  itcast_dwd.visit_consult_dwd where msg_count >= 1 and yearinfo='${yearStr}' and quarterinfo='${quarterStr}' and monthinfo='${monthStr}' and dayinfo='${dayStr}'
group by yearinfo,quarterinfo,monthinfo,dayinfo,area;

insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select  
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
    '1' as grouptype,
    '1' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from  itcast_dwd.visit_consult_dwd where msg_count >= 1 and yearinfo='${yearStr}' and quarterinfo='${quarterStr}' and monthinfo='${monthStr}' and dayinfo='${dayStr}'
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,area;

"
${HIVE_HOME} -e "${sqlStr}" -S
  • 最终将其配置在oozie中即可

2.5 数据导出

  • 在数据导出操作中,也需要将MySQL中之前的当年、当季度、当月的结果数据删除,重新导入操作
  • 我们可以将当年的统计结果数据全部删除,然后全部重新导入所有数据
#!/bin/bash

export SQOOP_HOME=/usr/bin/sqoop

if [ $# == 1 ]
	then
		TD_DATE=$1
	ELSE
		TD_DATE=`date -d '-1 day' + '%Y-%m-%d'`
fi
TD_YEAR=`date -d ${TD_DATE} +%Y`

mysql -uroot -p123456 -h192.168.52.150 -P3306 -e "delete from scrm_bi.visit_dws where yearinfo='$TD_YEAR'; delete from scrm_bi.consult_dws where yearinfo='$TD_YEAR';"

jdbcUrl='jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8'
username='root'
password='123456'

${SQOOP_HOME} export \
--connect ${jdbcUrl} \
--username ${username} \
--password ${password} \
--table 数据库表名 \
--hcatalog-database Hive数据库 \
--hcatalog-table hive数据表名 \
--hcatalog-partition-keys yearinfo \
--hcatalog-partition-values $TD_YEAR \
-m 1

${SQOOP_HOME} export \
--connect ${jdbcUrl} \
--username ${username} \
--password ${password} \
--table 数据库表名 \
--hcatalog-database Hive数据库 \
--hcatalog-table hive数据表名 \
--hcatalog-partition-keys yearinfo \
--hcatalog-partition-values $TD_YEAR \
-m 1

  • 将脚本集成在 oozie 中,实现定时调度任务的自动化实现
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
一、课程简介随着技术的飞速发展,经过多年的数据积累,各互联网公司已保存了海量的原始数据和各种业务数据,所以数据库技术是各大公司目前都需要着重发展投入的技术领域。数据库是面向分析的集成化数据环境,为企业所有决策制定过程,提供系统数据支持的战略集合。通过对数据库中数据的分析,可以帮助企业改进业务流程、控制成本、提高产品质量等。二、课程内容本次精心打造的数项目的课程,从项目架构的搭建,到数据采集模块的设计、数架构的设计、实战需求实现、即席查询的实现,我们针对国内目前广泛使用的Apache原生框架和CDH版本框架进行了分别介绍,Apache原生框架介绍中涉及到的技术框架包括Flume、Kafka、Sqoop、MySql、HDFS、Hive、Tez、Spark、Presto、Druid等,CDH版本框架讲解包括CM的安装部署、Hadoop、Zookeeper、Hive、Flume、Kafka、Oozie、Impala、HUE、Kudu、Spark的安装配置,透彻了解不同版本框架的区别联系,将大数据全生态系统前沿技术一网打尽。在过程中对大数据生态体系进行了系统的讲解,对实际企业数项目中可能涉及到的技术点都进行了深入的讲解和探讨。同时穿插了大量数基础理论知识,让你在掌握实战经验的同时能够打下坚实的理论基础。三、课程目标本课程以国内电商巨头实际业务应用场景为依托,对电商数的常见实战指标以及难点实战指标进行了详尽讲解,具体指标包括:每日、周、月活跃设备明细,留存用户比例,沉默用户、回流用户、流失用户统计,最近连续3周活跃用户统计,最近7天内连续3天活跃用户统计,GMV成交总额分析,转化率及漏斗分析,品牌复购率分析、订单表拉链表的设计等,让学生拥有更直观全面的实战经验。通过对本课程的学习,对数项目可以建立起清晰明确的概念,系统全面的掌握各项数项目技术,轻松应对各种数难题。四、课程亮点本课程结合国内多家企业实际项目经验,特别加入了项目架构模块,从集群规模的确定到框架版本选型以及服务器选型,手把手教你从零开始搭建大数据集群。并且总结大量项目实战中会遇到的问题,针对各个技术框架,均有调优实战经验,具体包括:常用Linux运维命令、Hadoop集群调优、Flume组件选型及性能优化、Kafka集群规模确认及关键参数调优。通过这部分学习,助学生迅速成长,获取前沿技术经验,从容解决实战问题。
抱歉,我无法提供图片或图表。但是,根据引用\[1\]和引用\[3\]的内容,可以给出一个简要的描述。Hive数据流程图通常包括以下几个步骤: 1. 数据来源:数据可以来自不同的数据源,如HDFS、MySQL、Redis、MongoDB、TiDB等。每个数据源都有一个特定的标识,比如01代表HDFS数据,02代表MySQL数据等。 2. ODS层:原始数据几乎无处理地存放在ODS层,也称为操作数据存储区。ODS层的数据结构与源系统基本保持一致。 3. 数据处理:在ODS层,可以对原始数据进行必要的处理,如数据清洗、转换、合并等。这些处理可以使用Hive等工具进行。 4. 数据库:经过处理后的数据被加载到数据库中。数据库是一个用于存储和管理数据的系统,通常采用分布式存储和处理技术。 5. 数据展示:最后,使用报表展示工具(如FineBI)对数据进行可视化和分析,以便用户能够更好地理解和利用数据。 请注意,这只是一个简要的描述,实际的Hive数据流程图可能会更加复杂和详细。 #### 引用[.reference_title] - *1* *3* [Hive之数的分层及建模理论](https://blog.csdn.net/qq_56870570/article/details/118938411)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down28v1,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [1——Hive项目完整流程在线教育)](https://blog.csdn.net/m0_57588393/article/details/127702966)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down28v1,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值