四、访问咨询主题看板_全量流程
5.hive的基础优化
(1)HDFS的副本数量:
a.默认情况下HDFS的副本有三个;
b.实际生产环境中一般HDFS副本也是3个;
c.如果数据不是特别的重要,也可以设置为只有2个副本;
d.如果使用hadoop3.x以上的版本,支持设置副本数量为1.5,其中0.5不是指的存储了一半的数据,而是采用纠删码;来存储这一份的信息,而纠删码只占用数据的一半。
配置副本数量:可以直接在CM上HDFS的配置目录下来配置
(2)yarn的基础配置:
yarn:用于资源的分配(资源:内存,cpu);其中nodemanager用于出内存和CPU;datanode用于出磁盘
cpu的配置:
a. 注意:每一个nodemanager 会向resourcemanager报告自己当前节点有多少核心数;默认是8核;yarn不会自动校验每一个节点有多少核cpu;
b. 推荐调整配置:当前节点有多少核,就要向resourcemanager汇报多少核
c. 查看当前节点核数:
方式一:通过CM的主机目录来查看每一个节点有多少核;
方式二:通过命令的方式来查看--->grep 'processor' /proc/cpuinfo | sort -u | wc -l
d.在yarn中配置各个节点的核心数:直接在cm的yarn的配置目录下搜索:yarn.nodemanager.resource.cpu-vcores
内存配置:
a.注意:每一个nodemanager 会向resourcemanager报告自己当前节点有多少内存;默认是8GB;yarn不会自动校验每一个节点有多少内存;
b.推荐配置:剩余内存 * 80%
c.查看当前节点内存剩余量:
方式一:通过CM的主机目录来查看每一个节点有多少剩余内存
方式二:通过命令方式查看--->free -m
d.配置各个节点内存:直接在cm的yarn的配置目录下搜索:
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-mb:与第一个保持一致
yarn.app.mapreduce.am.command-opts:略小于第一个配置的值(0.9)
注意:要同时设置yarn.scheduler.maximum-allocation-mb为一样的值,yarn.app.mapreduce.am.command-opts(JVM内存)的值要同步修改为略小的值(-Xmx1024m)。
yarn本地目录的配置:
配置项:yarn.nodemanager.local-dirs
目的:yarn在运行过程中会产生一些临时文件,这些临时文件存储的位置有本地目录配置来决定
推荐配置:当前服务器挂载了几块磁盘,就需要配置几个目录
查看每一个磁盘挂载到了linux的什么目录下:df -h 查看对应磁盘挂载点即可
(3)MapReduce基础配置:
mapreduce.map.memory.mb:在运行MR时,一个mapTask需要占用多大的内存
mapreduce.map.java.opts:在运行MR时,一个mapTask对应jvm需要占用多大内存
mapreduce.reduce.memory.mb:在运行MR时,一个reduceTask需要占用多大的内存
mapreduce.reduce.java.opts:在运行MR时,一个reduceTask对应jvm需要占用多大内存
注意:jvm的内存配置要略小于对应内存;所有的内存配置大小不要超过nodemanager的内存大小;一般不做任何修改,默认即可
(4)hive的基础配置:
hiveserver2的内存大小配置:配置项:HiveServer2的Java堆栈大小(字节)
说明:如果这个配置比较少,在执行SQL的时候,可能会出现hiveserver2宕机的情况,此时需要条件hiveserver2的内存大小,调整后重启
动态生成分区的线程数:配置:hive.load.dynamic.partitions.thread;默认值:15
说明:在执行动态分区的时候,最多允许多少个线程来运行动态分区操作,线程越多,执行效率越高,但是占用资源也越大;
推荐:先采用默认,如果动态分区执行慢,而且还有剩余资源,可以尝试调大
调整位置:直接在cm的hive的配置目录下调整
监听输入文件的线程数量:配置项:hive.exec.input.listing.max.threads;默认值:15
说明:在运行SQL的时候,可以使用多少个线程来读取HDFS上数据,线程数量越多,读取效率越高,占用资源越大
推荐:先采用默认,如果读取数据执行慢,而且还有剩余资源,可以尝试调大
(5)hive压缩的配置:
map中间结果的压缩配置:
hive.exec.compress.intermediate:是否开启hive对中间结果的压缩--->建议在hive的会话窗口配置
以下两个建议直接在cm的yarn的配置目录下直接配置
mapreduce.map.output.compress:是否开启map阶段的压缩
mapreduce.map.output.compress.codec:选择什么压缩方案
推荐配置:org.apache.hadoop.io.compress.SnappyCodec
reduce最终结果压缩配置:
hive.exec.compress.output:是否开启对hive最终结果的压缩--->建议在hive的会话窗口配置
以下两个建议直接在cm的yarn的配置目录下直接配置
mapreduce.output.fileoutputformat.compress:是否开启reduce端的压缩配置
mapreduce.output.fileoutputformat.compress.codec:选择什么压缩方案
推荐配置:org.apache.hadoop.io.compress.SnappyCodec
mapreduce.output.fileoutputformat.compress.type:压缩方案
推荐配置:BLOCK
说明:如果hive上没有开启压缩,即使配置MR的压缩,那么也不会生效
(6)hive执行引擎的切换:配置:hive.execution.engine
6.数据采集
(1)目的:将业务端的数据导入到ODS层对应表中
业务端数据:mysql
ODS层表:hive
--->使用apache sqoop将mysql的数据灌入到hive中
(2)导入数据的SQL语句:
-- 访问咨询主表
select id,create_date_time,session_id,sid,create_time,seo_source,seo_keywords,ip,
area,country,province,city,origin_channel,user as user_match,manual_time,begin_time,
end_time,last_customer_msg_time_stamp,last_agent_msg_time_stamp,reply_msg_count,
msg_count,browser_name,os_info,'2023-12-17' as starts_time
from web_chat_ems_2019_07;
-- 访问咨询附属表
select
*,'2023-12-17' as start_time
from web_chat_text_ems_2019_07;
执行sqoop脚本,完成数据采集
-- 访问咨询主表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/nev \
--username root \
--password 123456 \
--query 'select id,create_date_time,session_id,sid,create_time,seo_source,seo_keywords,ip,
area,country,province,city,origin_channel,user as user_match,manual_time,begin_time,
end_time,last_customer_msg_time_stamp,last_agent_msg_time_stamp,reply_msg_count,
msg_count,browser_name,os_info,"2023-12-17" as starts_time
from web_chat_ems_2019_07 where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_ems \
-m 1
-- 访问咨询附属表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/nev \
--username root \
--password 123456 \
--query 'select
*,"2023-12-17" as start_time
from web_chat_text_ems_2019_07 where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_text_ems \
-m 1
校验数据是否导入成功:
(1)查看mysql共计多少条数据:
select count(1) from web_chat_ems_2019_07;211197
select count(1) from web_chat_text_ems_2019_07;105599
(2)到hive中对表查询一下一共多少条数据:
select count(1) from itcast_ods.web_chat_ems;211197
select count(1) from itcast_ods.web_chat_text_ems;105599
(3)查询其中一部分数据,观察数据映射是否可以
select * from itcast_ods.web_chat_ems limit 10;
select * from itcast_ods.web_chat_text_ems limit 10;
7.数据清洗转换
目的:将ODS层数据导入到DWD层
(1)DWD层作用:1)清洗转换操作 2)少量维度退化操作
清洗操作:不需要
转换操作:将create_time日期转化为yearinfo quarterinfo monthinfo dayinfo hourinfo
额外加一个转换:将create_time日期数据转换为时间戳
维度退化操作:需要,将两个事实表合并成为一个事实表
SQL实现:还未完成转化操作
select
wce.session_id,
wce.sid,
wce.create_time,-- 此处需要转换:将字符串日期转换成时间戳
wce.seo_source,
wce.ip,
wce.area,
wce.msg_count,
wce.origin_channel,
wcte.referrer,
wcte.from_url,
wcte.landing_page_url,
wcte.url_title,
wcte.platform_description,
wcte.other_params,
wcte.history,
wce.create_time as hourinfo, -- 此处需要转换
wce.create_time as yearinfo, -- 此处需要转换
wce.create_time as quarterinfo, -- 此处需要转换
wce.create_time as monthinfo, -- 此处需要转换
wce.create_time as dayinfo -- 此处需要转换
from itcast_ods.web_chat_ems wce join itcast_ods.web_chat_text_ems wcte
on wce.id=wcte.id;
思考:如何进行转换操作:
转化1:将create_time转换为int 类型的数据(将其转化为时间戳)
方案: 日期转时间戳的函数:unix_timestamp(string date,string pattern)
案例:select unix_timestamp('2019-07-01 23:45:00','yyyy-MM-dd HH:mm:ss');
转化2:将create_time转换为yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo:
方案一:通过year() quarter() month() day() hour()
-- 案例:
select year('2019-07-01 23:45:00');-- 2019
select quarter('2019-07-01 23:45:00');-- 3
select month('2019-07-01 23:45:00');-- 7
select day('2019-07-01 23:45:00');-- 1
select hour('2019-07-01 23:45:00');-- 23
方案二:通过字符串的截取操作substr('字符串',从第几个截取,截取多少个)
select substr('2019-07-01 23:45:00',1,4); -- 2019
select substr('2019-07-01 23:45:00',6,2);-- 07
select substr('2019-07-01 23:45:00',9,2);-- 01
select substr('2019-07-01 23:45:00',12,2);-- 23
实现最终转换的SQL:
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
insert into table itcast_dwd.visit_consult_dwd partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
wce.session_id,
wce.sid,
unix_timestamp(wce.create_time) as create_time,
wce.seo_source,
wce.ip,
wce.area,
wce.msg_count,
wce.origin_channel,
wcte.referrer,
wcte.from_url,
wcte.landing_page_url,
wcte.url_title,
wcte.platform_description,
wcte.other_params,
wcte.history,
substr(wce.create_time,12,2) as hourinfo,
substr(wce.create_time,1,4) as yearinfo,
quarter(wce.create_time) as quarterinfo,
substr(wce.create_time,6,2) as monthinfo,
substr(wce.create_time,9,2) as dayinfo
from itcast_ods.web_chat_ems wce join itcast_ods.web_chat_text_ems wcte
on wce.id=wcte.id;
可能出现以下错误:
注意:在执行转换操作的时候,由于需要进行两表联查操作,其中一个表数据量比较少,此时hive会对其优化,采用map join的方案进行处理,而map join需要将小表的数据加载到内存中,但是内存不足,导致出现内存溢出错误,此错误可能会出现两个信息:第一个信息:return code 1;第二个信息:return code -137 (等待一会会爆出来)
解决方案:关闭掉map join 让其采用reduce join即可。set hive.auto.convert.join=false;
8.数据分析
目的:将DWD层数据灌入到DWS层
DWS层作用:细化维度统计操作
8.1访问量:
固有维度:时间维度:年 季度 月 天 小时
产品属性维度:地区维度,来源渠道,搜索来源,受访页面,总访问量
8.1.1以时间为基准,统计总访问量
-- 统计每年的总访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
yearinfo as time_str,
'-1' as from_url,
'5' as grouptype,
'5' as time_type,
yearinfo,
'-1' as quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo;
-- 统计每年每季度的总访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'_',quarterinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'4' as time_type,
yearinfo,
quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo;
-- 统计每年每季度每月的总访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'3' as time_type,
yearinfo,
quarterinfo,
monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo;
-- 统计每年每季度每月每天的总访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'2' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo;
-- 统计每年每季度每月每天每小时的总访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'1' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo;
8.1.2基于时间统计各个受访页面的访问量
-- 统计每年各个受访页面的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
yearinfo as time_str,
from_url,
'4' as grouptype,
'5' as time_type,
yearinfo,
'-1' as quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,from_url;
-- 统计每年每季度各个受访页面的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'_',quarterinfo) as time_str,
from_url,
'4' as grouptype,
'4' as time_type,
yearinfo,
quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,from_url;
-- 统计每年每季度每月各个受访页面的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo) as time_str,
from_url,
'4' as grouptype,
'3' as time_type,
yearinfo,
quarterinfo,
monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,from_url;
-- 统计每年每季度每月每天各个受访页面的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
from_url,
'4' as grouptype,
'2' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,from_url;
-- 统计每年每季度每月每天每小时各个受访页面的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
from_url,
'4' as grouptype,
'1' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,from_url;
8.1.3基于时间统计各个来源渠道的访问量
-- 统计每年各个来源渠道的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
'-1' as hourinfo,
yearinfo as time_str,
'-1' as from_url,
'3' as grouptype,
'5' as time_type,
yearinfo,
'-1' as quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,origin_channel;
-- 统计每年每季度各个来源渠道的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
'-1' as hourinfo,
concat(yearinfo,'_',quarterinfo) as time_str,
'-1' as from_url,
'3' as grouptype,
'4' as time_type,
yearinfo,
quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,origin_channel;
-- 统计每年每季度每月各个来源渠道的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'-1' as from_url,
'3' as grouptype,
'3' as time_type,
yearinfo,
quarterinfo,
monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,origin_channel;
-- 统计每年每季度每月每天各个来源渠道的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'-1' as from_url,
'3' as grouptype,
'2' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,origin_channel;
-- 统计每年每季度每月每天每小时各个来源渠道的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'-1' as from_url,
'3' as grouptype,
'1' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,origin_channel;
8.2咨询量:
固有维度:时间:年 季度 月 天 小时
产品属性维度:地区,来源渠道,总咨询量
8.2.1基于时间统计总咨询量
-- 统计每年的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as origin_channel,
'-1' as hourinfo,
yearinfo as time_str,
'3' as grouptype,
'5' as time_type,
yearinfo,
'-1' as quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo;
-- 统计每年每季度的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'_',quarterinfo) as time_str,
'3' as grouptype,
'4' as time_type,
yearinfo,
quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo;
-- 统计每年每季度每月的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'3' as grouptype,
'3' as time_type,
yearinfo,
quarterinfo,
monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo;
-- 统计每年每季度每月每天的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'3' as grouptype,
'2' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo,dayinfo;
-- 统计每年每季度每月每天每小时的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as origin_channel,
hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'3' as grouptype,
'1' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo;
8.2.2基于时间统计各个地区的咨询量
-- 统计每年各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
area,
'-1' as origin_channel,
'-1' as hourinfo,
yearinfo as time_str,
'1' as grouptype,
'5' as time_type,
yearinfo,
'-1' as quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,area;
-- 统计每年每季度各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
area,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'_',quarterinfo) as time_str,
'1' as grouptype,
'4' as time_type,
yearinfo,
quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,area;
-- 统计每年每季度每月各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
area,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'1' as grouptype,
'3' as time_type,
yearinfo,
quarterinfo,
monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo,area;
-- 统计每年每季度每月每天各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
area,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'1' as grouptype,
'2' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo,dayinfo,area;
-- 统计每年每季度每月每天每小时各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
area,
'-1' as origin_channel,
hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'1' as grouptype,
'1' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,area;
9.数据导出
目的:从hive的DWS层将数据导出到mysql中对应目标表中
技术:apache sqoop
第一步:在mysql中创建目标表
create database scrm_bi default character set utf8mb4 collate utf8mb4_general_ci;
-- 访问量的结果表
CREATE TABLE IF NOT EXISTS scrm_bi.visit_dws (
sid_total INT COMMENT '根据sid去重求count',
sessionid_total INT COMMENT '根据sessionid去重求count',
ip_total INT COMMENT '根据IP去重求count',
area varchar(32) COMMENT '区域信息',
seo_source varchar(32) COMMENT '搜索来源',
origin_channel varchar(32) COMMENT '来源渠道',
hourinfo varchar(32) COMMENT '创建时间,统计至小时',
time_str varchar(32) COMMENT '时间明细',
from_url varchar(32) comment '会话来源页面',
groupType varchar(32) COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.总访问量',
time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
yearinfo varchar(32) COMMENT '年',
quarterinfo varchar(32) COMMENT '季度',
monthinfo varchar(32) COMMENT '月',
dayinfo varchar(32) COMMENT '天'
) comment 'EMS访问日志dws表';
-- 咨询量的结果表
CREATE TABLE IF NOT EXISTS scrm_bi.consult_dws
(
sid_total INT COMMENT '根据sid去重求count',
sessionid_total INT COMMENT '根据sessionid去重求count',
ip_total INT COMMENT '根据IP去重求count',
area varchar(32) COMMENT '区域信息',
origin_channel varchar(32) COMMENT '来源渠道',
hourinfo varchar(32) COMMENT '创建时间,统计至小时',
time_str varchar(32) COMMENT '时间明细',
groupType varchar(32) COMMENT '产品属性类型:1.地区;2.来源渠道',
time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
yearinfo varchar(32) COMMENT '年',
quarterinfo varchar(32) COMMENT '季度',
monthinfo varchar(32) COMMENT '月',
dayinfo varchar(32) COMMENT '天'
)COMMENT '咨询量DWS宽表';
第二步:执行sqoop的数据导出
-- 先导出咨询量数据
sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 123456 \
--table consult_dws \
--hcatalog-database itcast_dws \
--hcatalog-table consult_dws \
-m 1
-- 访问量数据导出
sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 123456 \
--table visit_dws \
--hcatalog-database itcast_dws \
--hcatalog-table visit_dws \
-m 1
查看具体报了什么错误:查看MR的运行日志(jobHistory)