教育项目笔记(4)

四、访问咨询主题看板_全量流程

5.hive的基础优化

(1)HDFS的副本数量:

a.默认情况下HDFS的副本有三个;

b.实际生产环境中一般HDFS副本也是3个;

c.如果数据不是特别的重要,也可以设置为只有2个副本;

d.如果使用hadoop3.x以上的版本,支持设置副本数量为1.5,其中0.5不是指的存储了一半的数据,而是采用纠删码;来存储这一份的信息,而纠删码只占用数据的一半。

配置副本数量:可以直接在CM上HDFS的配置目录下来配置

(2)yarn的基础配置:

yarn:用于资源的分配(资源:内存,cpu);其中nodemanager用于出内存和CPU;datanode用于出磁盘

cpu的配置

a.  注意:每一个nodemanager 会向resourcemanager报告自己当前节点有多少核心数;默认是8核;yarn不会自动校验每一个节点有多少核cpu;

b.  推荐调整配置:当前节点有多少核,就要向resourcemanager汇报多少核

c.  查看当前节点核数:

方式一:通过CM的主机目录来查看每一个节点有多少核;

方式二:通过命令的方式来查看--->grep 'processor' /proc/cpuinfo | sort -u | wc -l

d.在yarn中配置各个节点的核心数:直接在cm的yarn的配置目录下搜索:yarn.nodemanager.resource.cpu-vcores

内存配置

a.注意:每一个nodemanager 会向resourcemanager报告自己当前节点有多少内存;默认是8GB;yarn不会自动校验每一个节点有多少内存;

b.推荐配置:剩余内存 * 80%

c.查看当前节点内存剩余量:

方式一:通过CM的主机目录来查看每一个节点有多少剩余内存

方式二:通过命令方式查看--->free -m

d.配置各个节点内存:直接在cm的yarn的配置目录下搜索:

yarn.nodemanager.resource.memory-mb

yarn.scheduler.maximum-allocation-mb:与第一个保持一致

yarn.app.mapreduce.am.command-opts:略小于第一个配置的值(0.9)

注意:要同时设置yarn.scheduler.maximum-allocation-mb为一样的值,yarn.app.mapreduce.am.command-opts(JVM内存)的值要同步修改为略小的值(-Xmx1024m)。

yarn本地目录的配置

配置项:yarn.nodemanager.local-dirs

目的:yarn在运行过程中会产生一些临时文件,这些临时文件存储的位置有本地目录配置来决定

推荐配置:当前服务器挂载了几块磁盘,就需要配置几个目录

查看每一个磁盘挂载到了linux的什么目录下:df -h 查看对应磁盘挂载点即可

(3)MapReduce基础配置:

mapreduce.map.memory.mb:在运行MR时,一个mapTask需要占用多大的内存

mapreduce.map.java.opts:在运行MR时,一个mapTask对应jvm需要占用多大内存

mapreduce.reduce.memory.mb:在运行MR时,一个reduceTask需要占用多大的内存

mapreduce.reduce.java.opts:在运行MR时,一个reduceTask对应jvm需要占用多大内存

注意:jvm的内存配置要略小于对应内存;所有的内存配置大小不要超过nodemanager的内存大小;一般不做任何修改,默认即可

(4)hive的基础配置:

hiveserver2的内存大小配置:配置项:HiveServer2的Java堆栈大小(字节)

说明:如果这个配置比较少,在执行SQL的时候,可能会出现hiveserver2宕机的情况,此时需要条件hiveserver2的内存大小,调整后重启

动态生成分区的线程数:配置:hive.load.dynamic.partitions.thread;默认值:15

说明:在执行动态分区的时候,最多允许多少个线程来运行动态分区操作,线程越多,执行效率越高,但是占用资源也越大;

推荐:先采用默认,如果动态分区执行慢,而且还有剩余资源,可以尝试调大

调整位置:直接在cm的hive的配置目录下调整

监听输入文件的线程数量:配置项:hive.exec.input.listing.max.threads;默认值:15

说明:在运行SQL的时候,可以使用多少个线程来读取HDFS上数据,线程数量越多,读取效率越高,占用资源越大

推荐:先采用默认,如果读取数据执行慢,而且还有剩余资源,可以尝试调大

(5)hive压缩的配置:

map中间结果的压缩配置

hive.exec.compress.intermediate:是否开启hive对中间结果的压缩--->建议在hive的会话窗口配置

以下两个建议直接在cm的yarn的配置目录下直接配置

mapreduce.map.output.compress:是否开启map阶段的压缩

mapreduce.map.output.compress.codec:选择什么压缩方案

推荐配置:org.apache.hadoop.io.compress.SnappyCodec

reduce最终结果压缩配置

hive.exec.compress.output:是否开启对hive最终结果的压缩--->建议在hive的会话窗口配置

以下两个建议直接在cm的yarn的配置目录下直接配置

mapreduce.output.fileoutputformat.compress:是否开启reduce端的压缩配置

mapreduce.output.fileoutputformat.compress.codec:选择什么压缩方案

        推荐配置:org.apache.hadoop.io.compress.SnappyCodec

mapreduce.output.fileoutputformat.compress.type:压缩方案

        推荐配置:BLOCK

说明:如果hive上没有开启压缩,即使配置MR的压缩,那么也不会生效

(6)hive执行引擎的切换:配置:hive.execution.engine

6.数据采集

(1)目的:将业务端的数据导入到ODS层对应表中

业务端数据:mysql

ODS层表:hive

--->使用apache sqoop将mysql的数据灌入到hive中

(2)导入数据的SQL语句:

-- 访问咨询主表
select id,create_date_time,session_id,sid,create_time,seo_source,seo_keywords,ip,
       area,country,province,city,origin_channel,user as user_match,manual_time,begin_time,
       end_time,last_customer_msg_time_stamp,last_agent_msg_time_stamp,reply_msg_count,
       msg_count,browser_name,os_info,'2023-12-17' as starts_time
from  web_chat_ems_2019_07;

-- 访问咨询附属表
select
*,'2023-12-17' as start_time
from web_chat_text_ems_2019_07;

执行sqoop脚本,完成数据采集

-- 访问咨询主表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/nev \
--username root \
--password 123456 \
--query 'select id,create_date_time,session_id,sid,create_time,seo_source,seo_keywords,ip,
       area,country,province,city,origin_channel,user as user_match,manual_time,begin_time,
       end_time,last_customer_msg_time_stamp,last_agent_msg_time_stamp,reply_msg_count,
       msg_count,browser_name,os_info,"2023-12-17" as starts_time
from  web_chat_ems_2019_07 where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_ems \
-m 1

-- 访问咨询附属表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/nev \
--username root \
--password 123456 \
--query 'select
*,"2023-12-17" as start_time
from web_chat_text_ems_2019_07 where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_text_ems \
-m 1

校验数据是否导入成功:

(1)查看mysql共计多少条数据:

select count(1) from web_chat_ems_2019_07;211197

select count(1) from web_chat_text_ems_2019_07;105599

(2)到hive中对表查询一下一共多少条数据:

select count(1) from itcast_ods.web_chat_ems;211197

select count(1) from itcast_ods.web_chat_text_ems;105599

(3)查询其中一部分数据,观察数据映射是否可以 

select * from itcast_ods.web_chat_ems limit 10;

select * from itcast_ods.web_chat_text_ems limit 10;

7.数据清洗转换

目的:将ODS层数据导入到DWD层

(1)DWD层作用:1)清洗转换操作 2)少量维度退化操作

清洗操作:不需要

转换操作:将create_time日期转化为yearinfo quarterinfo monthinfo dayinfo hourinfo

额外加一个转换:将create_time日期数据转换为时间戳

维度退化操作:需要,将两个事实表合并成为一个事实表

SQL实现:还未完成转化操作

select 
    wce.session_id,
    wce.sid,
    wce.create_time,-- 此处需要转换:将字符串日期转换成时间戳
    wce.seo_source,
    wce.ip,
    wce.area,
    wce.msg_count,
    wce.origin_channel,
    wcte.referrer,
    wcte.from_url,
    wcte.landing_page_url,
    wcte.url_title,
    wcte.platform_description,
    wcte.other_params,
    wcte.history,
    wce.create_time as hourinfo,    -- 此处需要转换
    wce.create_time as yearinfo,    -- 此处需要转换
    wce.create_time as quarterinfo, -- 此处需要转换
    wce.create_time as monthinfo,   -- 此处需要转换
    wce.create_time as dayinfo  -- 此处需要转换
from itcast_ods.web_chat_ems wce join itcast_ods.web_chat_text_ems wcte
    on wce.id=wcte.id;

思考:如何进行转换操作:

转化1:将create_time转换为int 类型的数据(将其转化为时间戳)

方案: 日期转时间戳的函数:unix_timestamp(string date,string pattern)

案例:select unix_timestamp('2019-07-01 23:45:00','yyyy-MM-dd HH:mm:ss');

转化2:将create_time转换为yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo:

方案一:通过year() quarter() month() day() hour()

-- 案例:
select year('2019-07-01 23:45:00');-- 2019
select quarter('2019-07-01 23:45:00');-- 3
select month('2019-07-01 23:45:00');-- 7
select day('2019-07-01 23:45:00');-- 1
select hour('2019-07-01 23:45:00');-- 23

方案二:通过字符串的截取操作substr('字符串',从第几个截取,截取多少个) 

select substr('2019-07-01 23:45:00',1,4); -- 2019
select substr('2019-07-01 23:45:00',6,2);-- 07
select substr('2019-07-01 23:45:00',9,2);-- 01
select substr('2019-07-01 23:45:00',12,2);-- 23

实现最终转换的SQL:

--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;

insert into table itcast_dwd.visit_consult_dwd partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    wce.session_id,
    wce.sid,
    unix_timestamp(wce.create_time) as create_time,
    wce.seo_source,
    wce.ip,
    wce.area,
    wce.msg_count,
    wce.origin_channel,
    wcte.referrer,
    wcte.from_url,
    wcte.landing_page_url,
    wcte.url_title,
    wcte.platform_description,
    wcte.other_params,
    wcte.history,
    substr(wce.create_time,12,2) as hourinfo,
    substr(wce.create_time,1,4) as yearinfo,
    quarter(wce.create_time) as quarterinfo,
    substr(wce.create_time,6,2) as monthinfo,
    substr(wce.create_time,9,2) as dayinfo
from itcast_ods.web_chat_ems wce join itcast_ods.web_chat_text_ems wcte
    on wce.id=wcte.id;

可能出现以下错误: 

注意:在执行转换操作的时候,由于需要进行两表联查操作,其中一个表数据量比较少,此时hive会对其优化,采用map join的方案进行处理,而map join需要将小表的数据加载到内存中,但是内存不足,导致出现内存溢出错误,此错误可能会出现两个信息:第一个信息:return code 1;第二个信息:return code -137 (等待一会会爆出来)

解决方案:关闭掉map join 让其采用reduce join即可。set hive.auto.convert.join=false;

8.数据分析

目的:将DWD层数据灌入到DWS层

DWS层作用:细化维度统计操作

8.1访问量:

固有维度:时间维度:年 季度 月 天 小时

产品属性维度:地区维度,来源渠道,搜索来源,受访页面,总访问量

8.1.1以时间为基准,统计总访问量
-- 统计每年的总访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    yearinfo as time_str,
    '-1' as from_url,
    '5' as grouptype,
    '5' as time_type,
    yearinfo,
    '-1' as quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo;

-- 统计每年每季度的总访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'_',quarterinfo) as time_str,
    '-1' as from_url,
    '5' as grouptype,
    '4' as time_type,
    yearinfo,
    quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo;

-- 统计每年每季度每月的总访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo) as time_str,
    '-1' as from_url,
    '5' as grouptype,
    '3' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo;

-- 统计每年每季度每月每天的总访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
    '-1' as from_url,
    '5' as grouptype,
    '2' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo;

-- 统计每年每季度每月每天每小时的总访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
    '-1' as from_url,
    '5' as grouptype,
    '1' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo;
8.1.2基于时间统计各个受访页面的访问量 
-- 统计每年各个受访页面的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    yearinfo as time_str,
    from_url,
    '4' as grouptype,
    '5' as time_type,
    yearinfo,
    '-1' as quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,from_url;

-- 统计每年每季度各个受访页面的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'_',quarterinfo) as time_str,
    from_url,
    '4' as grouptype,
    '4' as time_type,
    yearinfo,
    quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,from_url;

-- 统计每年每季度每月各个受访页面的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo) as time_str,
    from_url,
    '4' as grouptype,
    '3' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,from_url;

-- 统计每年每季度每月每天各个受访页面的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
    from_url,
    '4' as grouptype,
    '2' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,from_url;

-- 统计每年每季度每月每天每小时各个受访页面的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
    from_url,
    '4' as grouptype,
    '1' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,from_url;
8.1.3基于时间统计各个来源渠道的访问量 
-- 统计每年各个来源渠道的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    origin_channel,
    '-1' as hourinfo,
    yearinfo as time_str,
    '-1' as from_url,
    '3' as grouptype,
    '5' as time_type,
    yearinfo,
    '-1' as quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,origin_channel;

-- 统计每年每季度各个来源渠道的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'_',quarterinfo) as time_str,
    '-1' as from_url,
    '3' as grouptype,
    '4' as time_type,
    yearinfo,
    quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,origin_channel;

-- 统计每年每季度每月各个来源渠道的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo) as time_str,
    '-1' as from_url,
    '3' as grouptype,
    '3' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,origin_channel;

-- 统计每年每季度每月每天各个来源渠道的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
    '-1' as from_url,
    '3' as grouptype,
    '2' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,origin_channel;

-- 统计每年每季度每月每天每小时各个来源渠道的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    origin_channel,
    hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
    '-1' as from_url,
    '3' as grouptype,
    '1' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,origin_channel;
8.2咨询量:

固有维度:时间:年 季度 月 天 小时

产品属性维度:地区,来源渠道,总咨询量

8.2.1基于时间统计总咨询量
-- 统计每年的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    yearinfo as time_str,
    '3' as grouptype,
    '5' as time_type,
    yearinfo,
    '-1' as quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo;

-- 统计每年每季度的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'_',quarterinfo) as time_str,
    '3' as grouptype,
    '4' as time_type,
    yearinfo,
    quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo;

-- 统计每年每季度每月的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo) as time_str,
    '3' as grouptype,
    '3' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo;

-- 统计每年每季度每月每天的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
    '3' as grouptype,
    '2' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo,dayinfo;

-- 统计每年每季度每月每天每小时的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as origin_channel,
    hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
    '3' as grouptype,
    '1' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo;
8.2.2基于时间统计各个地区的咨询量
-- 统计每年各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    yearinfo as time_str,
    '1' as grouptype,
    '5' as time_type,
    yearinfo,
    '-1' as quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,area;
-- 统计每年每季度各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'_',quarterinfo) as time_str,
    '1' as grouptype,
    '4' as time_type,
    yearinfo,
    quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,area;

-- 统计每年每季度每月各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo) as time_str,
    '1' as grouptype,
    '3' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo,area;

-- 统计每年每季度每月每天各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
    '1' as grouptype,
    '2' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo,dayinfo,area;

-- 统计每年每季度每月每天每小时各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
    '1' as grouptype,
    '1' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,area;
9.数据导出 

目的:从hive的DWS层将数据导出到mysql中对应目标表中

技术:apache sqoop

第一步:在mysql中创建目标表

create database scrm_bi default character set utf8mb4 collate utf8mb4_general_ci;

-- 访问量的结果表
CREATE TABLE IF NOT EXISTS scrm_bi.visit_dws (
  sid_total INT COMMENT '根据sid去重求count',
  sessionid_total INT COMMENT '根据sessionid去重求count',
  ip_total INT COMMENT '根据IP去重求count',
  area varchar(32) COMMENT '区域信息',
  seo_source varchar(32) COMMENT '搜索来源',
  origin_channel varchar(32) COMMENT '来源渠道',
  hourinfo varchar(32) COMMENT '创建时间,统计至小时',
  time_str varchar(32) COMMENT '时间明细',
  from_url varchar(32) comment '会话来源页面',
  groupType varchar(32) COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.总访问量',
  time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
  yearinfo varchar(32) COMMENT '年',
  quarterinfo varchar(32) COMMENT '季度',
  monthinfo varchar(32) COMMENT '月',
  dayinfo varchar(32) COMMENT '天'
) comment 'EMS访问日志dws表';


-- 咨询量的结果表
CREATE TABLE IF NOT EXISTS scrm_bi.consult_dws
(
  sid_total INT COMMENT '根据sid去重求count',
  sessionid_total INT COMMENT '根据sessionid去重求count',
  ip_total INT COMMENT '根据IP去重求count',
  area  varchar(32) COMMENT '区域信息',
  origin_channel varchar(32) COMMENT '来源渠道',
  hourinfo varchar(32) COMMENT '创建时间,统计至小时',
  time_str varchar(32) COMMENT '时间明细',
  groupType varchar(32) COMMENT '产品属性类型:1.地区;2.来源渠道',
  time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
   yearinfo varchar(32) COMMENT '年',
  quarterinfo varchar(32) COMMENT '季度',
  monthinfo varchar(32) COMMENT '月',
  dayinfo varchar(32) COMMENT '天'
)COMMENT '咨询量DWS宽表';


第二步:执行sqoop的数据导出

-- 先导出咨询量数据
sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 123456 \
--table consult_dws \
--hcatalog-database itcast_dws \
--hcatalog-table consult_dws \
-m 1


-- 访问量数据导出
sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 123456 \
--table visit_dws \
--hcatalog-database itcast_dws \
--hcatalog-table visit_dws \
-m 1

 查看具体报了什么错误:查看MR的运行日志(jobHistory)

  • 12
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值