教育项目笔记(4)

四、访问咨询主题看板_全量流程

5.hive的基础优化

(1)HDFS的副本数量:

a.默认情况下HDFS的副本有三个;

b.实际生产环境中一般HDFS副本也是3个;

c.如果数据不是特别的重要,也可以设置为只有2个副本;

d.如果使用hadoop3.x以上的版本,支持设置副本数量为1.5,其中0.5不是指的存储了一半的数据,而是采用纠删码;来存储这一份的信息,而纠删码只占用数据的一半。

配置副本数量:可以直接在CM上HDFS的配置目录下来配置

(2)yarn的基础配置:

yarn:用于资源的分配(资源:内存,cpu);其中nodemanager用于出内存和CPU;datanode用于出磁盘

cpu的配置

a.  注意:每一个nodemanager 会向resourcemanager报告自己当前节点有多少核心数;默认是8核;yarn不会自动校验每一个节点有多少核cpu;

b.  推荐调整配置:当前节点有多少核,就要向resourcemanager汇报多少核

c.  查看当前节点核数:

方式一:通过CM的主机目录来查看每一个节点有多少核;

方式二:通过命令的方式来查看--->grep 'processor' /proc/cpuinfo | sort -u | wc -l

d.在yarn中配置各个节点的核心数:直接在cm的yarn的配置目录下搜索:yarn.nodemanager.resource.cpu-vcores

内存配置

a.注意:每一个nodemanager 会向resourcemanager报告自己当前节点有多少内存;默认是8GB;yarn不会自动校验每一个节点有多少内存;

b.推荐配置:剩余内存 * 80%

c.查看当前节点内存剩余量:

方式一:通过CM的主机目录来查看每一个节点有多少剩余内存

方式二:通过命令方式查看--->free -m

d.配置各个节点内存:直接在cm的yarn的配置目录下搜索:

yarn.nodemanager.resource.memory-mb

yarn.scheduler.maximum-allocation-mb:与第一个保持一致

yarn.app.mapreduce.am.command-opts:略小于第一个配置的值(0.9)

注意:要同时设置yarn.scheduler.maximum-allocation-mb为一样的值,yarn.app.mapreduce.am.command-opts(JVM内存)的值要同步修改为略小的值(-Xmx1024m)。

yarn本地目录的配置

配置项:yarn.nodemanager.local-dirs

目的:yarn在运行过程中会产生一些临时文件,这些临时文件存储的位置有本地目录配置来决定

推荐配置:当前服务器挂载了几块磁盘,就需要配置几个目录

查看每一个磁盘挂载到了linux的什么目录下:df -h 查看对应磁盘挂载点即可

(3)MapReduce基础配置:

mapreduce.map.memory.mb:在运行MR时,一个mapTask需要占用多大的内存

mapreduce.map.java.opts:在运行MR时,一个mapTask对应jvm需要占用多大内存

mapreduce.reduce.memory.mb:在运行MR时,一个reduceTask需要占用多大的内存

mapreduce.reduce.java.opts:在运行MR时,一个reduceTask对应jvm需要占用多大内存

注意:jvm的内存配置要略小于对应内存;所有的内存配置大小不要超过nodemanager的内存大小;一般不做任何修改,默认即可

(4)hive的基础配置:

hiveserver2的内存大小配置:配置项:HiveServer2的Java堆栈大小(字节)

说明:如果这个配置比较少,在执行SQL的时候,可能会出现hiveserver2宕机的情况,此时需要条件hiveserver2的内存大小,调整后重启

动态生成分区的线程数:配置:hive.load.dynamic.partitions.thread;默认值:15

说明:在执行动态分区的时候,最多允许多少个线程来运行动态分区操作,线程越多,执行效率越高,但是占用资源也越大;

推荐:先采用默认,如果动态分区执行慢,而且还有剩余资源,可以尝试调大

调整位置:直接在cm的hive的配置目录下调整

监听输入文件的线程数量:配置项:hive.exec.input.listing.max.threads;默认值:15

说明:在运行SQL的时候,可以使用多少个线程来读取HDFS上数据,线程数量越多,读取效率越高,占用资源越大

推荐:先采用默认,如果读取数据执行慢,而且还有剩余资源,可以尝试调大

(5)hive压缩的配置:

map中间结果的压缩配置

hive.exec.compress.intermediate:是否开启hive对中间结果的压缩--->建议在hive的会话窗口配置

以下两个建议直接在cm的yarn的配置目录下直接配置

mapreduce.map.output.compress:是否开启map阶段的压缩

mapreduce.map.output.compress.codec:选择什么压缩方案

推荐配置:org.apache.hadoop.io.compress.SnappyCodec

reduce最终结果压缩配置

hive.exec.compress.output:是否开启对hive最终结果的压缩--->建议在hive的会话窗口配置

以下两个建议直接在cm的yarn的配置目录下直接配置

mapreduce.output.fileoutputformat.compress:是否开启reduce端的压缩配置

mapreduce.output.fileoutputformat.compress.codec:选择什么压缩方案

        推荐配置:org.apache.hadoop.io.compress.SnappyCodec

mapreduce.output.fileoutputformat.compress.type:压缩方案

        推荐配置:BLOCK

说明:如果hive上没有开启压缩,即使配置MR的压缩,那么也不会生效

(6)hive执行引擎的切换:配置:hive.execution.engine

6.数据采集

(1)目的:将业务端的数据导入到ODS层对应表中

业务端数据:mysql

ODS层表:hive

--->使用apache sqoop将mysql的数据灌入到hive中

(2)导入数据的SQL语句:

-- 访问咨询主表
select id,create_date_time,session_id,sid,create_time,seo_source,seo_keywords,ip,
       area,country,province,city,origin_channel,user as user_match,manual_time,begin_time,
       end_time,last_customer_msg_time_stamp,last_agent_msg_time_stamp,reply_msg_count,
       msg_count,browser_name,os_info,'2023-12-17' as starts_time
from  web_chat_ems_2019_07;

-- 访问咨询附属表
select
*,'2023-12-17' as start_time
from web_chat_text_ems_2019_07;

执行sqoop脚本,完成数据采集

-- 访问咨询主表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/nev \
--username root \
--password 123456 \
--query 'select id,create_date_time,session_id,sid,create_time,seo_source,seo_keywords,ip,
       area,country,province,city,origin_channel,user as user_match,manual_time,begin_time,
       end_time,last_customer_msg_time_stamp,last_agent_msg_time_stamp,reply_msg_count,
       msg_count,browser_name,os_info,"2023-12-17" as starts_time
from  web_chat_ems_2019_07 where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_ems \
-m 1

-- 访问咨询附属表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/nev \
--username root \
--password 123456 \
--query 'select
*,"2023-12-17" as start_time
from web_chat_text_ems_2019_07 where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_text_ems \
-m 1

校验数据是否导入成功:

(1)查看mysql共计多少条数据:

select count(1) from web_chat_ems_2019_07;211197

select count(1) from web_chat_text_ems_2019_07;105599

(2)到hive中对表查询一下一共多少条数据:

select count(1) from itcast_ods.web_chat_ems;211197

select count(1) from itcast_ods.web_chat_text_ems;105599

(3)查询其中一部分数据,观察数据映射是否可以 

select * from itcast_ods.web_chat_ems limit 10;

select * from itcast_ods.web_chat_text_ems limit 10;

7.数据清洗转换

目的:将ODS层数据导入到DWD层

(1)DWD层作用:1)清洗转换操作 2)少量维度退化操作

清洗操作:不需要

转换操作:将create_time日期转化为yearinfo quarterinfo monthinfo dayinfo hourinfo

额外加一个转换:将create_time日期数据转换为时间戳

维度退化操作:需要,将两个事实表合并成为一个事实表

SQL实现:还未完成转化操作

select 
    wce.session_id,
    wce.sid,
    wce.create_time,-- 此处需要转换:将字符串日期转换成时间戳
    wce.seo_source,
    wce.ip,
    wce.area,
    wce.msg_count,
    wce.origin_channel,
    wcte.referrer,
    wcte.from_url,
    wcte.landing_page_url,
    wcte.url_title,
    wcte.platform_description,
    wcte.other_params,
    wcte.history,
    wce.create_time as hourinfo,    -- 此处需要转换
    wce.create_time as yearinfo,    -- 此处需要转换
    wce.create_time as quarterinfo, -- 此处需要转换
    wce.create_time as monthinfo,   -- 此处需要转换
    wce.create_time as dayinfo  -- 此处需要转换
from itcast_ods.web_chat_ems wce join itcast_ods.web_chat_text_ems wcte
    on wce.id=wcte.id;

思考:如何进行转换操作:

转化1:将create_time转换为int 类型的数据(将其转化为时间戳)

方案: 日期转时间戳的函数:unix_timestamp(string date,string pattern)

案例:select unix_timestamp('2019-07-01 23:45:00','yyyy-MM-dd HH:mm:ss');

转化2:将create_time转换为yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo:

方案一:通过year() quarter() month() day() hour()

-- 案例:
select year('2019-07-01 23:45:00');-- 2019
select quarter('2019-07-01 23:45:00');-- 3
select month('2019-07-01 23:45:00');-- 7
select day('2019-07-01 23:45:00');-- 1
select hour('2019-07-01 23:45:00');-- 23

方案二:通过字符串的截取操作substr('字符串',从第几个截取,截取多少个) 

select substr('2019-07-01 23:45:00',1,4); -- 2019
select substr('2019-07-01 23:45:00',6,2);-- 07
select substr('2019-07-01 23:45:00',9,2);-- 01
select substr('2019-07-01 23:45:00',12,2);-- 23

实现最终转换的SQL:

--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;

insert into table itcast_dwd.visit_consult_dwd partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    wce.session_id,
    wce.sid,
    unix_timestamp(wce.create_time) as create_time,
    wce.seo_source,
    wce.ip,
    wce.area,
    wce.msg_count,
    wce.origin_channel,
    wcte.referrer,
    wcte.from_url,
    wcte.landing_page_url,
    wcte.url_title,
    wcte.platform_description,
    wcte.other_params,
    wcte.history,
    substr(wce.create_time,12,2) as hourinfo,
    substr(wce.create_time,1,4) as yearinfo,
    quarter(wce.create_time) as quarterinfo,
    substr(wce.create_time,6,2) as monthinfo,
    substr(wce.create_time,9,2) as dayinfo
from itcast_ods.web_chat_ems wce join itcast_ods.web_chat_text_ems wcte
    on wce.id=wcte.id;

可能出现以下错误: 

注意:在执行转换操作的时候,由于需要进行两表联查操作,其中一个表数据量比较少,此时hive会对其优化,采用map join的方案进行处理,而map join需要将小表的数据加载到内存中,但是内存不足,导致出现内存溢出错误,此错误可能会出现两个信息:第一个信息:return code 1;第二个信息:return code -137 (等待一会会爆出来)

解决方案:关闭掉map join 让其采用reduce join即可。set hive.auto.convert.join=false;

8.数据分析

目的:将DWD层数据灌入到DWS层

DWS层作用:细化维度统计操作

8.1访问量:

固有维度:时间维度:年 季度 月 天 小时

产品属性维度:地区维度,来源渠道,搜索来源,受访页面,总访问量

8.1.1以时间为基准,统计总访问量
-- 统计每年的总访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    yearinfo as time_str,
    '-1' as from_url,
    '5' as grouptype,
    '5' as time_type,
    yearinfo,
    '-1' as quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo;

-- 统计每年每季度的总访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'_',quarterinfo) as time_str,
    '-1' as from_url,
    '5' as grouptype,
    '4' as time_type,
    yearinfo,
    quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo;

-- 统计每年每季度每月的总访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo) as time_str,
    '-1' as from_url,
    '5' as grouptype,
    '3' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo;

-- 统计每年每季度每月每天的总访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
    '-1' as from_url,
    '5' as grouptype,
    '2' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo;

-- 统计每年每季度每月每天每小时的总访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
    '-1' as from_url,
    '5' as grouptype,
    '1' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo;
8.1.2基于时间统计各个受访页面的访问量 
-- 统计每年各个受访页面的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    yearinfo as time_str,
    from_url,
    '4' as grouptype,
    '5' as time_type,
    yearinfo,
    '-1' as quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,from_url;

-- 统计每年每季度各个受访页面的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'_',quarterinfo) as time_str,
    from_url,
    '4' as grouptype,
    '4' as time_type,
    yearinfo,
    quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,from_url;

-- 统计每年每季度每月各个受访页面的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo) as time_str,
    from_url,
    '4' as grouptype,
    '3' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,from_url;

-- 统计每年每季度每月每天各个受访页面的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
    from_url,
    '4' as grouptype,
    '2' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,from_url;

-- 统计每年每季度每月每天每小时各个受访页面的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    '-1' as origin_channel,
    hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
    from_url,
    '4' as grouptype,
    '1' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,from_url;
8.1.3基于时间统计各个来源渠道的访问量 
-- 统计每年各个来源渠道的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    origin_channel,
    '-1' as hourinfo,
    yearinfo as time_str,
    '-1' as from_url,
    '3' as grouptype,
    '5' as time_type,
    yearinfo,
    '-1' as quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,origin_channel;

-- 统计每年每季度各个来源渠道的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'_',quarterinfo) as time_str,
    '-1' as from_url,
    '3' as grouptype,
    '4' as time_type,
    yearinfo,
    quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,origin_channel;

-- 统计每年每季度每月各个来源渠道的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo) as time_str,
    '-1' as from_url,
    '3' as grouptype,
    '3' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,origin_channel;

-- 统计每年每季度每月每天各个来源渠道的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
    '-1' as from_url,
    '3' as grouptype,
    '2' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,origin_channel;

-- 统计每年每季度每月每天每小时各个来源渠道的访问量
insert into itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as seo_source,
    origin_channel,
    hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
    '-1' as from_url,
    '3' as grouptype,
    '1' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,origin_channel;
8.2咨询量:

固有维度:时间:年 季度 月 天 小时

产品属性维度:地区,来源渠道,总咨询量

8.2.1基于时间统计总咨询量
-- 统计每年的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    yearinfo as time_str,
    '3' as grouptype,
    '5' as time_type,
    yearinfo,
    '-1' as quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo;

-- 统计每年每季度的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'_',quarterinfo) as time_str,
    '3' as grouptype,
    '4' as time_type,
    yearinfo,
    quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo;

-- 统计每年每季度每月的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo) as time_str,
    '3' as grouptype,
    '3' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo;

-- 统计每年每季度每月每天的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
    '3' as grouptype,
    '2' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo,dayinfo;

-- 统计每年每季度每月每天每小时的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    '-1' as area,
    '-1' as origin_channel,
    hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
    '3' as grouptype,
    '1' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo;
8.2.2基于时间统计各个地区的咨询量
-- 统计每年各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    yearinfo as time_str,
    '1' as grouptype,
    '5' as time_type,
    yearinfo,
    '-1' as quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,area;
-- 统计每年每季度各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'_',quarterinfo) as time_str,
    '1' as grouptype,
    '4' as time_type,
    yearinfo,
    quarterinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,area;

-- 统计每年每季度每月各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo) as time_str,
    '1' as grouptype,
    '3' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo,area;

-- 统计每年每季度每月每天各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    '-1' as hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
    '1' as grouptype,
    '2' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo,dayinfo,area;

-- 统计每年每季度每月每天每小时各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select 
    count(distinct sid) as sid_total,
    count(distinct session_id) as sessionid_total,
    count(distinct ip) as ip_total,
    area,
    '-1' as origin_channel,
    hourinfo,
    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
    '1' as grouptype,
    '1' as time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count>=1
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,area;
9.数据导出 

目的:从hive的DWS层将数据导出到mysql中对应目标表中

技术:apache sqoop

第一步:在mysql中创建目标表

create database scrm_bi default character set utf8mb4 collate utf8mb4_general_ci;

-- 访问量的结果表
CREATE TABLE IF NOT EXISTS scrm_bi.visit_dws (
  sid_total INT COMMENT '根据sid去重求count',
  sessionid_total INT COMMENT '根据sessionid去重求count',
  ip_total INT COMMENT '根据IP去重求count',
  area varchar(32) COMMENT '区域信息',
  seo_source varchar(32) COMMENT '搜索来源',
  origin_channel varchar(32) COMMENT '来源渠道',
  hourinfo varchar(32) COMMENT '创建时间,统计至小时',
  time_str varchar(32) COMMENT '时间明细',
  from_url varchar(32) comment '会话来源页面',
  groupType varchar(32) COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.总访问量',
  time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
  yearinfo varchar(32) COMMENT '年',
  quarterinfo varchar(32) COMMENT '季度',
  monthinfo varchar(32) COMMENT '月',
  dayinfo varchar(32) COMMENT '天'
) comment 'EMS访问日志dws表';


-- 咨询量的结果表
CREATE TABLE IF NOT EXISTS scrm_bi.consult_dws
(
  sid_total INT COMMENT '根据sid去重求count',
  sessionid_total INT COMMENT '根据sessionid去重求count',
  ip_total INT COMMENT '根据IP去重求count',
  area  varchar(32) COMMENT '区域信息',
  origin_channel varchar(32) COMMENT '来源渠道',
  hourinfo varchar(32) COMMENT '创建时间,统计至小时',
  time_str varchar(32) COMMENT '时间明细',
  groupType varchar(32) COMMENT '产品属性类型:1.地区;2.来源渠道',
  time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
   yearinfo varchar(32) COMMENT '年',
  quarterinfo varchar(32) COMMENT '季度',
  monthinfo varchar(32) COMMENT '月',
  dayinfo varchar(32) COMMENT '天'
)COMMENT '咨询量DWS宽表';


第二步:执行sqoop的数据导出

-- 先导出咨询量数据
sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 123456 \
--table consult_dws \
--hcatalog-database itcast_dws \
--hcatalog-table consult_dws \
-m 1


-- 访问量数据导出
sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 123456 \
--table visit_dws \
--hcatalog-database itcast_dws \
--hcatalog-table visit_dws \
-m 1

 查看具体报了什么错误:查看MR的运行日志(jobHistory)

  • 12
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
### 回答1: 对于一个想要学习Python编程语言的初学者来说,千锋教育推出的Python全套视频教程800集(完整版)笔记是非常实用的学习资源。千锋教育是一家知名的IT培训机构,拥有多年的教育经验,其教学质量得到业内的认可。其Python全套视频教程涉及了Python语言的基本语法、面向对象编程、Web开发、数据分析等方面,内容涵盖面非常广泛。在学习过程中,可以根据自己的兴趣和需求选择对应的部分进行学习。 通过学习这套课程,初学者可以掌握Python语言的基本知识,并具备独立编写程序的能力。此外,这套课程还包含了丰富的实例演示,让学生了解如何将所学的知识应用到实际情况中。可帮助学生快速投身行业。这套课程最大的优势是实践讲解,将所学知识融入到实际项目中,让学生更容易理解和掌握。在学习过程中,千锋教育还准备了丰富的练习题和课程答疑,能够帮助学生更好地巩固所学的知识,最终能够应用到实际工作中。 不过需要指出的是,这套课程需要花费一定的时间和精力进行学习,需要学生具备一定的计算机基础,并具备良好的英语阅读能力。同时,由于Python语言更新速度较快,部分课程内容可能不太适用于新版Python,因此需要学生自行查找更加符合自己需求的学习材料进行学习。总之,如果有一定的计算机基础,想要系统地学习Python编程语言的初学者,这套Python全套视频教程800集(完整版)笔记是不错的学习资源。 ### 回答2: 千锋教育Python全套视频教程800集(完整版)笔记是一份非常全面的Python学习资料,能够帮助学员从零开始系统地学习Python编程,深入了解Python语言的各种特性和常见应用。 根据笔记的内容,首先是Python的基础语法,包括变量类型、运算符、控制流、函数等,这些基础知识是Python编程的基础,也是日常编程中必不可少的一部分。 接下来是Python面向对象编程方面的内容,包括类、继承、多态、封装等,这是Python的一大特色,也是Python最为优秀的一个特性。 在这份笔记中,还涉及到了Python的常用模块,如datetime、os、re等,以及Python的网络编程、图形界面设计、数据可视化等方面的应用。 总之,这份笔记涵盖了Python编程的方方面面,对于初学者来说是一份非常宝贵的学习资源,对于已经掌握Python语言的开发者来说,也可以作为一份全面的Python编程参考资料。此外,这份笔记还有助于学员加深对Python编程的了解和认知,提高自己的编程能力。 ### 回答3: 千锋教育Python全套视频教程800集的完整版笔记,是一份非常详细的学习资料,内容涵盖了Python语言的基础知识、编程思想和实践技能。这套教程是由千锋教育专家团队精心制作的,内容充实、简明易懂,适合初学者和有一定基础的学员。 笔记的主要内容包括:Python语言入门、数据类型与运算符、流程控制语句、函数与模块、面向对象编程、文件操作、网络编程、Web开发等。通过这些笔记,你可以快速地了解Python这门语言的特点和用途,掌握Python的核心语法,学会使用Python进行编程、数据处理和Web开发等工作。 此外,千锋教育Python全套视频教程800集的完整版笔记,也提供了大量的练习题和实践项目。通过这些练习题和项目,你可以加深对Python语言的理解和应用,更好地实践所学知识。 总之,千锋教育Python全套视频教程800集的完整版笔记是一份非常有价值的学习资料,无论你是初学者还是已经有一定基础的学员,都可以从中获得很多收获,快速地掌握Python编程和开发的技能。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值