意向客户主题看板阿善有用前边有用后面没用拉链表

最新推荐文章于 2022-05-02 18:09:01 发布

okbin1991

最新推荐文章于 2022-05-02 18:09:01 发布

阅读量439

点赞数

本文链接：https://blog.csdn.net/okbin1991/article/details/129156727

版权

今日内容:1) 分桶表的相关优化 -- 理解2) 建模分层操作 -- 需要操作3) 全量流程的统计分析: -- 需求操作 (尝试自己实现) 数据的采集, 数据的清洗转换, 数据维度退化, 数据的统计分析4) 增量流程的: 如何对拉链表实现增量处理 -- 理解

1.意向客户主题看板_需求说明: 需求一: 计期内，新增意向客户(包含自己录入的意向客户)总数。指标: 意向数量维度: 时间维度: 年月天小时新老维度: 线上线下:

涉及表: customer_relationship(意向表) 涉及的字段: create_date_time 基于这个字段统计意向用户数量: customer_id:先去重需求二: 统计指定时间段内，新增的意向客户，所在城市区域人数热力图指标: 意向数量维度: 时间维度: 年月天小时新老维度: 线上线下区域维度: 涉及表: customer_relationship(意向表) customer (客户表(学员表)) 涉及的字段: 意向表中: create_date_time

客户表: area

基于这个字段统计意向用户数量: customer_id:先去重两个表关联条件: 意向表.customer_id=客户表.id

需求三: 统计指定时间段内，新增的意向客户中，意向学科人数排行榜。学科名称要关联查询出来指标: 意向数量维度: 时间维度: 年月天小时新老维度: 线上线下学科维度涉及表: customer_relationship(意向表), itcast_subject(学科表) customer_clue(线索表)

涉及字段: 线索表 : clue_state : 可以帮助识别新老用户 deleted : 用于判断数据是否删除 create_date_time 意向表 : origin_type: 此字段可以帮助判断是否为线上还是线下如果值为: NETSERVICE OR PRESIDNUP 说明是线上其他就是为线下基于这个字段统计意向用户数量: customer_id:先去重学科表: name 关联条件: 线索表.customer_relationship_id = 意向表.id 学科表.id = 意向表.itcast_subject_id

需求四: 统计指定时间段内，新增的意向客户中，意向校区人数排行榜指标: 意向数量维度: 时间维度: 年月天小时新老维度: 线上线下校区维度

注意：学校id，同步时，0和null转换为统一数据，都转换为-1

涉及表: customer_relationship(意向表), customer_clue(线索表), itcast_school(校区表) 涉及字段: 线索表 : clue_state : 可以帮助识别新老用户 deleted : 用于判断数据是否删除 create_date_time 意向表 : origin_type: 此字段可以帮助判断是否为线上还是线下如果值为: NETSERVICE OR PRESIDNUP 说明是线上其他就是为线下基于这个字段统计意向用户数量: customer_id:先去重校区表: name 关联条件: 意向表.itcast_school_id = 校区表.id 线索表.customer_relationship_id = 意向表.id

需求五: 统计指定时间段内，新增的意向客户中，不同来源渠道的意向客户占比。指标: 意向数量维度: 时间维度: 年月天小时新老维度: 线上线下来源渠道涉及表: customer_relationship(意向表), customer_clue(线索表) 涉及字段: 线索表 : clue_state : 可以帮助识别新老用户 deleted : 用于判断数据是否删除意向表: create_date_time origin_type: 此字段可以帮助判断是否为线上还是线下此字段也表示来源渠道如果值为: NETSERVICE OR PRESIDNUP 说明是线上其他就是为线下基于这个字段统计意向用户数量: customer_id:先去重关联条件: 线索表.customer_relationship_id = 意向表.id 需求6: 统计指定时间段内，新增的意向客户中，各咨询中心产生的意向客户数占比情况指标: 意向数量维度: 时间维度: 年月天小时新老维度: 线上线下各咨询中心涉及表: customer_relationship(意向表), employee: 员工表 scrm_department : 部门表 customer_clue(线索表) 涉及字段: 线索表 : clue_state : 可以帮助识别新老用户意向表: create_date_time origin_type: 此字段可以帮助判断是否为线上还是线下此字段也表示来源渠道如果值为: NETSERVICE OR PRESIDNUP 说明是线上其他就是为线下基于这个字段统计意向用户数量: customer_id:先去重员工表: tdepart_id : 部门id 部门表: name 关联条件: 线索表.customer_relationship_id = 意向表.id 员工表.tdepart_id = 部门表.id 意向表.creator = 员工表.id

总结: 指标: 意向数量维度: 时间维度: 年月天小时新老维度: 线上线下产品属性维度: 地区维度 , 来源渠道, 学科维度, 校区维度 , 各咨询中心

涉及表: 7张表 customer_relationship(意向表), 涉及到字段: create_date_time , origin_type , customer_id employee: 员工表涉及到字段 : tdepart_id 和 id scrm_department : 部门表涉及到字段 : name 和 id customer_clue(线索表) 涉及到字段 : clue_state ,deleted ,create_date_time ,customer_relationship_id itcast_school(校区表) : 涉及到字段 : name 和 id itcast_subject(学科表) 涉及到字段 : name 和 id customer(客户表) 涉及到字段: area 和 id 表关联: 线索表.customer_relationship_id = 意向表.id 员工表.tdepart_id = 部门表.id 意向表.creator = 员工表.id 意向表.itcast_school_id = 校区表.id 学科表.id = 意向表.itcast_subject_id 意向表.customer_id=客户表.id

意向主题看板案例_导入原始业务数据 --- 此层在实际工作中不存在 create database scrm default character set utf8mb4 collate utf8mb4_unicode_ci;

将原来发的知行教育分析平台资料中 --> 原始完整数据集 --> scrm --> 将7个表依次导入MySQL中

意向主题看板案例_建模分析: ODS层: 事实表: 意向表额外放置一张表: 线索表 (说明: 此表由于是后续主题看板事实表, 为了方便后续的处理, 将此表放置在ODS层) 表: 内部表 + 分桶表 + 分区表 + 拉链表实施DIM层: 维度层员工表, 校区表, 学科表, 客户表 ,部门表表: 外部表 + 分区表关于以上两层: 只需要一对对应原生数据表结构构建即可, 构建时注意添加一个 start_time(抽取时间)数据格式和压缩方式: ORC + ZLIB(SNAPPY)

DW层: DWD: 清洗转换以及如果表字段过多, 可以抽取相关的字段 , 对 ODS层表进行处理清洗工作: 清理掉以及被标识为删除的数据转换工作: 将 origin_type中数据转换为 0 和 1 形成一个新的字段, 用于标识线上上下 create_date_time将时间转换为年月日小时 学校id，同步时，0和null转换为统一数据，都转换为-1 涉及到字段: 普通字段: id,create_date_time,delete ,customer_id ,origin_type ,origin_type_stat, itcast_school_id ,itcast_subject_id,creator,hourinfo 分区: 年(yearinfo) , 月(monthinfo) 日(dayinfo) DWM: 基于维度提前聚合操作 (不能做) 维度退化 将六个维度表, 和 DWD的事实表进行组合, 形成一张表, 从而实现维度退化操作思想: 考虑要从各个维度表中获取那些字段数据, 将这些字段数据全部糅杂在一个表即可相关字段: 普通字段: customer_id, create_date_time,clue_state_stat ,origin_type_stat,area,origin_type, itcast_school_id,school_name,itcast_subject_id,itcast_subject_name,department_id, department_name ,hourinfo 分区字段: 年(yearinfo) , 月(monthinfo) 日(dayinfo)

要想生成这个表的数据, 此处需要进行从ODS+DIM 进行七表联查得出此表结果

DWS: 指标只有一个, 表也就只有一个 customerid_total,clue_state_stat,origin_type_stat,area,origin_type, itcast_school_id,school_name,itcast_subject_id,itcast_subject_name, department_id, department_name , time_type,group_type ,hourinfo ,time_str

分区: 年(yearinfo) , 月(monthinfo) 日(dayinfo) time_type: 1(年) 2(月) 3(日) 4(小时) group_type: 1地区维度 , 2来源渠道, 3学科维度, 4校区维度 , 5各咨询中心 ,6 总意向量

数据结果: 1000 0 0 年 -1 -1 -1 -1 1000 0 1 年 -1 -1 -1 -1 1000 1 0 年 -1 -1 -1 -1 1000 1 1 年 -1 -1 -1 -1 1000 0 0 年 11 -1 -1 -1 1000 0 1 年 11 -1 -1 -1 1000 1 0 年 11 -1 -1 -1 1000 1 1 年 11 -1 -1 -1 1000 0 0 年 11 01 -1 -1 1000 0 1 年 11 01 -1 -1 1000 1 0 年 11 01 -1 -1 1000 1 1 年 11 01 -1 -1 1000 0 0 年 11 -1 山西 -1 1000 0 1 年 11 -1 山西 -1 1000 1 0 年 11 -1 山西 -1 1000 1 1 年 11 -1 山西 -1 1000 0 0 年 11 01 -1 weixin 1000 0 1 年 11 01 -1 weixin 1000 1 0 年 11 01 -1 weixin 1000 1 1 年 11 01 -1 weixin

app层: 不要 DWS已经成功将各个维度分析完成....

2. 分桶表的相关优化: 分桶表: 分文件将一个文件拆分多个文件的操作, 具体拆分多少, 取决于设置的分桶的数量底层是如何实现分文件呢? 核心采用 MR 分区, 采用 Hash取模计算法对分桶字段进行分区操作会将数据进行打散操作, 同时保证相同数据会发往同一个reduce中

桶表的操作: 创建表: create table test_buck(id int, name string) clustered by(id) sorted by (id asc) into 6 buckets -- 主要此处代码 row format delimited fields terminated by '\t';

插入数据: --启用桶表 set hive.enforce.bucketing=true; insert into ...

注意: 桶表不能使用 load data 方式来插入桶表数据, set hive.strict.checks.bucketing = true; 禁止桶表使用load data 默认true 如何将数据插入到桶表: 对桶表建立一张临时表(千万不能桶表) 通过 load data 方式将数据进行加载到临时表, 然后通过 insert into 从临时表将数据加载到桶表中

作用: 数据的抽样处理 : 将一个文件的数据拆分为多个文件后, 从中获取其中某几个文件来进行处理, 这个过程数据采样作用: 1. 测试的时候, 由于数据过于庞大, 可以对数据进行采样, 然后在采样的结果上进行统计分析即可,提升快速开发的效率 2. 对整体数据分析不是很方便, 可以进行采样分析, 得出的结果依然可以反映整个数据的结果信息如何实现抽样: 格式: select * from table tablesample(bucket x out of y on column) as a

放置位置: 紧跟在表的后面如果表有别名, 请将抽样函数放置在别名之前, 表之后函数说明: tablesample(bucket x out of y on column) X : 从第几个桶开始抽 x的值必须小于等于y的值 y : 抽桶数量比例 , 必须是桶的倍数或者因子 column : 按照那个字段进行分桶抽样

例子: 表有 10个桶分桶字段为id

tablesample(bucket 3 out of 5 on id): 思考 : 会抽出几个桶? 10/5 = 2 会抽出那两个桶呢? 第三个桶和第八个桶

提升多表join的查询性能 : 主要的手段就是 map join 1. mapjoin: 适合于小表和大表的join操作必备条件: set hive.auto.convert.join=true; -- 必须开启 mapjoin的优化默认值为true set hive.auto.convert.join.noconditionaltask.size=512000000; 小表阈值默认值为 20971520 (20M)

2. 中等大小的表和大表进行join: 要求使用 map join 可以使用 Bucket-MapJoin 实现必备条件: 1) 两个表的关联条件的字段必须是分桶字段 2) 中型表的分桶数量小于等于大表的分桶数量并且必须是大表桶的倍数 3) 开启 bucket_mapjoin : set hive.optimize.bucketmapjoin = true 4) 两个表必须是分桶表 : 启用 set hive.enforce.bucketing=true; 一旦将以上的条件都满足, hive自动采用 Bucket-MapJoin 如果不满足, hive会检测是否满足 map join, 如果不满足, 那么就采用原始 reduce join的方案

3. 大表和大表 join: 要求使用 map join 可以采用 SMB Join 基于 Bucket-MapJoin 实施的, 首先要先满足 Bucket-MapJoin 实现必备条件: 1) 两个表的关联条件的字段必须是分桶字段, 并且必须按照分桶字段进行排序 2) 两个表的分桶数量必须相等 3) 开启 bucket_mapjoin : set hive.optimize.bucketmapjoin = true 4) 两个表必须是分桶表 : 启用 set hive.enforce.bucketing=true; 5) 开启 SMB join的必备三项条件 : set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin.sortedmerge = true; --开启 SMBjoin set hive.auto.convert.sortmerge.join.noconditionaltask=true; set hive.enforce.sorting=true;

建表操作: create table test_smb_2(mid string,age_id string) CLUSTERED BY(mid) SORTED BY(mid) INTO 500 BUCKETS;--3. 意向用户主题看板: 建模分层操作准备工作: 开启写入压缩set hive.exec.orc.compression.strategy=COMPRESSION;--3.1: 创建 ODS层表: 2张表 (意向表和线索表)CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship` ( `id` int COMMENT '客户关系id', `create_date_time` STRING COMMENT '创建时间', `update_date_time` STRING COMMENT '最后更新时间', `deleted` int COMMENT '是否被删除(禁用)', `customer_id` int COMMENT '所属客户id', `first_id` int COMMENT '第一条客户关系id', `belonger` int COMMENT '归属人', `belonger_name` STRING COMMENT '归属人姓名', `initial_belonger` int COMMENT '初始归属人', `distribution_handler` int COMMENT '分配处理人', `business_scrm_department_id` int COMMENT '归属部门', `last_visit_time` STRING COMMENT '最后回访时间', `next_visit_time` STRING COMMENT '下次回访时间', `origin_type` STRING COMMENT '数据来源', `itcast_school_id` int COMMENT '校区Id', `itcast_subject_id` int COMMENT '学科Id', `intention_study_type` STRING COMMENT '意向学习方式', `anticipat_signup_date` STRING COMMENT '预计报名时间', `level` STRING COMMENT '客户级别', `creator` int COMMENT '创建人', `current_creator` int COMMENT '当前创建人：初始==创建人，当在公海拉回时为拉回人', `creator_name` STRING COMMENT '创建者姓名', `origin_channel` STRING COMMENT '来源渠道', `comment` STRING COMMENT '备注', `first_customer_clue_id` int COMMENT '第一条线索id', `last_customer_clue_id` int COMMENT '最后一条线索id', `process_state` STRING COMMENT '处理状态', `process_time` STRING COMMENT '处理状态变动时间', `payment_state` STRING COMMENT '支付状态', `payment_time` STRING COMMENT '支付状态变动时间', `signup_state` STRING COMMENT '报名状态', `signup_time` STRING COMMENT '报名时间', `notice_state` STRING COMMENT '通知状态', `notice_time` STRING COMMENT '通知状态变动时间', `lock_state` STRING COMMENT '锁定状态', `lock_time` STRING COMMENT '锁定状态修改时间', `itcast_clazz_id` int COMMENT '所属ems班级id', `itcast_clazz_time` STRING COMMENT '报班时间', `payment_url` STRING COMMENT '付款链接', `payment_url_time` STRING COMMENT '支付链接生成时间', `ems_student_id` int COMMENT 'ems的学生id', `delete_reason` STRING COMMENT '删除原因', `deleter` int COMMENT '删除人', `deleter_name` STRING COMMENT '删除人姓名', `delete_time` STRING COMMENT '删除时间', `course_id` int COMMENT '课程ID', `course_name` STRING COMMENT '课程名称', `delete_comment` STRING COMMENT '删除原因说明', `close_state` STRING COMMENT '关闭装填', `close_time` STRING COMMENT '关闭状态变动时间', `appeal_id` int COMMENT '申诉id', `tenant` int COMMENT '租户', `total_fee` DECIMAL COMMENT '报名费总金额', `belonged` int COMMENT '小周期归属人', `belonged_time` STRING COMMENT '归属时间', `belonger_time` STRING COMMENT '归属时间', `transfer` int COMMENT '转移人', `transfer_time` STRING COMMENT '转移时间', `follow_type` int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', `transfer_bxg_oa_account` STRING COMMENT '转移到博学谷归属人OA账号', `transfer_bxg_belonger_name` STRING COMMENT '转移到博学谷归属人OA姓名', `end_time` STRING COMMENT '有效截止时间')comment '客户关系表'PARTITIONED BY(start_time STRING)clustered by(id) sorted by(id) into 10 bucketsROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');

CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '创建时间', update_date_time STRING COMMENT '最后更新时间', deleted STRING COMMENT '是否被删除(禁用)', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户关系id', session_id STRING COMMENT '七陌会话id', sid STRING COMMENT '访客id', status STRING COMMENT '状态(undeal待领取 deal 已领取 finish 已关闭 changePeer 已流转)', users STRING COMMENT '所属坐席', create_time STRING COMMENT '七陌创建时间', platform STRING COMMENT '平台来源 (pc-网站咨询|wap-wap咨询|sdk-app咨询|weixin-微信咨询)', s_name STRING COMMENT '用户名称', seo_source STRING COMMENT '搜索来源', seo_keywords STRING COMMENT '关键字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上级来源页面', from_url STRING COMMENT '会话来源页面', landing_page_url STRING COMMENT '访客着陆页面', url_title STRING COMMENT '咨询页面title', to_peer STRING COMMENT '所属技能组', manual_time STRING COMMENT '人工开始时间', begin_time STRING COMMENT '坐席领取时间 ', reply_msg_count int COMMENT '客服回复消息数', total_msg_count int COMMENT '消息总数', msg_count int COMMENT '客户发送消息数', comment STRING COMMENT '备注', finish_reason STRING COMMENT '结束类型', finish_user STRING COMMENT '结束坐席', end_time STRING COMMENT '会话结束时间', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '浏览器名称', os_info STRING COMMENT '系统名称', area STRING COMMENT '区域', country STRING COMMENT '所在国家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '创建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份证号', phone STRING COMMENT '手机号', itcast_school_id int COMMENT '校区Id', itcast_school STRING COMMENT '校区', itcast_subject_id int COMMENT '学科Id', itcast_subject STRING COMMENT '学科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq号', email STRING COMMENT '邮箱', gender STRING COMMENT '性别', level STRING COMMENT '客户级别', origin_type STRING COMMENT '数据来源渠道', information_way STRING COMMENT '资讯方式', working_years STRING COMMENT '开始工作时间', technical_directions STRING COMMENT '技术方向', customer_state STRING COMMENT '当前客户状态', valid STRING COMMENT '该线索是否是网资有效线索', anticipat_signup_date STRING COMMENT '预计报名时间', clue_state STRING COMMENT '线索状态', scrm_department_id int COMMENT 'SCRM内部部门id', superior_url STRING COMMENT '诸葛获取上级页面URL', superior_source STRING COMMENT '诸葛获取上级页面URL标题', landing_url STRING COMMENT '诸葛获取着陆页面URL', landing_source STRING COMMENT '诸葛获取着陆页面URL来源', info_url STRING COMMENT '诸葛获取留咨页URL', info_source STRING COMMENT '诸葛获取留咨页URL标题', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '课程编号', course_name STRING COMMENT '课程名称', zhuge_session_id STRING COMMENT 'zhuge会话id', is_repeat int COMMENT '是否重复线索(手机号维度) 0:正常 1：重复', tenant int COMMENT '租户id', activity_id STRING COMMENT '活动id', activity_name STRING COMMENT '活动名称', follow_type int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', shunt_mode_id int COMMENT '匹配到的技能组id', shunt_employee_group_id int COMMENT '所属分流员工组', ends_time STRING COMMENT '有效时间')comment '客户关系表'PARTITIONED BY(starts_time STRING)clustered by(customer_relationship_id) sorted by(customer_relationship_id) into 10 bucketsROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');

--3.2: 创建 DIM层表: 5张表CREATE DATABASE IF NOT EXISTS itcast_dimen;CREATE TABLE IF NOT EXISTS itcast_dimen.`customer` ( `id` int COMMENT 'key id', `customer_relationship_id` int COMMENT '当前意向id', `create_date_time` STRING COMMENT '创建时间', `update_date_time` STRING COMMENT '最后更新时间', `deleted` int COMMENT '是否被删除(禁用)', `name` STRING COMMENT '姓名', `idcard` STRING COMMENT '身份证号', `birth_year` int COMMENT '出生年份', `gender` STRING COMMENT '性别', `phone` STRING COMMENT '手机号', `wechat` STRING COMMENT '微信', `qq` STRING COMMENT 'qq号', `email` STRING COMMENT '邮箱', `area` STRING COMMENT '所在区域', `leave_school_date` date COMMENT '离校时间', `graduation_date` date COMMENT '毕业时间', `bxg_student_id` STRING COMMENT '博学谷学员ID，可能未关联到，不存在', `creator` int COMMENT '创建人ID', `origin_type` STRING COMMENT '数据来源', `origin_channel` STRING COMMENT '来源渠道', `tenant` int, `md_id` int COMMENT '中台id')comment '客户表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dimen.employee ( id int COMMENT '员工id', email STRING COMMENT '公司邮箱，OA登录账号', real_name STRING COMMENT '员工的真实姓名', phone STRING COMMENT '手机号，目前还没有使用；隐私问题OA接口没有提供这个属性，', department_id STRING COMMENT 'OA中的部门编号，有负值', department_name STRING COMMENT 'OA中的部门名', remote_login STRING COMMENT '员工是否可以远程登录', job_number STRING COMMENT '员工工号', cross_school STRING COMMENT '是否有跨校区权限', last_login_date STRING COMMENT '最后登录日期', creator int COMMENT '创建人', create_date_time STRING COMMENT '创建时间', update_date_time STRING COMMENT '最后更新时间', deleted STRING COMMENT '是否被删除(禁用)', scrm_department_id int COMMENT 'SCRM内部部门id', leave_office STRING COMMENT '离职状态', leave_office_time STRING COMMENT '离职时间', reinstated_time STRING COMMENT '复职时间', superior_leaders_id int COMMENT '上级领导ID', tdepart_id int COMMENT '直属部门', tenant int COMMENT '租户', ems_user_name STRING COMMENT 'ems用户名称')comment '员工表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dimen.`scrm_department` ( `id` int COMMENT '部门id', `name` STRING COMMENT '部门名称', `parent_id` int COMMENT '父部门id', `create_date_time` STRING COMMENT '创建时间', `update_date_time` STRING COMMENT '更新时间', `deleted` STRING COMMENT '删除标志', `id_path` STRING COMMENT '编码全路径', `tdepart_code` int COMMENT '直属部门', `creator` STRING COMMENT '创建者', `depart_level` int COMMENT '部门层级', `depart_sign` int COMMENT '部门标志，暂时默认1', `depart_line` int COMMENT '业务线，存储业务线编码', `depart_sort` int COMMENT '排序字段', `disable_flag` int COMMENT '禁用标志', `tenant` int COMMENT '租户')comment 'scrm部门表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_school` ( `id` int COMMENT '自增主键', `create_date_time` timestamp COMMENT '创建时间', `update_date_time` timestamp COMMENT '最后更新时间', `deleted` STRING COMMENT '是否被删除(禁用)', `name` STRING COMMENT '校区名称', `code` STRING COMMENT '校区标识', `tenant` int COMMENT '租户')comment '校区字典表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_subject` ( `id` int COMMENT '自增主键', `create_date_time` timestamp COMMENT '创建时间', `update_date_time` timestamp COMMENT '最后更新时间', `deleted` STRING COMMENT '是否被删除(禁用)', `name` STRING COMMENT '学科名称', `code` STRING COMMENT '学科编码', `tenant` int COMMENT '租户')comment '学科字典表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');

--3.3 构建 DWD层: -- 演示 join优化CREATE TABLE IF NOT EXISTS itcast_dwd.`itcast_intention_dwd` ( `rid` int COMMENT 'id', `customer_id` STRING COMMENT '客户id', `create_date_time` STRING COMMENT '创建时间', `itcast_school_id` STRING COMMENT '校区id', `deleted` STRING COMMENT '是否被删除', `origin_type` STRING COMMENT '来源渠道', `itcast_subject_id` STRING COMMENT '学科id', `creator` int COMMENT '创建人', `hourinfo` STRING COMMENT '小时信息', `origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上')comment '客户意向dwd表'PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)clustered by(rid) sorted by(rid) into 10 bucketsROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as ORCTBLPROPERTIES ('orc.compress'='SNAPPY');

-- 3.4: 构建 DWM层create database itcast_dwm;CREATE TABLE IF NOT EXISTS itcast_dwm.`itcast_intention_dwm` ( `customer_id` STRING COMMENT 'id信息', `create_date_time` STRING COMMENT '创建时间', `area` STRING COMMENT '区域信息', `itcast_school_id` STRING COMMENT '校区id', `itcast_school_name` STRING COMMENT '校区名称', `deleted` STRING COMMENT '是否被删除', `origin_type` STRING COMMENT '来源渠道', `itcast_subject_id` STRING COMMENT '学科id', `itcast_subject_name` STRING COMMENT '学科名称', `hourinfo` STRING COMMENT '小时信息', `origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上', `clue_state_stat` STRING COMMENT '新老客户：0.老客户；1.新客户', `tdepart_id` STRING COMMENT '创建者部门id', `tdepart_name` STRING COMMENT '咨询中心名称')comment '客户意向dwm表'PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)clustered by(customer_id) sorted by(customer_id) into 10 bucketsROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as ORCTBLPROPERTIES ('orc.compress'='SNAPPY');

-- 3.5 构建 DWS 层CREATE TABLE IF NOT EXISTS itcast_dws.itcast_intention_dws ( `customer_total` INT COMMENT '聚合意向客户数', `area` STRING COMMENT '区域信息', `itcast_school_id` STRING COMMENT '校区id', `itcast_school_name` STRING COMMENT '校区名称', `origin_type` STRING COMMENT '来源渠道', `itcast_subject_id` STRING COMMENT '学科id', `itcast_subject_name` STRING COMMENT '学科名称', `hourinfo` STRING COMMENT '小时信息', `origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上', `clue_state_stat` STRING COMMENT '客户属性：0.老客户；1.新客户', `tdepart_id` STRING COMMENT '创建者部门id', `tdepart_name` STRING COMMENT '咨询中心名称', `time_str` STRING COMMENT '时间明细', `groupType` STRING COMMENT '产品属性类别：1.总意向量；2.区域信息；3.校区、学科组合分组；4.来源渠道；5.咨询中心;', `time_type` STRING COMMENT '时间维度：1、按小时聚合；2、按天聚合；3、按周聚合；4、按月聚合；5、按年聚合；')comment '客户意向dws表'PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');

4. 意向主题看板案例_数据的采集:4.1: 完成 DIM层的数据采集:sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select id, customer_relationship_id, create_date_time, update_date_time, deleted, name, idcard, birth_year, gender, phone, wechat, qq, email, area, leave_school_date, graduation_date, bxg_student_id, creator, origin_type, origin_channel, tenant, md_id, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d") as start_time from customer where $CONDITIONS' \--hcatalog-database itcast_dimen \--hcatalog-table customer \-m 1 \--split-by id

sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select id,email,real_name,-1 as phone,department_id,department_name,remote_login,job_number,cross_school,last_login_date,creator,create_date_time,update_date_time,deleted,scrm_department_id,leave_office,leave_office_time,reinstated_time,superior_leaders_id,tdepart_id,tenant,ems_user_name,FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from employee where $CONDITIONS' \--hcatalog-database itcast_dimen \--hcatalog-table employee \-m 1 \--split-by id

sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from scrm_department where $CONDITIONS' \--hcatalog-database itcast_dimen \--hcatalog-table scrm_department \-m 1 \--split-by id

sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from itcast_school where $CONDITIONS' \--hcatalog-database itcast_dimen \--hcatalog-table itcast_school \-m 1 \--split-by id

sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from itcast_subject where $CONDITIONS' \--hcatalog-database itcast_dimen \--hcatalog-table itcast_subject \-m 1 \--split-by id

4.2: 完成ODS层的数据采集由于ODS层表时两张桶表数据, 而 sqoop 无法支持桶表数据的导入工作, 此时解决方案: 为对应的桶表构建临时表, 然后通过sqoop将数据导入到临时表 在通过临时表使用 insert into 的方式将数据导入分桶表中即可

4.2.1: 意向表的数据导入第一步: 创建意向表的临时表结构CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship_tmp` ( `id` int COMMENT '客户关系id', `create_date_time` STRING COMMENT '创建时间', `update_date_time` STRING COMMENT '最后更新时间', `deleted` int COMMENT '是否被删除(禁用)', `customer_id` int COMMENT '所属客户id', `first_id` int COMMENT '第一条客户关系id', `belonger` int COMMENT '归属人', `belonger_name` STRING COMMENT '归属人姓名', `initial_belonger` int COMMENT '初始归属人', `distribution_handler` int COMMENT '分配处理人', `business_scrm_department_id` int COMMENT '归属部门', `last_visit_time` STRING COMMENT '最后回访时间', `next_visit_time` STRING COMMENT '下次回访时间', `origin_type` STRING COMMENT '数据来源', `itcast_school_id` int COMMENT '校区Id', `itcast_subject_id` int COMMENT '学科Id', `intention_study_type` STRING COMMENT '意向学习方式', `anticipat_signup_date` STRING COMMENT '预计报名时间', `level` STRING COMMENT '客户级别', `creator` int COMMENT '创建人', `current_creator` int COMMENT '当前创建人：初始==创建人，当在公海拉回时为拉回人', `creator_name` STRING COMMENT '创建者姓名', `origin_channel` STRING COMMENT '来源渠道', `comment` STRING COMMENT '备注', `first_customer_clue_id` int COMMENT '第一条线索id', `last_customer_clue_id` int COMMENT '最后一条线索id', `process_state` STRING COMMENT '处理状态', `process_time` STRING COMMENT '处理状态变动时间', `payment_state` STRING COMMENT '支付状态', `payment_time` STRING COMMENT '支付状态变动时间', `signup_state` STRING COMMENT '报名状态', `signup_time` STRING COMMENT '报名时间', `notice_state` STRING COMMENT '通知状态', `notice_time` STRING COMMENT '通知状态变动时间', `lock_state` STRING COMMENT '锁定状态', `lock_time` STRING COMMENT '锁定状态修改时间', `itcast_clazz_id` int COMMENT '所属ems班级id', `itcast_clazz_time` STRING COMMENT '报班时间', `payment_url` STRING COMMENT '付款链接', `payment_url_time` STRING COMMENT '支付链接生成时间', `ems_student_id` int COMMENT 'ems的学生id', `delete_reason` STRING COMMENT '删除原因', `deleter` int COMMENT '删除人', `deleter_name` STRING COMMENT '删除人姓名', `delete_time` STRING COMMENT '删除时间', `course_id` int COMMENT '课程ID', `course_name` STRING COMMENT '课程名称', `delete_comment` STRING COMMENT '删除原因说明', `close_state` STRING COMMENT '关闭装填', `close_time` STRING COMMENT '关闭状态变动时间', `appeal_id` int COMMENT '申诉id', `tenant` int COMMENT '租户', `total_fee` DECIMAL COMMENT '报名费总金额', `belonged` int COMMENT '小周期归属人', `belonged_time` STRING COMMENT '归属时间', `belonger_time` STRING COMMENT '归属时间', `transfer` int COMMENT '转移人', `transfer_time` STRING COMMENT '转移时间', `follow_type` int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', `transfer_bxg_oa_account` STRING COMMENT '转移到博学谷归属人OA账号', `transfer_bxg_belonger_name` STRING COMMENT '转移到博学谷归属人OA姓名', `end_time` STRING COMMENT '有效截止时间')comment '客户关系表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');

第二步: 使用sqoop 完成数据导入到临时表: sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name,date_format("9999-12-31","%Y-%m-%d") as end_time, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from customer_relationship where $CONDITIONS' \--hcatalog-database itcast_ods \--hcatalog-table customer_relationship_tmp \-m 1 \--split-by id

--第三步: 将临时表的数据, 在次灌入到 ODS的分桶的意向表中: --分区SET hive.exec.dynamic.partition=true;SET hive.exec.dynamic.partition.mode=nonstrict;set hive.exec.max.dynamic.partitions.pernode=10000;set hive.exec.max.dynamic.partitions=100000;set hive.exec.max.created.files=150000;--hive压缩set hive.exec.compress.intermediate=true;set hive.exec.compress.output=true;--写入时压缩生效set hive.exec.orc.compression.strategy=COMPRESSION;--分桶 set hive.optimize.bucketmapjoin = true;set hive.enforce.bucketing=true;set hive.enforce.sorting=true;

set hive.auto.convert.sortmerge.join=true;set hive.auto.convert.sortmerge.join.noconditionaltask=true;

insert into table itcast_ods.customer_relationship partition(start_time)select * from customer_relationship_tmp;

4.2.2: 将线索表数据导入到ods层的表中第一步: 建立线索表的临时表: CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_tmp ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '创建时间', update_date_time STRING COMMENT '最后更新时间', deleted STRING COMMENT '是否被删除(禁用)', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户关系id', session_id STRING COMMENT '七陌会话id', sid STRING COMMENT '访客id', status STRING COMMENT '状态(undeal待领取 deal 已领取 finish 已关闭 changePeer 已流转)', users STRING COMMENT '所属坐席', create_time STRING COMMENT '七陌创建时间', platform STRING COMMENT '平台来源 (pc-网站咨询|wap-wap咨询|sdk-app咨询|weixin-微信咨询)', s_name STRING COMMENT '用户名称', seo_source STRING COMMENT '搜索来源', seo_keywords STRING COMMENT '关键字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上级来源页面', from_url STRING COMMENT '会话来源页面', landing_page_url STRING COMMENT '访客着陆页面', url_title STRING COMMENT '咨询页面title', to_peer STRING COMMENT '所属技能组', manual_time STRING COMMENT '人工开始时间', begin_time STRING COMMENT '坐席领取时间 ', reply_msg_count int COMMENT '客服回复消息数', total_msg_count int COMMENT '消息总数', msg_count int COMMENT '客户发送消息数', comment STRING COMMENT '备注', finish_reason STRING COMMENT '结束类型', finish_user STRING COMMENT '结束坐席', end_time STRING COMMENT '会话结束时间', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '浏览器名称', os_info STRING COMMENT '系统名称', area STRING COMMENT '区域', country STRING COMMENT '所在国家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '创建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份证号', phone STRING COMMENT '手机号', itcast_school_id int COMMENT '校区Id', itcast_school STRING COMMENT '校区', itcast_subject_id int COMMENT '学科Id', itcast_subject STRING COMMENT '学科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq号', email STRING COMMENT '邮箱', gender STRING COMMENT '性别', level STRING COMMENT '客户级别', origin_type STRING COMMENT '数据来源渠道', information_way STRING COMMENT '资讯方式', working_years STRING COMMENT '开始工作时间', technical_directions STRING COMMENT '技术方向', customer_state STRING COMMENT '当前客户状态', valid STRING COMMENT '该线索是否是网资有效线索', anticipat_signup_date STRING COMMENT '预计报名时间', clue_state STRING COMMENT '线索状态', scrm_department_id int COMMENT 'SCRM内部部门id', superior_url STRING COMMENT '诸葛获取上级页面URL', superior_source STRING COMMENT '诸葛获取上级页面URL标题', landing_url STRING COMMENT '诸葛获取着陆页面URL', landing_source STRING COMMENT '诸葛获取着陆页面URL来源', info_url STRING COMMENT '诸葛获取留咨页URL', info_source STRING COMMENT '诸葛获取留咨页URL标题', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '课程编号', course_name STRING COMMENT '课程名称', zhuge_session_id STRING COMMENT 'zhuge会话id', is_repeat int COMMENT '是否重复线索(手机号维度) 0:正常 1：重复', tenant int COMMENT '租户id', activity_id STRING COMMENT '活动id', activity_name STRING COMMENT '活动名称', follow_type int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', shunt_mode_id int COMMENT '匹配到的技能组id', shunt_employee_group_id int COMMENT '所属分流员工组', ends_time STRING COMMENT '有效时间')comment '客户关系表'PARTITIONED BY(starts_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');

第二步: 使用sqoop 完成数据导入到线索表临时表

sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select id,create_date_time,update_date_time,deleted,customer_id,customer_relationship_id,session_id,sid,status,user as users,create_time,platform,s_name,seo_source,seo_keywords,ip,referrer,from_url,landing_page_url,url_title,to_peer,manual_time,begin_time,reply_msg_count,total_msg_count,msg_count,comment,finish_reason,finish_user,end_time,platform_description,browser_name,os_info,area,country,province,city,creator,name,"-1" as idcard,"-1" as phone,itcast_school_id,itcast_school,itcast_subject_id,itcast_subject,"-1" as wechat,"-1" as qq,"-1" as email,gender,level,origin_type,information_way,working_years,technical_directions,customer_state,valid,anticipat_signup_date,clue_state,scrm_department_id,superior_url,superior_source,landing_url,landing_source,info_url,info_source,origin_channel,course_id,course_name,zhuge_session_id,is_repeat,tenant,activity_id,activity_name,follow_type,shunt_mode_id,shunt_employee_group_id,date_format("9999-12-31","%Y-%m-%d") as ends_time,FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as starts_time from customer_clue where $CONDITIONS' \--hcatalog-database itcast_ods \--hcatalog-table customer_clue_tmp \-m 1 \--split-by id

第三步: 将临时表的数据, 导入到线索表:

insert into table itcast_ods.customer_clue partition(starts_time)select * from itcast_ods.customer_clue_tmp;

4.3: 完成数据清洗转换处理工作: ODS的意向表 --> DWD层清洗后的意向表需要清洗和转换的操作都有哪些? 清洗: 将标记为delete=1进行清除转换工作: create_date_time字段, 需要转换出有年月天小时 origin_type 中数据生成一个新的字段 origin_type_stat 用于区分线上和线下学校id和学科ID，同步时，0和null转换为统一数据，都转换为-1

清洗转换的SQL : INSERT INTO TABLE itcast_dwd.itcast_intention_dwd partition(yearinfo,monthinfo,dayinfo) select id as rid, customer_id, create_date_time, if(itcast_school_id is null or itcast_school_id =0,'-1',itcast_school_id) as itcast_school_id , deleted, origin_type, if(itcast_subject_id is null or itcast_subject_id =0,'-1',itcast_subject_id) as itcast_subject_id, creator, substr(create_date_time,12,2) as hourinfo, if(origin_type in('NETSERVICE','PRESIGNUP'),'1','0') as origin_type_stat, substr(create_date_time,1,4) as yearinfo, substr(create_date_time,6,2) as monthinfo, substr(create_date_time,9,2) as dayinfo from itcast_ods.customer_relationship TABLESAMPLE(BUCKET 1 OUT OF 10 on id) as cr where deleted = 0;

--4.4: 完成数据转换操作: DWD --> DWM --分区 SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=10000; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.created.files=150000; --hive压缩 set hive.exec.compress.intermediate=true; set hive.exec.compress.output=true; --写入时压缩生效 set hive.exec.orc.compression.strategy=COMPRESSION; --分桶 set hive.enforce.bucketing=true; set hive.enforce.sorting=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.auto.convert.sortmerge.join=true; set hive.auto.convert.sortmerge.join.noconditionaltask=true;

insert into table itcast_dwm.itcast_intention_dwm partition(yearinfo,monthinfo,dayinfo) select iid.customer_id, iid.create_date_time, dcu.area, iid.itcast_school_id, dis.name, iid.deleted, iid.origin_type, iid.itcast_subject_id, disub.name, iid.hourinfo, iid.origin_type_stat, if(cc.clue_state ='VALID_NEW_CLUES' , '1', if(cc.clue_state ='VALID_PUBLIC_NEW_CLUE','0','-1') ) as clue_state_stat, -- 找新老用户 demp.tdepart_id, dsd.name, iid.yearinfo, iid.monthinfo, iid.dayinfo from itcast_dwd.itcast_intention_dwd as iid left join itcast_ods.customer_clue as cc on iid.rid = cc.customer_relationship_id left join itcast_dimen.itcast_school as dis on dis.id = iid.itcast_school_id left join itcast_dimen.itcast_subject as disub on disub.id=iid.itcast_subject_id left join itcast_dimen.customer as dcu on dcu.id = iid.customer_id left join itcast_dimen.employee as demp on demp.id = iid.creator left join itcast_dimen.scrm_department as dsd on dsd.id = demp.tdepart_id;

经过测试发现: itcast_intention_dwd 和 customer_clue 产生 SMB的mapjoin优化其余表均为普通 map join

4.5) 统计分析: 指标: 意向数量维度: 时间维度: 年月天小时新老维度: 线上线下产品属性维度: 地区维度 , 来源渠道, 学科维度, 校区维度 , 各咨询中心

--需求1: 按照月统计新老用户以及线上下产生意向用户数量 insert into table itcast_dws.itcast_intention_dws partition(yearinfo,monthinfo,dayinfo) select count(distinct customer_id ) as customer_total, '-1' as area, '-1' as itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo) as time_str, '1' as grouptype , '4' as time_type, yearinfo, monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm group by yearinfo,monthinfo, clue_state_stat, origin_type_stat;

-- 需求2: 按照天统计新老用户以及线上下以及各个地区产生意向用户数量 insert into table itcast_dws.itcast_intention_dws partition(yearinfo,monthinfo,dayinfo) select count(distinct customer_id ) as customer_total, area, '-1' as itcast_school_id, '-1' as itcast_school_name, '-1' as origin_type, '-1' as itcast_subject_id, '-1' as itcast_subject_name, '-1' as hourinfo, origin_type_stat, clue_state_stat, '-1' as tdepart_id, '-1' as tdepart_name, concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str, '2' as grouptype , '2' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm group by yearinfo,monthinfo,dayinfo, clue_state_stat, origin_type_stat,area;

今日内容: day14-------------------------------------------------1) 访问咨询主题看板_增量流程 -- 操作2) 意向客户主题看板_需求分析 -- 最好能够自己分析出来3) 意向客户主题看板_建模分析 -- 理解尝试自己进行分析4) 分桶join优化过程 -- 理解 + 记录

1) 访问咨询主题看板_增量流程什么是增量流程: 每一天都要对上一天的数据进行相关的操作 1. 数据采集: 业务数据库 --> ODS层将业务数据库中上一天的数据导入到ODS层 2. 数据的转换: ODS层 --> DWD层将ODS层上一天的数据, 进行清洗转换工作, 将数据导入到DWD层 3. 数据的分析: DWD层 --> DWS层将DWD层中上一天的数据, 进行统计分析, 将结果数据导入到DWS层 4. 数据的导出: DWS层 --> 业务数据库(BI) 此处可以执行全量导出,因为每一天的统计结果数据量都是差不多 0.准备工作: 重新造一份上一天的数据, 在实际生产中是不存在 -- 创建一个表: 将数据添加这个新表中 CREATE TABLE web_chat_ems_2020_11 AS SELECT * FROM web_chat_ems_2019_07 WHERE create_time BETWEEN '2019-07-01 00:00:00' AND '2019-07-01 23:59:59' ; -- 修改主表中时间字段为上一天的时间 UPDATE web_chat_ems_2020_11 SET create_time= CONCAT('2020-11-28',' ',SUBSTR(create_time,12)) ; -- 创建一个副表, 由于副表数据本身就是主表对应数据, 直接灌入到一个新表即可 CREATE TABLE web_chat_text_ems_2020_11 AS SELECT * FROM web_chat_text_ems_2019_07 ;

1. 数据采集的增量操作: 1.1: 如何从MySQL中获取上一天的数据?

SELECT id,create_date_time,session_id,sid,create_time,seo_source, seo_keywords,ip,`area`,country,province,city,origin_channel, `user` AS user_match, manual_time,begin_time,end_time,last_customer_msg_time_stamp, last_agent_msg_time_stamp,reply_msg_count,msg_count,browser_name,os_info, '2020-11-28' AS starts_time FROM web_chat_ems_2020_11 WHERE create_time BETWEEN '2020-11-28 00:00:00' AND '2020-11-28 23:59:59';

SELECT wcte.* , '2020-11-28' AS start_time FROM (SELECT id FROM web_chat_ems_2020_11 WHERE create_time BETWEEN '2020-11-28 00:00:00' AND '2020-11-28 23:59:59') AS tmp1 JOIN web_chat_text_ems_2020_11 wcte ON tmp1.id = wcte.id ; 1.2: 将上一天的数据导入的ODS层: sqoop 思考: 以上这两个每天都要执行, 只需要更换一下日期即可如何解决呢? shell 脚本功能: 编写一个shell脚本, 如果外部传递了日期参数, 采用这个指定日期导入数据, 如果没有传递参数, 使用上一天日期

sqoop import \ --connect jdbc:mysql://192.168.52.150:3306/nev \ --username root --password 123456 \ --query 'SELECT id,create_date_time,session_id,sid,create_time,seo_source, seo_keywords,ip,`area`,country,province,city,origin_channel, `user` AS user_match, manual_time,begin_time,end_time,last_customer_msg_time_stamp, last_agent_msg_time_stamp,reply_msg_count,msg_count,browser_name,os_info, "2020-11-28" AS starts_time FROM web_chat_ems_2020_11 WHERE create_time BETWEEN "2020-11-28 00:00:00" AND "2020-11-28 23:59:59" and $CONDITIONS' \ --fields-terminated-by '\t' \ --hcatalog-database itcast_ods \ --hcatalog-table web_chat_ems \ -m 3 \ --split-by id sqoop import \ --connect jdbc:mysql://192.168.52.150:3306/nev \ --username root --password 123456 \ --query 'SELECT wcte.* , "2020-11-28" AS start_time FROM (SELECT id FROM web_chat_ems_2020_11 WHERE create_time BETWEEN "2020-11-28 00:00:00" AND '2020-11-28 23:59:59') AS tmp1 JOIN web_chat_text_ems_2020_11 wcte ON tmp1.id = wcte.id and $CONDITIONS' \ --fields-terminated-by '\t' \ --hcatalog-database itcast_ods \ --hcatalog-table web_chat_text_ems \ -m 3 \ --split-by id

-- 编写好的shell脚本, 需要每一天都要对上一天的数据进行数据采集的工作, 此时可以通过 oozie 来解决

-- 清洗转换操作: 将ODS中上一天的数据进行清洗转换工作即可

--动态分区配置set hive.exec.dynamic.partition=true;set hive.exec.dynamic.partition.mode=nonstrict;--hive压缩set hive.exec.compress.intermediate=true;set hive.exec.compress.output=true;--写入时压缩生效set hive.exec.orc.compression.strategy=COMPRESSION;

insert into table itcast_dwd.visit_consult_dwd partition(yearinfo,quarterinfo,monthinfo,dayinfo)select wce.session_id,wce.sid,unix_timestamp(wce.create_time,"yyyy-MM-dd HH:mm:ss") AS create_time,wce.seo_source,wce.ip,wce.area,cast( if( wce.msg_count is null , 0 ,wce.msg_count ) as int ) as msg_count,wce.origin_channel,wcte.referrer, wcte.from_url, wcte.landing_page_url, wcte.url_title, wcte.platform_description, wcte.other_params, wcte.history,substr(wce.create_time,12,2) as hourinfo,substr(wce.create_time,1,4) as yearinfo ,quarter(wce.create_time) as quarterinfo,substr(wce.create_time,6,2) as monthinfo,substr(wce.create_time,9,2) as dayinfofrom (select * from itcast_ods.web_chat_ems where starts_time = '2020-11-28' ) wce join (select * from itcast_ods.web_chat_text_ems where start_time = '2020-11-28') wcteon wce.id = wcte.id ;

--增量统计分析: 在进行增量统计分析时候, 有可能会发生随着增量数据统计会导致之前的统计结果失效的问题比如说: 从2020年 1月份到 2020年 11月27 号统计每年每季度每月每天每小时将11月28号的数据加入到整个数据集以后, 再次进行统计: 每天统计结果只需要将新的一天在新增数据即可每小时统计结果, 只需要在之前上面在新增数据即可每月的数据 1~10月份的数据不会受到影响, 但是 11月份的节点数据可能会受到影响, 此时需要将之前的数据给删除掉每季度统计结果, 1,2,3季度的数据, 不会受到影响, 但是第4季度的数据会受到影响,此时在按照季度统计的时候需要将第4季度数据给删除掉每年的统计结果那么对2020年度的统计结果依然会受到影响, 需要先将按照2020年统计的数据, 先删除, 然后才能统计

但是, hive不支持删除某一个行数据(无法啊随机删除),思考如何解决呢? 支持删除分区删除分区的格式: alter table 表名 drop partition(分区字段=值....)

例如说: 按照年来统计最新增量数据: alter table visit_dws drop partition(yearinfo='2020',quarterinfo='-1',monthinfo='-1',dayinfo='-1') 例如说: 按照季度来统计有增量数据 alter table visit_dws drop partition(yearinfo='2020',quarterinfo='4',monthinfo='-1',dayinfo='-1')

至于后续的统计操作, 大家只需要将对应的要统计的数据通过where条件筛选出来即可例如: 按天来统计各地区的访问量数据

select ..... from dwd表 where yearinfo='2020' and quarterinfo= '4' and monthinfo ='11' and dayinfo ='28' group by 年季度月天地区 ; 按照年来统计各地区 select ..... from dwd表 where yearinfo='2020' group by 年季度月天地区 ;

-- 导出数据: (简单化) 做法: 将MySQL原有数据中, 删除当年的数据, 因为不管在怎么影响, 都会影响之前年

接下来: 将DWS层表数据筛选出 2020年所有的统计结果, 直接全部导出即可

看板1作业: 将第一个看板的指标和维度以及如何进行维度分析的过程, 以及在统计过程中, 涉及到了那些优化的点, 需要能够拿自己的话讲出来

3.意向客户主题看板_需求说明: 需求一: 计期内，新增意向客户(包含自己录入的意向客户)总数。指标: 意向数量维度: 时间维度: 年月天小时新老维度: 线上线下:

客户表: area

基于这个字段统计意向用户数量: customer_id:先去重两个表关联条件: 意向表.customer_id=客户表.id

需求四: 统计指定时间段内，新增的意向客户中，意向校区人数排行榜指标: 意向数量维度: 时间维度: 年月天小时新老维度: 线上线下校区维度

注意：学校id，同步时，0和null转换为统一数据，都转换为-1

总结: 指标: 意向数量维度: 时间维度: 年月天小时新老维度: 线上线下产品属性维度: 地区维度 , 来源渠道, 学科维度, 校区维度 , 各咨询中心

意向主题看板案例_导入原始业务数据 --- 此层在实际工作中不存在 create database scrm default character set utf8mb4 collate utf8mb4_unicode_ci;

将原来发的知行教育分析平台资料中 --> 原始完整数据集 --> scrm --> 将7个表依次导入MySQL中

意向主题看板案例_建模分析: ODS层: 事实表: 意向表额外放置一张表: 线索表 (说明: 此表由于是后续主题看板事实表, 为了方便后续的处理, 将此表放置在ODS层) 表: 内部表 + 分桶表 + 分区表 + 拉链表实施DIM层: 维度层员工表, 校区表, 学科表, 客户表 ,部门表表: 外部表 + 分区表关于以上两层: 只需要一对对应原生数据表结构构建即可, 构建时注意添加一个 start_time(抽取时间)数据格式和压缩方式: ORC + ZLIB(SNAPPY)

DW层: DWD: 清洗转换以及如果表字段过多, 可以抽取相关的字段 , 对 ODS层表进行处理清洗工作: 清理掉以及被标识为删除的数据转换工作: 将 origin_type中数据转换为 0 和 1 形成一个新的字段, 用于标识线上上下 create_date_time将时间转换为年月日小时涉及到字段: 普通字段: id,create_date_time,delete ,customer_id ,origin_type ,origin_type_stat, itcast_school_id ,itcast_subject_id,creator,hourinfo 分区: 年(yearinfo) , 月(monthinfo) 日(dayinfo) DWM: 基于维度提前聚合操作 (不能做) 维度退化将六个维度表, 和 DWD的事实表进行组合, 形成一张表, 从而实现维度退化操作思想: 考虑要从各个维度表中获取那些字段数据, 将这些字段数据全部糅杂在一个表即可相关字段: 普通字段: customer_id, create_date_time,clue_state_stat ,origin_type_stat,area,origin_type, itcast_school_id,school_name,itcast_subject_id,itcast_subject_name,department_id, department_name ,hourinfo 分区字段: 年(yearinfo) , 月(monthinfo) 日(dayinfo)

要想生成这个表的数据, 此处需要进行从ODS+DIM 进行七表联查得出此表结果

app层: 不要 DWS已经成功将各个维度分析完成....

意向客户主题看板

1. 学习目标

了解意向客户主题看板需求

掌握Hive分桶的用法

掌握Map Join的用法

掌握Bucket-Map Join的用法

掌握SMB Join的用法

能够采集意向客户全量数据

能够使用Hive执行计划

能够编写意向客户指标的DWD清洗转换SQL

能够编写意向客户指标的DWM中间层SQL

能够编写意向客户指标的DWS业务层SQL

能够导出分析结果到Mysql

了解拉链表的增量采集导入过程

掌握变更数据的增量清洗过程

掌握变更数据的增量分析过程

能够使用Sqoop导出增量数据到Mysql

2. 主题需求

包含的指标有：1、总意向量、2、意向学员位置热力图、3、意向学科排名、4、意向校区排名、5、来源渠道占比、6、意向贡献中心占比。

1.1 总意向量

说明：计期内，新增意向客户(包含自己录入的意向客户)总数。

展现：线状图

条件：年、月、线上线下

维度：年、月、线上线下

指标：总意向客户量

粒度：天，可以下钻到小时数据。

数据来源：客户管理系统的customer_relationship意向表

SQL：

SELECT
date_format(
cr.create_date_time,
'%Y-%m-%d'
),
count(DISTINCT cr.customer_id)
FROM
customer_relationship cr
WHERE
cr.create_date_time >= '2019-12-01'
AND cr.create_date_time <= '2019-12-31 23:59:59'
GROUP BY
date_format(
cr.create_date_time,
'%Y-%m-%d'
);

1.2 意向学员位置热力图

说明：统计指定时间段内，新增的意向客户，所在城市区域人数热力图。

展现：地图热力图

维度：年、月、线上线下

指标：按照地区聚合意向客户id数量

粒度：天，可以下钻到小时数据。

条件：年、月、线上线下

数据来源：客户管理系统的customer(客户静态信息表) 、customer_relationship(客户意向表)

SQL：

SELECT
c.area '区域',
count(DISTINCT cr.customer_id) '总数',
DATE_FORMAT(cr.create_date_time,'%Y-%m-%d') '客户创建时间'
FROM
customer c, customer_relationship cr
WHERE cr.customer_id = c.id
AND cr.create_date_time > '2019-11-01 00:00:00'
AND cr.create_date_time < '2019-11-30 23:59:59'
GROUP BY DATE_FORMAT(cr.create_date_time,'%Y-%m-%d'), c.area
ORDER BY DATE_FORMAT(cr.create_date_time,'%Y-%m-%d') ASC, count(1) DESC

1.3 意向学科排名

说明：统计指定时间段内，新增的意向客户中，意向学科人数排行榜。学科名称要关联查询出来。

展现：柱状图

条件：年、月、线上线下

维度：年、月、线上线下、学科

指标：学科意向客户量

粒度：天，可以下钻到小时数据。

数据来源：客户管理系统的customer_clue(客户线索表)、customer_relationship(客户意向表)、itcast_subject(学科表)

SQL：

意向学科，要以意向表的学科字段为准，不能以线索表为准。

SELECT cr.itcast_subject_id,
sj.name,
count(DISTINCT cr.customer_id)
FROM customer_clue cc,
customer_relationship cr
left join itcast_subject sj on cr.itcast_subject_id = sj.id
WHERE cc.clue_state = 'VALID_NEW_CLUES' --新客户新线索
AND ! cc.deleted
AND cr.origin_type IN ('NETSERVICE', 'PRESIGNUP') #线上(排除挖掘录入量)
AND cc.create_date_time > '2019-10-01 00:00:00'
AND cc.create_date_time < '2019-11-30 23:59:59'
AND cc.customer_relationship_id = cr.id
GROUP BY cr.itcast_subject_id
ORDER BY count(1) DESC;

1.4 意向校区排名

说明：统计指定时间段内，新增的意向客户中，意向校区人数排行榜。

展现：柱状图

条件：年、月、线上线下

维度：年、月、线上线下、校区

指标：校区意向客户量

粒度：天，可以下钻到小时数据。

数据来源：客户管理系统的

注意：学校id，同步时，0和null转换为统一数据，都转换为-1

SQL：

SELECT cr.itcast_school_id,
sc.name,
count(DISTINCT cr.customer_id)
FROM customer_clue cc,
customer_relationship cr
left join itcast_school sc on cr.itcast_school_id = sc.id
WHERE cc.clue_state = 'VALID_NEW_CLUES' --新客户新线索
AND ! cc.deleted
AND cr.origin_type IN ('NETSERVICE', 'PRESIGNUP') #线上(排除挖掘录入量)
AND cc.create_date_time > '2019-10-01 00:00:00'
AND cc.create_date_time < '2019-11-30 23:59:59'
AND cc.customer_relationship_id = cr.id
GROUP BY cr.itcast_school_id
ORDER BY count(1) DESC;

1.5 来源渠道占比

说明：统计指定时间段内，新增的意向客户中，不同来源渠道的意向客户占比。

展现：饼状图

条件：年、月、线上线下

维度：年、月、线上线下、来源渠道

粒度：天，可以下钻到小时数据。

指标：来源渠道意向客户量

数据来源：客户管理系统的customer_clue(客户线索表)、customer_relationship(客户意向表)

SQL：

SELECT
cr.origin_type '来源渠道',
count(DISTINCT cr.customer_id) '总数'
FROM
customer_relationship cr
LEFT JOIN customer_clue cc ON cc.customer_relationship_id = cr.id
WHERE
cc.clue_state = 'VALID_NEW_CLUES'
AND cr.create_date_time < '2019-11-30 23:59:59'
AND cr.create_date_time < '2019-11-30 23:59:59'
AND cr.origin_type IN ('NETSERVICE','PRESIGNUP') #线上(排除挖掘录入量)
AND ! cc.deleted
GROUP BY
cr.origin_type;

1.6 意向贡献中心占比

说明：统计指定时间段内，新增的意向客户中，各咨询中心产生的意向客户数占比情况。

展现：饼状图

条件：年、月、线上线下

维度：年、月、线上线下、咨询中心

指标：咨询中心意向客户数

粒度：天，可以下钻到小时数据。

数据来源：客户管理系统的customer_relationship(客户意向表)、employee(员工表)、scrm_department(部门表)

SQL：

SELECT
e.tdepart_id,
sd.`name`,
count(DISTINCT cr.customer_id) '总数'
FROM
customer_relationship cr
LEFT JOIN employee e ON cr.creator = e.id
LEFT JOIN scrm_department sd ON e.tdepart_id = sd.id
WHERE
cc.clue_state = 'VALID_NEW_CLUES'
AND cr.create_date_time >= '2019-10-01 00:00:00'
AND cr.create_date_time <= '2019-11-30 23:59:59'
AND cr.origin_type IN ('NETSERVICE','PRESIGNUP') #线上(排除挖掘录入量)
GROUP BY
e.tdepart_id;

1.7 原始数据结构

1.7.1 建库

意向客户数据，来源于咨询管理系统的数据库：scrm。

create database scrm default character set utf8mb4 collate utf8mb4_unicode_ci;

测试数据

Mysql测试数据可以通过导入已准备好的sql文件进行创建：【Home\讲义\完整原始数据\scrm.sql】。可以通过mysql脚本导入：

mysql -h 192.168.52.150 -P 3306 -uroot -p

source G:\知行教育大数据平台\讲义\完整原始数据\scrm.sql

1.7.2 customer客户静态信息表

主要用来关联获取客户的静态信息，比如地区信息。

CREATE TABLE `customer` ( `id` int(11) NOT NULL AUTO_INCREMENT, `customer_relationship_id` int(11) DEFAULT NULL COMMENT '当前意向id', `create_date_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间', `update_date_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最后更新时间', `deleted` bit(1) NOT NULL DEFAULT b'0' COMMENT '是否被删除(禁用)', `name` varchar(128) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '' COMMENT '姓名', `idcard` varchar(24) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT '' COMMENT '身份证号', `birth_year` int(5) DEFAULT NULL COMMENT '出生年份', `gender` varchar(8) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT 'MAN' COMMENT '性别', `phone` varchar(24) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '' COMMENT '手机号', `wechat` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT '' COMMENT '微信', `qq` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT '' COMMENT 'qq号', `email` varchar(56) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT '' COMMENT '邮箱', `area` varchar(128) DEFAULT '' COMMENT '所在区域', `leave_school_date` date DEFAULT NULL COMMENT '离校时间', `graduation_date` date DEFAULT NULL COMMENT '毕业时间', `bxg_student_id` varchar(64) DEFAULT NULL COMMENT '博学谷学员ID，可能未关联到，不存在', `creator` int(11) DEFAULT NULL COMMENT '创建人ID', `origin_type` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT '数据来源', `origin_channel` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT '来源渠道', `tenant` int(11) NOT NULL DEFAULT '0', `md_id` int(11) DEFAULT '0' COMMENT '中台id', PRIMARY KEY (`id`), KEY `employee_id` (`creator`) USING BTREE, KEY `customer_relationship_id` (`customer_relationship_id`) USING BTREE, KEY `index_idcard` (`idcard`) USING BTREE, KEY `index_phone` (`phone`) USING BTREE, KEY `index_create_time` (`create_date_time`) USING BTREE, KEY `index_qq` (`qq`) USING BTREE, KEY `idx_update_time` (`update_date_time`) USING BTREE, CONSTRAINT `customer_ibfk_1` FOREIGN KEY (`creator`) REFERENCES `employee` (`id`)) ENGINE=InnoDB AUTO_INCREMENT=2061222 DEFAULT CHARSET=utf8;

1.7.3 customer_relationship客户意向表

意向客户主表，用来统计事实数据。

根据需求，客户的意向数据，会存在更新的情况，需要将更新的数据进行重新统计以得到正确的结果；同时要能够查看这些数据的历史快照。

CREATE TABLE `customer_relationship` ( `id` int(11) NOT NULL AUTO_INCREMENT, `create_date_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP, `update_date_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最后更新时间', `deleted` bit(1) NOT NULL DEFAULT b'0' COMMENT '是否被删除(禁用)', `customer_id` int(11) NOT NULL DEFAULT '0' COMMENT '所属客户id', `first_id` int(11) DEFAULT NULL COMMENT '第一条客户关系id', `belonger` int(11) DEFAULT NULL COMMENT '归属人', `belonger_name` varchar(10) DEFAULT NULL COMMENT '归属人姓名', `initial_belonger` int(11) DEFAULT NULL COMMENT '初始归属人', `distribution_handler` int(11) DEFAULT NULL COMMENT '分配处理人', `business_scrm_department_id` int(11) DEFAULT '0' COMMENT '归属部门', `last_visit_time` datetime DEFAULT NULL COMMENT '最后回访时间', `next_visit_time` datetime DEFAULT NULL COMMENT '下次回访时间', `origin_type` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT '数据来源', `itcast_school_id` int(11) DEFAULT NULL COMMENT '校区Id', `itcast_subject_id` int(11) DEFAULT NULL COMMENT '学科Id', `intention_study_type` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT '意向学习方式', `anticipat_signup_date` date DEFAULT NULL COMMENT '预计报名时间', `level` varchar(8) DEFAULT NULL COMMENT '客户级别', `creator` int(11) DEFAULT NULL COMMENT '创建人', `current_creator` int(11) DEFAULT NULL COMMENT '当前创建人：初始==创建人，当在公海拉回时为拉回人', `creator_name` varchar(32) DEFAULT '' COMMENT '创建者姓名', `origin_channel` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT '来源渠道', `comment` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT '' COMMENT '备注', `first_customer_clue_id` int(11) DEFAULT '0' COMMENT '第一条线索id', `last_customer_clue_id` int(11) DEFAULT '0' COMMENT '最后一条线索id', `process_state` varchar(32) DEFAULT NULL COMMENT '处理状态', `process_time` datetime DEFAULT NULL COMMENT '处理状态变动时间', `payment_state` varchar(32) DEFAULT NULL COMMENT '支付状态', `payment_time` datetime DEFAULT NULL COMMENT '支付状态变动时间', `signup_state` varchar(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT '报名状态', `signup_time` datetime DEFAULT NULL COMMENT '报名时间', `notice_state` varchar(32) DEFAULT NULL COMMENT '通知状态', `notice_time` datetime DEFAULT NULL COMMENT '通知状态变动时间', `lock_state` bit(1) DEFAULT b'0' COMMENT '锁定状态', `lock_time` datetime DEFAULT NULL COMMENT '锁定状态修改时间', `itcast_clazz_id` int(11) DEFAULT NULL COMMENT '所属ems班级id', `itcast_clazz_time` datetime DEFAULT NULL COMMENT '报班时间', `payment_url` varchar(1024) DEFAULT '' COMMENT '付款链接', `payment_url_time` datetime DEFAULT NULL COMMENT '支付链接生成时间', `ems_student_id` int(11) DEFAULT NULL COMMENT 'ems的学生id', `delete_reason` varchar(64) DEFAULT NULL COMMENT '删除原因', `deleter` int(11) DEFAULT NULL COMMENT '删除人', `deleter_name` varchar(32) DEFAULT NULL COMMENT '删除人姓名', `delete_time` datetime DEFAULT NULL COMMENT '删除时间', `course_id` int(11) DEFAULT NULL COMMENT '课程ID', `course_name` varchar(64) DEFAULT NULL COMMENT '课程名称', `delete_comment` varchar(255) DEFAULT '' COMMENT '删除原因说明', `close_state` varchar(32) DEFAULT NULL COMMENT '关闭装填', `close_time` datetime DEFAULT NULL COMMENT '关闭状态变动时间', `appeal_id` int(11) DEFAULT NULL COMMENT '申诉id', `tenant` int(11) NOT NULL DEFAULT '0' COMMENT '租户', `total_fee` decimal(19,0) DEFAULT NULL COMMENT '报名费总金额', `belonged` int(11) DEFAULT NULL COMMENT '小周期归属人', `belonged_time` datetime DEFAULT NULL COMMENT '归属时间', `belonger_time` datetime DEFAULT NULL COMMENT '归属时间', `transfer` int(11) DEFAULT NULL COMMENT '转移人', `transfer_time` datetime DEFAULT NULL COMMENT '转移时间', `follow_type` int(4) DEFAULT '0' COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', `transfer_bxg_oa_account` varchar(64) DEFAULT NULL COMMENT '转移到博学谷归属人OA账号', `transfer_bxg_belonger_name` varchar(64) DEFAULT NULL COMMENT '转移到博学谷归属人OA姓名', PRIMARY KEY (`id`), KEY `customer_id` (`customer_id`) USING BTREE, KEY `appeal_id` (`appeal_id`) USING BTREE, KEY `create_date_time` (`create_date_time`) USING BTREE, KEY `next_visit_time` (`next_visit_time`) USING BTREE, KEY `last_visit_time` (`last_visit_time`) USING BTREE, KEY `itcast_school_id` (`itcast_school_id`) USING BTREE, KEY `index_delete` (`delete_time`) USING BTREE, KEY `index_class_id` (`itcast_clazz_id`) USING BTREE, KEY `belonger` (`belonger`) USING BTREE, KEY `creator` (`creator`) USING BTREE, KEY `index_itcast_subject_id` (`itcast_subject_id`) USING BTREE, KEY `idex_distribution` (`distribution_handler`) USING BTREE, CONSTRAINT `customer_relationship_ibfk_1` FOREIGN KEY (`customer_id`) REFERENCES `customer` (`id`)) ENGINE=InnoDB AUTO_INCREMENT=2060127 DEFAULT CHARSET=utf8;

1.7.4 customer_clue客户线索表

客户线索表主要保存的是客户咨询时留下来的手机号、微信号等联系线索。在意向客户统计时，主要用来判断是新客户还是老客户，clue_state字段的值'VALID_NEW_CLUES'代表是新客户，'VALID_PUBLIC_NEW_CLUE'代表是老客户。

根据需求，客户的线索数据，也会存在更新的情况，需要将更新的数据进行重新统计以得到正确的结果；同时要能够查看这些数据的历史快照。

CREATE TABLE `customer_clue` ( `id` int(11) NOT NULL AUTO_INCREMENT, `create_date_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间', `update_date_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最后更新时间', `deleted` bit(1) NOT NULL DEFAULT b'0' COMMENT '是否被删除(禁用)', `customer_id` int(11) DEFAULT NULL COMMENT '客户id', `customer_relationship_id` int(11) DEFAULT NULL COMMENT '客户关系id', `session_id` varchar(48) COLLATE utf8_bin DEFAULT '' COMMENT '七陌会话id', `sid` varchar(48) COLLATE utf8_bin DEFAULT '' COMMENT '访客id', `status` varchar(16) COLLATE utf8_bin DEFAULT '' COMMENT '状态(undeal待领取 deal 已领取 finish 已关闭 changePeer 已流转)', `user` varchar(16) COLLATE utf8_bin DEFAULT '' COMMENT '所属坐席', `create_time` datetime DEFAULT NULL COMMENT '七陌创建时间', `platform` varchar(16) COLLATE utf8_bin DEFAULT '' COMMENT '平台来源 (pc-网站咨询|wap-wap咨询|sdk-app咨询|weixin-微信咨询)', `s_name` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT '用户名称', `seo_source` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '搜索来源', `seo_keywords` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '关键字', `ip` varchar(48) COLLATE utf8_bin DEFAULT '' COMMENT 'IP地址', `referrer` text COLLATE utf8_bin COMMENT '上级来源页面', `from_url` text COLLATE utf8_bin COMMENT '会话来源页面', `landing_page_url` text COLLATE utf8_bin COMMENT '访客着陆页面', `url_title` varchar(1024) COLLATE utf8_bin DEFAULT '' COMMENT '咨询页面title', `to_peer` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '所属技能组', `manual_time` datetime DEFAULT NULL COMMENT '人工开始时间', `begin_time` datetime DEFAULT NULL COMMENT '坐席领取时间 ', `reply_msg_count` int(11) DEFAULT '0' COMMENT '客服回复消息数', `total_msg_count` int(11) DEFAULT '0' COMMENT '消息总数', `msg_count` int(11) DEFAULT '0' COMMENT '客户发送消息数', `comment` varchar(1024) COLLATE utf8_bin DEFAULT '' COMMENT '备注', `finish_reason` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '结束类型', `finish_user` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT '结束坐席', `end_time` datetime DEFAULT NULL COMMENT '会话结束时间', `platform_description` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '客户平台信息', `browser_name` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '浏览器名称', `os_info` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '系统名称', `area` varchar(255) COLLATE utf8_bin DEFAULT NULL COMMENT '区域', `country` varchar(16) COLLATE utf8_bin DEFAULT '' COMMENT '所在国家', `province` varchar(16) COLLATE utf8_bin DEFAULT '' COMMENT '省', `city` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '城市', `creator` int(11) DEFAULT '0' COMMENT '创建人', `name` varchar(64) COLLATE utf8_bin DEFAULT '' COMMENT '客户姓名', `idcard` varchar(24) COLLATE utf8_bin DEFAULT '' COMMENT '身份证号', `phone` varchar(24) COLLATE utf8_bin DEFAULT '' COMMENT '手机号', `itcast_school_id` int(11) DEFAULT NULL COMMENT '校区Id', `itcast_school` varchar(128) COLLATE utf8_bin DEFAULT '' COMMENT '校区', `itcast_subject_id` int(11) DEFAULT NULL COMMENT '学科Id', `itcast_subject` varchar(128) COLLATE utf8_bin DEFAULT '' COMMENT '学科', `wechat` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT '微信', `qq` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT 'qq号', `email` varchar(56) COLLATE utf8_bin DEFAULT '' COMMENT '邮箱', `gender` varchar(8) COLLATE utf8_bin DEFAULT 'MAN' COMMENT '性别', `level` varchar(8) COLLATE utf8_bin DEFAULT NULL COMMENT '客户级别', `origin_type` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT '数据来源渠道', `information_way` varchar(32) COLLATE utf8_bin DEFAULT NULL COMMENT '资讯方式', `working_years` date DEFAULT NULL COMMENT '开始工作时间', `technical_directions` varchar(255) COLLATE utf8_bin DEFAULT '' COMMENT '技术方向', `customer_state` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT '当前客户状态', `valid` bit(1) DEFAULT b'0' COMMENT '该线索是否是网资有效线索', `anticipat_signup_date` date DEFAULT NULL COMMENT '预计报名时间', `clue_state` varchar(32) COLLATE utf8_bin DEFAULT 'NOT_SUBMIT' COMMENT '线索状态', `scrm_department_id` int(11) DEFAULT NULL COMMENT 'SCRM内部部门id', `superior_url` text COLLATE utf8_bin COMMENT '诸葛获取上级页面URL', `superior_source` varchar(1024) COLLATE utf8_bin DEFAULT NULL COMMENT '诸葛获取上级页面URL标题', `landing_url` text COLLATE utf8_bin COMMENT '诸葛获取着陆页面URL', `landing_source` varchar(1024) COLLATE utf8_bin DEFAULT NULL COMMENT '诸葛获取着陆页面URL来源', `info_url` text COLLATE utf8_bin COMMENT '诸葛获取留咨页URL', `info_source` varchar(255) COLLATE utf8_bin DEFAULT NULL COMMENT '诸葛获取留咨页URL标题', `origin_channel` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT '投放渠道', `course_id` int(32) DEFAULT NULL, `course_name` varchar(255) COLLATE utf8_bin DEFAULT NULL, `zhuge_session_id` varchar(500) COLLATE utf8_bin DEFAULT NULL, `is_repeat` int(4) NOT NULL DEFAULT '0' COMMENT '是否重复线索(手机号维度) 0:正常 1：重复', `tenant` int(11) NOT NULL DEFAULT '0' COMMENT '租户id', `activity_id` varchar(16) COLLATE utf8_bin DEFAULT NULL COMMENT '活动id', `activity_name` varchar(64) COLLATE utf8_bin DEFAULT NULL COMMENT '活动名称', `follow_type` int(4) DEFAULT '0' COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', `shunt_mode_id` int(11) DEFAULT NULL COMMENT '匹配到的技能组id', `shunt_employee_group_id` int(11) DEFAULT NULL COMMENT '所属分流员工组', PRIMARY KEY (`id`), KEY `customer_id` (`customer_id`) USING BTREE, KEY `customer_relationship_id` (`customer_relationship_id`) USING BTREE, KEY `phone` (`phone`) USING BTREE, KEY `idcard` (`idcard`) USING BTREE, KEY `session_id` (`session_id`) USING BTREE, KEY `index_date_time` (`create_date_time`) USING BTREE, KEY `index_creator` (`creator`) USING BTREE, CONSTRAINT `customer_clue_ibfk_1` FOREIGN KEY (`customer_id`) REFERENCES `customer` (`id`), CONSTRAINT `customer_clue_ibfk_2` FOREIGN KEY (`customer_relationship_id`) REFERENCES `customer_relationship` (`id`)) ENGINE=InnoDB AUTO_INCREMENT=2060711 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

1.7.5 employee员工表

主要用来关联获取员工信息，比如员工所在的部门id。

create table employee( id int auto_increment primary key, email varchar(64) not null comment '公司邮箱，OA登录账号', real_name varchar(32) not null comment '员工的真实姓名', phone varchar(32) not null comment '手机号，目前还没有使用；隐私问题OA接口没有提供这个属性，', department_id varchar(64) default '0' null comment 'OA中的部门编号，有负值', department_name varchar(64) default '' null comment 'OA中的部门名', remote_login bit not null comment '员工是否可以远程登录', job_number varchar(64) null comment '员工工号', cross_school bit not null comment '是否有跨校区权限', last_login_date datetime not null comment '最后登录日期', creator int(32) null comment '创建人', create_date_time datetime default CURRENT_TIMESTAMP not null comment '创建时间', update_date_time timestamp default CURRENT_TIMESTAMP not null on update CURRENT_TIMESTAMP comment '最后更新时间', deleted bit default b'0' not null comment '是否被删除(禁用)', scrm_department_id int(32) null comment 'SCRM内部部门id', leave_office bit null comment '离职状态', leave_office_time datetime null comment '离职时间', reinstated_time datetime null comment '复职时间', superior_leaders_id int null comment '上级领导ID', tdepart_id int null comment '直属部门', tenant int default 0 not null, ems_user_name varchar(32) null) comment '员工信息表';

1.7.6 scrm_department部门表

用来获取部门名称等信息。

CREATE TABLE `scrm_department` ( `id` int(11) NOT NULL AUTO_INCREMENT COMMENT '部门id', `name` varchar(255) COLLATE utf8_bin DEFAULT NULL COMMENT '部门名称', `parent_id` int(11) DEFAULT NULL COMMENT '父部门id', `create_date_time` datetime DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间', `update_date_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间', `deleted` bit(1) DEFAULT b'0' COMMENT '删除标志', `id_path` varchar(1000) COLLATE utf8_bin DEFAULT NULL COMMENT '编码全路径', `tdepart_code` int(11) DEFAULT NULL COMMENT '直属部门', `creator` varchar(32) COLLATE utf8_bin DEFAULT NULL COMMENT '创建者', `depart_level` int(4) DEFAULT NULL COMMENT '部门层级', `depart_sign` int(4) DEFAULT NULL COMMENT '部门标志，暂时默认1', `depart_line` int(11) DEFAULT NULL COMMENT '业务线，存储业务线编码', `depart_sort` int(5) DEFAULT NULL COMMENT '排序字段', `disable_flag` int(1) DEFAULT NULL COMMENT '禁用标志', `tenant` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=149 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

1.7.7 itcast_school学校表

用来获取学校名称等信息。

CREATE TABLE `itcast_school` ( `id` int(11) NOT NULL AUTO_INCREMENT, `create_date_time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间', `update_date_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最后更新时间', `deleted` bit(1) NOT NULL DEFAULT b'0' COMMENT '是否被删除(禁用)', `name` varchar(32) COLLATE utf8_bin NOT NULL DEFAULT '' COMMENT '校区名称', `code` varchar(32) COLLATE utf8_bin NOT NULL, `tenant` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=30 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

1.7.8 itcast_subject学科表

用来获取学科名称等信息。

CREATE TABLE `itcast_subject` ( `id` int(11) NOT NULL AUTO_INCREMENT, `create_date_time` datetime NOT NULL COMMENT '创建时间', `update_date_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '最后更新时间', `deleted` bit(1) NOT NULL DEFAULT b'0' COMMENT '是否被删除(禁用)', `name` varchar(32) COLLATE utf8_bin DEFAULT '' COMMENT '学科名称', `code` varchar(32) COLLATE utf8_bin DEFAULT NULL, `tenant` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=22 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

2. 建模分析

2.1 指标和维度

根据主题需求，我们来进行指标和维度的提取：

从1.1~1.6统计的分别是，总意向客户数、地区意向客户、学科意向客户、校区意向客户、来源渠道意向客户和咨询中心意向客户，维度都包含了年、月、线上线下。

每个指标都指明统计的是新增客户，我们可以将数据分为新客户和老客户进行统计。

我们可以提取出共有的指标：意向客户量。维度：年、月、线上线下、新老客户。

因为数据粒度都是展示到天，而且可以下钻到小时，所以我们的统计维度中也需要增加天和小时。

不同指标的产品属性也需要增加到维度中：

意向学员位置热力图，是将不同地区的意向客户数量进行统计；

意向学科排名，虽然最终要的结果是学科的排名，但这个排名的依据是根据学科统计出来的意向学员数量；

意向校区排名，要的结果是校区排名，但排名的依据也是根据校区统计出来的意向学员数量；

来源渠道占比，指的是不同来源渠道意向学员数量的总体占比，底层的依据还是意向学员数量；

意向贡献中心占比，和来源渠道占比类似，依据的是不同咨询中心的意向学员数量；

所以维度应该包括：年、月、天、小时、线上线下、新老客户、地区、学科、校区、来源渠道、咨询中心。

2.2 分层设计

我们可以采取结果导向的方式来进行倒推：

最终需要统计的数据维度：年、月、天、小时、线上线下、新老客户、地区、学科、校区、来源渠道、咨询中心；
在需求中，每个指标的条件都包含有时间和线上线下、新老客户，也就是说无论哪一种业务维度都需要按照时间、线上线下和新老客户来进行区分，可以将这三个维度作为单独字段；
因此我们将维度分为四类：时间维度(年、月、天)、数据来源(线上线下)、客户属性(新老客户)和产品属性维度(总意向量、地区、学科、校区、来源渠道、咨询中心)；
首先将数据抽取到ODS源数据层，然后将明细数据通过清洗转换后存入DWD层；
在DWM，关联相关的维度数据，并转换出需要的信息；
DWS层在DWM关联后的数据上进行统计，得出数据集市；
将OLAP需要的数据和字段同步至mysql；
ODS——》DWD——》DWM——》DWS。

3. 实现

3.1 建模

3.1.1 指标和维度

指标：意向客户量是单位时间内新增的意向客户量(包含线上线下)，以天为单位显示。

维度：

l 时间维度：年、月、天、小时

l 数据来源：线上线下

l 客户属性：新客户、老客户

l 地区、学科、校区、来源渠道、咨询中心。

3.1.2 事实表和维度表

customer_relationship客户意向表，包含了意向客户信息；显然此表就是意向客户指标的基础事实。

customer客户静态信息表主要用来关联获取客户的静态信息，比如地区信息。是我们的维度数据。

customer_clue客户线索表主要用来判断是新客户还是老客户；也属于要关联的维度信息；但因为此表包含了后续其他指标的事实数据，所以不放在维度DIM层。

类似的，employee员工表、scrm_department部门表、itcast_school学校表、itcast_subject学科表都属于维度信息，所以作为维度表放在维度层。

3.1.3 Hive分桶

分桶是将数据集分解成更容易管理的若干部分的一个技术，是比分区更为细粒度的数据范围划分。

3.1.3.1 为什么要分桶？

3.1.3.1.1 获得更高的查询处理效率

在分区数量过于庞大以至于可能导致文件系统崩溃时，或数据集找不到合理的分区字段时，我们就需要使用分桶来解决问题了。

分区中的数据可以被进一步拆分成桶，不同于分区对列直接进行拆分，桶往往使用列的哈希值对数据打散，并分发到各个不同的桶中从而完成数据的分桶过程。

注意，hive使用对分桶所用的值进行hash，并用hash结果除以桶的个数做取余运算的方式来分桶，保证了每个桶中都有数据，但每个桶中的数据条数不一定相等。

如果另外一个表也按照同样的规则分成了一个个小文件。两个表join的时候，就不必要扫描整个表，只需要匹配相同分桶的数据即可，从而提升效率。

在数据量足够大的情况下，分桶比分区有更高的查询效率。

3.1.3.1.2 数据采样

在真实的大数据分析过程中，由于数据量较大，开发和自测的过程比较慢，严重影响系统的开发进度。此时就可以使用分桶来进行数据采样。采样使用的是一个具有代表性的查询结果而不是全部结果，通过对采样数据的分析，来达到快速开发和自测的目的，节省大量的研发成本。

3.1.3.2 分桶和分区的区别

分桶对数据的处理比分区更加细粒度化：分区针对的是数据的存储路径；分桶针对的是数据文件；
分桶是按照列的哈希函数进行分割的，相对比较平均；而分区是按照列的值来进行分割的，容易造成数据倾斜；
分桶和分区两者不干扰，可以把分区表进一步分桶。

3.1.3.3 操作

创建分桶表

create table test_buck(id int, name string)

clustered by(id) sorted by (id asc) into 6 buckets

row format delimited fields terminated by '\t';

CLUSTERED BY来指定划分桶所用列；

SORTED BY对桶中的一个或多个列进行排序；

into 6 buckets指定划分桶的个数。

分桶规则：HIVE对key的hash值除bucket个数取余数，保证数据均匀随机分布在所有bucket里。

查看分桶表信息

desc formatted test_buck;

插入数据

--启用桶表

set hive.enforce.bucketing=true;

insert into table test_buck select id, name from temp_buck;

hive.enforce.bucketing：启用桶表，数据分桶是否被强制执行，默认false，如果开启，则写入table数据时会启动分桶。

3.1.3.4 文本数据处理

注意：对于分桶表，不能使用load data的方式进行数据插入操作，因为load data导入的数据不会有分桶结构。

如何避免针对桶表使用load data插入数据的误操作呢？

--限制对桶表进行load操作

set hive.strict.checks.bucketing = true;

也可以在CM的hive配置项中修改此配置，当针对桶表执行load data操作时会报错。

那么对于文本数据如何处理呢？

(1. 先创建临时表，通过load data将txt文本导入临时表。

--创建临时表

create table temp_buck(id int, name string)

row format delimited fields terminated by '\t';

--导入数据

load data local inpath '/tools/test_buck.txt' into table temp_buck;

(2. 使用insert select语句间接的把数据从临时表导入到分桶表。

--启用桶表

set hive.enforce.bucketing=true;

--限制对桶表进行load操作

set hive.strict.checks.bucketing = true;

--insert select

insert into table test_buck select id, name from temp_buck;

--分桶成功

3.1.3.5 数据采样

对表分桶一般有两个目的，提高数据查询效率、抽样调查。通过前面的讲解，我们已经可以对分桶表进行正常的创建并导入数据了。一般在实际生产中，对于非常大的数据集，有时用户需要使用的是一个具有代表性的查询结果而不是全部结果，比如在开发自测的时候。这个时候Hive就可以通过对表进行抽样来满足这个需求。

语法

select * from table tablesample(bucket x out of y on column)

hive根据y的大小，决定抽样的比例。y必须是table总bucket数的倍数或者因子。

例如，table总共分了10份bucket，当y=2时，抽取(10/2=)5个bucket的数据，当y=10时，抽取(10/10=)1个bucket的数据。

x表示从哪个bucket开始抽取，如果需要取多个分区，以后的分区号为当前分区号加上y。

例如，table总bucket数为6，tablesample(bucket 1 out of 2)，表示总共抽取(6/2=)3个bucket的数据，从第1个bucket开始，抽取第1(x)个和第3(x+y)个和第5(x+y)个bucket的数据。

注意：x的值必须小于等于y的值。否则会抛出异常：FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck。

栗子

select * from test_buck tablesample(bucket 1 out of 10 on id);

注意：sqoop不支持分桶表，如果需要从sqoop导入数据到分桶表，可以通过中间临时表进行过度。ODS也可以不做分桶，从DWD明细层开始分桶。

3.1.3.6 Map Join

MapJoin顾名思义，就是在Map阶段进行表之间的连接。而不需要进入到Reduce阶段才进行连接。这样就节省了在Shuffle阶段时要进行的大量数据传输。从而起到了优化作业的作用。

要使MapJoin能够顺利进行，那就必须满足这样的条件：除了一份表的数据分布在不同的Map中外，其他连接的表的数据必须在每个Map中有完整的拷贝。

所以并不是所有的场景都适合用MapJoin。它通常会用在如下的一些情景：在二个要连接的表中，有一个很大，有一个很小，这个小表可以存放在内存中而不影响性能。

这样我们就把小表文件复制到每一个Map任务的本地，再让Map把文件读到内存中待用。

在Hive v0.7之前，需要使用hint提示 /*+ mapjoin(table) */才会执行MapJoin。Hive v0.7之后的版本已经不需要给出MapJoin的指示就进行优化。现在可以通过如下配置参数来进行控制：

set hive.auto.convert.join=true;

Hive还提供另外一个参数--表文件的大小作为开启和关闭MapJoin的阈值：

--旧版本为hive.mapjoin.smalltable.filesize

set hive.auto.convert.join.noconditionaltask.size=512000000

注意，如果hive.auto.convert.join是关闭的，则本参数不起作用。否则，如果参与连接的N个表(或分区)中的N-1个的总大小小于512MB，则直接将连接转为Map连接。默认值为20MB。

MapJoin的使用场景：

1. 关联操作中有一张表非常小

2. 不等值的链接操作

3.1.3.6.1 大小表关联

select f.a,f.b from A t join B f on ( f.a=t.a and f.ftime=20110802)

该语句中B表有30亿行记录，A表只有100行记录，而且B表中数据倾斜特别严重，有一个key上有15亿行记录，在运行过程中特别的慢，而且在reduece的过程中遇到执行时间过长或者内存不够的问题。

MAPJION会把小表全部读入内存中，在map阶段直接拿另外一个表的数据和内存中表数据做匹配，由于在map时进行了join操作，省去了reduce运行的效率会高很多。

这样就不会由于数据倾斜导致某个reduce上落数据太多而失败。于是原来的sql可以通过使用hint的方式指定join时使用mapjoin。

select /*+ mapjoin(A)*/ f.a,f.b from A t join B f on ( f.a=t.a and f.ftime=20110802)

在实际使用中，只要根据业务调整小表的阈值即可，hive会自动帮我们完成mapjoin，提高执行的效率。

3.1.3.6.2 不等连接

mapjoin还有一个很大的好处是能够进行不等连接的join操作，如果将不等条件写在where中，那么mapreduce过程中会进行笛卡尔积，运行效率特别低，如果使用mapjoin操作，在map的过程中就完成了不等值的join操作，效率会高很多。

select A.a ,A.b from A join B where A.a>B.a

3.1.3.7 Bucket-MapJoin

3.1.3.7.1 作用

两个表join的时候，小表不足以放到内存中，但是又想用map side join这个时候就要用到bucket Map join。其方法是两个join表在join key上都做hash bucket，并且把你打算复制的那个(相对)小表的bucket数设置为大表的倍数。这样数据就会按照key join，做hash bucket。小表依然复制到所有节点，Map join的时候，小表的每一组bucket加载成hashtable，与对应的一个大表bucket做局部join，这样每次只需要加载部分hashtable就可以了。

3.1.3.7.2 条件

1) set hive.optimize.bucketmapjoin = true;2) 一个表的bucket数是另一个表bucket数的整数倍3) bucket列 == join列4) 必须是应用在map join的场景中

注意：如果表不是bucket的，则只是做普通join。

3.1.3.8 SMB Join

全称Sort Merge Bucket Join。

3.1.3.8.1 作用

大表对小表应该使用MapJoin来进行优化，但是如果是大表对大表，如果进行shuffle，那就非常可怕，第一个慢不用说，第二个容易出异常，此时就可以使用SMB Join来提高性能。SMB Join基于bucket-mapjoin的有序bucket，可实现在map端完成join操作，可以有效地减少或避免shuffle的数据量。SMB join的条件和Map join类似但又不同。

3.1.3.8.2 条件

bucket mapjoin	SMB join
set hive.optimize.bucketmapjoin = true;	set hive.optimize.bucketmapjoin = true; set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.auto.convert.sortmerge.join.noconditionaltask=true;
一个表的bucket数是另一个表bucket数的整数倍	小表的bucket数=大表bucket数
bucket列 == join列	Bucket 列 == Join 列 == sort 列
必须是应用在map join的场景中	必须是应用在bucket mapjoin 的场景中

3.1.3.8.3 确保分同列排序

hive并不检查两个join的表是否已经做好bucket且sorted，需要用户自己去保证join的表数据sorted，否则可能数据不正确。

有两个办法：

1)hive.enforce.sorting 设置为 true。开启强制排序时，插数据到表中会进行强制排序，默认false。

2)插入数据时通过在sql中用distributed c1 sort by c1 或者 cluster by c1

另外，表创建时必须是CLUSTERED且SORTED，如下：

create table test_smb_2(mid string,age_id string)

CLUSTERED BY(mid) SORTED BY(mid) INTO 500 BUCKETS;

综上，涉及到分桶表操作的齐全配置为：

--写入数据强制分桶

set hive.enforce.bucketing=true;

--写入数据强制排序

set hive.enforce.sorting=true;

--开启bucketmapjoin

set hive.optimize.bucketmapjoin = true;

--开启SMB Join

set hive.auto.convert.sortmerge.join=true;

set hive.auto.convert.sortmerge.join.noconditionaltask=true;

开启MapJoin的配置(hive.auto.convert.join和hive.auto.convert.join.noconditionaltask.size)，还有限制对桶表进行load操作(hive.strict.checks.bucketing)可以直接设置在hive的配置项中，无需在sql中声明。

自动尝试SMB联接(hive.optimize.bucketmapjoin.sortedmerge)也可以在设置中进行提前配置。

3.1.4 分层

3.1.4.1 ODS

写入时压缩生效

set hive.exec.orc.compression.strategy=COMPRESSION;

拉链表：意向客户看板中，对意向数据有新的需求：将customer_relationship的数据更新涉及到的维度按照最新值重新统计(比如2020年7月份的数据有修改更新，则需要将7月份的统计数据重新计算)；同时要有历史快照。

此时需要使用缓慢渐变维，推荐采用SCD2拉链表的形式来做，既能满足数据更新的需求，又能满足数据历史快照的需求。需要在start_time字段的基础上，增加新的end_time字段，以标识封链时间。

内外部表：ODS层是原始数据，一般不允许修改，所以使用外部表保证数据的安全性，避免误删除；ODS中的customer_relationship客户意向表和customer_clue客户线索表，因为使用拉链表需要覆盖操作，所以没有定义为外部表。

分桶采集：sqoop不支持分桶表，如果需要从sqoop导入数据到分桶表，需要通过中间临时表进行过度。也可以ODS不做分桶，从DWD明细层开始分桶。

分桶关联与采样：ODS层的customer_relationship客户意向表和customer_clue客户线索表是存在关联关系的，customer_relationship通过 id 关联customer_clue表的 customer_relationship_id ，可以获取新老客户信息。因此我们将这两个字段作为分桶字段。可用于数据采样和MapJoin。

分区：在之前的访问咨询主题看板中，为了便于后续T+1抽取数据时，方便获取昨天的数据，ODS模型要在原始mysql表的基础之上增加start_time字段，并且可以使用start_time字段做分区以提升查询的性能。

3.1.4.1.1 customer_relationship客户意向表

DROP TABLE itcast_ods.`customer_relationship`;CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship` ( `id` int COMMENT '客户关系id', `create_date_time` STRING COMMENT '创建时间', `update_date_time` STRING COMMENT '最后更新时间', `deleted` int COMMENT '是否被删除(禁用)', `customer_id` int COMMENT '所属客户id', `first_id` int COMMENT '第一条客户关系id', `belonger` int COMMENT '归属人', `belonger_name` STRING COMMENT '归属人姓名', `initial_belonger` int COMMENT '初始归属人', `distribution_handler` int COMMENT '分配处理人', `business_scrm_department_id` int COMMENT '归属部门', `last_visit_time` STRING COMMENT '最后回访时间', `next_visit_time` STRING COMMENT '下次回访时间', `origin_type` STRING COMMENT '数据来源', `itcast_school_id` int COMMENT '校区Id', `itcast_subject_id` int COMMENT '学科Id', `intention_study_type` STRING COMMENT '意向学习方式', `anticipat_signup_date` STRING COMMENT '预计报名时间', `level` STRING COMMENT '客户级别', `creator` int COMMENT '创建人', `current_creator` int COMMENT '当前创建人：初始==创建人，当在公海拉回时为拉回人', `creator_name` STRING COMMENT '创建者姓名', `origin_channel` STRING COMMENT '来源渠道', `comment` STRING COMMENT '备注', `first_customer_clue_id` int COMMENT '第一条线索id', `last_customer_clue_id` int COMMENT '最后一条线索id', `process_state` STRING COMMENT '处理状态', `process_time` STRING COMMENT '处理状态变动时间', `payment_state` STRING COMMENT '支付状态', `payment_time` STRING COMMENT '支付状态变动时间', `signup_state` STRING COMMENT '报名状态', `signup_time` STRING COMMENT '报名时间', `notice_state` STRING COMMENT '通知状态', `notice_time` STRING COMMENT '通知状态变动时间', `lock_state` STRING COMMENT '锁定状态', `lock_time` STRING COMMENT '锁定状态修改时间', `itcast_clazz_id` int COMMENT '所属ems班级id', `itcast_clazz_time` STRING COMMENT '报班时间', `payment_url` STRING COMMENT '付款链接', `payment_url_time` STRING COMMENT '支付链接生成时间', `ems_student_id` int COMMENT 'ems的学生id', `delete_reason` STRING COMMENT '删除原因', `deleter` int COMMENT '删除人', `deleter_name` STRING COMMENT '删除人姓名', `delete_time` STRING COMMENT '删除时间', `course_id` int COMMENT '课程ID', `course_name` STRING COMMENT '课程名称', `delete_comment` STRING COMMENT '删除原因说明', `close_state` STRING COMMENT '关闭装填', `close_time` STRING COMMENT '关闭状态变动时间', `appeal_id` int COMMENT '申诉id', `tenant` int COMMENT '租户', `total_fee` DECIMAL COMMENT '报名费总金额', `belonged` int COMMENT '小周期归属人', `belonged_time` STRING COMMENT '归属时间', `belonger_time` STRING COMMENT '归属时间', `transfer` int COMMENT '转移人', `transfer_time` STRING COMMENT '转移时间', `follow_type` int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', `transfer_bxg_oa_account` STRING COMMENT '转移到博学谷归属人OA账号', `transfer_bxg_belonger_name` STRING COMMENT '转移到博学谷归属人OA姓名', `end_time` STRING COMMENT '有效截止时间')comment '客户关系表'PARTITIONED BY(start_time STRING)clustered by(id) sorted by(id) into 10 bucketsROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');

3.1.4.1.2 customer_clue客户线索表

使用start_time字段分区以提升条件查询性能。customer_clue是后面有效线索主题看板的事实表，需求也要求将数据更新涉及到的维度按照最新值重新统计、要有历史快照。采用拉链表(SCD2)的形式来做，增加新的end_time字段，以标识封链时间。

DROP TABLE itcast_ods.customer_clue;CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '创建时间', update_date_time STRING COMMENT '最后更新时间', deleted STRING COMMENT '是否被删除(禁用)', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户关系id', session_id STRING COMMENT '七陌会话id', sid STRING COMMENT '访客id', status STRING COMMENT '状态(undeal待领取 deal 已领取 finish 已关闭 changePeer 已流转)', users STRING COMMENT '所属坐席', create_time STRING COMMENT '七陌创建时间', platform STRING COMMENT '平台来源 (pc-网站咨询|wap-wap咨询|sdk-app咨询|weixin-微信咨询)', s_name STRING COMMENT '用户名称', seo_source STRING COMMENT '搜索来源', seo_keywords STRING COMMENT '关键字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上级来源页面', from_url STRING COMMENT '会话来源页面', landing_page_url STRING COMMENT '访客着陆页面', url_title STRING COMMENT '咨询页面title', to_peer STRING COMMENT '所属技能组', manual_time STRING COMMENT '人工开始时间', begin_time STRING COMMENT '坐席领取时间 ', reply_msg_count int COMMENT '客服回复消息数', total_msg_count int COMMENT '消息总数', msg_count int COMMENT '客户发送消息数', comment STRING COMMENT '备注', finish_reason STRING COMMENT '结束类型', finish_user STRING COMMENT '结束坐席', end_time STRING COMMENT '会话结束时间', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '浏览器名称', os_info STRING COMMENT '系统名称', area STRING COMMENT '区域', country STRING COMMENT '所在国家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '创建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份证号', phone STRING COMMENT '手机号', itcast_school_id int COMMENT '校区Id', itcast_school STRING COMMENT '校区', itcast_subject_id int COMMENT '学科Id', itcast_subject STRING COMMENT '学科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq号', email STRING COMMENT '邮箱', gender STRING COMMENT '性别', level STRING COMMENT '客户级别', origin_type STRING COMMENT '数据来源渠道', information_way STRING COMMENT '资讯方式', working_years STRING COMMENT '开始工作时间', technical_directions STRING COMMENT '技术方向', customer_state STRING COMMENT '当前客户状态', valid STRING COMMENT '该线索是否是网资有效线索', anticipat_signup_date STRING COMMENT '预计报名时间', clue_state STRING COMMENT '线索状态', scrm_department_id int COMMENT 'SCRM内部部门id', superior_url STRING COMMENT '诸葛获取上级页面URL', superior_source STRING COMMENT '诸葛获取上级页面URL标题', landing_url STRING COMMENT '诸葛获取着陆页面URL', landing_source STRING COMMENT '诸葛获取着陆页面URL来源', info_url STRING COMMENT '诸葛获取留咨页URL', info_source STRING COMMENT '诸葛获取留咨页URL标题', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '课程编号', course_name STRING COMMENT '课程名称', zhuge_session_id STRING COMMENT 'zhuge会话id', is_repeat int COMMENT '是否重复线索(手机号维度) 0:正常 1：重复', tenant int COMMENT '租户id', activity_id STRING COMMENT '活动id', activity_name STRING COMMENT '活动名称', follow_type int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', shunt_mode_id int COMMENT '匹配到的技能组id', shunt_employee_group_id int COMMENT '所属分流员工组', ends_time STRING COMMENT '有效时间')comment '客户关系表'PARTITIONED BY(starts_time STRING)clustered by(customer_relationship_id) sorted by(customer_relationship_id) into 10 bucketsROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');

3.1.4.2 Dimen

为了保证数据安全，采用外部表。

建库

CREATE DATABASE IF NOT EXISTS itcast_dimen;

3.1.4.2.1 Customer客户静态信息表

CREATE TABLE IF NOT EXISTS itcast_dimen.`customer` ( `id` int COMMENT 'key id', `customer_relationship_id` int COMMENT '当前意向id', `create_date_time` STRING COMMENT '创建时间', `update_date_time` STRING COMMENT '最后更新时间', `deleted` int COMMENT '是否被删除(禁用)', `name` STRING COMMENT '姓名', `idcard` STRING COMMENT '身份证号', `birth_year` int COMMENT '出生年份', `gender` STRING COMMENT '性别', `phone` STRING COMMENT '手机号', `wechat` STRING COMMENT '微信', `qq` STRING COMMENT 'qq号', `email` STRING COMMENT '邮箱', `area` STRING COMMENT '所在区域', `leave_school_date` date COMMENT '离校时间', `graduation_date` date COMMENT '毕业时间', `bxg_student_id` STRING COMMENT '博学谷学员ID，可能未关联到，不存在', `creator` int COMMENT '创建人ID', `origin_type` STRING COMMENT '数据来源', `origin_channel` STRING COMMENT '来源渠道', `tenant` int, `md_id` int COMMENT '中台id')comment '客户表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');

3.1.4.2.2 employee员工表

3.1.4.2.3 scrm_department部门表

CREATE TABLE IF NOT EXISTS itcast_dimen.`scrm_department` ( `id` int COMMENT '部门id', `name` STRING COMMENT '部门名称', `parent_id` int COMMENT '父部门id', `create_date_time` STRING COMMENT '创建时间', `update_date_time` STRING COMMENT '更新时间', `deleted` STRING COMMENT '删除标志', `id_path` STRING COMMENT '编码全路径', `tdepart_code` int COMMENT '直属部门', `creator` STRING COMMENT '创建者', `depart_level` int COMMENT '部门层级', `depart_sign` int COMMENT '部门标志，暂时默认1', `depart_line` int COMMENT '业务线，存储业务线编码', `depart_sort` int COMMENT '排序字段', `disable_flag` int COMMENT '禁用标志', `tenant` int COMMENT '租户')comment 'scrm部门表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');

3.1.4.2.4 itcast_school学校表

CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_school` ( `id` int COMMENT '自增主键', `create_date_time` timestamp COMMENT '创建时间', `update_date_time` timestamp COMMENT '最后更新时间', `deleted` STRING COMMENT '是否被删除(禁用)', `name` STRING COMMENT '校区名称', `code` STRING COMMENT '校区标识', `tenant` int COMMENT '租户')comment '校区字典表'

PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');

3.1.4.2.5 itcast_subject学科表

CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_subject` ( `id` int COMMENT '自增主键', `create_date_time` timestamp COMMENT '创建时间', `update_date_time` timestamp COMMENT '最后更新时间', `deleted` STRING COMMENT '是否被删除(禁用)', `name` STRING COMMENT '学科名称', `code` STRING COMMENT '学科编码', `tenant` int COMMENT '租户')comment '学科字典表'

PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');

3.1.4.3 DWD

ODS事实数据customer_relationship清洗转换后存入DWD明细层。

DW和APP层是统计数据，为了使覆盖插入等操作更方便，满足业务需求的同时，提高开发和测试效率，推荐使用内部表。

drop table itcast_dwd.`itcast_intention_dwd`;CREATE TABLE IF NOT EXISTS itcast_dwd.`itcast_intention_dwd` ( `rid` int COMMENT 'id', `customer_id` STRING COMMENT '客户id', `create_date_time` STRING COMMENT '创建时间', `itcast_school_id` STRING COMMENT '校区id', `deleted` STRING COMMENT '是否被删除', `origin_type` STRING COMMENT '来源渠道', `itcast_subject_id` STRING COMMENT '学科id', `creator` int COMMENT '创建人', `hourinfo` STRING COMMENT '小时信息', `origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上')comment '客户意向dwd表'PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)clustered by(rid) sorted by(rid) into 10 bucketsROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as ORCTBLPROPERTIES ('orc.compress'='SNAPPY');

3.1.4.4 DWM

关联所有维表，并对获取的字段进行转换，便于统计时直接使用。

create database itcast_dwm;

drop table itcast_dwm.`itcast_intention_dwm`;CREATE TABLE IF NOT EXISTS itcast_dwm.`itcast_intention_dwm` ( `customer_id` STRING COMMENT 'id信息', `create_date_time` STRING COMMENT '创建时间', `area` STRING COMMENT '区域信息', `itcast_school_id` STRING COMMENT '校区id', `itcast_school_name` STRING COMMENT '校区名称', `deleted` STRING COMMENT '是否被删除', `origin_type` STRING COMMENT '来源渠道', `itcast_subject_id` STRING COMMENT '学科id', `itcast_subject_name` STRING COMMENT '学科名称', `hourinfo` STRING COMMENT '小时信息', `origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上', `clue_state_stat` STRING COMMENT '新老客户：0.老客户；1.新客户', `tdepart_id` STRING COMMENT '创建者部门id', `tdepart_name` STRING COMMENT '咨询中心名称')comment '客户意向dwm表'PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)clustered by(customer_id) sorted by(customer_id) into 10 bucketsROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as ORCTBLPROPERTIES ('orc.compress'='SNAPPY');

3.1.4.5 DWS

在DWM层的基础上，按照业务的要求进行统计分析；有三个常驻维度，分别增加对应的属性标识：

l 时间维度：1.年、2.月、3.天、4.小时

l 数据来源：0.线下；1.线上

l 客户属性：0.老客户、1.新客户

l 产品属性维度：1.总意向量；2.区域信息；3.校区、学科组合分组；4.来源渠道；5.贡献中心；

drop Table itcast_dws.itcast_intention_dws;CREATE TABLE IF NOT EXISTS itcast_dws.itcast_intention_dws ( `customer_total` INT COMMENT '聚合意向客户数', `area` STRING COMMENT '区域信息', `itcast_school_id` STRING COMMENT '校区id', `itcast_school_name` STRING COMMENT '校区名称', `origin_type` STRING COMMENT '来源渠道', `itcast_subject_id` STRING COMMENT '学科id', `itcast_subject_name` STRING COMMENT '学科名称', `hourinfo` STRING COMMENT '小时信息', `origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上', `clue_state_stat` STRING COMMENT '客户属性：0.老客户；1.新客户', `tdepart_id` STRING COMMENT '创建者部门id', `tdepart_name` STRING COMMENT '咨询中心名称', `time_str` STRING COMMENT '时间明细', `groupType` STRING COMMENT '产品属性类别：1.总意向量；2.区域信息；3.校区、学科组合分组；4.来源渠道；5.贡献中心;', `time_type` STRING COMMENT '时间维度：1、按小时聚合；2、按天聚合；3、按周聚合；4、按月聚合；5、按年聚合；')comment '客户意向dws表'PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');

3.1.4.6 APP

如果用户需要具体的报表展示，可以针对不同的报表页面设计APP层结构，然后导出至OLAP系统的mysql中。此系统使用FineReport，需要通过宽表来进行灵活的展现。因此APP层不再进行细化。直接将DWS层导出至mysql即可。

3.2 全量流程

3.2.1 数据采集

--split-by id

2.1.1.2 ODS层

Sqoop不支持分桶表，需要通过临时表的方式实现。

2.1.1.2.1 customer_relationship意向表

SQL：

select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as start_time, date_format("9999-12-31", "%Y-%m-%d") as end_timefrom customer_relationship;

Sqoop：

sqoop import \

--connect jdbc:mysql://192.168.52.150:3306/scrm \

--username root \

--password 123456 \

--query 'select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time,date_format("9999-12-31","%Y-%m-%d") as end_time from customer_relationship where $CONDITIONS' \

--hcatalog-database itcast_ods \

--hcatalog-table customer_relationship \

-m 10 \

--split-by id

报错：

common.HCatException : 2016 : Error operation not supported : Store into a partition with bucket definition from Pig/Mapreduce is not supported

这个错误是由于sqoop不支持将数据导入分桶表所引起的问题，但是如果我们想在ODS进行分桶的话，如何来做呢？

我们可以通过临时表的方式来进行抽取数据，然后将临时表数据再同步到ODS分桶表即可。

2.1.1.2.1.1 重建ods临时表，注意不要有分桶

DROP TABLE itcast_ods.`customer_relationship_tmp`;CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship_tmp` ( `id` int COMMENT '客户关系id', `create_date_time` STRING COMMENT '创建时间', `update_date_time` STRING COMMENT '最后更新时间', `deleted` int COMMENT '是否被删除(禁用)', `customer_id` int COMMENT '所属客户id', `first_id` int COMMENT '第一条客户关系id', `belonger` int COMMENT '归属人', `belonger_name` STRING COMMENT '归属人姓名', `initial_belonger` int COMMENT '初始归属人', `distribution_handler` int COMMENT '分配处理人', `business_scrm_department_id` int COMMENT '归属部门', `last_visit_time` STRING COMMENT '最后回访时间', `next_visit_time` STRING COMMENT '下次回访时间', `origin_type` STRING COMMENT '数据来源', `itcast_school_id` int COMMENT '校区Id', `itcast_subject_id` int COMMENT '学科Id', `intention_study_type` STRING COMMENT '意向学习方式', `anticipat_signup_date` STRING COMMENT '预计报名时间', `level` STRING COMMENT '客户级别', `creator` int COMMENT '创建人', `current_creator` int COMMENT '当前创建人：初始==创建人，当在公海拉回时为拉回人', `creator_name` STRING COMMENT '创建者姓名', `origin_channel` STRING COMMENT '来源渠道', `comment` STRING COMMENT '备注', `first_customer_clue_id` int COMMENT '第一条线索id', `last_customer_clue_id` int COMMENT '最后一条线索id', `process_state` STRING COMMENT '处理状态', `process_time` STRING COMMENT '处理状态变动时间', `payment_state` STRING COMMENT '支付状态', `payment_time` STRING COMMENT '支付状态变动时间', `signup_state` STRING COMMENT '报名状态', `signup_time` STRING COMMENT '报名时间', `notice_state` STRING COMMENT '通知状态', `notice_time` STRING COMMENT '通知状态变动时间', `lock_state` STRING COMMENT '锁定状态', `lock_time` STRING COMMENT '锁定状态修改时间', `itcast_clazz_id` int COMMENT '所属ems班级id', `itcast_clazz_time` STRING COMMENT '报班时间', `payment_url` STRING COMMENT '付款链接', `payment_url_time` STRING COMMENT '支付链接生成时间', `ems_student_id` int COMMENT 'ems的学生id', `delete_reason` STRING COMMENT '删除原因', `deleter` int COMMENT '删除人', `deleter_name` STRING COMMENT '删除人姓名', `delete_time` STRING COMMENT '删除时间', `course_id` int COMMENT '课程ID', `course_name` STRING COMMENT '课程名称', `delete_comment` STRING COMMENT '删除原因说明', `close_state` STRING COMMENT '关闭装填', `close_time` STRING COMMENT '关闭状态变动时间', `appeal_id` int COMMENT '申诉id', `tenant` int COMMENT '租户', `total_fee` DECIMAL COMMENT '报名费总金额', `belonged` int COMMENT '小周期归属人', `belonged_time` STRING COMMENT '归属时间', `belonger_time` STRING COMMENT '归属时间', `transfer` int COMMENT '转移人', `transfer_time` STRING COMMENT '转移时间', `follow_type` int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', `transfer_bxg_oa_account` STRING COMMENT '转移到博学谷归属人OA账号', `transfer_bxg_belonger_name` STRING COMMENT '转移到博学谷归属人OA姓名', `end_time` STRING COMMENT '有效截止时间')comment '客户关系表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');

2.1.1.2.1.2 抽取数据到临时表

SQL：

Sqoop：

sqoop import \

--connect jdbc:mysql://192.168.52.150:3306/scrm \

--username root \

--password 123456 \

--hcatalog-database itcast_ods \

--hcatalog-table customer_relationship_tmp \

-m 10 \

--split-by id

2.1.1.2.1.3 将数据覆盖插入到ODS

insert overwrite table itcast_ods.customer_relationship partition(start_time)select * from itcast_ods.customer_relationship_tmp;

2.1.1.2.2 Customer_clue线索表

2.1.1.2.2.1 重建ods表，注意不要有分桶

DROP TABLE itcast_ods.customer_clue_tmp;CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_tmp ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '创建时间', update_date_time STRING COMMENT '最后更新时间', deleted STRING COMMENT '是否被删除(禁用)', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户关系id', session_id STRING COMMENT '七陌会话id', sid STRING COMMENT '访客id', status STRING COMMENT '状态(undeal待领取 deal 已领取 finish 已关闭 changePeer 已流转)', users STRING COMMENT '所属坐席', create_time STRING COMMENT '七陌创建时间', platform STRING COMMENT '平台来源 (pc-网站咨询|wap-wap咨询|sdk-app咨询|weixin-微信咨询)', s_name STRING COMMENT '用户名称', seo_source STRING COMMENT '搜索来源', seo_keywords STRING COMMENT '关键字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上级来源页面', from_url STRING COMMENT '会话来源页面', landing_page_url STRING COMMENT '访客着陆页面', url_title STRING COMMENT '咨询页面title', to_peer STRING COMMENT '所属技能组', manual_time STRING COMMENT '人工开始时间', begin_time STRING COMMENT '坐席领取时间 ', reply_msg_count int COMMENT '客服回复消息数', total_msg_count int COMMENT '消息总数', msg_count int COMMENT '客户发送消息数', comment STRING COMMENT '备注', finish_reason STRING COMMENT '结束类型', finish_user STRING COMMENT '结束坐席', end_time STRING COMMENT '会话结束时间', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '浏览器名称', os_info STRING COMMENT '系统名称', area STRING COMMENT '区域', country STRING COMMENT '所在国家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '创建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份证号', phone STRING COMMENT '手机号', itcast_school_id int COMMENT '校区Id', itcast_school STRING COMMENT '校区', itcast_subject_id int COMMENT '学科Id', itcast_subject STRING COMMENT '学科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq号', email STRING COMMENT '邮箱', gender STRING COMMENT '性别', level STRING COMMENT '客户级别', origin_type STRING COMMENT '数据来源渠道', information_way STRING COMMENT '资讯方式', working_years STRING COMMENT '开始工作时间', technical_directions STRING COMMENT '技术方向', customer_state STRING COMMENT '当前客户状态', valid STRING COMMENT '该线索是否是网资有效线索', anticipat_signup_date STRING COMMENT '预计报名时间', clue_state STRING COMMENT '线索状态', scrm_department_id int COMMENT 'SCRM内部部门id', superior_url STRING COMMENT '诸葛获取上级页面URL', superior_source STRING COMMENT '诸葛获取上级页面URL标题', landing_url STRING COMMENT '诸葛获取着陆页面URL', landing_source STRING COMMENT '诸葛获取着陆页面URL来源', info_url STRING COMMENT '诸葛获取留咨页URL', info_source STRING COMMENT '诸葛获取留咨页URL标题', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '课程编号', course_name STRING COMMENT '课程名称', zhuge_session_id STRING COMMENT 'zhuge会话id', is_repeat int COMMENT '是否重复线索(手机号维度) 0:正常 1：重复', tenant int COMMENT '租户id', activity_id STRING COMMENT '活动id', activity_name STRING COMMENT '活动名称', follow_type int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', shunt_mode_id int COMMENT '匹配到的技能组id', shunt_employee_group_id int COMMENT '所属分流员工组', ends_time STRING COMMENT '有效时间')comment '客户关系表'PARTITIONED BY(starts_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');

2.1.1.2.2.2 抽取数据到临时表

SQL：

select id, create_date_time, update_date_time, deleted, customer_id, customer_relationship_id, session_id, sid, status, user, create_time, platform, s_name, seo_source, seo_keywords, ip, referrer, from_url, landing_page_url, url_title, to_peer, manual_time, begin_time, reply_msg_count, total_msg_count, msg_count, comment, finish_reason, finish_user, end_time, platform_description, browser_name, os_info, area, country, province, city, creator, name, idcard, phone, itcast_school_id, itcast_school, itcast_subject_id, itcast_subject, wechat, qq, email, gender, level, origin_type, information_way, working_years, technical_directions, customer_state, valid, anticipat_signup_date, clue_state, scrm_department_id, superior_url, superior_source, landing_url, landing_source, info_url, info_source, origin_channel, course_id, course_name, zhuge_session_id, is_repeat, tenant, activity_id, activity_name, follow_type, shunt_mode_id, shunt_employee_group_id, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as starts_time, date_format("9999-12-31", "%Y-%m-%d") as ends_timefrom customer_clue;

Sqoop：

sqoop import \

--connect jdbc:mysql://192.168.52.150:3306/scrm \

--username root \

--password 123456 \

--query 'select id,create_date_time,update_date_time,deleted,customer_id,customer_relationship_id,session_id,sid,status,user as users,create_time,platform,s_name,seo_source,seo_keywords,ip,referrer,from_url,landing_page_url,url_title,to_peer,manual_time,begin_time,reply_msg_count,total_msg_count,msg_count,comment,finish_reason,finish_user,end_time,platform_description,browser_name,os_info,area,country,province,city,creator,name,"-1" as idcard,"-1" as phone,itcast_school_id,itcast_school,itcast_subject_id,itcast_subject,"-1" as wechat,"-1" as qq,"-1" as email,gender,level,origin_type,information_way,working_years,technical_directions,customer_state,valid,anticipat_signup_date,clue_state,scrm_department_id,superior_url,superior_source,landing_url,landing_source,info_url,info_source,origin_channel,course_id,course_name,zhuge_session_id,is_repeat,tenant,activity_id,activity_name,follow_type,shunt_mode_id,shunt_employee_group_id,FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as starts_time,date_format("9999-12-31","%Y-%m-%d") as ends_time from customer_clue where $CONDITIONS' \

--hcatalog-database itcast_ods \

--hcatalog-table customer_clue_tmp \

-m 10 \

--split-by id

2.1.1.2.2.3 将数据覆盖插入到ODS

insert overwrite table itcast_ods.customer_clue partition(starts_time)select * from itcast_ods.customer_clue_tmp;

3.2.2 数据清洗转换

3.2.2.1 Hive执行计划

2.1.1.2.3 作用

用户提交HiveQL查询后，Hive会把查询语句转换为MapReduce作业。Hive会自动完成整个执行过程，一般情况下，我们并不用知道内部是如何运行的。

执行计划可以告诉我们查询过程的关键信息，用来帮助我们判定优化措施是否已经生效。

3.2.2.1.1 基础语法

EXPLAIN的使用非常简单，只需要在正常HiveQL前面加上EXPLAIN就可以了。执行计划运行时的HiveQL不会真正执行作业，只是基于优化器生成了最优的执行路径：

EXPLAIN [EXTENDED] query

extended输出更加详细的信息；

3.2.2.1.2 执行计划分为两部分

stage依赖(STAGE DEPENDENCIES)

(1) 这部分展示本次查询分为两个stage：Stage-1，Stage-0.

(2) 一般Stage-0是最终给查询用户展示数据用的，如LIMITE操作就会在这部分。

(3) Stage-1是mr程序的执行阶段。

1 STAGE DEPENDENCIES:2 Stage-1 is a root stage3 Stage-0 depends on stages: Stage-1

stage详细执行计划(STAGE PLANS)

(1) 包含了整个查询所有Stage的大部分处理过程。

(2) 特定优化是否生效，主要通过此部分内容查看。

名次解释

TableScan:查看表

alias: emp：所需要的表

Statistics: Num rows: 2 Data size: 820 Basic stats: COMPLETE Column stats: NONE：这张表的基本统计信息：行数、大小等；

expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: double), comm (type: double), deptno (type: int)：表中需要输出的字段及类型

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7：输出的的字段编号

compressed: true：输出是否压缩；

input format: org.apache.hadoop.mapred.SequenceFileInputFormat：文件输入调用的Java类，显示以文本Text格式输入；

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat：文件输出调用的java类，显示以文本Text格式输出；

3.2.2.1.3 样例

DWD阶段执行计划：

1 STAGE DEPENDENCIES:2 Stage-1 is a root stage3 Stage-0 depends on stages: Stage-145 STAGE PLANS:6 Stage: Stage-17 Map Reduce8 Map Operator Tree:9 TableScan10 alias: rs11 Statistics: Num rows: 1109147 Data size: 236547154 Basic stats: COMPLETE Column stats: COMPLETE12 Filter Operator13 predicate: (((hash(id) & 2147483647) % 10) = 0) (type: boolean)14 Statistics: Num rows: 554573 Data size: 118273474 Basic stats: COMPLETE Column stats: COMPLETE15 Select Operator16 expressions: id (type: int), customer_id (type: int), create_date_time (type: string), if((itcast_school_id is null or (itcast_school_id = 0)), -1, itcast_school_id) (type: int), deleted (type: int), origin_type (type: string), if((itcast_subject_id is null or (itcast_subject_id = 0)), -1, itcast_subject_id) (type: int), substr(create_date_time, 12, 2) (type: string), if((origin_type = 'NETSERVICE'), '1', if((origin_type = 'PRESIGNUP'), '1', '0')) (type: string), substr(create_date_time, 1, 4) (type: string), substr(create_date_time, 6, 2) (type: string), substr(create_date_time, 9, 2) (type: string)17 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col1118 Statistics: Num rows: 554573 Data size: 631104074 Basic stats: COMPLETE Column stats: COMPLETE19 File Output Operator20 compressed: false21 Statistics: Num rows: 554573 Data size: 631104074 Basic stats: COMPLETE Column stats: COMPLETE22 table:23 input format: org.apache.hadoop.mapred.SequenceFileInputFormat24 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat25 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe26 27 Stage: Stage-028 Fetch Operator29 limit: -130 Processor Tree:31 ListSink

3.2.2.2 DWD

3.2.2.2.1 分析

在DWD层对customer_relationship意向客户事实表做清洗转换：

清洗掉已删除的数据；

判断学校id和学科id，空值统一转换为-1；

将origin_type来源渠道字段转换为线上/线下，如果origin_type是NETSERVICE和PRESIGNUP类型，即为1线上，否则为0线下。

3.2.2.2.2 代码

insert into table itcast_dwd.itcast_intention_dwd partition (yearinfo,monthinfo,dayinfo)select rs.id as rid, rs.customer_id, rs.create_date_time, if((rs.itcast_school_id is null) or (rs.itcast_school_id = 0), -1, rs.itcast_school_id) as itcast_school_id, rs.deleted, rs.origin_type, if((rs.itcast_subject_id is null) or (rs.itcast_subject_id = 0), -1, rs.itcast_subject_id) as itcast_subject_id, substr(rs.create_date_time, 12, 2) hourinfo, if(rs.origin_type='NETSERVICE', '1', if(rs.origin_type='PRESIGNUP', '1', '0')) as origin_type_stat, substr(rs.create_date_time, 1, 4) yearinfo, substr(rs.create_date_time, 6, 2) monthinfo, substr(rs.create_date_time, 9, 2) dayinfofrom itcast_ods.customer_relationship rswhere rs.deleted = 0;

3.2.2.2.3 测试

测试时，可以通过分区和分桶采样的方式。

分区针对的是固定日期，而分桶采样则侧重抽查，更具有代表性。由于第一次是全量抽取数据，所以日期分区下的数据非常庞大，此时使用分桶来进行采样测试可以提升开发和测试效率。

注意tablesample关键字所在的位置，是在表名之后，别名之前。

2.1.1.2.4 执行计划验证

在select之前添加Explain，先来查看查询执行计划，可以看到分桶采样已经生效，提高了开发和测试时的执行效率。

2.1.1.2.5 动态分区报错

提高动态分区数量和文件数量，在sql前添加：

set hive.exec.max.dynamic.partitions.pernode=10000;set hive.exec.max.dynamic.partitions=100000;set hive.exec.max.created.files=150000;

2.1.1.2.6 内存溢出

注意，如果遇到因硬件配置而导致的内存溢出问题，有以下几种处理办法：

2.1.1.2.6.1 硬件内存充足

按照访问咨询看板中增加内存的设置进行配置：

提高Yarn的NodeManager内存配置

修改参数yarn.nodemanager.resource.memory-mb。

提高MR的内存配置

修改参数mapreduce.map.java.opts、mapreduce.reduce.java.opts、mapreduce.map.memory.mb、mapreduce.reduce.memory.mb。

2.1.1.2.6.2 硬件内存不足

开启有序动态分区，并关闭Map Join，但过程会比较慢。

也可以通过where条件，按照日期分批进行清洗转换。

查看各个年份数据分布情况：

select count(1), substr(create_date_time, 1, 4) from itcast_ods.customer_relationship group by substr(create_date_time, 1, 4);

从结果可以看出，数据按年分配比较均匀，因此可以按照年份来进行分批计算。

2.1.1.2.6.3 本地模式(虚拟机环境)

set hive.exec.mode.local.auto=true;

3.2.2.3 DWM

3.2.2.3.1 分析

意向客户量指标，最终统计的是去重后的客户；所以不能采用先count后sum的形式进行。因此在DWM中间层，我们不做统计，只将相关的维度数据进行关联，并转换出我们需要的信息。

通过id关联customer_clue表的customer_relationship_id，将clue_state状态转换为新老客户，如果clue_state状态为VALID_NEW_CLUES，则为新客户，为VALID_PUBLIC_NEW_CLUE，则为老客户，否则为无效数据。

通过customer_id关联customer表id获取到区域信息area；

通过creator关联employee表获取tdepart_id咨询中心单位id；再用employee的department_id和scrm_department表id关联获取单位名称name。

通过itcast_subject_id学科id和itcast_subject学科表id进行关联，获取到学科名称name。

通过itcast_school_id学科id和itcast_school校区表id进行关联，获取到校区名称name。

3.2.2.3.2 代码

insert into table itcast_dwm.itcast_intention_dwm partition (yearinfo,monthinfo,dayinfo)select dwd.customer_id, dwd.create_date_time, cus.area, dwd.itcast_school_id, sch.name as itcast_school_name, dwd.deleted, dwd.origin_type, dwd.itcast_subject_id, sub.name as itcast_subject_name, dwd.hourinfo, dwd.origin_type_stat, if(clue.clue_state='VALID_NEW_CLUES', '1', if(clue.clue_state='VALID_PUBLIC_NEW_CLUE', '0', '-1')) as clue_state_stat, e.department_id as tdepart_id, dept.name as tdepart_name, dwd.yearinfo, dwd.monthinfo, dwd.dayinfofrom itcast_dwd.itcast_intention_dwd dwdleft join itcast_ods.customer_clue clue on clue.customer_relationship_id=dwd.ridleft join itcast_dimen.customer cus on dwd.customer_id = cus.idleft join itcast_dimen.employee e on dwd.creator = e.idleft join itcast_dimen.scrm_department dept on e.department_id = dept.idleft join itcast_dimen.itcast_subject sub on dwd.itcast_subject_id = sub.id and sub.name is not nullleft join itcast_dimen.itcast_school sch on dwd.itcast_school_id = sch.id;

3.2.2.3.3 测试

可以使用分桶采样来进行测试。这里因为我们在DWD层已经将数据分桶后减少了9/10，也可以不用再分桶。

3.2.2.3.3.1 执行计划验证

可以看到分桶采样，以及SMB Join都生效了，去掉Reduce过程避免了数据倾斜的问题，提升了执行效率。

explainselect dwd.customer_id, dwd.create_date_time, cus.area, dwd.itcast_school_id, sch.name as itcast_school_name, dwd.deleted, dwd.origin_type, dwd.itcast_subject_id, sub.name as itcast_subject_name, dwd.hourinfo, dwd.origin_type_stat, if(clue.clue_state='VALID_NEW_CLUES', '1', if(clue.clue_state='VALID_PUBLIC_NEW_CLUE', '0', '-1')) as clue_state_stat, e.department_id as tdepart_id, dept.name as tdepart_name, dwd.yearinfo, dwd.monthinfo, dwd.dayinfofrom itcast_dwd.itcast_intention_dwd tablesample(bucket 1 out of 10 on rid) dwdleft join itcast_ods.customer_clue clue on clue.customer_relationship_id=dwd.ridleft join itcast_dimen.customer cus on dwd.customer_id = cus.idleft join itcast_dimen.employee e on dwd.creator = e.idleft join itcast_dimen.scrm_department dept on e.department_id = dept.idleft join itcast_dimen.itcast_subject sub on dwd.itcast_subject_id = sub.id left join itcast_dimen.itcast_school sch on dwd.itcast_school_id = sch.id;

3.2.2.3.3.2 运行插入

insert into table itcast_dwm.itcast_intention_dwm partition (yearinfo,monthinfo,dayinfo)select dwd.customer_id, dwd.create_date_time, cus.area, dwd.itcast_school_id, sch.name as itcast_school_name, dwd.deleted, dwd.origin_type, dwd.itcast_subject_id, sub.name as itcast_subject_name, dwd.hourinfo, dwd.origin_type_stat, if(clue.clue_state='VALID_NEW_CLUES', '1', if(clue.clue_state='VALID_PUBLIC_NEW_CLUE', '0', '-1')) as clue_state_stat, e.department_id as tdepart_id, dept.name as tdepart_name, dwd.yearinfo, dwd.monthinfo, dwd.dayinfofrom itcast_dwd.itcast_intention_dwd dwdleft join itcast_ods.customer_clue clue on clue.customer_relationship_id=dwd.ridleft join itcast_dimen.customer cus on dwd.customer_id = cus.idleft join itcast_dimen.employee e on dwd.creator = e.idleft join itcast_dimen.scrm_department dept on e.department_id = dept.idleft join itcast_dimen.itcast_subject sub on dwd.itcast_subject_id = sub.id left join itcast_dimen.itcast_school sch on dwd.itcast_school_id = sch.id;

3.2.3 统计分析

3.2.3.1 DWS

3.2.3.1.1 分析

DWS层基于DWM清洗转换关联后的数据，使用count+distinct来统计指标。

在建模分析阶段，我们已经得到了指标相关的维度。分四大类：

l 时间维度：1.年、2.月、3.天、4.小时

l 产品属性维度：1.总意向量；2.区域信息；3.校区、学科组合分组；4.来源渠道；5.贡献中心；

l 数据来源：0.线下；1.线上

l 客户属性：0.老客户、1.新客户

代码按照产品属性分开统计；时间属性、线上线下和客户属性作为常驻字段，每一种统计分组中都要包含。

3.2.3.1.2 代码

3.2.3.1.2.1 新增总意向量

--分区
SET hive.exec.dynamic.partition=true;SET hive.exec.dynamic.partition.mode=nonstrict;set hive.exec.max.dynamic.partitions.pernode=10000;set hive.exec.max.dynamic.partitions=100000;set hive.exec.max.created.files=150000;--hive压缩
set hive.exec.compress.intermediate=true;set hive.exec.compress.output=true;--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;--分桶
set hive.enforce.bucketing=true;set hive.enforce.sorting=true;set hive.optimize.bucketmapjoin = true;set hive.auto.convert.sortmerge.join=true;set hive.auto.convert.sortmerge.join.noconditionaltask=true;--总意向量分组(按照时间和常驻类型统计)
--小时
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    '-1' itcast_school_id,    '-1' as itcast_school_name,    '-1' as origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,    '1' as grouptype,    '1' as time_type,    yearinfo,    monthinfo,    dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by yearinfo, monthinfo, dayinfo, hourinfo, origin_type_stat, clue_state_stat;--天
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    '-1' itcast_school_id,    '-1' as itcast_school_name,    '-1' as origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    '-1' as hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,    '1' as grouptype,    '2' as time_type,    yearinfo,    monthinfo,    dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by yearinfo, monthinfo, dayinfo, origin_type_stat, clue_state_stat;--月
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    '-1' itcast_school_id,    '-1' as itcast_school_name,    '-1' as origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    '-1' as hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo,'-',monthinfo) as time_str,    '1' as grouptype,    '1' as time_type,    yearinfo,    monthinfo,    '-1' as dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by yearinfo, monthinfo, origin_type_stat, clue_state_stat;--年
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    '-1' itcast_school_id,    '-1' as itcast_school_name,    '-1' as origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    '-1' as hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo) as time_str,    '1' as grouptype,    '1' as time_type,    yearinfo,    '-1' as monthinfo,    '-1' as dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by yearinfo, origin_type_stat, clue_state_stat;

3.2.3.1.2.2 意向学员位置热力图

--地区分组(按照地区、时间和常驻类型统计)
--小时
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    area,    '-1' itcast_school_id,    '-1' as itcast_school_name,    '-1' as origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,    '2' as grouptype,    '1' as time_type,    yearinfo,    monthinfo,    dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by area, yearinfo, monthinfo, dayinfo, hourinfo, origin_type_stat, clue_state_stat;--天
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    area,    '-1' itcast_school_id,    '-1' as itcast_school_name,    '-1' as origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    '-1' as hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,    '2' as grouptype,    '2' as time_type,    yearinfo,    monthinfo,    dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by area, yearinfo, monthinfo, dayinfo, origin_type_stat, clue_state_stat;--月
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    area,    '-1' itcast_school_id,    '-1' as itcast_school_name,    '-1' as origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    '-1' as hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo,'-',monthinfo) as time_str,    '1' as grouptype,    '1' as time_type,    yearinfo,    monthinfo,    '-1' as dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by area, yearinfo, monthinfo, origin_type_stat, clue_state_stat;--年
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    area,    '-1' itcast_school_id,    '-1' as itcast_school_name,    '-1' as origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    '-1' as hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo) as time_str,    '2' as grouptype,    '1' as time_type,    yearinfo,    '-1' as monthinfo,    '-1' as dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by area, yearinfo, origin_type_stat, clue_state_stat;

3.2.3.1.2.3 学科、校区排名

--学科、校区分组(按照学科、校区、时间和常驻类型统计)
--小时
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    itcast_school_id,    itcast_school_name,    '-1' as origin_type,    itcast_subject_id,    itcast_subject_name,    hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,    '3' as grouptype,    '1' as time_type,    yearinfo,    monthinfo,    dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by itcast_school_id, itcast_school_name, itcast_subject_id, itcast_subject_name, yearinfo, monthinfo, dayinfo, hourinfo, origin_type_stat, clue_state_stat;--天
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    itcast_school_id,    itcast_school_name,    '-1' as origin_type,    itcast_subject_id,    itcast_subject_name,    '-1' as hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,    '3' as grouptype,    '2' as time_type,    yearinfo,    monthinfo,    dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by itcast_school_id, itcast_school_name, itcast_subject_id, itcast_subject_name, yearinfo, monthinfo, dayinfo, origin_type_stat, clue_state_stat;--月
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    itcast_school_id,    itcast_school_name,    '-1' as origin_type,    itcast_subject_id,    itcast_subject_name,    '-1' as hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo,'-',monthinfo) as time_str,    '3' as grouptype,    '1' as time_type,    yearinfo,    monthinfo,    '-1' as dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by itcast_school_id, itcast_school_name, itcast_subject_id, itcast_subject_name, yearinfo, monthinfo, origin_type_stat, clue_state_stat;--年
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    itcast_school_id,    itcast_school_name,    '-1' as origin_type,    itcast_subject_id,    itcast_subject_name,    '-1' as hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo) as time_str,    '3' as grouptype,    '1' as time_type,    yearinfo,    '-1' as monthinfo,    '-1' as dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by itcast_school_id, itcast_school_name, itcast_subject_id, itcast_subject_name, yearinfo, origin_type_stat, clue_state_stat;

3.2.3.1.2.4 来源渠道占比

--来源渠道分组(按照来源渠道、时间和常驻类型统计)
--小时
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    '-1' as itcast_school_id,    '-1' as itcast_school_name,    origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,    '4' as grouptype,    '1' as time_type,    yearinfo,    monthinfo,    dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by origin_type, yearinfo, monthinfo, dayinfo, hourinfo, origin_type_stat, clue_state_stat;--天
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    '-1' as itcast_school_id,    '-1' as itcast_school_name,    origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    '-1' as hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,    '4' as grouptype,    '2' as time_type,    yearinfo,    monthinfo,    dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by origin_type, yearinfo, monthinfo, dayinfo, origin_type_stat, clue_state_stat;--月
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    '-1' as itcast_school_id,    '-1' as itcast_school_name,    origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    '-1' as hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo,'-',monthinfo) as time_str,    '4' as grouptype,    '1' as time_type,    yearinfo,    monthinfo,    '-1' as dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by origin_type, yearinfo, monthinfo, origin_type_stat, clue_state_stat;--年
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    '-1' as itcast_school_id,    '-1' as itcast_school_name,    origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    '-1' as hourinfo,    origin_type_stat,    clue_state_stat,    '-1' as tdepart_id,    '-1' as tdepart_name,    concat(yearinfo) as time_str,    '4' as grouptype,    '1' as time_type,    yearinfo,    '-1' as monthinfo,    '-1' as dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by origin_type, yearinfo, origin_type_stat, clue_state_stat;

3.2.3.1.2.5 咨询中心占比

--咨询中心分组(按照咨询中心、时间和常驻类型统计)
--小时
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    '-1' as itcast_school_id,    '-1' as itcast_school_name,    '-1' as origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    hourinfo,    origin_type_stat,    clue_state_stat,    tdepart_id,    tdepart_name,    concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,    '5' as grouptype,    '1' as time_type,    yearinfo,    monthinfo,    dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by tdepart_id, tdepart_name, yearinfo, monthinfo, dayinfo, hourinfo, origin_type_stat, clue_state_stat;--天
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    '-1' as itcast_school_id,    '-1' as itcast_school_name,    '-1' as origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    '-1' as hourinfo,    origin_type_stat,    clue_state_stat,    tdepart_id,    tdepart_name,    concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,    '5' as grouptype,    '2' as time_type,    yearinfo,    monthinfo,    dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by tdepart_id, tdepart_name, yearinfo, monthinfo, dayinfo, origin_type_stat, clue_state_stat;--月
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    '-1' as itcast_school_id,    '-1' as itcast_school_name,    '-1' as origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    '-1' as hourinfo,    origin_type_stat,    clue_state_stat,    tdepart_id,    tdepart_name,    concat(yearinfo,'-',monthinfo) as time_str,    '5' as grouptype,    '1' as time_type,    yearinfo,    monthinfo,    '-1' as dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by tdepart_id, tdepart_name, yearinfo, monthinfo, origin_type_stat, clue_state_stat;--年
insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)select    count(distinct customer_id) as customer_total,    '-1' as area,    '-1' as itcast_school_id,    '-1' as itcast_school_name,    '-1' as origin_type,    '-1' as itcast_subject_id,    '-1' as itcast_subject_name,    '-1' as hourinfo,    origin_type_stat,    clue_state_stat,    tdepart_id,    tdepart_name,    concat(yearinfo) as time_str,    '5' as grouptype,    '1' as time_type,    yearinfo,    '-1' as monthinfo,    '-1' as dayinfofrom itcast_dwm.itcast_intention_dwm dwmgroup by tdepart_id, tdepart_name, yearinfo, origin_type_stat, clue_state_stat;

3.2.3.2 测试

由于从ODS—>DWD层—>DWM层，已经通过分桶采样减少了数据，因此在DWS层无需重复采样。

3.2.4 导出数据

3.2.4.1 创建mysql表

CREATE TABLE itcast_intention_app ( `customer_total` int(11) COMMENT '聚合意向客户数', `area` varchar(32) COMMENT '区域信息', `itcast_school_id` varchar(32) COMMENT '校区id', `itcast_school_name` varchar(32) COMMENT '校区名称', `origin_type` varchar(32) COMMENT '来源渠道', `itcast_subject_id` varchar(32) COMMENT '学科id', `itcast_subject_name` varchar(32) COMMENT '学科名称', `hourinfo` varchar(32) COMMENT '小时信息', `origin_type_stat` varchar(32) COMMENT '数据来源:0.线下；1.线上', `clue_state_stat` varchar(32) COMMENT '客户属性：0.老客户；1.新客户', `tdepart_id` varchar(32) COMMENT '创建者', `tdepart_name` varchar(32) COMMENT '咨询中心名称', `time_str` varchar(32) COMMENT '时间明细', `groupType` varchar(32) COMMENT '产品属性类别：1.总意向量；2.区域信息；3.校区、学科组合分组；4.来源渠道；5.贡献中心;', `time_type` varchar(32) COMMENT '聚合时间类型：1、按小时聚合；2、按天聚合；3、按周聚合；4、按月聚合；5、按年聚合；', `dayinfo` varchar(32) COMMENT '日信息', `monthinfo` varchar(32) COMMENT '月信息', `yearinfo` varchar(32) COMMENT '年信息');

3.2.4.2 Sqoop导出脚本

sqoop export \--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password '123456' \
--table itcast_intention_app \
--hcatalog-database itcast_dws \
--hcatalog-table itcast_intention_dws \
-m 100

3.3 增量流程

3.3.1 数据采集

2.1.1.3 Dimen层

2.1.1.3.0.1 拉链表回顾

拉链表就是之前我们讲过的SCD2，它的优点是即满足了反应数据的历史状态，又能在最大程度上节省存储。

拉链表的实现需要在原始字段基础上增加两个新字段：

l start_time(表示该条记录的生命周期开始时间——周期快照时的状态)

l end_time(该条记录的生命周期结束时间)

2.1.1.3.0.2 采集实现步骤阿善用到

建立增量数据临时表update；
抽取昨日增量数据(新增和更新)到update表；
建立合并数据临时表tmp；
合并昨日增量数据(update表)与历史数据(拉链表)

(1) 新数据end_time设为’9999-12-31’，也就是当前有效；

(2) 如果增量数据有重复id的旧数据，将旧数据end_time更新为前天(昨日-1)，也就是从昨天开始不再生效；

(3) 合并后的数据写入tmp表；

将临时表的数据，覆盖到拉链表中；
下次抽取需要重建update表和tmp表。

查询拉链表数据时，可以通过start_time和end_time查询出快照数据。

3.3.1.8 Customer_relationship

因为需求需要将customer_relationship更新数据涉及到的维度重新统计；同时要有历史快照。推荐采用拉链表(SCD2)的形式来做。需要在start_time字段的基础上，增加新的end_time字段，以标识封链时间。

3.3.1.8.1 重建customer_relationship_update增量表

每次使用update表都需要重建，以避免因为数据重复而导致的问题。

DROP TABLE IF EXISTS itcast_ods.customer_relationship_update;CREATE TABLE IF NOT EXISTS itcast_ods.customer_relationship_update ( id int COMMENT '客户关系id', create_date_time STRING COMMENT '创建时间', update_date_time STRING COMMENT '最后更新时间', deleted int COMMENT '是否被删除(禁用)', customer_id int COMMENT '所属客户id', first_id int COMMENT '第一条客户关系id', belonger int COMMENT '归属人', belonger_name STRING COMMENT '归属人姓名', initial_belonger int COMMENT '初始归属人', distribution_handler int COMMENT '分配处理人', business_scrm_department_id int COMMENT '归属部门', last_visit_time STRING COMMENT '最后回访时间', next_visit_time STRING COMMENT '下次回访时间', origin_type STRING COMMENT '数据来源', itcast_school_id int COMMENT '校区Id', itcast_subject_id int COMMENT '学科Id', intention_study_type STRING COMMENT '意向学习方式', anticipat_signup_date STRING COMMENT '预计报名时间', level STRING COMMENT '客户级别', creator int COMMENT '创建人', current_creator int COMMENT '当前创建人：初始==创建人，当在公海拉回时为拉回人', creator_name STRING COMMENT '创建者姓名', origin_channel STRING COMMENT '来源渠道', comment STRING COMMENT '备注', first_customer_clue_id int COMMENT '第一条线索id', last_customer_clue_id int COMMENT '最后一条线索id', process_state STRING COMMENT '处理状态', process_time STRING COMMENT '处理状态变动时间', payment_state STRING COMMENT '支付状态', payment_time STRING COMMENT '支付状态变动时间', signup_state STRING COMMENT '报名状态', signup_time STRING COMMENT '报名时间', notice_state STRING COMMENT '通知状态', notice_time STRING COMMENT '通知状态变动时间', lock_state STRING COMMENT '锁定状态', lock_time STRING COMMENT '锁定状态修改时间', itcast_clazz_id int COMMENT '所属ems班级id', itcast_clazz_time STRING COMMENT '报班时间', payment_url STRING COMMENT '付款链接', payment_url_time STRING COMMENT '支付链接生成时间', ems_student_id int COMMENT 'ems的学生id', delete_reason STRING COMMENT '删除原因', deleter int COMMENT '删除人', deleter_name STRING COMMENT '删除人姓名', delete_time STRING COMMENT '删除时间', course_id int COMMENT '课程ID', course_name STRING COMMENT '课程名称', delete_comment STRING COMMENT '删除原因说明', close_state STRING COMMENT '关闭装填', close_time STRING COMMENT '关闭状态变动时间', appeal_id int COMMENT '申诉id', tenant int COMMENT '租户', total_fee DECIMAL COMMENT '报名费总金额', belonged int COMMENT '小周期归属人', belonged_time STRING COMMENT '归属时间', belonger_time STRING COMMENT '归属时间', transfer int COMMENT '转移人', transfer_time STRING COMMENT '转移时间', follow_type int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', transfer_bxg_oa_account STRING COMMENT '转移到博学谷归属人OA账号', transfer_bxg_belonger_name STRING COMMENT '转移到博学谷归属人OA姓名', end_time STRING COMMENT '有效时间')comment '客户关系表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');

3.3.1.8.2 抽取昨日新增和更新数据(逻辑删除也属于更新操作)

因为增量抽取是T+1，所以Sql中需要增加where条件，只查询昨天一天的数据(新增和更新)，而不是所有表数据。

新增的数据create_time=昨天；更新的数据update_time=昨天。

注意，更新的数据可能是以前创建的数据，创建日期可能不是昨天。业务方将更新周期限制在30天内，也就是说，昨天更改的数据，create_time<=’30天前的日期’，而update_time的值就是昨天的日期。

查询条件需要包含创建日期和更新日期，因为需要将昨日新增和修改的数据都抽取到数仓中。

2.1.1.3.0.2.1 SQL：

select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as start_time, "9999-12-31" as end_timefrom customer_relationshipwhere ( create_date_time >= "2011-12-04 00:00:00" and create_date_time < "2011-12-05 00:00:00" ) or ( update_date_time >= "2011-12-04 00:00:00" and update_date_time < "2011-12-05 00:00:00" );

2.1.1.3.0.2.2 Sqoop脚本：

sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \
--username root \
--password 123456 \
--query '
select id,       create_date_time,       update_date_time,       deleted,       customer_id,       first_id,       belonger,       belonger_name,       initial_belonger,       distribution_handler,       business_scrm_department_id,       last_visit_time,       next_visit_time,       origin_type,       itcast_school_id,       itcast_subject_id,       intention_study_type,       anticipat_signup_date,       level,       creator,       current_creator,       creator_name,       origin_channel,       comment,       first_customer_clue_id,       last_customer_clue_id,       process_state,       process_time,       payment_state,       payment_time,       signup_state,       signup_time,       notice_state,       notice_time,       lock_state,       lock_time,       itcast_clazz_id,       itcast_clazz_time,       payment_url,       payment_url_time,       ems_student_id,       delete_reason,       deleter,       deleter_name,       delete_time,       course_id,       course_name,       delete_comment,       close_state,       close_time,       appeal_id,       tenant,       total_fee,       belonged,       belonged_time,       belonger_time,       transfer,       transfer_time,       follow_type,       transfer_bxg_oa_account,       transfer_bxg_belonger_name,       FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as start_time,       date_format("9999-12-31", "%Y-%m-%d")       as end_timefrom customer_relationshipwhere  (    create_date_time >= "2011-12-04 00:00:00"     and    create_date_time < "2011-12-05 00:00:00"  )  or  (    update_date_time >= "2011-12-04 00:00:00"    and    update_date_time < "2011-12-05 00:00:00"  )
 and $CONDITIONS' \--hcatalog-database itcast_ods \--hcatalog-table customer_relationship_update \--hive-partition-key start_time \--hive-partition-value  2020-07-15 \-m 100 \--split-by id

3.3.1.8.3 重建customer_relationship_tmp临时表

每次使用tmp表都需要重建，以避免因为数据重复而导致的问题。

3.3.1.8.4 合并增量数据与历史数据(根据需求仅更新30天之内的数据)

获取update表的更新数据，新数据end_time为’9999-12-31’，start_time为昨日日期；
获取拉链表历史数据：

(1) 更新旧数据end_time

① 将历史表customer_relationship(拉链表)与新增/更新数据表customer_relationship_update通过id进行关联，如果update中有与历史表重复的id，证明有此条id数据已有新的变更；

② end_time不变的条件：

1) 没有更新的数据保留原始end_time；

2) 历史表已是失效的数据，保留原始有效结束日期end_time；

③ 否则(有更新的数据，且旧数据目前正在生效)，修改end_time为前天(昨天之前)；

(2) 因为业务方将更新周期限制在30天内(只会修改30天之内的数据，即create_time在30天之内)，所以只需查询更新30天内的数据(end_time)即可；

将 1.update 与 2.拉链表合并，覆盖插入到临时表中。

实现：

insert overwrite table itcast_ods.customer_relationship_tmp partition (start_time)select * from (-- 一、update表更新的数据select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name, '9999-12-31' end_time, '2020-07-15' as start_time from itcast_ods.customer_relationship_update where start_time='2020-07-15' union all-- 二、历史拉链表数据，并根据update判断更新end_time有效期select rs.id, rs.create_date_time, rs.update_date_time, rs.deleted, rs.customer_id, rs.first_id, rs.belonger, rs.belonger_name, rs.initial_belonger, rs.distribution_handler, rs.business_scrm_department_id, rs.last_visit_time, rs.next_visit_time, rs.origin_type, rs.itcast_school_id, rs.itcast_subject_id, rs.intention_study_type, rs.anticipat_signup_date, rs.level, rs.creator, rs.current_creator, rs.creator_name, rs.origin_channel, rs.comment, rs.first_customer_clue_id, rs.last_customer_clue_id, rs.process_state, rs.process_time, rs.payment_state, rs.payment_time, rs.signup_state, rs.signup_time, rs.notice_state, rs.notice_time, rs.lock_state, rs.lock_time, rs.itcast_clazz_id, rs.itcast_clazz_time, rs.payment_url, rs.payment_url_time, rs.ems_student_id, rs.delete_reason, rs.deleter, rs.deleter_name, rs.delete_time, rs.course_id, rs.course_name, rs.delete_comment, rs.close_state, rs.close_time, rs.appeal_id, rs.tenant, rs.total_fee, rs.belonged, rs.belonged_time, rs.belonger_time, rs.transfer, rs.transfer_time, rs.follow_type, rs.transfer_bxg_oa_account, rs.transfer_bxg_belonger_name, --3、更新end_time：如果没有匹配到变更数据，或者当前已经是无效的历史数据，则保留原始end_time过期时间；否则变更end_time时间为前天(昨天之前有效)
if(up.id is null or rs.end_time<'9999-12-31', rs.end_time, date_add(up.start_time,-1)) end_time, rs.start_time from itcast_ods.customer_relationship rs left join ( select *from itcast_ods.customer_relationship_update where start_time='2020-07-15' ) up on rs.id=up.id --4、时间限制：历史表中30天之内的数据才有可能变更，结果会按照所属分区进行覆盖插入where rs.start_time >= date_add(up.start_time,-30) )hisorder by his.id, start_time;

3.3.1.8.5 临时表覆盖到拉链表

注意如果有分区的情况下，只会覆盖所属分区的数据，所以不用在上一个步骤中查询出所有历史数据，我们只需要查询出30天内的数据即可，30天前的数据不会被覆盖。

INSERT OVERWRITE TABLE itcast_ods.customer_relationship partition (start_time) SELECT * from itcast_ods.customer_relationship_tmp;

3.3.1.8.6 测试

完整执行流程后，观察拉链表中对应条件的数据是否有变化：

SELECT * from itcast_ods.customer_relationshipWHERE create_date_time BETWEEN "2011-12-04 00:00:00" and "2011-12-05 00:00:00";

3.3.1.8.7 Oozie脚本

将拉链表的完整过程写入到shell脚本中。

#! /bin/bash

HIVE_HOME=/usr/bin/hive

if [[ $1 == "" ]];

then

TD_DATE=`date -d ''1 days ago'' "+%Y-%m-%d"`

else

TD_DATE=$1

output=$(${HIVE_HOME} -S -e "

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=nonstrict;

DROP TABLE IF EXISTS itcast_ods.customer_relationship_update;

CREATE TABLE IF NOT EXISTS itcast_ods.customer_relationship_update (

id int COMMENT '客户关系id',

create_date_time STRING COMMENT '创建时间',

update_date_time STRING COMMENT '最后更新时间',

deleted int COMMENT '是否被删除(禁用)',

customer_id int COMMENT '所属客户id',

first_id int COMMENT '第一条客户关系id',

belonger int COMMENT '归属人',

belonger_name STRING COMMENT '归属人姓名',

initial_belonger int COMMENT '初始归属人',

distribution_handler int COMMENT '分配处理人',

business_scrm_department_id int COMMENT '归属部门',

last_visit_time STRING COMMENT '最后回访时间',

next_visit_time STRING COMMENT '下次回访时间',

origin_type STRING COMMENT '数据来源',

itcast_school_id int COMMENT '校区Id',

itcast_subject_id int COMMENT '学科Id',

intention_study_type STRING COMMENT '意向学习方式',

anticipat_signup_date STRING COMMENT '预计报名时间',

level STRING COMMENT '客户级别',

creator int COMMENT '创建人',

current_creator int COMMENT '当前创建人：初始==创建人，当在公海拉回时为拉回人',

creator_name STRING COMMENT '创建者姓名',

origin_channel STRING COMMENT '来源渠道',

comment STRING COMMENT '备注',

first_customer_clue_id int COMMENT '第一条线索id',

last_customer_clue_id int COMMENT '最后一条线索id',

process_state STRING COMMENT '处理状态',

process_time STRING COMMENT '处理状态变动时间',

payment_state STRING COMMENT '支付状态',

payment_time STRING COMMENT '支付状态变动时间',

signup_state STRING COMMENT '报名状态',

signup_time STRING COMMENT '报名时间',

notice_state STRING COMMENT '通知状态',

notice_time STRING COMMENT '通知状态变动时间',

lock_state STRING COMMENT '锁定状态',

lock_time STRING COMMENT '锁定状态修改时间',

itcast_clazz_id int COMMENT '所属ems班级id',

itcast_clazz_time STRING COMMENT '报班时间',

payment_url STRING COMMENT '付款链接',

payment_url_time STRING COMMENT '支付链接生成时间',

ems_student_id int COMMENT 'ems的学生id',

delete_reason STRING COMMENT '删除原因',

deleter int COMMENT '删除人',

deleter_name STRING COMMENT '删除人姓名',

delete_time STRING COMMENT '删除时间',

course_id int COMMENT '课程ID',

course_name STRING COMMENT '课程名称',

delete_comment STRING COMMENT '删除原因说明',

close_state STRING COMMENT '关闭装填',

close_time STRING COMMENT '关闭状态变动时间',

appeal_id int COMMENT '申诉id',

tenant int COMMENT '租户',

total_fee DECIMAL COMMENT '报名费总金额',

belonged int COMMENT '小周期归属人',

belonged_time STRING COMMENT '归属时间',

belonger_time STRING COMMENT '归属时间',

transfer int COMMENT '转移人',

transfer_time STRING COMMENT '转移时间',

follow_type int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取',

transfer_bxg_oa_account STRING COMMENT '转移到博学谷归属人OA账号',

transfer_bxg_belonger_name STRING COMMENT '转移到博学谷归属人OA姓名',

end_time STRING COMMENT '有效时间')

comment '客户关系表'

PARTITIONED BY(start_time STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

stored as orc

TBLPROPERTIES ('orc.compress'='ZLIB');

SQOOP_HOME=/usr/bin/sqoop

output=$(${SQOOP_HOME} import \

--connect jdbc:mysql://172.17.0.202:3306/scrm \

--username root \

--password 123456 \

--query 'select id,

create_date_time,

update_date_time,

deleted,

customer_id,

first_id,

belonger,

belonger_name,

initial_belonger,

distribution_handler,

business_scrm_department_id,

last_visit_time,

next_visit_time,

origin_type,

itcast_school_id,

itcast_subject_id,

intention_study_type,

anticipat_signup_date,

level,

creator,

current_creator,

creator_name,

origin_channel,

comment,

first_customer_clue_id,

last_customer_clue_id,

process_state,

process_time,

payment_state,

payment_time,

signup_state,

signup_time,

notice_state,

notice_time,

lock_state,

lock_time,

itcast_clazz_id,

itcast_clazz_time,

payment_url,

payment_url_time,

ems_student_id,

delete_reason,

deleter,

deleter_name,

delete_time,

course_id,

course_name,

delete_comment,

close_state,

close_time,

appeal_id,

tenant,

total_fee,

belonged,

belonged_time,

belonger_time,

transfer,

transfer_time,

follow_type,

transfer_bxg_oa_account,

transfer_bxg_belonger_name,

FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as start_time,

date_format("9999-12-31", "%Y-%m-%d") as end_time

from customer_relationship

where

(

create_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE) - INTERVAL 1 DAY),"%Y-%m-%d %H:%i:%s")

and

create_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE)),"%Y-%m-%d %H:%i:%s")

)

(

update_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE) - INTERVAL 1 DAY),"%Y-%m-%d %H:%i:%s")

and

update_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE)),"%Y-%m-%d %H:%i:%s")

) and $CONDITIONS' \

--hcatalog-database itcast_ods \

--hcatalog-table customer_relationship_update \

--hive-partition-key start_time \

--hive-partition-value ${TD_DATE} \

-m 100 \

--split-by id)

output=$(${HIVE_HOME} -S -e "

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=nonstrict;

DROP TABLE itcast_ods.customer_clue_tmp;

CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_tmp (

id int COMMENT 'customer_clue_id',

create_date_time STRING COMMENT '创建时间',

update_date_time STRING COMMENT '最后更新时间',

deleted STRING COMMENT '是否被删除(禁用)',

customer_id int COMMENT '客户id',

customer_relationship_id int COMMENT '客户关系id',

session_id STRING COMMENT '七陌会话id',

sid STRING COMMENT '访客id',

status STRING COMMENT '状态(undeal待领取 deal 已领取 finish 已关闭 changePeer 已流转)',

users STRING COMMENT '所属坐席',

create_time STRING COMMENT '七陌创建时间',

platform STRING COMMENT '平台来源 (pc-网站咨询|wap-wap咨询|sdk-app咨询|weixin-微信咨询)',

s_name STRING COMMENT '用户名称',

seo_source STRING COMMENT '搜索来源',

seo_keywords STRING COMMENT '关键字',

ip STRING COMMENT 'IP地址',

referrer STRING COMMENT '上级来源页面',

from_url STRING COMMENT '会话来源页面',

landing_page_url STRING COMMENT '访客着陆页面',

url_title STRING COMMENT '咨询页面title',

to_peer STRING COMMENT '所属技能组',

manual_time STRING COMMENT '人工开始时间',

begin_time STRING COMMENT '坐席领取时间 ',

reply_msg_count int COMMENT '客服回复消息数',

total_msg_count int COMMENT '消息总数',

msg_count int COMMENT '客户发送消息数',

comment STRING COMMENT '备注',

finish_reason STRING COMMENT '结束类型',

finish_user STRING COMMENT '结束坐席',

end_time STRING COMMENT '会话结束时间',

platform_description STRING COMMENT '客户平台信息',

browser_name STRING COMMENT '浏览器名称',

os_info STRING COMMENT '系统名称',

area STRING COMMENT '区域',

country STRING COMMENT '所在国家',

province STRING COMMENT '省',

city STRING COMMENT '城市',

creator int COMMENT '创建人',

name STRING COMMENT '客户姓名',

idcard STRING COMMENT '身份证号',

phone STRING COMMENT '手机号',

itcast_school_id int COMMENT '校区Id',

itcast_school STRING COMMENT '校区',

itcast_subject_id int COMMENT '学科Id',

itcast_subject STRING COMMENT '学科',

wechat STRING COMMENT '微信',

qq STRING COMMENT 'qq号',

email STRING COMMENT '邮箱',

gender STRING COMMENT '性别',

level STRING COMMENT '客户级别',

origin_type STRING COMMENT '数据来源渠道',

information_way STRING COMMENT '资讯方式',

working_years STRING COMMENT '开始工作时间',

technical_directions STRING COMMENT '技术方向',

customer_state STRING COMMENT '当前客户状态',

valid STRING COMMENT '该线索是否是网资有效线索',

anticipat_signup_date STRING COMMENT '预计报名时间',

clue_state STRING COMMENT '线索状态',

scrm_department_id int COMMENT 'SCRM内部部门id',

superior_url STRING COMMENT '诸葛获取上级页面URL',

superior_source STRING COMMENT '诸葛获取上级页面URL标题',

landing_url STRING COMMENT '诸葛获取着陆页面URL',

landing_source STRING COMMENT '诸葛获取着陆页面URL来源',

info_url STRING COMMENT '诸葛获取留咨页URL',

info_source STRING COMMENT '诸葛获取留咨页URL标题',

origin_channel STRING COMMENT '投放渠道',

course_id int COMMENT '课程编号',

course_name STRING COMMENT '课程名称',

zhuge_session_id STRING COMMENT 'zhuge会话id',

is_repeat int COMMENT '是否重复线索(手机号维度) 0:正常 1：重复',

tenant int COMMENT '租户id',

activity_id STRING COMMENT '活动id',

activity_name STRING COMMENT '活动名称',

follow_type int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取',

shunt_mode_id int COMMENT '匹配到的技能组id',

shunt_employee_group_id int COMMENT '所属分流员工组',

ends_time STRING COMMENT '有效时间')

comment '客户关系表'

PARTITIONED BY(starts_time STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

stored as orc

TBLPROPERTIES ('orc.compress'='ZLIB');

insert overwrite table itcast_ods.`customer_relationship_tmp` partition (start_time)

select * from

(

select

id,

create_date_time,

update_date_time,

deleted,

customer_id,

first_id,

belonger,

belonger_name,

initial_belonger,

distribution_handler,

business_scrm_department_id,

last_visit_time,

next_visit_time,

origin_type,

itcast_school_id,

itcast_subject_id,

intention_study_type,

anticipat_signup_date,

level,

creator,

current_creator,

creator_name,

origin_channel,

comment,

first_customer_clue_id,

last_customer_clue_id,

process_state,

process_time,

payment_state,

payment_time,

signup_state,

signup_time,

notice_state,

notice_time,

lock_state,

lock_time,

itcast_clazz_id,

itcast_clazz_time,

payment_url,

payment_url_time,

ems_student_id,

delete_reason,

deleter,

deleter_name,

delete_time,

course_id,

course_name,

delete_comment,

close_state,

close_time,

appeal_id,

tenant,

total_fee,

belonged,

belonged_time,

belonger_time,

transfer,

transfer_time,

follow_type,

transfer_bxg_oa_account,

transfer_bxg_belonger_name,

'9999-12-31' end_time,

FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as start_time

from itcast_ods.customer_relationship_update where start_time=FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d")

union all

select

rs.id,

rs.create_date_time,

rs.update_date_time,

rs.deleted,

rs.customer_id,

rs.first_id,

rs.belonger,

rs.belonger_name,

rs.initial_belonger,

rs.distribution_handler,

rs.business_scrm_department_id,

rs.last_visit_time,

rs.next_visit_time,

rs.origin_type,

rs.itcast_school_id,

rs.itcast_subject_id,

rs.intention_study_type,

rs.anticipat_signup_date,

rs.level,

rs.creator,

rs.current_creator,

rs.creator_name,

rs.origin_channel,

rs.comment,

rs.first_customer_clue_id,

rs.last_customer_clue_id,

rs.process_state,

rs.process_time,

rs.payment_state,

rs.payment_time,

rs.signup_state,

rs.signup_time,

rs.notice_state,

rs.notice_time,

rs.lock_state,

rs.lock_time,

rs.itcast_clazz_id,

rs.itcast_clazz_time,

rs.payment_url,

rs.payment_url_time,

rs.ems_student_id,

rs.delete_reason,

rs.deleter,

rs.deleter_name,

rs.delete_time,

rs.course_id,

rs.course_name,

rs.delete_comment,

rs.close_state,

rs.close_time,

rs.appeal_id,

rs.tenant,

rs.total_fee,

rs.belonged,

rs.belonged_time,

rs.belonger_time,

rs.transfer,

rs.transfer_time,

rs.follow_type,

rs.transfer_bxg_oa_account,

rs.transfer_bxg_belonger_name,

if(up.id is null, rs.end_time, date_add(up.start_time,-1)) end_time,

rs.start_time

from itcast_ods.customer_relationship rs left join

(

select

from itcast_ods.customer_relationship_update

where start_time=FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d")

) up

on rs.id=up.id where rs.start_time >= date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP()),30) and rs.end_time='9999-12-31'

)his

order by his.id, start_time;

INSERT OVERWRITE TABLE itcast_ods.customer_relationship partition (start_time)

SELECT * from itcast_ods.customer_relationship_tmp;

3.3.1.9 Customer_clue线索表

3.3.1.9.1 重建customer_clue_update更新表

DROP TABLE itcast_ods.customer_clue_update;CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_update ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '创建时间', update_date_time STRING COMMENT '最后更新时间', deleted STRING COMMENT '是否被删除(禁用)', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户关系id', session_id STRING COMMENT '七陌会话id', sid STRING COMMENT '访客id', status STRING COMMENT '状态(undeal待领取 deal 已领取 finish 已关闭 changePeer 已流转)', users STRING COMMENT '所属坐席', create_time STRING COMMENT '七陌创建时间', platform STRING COMMENT '平台来源 (pc-网站咨询|wap-wap咨询|sdk-app咨询|weixin-微信咨询)', s_name STRING COMMENT '用户名称', seo_source STRING COMMENT '搜索来源', seo_keywords STRING COMMENT '关键字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上级来源页面', from_url STRING COMMENT '会话来源页面', landing_page_url STRING COMMENT '访客着陆页面', url_title STRING COMMENT '咨询页面title', to_peer STRING COMMENT '所属技能组', manual_time STRING COMMENT '人工开始时间', begin_time STRING COMMENT '坐席领取时间 ', reply_msg_count int COMMENT '客服回复消息数', total_msg_count int COMMENT '消息总数', msg_count int COMMENT '客户发送消息数', comment STRING COMMENT '备注', finish_reason STRING COMMENT '结束类型', finish_user STRING COMMENT '结束坐席', end_time STRING COMMENT '会话结束时间', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '浏览器名称', os_info STRING COMMENT '系统名称', area STRING COMMENT '区域', country STRING COMMENT '所在国家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '创建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份证号', phone STRING COMMENT '手机号', itcast_school_id int COMMENT '校区Id', itcast_school STRING COMMENT '校区', itcast_subject_id int COMMENT '学科Id', itcast_subject STRING COMMENT '学科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq号', email STRING COMMENT '邮箱', gender STRING COMMENT '性别', level STRING COMMENT '客户级别', origin_type STRING COMMENT '数据来源渠道', information_way STRING COMMENT '资讯方式', working_years STRING COMMENT '开始工作时间', technical_directions STRING COMMENT '技术方向', customer_state STRING COMMENT '当前客户状态', valid STRING COMMENT '该线索是否是网资有效线索', anticipat_signup_date STRING COMMENT '预计报名时间', clue_state STRING COMMENT '线索状态', scrm_department_id int COMMENT 'SCRM内部部门id', superior_url STRING COMMENT '诸葛获取上级页面URL', superior_source STRING COMMENT '诸葛获取上级页面URL标题', landing_url STRING COMMENT '诸葛获取着陆页面URL', landing_source STRING COMMENT '诸葛获取着陆页面URL来源', info_url STRING COMMENT '诸葛获取留咨页URL', info_source STRING COMMENT '诸葛获取留咨页URL标题', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '课程编号', course_name STRING COMMENT '课程名称', zhuge_session_id STRING COMMENT 'zhuge会话id', is_repeat int COMMENT '是否重复线索(手机号维度) 0:正常 1：重复', tenant int COMMENT '租户id', activity_id STRING COMMENT '活动id', activity_name STRING COMMENT '活动名称', follow_type int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', shunt_mode_id int COMMENT '匹配到的技能组id', shunt_employee_group_id int COMMENT '所属分流员工组', ends_time STRING COMMENT '有效时间')comment '客户关系表'PARTITIONED BY(starts_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');

3.3.1.9.2 抽取昨日新增和更新数据(逻辑删除也属于更新操作)

因为增量抽取是T+1，所以Sql中需要增加where条件，只查询昨天一天的数据，而不是所有表数据。

查询条件需要包含创建日期和更新日期，因为需要将昨日新增和修改的数据都抽取到数仓中。

SQL:

select id, create_date_time, update_date_time, deleted, customer_id, customer_relationship_id, session_id, sid, status, user as users, create_time, platform, s_name, seo_source, seo_keywords, ip, referrer, from_url, landing_page_url, url_title, to_peer, manual_time, begin_time, reply_msg_count, total_msg_count, msg_count, comment, finish_reason, finish_user, end_time, platform_description, browser_name, os_info, area, country, province, city, creator, name, idcard, phone, itcast_school_id, itcast_school, itcast_subject_id, itcast_subject, wechat, qq, email, gender, level, origin_type, information_way, working_years, technical_directions, customer_state, valid, anticipat_signup_date, clue_state, scrm_department_id, superior_url, superior_source, landing_url, landing_source, info_url, info_source, origin_channel, course_id, course_name, zhuge_session_id, is_repeat, tenant, activity_id, activity_name, follow_type, shunt_mode_id, shunt_employee_group_id, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as starts_time, date_format("9999-12-31", "%Y-%m-%d") as ends_timefrom customer_cluewhere ( create_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 00:00:00"),"%Y-%m-%d %H:%i:%s") and create_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 23:59:59"),"%Y-%m-%d %H:%i:%s") ) or ( update_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 00:00:00"),"%Y-%m-%d %H:%i:%s") and update_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 23:59:59"),"%Y-%m-%d %H:%i:%s") );

Sqoop脚本：

sqoop import \--connect jdbc:mysql://172.17.0.202:3306/scrm \
--username root \
--password 123456 \
--query '
select id, create_date_time, update_date_time, deleted, customer_id, customer_relationship_id, session_id, sid, status, user as users, create_time, platform, s_name, seo_source, seo_keywords, ip, referrer, from_url, landing_page_url, url_title, to_peer, manual_time, begin_time, reply_msg_count, total_msg_count, msg_count, comment, finish_reason, finish_user, end_time, platform_description, browser_name, os_info, area, country, province, city, creator, name, idcard, phone, itcast_school_id, itcast_school, itcast_subject_id, itcast_subject, wechat, qq, email, gender, level, origin_type, information_way, working_years, technical_directions, customer_state, valid, anticipat_signup_date, clue_state, scrm_department_id, superior_url, superior_source, landing_url, landing_source, info_url, info_source, origin_channel, course_id, course_name, zhuge_session_id, is_repeat, tenant, activity_id, activity_name, follow_type, shunt_mode_id, shunt_employee_group_id, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as starts_time, date_format("9999-12-31", "%Y-%m-%d") as ends_timefrom customer_cluewhere ( create_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 00:00:00"),"%Y-%m-%d %H:%i:%s") and create_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 23:59:59"),"%Y-%m-%d %H:%i:%s") ) or ( update_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 00:00:00"),"%Y-%m-%d %H:%i:%s") and update_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP("2019-12-04 23:59:59"),"%Y-%m-%d %H:%i:%s") )and $CONDITIONS' \--hcatalog-database itcast_ods \--hcatalog-table customer_clue_update \--hive-partition-key starts_time \--hive-partition-value 2019-12-04 \-m 100 \--split-by id

3.3.1.9.3 重建customer_clue_tmp临时表

3.3.1.9.4 合并增量数据与历史数据(仅更新30天之内的数据，根据需求)

获取update表的更新数据，新数据end_time为’9999-12-31’，start_time为昨日日期；
获取拉链表历史数据：

(1) 更新end_time

① 将历史表customer_relationship(主表)与新增/更新数据表customer_relationship_update通过id进行关联，如果update中有与历史表重复的id，证明有此条id数据已有新的变更；

② 没有更新的数据保留原始end_time；

③ 历史表已是失效的数据，保留原始有效结束日期end_time；

④ 有更新的数据，且旧数据目前正在生效，修改end_time为前天(昨天之前)；

(2) 因为业务方将更新周期限制在30天内，所以只需查询更新30天内的数据即可；

将 1.update 与 2.拉链表合并，覆盖插入到临时表中。

实现：

insert overwrite table itcast_ods.customer_clue_tmp partition (starts_time)select * from ( select id, create_date_time, update_date_time, deleted, customer_id, customer_relationship_id, session_id, sid, status, users, create_time, platform, s_name, seo_source, seo_keywords, ip, referrer, from_url, landing_page_url, url_title, to_peer, manual_time, begin_time, reply_msg_count, total_msg_count, msg_count, comment, finish_reason, finish_user, end_time, platform_description, browser_name, os_info, area, country, province, city, creator, name, idcard, phone, itcast_school_id, itcast_school, itcast_subject_id, itcast_subject, wechat, qq, email, gender, level, origin_type, information_way, working_years, technical_directions, customer_state, valid, anticipat_signup_date, clue_state, scrm_department_id, superior_url, superior_source, landing_url, landing_source, info_url, info_source, origin_channel, course_id, course_name, zhuge_session_id, is_repeat, tenant, activity_id, activity_name, follow_type, shunt_mode_id, shunt_employee_group_id, '9999-12-31' ends_time, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as starts_time from itcast_ods.customer_clue_update where starts_time=FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") union all select rs.id, rs.create_date_time, rs.update_date_time, rs.deleted, rs.customer_id, rs.customer_relationship_id, rs.session_id, rs.sid, rs.status, rs.users, rs.create_time, rs.platform, rs.s_name, rs.seo_source, rs.seo_keywords, rs.ip, rs.referrer, rs.from_url, rs.landing_page_url, rs.url_title, rs.to_peer, rs.manual_time, rs.begin_time, rs.reply_msg_count, rs.total_msg_count, rs.msg_count, rs.comment, rs.finish_reason, rs.finish_user, rs.end_time, rs.platform_description, rs.browser_name, rs.os_info, rs.area, rs.country, rs.province, rs.city, rs.creator, rs.name, rs.idcard, rs.phone, rs.itcast_school_id, rs.itcast_school, rs.itcast_subject_id, rs.itcast_subject, rs.wechat, rs.qq, rs.email, rs.gender, rs.level, rs.origin_type, rs.information_way, rs.working_years, rs.technical_directions, rs.customer_state, rs.valid, rs.anticipat_signup_date, rs.clue_state, rs.scrm_department_id, rs.superior_url, rs.superior_source, rs.landing_url, rs.landing_source, rs.info_url, rs.info_source, rs.origin_channel, rs.course_id, rs.course_name, rs.zhuge_session_id, rs.is_repeat, rs.tenant, rs.activity_id, rs.activity_name, rs.follow_type, rs.shunt_mode_id, rs.shunt_employee_group_id, if(up.id is null or rs.end_time<'9999-12-31', rs.ends_time, date_add(up.starts_time,-1)) ends_time, rs.starts_time from itcast_ods.customer_clue rs left join ( select *
from itcast_ods.customer_clue_update where starts_time=FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") ) up on rs.id=up.id where rs.starts_time >= date_add(FROM_UNIXTIME(UNIX_TIMESTAMP()),-30))his order by his.id, starts_time;

3.3.1.9.5 临时表覆盖到拉链表

INSERT OVERWRITE TABLE itcast_ods.customer_clue partition (starts_time)

SELECT * from itcast_ods.customer_clue_tmp;

3.3.1.9.6 测试

删除mysql和HDFS(外部表)中的测试数据，避免数据重复，便于验证测试结果
向mysql中插入新数据
验证sqoop中的sql是否能够在mysql正常查询出测试数据
重建update更新表
手动执行sqoop脚本抽取数据
重建tmp临时表
合并当天的新增和更新数据
临时表覆盖到拉链表

3.3.1.9.7 Oozie脚本

#! /bin/bashHIVE_HOME=/usr/bin/hiveif [[ $1 == "" ]];then TD_DATE=`date -d ''1 days ago'' "+%Y-%m-%d"`else TD_DATE=$1fioutput=$(${HIVE_HOME} -S -e "SET hive.exec.dynamic.partition=true;SET hive.exec.dynamic.partition.mode=nonstrict;DROP TABLE itcast_ods.customer_clue_update;CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_update ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '创建时间', update_date_time STRING COMMENT '最后更新时间', deleted STRING COMMENT '是否被删除(禁用)', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户关系id', session_id STRING COMMENT '七陌会话id', sid STRING COMMENT '访客id', status STRING COMMENT '状态(undeal待领取 deal 已领取 finish 已关闭 changePeer 已流转)', users STRING COMMENT '所属坐席', create_time STRING COMMENT '七陌创建时间', platform STRING COMMENT '平台来源 (pc-网站咨询|wap-wap咨询|sdk-app咨询|weixin-微信咨询)', s_name STRING COMMENT '用户名称', seo_source STRING COMMENT '搜索来源', seo_keywords STRING COMMENT '关键字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上级来源页面', from_url STRING COMMENT '会话来源页面', landing_page_url STRING COMMENT '访客着陆页面', url_title STRING COMMENT '咨询页面title', to_peer STRING COMMENT '所属技能组', manual_time STRING COMMENT '人工开始时间', begin_time STRING COMMENT '坐席领取时间 ', reply_msg_count int COMMENT '客服回复消息数', total_msg_count int COMMENT '消息总数', msg_count int COMMENT '客户发送消息数', comment STRING COMMENT '备注', finish_reason STRING COMMENT '结束类型', finish_user STRING COMMENT '结束坐席', end_time STRING COMMENT '会话结束时间', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '浏览器名称', os_info STRING COMMENT '系统名称', area STRING COMMENT '区域', country STRING COMMENT '所在国家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '创建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份证号', phone STRING COMMENT '手机号', itcast_school_id int COMMENT '校区Id', itcast_school STRING COMMENT '校区', itcast_subject_id int COMMENT '学科Id', itcast_subject STRING COMMENT '学科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq号', email STRING COMMENT '邮箱', gender STRING COMMENT '性别', level STRING COMMENT '客户级别', origin_type STRING COMMENT '数据来源渠道', information_way STRING COMMENT '资讯方式', working_years STRING COMMENT '开始工作时间', technical_directions STRING COMMENT '技术方向', customer_state STRING COMMENT '当前客户状态', valid STRING COMMENT '该线索是否是网资有效线索', anticipat_signup_date STRING COMMENT '预计报名时间', clue_state STRING COMMENT '线索状态', scrm_department_id int COMMENT 'SCRM内部部门id', superior_url STRING COMMENT '诸葛获取上级页面URL', superior_source STRING COMMENT '诸葛获取上级页面URL标题', landing_url STRING COMMENT '诸葛获取着陆页面URL', landing_source STRING COMMENT '诸葛获取着陆页面URL来源', info_url STRING COMMENT '诸葛获取留咨页URL', info_source STRING COMMENT '诸葛获取留咨页URL标题', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '课程编号', course_name STRING COMMENT '课程名称', zhuge_session_id STRING COMMENT 'zhuge会话id', is_repeat int COMMENT '是否重复线索(手机号维度) 0:正常 1：重复', tenant int COMMENT '租户id', activity_id STRING COMMENT '活动id', activity_name STRING COMMENT '活动名称', follow_type int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', shunt_mode_id int COMMENT '匹配到的技能组id', shunt_employee_group_id int COMMENT '所属分流员工组', ends_time STRING COMMENT '有效时间')comment '客户关系表'PARTITIONED BY(starts_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');")SQOOP_HOME=/usr/bin/sqoopoutput=$(${SQOOP_HOME} import \--connect jdbc:mysql://172.17.0.202:3306/scrm \--username root \--password 123456 \--query 'select id, create_date_time, update_date_time, deleted, customer_id, customer_relationship_id, session_id, sid, status, user, create_time, platform, s_name, seo_source, seo_keywords, ip, referrer, from_url, landing_page_url, url_title, to_peer, manual_time, begin_time, reply_msg_count, total_msg_count, msg_count, comment, finish_reason, finish_user, end_time, platform_description, browser_name, os_info, area, country, province, city, creator, name, idcard, phone, itcast_school_id, itcast_school, itcast_subject_id, itcast_subject, wechat, qq, email, gender, level, origin_type, information_way, working_years, technical_directions, customer_state, valid, anticipat_signup_date, clue_state, scrm_department_id, superior_url, superior_source, landing_url, landing_source, info_url, info_source, origin_channel, course_id, course_name, zhuge_session_id, is_repeat, tenant, activity_id, activity_name, follow_type, shunt_mode_id, shunt_employee_group_id, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as starts_time, date_format("9999-12-31", "%Y-%m-%d") as ends_timefrom customer_cluewhere ( create_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE) - INTERVAL 1 DAY),"%Y-%m-%d %H:%i:%s") and create_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE)),"%Y-%m-%d %H:%i:%s") ) or ( update_date_time >= FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE) - INTERVAL 1 DAY),"%Y-%m-%d %H:%i:%s") and update_date_time < FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(SYSDATE()AS DATE)),"%Y-%m-%d %H:%i:%s") ) and $CONDITIONS' \--hcatalog-database itcast_ods \--hcatalog-table customer_clue_update \--hive-partition-key starts_time \--hive-partition-value ${TD_DATE} \-m 100 \--split-by id)output=$(${HIVE_HOME} -S -e "SET hive.exec.dynamic.partition=true;SET hive.exec.dynamic.partition.mode=nonstrict;DROP TABLE itcast_ods.customer_clue_tmp;CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_tmp ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '创建时间', update_date_time STRING COMMENT '最后更新时间', deleted STRING COMMENT '是否被删除(禁用)', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户关系id', session_id STRING COMMENT '七陌会话id', sid STRING COMMENT '访客id', status STRING COMMENT '状态(undeal待领取 deal 已领取 finish 已关闭 changePeer 已流转)', users STRING COMMENT '所属坐席', create_time STRING COMMENT '七陌创建时间', platform STRING COMMENT '平台来源 (pc-网站咨询|wap-wap咨询|sdk-app咨询|weixin-微信咨询)', s_name STRING COMMENT '用户名称', seo_source STRING COMMENT '搜索来源', seo_keywords STRING COMMENT '关键字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上级来源页面', from_url STRING COMMENT '会话来源页面', landing_page_url STRING COMMENT '访客着陆页面', url_title STRING COMMENT '咨询页面title', to_peer STRING COMMENT '所属技能组', manual_time STRING COMMENT '人工开始时间', begin_time STRING COMMENT '坐席领取时间 ', reply_msg_count int COMMENT '客服回复消息数', total_msg_count int COMMENT '消息总数', msg_count int COMMENT '客户发送消息数', comment STRING COMMENT '备注', finish_reason STRING COMMENT '结束类型', finish_user STRING COMMENT '结束坐席', end_time STRING COMMENT '会话结束时间', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '浏览器名称', os_info STRING COMMENT '系统名称', area STRING COMMENT '区域', country STRING COMMENT '所在国家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '创建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份证号', phone STRING COMMENT '手机号', itcast_school_id int COMMENT '校区Id', itcast_school STRING COMMENT '校区', itcast_subject_id int COMMENT '学科Id', itcast_subject STRING COMMENT '学科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq号', email STRING COMMENT '邮箱', gender STRING COMMENT '性别', level STRING COMMENT '客户级别', origin_type STRING COMMENT '数据来源渠道', information_way STRING COMMENT '资讯方式', working_years STRING COMMENT '开始工作时间', technical_directions STRING COMMENT '技术方向', customer_state STRING COMMENT '当前客户状态', valid STRING COMMENT '该线索是否是网资有效线索', anticipat_signup_date STRING COMMENT '预计报名时间', clue_state STRING COMMENT '线索状态', scrm_department_id int COMMENT 'SCRM内部部门id', superior_url STRING COMMENT '诸葛获取上级页面URL', superior_source STRING COMMENT '诸葛获取上级页面URL标题', landing_url STRING COMMENT '诸葛获取着陆页面URL', landing_source STRING COMMENT '诸葛获取着陆页面URL来源', info_url STRING COMMENT '诸葛获取留咨页URL', info_source STRING COMMENT '诸葛获取留咨页URL标题', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '课程编号', course_name STRING COMMENT '课程名称', zhuge_session_id STRING COMMENT 'zhuge会话id', is_repeat int COMMENT '是否重复线索(手机号维度) 0:正常 1：重复', tenant int COMMENT '租户id', activity_id STRING COMMENT '活动id', activity_name STRING COMMENT '活动名称', follow_type int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', shunt_mode_id int COMMENT '匹配到的技能组id', shunt_employee_group_id int COMMENT '所属分流员工组', ends_time STRING COMMENT '有效时间')comment '客户关系表'PARTITIONED BY(starts_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');insert overwrite table itcast_ods.customer_clue_tmp partition (starts_time)select * from ( select id, create_date_time, update_date_time, deleted, customer_id, customer_relationship_id, session_id, sid, status, user, create_time, platform, s_name, seo_source, seo_keywords, ip, referrer, from_url, landing_page_url, url_title, to_peer, manual_time, begin_time, reply_msg_count, total_msg_count, msg_count, comment, finish_reason, finish_user, ends_time, platform_description, browser_name, os_info, area, country, province, city, creator, name, idcard, phone, itcast_school_id, itcast_school, itcast_subject_id, itcast_subject, wechat, qq, email, gender, level, origin_type, information_way, working_years, technical_directions, customer_state, valid, anticipat_signup_date, clue_state, scrm_department_id, superior_url, superior_source, landing_url, landing_source, info_url, info_source, origin_channel, course_id, course_name, zhuge_session_id, is_repeat, tenant, activity_id, activity_name, follow_type, shunt_mode_id, shunt_employee_group_id, '9999-12-31' ends_time, FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") as starts_time from itcast_ods.customer_clue_update where starts_time=FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") union all select rs.id, rs.create_date_time, rs.update_date_time, rs.deleted, rs.customer_id, rs.customer_relationship_id, rs.session_id, rs.sid, rs.status, rs.user, rs.create_time, rs.platform, rs.s_name, rs.seo_source, rs.seo_keywords, rs.ip, rs.referrer, rs.from_url, rs.landing_page_url, rs.url_title, rs.to_peer, rs.manual_time, rs.begin_time, rs.reply_msg_count, rs.total_msg_count, rs.msg_count, rs.comment, rs.finish_reason, rs.finish_user, rs.ends_time, rs.platform_description, rs.browser_name, rs.os_info, rs.area, rs.country, rs.province, rs.city, rs.creator, rs.name, rs.idcard, rs.phone, rs.itcast_school_id, rs.itcast_school, rs.itcast_subject_id, rs.itcast_subject, rs.wechat, rs.qq, rs.email, rs.gender, rs.level, rs.origin_type, rs.information_way, rs.working_years, rs.technical_directions, rs.customer_state, rs.valid, rs.anticipat_signup_date, rs.clue_state, rs.scrm_department_id, rs.superior_url, rs.superior_source, rs.landing_url, rs.landing_source, rs.info_url, rs.info_source, rs.origin_channel, rs.course_id, rs.course_name, rs.zhuge_session_id, rs.is_repeat, rs.tenant, rs.activity_id, rs.activity_name, rs.follow_type, rs.shunt_mode_id, rs.shunt_employee_group_id, if(up.id is null, rs.ends_time, date_add(up.starts_time,-1)) ends_time, rs.starts_time from itcast_ods.customer_clue rs left join ( select * from itcast_ods.customer_clue_update where starts_time=FROM_UNIXTIME(unix_timestamp(), "%Y-%m-%d") ) up on rs.id=up.id where rs.starts_time >= date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP()),30) and rs.ends_time='9999-12-31')his order by his.id, starts_time;INSERT OVERWRITE TABLE itcast_ods.customer_clue partition (starts_time) SELECT * from itcast_ods.customer_clue_tmp;")

3.3.2 数据清洗转换

3.3.2.1 DWD

3.3.2.1.1 分析

因为业务方将更新周期限制在30天内，而明细层不涉及统计，只有数据清洗转换操作，所以我们在进行增量统计时，只需要重新计算上个月1日至今的数据即可。

通过start_time来指定清洗的数据时间范围(昨天：新增/更新)；

通过end_time来指定获取当前有效的数据。

清洗掉已删除的数据；

判断学校id和学科id，把为空的字段统一转换为-1；

将origin_type来源渠道字段转换为线上/线下，如果origin_type是NETSERVICE和PRESIGNUP类型，即为1线上，否则为0线下。

3.3.2.1.2 代码

3.3.2.1.2.1 SQL：

3.3.2.1.2.2 Shell脚本：

通过shell脚本获取上个月1日的日期，替换sql中的查询条件。

#! /bin/bashSQOOP_HOME=/usr/bin/sqoop#昨天Last_DATE=$(date -d "-1 day" +%Y-%m-%d)${HIVE_HOME} -S -e "--分区
SET hive.exec.dynamic.partition=true;SET hive.exec.dynamic.partition.mode=nonstrict;set hive.exec.max.dynamic.partitions.pernode=10000;set hive.exec.max.dynamic.partitions=100000;set hive.exec.max.created.files=150000;--hive压缩
set hive.exec.compress.intermediate=true;set hive.exec.compress.output=true;--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;--分桶
set hive.enforce.bucketing=true;set hive.enforce.sorting=true;set hive.optimize.bucketmapjoin = true;set hive.auto.convert.sortmerge.join=true;set hive.auto.convert.sortmerge.join.noconditionaltask=true;

3.3.2.2 DWM

通过年月日限定，只关联上个月1日至今的数据。

3.3.2.2.1 SQL:

insert overwrite table itcast_dwm.itcast_intention_dwm partition (yearinfo,monthinfo,dayinfo)select dwd.customer_id, dwd.create_date_time, cus.area, dwd.itcast_school_id, sch.name as itcast_school_name, dwd.deleted, dwd.origin_type, dwd.itcast_subject_id, sub.name as itcast_subject_name, dwd.hourinfo, dwd.origin_type_stat, if(clue.clue_state='VALID_NEW_CLUES', '1', if(clue.clue_state='VALID_PUBLIC_NEW_CLUE', '0', '-1')) as clue_state_stat, e.department_id as tdepart_id, dept.name as tdepart_name, dwd.yearinfo, dwd.monthinfo, dwd.dayinfofrom itcast_dwd.itcast_intention_dwd dwdleft join itcast_ods.customer_clue clue on clue.customer_relationship_id=dwd.ridleft join itcast_dimen.customer cus on dwd.customer_id = cus.idleft join itcast_dimen.employee e on dwd.creator = e.idleft join itcast_dimen.scrm_department dept on e.department_id = dept.idleft join itcast_dimen.itcast_subject sub on dwd.itcast_subject_id = sub.idleft join itcast_dimen.itcast_school sch on dwd.itcast_school_id = sch.idwhere concat_ws('-',dwd.yearinfo,dwd.monthinfo,dwd.dayinfo) >= '${Last_Month_DATE}'--2019-11-01;

3.3.2.2.2 Shell:

#! /bin/bashSQOOP_HOME=/usr/bin/sqoop#上个月1日Last_Month_DATE=$(date -d "-1 month" +%Y-%m-01)${HIVE_HOME} -S -e "--分区
SET hive.exec.dynamic.partition=true;SET hive.exec.dynamic.partition.mode=nonstrict;set hive.exec.max.dynamic.partitions.pernode=10000;set hive.exec.max.dynamic.partitions=100000;set hive.exec.max.created.files=150000;--hive压缩
set hive.exec.compress.intermediate=true;set hive.exec.compress.output=true;--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;--分桶
set hive.enforce.bucketing=true;set hive.enforce.sorting=true;set hive.optimize.bucketmapjoin = true;set hive.auto.convert.sortmerge.join=true;set hive.auto.convert.sortmerge.join.noconditionaltask=true;

insert into table itcast_dwm.itcast_intention_dwm partition (yearinfo,monthinfo,dayinfo)select dwd.customer_id, dwd.create_date_time, cus.area, dwd.itcast_school_id, sch.name as itcast_school_name, dwd.deleted, dwd.origin_type, dwd.itcast_subject_id, sub.name as itcast_subject_name, dwd.hourinfo, dwd.origin_type_stat, if(clue.clue_state='VALID_NEW_CLUES', '1', if(clue.clue_state='VALID_PUBLIC_NEW_CLUE', '0', '-1')) as clue_state_stat, e.department_id as tdepart_id, dept.name as tdepart_name, dwd.yearinfo, dwd.monthinfo, dwd.dayinfofrom itcast_dwd.itcast_intention_dwd dwdleft join itcast_ods.customer_clue clue on clue.customer_relationship_id=dwd.ridleft join itcast_dimen.customer cus on dwd.customer_id = cus.idleft join itcast_dimen.employee e on dwd.creator = e.idleft join itcast_dimen.scrm_department dept on e.department_id = dept.idleft join itcast_dimen.itcast_subject sub on dwd.itcast_subject_id = sub.idleft join itcast_dimen.itcast_school sch on dwd.itcast_school_id = sch.idwhere concat_ws('-',dwd.yearinfo,dwd.monthinfo,dwd.dayinfo) >= '${Last_Month_DATE}'--2019-11-01;"

3.3.3 统计分析

3.3.3.1 新增总意向量

可以查询2016-10-12之前的数据进行测试。

小时和天数据，重新计算上个月1日之后的数据；月份维度，计算上个月之后的数据；年份维度，计算上个月1日所在的年份之后的数据。

#! /bin/bash

#上个月1日

Last_Month_DATE=$(date -d "$(date +%Y%m)01 last month" +%Y-%m-01)

#根据TD_DATE计算年季度月日

V_PARYEAR=`date --date="$Last_Month_DATE" +%Y`

V_PARMONTH=`date --date="$Last_Month_DATE" +%m`

V_PARDAY=`date --date="$Last_Month_DATE" +%d`

#获取季度，-m为不带0，比如7，而不是07

V_month_for_quarter=`date --date="$Last_Month_DATE" +%-m`

V_PARQUARTER=$(((${V_month_for_quarter}-1)/3+1))

${HIVE_HOME} -S -e "

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=nonstrict;

set hive.exec.max.dynamic.partitions.pernode=10000;

set hive.exec.max.dynamic.partitions=100000;

set hive.exec.max.created.files=150000;

set hive.enforce.bucketing=true;

set hive.enforce.sorting=true;

set hive.optimize.bucketmapjoin = true;

set hive.auto.convert.sortmerge.join=true;

set hive.auto.convert.sortmerge.join.noconditionaltask=true;

insert into itcast_dws.itcast_intention_dws partition (yearinfo, monthinfo, dayinfo)

select

count(distinct customer_id) as customer_total,

'-1' as area,

'-1' itcast_school_id,

'-1' as itcast_school_name,

'-1' as origin_type,

'-1' as itcast_subject_id,

'-1' as itcast_subject_name,

hourinfo,

origin_type_stat,

clue_state_stat,

'-1' as tdepart_id,

'-1' as tdepart_name,

concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,

'1' as grouptype,

'1' as time_type,

yearinfo,

monthinfo,

dayinfo

from itcast_dwm.itcast_intention_dwm dwm

where concat_ws('-',dwm.yearinfo,dwm.monthinfo,dwm.dayinfo) >= '${Last_Month_DATE}'

group by yearinfo, monthinfo, dayinfo, hourinfo, origin_type_stat, clue_state_stat;

3.3.4 导出数据

按照年份，先删除所在年的数据，后导出。

#! /bin/bash

SQOOP_HOME=/usr/bin/sqoop

HOST=172.17.0.202

USERNAME="root"

PASSWORD="123456"

PORT=3306

DBNAME="scrm_bi"

MYSQL=/usr/local/mysql_5723/bin/mysql

#上个月1日

if [[ $1 == "" ]];then Last_Month_DATE=$(date -d "-1 month" +%Y-%m-01)else Last_Month_DATE=$1fi

TD_YEAR=$(date -d "$Last_Month_DATE" +%Y)

${MYSQL} -h${HOST} -P${PORT} -u${USERNAME} -p${PASSWORD} -D${DBNAME} -e "delete from itcast_intention_app where yearinfo = '${Last_Month_DATE:0:4}'"

${SQOOP_HOME} export \

--connect "jdbc:mysql://${HOST}:${PORT}/${DBNAME}?useUnicode=true&characterEncoding=utf-8" \

--username ${USERNAME} \

--password ${PASSWORD} \

--table itcast_intention_app \

--hcatalog-database itcast_dws \

--hcatalog-table itcast_intention_dws \

--hcatalog-partition-keys yearinfo \

--hcatalog-partition-values ${TD_YEAR} \

-m 100

今日内容:1) 分桶表的相关优化 -- 理解2) 建模分层操作 -- 需要操作3) 全量流程的统计分析: -- 需求操作 (尝试自己实现) 数据的采集, 数据的清洗转换, 数据维度退化, 数据的统计分析4) 增量流程的: 如何对拉链表实现增量处理 -- 理解
1.意向客户主题看板_需求说明:   需求一: 计期内，新增意向客户(包含自己录入的意向客户)总数。指标: 意向数量维度:   时间维度:   年月天小时新老维度: 线上线下:
涉及表:   customer_relationship(意向表) 涉及的字段:   create_date_time 基于这个字段统计意向用户数量: customer_id:先去重     需求二: 统计指定时间段内，新增的意向客户，所在城市区域人数热力图指标: 意向数量维度:   时间维度: 年月天小时新老维度: 线上线下区域维度: 涉及表:   customer_relationship(意向表)   customer (客户表(学员表)) 涉及的字段:   意向表中: create_date_time
客户表: area
基于这个字段统计意向用户数量: customer_id:先去重两个表关联条件: 意向表.customer_id=客户表.id
需求三: 统计指定时间段内，新增的意向客户中，意向学科人数排行榜。学科名称要关联查询出来指标: 意向数量维度:   时间维度: 年月天小时新老维度: 线上线下学科维度涉及表:   customer_relationship(意向表), itcast_subject(学科表) customer_clue(线索表)
涉及字段:   线索表 : clue_state : 可以帮助识别新老用户 deleted : 用于判断数据是否删除 create_date_time 意向表 : origin_type: 此字段可以帮助判断是否为线上还是线下如果值为: NETSERVICE OR PRESIDNUP 说明是线上其他就是为线下基于这个字段统计意向用户数量: customer_id:先去重学科表:   name    关联条件:   线索表.customer_relationship_id = 意向表.id 学科表.id = 意向表.itcast_subject_id
需求四: 统计指定时间段内，新增的意向客户中，意向校区人数排行榜指标: 意向数量维度:   时间维度: 年月天小时新老维度: 线上线下校区维度
注意：学校id，同步时，0和null转换为统一数据，都转换为-1
涉及表: customer_relationship(意向表), customer_clue(线索表), itcast_school(校区表) 涉及字段:   线索表 : clue_state : 可以帮助识别新老用户 deleted : 用于判断数据是否删除 create_date_time 意向表 : origin_type: 此字段可以帮助判断是否为线上还是线下如果值为: NETSERVICE OR PRESIDNUP 说明是线上其他就是为线下基于这个字段统计意向用户数量: customer_id:先去重校区表:   name 关联条件:   意向表.itcast_school_id = 校区表.id 线索表.customer_relationship_id = 意向表.id
需求五: 统计指定时间段内，新增的意向客户中，不同来源渠道的意向客户占比。指标: 意向数量维度:   时间维度: 年月天小时新老维度: 线上线下来源渠道     涉及表: customer_relationship(意向表), customer_clue(线索表) 涉及字段:   线索表 : clue_state : 可以帮助识别新老用户 deleted : 用于判断数据是否删除意向表:   create_date_time origin_type: 此字段可以帮助判断是否为线上还是线下此字段也表示来源渠道如果值为: NETSERVICE OR PRESIDNUP 说明是线上其他就是为线下基于这个字段统计意向用户数量: customer_id:先去重关联条件:   线索表.customer_relationship_id = 意向表.id     需求6: 统计指定时间段内，新增的意向客户中，各咨询中心产生的意向客户数占比情况指标: 意向数量维度:   时间维度: 年月天小时新老维度: 线上线下各咨询中心     涉及表: customer_relationship(意向表), employee: 员工表 scrm_department : 部门表 customer_clue(线索表) 涉及字段:   线索表 : clue_state : 可以帮助识别新老用户意向表:   create_date_time    origin_type: 此字段可以帮助判断是否为线上还是线下此字段也表示来源渠道如果值为: NETSERVICE OR PRESIDNUP 说明是线上其他就是为线下基于这个字段统计意向用户数量: customer_id:先去重员工表:   tdepart_id : 部门id 部门表: name 关联条件:   线索表.customer_relationship_id = 意向表.id 员工表.tdepart_id = 部门表.id 意向表.creator = 员工表.id
总结:   指标: 意向数量维度:   时间维度: 年月天小时新老维度: 线上线下产品属性维度:   地区维度 , 来源渠道, 学科维度, 校区维度 , 各咨询中心
涉及表: 7张表 customer_relationship(意向表), 涉及到字段: create_date_time , origin_type , customer_id employee: 员工表    涉及到字段 : tdepart_id 和 id scrm_department : 部门表涉及到字段 : name 和 id    customer_clue(线索表)    涉及到字段 : clue_state ,deleted ,create_date_time ,customer_relationship_id itcast_school(校区表) : 涉及到字段 : name 和 id   itcast_subject(学科表) 涉及到字段 : name 和 id   customer(客户表)   涉及到字段: area 和 id 表关联:   线索表.customer_relationship_id = 意向表.id 员工表.tdepart_id = 部门表.id 意向表.creator = 员工表.id 意向表.itcast_school_id = 校区表.id 学科表.id = 意向表.itcast_subject_id 意向表.customer_id=客户表.id
意向主题看板案例_导入原始业务数据 --- 此层在实际工作中不存在 create database scrm default character set utf8mb4 collate utf8mb4_unicode_ci;
将原来发的知行教育分析平台资料中 --> 原始完整数据集 --> scrm --> 将7个表依次导入MySQL中
意向主题看板案例_建模分析:  ODS层:   事实表: 意向表额外放置一张表: 线索表 (说明: 此表由于是后续主题看板事实表, 为了方便后续的处理, 将此表放置在ODS层) 表: 内部表 + 分桶表 + 分区表 + 拉链表实施DIM层: 维度层员工表, 校区表, 学科表, 客户表 ,部门表表: 外部表 + 分区表关于以上两层: 只需要一对对应原生数据表结构构建即可, 构建时注意添加一个 start_time(抽取时间)数据格式和压缩方式: ORC + ZLIB(SNAPPY)
DW层:   DWD: 清洗转换以及如果表字段过多, 可以抽取相关的字段 , 对 ODS层表进行处理清洗工作: 清理掉以及被标识为删除的数据转换工作:   将 origin_type中数据转换为 0 和 1 形成一个新的字段, 用于标识线上上下 create_date_time将时间转换为年月日小时学校id，同步时，0和null转换为统一数据，都转换为-1 涉及到字段:   普通字段: id,create_date_time,delete ,customer_id ,origin_type ,origin_type_stat, itcast_school_id ,itcast_subject_id,creator,hourinfo 分区:   年(yearinfo) , 月(monthinfo) 日(dayinfo)     DWM: 基于维度提前聚合操作 (不能做) 维度退化将六个维度表, 和 DWD的事实表进行组合, 形成一张表, 从而实现维度退化操作思想: 考虑要从各个维度表中获取那些字段数据, 将这些字段数据全部糅杂在一个表即可相关字段:    普通字段:   customer_id, create_date_time,clue_state_stat ,origin_type_stat,area,origin_type, itcast_school_id,school_name,itcast_subject_id,itcast_subject_name,department_id, department_name ,hourinfo 分区字段:   年(yearinfo) , 月(monthinfo) 日(dayinfo)
要想生成这个表的数据, 此处需要进行从ODS+DIM 进行七表联查得出此表结果
DWS: 指标只有一个, 表也就只有一个 customerid_total,clue_state_stat,origin_type_stat,area,origin_type, itcast_school_id,school_name,itcast_subject_id,itcast_subject_name, department_id, department_name , time_type,group_type ,hourinfo ,time_str
分区:   年(yearinfo) , 月(monthinfo) 日(dayinfo) time_type: 1(年) 2(月) 3(日) 4(小时)    group_type: 1地区维度 , 2来源渠道, 3学科维度, 4校区维度 , 5各咨询中心 ,6 总意向量
数据结果:   1000 0 0 年 -1 -1 -1 -1 1000 0 1 年 -1 -1 -1 -1 1000 1 0 年 -1 -1 -1 -1 1000 1 1 年 -1 -1 -1 -1 1000 0 0 年 11 -1 -1 -1 1000 0 1 年 11 -1 -1 -1 1000 1 0 年 11 -1 -1 -1 1000 1 1 年 11 -1 -1 -1 1000 0 0 年 11 01 -1 -1 1000 0 1 年 11 01 -1 -1 1000 1 0 年 11 01 -1 -1 1000 1 1 年 11 01 -1 -1 1000 0 0 年 11 -1 山西 -1 1000 0 1 年 11 -1 山西 -1 1000 1 0 年 11 -1 山西 -1 1000 1 1 年 11 -1 山西 -1 1000 0 0 年 11 01 -1 weixin 1000 0 1 年 11 01 -1 weixin 1000 1 0 年 11 01 -1 weixin 1000 1 1 年 11 01 -1 weixin
app层: 不要 DWS已经成功将各个维度分析完成....
2. 分桶表的相关优化:   分桶表: 分文件将一个文件拆分多个文件的操作, 具体拆分多少, 取决于设置的分桶的数量底层是如何实现分文件呢? 核心采用 MR 分区, 采用 Hash取模计算法对分桶字段进行分区操作会将数据进行打散操作, 同时保证相同数据会发往同一个reduce中
桶表的操作:    创建表: create table test_buck(id int, name string) clustered by(id) sorted by (id asc) into 6 buckets -- 主要此处代码 row format delimited fields terminated by '\t';
插入数据:   --启用桶表 set hive.enforce.bucketing=true; insert into ...
注意: 桶表不能使用 load data 方式来插入桶表数据,   set hive.strict.checks.bucketing = true; 禁止桶表使用load data 默认true 如何将数据插入到桶表:   对桶表建立一张临时表(千万不能桶表) 通过 load data 方式将数据进行加载到临时表, 然后通过 insert into 从临时表将数据加载到桶表中
作用:   数据的抽样处理 : 将一个文件的数据拆分为多个文件后, 从中获取其中某几个文件来进行处理, 这个过程数据采样作用:   1. 测试的时候, 由于数据过于庞大, 可以对数据进行采样, 然后在采样的结果上进行统计分析即可,提升快速开发的效率 2. 对整体数据分析不是很方便, 可以进行采样分析, 得出的结果依然可以反映整个数据的结果信息如何实现抽样: 格式: select * from table tablesample(bucket x out of y on column) as a
放置位置: 紧跟在表的后面如果表有别名, 请将抽样函数放置在别名之前, 表之后函数说明: tablesample(bucket x out of y on column) X : 从第几个桶开始抽 x的值必须小于等于y的值 y : 抽桶数量比例 , 必须是桶的倍数或者因子 column : 按照那个字段进行分桶抽样
例子: 表有 10个桶分桶字段为id
tablesample(bucket 3 out of 5 on id):   思考 : 会抽出几个桶? 10/5 = 2 会抽出那两个桶呢?   第三个桶和第八个桶
提升多表join的查询性能 : 主要的手段就是 map join 1. mapjoin: 适合于小表和大表的join操作必备条件: set hive.auto.convert.join=true; -- 必须开启 mapjoin的优化默认值为true set hive.auto.convert.join.noconditionaltask.size=512000000; 小表阈值默认值为 20971520 (20M)
2. 中等大小的表和大表进行join: 要求使用 map join 可以使用 Bucket-MapJoin   实现必备条件:   1) 两个表的关联条件的字段必须是分桶字段 2) 中型表的分桶数量小于等于大表的分桶数量并且必须是大表桶的倍数    3) 开启 bucket_mapjoin : set hive.optimize.bucketmapjoin = true 4) 两个表必须是分桶表 : 启用 set hive.enforce.bucketing=true;     一旦将以上的条件都满足, hive自动采用 Bucket-MapJoin 如果不满足, hive会检测是否满足 map join, 如果不满足, 那么就采用原始 reduce join的方案
3. 大表和大表 join: 要求使用 map join 可以采用 SMB Join 基于 Bucket-MapJoin 实施的, 首先要先满足 Bucket-MapJoin 实现必备条件:   1) 两个表的关联条件的字段必须是分桶字段, 并且必须按照分桶字段进行排序 2) 两个表的分桶数量必须相等    3) 开启 bucket_mapjoin : set hive.optimize.bucketmapjoin = true 4) 两个表必须是分桶表 : 启用 set hive.enforce.bucketing=true; 5) 开启 SMB join的必备三项条件 :   set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin.sortedmerge = true; --开启 SMBjoin set hive.auto.convert.sortmerge.join.noconditionaltask=true; set hive.enforce.sorting=true;
建表操作: create table test_smb_2(mid string,age_id string) CLUSTERED BY(mid) SORTED BY(mid) INTO 500 BUCKETS;--3. 意向用户主题看板: 建模分层操作准备工作: 开启写入压缩set hive.exec.orc.compression.strategy=COMPRESSION;--3.1: 创建 ODS层表: 2张表 (意向表和线索表)CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship` ( `id` int COMMENT '客户关系id', `create_date_time` STRING COMMENT '创建时间', `update_date_time` STRING COMMENT '最后更新时间', `deleted` int COMMENT '是否被删除(禁用)', `customer_id` int COMMENT '所属客户id', `first_id` int COMMENT '第一条客户关系id', `belonger` int COMMENT '归属人', `belonger_name` STRING COMMENT '归属人姓名', `initial_belonger` int COMMENT '初始归属人', `distribution_handler` int COMMENT '分配处理人', `business_scrm_department_id` int COMMENT '归属部门', `last_visit_time` STRING COMMENT '最后回访时间', `next_visit_time` STRING COMMENT '下次回访时间', `origin_type` STRING COMMENT '数据来源', `itcast_school_id` int COMMENT '校区Id', `itcast_subject_id` int COMMENT '学科Id', `intention_study_type` STRING COMMENT '意向学习方式', `anticipat_signup_date` STRING COMMENT '预计报名时间', `level` STRING COMMENT '客户级别', `creator` int COMMENT '创建人', `current_creator` int COMMENT '当前创建人：初始==创建人，当在公海拉回时为拉回人', `creator_name` STRING COMMENT '创建者姓名', `origin_channel` STRING COMMENT '来源渠道', `comment` STRING COMMENT '备注', `first_customer_clue_id` int COMMENT '第一条线索id', `last_customer_clue_id` int COMMENT '最后一条线索id', `process_state` STRING COMMENT '处理状态', `process_time` STRING COMMENT '处理状态变动时间', `payment_state` STRING COMMENT '支付状态', `payment_time` STRING COMMENT '支付状态变动时间', `signup_state` STRING COMMENT '报名状态', `signup_time` STRING COMMENT '报名时间', `notice_state` STRING COMMENT '通知状态', `notice_time` STRING COMMENT '通知状态变动时间', `lock_state` STRING COMMENT '锁定状态', `lock_time` STRING COMMENT '锁定状态修改时间', `itcast_clazz_id` int COMMENT '所属ems班级id', `itcast_clazz_time` STRING COMMENT '报班时间', `payment_url` STRING COMMENT '付款链接', `payment_url_time` STRING COMMENT '支付链接生成时间', `ems_student_id` int COMMENT 'ems的学生id', `delete_reason` STRING COMMENT '删除原因', `deleter` int COMMENT '删除人', `deleter_name` STRING COMMENT '删除人姓名', `delete_time` STRING COMMENT '删除时间', `course_id` int COMMENT '课程ID', `course_name` STRING COMMENT '课程名称', `delete_comment` STRING COMMENT '删除原因说明', `close_state` STRING COMMENT '关闭装填', `close_time` STRING COMMENT '关闭状态变动时间', `appeal_id` int COMMENT '申诉id', `tenant` int COMMENT '租户', `total_fee` DECIMAL COMMENT '报名费总金额', `belonged` int COMMENT '小周期归属人', `belonged_time` STRING COMMENT '归属时间', `belonger_time` STRING COMMENT '归属时间', `transfer` int COMMENT '转移人', `transfer_time` STRING COMMENT '转移时间', `follow_type` int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', `transfer_bxg_oa_account` STRING COMMENT '转移到博学谷归属人OA账号', `transfer_bxg_belonger_name` STRING COMMENT '转移到博学谷归属人OA姓名', `end_time` STRING COMMENT '有效截止时间')comment '客户关系表'PARTITIONED BY(start_time STRING)clustered by(id) sorted by(id) into 10 bucketsROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');
CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '创建时间', update_date_time STRING COMMENT '最后更新时间', deleted STRING COMMENT '是否被删除(禁用)', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户关系id', session_id STRING COMMENT '七陌会话id', sid STRING COMMENT '访客id', status STRING COMMENT '状态(undeal待领取 deal 已领取 finish 已关闭 changePeer 已流转)', users STRING COMMENT '所属坐席', create_time STRING COMMENT '七陌创建时间', platform STRING COMMENT '平台来源 (pc-网站咨询|wap-wap咨询|sdk-app咨询|weixin-微信咨询)', s_name STRING COMMENT '用户名称', seo_source STRING COMMENT '搜索来源', seo_keywords STRING COMMENT '关键字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上级来源页面', from_url STRING COMMENT '会话来源页面', landing_page_url STRING COMMENT '访客着陆页面', url_title STRING COMMENT '咨询页面title', to_peer STRING COMMENT '所属技能组', manual_time STRING COMMENT '人工开始时间', begin_time STRING COMMENT '坐席领取时间 ', reply_msg_count int COMMENT '客服回复消息数', total_msg_count int COMMENT '消息总数', msg_count int COMMENT '客户发送消息数', comment STRING COMMENT '备注', finish_reason STRING COMMENT '结束类型', finish_user STRING COMMENT '结束坐席', end_time STRING COMMENT '会话结束时间', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '浏览器名称', os_info STRING COMMENT '系统名称', area STRING COMMENT '区域', country STRING COMMENT '所在国家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '创建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份证号', phone STRING COMMENT '手机号', itcast_school_id int COMMENT '校区Id', itcast_school STRING COMMENT '校区', itcast_subject_id int COMMENT '学科Id', itcast_subject STRING COMMENT '学科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq号', email STRING COMMENT '邮箱', gender STRING COMMENT '性别', level STRING COMMENT '客户级别', origin_type STRING COMMENT '数据来源渠道', information_way STRING COMMENT '资讯方式', working_years STRING COMMENT '开始工作时间', technical_directions STRING COMMENT '技术方向', customer_state STRING COMMENT '当前客户状态', valid STRING COMMENT '该线索是否是网资有效线索', anticipat_signup_date STRING COMMENT '预计报名时间', clue_state STRING COMMENT '线索状态', scrm_department_id int COMMENT 'SCRM内部部门id', superior_url STRING COMMENT '诸葛获取上级页面URL', superior_source STRING COMMENT '诸葛获取上级页面URL标题', landing_url STRING COMMENT '诸葛获取着陆页面URL', landing_source STRING COMMENT '诸葛获取着陆页面URL来源', info_url STRING COMMENT '诸葛获取留咨页URL', info_source STRING COMMENT '诸葛获取留咨页URL标题', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '课程编号', course_name STRING COMMENT '课程名称', zhuge_session_id STRING COMMENT 'zhuge会话id', is_repeat int COMMENT '是否重复线索(手机号维度) 0:正常 1：重复', tenant int COMMENT '租户id', activity_id STRING COMMENT '活动id', activity_name STRING COMMENT '活动名称', follow_type int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', shunt_mode_id int COMMENT '匹配到的技能组id', shunt_employee_group_id int COMMENT '所属分流员工组', ends_time STRING COMMENT '有效时间')comment '客户关系表'PARTITIONED BY(starts_time STRING)clustered by(customer_relationship_id) sorted by(customer_relationship_id) into 10 bucketsROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');
--3.2: 创建 DIM层表: 5张表CREATE DATABASE IF NOT EXISTS itcast_dimen;CREATE TABLE IF NOT EXISTS itcast_dimen.`customer` ( `id` int COMMENT 'key id', `customer_relationship_id` int COMMENT '当前意向id', `create_date_time` STRING COMMENT '创建时间', `update_date_time` STRING COMMENT '最后更新时间', `deleted` int COMMENT '是否被删除(禁用)', `name` STRING COMMENT '姓名', `idcard` STRING COMMENT '身份证号', `birth_year` int COMMENT '出生年份', `gender` STRING COMMENT '性别', `phone` STRING COMMENT '手机号', `wechat` STRING COMMENT '微信', `qq` STRING COMMENT 'qq号', `email` STRING COMMENT '邮箱', `area` STRING COMMENT '所在区域', `leave_school_date` date COMMENT '离校时间', `graduation_date` date COMMENT '毕业时间', `bxg_student_id` STRING COMMENT '博学谷学员ID，可能未关联到，不存在', `creator` int COMMENT '创建人ID', `origin_type` STRING COMMENT '数据来源', `origin_channel` STRING COMMENT '来源渠道', `tenant` int, `md_id` int COMMENT '中台id')comment '客户表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');
CREATE TABLE IF NOT EXISTS itcast_dimen.employee ( id int COMMENT '员工id', email STRING COMMENT '公司邮箱，OA登录账号', real_name STRING COMMENT '员工的真实姓名', phone STRING COMMENT '手机号，目前还没有使用；隐私问题OA接口没有提供这个属性，', department_id STRING COMMENT 'OA中的部门编号，有负值', department_name STRING COMMENT 'OA中的部门名', remote_login STRING COMMENT '员工是否可以远程登录', job_number STRING COMMENT '员工工号', cross_school STRING COMMENT '是否有跨校区权限', last_login_date STRING COMMENT '最后登录日期', creator int COMMENT '创建人', create_date_time STRING COMMENT '创建时间', update_date_time STRING COMMENT '最后更新时间', deleted STRING COMMENT '是否被删除(禁用)', scrm_department_id int COMMENT 'SCRM内部部门id', leave_office STRING COMMENT '离职状态', leave_office_time STRING COMMENT '离职时间', reinstated_time STRING COMMENT '复职时间', superior_leaders_id int COMMENT '上级领导ID', tdepart_id int COMMENT '直属部门', tenant int COMMENT '租户', ems_user_name STRING COMMENT 'ems用户名称')comment '员工表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');
CREATE TABLE IF NOT EXISTS itcast_dimen.`scrm_department` ( `id` int COMMENT '部门id', `name` STRING COMMENT '部门名称', `parent_id` int COMMENT '父部门id', `create_date_time` STRING COMMENT '创建时间', `update_date_time` STRING COMMENT '更新时间', `deleted` STRING COMMENT '删除标志', `id_path` STRING COMMENT '编码全路径', `tdepart_code` int COMMENT '直属部门', `creator` STRING COMMENT '创建者', `depart_level` int COMMENT '部门层级', `depart_sign` int COMMENT '部门标志，暂时默认1', `depart_line` int COMMENT '业务线，存储业务线编码', `depart_sort` int COMMENT '排序字段', `disable_flag` int COMMENT '禁用标志', `tenant` int COMMENT '租户')comment 'scrm部门表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');
CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_school` ( `id` int COMMENT '自增主键', `create_date_time` timestamp COMMENT '创建时间', `update_date_time` timestamp COMMENT '最后更新时间', `deleted` STRING COMMENT '是否被删除(禁用)', `name` STRING COMMENT '校区名称', `code` STRING COMMENT '校区标识', `tenant` int COMMENT '租户')comment '校区字典表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');
CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_subject` ( `id` int COMMENT '自增主键', `create_date_time` timestamp COMMENT '创建时间', `update_date_time` timestamp COMMENT '最后更新时间', `deleted` STRING COMMENT '是否被删除(禁用)', `name` STRING COMMENT '学科名称', `code` STRING COMMENT '学科编码', `tenant` int COMMENT '租户')comment '学科字典表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');
--3.3 构建 DWD层: -- 演示 join优化CREATE TABLE IF NOT EXISTS itcast_dwd.`itcast_intention_dwd` ( `rid` int COMMENT 'id', `customer_id` STRING COMMENT '客户id', `create_date_time` STRING COMMENT '创建时间', `itcast_school_id` STRING COMMENT '校区id', `deleted` STRING COMMENT '是否被删除', `origin_type` STRING COMMENT '来源渠道', `itcast_subject_id` STRING COMMENT '学科id', `creator` int COMMENT '创建人', `hourinfo` STRING COMMENT '小时信息', `origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上')comment '客户意向dwd表'PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)clustered by(rid) sorted by(rid) into 10 bucketsROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'stored as ORCTBLPROPERTIES ('orc.compress'='SNAPPY');
-- 3.4: 构建 DWM层create database itcast_dwm;CREATE TABLE IF NOT EXISTS itcast_dwm.`itcast_intention_dwm` ( `customer_id` STRING COMMENT 'id信息', `create_date_time` STRING COMMENT '创建时间', `area` STRING COMMENT '区域信息', `itcast_school_id` STRING COMMENT '校区id', `itcast_school_name` STRING COMMENT '校区名称', `deleted` STRING COMMENT '是否被删除', `origin_type` STRING COMMENT '来源渠道', `itcast_subject_id` STRING COMMENT '学科id', `itcast_subject_name` STRING COMMENT '学科名称', `hourinfo` STRING COMMENT '小时信息', `origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上', `clue_state_stat` STRING COMMENT '新老客户：0.老客户；1.新客户', `tdepart_id` STRING COMMENT '创建者部门id', `tdepart_name` STRING COMMENT '咨询中心名称')comment '客户意向dwm表'PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)clustered by(customer_id) sorted by(customer_id) into 10 bucketsROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as ORCTBLPROPERTIES ('orc.compress'='SNAPPY');
-- 3.5 构建 DWS 层CREATE TABLE IF NOT EXISTS itcast_dws.itcast_intention_dws ( `customer_total` INT COMMENT '聚合意向客户数', `area` STRING COMMENT '区域信息', `itcast_school_id` STRING COMMENT '校区id', `itcast_school_name` STRING COMMENT '校区名称', `origin_type` STRING COMMENT '来源渠道', `itcast_subject_id` STRING COMMENT '学科id', `itcast_subject_name` STRING COMMENT '学科名称', `hourinfo` STRING COMMENT '小时信息', `origin_type_stat` STRING COMMENT '数据来源:0.线下；1.线上', `clue_state_stat` STRING COMMENT '客户属性：0.老客户；1.新客户', `tdepart_id` STRING COMMENT '创建者部门id', `tdepart_name` STRING COMMENT '咨询中心名称', `time_str` STRING COMMENT '时间明细', `groupType` STRING COMMENT '产品属性类别：1.总意向量；2.区域信息；3.校区、学科组合分组；4.来源渠道；5.咨询中心;', `time_type` STRING COMMENT '时间维度：1、按小时聚合；2、按天聚合；3、按周聚合；4、按月聚合；5、按年聚合；')comment '客户意向dws表'PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='SNAPPY');
4. 意向主题看板案例_数据的采集:4.1: 完成 DIM层的数据采集:sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select id, customer_relationship_id, create_date_time, update_date_time, deleted, name, idcard, birth_year, gender, phone, wechat, qq, email, area, leave_school_date, graduation_date, bxg_student_id, creator, origin_type, origin_channel, tenant, md_id, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d") as start_time from customer where $CONDITIONS' \--hcatalog-database itcast_dimen \--hcatalog-table customer \-m 1 \--split-by id
sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select id,email,real_name,-1 as phone,department_id,department_name,remote_login,job_number,cross_school,last_login_date,creator,create_date_time,update_date_time,deleted,scrm_department_id,leave_office,leave_office_time,reinstated_time,superior_leaders_id,tdepart_id,tenant,ems_user_name,FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from employee where $CONDITIONS' \--hcatalog-database itcast_dimen \--hcatalog-table employee \-m 1 \--split-by id
sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from scrm_department where $CONDITIONS' \--hcatalog-database itcast_dimen \--hcatalog-table scrm_department \-m 1 \--split-by id
sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from itcast_school where $CONDITIONS' \--hcatalog-database itcast_dimen \--hcatalog-table itcast_school \-m 1 \--split-by id
sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select *, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from itcast_subject where $CONDITIONS' \--hcatalog-database itcast_dimen \--hcatalog-table itcast_subject \-m 1 \--split-by id
4.2: 完成ODS层的数据采集由于ODS层表时两张桶表数据, 而 sqoop 无法支持桶表数据的导入工作, 此时解决方案: 为对应的桶表构建临时表, 然后通过sqoop将数据导入到临时表在通过临时表使用 insert into 的方式将数据导入分桶表中即可
4.2.1: 意向表的数据导入第一步: 创建意向表的临时表结构CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship_tmp` ( `id` int COMMENT '客户关系id', `create_date_time` STRING COMMENT '创建时间', `update_date_time` STRING COMMENT '最后更新时间', `deleted` int COMMENT '是否被删除(禁用)', `customer_id` int COMMENT '所属客户id', `first_id` int COMMENT '第一条客户关系id', `belonger` int COMMENT '归属人', `belonger_name` STRING COMMENT '归属人姓名', `initial_belonger` int COMMENT '初始归属人', `distribution_handler` int COMMENT '分配处理人', `business_scrm_department_id` int COMMENT '归属部门', `last_visit_time` STRING COMMENT '最后回访时间', `next_visit_time` STRING COMMENT '下次回访时间', `origin_type` STRING COMMENT '数据来源', `itcast_school_id` int COMMENT '校区Id', `itcast_subject_id` int COMMENT '学科Id', `intention_study_type` STRING COMMENT '意向学习方式', `anticipat_signup_date` STRING COMMENT '预计报名时间', `level` STRING COMMENT '客户级别', `creator` int COMMENT '创建人', `current_creator` int COMMENT '当前创建人：初始==创建人，当在公海拉回时为拉回人', `creator_name` STRING COMMENT '创建者姓名', `origin_channel` STRING COMMENT '来源渠道', `comment` STRING COMMENT '备注', `first_customer_clue_id` int COMMENT '第一条线索id', `last_customer_clue_id` int COMMENT '最后一条线索id', `process_state` STRING COMMENT '处理状态', `process_time` STRING COMMENT '处理状态变动时间', `payment_state` STRING COMMENT '支付状态', `payment_time` STRING COMMENT '支付状态变动时间', `signup_state` STRING COMMENT '报名状态', `signup_time` STRING COMMENT '报名时间', `notice_state` STRING COMMENT '通知状态', `notice_time` STRING COMMENT '通知状态变动时间', `lock_state` STRING COMMENT '锁定状态', `lock_time` STRING COMMENT '锁定状态修改时间', `itcast_clazz_id` int COMMENT '所属ems班级id', `itcast_clazz_time` STRING COMMENT '报班时间', `payment_url` STRING COMMENT '付款链接', `payment_url_time` STRING COMMENT '支付链接生成时间', `ems_student_id` int COMMENT 'ems的学生id', `delete_reason` STRING COMMENT '删除原因', `deleter` int COMMENT '删除人', `deleter_name` STRING COMMENT '删除人姓名', `delete_time` STRING COMMENT '删除时间', `course_id` int COMMENT '课程ID', `course_name` STRING COMMENT '课程名称', `delete_comment` STRING COMMENT '删除原因说明', `close_state` STRING COMMENT '关闭装填', `close_time` STRING COMMENT '关闭状态变动时间', `appeal_id` int COMMENT '申诉id', `tenant` int COMMENT '租户', `total_fee` DECIMAL COMMENT '报名费总金额', `belonged` int COMMENT '小周期归属人', `belonged_time` STRING COMMENT '归属时间', `belonger_time` STRING COMMENT '归属时间', `transfer` int COMMENT '转移人', `transfer_time` STRING COMMENT '转移时间', `follow_type` int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', `transfer_bxg_oa_account` STRING COMMENT '转移到博学谷归属人OA账号', `transfer_bxg_belonger_name` STRING COMMENT '转移到博学谷归属人OA姓名', `end_time` STRING COMMENT '有效截止时间')comment '客户关系表'PARTITIONED BY(start_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');
第二步: 使用sqoop 完成数据导入到临时表: sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select id, create_date_time, update_date_time, deleted, customer_id, first_id, belonger, belonger_name, initial_belonger, distribution_handler, business_scrm_department_id, last_visit_time, next_visit_time, origin_type, itcast_school_id, itcast_subject_id, intention_study_type, anticipat_signup_date, level, creator, current_creator, creator_name, origin_channel, comment, first_customer_clue_id, last_customer_clue_id, process_state, process_time, payment_state, payment_time, signup_state, signup_time, notice_state, notice_time, lock_state, lock_time, itcast_clazz_id, itcast_clazz_time, payment_url, payment_url_time, ems_student_id, delete_reason, deleter, deleter_name, delete_time, course_id, course_name, delete_comment, close_state, close_time, appeal_id, tenant, total_fee, belonged, belonged_time, belonger_time, transfer, transfer_time, follow_type, transfer_bxg_oa_account, transfer_bxg_belonger_name,date_format("9999-12-31","%Y-%m-%d") as end_time, FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as start_time from customer_relationship where $CONDITIONS' \--hcatalog-database itcast_ods \--hcatalog-table customer_relationship_tmp \-m 1 \--split-by id
--第三步: 将临时表的数据, 在次灌入到 ODS的分桶的意向表中: --分区SET hive.exec.dynamic.partition=true;SET hive.exec.dynamic.partition.mode=nonstrict;set hive.exec.max.dynamic.partitions.pernode=10000;set hive.exec.max.dynamic.partitions=100000;set hive.exec.max.created.files=150000;--hive压缩set hive.exec.compress.intermediate=true;set hive.exec.compress.output=true;--写入时压缩生效set hive.exec.orc.compression.strategy=COMPRESSION;--分桶 set hive.optimize.bucketmapjoin = true;set hive.enforce.bucketing=true;set hive.enforce.sorting=true;
set hive.auto.convert.sortmerge.join=true;set hive.auto.convert.sortmerge.join.noconditionaltask=true;
insert into table itcast_ods.customer_relationship partition(start_time)select * from customer_relationship_tmp;
4.2.2: 将线索表数据导入到ods层的表中第一步: 建立线索表的临时表: CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_tmp ( id int COMMENT 'customer_clue_id', create_date_time STRING COMMENT '创建时间', update_date_time STRING COMMENT '最后更新时间', deleted STRING COMMENT '是否被删除(禁用)', customer_id int COMMENT '客户id', customer_relationship_id int COMMENT '客户关系id', session_id STRING COMMENT '七陌会话id', sid STRING COMMENT '访客id', status STRING COMMENT '状态(undeal待领取 deal 已领取 finish 已关闭 changePeer 已流转)', users STRING COMMENT '所属坐席', create_time STRING COMMENT '七陌创建时间', platform STRING COMMENT '平台来源 (pc-网站咨询|wap-wap咨询|sdk-app咨询|weixin-微信咨询)', s_name STRING COMMENT '用户名称', seo_source STRING COMMENT '搜索来源', seo_keywords STRING COMMENT '关键字', ip STRING COMMENT 'IP地址', referrer STRING COMMENT '上级来源页面', from_url STRING COMMENT '会话来源页面', landing_page_url STRING COMMENT '访客着陆页面', url_title STRING COMMENT '咨询页面title', to_peer STRING COMMENT '所属技能组', manual_time STRING COMMENT '人工开始时间', begin_time STRING COMMENT '坐席领取时间 ', reply_msg_count int COMMENT '客服回复消息数', total_msg_count int COMMENT '消息总数', msg_count int COMMENT '客户发送消息数', comment STRING COMMENT '备注', finish_reason STRING COMMENT '结束类型', finish_user STRING COMMENT '结束坐席', end_time STRING COMMENT '会话结束时间', platform_description STRING COMMENT '客户平台信息', browser_name STRING COMMENT '浏览器名称', os_info STRING COMMENT '系统名称', area STRING COMMENT '区域', country STRING COMMENT '所在国家', province STRING COMMENT '省', city STRING COMMENT '城市', creator int COMMENT '创建人', name STRING COMMENT '客户姓名', idcard STRING COMMENT '身份证号', phone STRING COMMENT '手机号', itcast_school_id int COMMENT '校区Id', itcast_school STRING COMMENT '校区', itcast_subject_id int COMMENT '学科Id', itcast_subject STRING COMMENT '学科', wechat STRING COMMENT '微信', qq STRING COMMENT 'qq号', email STRING COMMENT '邮箱', gender STRING COMMENT '性别', level STRING COMMENT '客户级别', origin_type STRING COMMENT '数据来源渠道', information_way STRING COMMENT '资讯方式', working_years STRING COMMENT '开始工作时间', technical_directions STRING COMMENT '技术方向', customer_state STRING COMMENT '当前客户状态', valid STRING COMMENT '该线索是否是网资有效线索', anticipat_signup_date STRING COMMENT '预计报名时间', clue_state STRING COMMENT '线索状态', scrm_department_id int COMMENT 'SCRM内部部门id', superior_url STRING COMMENT '诸葛获取上级页面URL', superior_source STRING COMMENT '诸葛获取上级页面URL标题', landing_url STRING COMMENT '诸葛获取着陆页面URL', landing_source STRING COMMENT '诸葛获取着陆页面URL来源', info_url STRING COMMENT '诸葛获取留咨页URL', info_source STRING COMMENT '诸葛获取留咨页URL标题', origin_channel STRING COMMENT '投放渠道', course_id int COMMENT '课程编号', course_name STRING COMMENT '课程名称', zhuge_session_id STRING COMMENT 'zhuge会话id', is_repeat int COMMENT '是否重复线索(手机号维度) 0:正常 1：重复', tenant int COMMENT '租户id', activity_id STRING COMMENT '活动id', activity_name STRING COMMENT '活动名称', follow_type int COMMENT '分配类型，0-自动分配，1-手动分配，2-自动转移，3-手动单个转移，4-手动批量转移，5-公海领取', shunt_mode_id int COMMENT '匹配到的技能组id', shunt_employee_group_id int COMMENT '所属分流员工组', ends_time STRING COMMENT '有效时间')comment '客户关系表'PARTITIONED BY(starts_time STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'stored as orcTBLPROPERTIES ('orc.compress'='ZLIB');
第二步: 使用sqoop 完成数据导入到线索表临时表
sqoop import \--connect jdbc:mysql://192.168.52.150:3306/scrm \--username root \--password 123456 \--query 'select id,create_date_time,update_date_time,deleted,customer_id,customer_relationship_id,session_id,sid,status,user as users,create_time,platform,s_name,seo_source,seo_keywords,ip,referrer,from_url,landing_page_url,url_title,to_peer,manual_time,begin_time,reply_msg_count,total_msg_count,msg_count,comment,finish_reason,finish_user,end_time,platform_description,browser_name,os_info,area,country,province,city,creator,name,"-1" as idcard,"-1" as phone,itcast_school_id,itcast_school,itcast_subject_id,itcast_subject,"-1" as wechat,"-1" as qq,"-1" as email,gender,level,origin_type,information_way,working_years,technical_directions,customer_state,valid,anticipat_signup_date,clue_state,scrm_department_id,superior_url,superior_source,landing_url,landing_source,info_url,info_source,origin_channel,course_id,course_name,zhuge_session_id,is_repeat,tenant,activity_id,activity_name,follow_type,shunt_mode_id,shunt_employee_group_id,date_format("9999-12-31","%Y-%m-%d") as ends_time,FROM_UNIXTIME(unix_timestamp(),"%Y-%m-%d")as starts_time from customer_clue where $CONDITIONS' \--hcatalog-database itcast_ods \--hcatalog-table customer_clue_tmp \-m 1 \--split-by id
第三步: 将临时表的数据, 导入到线索表:
insert into table itcast_ods.customer_clue partition(starts_time)select * from itcast_ods.customer_clue_tmp;
4.3: 完成数据清洗转换处理工作: ODS的意向表 --> DWD层清洗后的意向表需要清洗和转换的操作都有哪些?   清洗:    将标记为delete=1进行清除转换工作:   create_date_time字段, 需要转换出有年月天小时 origin_type 中数据生成一个新的字段 origin_type_stat 用于区分线上和线下学校id和学科ID，同步时，0和null转换为统一数据，都转换为-1
清洗转换的SQL :   INSERT INTO TABLE itcast_dwd.itcast_intention_dwd partition(yearinfo,monthinfo,dayinfo) select    id as rid,   customer_id,   create_date_time,   if(itcast_school_id is null or itcast_school_id =0,'-1',itcast_school_id) as itcast_school_id ,   deleted,   origin_type,   if(itcast_subject_id is null or itcast_subject_id =0,'-1',itcast_subject_id) as itcast_subject_id, creator,   substr(create_date_time,12,2) as hourinfo, if(origin_type in('NETSERVICE','PRESIGNUP'),'1','0') as origin_type_stat, substr(create_date_time,1,4) as yearinfo, substr(create_date_time,6,2) as monthinfo, substr(create_date_time,9,2) as dayinfo from itcast_ods.customer_relationship TABLESAMPLE(BUCKET 1 OUT OF 10 on id) as cr where deleted = 0;
--4.4: 完成数据转换操作: DWD --> DWM   --分区 SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions.pernode=10000; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.created.files=150000; --hive压缩 set hive.exec.compress.intermediate=true; set hive.exec.compress.output=true; --写入时压缩生效 set hive.exec.orc.compression.strategy=COMPRESSION; --分桶 set hive.enforce.bucketing=true; set hive.enforce.sorting=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.auto.convert.sortmerge.join=true; set hive.auto.convert.sortmerge.join.noconditionaltask=true;
insert into table itcast_dwm.itcast_intention_dwm partition(yearinfo,monthinfo,dayinfo) select    iid.customer_id, iid.create_date_time, dcu.area, iid.itcast_school_id, dis.name, iid.deleted, iid.origin_type, iid.itcast_subject_id, disub.name, iid.hourinfo, iid.origin_type_stat, if(cc.clue_state ='VALID_NEW_CLUES' , '1', if(cc.clue_state ='VALID_PUBLIC_NEW_CLUE','0','-1') ) as clue_state_stat, -- 找新老用户 demp.tdepart_id, dsd.name, iid.yearinfo, iid.monthinfo, iid.dayinfo from itcast_dwd.itcast_intention_dwd as iid   left join itcast_ods.customer_clue as cc on iid.rid = cc.customer_relationship_id left join itcast_dimen.itcast_school as dis on dis.id = iid.itcast_school_id left join itcast_dimen.itcast_subject as disub on disub.id=iid.itcast_subject_id left join itcast_dimen.customer as dcu on dcu.id = iid.customer_id left join itcast_dimen.employee as demp on demp.id = iid.creator left join itcast_dimen.scrm_department as dsd on dsd.id = demp.tdepart_id;
经过测试发现: itcast_intention_dwd 和 customer_clue 产生 SMB的mapjoin优化其余表均为普通 map join
4.5) 统计分析:  指标: 意向数量维度:   时间维度: 年月天小时新老维度: 线上线下产品属性维度:   地区维度 , 来源渠道, 学科维度, 校区维度 , 各咨询中心
--需求1: 按照月统计新老用户以及线上下产生意向用户数量 insert into table itcast_dws.itcast_intention_dws partition(yearinfo,monthinfo,dayinfo) select    count(distinct customer_id ) as customer_total, '-1' as area, '-1' as itcast_school_id,   '-1' as itcast_school_name,   '-1' as origin_type,   '-1' as itcast_subject_id,   '-1' as itcast_subject_name,   '-1' as hourinfo,   origin_type_stat, clue_state_stat, '-1' as tdepart_id,   '-1' as tdepart_name,   concat(yearinfo,'-',monthinfo) as time_str, '1' as grouptype , '4' as time_type, yearinfo, monthinfo, '-1' as dayinfo from itcast_dwm.itcast_intention_dwm group by yearinfo,monthinfo, clue_state_stat,   origin_type_stat;
-- 需求2: 按照天统计新老用户以及线上下以及各个地区产生意向用户数量 insert into table itcast_dws.itcast_intention_dws partition(yearinfo,monthinfo,dayinfo) select    count(distinct customer_id ) as customer_total, area, '-1' as itcast_school_id,   '-1' as itcast_school_name,   '-1' as origin_type,   '-1' as itcast_subject_id,   '-1' as itcast_subject_name,   '-1' as hourinfo,   origin_type_stat, clue_state_stat, '-1' as tdepart_id,   '-1' as tdepart_name,   concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str, '2' as grouptype , '2' as time_type, yearinfo, monthinfo, dayinfo from itcast_dwm.itcast_intention_dwm group by yearinfo,monthinfo,dayinfo, clue_state_stat,   origin_type_stat,area;