4——Hive数仓项目完整流程(在线教育)

在线教育(4)

1. 意向用户主题看板__全量流程

1.1 需求分析

  • 需求一:计期内,新增意向客户(包含自己录入的意向客户)总数
涉及维度:	时间维度:年、季度、月、天、小时、		线上线下
涉及指标:	意向量
涉及表:	customer_relationship(客户意向表)
涉及字段:	create_date_time、customer_id
缺失字段: 线上线下
  • 需求二:统计指定时间段内,新增的意向客户,所在城市区域人数热力图
涉及维度:	时间维度、地区维度、新增、线上线下
涉及指标:	意向量
涉及表:	customer_relationship(客户意向表)、customer(客户表)
	连接条件:	客户表.id = 客户意向表.customer_id
涉及字段:	create_date_time、customer_id、area

注意:区域热力图,对数据进行排序
  • 需求三:统计指定时间段内,新增的意向客户中,意向学科人数排行榜。学科名称要关联查询出来
涉及维度:	时间维度、学科维度、新增、线上线下
涉及指标:	意向量
涉及表:	customer_relationship(客户意向表)、customer_clue(客户线索表)、itcast_subject(学科表)
	连接条件:客户意向表.itcast_subject_id = 学科表.id
			客户线索表.customer_relationship_id = 客户意向表.id
涉及字段:	
	客户线索表.create_date_time、
	客户意向表.itcast_subject_id、
	客户意向表.customer_id、
	客户意向表.origin_type('NETSERVICE','PRESIGNUP')、
	学科表.name
	客户线索表.clue_state='VALID_NEW_CLUES'(新客户新线索)
清洗操作: 
	客户线索表.deleted = false 数据保留下, 为true清洗掉
  • 需求四:统计指定时间段内,新增的意向客户中,意向校区人数排行榜
涉及维度:	时间维度、线上线下、校区维度、线上线下
涉及指标:	意向量
涉及表:	customer_clue(客户线索表)、customer_relationship(客户意向表)、itcast_school(校区表)
	连接条件: 客户意向表.itcast_school_id=校区表.id
			 客户意向表.id = 客户线索表.customer_relationship_id
涉及字段:
		时间维度	客户线索表.create_date_time
		线上线下	客户意向表.origin_type('NETSERVICE','PRESIGNUP')(排除线下)
		校区维度	客户意向表.itcast_school_id	,校区表.name
		指标字段	客户意向表.customer_id
		客户线索表.clue_state('VALID_NEW_CLUES')新客户新线索
  • 需求五:统计指定时间段内,新增的意向客户中,不同来源渠道的意向客户占比
涉及维度:	时间维度、来源渠道、新增
涉及指标:	意向量
涉及表:	customer_relationship(客户意向表)、customer_clue(客户线索表)
	连接条件:客户意向表.id = 客户线索表.customer_relationship_id
涉及字段:
	时间维度	客户意向表.create_date_time	
	来源渠道	客户意向表.origin_type('NETSERVICE','PRESIGNUP')(排除线下)
	新增		客户线索表.clue_state('VALID_NEW_CLUES')
	指标字段	客户意向表.customer_id
清洗操作: 
	线索表.deleted = false
  • 需求六:统计指定时间段内,新增的意向客户中,各咨询中心产生的意向客户数占比情况
过滤条件:	新增
涉及维度:	时间维度、咨询中心、线上线下
涉及指标:	意向量
涉及表:	客户意向表、employee(员工表)、scrm_department(部门表)
	连接条件:客户意向表.creator = 员工表.id
			员工表.tdepart_id = 部门表.id
涉及字段:	时间维度	客户意向表.create_data_time
		   咨询中心		员工表.tdepart_id、部门表.name
		   新增		客户线索表.clue_state('VALID_NEW_CLUES')
		   排除线下		客户意向表.origin_type('NETSERVICE','PRESIGNUP')

1.2 需求分析总结

涉及维度:
	固有维度:	时间维度、线上线下
	产品属性维度:	地区维度、学科、校区、来源渠道、咨询中心
	过滤条件:	新增
涉及指标:	意向量
涉及表:
	事实表:	客户意向表
	维度表:	客户线索表、客户表、学科表、校区表、员工表、部门表
	连接条件:
		客户表.id = 客户意向表.customer_id
		客户意向表.itcast_subject_id = 学科表.id
		客户线索表.customer_relationship_id = 客户意向表.id
		客户意向表.itcast_school_id=校区表.id
		客户意向表.creator = 员工表.id
		员工表.tdepart_id = 部门表.id
涉及字段:
	意向量	customer_id
	时间维度	create_date_time
	线上线下	客户意向表.origin_type('NETSERVICE','PRESIGNUP')
	新增		客户线索表.clue_state('VALID_NEW_CLUES')
	地区维度	area
	学科			客户意向表.itcast_subject_id、学科表.name
	校区		客户意向表.itcast_school_id	,校区表.name
	来源渠道	客户意向表.origin_type('NETSERVICE','PRESIGNUP')
	咨询中心	员工表.tdepart_id、部门表.name
需清洗内容:
	过滤字段:客户意向表.deleted = false
需转换内容:
	日期字段:create_date_time转换为yearinfo......
	新增:客户线索表.clue_state值为VALID_NEW_CLUES为新用户
	线上线下:客户意向表.origin_type('NETSERVICE','PRESIGNUP')表示线上		转化为线下(0)、线上(1)
	校区和学科id转换:需要将客户意向表中, 校区id 和 学科id 如果为 0或者 null 转换为 -1

1.3 业务数据准备(工作环境无)

将数据导入MySQL中,首先建库之后导入数据

create database scrm default character set utf8mb4 collate utf8mb4_unicode_ci;

1.4 建模分析

  • ODS 层:源数据层
作用: 存储事实表和少量维度表
建表字段: 与源数据中一致即可
	注意: 分区字段为时间字段,用于标记抽取数据到 ODS层 的时间
	表: 客户意向表	外加	客户线索表
		注意: 意向表 和 线索表 存在数据变更操作,需采用缓慢渐变维方式解决
  • DIM 层:维度层
作用:	存储维度表
表 :  客户表、学科表、员工表、部门表、校区表
建表字段:  与表中字段一致 + 抽取时间
  • DW 层:数据仓库层
    • DWD 层:明细层
作用: 清洗转换、少量维度退化
	清洗操作:
		过滤字段:客户意向表.deleted = false
	转换操作:
		日期字段:create_date_time转换为yearinfo......
		新增:客户线索表.clue_state值为VALID_NEW_CLUES为新用户
		线上线下:客户意向表.origin_type('NETSERVICE','PRESIGNUP')表示线上		转化为线下(0)、线上(1)
		校区和学科id转换:需要将客户意向表中, 校区id 和 学科id 如果为 0或者 null 转换为 -1
	
建表字段: 必须字段(只可为事实表字段) + 清洗字段 + 转换字段 + join字段
(因为没有做维度退化,所以只有事实表字段)(join字段只写与事实表相关的)
	customer_relationship(意向表)字段:
		时间维度:	create_date_time
		线上线下:origin_type --> origin_type_stat线下(0)、线上(1)
		新增	 :	  origin_type
		校区维度:	itcast_school_id
		学科维度:	itcast_subject_id
		来源渠道:	origin_type
		join字段:	  customer_id、id
		
最终字段:	
id	customer_id,create_date_time,origin_type,itcast_subject_id,itcast_school_id,origin_type_stat,
	creator,deleted,yearinfo,monthinfo,dayinfo,hourinfo
  • DWM 层:中间层
作用: 维度退化,提前聚合
因为后续的意向量指标字段存在去重操作,所以无维度退化操作
join字段:	
		客户表.id = 客户意向表.customer_id
		客户意向表.itcast_subject_id = 学科表.id
		客户线索表.customer_relationship_id = 客户意向表.id
		客户意向表.itcast_school_id=校区表.id
		客户意向表.creator = 员工表.id
		员工表.tdepart_id = 部门表.id
建表字段: 指标字段 + 各表维度相关字段
id,customer_id,create_date_time,origin_type,
area,
itcast_subject_id,itcast_subject_name,
itcast_school_id,itcast_school_name,
origin_type_stat,
tdepart_id,tdepart_name
	creator,deleted,yearinfo,monthinfo,dayinfo,hourinfo
	需要7表关联操作
  • DWS 层:数据业务层
作用: 细化维度统计操作
建表字段:  统计字段 + 各维度字段 + 三个用于查询的字段
字段: 
	指标字段:	customerid_total
	时间维度:	yearinfo,monthinfo,dayinfo,hourinfo
	新	老:	  clue_state_stat
	线上线下:	origin_type_stat
	来源渠道:	origin_type
	地 	区:	  area
	学科维度:	itcast_subject_id,itcast_subject_name
	校区维度:	itcast_schoool_id,itcast_school_name
	咨询中心:	tdepart_id,tdepart_name
	经验字段:	group_type,time_type,time_str
	

1.5 建模操作

  • ODS 层:
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 客户意向表(内部表 分区表 分桶表, 拉链表)
create table if not exists itcast_ods.'customer_relationship'(
`id` int COMMENT '客户关系id',
  `create_date_time` STRING COMMENT '创建时间',
  `update_date_time` STRING COMMENT '最后更新时间',
  `deleted` int COMMENT '是否被删除(禁用)',
  `customer_id` int COMMENT '所属客户id',
  `first_id` int COMMENT '第一条客户关系id',
  `belonger` int COMMENT '归属人',
  `belonger_name` STRING COMMENT '归属人姓名',
  `initial_belonger` int COMMENT '初始归属人',
  `distribution_handler` int COMMENT '分配处理人',
  `business_scrm_department_id` int COMMENT '归属部门',
  `last_visit_time` STRING COMMENT '最后回访时间',
  `next_visit_time` STRING COMMENT '下次回访时间',
  `origin_type` STRING COMMENT '数据来源',
  `itcast_school_id` int COMMENT '校区Id',
  `itcast_subject_id` int COMMENT '学科Id',
  `intention_study_type` STRING COMMENT '意向学习方式',
  `anticipat_signup_date` STRING COMMENT '预计报名时间',
  `level` STRING COMMENT '客户级别',
  `creator` int COMMENT '创建人',
  `current_creator` int COMMENT '当前创建人:初始==创建人,当在公海拉回时为 拉回人',
  `creator_name` STRING COMMENT '创建者姓名',
  `origin_channel` STRING COMMENT '来源渠道',
  `comment` STRING COMMENT '备注',
  `first_customer_clue_id` int COMMENT '第一条线索id',
  `last_customer_clue_id` int COMMENT '最后一条线索id',
  `process_state` STRING COMMENT '处理状态',
  `process_time` STRING COMMENT '处理状态变动时间',
  `payment_state` STRING COMMENT '支付状态',
  `payment_time` STRING COMMENT '支付状态变动时间',
  `signup_state` STRING COMMENT '报名状态',
  `signup_time` STRING COMMENT '报名时间',
  `notice_state` STRING COMMENT '通知状态',
  `notice_time` STRING COMMENT '通知状态变动时间',
  `lock_state` STRING COMMENT '锁定状态',
  `lock_time` STRING COMMENT '锁定状态修改时间',
  `itcast_clazz_id` int COMMENT '所属ems班级id',
  `itcast_clazz_time` STRING COMMENT '报班时间',
  `payment_url` STRING COMMENT '付款链接',
  `payment_url_time` STRING COMMENT '支付链接生成时间',
  `ems_student_id` int COMMENT 'ems的学生id',
  `delete_reason` STRING COMMENT '删除原因',
  `deleter` int COMMENT '删除人',
  `deleter_name` STRING COMMENT '删除人姓名',
  `delete_time` STRING COMMENT '删除时间',
  `course_id` int COMMENT '课程ID',
  `course_name` STRING COMMENT '课程名称',
  `delete_comment` STRING COMMENT '删除原因说明',
  `close_state` STRING COMMENT '关闭装填',
  `close_time` STRING COMMENT '关闭状态变动时间',
  `appeal_id` int COMMENT '申诉id',
  `tenant` int COMMENT '租户',
  `total_fee` DECIMAL COMMENT '报名费总金额',
  `belonged` int COMMENT '小周期归属人',
  `belonged_time` STRING COMMENT '归属时间',
  `belonger_time` STRING COMMENT '归属时间',
  `transfer` int COMMENT '转移人',
  `transfer_time` STRING COMMENT '转移时间',
  `follow_type` int COMMENT '分配类型,0-自动分配,1-手动分配,2-自动转移,3-手动单个转移,4-手动批量转移,5-公海领取',
  `transfer_bxg_oa_account` STRING COMMENT '转移到博学谷归属人OA账号',
  `transfer_bxg_belonger_name` STRING COMMENT '转移到博学谷归属人OA姓名',
  `end_time` STRING COMMENT '有效截止时间')
  comment '客户意向表'
  partitioned by (start_time string)
  clustered by (id) sorted by(id) into 10 buckets
  row format delimited
  field terminated by '\t'
  stored as orc
  tblproperties('orc.compress'='zlib');
  
-- 客户线索表
CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue (
  id int COMMENT 'customer_clue_id',
  create_date_time STRING COMMENT '创建时间',
  update_date_time STRING COMMENT '最后更新时间',
  deleted STRING COMMENT '是否被删除(禁用)',
  customer_id int COMMENT '客户id',
  customer_relationship_id int COMMENT '客户关系id',
  session_id STRING COMMENT '七陌会话id',
  sid STRING COMMENT '访客id',
  status STRING COMMENT '状态(undeal待领取 deal 已领取 finish 已关闭 changePeer 已流转)',
  users STRING COMMENT '所属坐席',
  create_time STRING COMMENT '七陌创建时间',
  platform STRING COMMENT '平台来源 (pc-网站咨询|wap-wap咨询|sdk-app咨询|weixin-微信咨询)',
  s_name STRING COMMENT '用户名称',
  seo_source STRING COMMENT '搜索来源',
  seo_keywords STRING COMMENT '关键字',
  ip STRING COMMENT 'IP地址',
  referrer STRING COMMENT '上级来源页面',
  from_url STRING COMMENT '会话来源页面',
  landing_page_url STRING COMMENT '访客着陆页面',
  url_title STRING COMMENT '咨询页面title',
  to_peer STRING COMMENT '所属技能组',
  manual_time STRING COMMENT '人工开始时间',
  begin_time STRING COMMENT '坐席领取时间 ',
  reply_msg_count int COMMENT '客服回复消息数',
  total_msg_count int COMMENT '消息总数',
  msg_count int COMMENT '客户发送消息数',
  comment STRING COMMENT '备注',
  finish_reason STRING COMMENT '结束类型',
  finish_user STRING COMMENT '结束坐席',
  end_time STRING COMMENT '会话结束时间',
  platform_description STRING COMMENT '客户平台信息',
  browser_name STRING COMMENT '浏览器名称',
  os_info STRING COMMENT '系统名称',
  area STRING COMMENT '区域',
  country STRING COMMENT '所在国家',
  province STRING COMMENT '省',
  city STRING COMMENT '城市',
  creator int COMMENT '创建人',
  name STRING COMMENT '客户姓名',
  idcard STRING COMMENT '身份证号',
  phone STRING COMMENT '手机号',
  itcast_school_id int COMMENT '校区Id',
  itcast_school STRING COMMENT '校区',
  itcast_subject_id int COMMENT '学科Id',
  itcast_subject STRING COMMENT '学科',
  wechat STRING COMMENT '微信',
  qq STRING COMMENT 'qq号',
  email STRING COMMENT '邮箱',
  gender STRING COMMENT '性别',
  level STRING COMMENT '客户级别',
  origin_type STRING COMMENT '数据来源渠道',
  information_way STRING COMMENT '资讯方式',
  working_years STRING COMMENT '开始工作时间',
  technical_directions STRING COMMENT '技术方向',
  customer_state STRING COMMENT '当前客户状态',
  valid STRING COMMENT '该线索是否是网资有效线索',
  anticipat_signup_date STRING COMMENT '预计报名时间',
  clue_state STRING COMMENT '线索状态',
  scrm_department_id int COMMENT 'SCRM内部部门id',
  superior_url STRING COMMENT '诸葛获取上级页面URL',
  superior_source STRING COMMENT '诸葛获取上级页面URL标题',
  landing_url STRING COMMENT '诸葛获取着陆页面URL',
  landing_source STRING COMMENT '诸葛获取着陆页面URL来源',
  info_url STRING COMMENT '诸葛获取留咨页URL',
  info_source STRING COMMENT '诸葛获取留咨页URL标题',
  origin_channel STRING COMMENT '投放渠道',
  course_id int COMMENT '课程编号',
  course_name STRING COMMENT '课程名称',
  zhuge_session_id STRING COMMENT 'zhuge会话id',
  is_repeat int COMMENT '是否重复线索(手机号维度) 0:正常 1:重复',
  tenant int COMMENT '租户id',
  activity_id STRING COMMENT '活动id',
  activity_name STRING COMMENT '活动名称',
  follow_type int COMMENT '分配类型,0-自动分配,1-手动分配,2-自动转移,3-手动单个转移,4-手动批量转移,5-公海领取',
  shunt_mode_id int COMMENT '匹配到的技能组id',
  shunt_employee_group_id int COMMENT '所属分流员工组',
  ends_time STRING COMMENT '有效时间')
comment '客户关系表'
PARTITIONED BY(starts_time STRING)
clustered by(customer_relationship_id) sorted by(customer_relationship_id) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='ZLIB');
  • DIM 层:
CREATE DATABASE IF NOT EXISTS itcast_dimen;
-- 客户表
CREATE TABLE IF NOT EXISTS itcast_dimen.`customer` (
  `id` int COMMENT 'key id',
  `customer_relationship_id` int COMMENT '当前意向id',
  `create_date_time` STRING COMMENT '创建时间',
  `update_date_time` STRING COMMENT '最后更新时间',
  `deleted` int  COMMENT '是否被删除(禁用)',
  `name` STRING COMMENT '姓名',
  `idcard` STRING  COMMENT '身份证号',
  `birth_year` int COMMENT '出生年份',
  `gender` STRING COMMENT '性别',
  `phone` STRING COMMENT '手机号',
  `wechat` STRING COMMENT '微信',
  `qq` STRING COMMENT 'qq号',
  `email` STRING COMMENT '邮箱',
  `area` STRING COMMENT '所在区域',
  `leave_school_date` date COMMENT '离校时间',
  `graduation_date` date COMMENT '毕业时间',
  `bxg_student_id` STRING COMMENT '博学谷学员ID,可能未关联到,不存在',
  `creator` int COMMENT '创建人ID',
  `origin_type` STRING COMMENT '数据来源',
  `origin_channel` STRING COMMENT '来源渠道',
  `tenant` int,
  `md_id` int COMMENT '中台id')
comment '客户表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

-- 学科表
CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_subject` (
  `id` int COMMENT '自增主键',
  `create_date_time` timestamp COMMENT '创建时间',
  `update_date_time` timestamp COMMENT '最后更新时间',
  `deleted` STRING COMMENT '是否被删除(禁用)',
  `name` STRING COMMENT '学科名称',
  `code` STRING COMMENT '学科编码',
  `tenant` int COMMENT '租户')
comment '学科字典表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

-- 员工表
CREATE TABLE IF NOT EXISTS itcast_dimen.employee (
  id int COMMENT '员工id',
  email STRING COMMENT '公司邮箱,OA登录账号',
  real_name STRING COMMENT '员工的真实姓名',
  phone STRING COMMENT '手机号,目前还没有使用;隐私问题OA接口没有提供这个属性,',
  department_id STRING COMMENT 'OA中的部门编号,有负值',
  department_name STRING COMMENT 'OA中的部门名',
  remote_login STRING COMMENT '员工是否可以远程登录',
  job_number STRING COMMENT '员工工号',
  cross_school STRING COMMENT '是否有跨校区权限',
  last_login_date STRING COMMENT '最后登录日期',
  creator int COMMENT '创建人',
  create_date_time STRING COMMENT '创建时间',
  update_date_time STRING COMMENT '最后更新时间',
  deleted STRING COMMENT '是否被删除(禁用)',
  scrm_department_id int COMMENT 'SCRM内部部门id',
  leave_office STRING COMMENT '离职状态',
  leave_office_time STRING COMMENT '离职时间',
  reinstated_time STRING COMMENT '复职时间',
  superior_leaders_id int COMMENT '上级领导ID',
  tdepart_id int COMMENT '直属部门',
  tenant int COMMENT '租户',
  ems_user_name STRING COMMENT 'ems用户名称'
)
comment '员工表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

-- 部门表
CREATE TABLE IF NOT EXISTS itcast_dimen.`scrm_department` (
  `id` int COMMENT '部门id',
  `name` STRING COMMENT '部门名称',
  `parent_id` int COMMENT '父部门id',
  `create_date_time` STRING COMMENT '创建时间',
  `update_date_time` STRING COMMENT '更新时间',
  `deleted` STRING COMMENT '删除标志',
  `id_path` STRING COMMENT '编码全路径',
  `tdepart_code` int COMMENT '直属部门',
  `creator` STRING COMMENT '创建者',
  `depart_level` int COMMENT '部门层级',
  `depart_sign` int COMMENT '部门标志,暂时默认1',
  `depart_line` int COMMENT '业务线,存储业务线编码',
  `depart_sort` int COMMENT '排序字段',
  `disable_flag` int COMMENT '禁用标志',
  `tenant` int COMMENT '租户')
comment 'scrm部门表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

-- 校区表
CREATE TABLE IF NOT EXISTS itcast_dimen.`itcast_school` (
  `id` int COMMENT '自增主键',
  `create_date_time` timestamp COMMENT '创建时间',
  `update_date_time` timestamp  COMMENT '最后更新时间',
  `deleted` STRING COMMENT '是否被删除(禁用)',
  `name` STRING COMMENT '校区名称',
  `code` STRING COMMENT '校区标识',
  `tenant` int COMMENT '租户')
comment '校区字典表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

  • DWD 层:
CREATE TABLE IF NOT EXISTS itcast_dwd.`itcast_intention_dwd` (
  `rid` int COMMENT 'id',
  `customer_id` STRING COMMENT '客户id',
  `create_date_time` STRING COMMENT '创建时间',
  `itcast_school_id` STRING COMMENT '校区id',
  `deleted` STRING COMMENT '是否被删除',
  `origin_type` STRING COMMENT '来源渠道',
  `itcast_subject_id` STRING COMMENT '学科id',
  `creator` int COMMENT '创建人',
  `hourinfo` STRING COMMENT '小时信息',
  `origin_type_stat` STRING COMMENT '数据来源:0.线下;1.线上'
)
comment '客户意向dwd表'
PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)
clustered by(rid) sorted by(rid) into 10 buckets
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
stored as ORC
TBLPROPERTIES ('orc.compress'='SNAPPY');
  • DWM 层:
create database itcast_dwm;

CREATE TABLE IF NOT EXISTS itcast_dwm.`itcast_intention_dwm` (
  `customer_id` STRING COMMENT 'id信息',
  `create_date_time` STRING COMMENT '创建时间',
  `area` STRING COMMENT '区域信息',
  `itcast_school_id` STRING COMMENT '校区id',
  `itcast_school_name` STRING COMMENT '校区名称',
  `deleted` STRING COMMENT '是否被删除',
  `origin_type` STRING COMMENT '来源渠道',
  `itcast_subject_id` STRING COMMENT '学科id',
  `itcast_subject_name` STRING COMMENT '学科名称',
  `hourinfo` STRING COMMENT '小时信息',
  `origin_type_stat` STRING COMMENT '数据来源:0.线下;1.线上',
  `clue_state_stat` STRING COMMENT '新老客户:0.老客户;1.新客户',
  `tdepart_id` STRING COMMENT '创建者部门id',
  `tdepart_name` STRING COMMENT '咨询中心名称'
)
comment '客户意向dwm表'
PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)
clustered by(customer_id) sorted by(customer_id) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as ORC
TBLPROPERTIES ('orc.compress'='SNAPPY');
  • DWS 层:
CREATE TABLE IF NOT EXISTS itcast_dws.itcast_intention_dws (
   `customer_total` INT COMMENT '聚合意向客户数',
   `area` STRING COMMENT '区域信息',
   `itcast_school_id` STRING COMMENT '校区id',
   `itcast_school_name` STRING COMMENT '校区名称',
   `origin_type` STRING COMMENT '来源渠道',
   `itcast_subject_id` STRING COMMENT '学科id',
   `itcast_subject_name` STRING COMMENT '学科名称',
   `hourinfo` STRING COMMENT '小时信息',
   `origin_type_stat` STRING COMMENT '数据来源:0.线下;1.线上',
   `clue_state_stat` STRING COMMENT '客户属性:0.老客户;1.新客户',
   `tdepart_id` STRING COMMENT '创建者部门id',
   `tdepart_name` STRING COMMENT '咨询中心名称',
   `time_str` STRING COMMENT '时间明细',
   `groupType` STRING COMMENT '产品属性类别:1.总意向量;2.区域信息;3.校区、学科组合分组;4.来源渠道;5.咨询中心;',
   `time_type` STRING COMMENT '时间维度:1、按小时聚合;2、按天聚合;3、按周聚合;4、按月聚合;5、按年聚合;'
)
comment '客户意向dws表'
PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='SNAPPY');

1.6 数据采集

利用 Sqoop 将数据从MySQL中转移到hive中的ODS层和DIM层

  • ODS 层:
    • 由于意向表和线索表是分桶表,不能直接操作。所以要创建临时表
# 客户意向表
## 1.创建临时表(非分桶表)
## 2.编写sqoop 将数据导入临时表
## 3.利用 insert into + select 导入目标表
-- 第一步: 创建一张客户意向表的临时表
CREATE TABLE IF NOT EXISTS itcast_ods.`customer_relationship_temp` (
  `id` int COMMENT '客户关系id',
  `create_date_time` STRING COMMENT '创建时间',
  `update_date_time` STRING COMMENT '最后更新时间',
  `deleted` int COMMENT '是否被删除(禁用)',
  `customer_id` int COMMENT '所属客户id',
  `first_id` int COMMENT '第一条客户关系id',
  `belonger` int COMMENT '归属人',
  `belonger_name` STRING COMMENT '归属人姓名',
  `initial_belonger` int COMMENT '初始归属人',
  `distribution_handler` int COMMENT '分配处理人',
  `business_scrm_department_id` int COMMENT '归属部门',
  `last_visit_time` STRING COMMENT '最后回访时间',
  `next_visit_time` STRING COMMENT '下次回访时间',
  `origin_type` STRING COMMENT '数据来源',
  `itcast_school_id` int COMMENT '校区Id',
  `itcast_subject_id` int COMMENT '学科Id',
  `intention_study_type` STRING COMMENT '意向学习方式',
  `anticipat_signup_date` STRING COMMENT '预计报名时间',
  `level` STRING COMMENT '客户级别',
  `creator` int COMMENT '创建人',
  `current_creator` int COMMENT '当前创建人:初始==创建人,当在公海拉回时为 拉回人',
  `creator_name` STRING COMMENT '创建者姓名',
  `origin_channel` STRING COMMENT '来源渠道',
  `comment` STRING COMMENT '备注',
  `first_customer_clue_id` int COMMENT '第一条线索id',
  `last_customer_clue_id` int COMMENT '最后一条线索id',
  `process_state` STRING COMMENT '处理状态',
  `process_time` STRING COMMENT '处理状态变动时间',
  `payment_state` STRING COMMENT '支付状态',
  `payment_time` STRING COMMENT '支付状态变动时间',
  `signup_state` STRING COMMENT '报名状态',
  `signup_time` STRING COMMENT '报名时间',
  `notice_state` STRING COMMENT '通知状态',
  `notice_time` STRING COMMENT '通知状态变动时间',
  `lock_state` STRING COMMENT '锁定状态',
  `lock_time` STRING COMMENT '锁定状态修改时间',
  `itcast_clazz_id` int COMMENT '所属ems班级id',
  `itcast_clazz_time` STRING COMMENT '报班时间',
  `payment_url` STRING COMMENT '付款链接',
  `payment_url_time` STRING COMMENT '支付链接生成时间',
  `ems_student_id` int COMMENT 'ems的学生id',
  `delete_reason` STRING COMMENT '删除原因',
  `deleter` int COMMENT '删除人',
  `deleter_name` STRING COMMENT '删除人姓名',
  `delete_time` STRING COMMENT '删除时间',
  `course_id` int COMMENT '课程ID',
  `course_name` STRING COMMENT '课程名称',
  `delete_comment` STRING COMMENT '删除原因说明',
  `close_state` STRING COMMENT '关闭装填',
  `close_time` STRING COMMENT '关闭状态变动时间',
  `appeal_id` int COMMENT '申诉id',
  `tenant` int COMMENT '租户',
  `total_fee` DECIMAL COMMENT '报名费总金额',
  `belonged` int COMMENT '小周期归属人',
  `belonged_time` STRING COMMENT '归属时间',
  `belonger_time` STRING COMMENT '归属时间',
  `transfer` int COMMENT '转移人',
  `transfer_time` STRING COMMENT '转移时间',
  `follow_type` int COMMENT '分配类型,0-自动分配,1-手动分配,2-自动转移,3-手动单个转移,4-手动批量转移,5-公海领取',
  `transfer_bxg_oa_account` STRING COMMENT '转移到博学谷归属人OA账号',
  `transfer_bxg_belonger_name` STRING COMMENT '转移到博学谷归属人OA姓名',
  `end_time` STRING COMMENT '有效截止时间')
comment '客户关系表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='ZLIB');

-- 第二步: 编写sqoop命令 将数据导入到临时表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/scrm \
--username root \
--password 123456 \
--query 'SELECT 
  *, "9999-12-31" as end_time , "2021-09-27" AS start_time
FROM customer_relationship where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table customer_relationship_temp \
-m 1

-- 第三步: 执行 insert into + select 导入到目标表
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;

insert into table itcast_ods.customer_relationship partition(start_time)
select * from itcast_ods.customer_relationship_temp;
# 客户线索表
-- 第一步: 创建客户线索表的临时表
CREATE TABLE IF NOT EXISTS itcast_ods.customer_clue_temp (
  id int COMMENT 'customer_clue_id',
  create_date_time STRING COMMENT '创建时间',
  update_date_time STRING COMMENT '最后更新时间',
  deleted STRING COMMENT '是否被删除(禁用)',
  customer_id int COMMENT '客户id',
  customer_relationship_id int COMMENT '客户关系id',
  session_id STRING COMMENT '七陌会话id',
  sid STRING COMMENT '访客id',
  status STRING COMMENT '状态(undeal待领取 deal 已领取 finish 已关闭 changePeer 已流转)',
  users STRING COMMENT '所属坐席',
  create_time STRING COMMENT '七陌创建时间',
  platform STRING COMMENT '平台来源 (pc-网站咨询|wap-wap咨询|sdk-app咨询|weixin-微信咨询)',
  s_name STRING COMMENT '用户名称',
  seo_source STRING COMMENT '搜索来源',
  seo_keywords STRING COMMENT '关键字',
  ip STRING COMMENT 'IP地址',
  referrer STRING COMMENT '上级来源页面',
  from_url STRING COMMENT '会话来源页面',
  landing_page_url STRING COMMENT '访客着陆页面',
  url_title STRING COMMENT '咨询页面title',
  to_peer STRING COMMENT '所属技能组',
  manual_time STRING COMMENT '人工开始时间',
  begin_time STRING COMMENT '坐席领取时间 ',
  reply_msg_count int COMMENT '客服回复消息数',
  total_msg_count int COMMENT '消息总数',
  msg_count int COMMENT '客户发送消息数',
  comment STRING COMMENT '备注',
  finish_reason STRING COMMENT '结束类型',
  finish_user STRING COMMENT '结束坐席',
  end_time STRING COMMENT '会话结束时间',
  platform_description STRING COMMENT '客户平台信息',
  browser_name STRING COMMENT '浏览器名称',
  os_info STRING COMMENT '系统名称',
  area STRING COMMENT '区域',
  country STRING COMMENT '所在国家',
  province STRING COMMENT '',
  city STRING COMMENT '城市',
  creator int COMMENT '创建人',
  name STRING COMMENT '客户姓名',
  idcard STRING COMMENT '身份证号',
  phone STRING COMMENT '手机号',
  itcast_school_id int COMMENT '校区Id',
  itcast_school STRING COMMENT '校区',
  itcast_subject_id int COMMENT '学科Id',
  itcast_subject STRING COMMENT '学科',
  wechat STRING COMMENT '微信',
  qq STRING COMMENT 'qq号',
  email STRING COMMENT '邮箱',
  gender STRING COMMENT '性别',
  level STRING COMMENT '客户级别',
  origin_type STRING COMMENT '数据来源渠道',
  information_way STRING COMMENT '资讯方式',
  working_years STRING COMMENT '开始工作时间',
  technical_directions STRING COMMENT '技术方向',
  customer_state STRING COMMENT '当前客户状态',
  valid STRING COMMENT '该线索是否是网资有效线索',
  anticipat_signup_date STRING COMMENT '预计报名时间',
  clue_state STRING COMMENT '线索状态',
  scrm_department_id int COMMENT 'SCRM内部部门id',
  superior_url STRING COMMENT '诸葛获取上级页面URL',
  superior_source STRING COMMENT '诸葛获取上级页面URL标题',
  landing_url STRING COMMENT '诸葛获取着陆页面URL',
  landing_source STRING COMMENT '诸葛获取着陆页面URL来源',
  info_url STRING COMMENT '诸葛获取留咨页URL',
  info_source STRING COMMENT '诸葛获取留咨页URL标题',
  origin_channel STRING COMMENT '投放渠道',
  course_id int COMMENT '课程编号',
  course_name STRING COMMENT '课程名称',
  zhuge_session_id STRING COMMENT 'zhuge会话id',
  is_repeat int COMMENT '是否重复线索(手机号维度) 0:正常 1:重复',
  tenant int COMMENT '租户id',
  activity_id STRING COMMENT '活动id',
  activity_name STRING COMMENT '活动名称',
  follow_type int COMMENT '分配类型,0-自动分配,1-手动分配,2-自动转移,3-手动单个转移,4-手动批量转移,5-公海领取',
  shunt_mode_id int COMMENT '匹配到的技能组id',
  shunt_employee_group_id int COMMENT '所属分流员工组',
  ends_time STRING COMMENT '有效时间')
comment '客户关系表'
PARTITIONED BY(starts_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
TBLPROPERTIES ('orc.compress'='ZLIB');

-- 第二步: 编写sqoop命令 将数据导入到临时表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/scrm \
--username root \
--password 123456 \
--query 'SELECT 
  id,create_date_time,update_date_time,deleted,customer_id,customer_relationship_id,session_id,sid,status,user as users,create_time,platform,s_name,seo_source,seo_keywords,ip,referrer,from_url,landing_page_url,url_title,to_peer,manual_time,begin_time,reply_msg_count,total_msg_count,msg_count,comment,finish_reason,finish_user,end_time,platform_description,browser_name,os_info,area,country,province,city,creator,name,"-1" as idcard,"-1" as phone,itcast_school_id,itcast_school,itcast_subject_id,itcast_subject,"-1" as wechat,"-1" as qq,"-1" as email,gender,level,origin_type,information_way,working_years,technical_directions,customer_state,valid,anticipat_signup_date,clue_state,scrm_department_id,superior_url,superior_source,landing_url,landing_source,info_url,info_source,origin_channel,course_id,course_name,zhuge_session_id,is_repeat,tenant,activity_id,activity_name,follow_type,shunt_mode_id,shunt_employee_group_id, "9999-12-31" as ends_time , "2021-09-27" AS starts_time
FROM customer_clue where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table customer_clue_temp \
-m 1

-- 第三步:  通过 insert into + select 导入到目标表
insert into table itcast_ods.customer_clue partition(starts_time)
select * from itcast_ods.customer_clue_temp;
  • DIM 层:
# 客户表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/scrm \
--username root \
--password 123456 \
--query 'SELECT 
  *, "2021-09-27" AS start_time
FROM customer where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table customer \
-m 1 

# 学科表:
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/scrm \
--username root \
--password 123456 \
--query 'SELECT 
  *, "2021-09-27" AS start_time
FROM itcast_subject where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table itcast_subject \
-m 1 

# 校区表:
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/scrm \
--username root \
--password 123456 \
--query 'SELECT 
  *, "2021-09-27" AS start_time
FROM itcast_school where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table itcast_school \
-m 1 

# 员工表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/scrm \
--username root \
--password 123456 \
--query 'SELECT 
  *, "2021-09-27" AS start_time
FROM employee where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table employee \
-m 1

# 部门表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/scrm \
--username root \
--password 123456 \
--query 'SELECT 
  *, "2021-09-27" AS start_time
FROM scrm_department where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table scrm_department \
-m 1

1.7 数据清洗转换

  • DWD 层:清洗转换、少量维度退化(数据过多,可以利用采样进行抽取)
--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶
set hive.enforce.bucketing=true; -- 开启分桶支持, 默认就是true
set hive.enforce.sorting=true; -- 开启强制排序

insert  into table itcast_dwd.itcast_intention_dwd partition(yearinfo,monthinfo,dayinfo)
select  
    id as rid,
    customer_id,
    create_date_time,
    if(itcast_school_id is null OR itcast_school_id = 0 , '-1',itcast_school_id) as itcast_school_id, 
    deleted,
    origin_type,
    if(itcast_subject_id is not null, if(itcast_subject_id != 0,itcast_subject_id,'-1'),'-1') as itcast_subject_id, 
    creator,
    substr(create_date_time,12,2) as hourinfo, 
    if(origin_type in('NETSERVICE','PRESIGNUP'),'1','0') as origin_type_stat,
    substr(create_date_time,1,4) as yearinfo,
    substr(create_date_time,6,2) as monthinfo,
    substr(create_date_time,9,2) as dayinfo 

from itcast_ods.customer_relationship tablesample(bucket 5 out of 10 on id) where deleted = 0 ;
  • DWM 层:维度退化
--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶
set hive.enforce.bucketing=true; -- 开启分桶支持, 默认就是true
set hive.enforce.sorting=true; -- 开启强制排序

-- 优化: 
set hive.auto.convert.join=false;  -- map join
set hive.optimize.bucketmapjoin = false; -- 开启 bucket map join
-- 开启SMB map join
set hive.auto.convert.sortmerge.join=false;
set hive.auto.convert.sortmerge.join.noconditionaltask=false;
-- 写入数据强制排序
set hive.enforce.sorting=false;
-- 开启自动尝试SMB连接
set hive.optimize.bucketmapjoin.sortedmerge = false; 

insert into table itcast_dwm.itcast_intention_dwm partition(yearinfo,monthinfo,dayinfo)
select  
    iid.customer_id,
    iid.create_date_time,
    c.area,
    iid.itcast_school_id,
    sch.name as itcast_school_name,
    iid.deleted,
    iid.origin_type,
    iid.itcast_subject_id,
    sub.name as itcast_subject_name,
    iid.hourinfo,
    iid.origin_type_stat,
    -- if(cc.clue_state = 'VALID_NEW_CLUES',1,if(cc.clue_state = 'VALID_PUBLIC_NEW_CLUE','0','-1')) as clue_state_stat, -- 此处有转换
    case cc.clue_state 
        when 'VALID_NEW_CLUES' then '1'
        when 'VALID_PUBLIC_NEW_CLUE' then '0'
        else '-1' 
    end as clue_state_stat,
    emp.tdepart_id,
    dept.name as tdepart_name,
    iid.yearinfo,
    iid.monthinfo,
    iid.dayinfo
from itcast_dwd.itcast_intention_dwd  iid
    left join itcast_ods.customer_clue cc on cc.customer_relationship_id = iid.rid
    left join itcast_dimen.customer c on  iid.customer_id = c.id
    left join itcast_dimen.itcast_subject sub on  iid.itcast_subject_id = sub.id
    left join itcast_dimen.itcast_school sch  on iid.itcast_school_id = sch.id
    left join itcast_dimen.employee emp on iid.creator = emp.id
    left join itcast_dimen.scrm_department dept on emp.tdepart_id = dept.id;
    

1.8 数据分析

  • 利用DWM层数据进行分析,转至DWS层
指标:	意向量
维度:
	固有维度:	时间维度、线上线下、新老维度
	产品属性维度:	地区、学科、校区、咨询中心、来源渠道、总意向量
  • 统计总意向量
-- 统计每年 线上线下 新老用户的总意向量
insert into table itcast_dws.itcast_intention_dws partition(yearinfo,monthinfo,dayinfo)
select 
	count( distinct customer_id) as customer_total, 
	'-1' as area,
	'-1' as itcast_school_id,
	'-1' as itcast_school_name,
	'-1' as origin_type,
	'-1' as itcast_subject_id,
	'-1' as itcast_subject_name,
	'-1' as hourinfo,
	origin_type_stat,
	clue_state_stat,
	'-1' as tdepart_id,
	'-1' as tdepart_name,
	yearinfo as time_str,
	'1' as grouptype,
	'5' as time_type,
	yearinfo,
	'-1' as monthinfo,
	'-1' as dayinfo
from itcast_dwm.itcast_intention_dwm
group by yearinfo,origin_type_stat,clue_state_stat;

-- 统计每年每月 线上线下 新老用户的总意向量
insert into table itcast_dws.itcast_intention_dws partition(yearinfo,monthinfo,dayinfo)
select 
	count( distinct customer_id) as customer_total, 
	'-1' as area,
	'-1' as itcast_school_id,
	'-1' as itcast_school_name,
	'-1' as origin_type,
	'-1' as itcast_subject_id,
	'-1' as itcast_subject_name,
	'-1' as hourinfo,
	origin_type_stat,
	clue_state_stat,
	'-1' as tdepart_id,
	'-1' as tdepart_name,
	concat(yearinfo,'-',monthinfo) as time_str,
	'1' as grouptype,
	'4' as time_type,
	yearinfo,
	monthinfo,
	'-1' as dayinfo
from itcast_dwm.itcast_intention_dwm
group by yearinfo,monthinfo,origin_type_stat,clue_state_stat;
-- 统计每年每月每日 线上线下 新老用户的总意向量
insert into table itcast_dws.itcast_intention_dws partition(yearinfo,monthinfo,dayinfo)
select 
	count( distinct customer_id) as customer_total, 
	'-1' as area,
	'-1' as itcast_school_id,
	'-1' as itcast_school_name,
	'-1' as origin_type,
	'-1' as itcast_subject_id,
	'-1' as itcast_subject_name,
	'-1' as hourinfo,
	origin_type_stat,
	clue_state_stat,
	'-1' as tdepart_id,
	'-1' as tdepart_name,
	concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
	'1' as grouptype,
	'2' as time_type,
	yearinfo,
	monthinfo,
	dayinfo
from itcast_dwm.itcast_intention_dwm
group by yearinfo,monthinfo,dayinfo,origin_type_stat,clue_state_stat;
-- 统计每年每月每日每小时 线上线下 新老用户的总意向量
insert into table itcast_dws.itcast_intention_dws partition(yearinfo,monthinfo,dayinfo)
select 
	count( distinct customer_id) as customer_total, 
	'-1' as area,
	'-1' as itcast_school_id,
	'-1' as itcast_school_name,
	'-1' as origin_type,
	'-1' as itcast_subject_id,
	'-1' as itcast_subject_name,
	hourinfo,
	origin_type_stat,
	clue_state_stat,
	'-1' as tdepart_id,
	'-1' as tdepart_name,
	concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
	'1' as grouptype,
	'1' as time_type,
	yearinfo,
	monthinfo,
	dayinfo
from itcast_dwm.itcast_intention_dwm
group by yearinfo,monthinfo,dayinfo,hourinfo,origin_type_stat,clue_state_stat;
  • 统计咨询中心维度
-- 统计每年线上线下, 新老用户产生各个咨询中心的意向量
insert into table itcast_dws.itcast_intention_dws partition(yearinfo,monthinfo,dayinfo)
select 
	count( distinct customer_id) as customer_total, 
	'-1' as area,
	'-1' as itcast_school_id,
	'-1' as itcast_school_name,
	'-1' as origin_type,
	'-1' as itcast_subject_id,
	'-1' as itcast_subject_name,
	'-1' as hourinfo,
	origin_type_stat,
	clue_state_stat,
	tdepart_id,
	tdepart_name,
	yearinfo as time_str,
	'5' as grouptype,
	'5' as time_type,
	yearinfo,
	'-1' as monthinfo,
	'-1' as dayinfo
from itcast_dwm.itcast_intention_dwm
group by yearinfo,origin_type_stat,clue_state_stat,tdepart_id,tdepart_name;

1.9 数据导出

  • 利用sqoop将DWS层分析好的数据导入至MySQL中

    • 在MySQL中建表
    CREATE TABLE IF NOT EXISTS scrm_bi.itcast_intention (
       `customer_total` INT COMMENT '聚合意向客户数',
       `area` varchar(100) COMMENT '区域信息',
       `itcast_school_id` varchar(100) COMMENT '校区id',
       `itcast_school_name` varchar(100) COMMENT '校区名称',
       `origin_type` varchar(100) COMMENT '来源渠道',
       `itcast_subject_id` varchar(100) COMMENT '学科id',
       `itcast_subject_name` varchar(100) COMMENT '学科名称',
       `hourinfo` varchar(100) COMMENT '小时信息',
       `origin_type_stat` varchar(100) COMMENT '数据来源:0.线下;1.线上',
       `clue_state_stat` varchar(100) COMMENT '客户属性:0.老客户;1.新客户',
       `tdepart_id` varchar(100) COMMENT '创建者部门id',
       `tdepart_name` varchar(100) COMMENT '咨询中心名称',
       `time_str` varchar(100) COMMENT '时间明细',
       `groupType` varchar(100) COMMENT '产品属性类别:1.总意向量;2.区域信息;3.校区、学科组合分组;4.来源渠道;5.咨询中心;',
       `time_type` varchar(100) COMMENT '时间维度:1、按小时聚合;2、按天聚合;3、按周聚合;4、按月聚合;5、按年聚合;',
        yearinfo varchar(100) COMMENT '年' ,
        monthinfo varchar(100) COMMENT '月',
        dayinfo varchar(100) COMMENT '日'
    )
    comment '客户意向dws表';
    
    • 将数据导入MySQL
    sqoop export \
    --connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
    --username root \
    --password 123456 \
    --table itcast_intention \
    --hcatalog-database itcast_dws \
    --hcatalog-table itcast_intention_dws \
    -m 1
    

2. 意向用户主题看板__增量流程

2.1 模拟数据(工作环境无)

  • 利用现有数据,通过修改模拟新增数据

2.2 数据采集

  • 利用sqoop将MySQL中的新增数据导入至hive中
    • 由于是增量数据,并且该主题下的表是拉链表,所以在抽取数据时需注意:1.start time(表示数据开始的时间) 2.end time(表示该条记录失效时间)
    • 建立update表对增量数据进行存储,之后再汇总至temp表
    • 每次使用update表均需重建,避免数据重复导致问题

2.3 数据清洗转换

  • 在对增量数据进行清洗转化时,与全量过程是一致的,只需要确保数据是有效的,即添加 where条件:end_time=‘9999-12-31’

2.4 数据分析

在增量分析中,以天为例,
对年统计当年结果数据;对季度统计当年当季度结果数据;对月统计当年当季度当月结果数据 存在影响;
对当天和小时的统计对历史数据无影响
解决方案:通过删除分区的方案来解决

例如:2023-01-11 的数据所在分区为:yearinfo=‘2023’ and quarterinfo=‘1’ and monthinfo=‘01’ and dayinfo=‘11’

执行删除:alter table 表名 drop partition("分区");

2.5 数据导出

  • 在数据导出操作中,也需要将MySQL中之前的当年、当季度、当月的结果数据删除,重新导入操作
  • 我们可以将当年的统计结果数据全部删除,然后全部重新导入所有数据
所有的增量流程均可写为shell脚本,实现自动化运行
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值