在线教育(5)
在线教育(五)
1.学生出勤主题看板__全量流程
1.1 需求分析
- 需求一:统计指定时间段内,不同班级的出勤人数。打卡时间在上课前40分钟(否则认为无效)~上课时间点之内,且未早退,则为正常上课打卡。可以下钻到具体学生的出勤数据。跨天数据直接累加。
涉及维度: 时间维度、班级维度、学生维度
涉及指标: 出勤人数
课前40分钟——上课时间、无早退——>正常上课打卡
涉及 表:
涉及字段:
- 需求二:统计指定时间段内,不同班级的学生出勤率。可以下钻到具体学生的出勤数据。出勤率=出勤人数/当日在读学员人数
涉及维度: 时间维度、班级维度、学生维度
涉及指标: 当日在读人数、出勤人数
涉及 表:
涉及字段:
- 需求三:统计指定时间段内,不同班级的迟到人数。上课10分钟后视为迟到。可以下钻到具体学生的迟到数据。跨天数据直接累加
涉及维度: 时间维度、班级维度、学生维度
涉及指标: 迟到人数
上课后10分钟————>迟到
涉及 表:
涉及字段:
- 需求四:统计指定时间段内,不同班级的学生迟到率。上课10分钟后视为迟到。可以下钻到具体学生的迟到数据。迟到率=迟到人数/当日在读学员人数
涉及维度: 时间维度、班级维度、学生维度
涉及指标: 迟到人数、当日在读人数
涉及 表: course_table_uppload_detail(班级课表)、tbh_student_signin_record(学生打卡记录表)、tbh_class_time_table(班级作息时间表)
连接条件:
班级课表.class_id=学生打卡记录表.class_id
班级作息时间表.id=学生打卡记录表.time_table_id
涉及字段:
时间维度: 班级课表.class_date
班级维度: 班级课表.class_id
学生维度: 学生打卡记录表.student_id
指标字段: 学生打卡记录表.signin_time
计算方法:
判断打卡时间再 (课前40,上课10分钟) ===> 正常出勤--0
(上课10分钟,下课)====> 迟到--1
否则====> 缺勤--2
判断时间字段:
morning_begin_time
morning_end_time
afternoon_begin_time
afternoon_end_time
evening_begin_time
evening_end_time
过滤:
1. ifnull(ctud.content, '') != ''
==> 在hive中,nvl(ctud.content,'') != ''
2.ctud.content != '开班典礼'
3.确保打卡表中学生数据都是开启公屏
学生打卡表.share_state=1
- 需求五:统计指定时间段内,不同班级的请假人数。跨天数据直接累加
涉及维度: 时间维度、班级维度、学生维度
涉及指标: 请假人数
涉及 表: student_leave_apply(学生请假申请表)、tbh_class_time_table(班级作息时间表)、course_table_upload_detail(班级课表)
连接条件: 请假申请表.class_id=作息时间表.class_id=班级课表.id
涉及字段:
时间维度: 课程表.class_date
班级维度: 课程表.class_id
指标字段: 请假表.student_id
需要进行去重统计操作
需过滤:
1. ifnull(ctud.content, '') != ''
==> 在hive中,nvl(ctud.content,'') != ''
2.ctud.content != '开班典礼'
3.ctud.class_date 在 tctt.use_begin_date 和 tctt.use_end_date之间.
4. sla.audit_state=1 ,sla.cancel_state=0 , sla.valid_state=1
5. 请假日期+早上上课时间 >= sla.begin_time and 请假日期+早上下课时间 > sla.end_time
- 需求六:统计指定时间段内,不同班级的学生请假率。可以下钻到具体学生的请假数据。请假率=请假人数/当日在读学员人数
涉及维度: 时间维度、班级维度、学生维度
涉及指标: 请假人数、当日在读人数
涉及 表:
涉及字段:
- 需求七:统计指定时间段内,不同班级的旷课人数。跨天数据直接累加。旷课人数=当日在读学员人数-出勤人数-请假人数。
涉及维度: 时间维度、班级维度
涉及指标: 当日在读人数、出勤人数、请假人数
计算方法: 旷课人数=当日在读学员人数-出勤人数-请假人数。
涉及 表:
涉及字段:
- 需求八:统计指定时间段内,不同班级的学生旷课率。旷课率=旷课人数/当日在读学员人数
涉及维度: 时间维度、班级维度、学生维度
涉及指标: 当日在读人数、出勤人数、请假人数
计算方法: 旷课率=旷课人数/当日在读学员人数
涉及 表:
涉及字段:
1.2 分析总结
-- 第一类:出勤人数、出勤率、迟到人数、迟到率
涉及维度:
固有维度:
时间维度:年、月、日、上午、下午、晚自习
班级维度:
学生维度:
涉及 表:
course_table_uppload_detail(班级课表)
tbh_student_signin_record(学生打卡记录表)--事实表
tbh_class_time_table(班级作息时间表)
连接条件:
班级课表.class_id=学生打卡记录表.class_id
班级作息时间表.id=学生打卡记录表.time_table_id
涉及字段:
时间维度: 班级课表.class_date
班级维度: 班级课表.class_id
学生维度: 学生打卡记录表.student_id
指标字段: 学生打卡记录表.signin_time
计算方法:
判断打卡时间再 (课前40,上课10分钟) ===> 正常出勤--0
(上课10分钟,下课)====> 迟到--1
否则====> 缺勤--2
判断时间字段:
morning_begin_time
morning_end_time
afternoon_begin_time
afternoon_end_time
evening_begin_time
evening_end_time
过滤:
1. ifnull(ctud.content, '') != ''
==> 在hive中,nvl(ctud.content,'') != ''
2.ctud.content != '开班典礼'
3.确保打卡表中学生数据都是开启公屏
学生打卡表.share_state=1
-- 第二类:请假人数、请假率
涉及维度:
固有维度:
时间维度:年、月、日、上午、下午、晚自习
班级维度:
学生维度:
涉及 表:
student_leave_apply(学生请假申请表)、
tbh_class_time_table(班级作息时间表)、
course_table_upload_detail(班级课表)
连接条件: 请假申请表.class_id=作息时间表.class_id=班级课表.id
涉及字段:
需过滤:
1. ifnull(ctud.content, '') != ''
==> 在hive中,nvl(ctud.content,'') != ''
2.ctud.content != '开班典礼'
3.ctud.class_date 在 tctt.use_begin_date 和 tctt.use_end_date之间.
4. sla.audit_state=1 ,sla.cancel_state=0 , sla.valid_state=1
5. 请假日期+早上上课时间 >= sla.begin_time and 请假日期+早上下课时间 > sla.end_time
-- 第三类: 旷课人数、旷课率
涉及维度: 时间维度、班级维度
涉及指标: 当日在读人数、出勤人数、请假人数
计算方法: 旷课人数=当日在读学员人数-出勤人数-请假人数。
1.3 业务数据准备
-
在MySQL中创建业务数据库
-
将数据导入至新建业务库中
-
表: student_leave_apply(学生请假申请表)--事实表 tbh_class_time_table(班级作息时间表)、 course_table_uppload_detail(班级课表) tbh_student_signin_record(学生打卡记录表)--事实表 class_studying_student_count(当日再度人数表)
1.4 建模分析
- ODS 层
作用: 对接源数据,与数据源保持相同粒度。存储事实表和少量维度表;
存储表:
tbh_student_signin_record(学生打卡记录表)、
student_leave_apply(学生请假申请表)
建表方案:字段与源数据保持一致,+ 抽取时间字段(start_time)
- DIM 层
作用: 存储维度表,且与数据源保持相同粒度
存储表:
tbh_class_time_table(班级作息时间表)
course_table_uppload_detail(班级课表)
class_studying_student_count(当日再度人数表)
建表方案:字段与源数据保持一致,+ 抽取时间字段(start_time)
- DWD 层
作用: 清洗转换,少量维度退化
清洗转换操作:
可做
1) 日期转换为 年 月 日
2) 过滤无效的请假数据
3) 过滤没有开启公屏的数据
目前假设没有这些操作, 所以此时DWD层 是不需要存在的
- DWM 层
作用: 维度退化、提前聚合
多个事实表,多个维度表的处理逻辑: 先分 在和
建表方案:
学生出勤状态表
作用: 用于统计每天每个班级, 每个学生的出勤状态(0(正常出勤) 1(迟到出勤) 2(没来))
表字段 :
yearinfo, monthinfo,dayinfo, 班级id, 学生id, 上午出勤, 下午出勤, 晚自习出勤
班级出勤人数表
作用: 用于统计每天每个班级的出勤人数(出勤人数, 迟到出勤人数)
表字段:
yearinfo, monthinfo,dayinfo, 班级id,上午出勤人数, 上午迟到人数, 下午出勤人数, 下午迟到人数, 晚上出勤人数, 晚上迟到人数
班级请假人数表
作用: 用于统计每天每个班级的请假人数
表字段:
yearinfo, monthinfo,dayinfo, 班级id, 上午请假人数, 下午请假人数, 晚上请假人数
班级旷课人数表
作用: 用于统计每天每个班级的旷课人数
表字段:
yearinfo, monthinfo,dayinfo, 班级id, 上午旷课人数, 下午旷课人数, 晚上旷课人数
班级指标汇总表 (提前聚合表)
作用: 用于将前几个表相关的指标数据汇总起来, 同时计算出 相关比率操作 (统计每天的上午 下午 晚自习)
表字段:
yearinfo, monthinfo,dayinfo, 班级id, 班级当日在读人数,
上午出勤人数, 上午出勤率, 下午出勤人数 下午出勤率, 晚上出勤人数, 晚上出勤率
上午迟到人数, 上午迟到率, 下午迟到人数, 下午迟到率, 晚上迟到人数, 晚上迟到率
上午请假人数, 上午请假率, 下午请假人数, 下午请假率, 晚上请假人数, 晚上请假率
上午旷课人数, 上午旷课率, 下午旷课人数, 下午旷课率, 晚上旷课人数, 晚上旷课率
- DWS 层:
作用: 细化维度统计操作
建表方案:
yearinfo, monthinfo,dayinfo, 班级id, 班级当日在读人数, time_str,time_type
上午出勤人数, 上午出勤率, 下午出勤人数 下午出勤率, 晚上出勤人数, 晚上出勤率
上午迟到人数, 上午迟到率, 下午迟到人数, 下午迟到率, 晚上迟到人数, 晚上迟到率
上午请假人数, 上午请假率, 下午请假人数, 下午请假率, 晚上请假人数, 晚上请假率
上午旷课人数, 上午旷课率, 下午旷课人数, 下午旷课率, 晚上旷课人数, 晚上旷课率
1.5 建模操作
- ODS 层:两张表(学生打卡记录表、学生请假表)
-- 学生打卡记录表:
CREATE TABLE IF NOT EXISTS itcast_ods.student_signin_ods (
id int,
normal_class_flag int comment '是否正课 1 正课 2 自习 3 休息',
time_table_id int comment '作息时间id normal_class_flag=2 关联tbh_school_time_table 或者 normal_class_flag=1 关联 tbh_class_time_table',
class_id int comment '班级id',
student_id int comment '学员id',
signin_time String comment '签到时间',
signin_date String comment '签到日期',
inner_flag int comment '内外网标志 0 外网 1 内网',
signin_type int comment '签到类型 1 心跳打卡 2 老师补卡 3 直播打卡',
share_state int comment '共享屏幕状态 0 否 1是 在上午或下午段有共屏记录,则该段所有记录该字段为1,内网默认为1 外网默认为0 (暂不用)',
inner_ip String comment '内网ip地址',
create_time String comment '创建时间')
comment '学生打卡记录表'
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.bloom.filter.columns'='time_table_id,class_id,signin_date,share_state');
-- 学生请假信息表:
CREATE TABLE IF NOT EXISTS itcast_ods.student_leave_apply_ods (
id int,
class_id int comment '班级id',
student_id int comment '学员id',
audit_state int comment '审核状态 0 待审核 1 通过 2 不通过',
audit_person int comment '审核人',
audit_time String comment '审核时间',
audit_remark String comment '审核备注',
leave_type int comment '请假类型 1 请假 2 销假 (查询是否请假不用过滤此类型,通过有效状态来判断)',
leave_reason int comment '请假原因 1 事假 2 病假',
begin_time String comment '请假开始时间',
begin_time_type int comment '1:上午 2:下午 3:晚自习',
end_time String comment '请假结束时间',
end_time_type int comment '1:上午 2:下午 3:晚自习',
days float comment '请假/已休天数',
cancel_state int comment '撤销状态 0 未撤销 1 已撤销',
cancel_time String comment '撤销时间',
old_leave_id int comment '原请假id,只有leave_type =2 销假的时候才有',
leave_remark String comment '请假/销假说明',
valid_state int comment '是否有效(0:无效 1:有效)',
create_time String comment '创建时间')
comment '学生请假申请表'
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.bloom.filter.columns'='class_id,audit_state,cancel_state,valid_state');
- DIM 层:三张表(班级作息时间表、班级课表、当日在读人数表)
-- 日期课程表
CREATE TABLE IF NOT EXISTS itcast_dimen.course_table_upload_detail_dimen
(
id int comment 'id',
base_id int comment '课程主表id',
class_id int comment '班级id',
class_date STRING comment '上课日期',
content STRING comment '课程内容',
teacher_id int comment '老师id',
teacher_name STRING comment '老师名字',
job_number STRING comment '工号',
classroom_id int comment '教室id',
classroom_name STRING comment '教室名称',
is_outline int comment '是否大纲 0 否 1 是',
class_mode int comment '上课模式 0 传统全天 1 AB上午 2 AB下午 3 线上直播',
is_stage_exam int comment '是否阶段考试(0:否 1:是)',
is_pay int comment '代课费(0:无 1:有)',
tutor_teacher_id int comment '晚自习辅导老师id',
tutor_teacher_name STRING comment '辅导老师姓名',
tutor_job_number STRING comment '晚自习辅导老师工号',
is_subsidy int comment '晚自习补贴(0:无 1:有)',
answer_teacher_id int comment '答疑老师id',
answer_teacher_name STRING comment '答疑老师姓名',
answer_job_number STRING comment '答疑老师工号',
remark STRING comment '备注',
create_time STRING comment '创建时间')
comment '班级课表明细表'
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.bloom.filter.columns'='class_id,class_date');
-- 班级作息时间表
CREATE TABLE IF NOT EXISTS itcast_dimen.class_time_dimen (
id int,
class_id int comment '班级id',
morning_template_id int comment '上午出勤模板id',
morning_begin_time STRING comment '上午开始时间',
morning_end_time STRING comment '上午结束时间',
afternoon_template_id int comment '下午出勤模板id',
afternoon_begin_time STRING comment '下午开始时间',
afternoon_end_time STRING comment '下午结束时间',
evening_template_id int comment '晚上出勤模板id',
evening_begin_time STRING comment '晚上开始时间',
evening_end_time STRING comment '晚上结束时间',
use_begin_date STRING comment '使用开始日期',
use_end_date STRING comment '使用结束日期',
create_time STRING comment '创建时间',
create_person int comment '创建人',
remark STRING comment '备注')
comment '班级作息时间表'
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.bloom.filter.columns'='id,class_id');
-- 当日在读人数表:
CREATE TABLE IF NOT EXISTS itcast_dimen.class_studying_student_count_dimen (
id int,
school_id int comment '校区id',
subject_id int comment '学科id',
class_id int comment '班级id',
studying_student_count int comment '在读班级人数',
studying_date STRING comment '在读日期')
comment '在读班级的每天在读学员人数'
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.bloom.filter.columns'='studying_student_count,studying_date');
- DWM层:
-- 学生出勤状态信息表
CREATE TABLE IF NOT EXISTS itcast_dwm.student_attendance_dwm (
dateinfo String comment '日期',
class_id int comment '班级id',
student_id int comment '学员id',
morning_att String comment '上午出勤情况:0.正常出勤、1.迟到、2.其他(请假+旷课)',
afternoon_att String comment '下午出勤情况:0.正常出勤、1.迟到、2.其他(请假+旷课)',
evening_att String comment '晚自习出勤情况:0.正常出勤、1.迟到、2.其他(请假+旷课)')
comment '学生出勤(正常出勤和迟到)数据'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');
-- 班级出勤人数表:
CREATE TABLE IF NOT EXISTS itcast_dwm.class_attendance_dwm (
dateinfo String comment '日期',
class_id int comment '班级id',
morning_att_count String comment '上午出勤人数',
afternoon_att_count String comment '下午出勤人数',
evening_att_count String comment '晚自习出勤人数',
morning_late_count String comment '上午迟到人数',
afternoon_late_count String comment '下午迟到人数',
evening_late_count String comment '晚自习迟到人数')
comment '学生出勤(正常出勤和迟到)数据'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');
-- 班级请假人数表
CREATE TABLE IF NOT EXISTS itcast_dwm.class_leave_dwm (
dateinfo String comment '日期',
class_id int comment '班级id',
morning_leave_count String comment '上午请假人数',
afternoon_leave_count String comment '下午请假人数',
evening_leave_count String comment '晚自习请假人数')
comment '班级请假数据统计'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');
-- 班级旷课人数表
CREATE TABLE IF NOT EXISTS itcast_dwm.class_truant_dwm (
dateinfo String comment '日期',
class_id int comment '班级id',
morning_truant_count String comment '上午旷课人数',
afternoon_truant_count String comment '下午旷课人数',
evening_truant_count String comment '晚自习旷课人数')
comment '班级请假数据统计'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');
-- 汇总表
CREATE TABLE IF NOT EXISTS itcast_dwm.class_all_dwm (
dateinfo String comment '日期',
class_id int comment '班级id',
studying_student_count int comment '在读班级人数',
morning_att_count String comment '上午出勤人数',
morning_att_ratio String comment '上午出勤率',
afternoon_att_count String comment '下午出勤人数',
afternoon_att_ratio String comment '下午出勤率',
evening_att_count String comment '晚自习出勤人数',
evening_att_ratio String comment '晚自习出勤率',
morning_late_count String comment '上午迟到人数',
morning_late_ratio String comment '上午迟到率',
afternoon_late_count String comment '下午迟到人数',
afternoon_late_ratio String comment '下午迟到率',
evening_late_count String comment '晚自习迟到人数',
evening_late_ratio String comment '晚自习迟到率',
morning_leave_count String comment '上午请假人数',
morning_leave_ratio String comment '上午请假率',
afternoon_leave_count String comment '下午请假人数',
afternoon_leave_ratio String comment '下午请假率',
evening_leave_count String comment '晚自习请假人数',
evening_leave_ratio String comment '晚自习请假率',
morning_truant_count String comment '上午旷课人数',
morning_truant_ratio String comment '上午旷课率',
afternoon_truant_count String comment '下午旷课人数',
afternoon_truant_ratio String comment '下午旷课率',
evening_truant_count String comment '晚自习旷课人数',
evening_truant_ratio String comment '晚自习旷课率')
comment '班级请假数据统计'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');
- DWS层:
CREATE TABLE IF NOT EXISTS itcast_dws.class_attendance_dws (
dateinfo String comment '日期',
class_id int comment '班级id',
studying_student_count int comment '在读班级人数',
morning_att_count String comment '上午出勤人数',
morning_att_ratio String comment '上午出勤率',
afternoon_att_count String comment '下午出勤人数',
afternoon_att_ratio String comment '下午出勤率',
evening_att_count String comment '晚自习出勤人数',
evening_att_ratio String comment '晚自习出勤率',
morning_late_count String comment '上午迟到人数',
morning_late_ratio String comment '上午迟到率',
afternoon_late_count String comment '下午迟到人数',
afternoon_late_ratio String comment '下午迟到率',
evening_late_count String comment '晚自习迟到人数',
evening_late_ratio String comment '晚自习迟到率',
morning_leave_count String comment '上午请假人数',
morning_leave_ratio String comment '上午请假率',
afternoon_leave_count String comment '下午请假人数',
afternoon_leave_ratio String comment '下午请假率',
evening_leave_count String comment '晚自习请假人数',
evening_leave_ratio String comment '晚自习请假率',
morning_truant_count String comment '上午旷课人数',
morning_truant_ratio String comment '上午旷课率',
afternoon_truant_count String comment '下午旷课人数',
afternoon_truant_ratio String comment '下午旷课率',
evening_truant_count String comment '晚自习旷课人数',
evening_truant_ratio String comment '晚自习旷课率',
time_type STRING COMMENT '聚合时间类型:1、按小时聚合;2、按天聚合;3、按周聚合;4、按月聚合;5、按年聚合。')
comment '班级请假数据统计'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');
1.6 数据采集
- 利用sqoop 将MySQL中的数据导入hive 的ODS 层和 DIM层
- ODS 层:
# 学生打卡记录表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/teach \
--username root \
--password 123456 \
--query 'select *,"2021-10-07" as dt from tbh_student_signin_record where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table student_signin_ods \
-m 1
# 学生请假信息表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/teach \
--username root \
--password 123456 \
--query 'select *,"2021-10-07" as dt from student_leave_apply where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table student_leave_apply_ods \
-m 1
- DIM层:
# 当日在读人数表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/teach \
--username root \
--password 123456 \
--query 'select *,"2021-10-07" as dt from class_studying_student_count where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table class_studying_student_count_dimen \
-m 1
# 日期课程表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/teach \
--username root \
--password 123456 \
--query 'select *,"2021-10-07" as dt from course_table_upload_detail where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table course_table_upload_detail_dimen \
-m 1
# 作息时间表
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/teach \
--username root \
--password 123456 \
--query 'select *,"2021-10-07" as dt from tbh_class_time_table where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table class_time_dimen \
-m 1
1.7 数据清洗转换
- 生成学生出勤状态信息表
涉及表:
course_table_upload_detail: 日志课程明细表 (课表) (维度表)
tbh_student_signin_record: 学生打卡记录表 (事实表)
tbh_class_time_table: 班级作息时间表 (维度表)
关联条件:
学生打卡表.class_id = 课程表.class_id
班级作息时间表.id = 学生打卡表.time_table_id
涉及到字段:
时间维度: 课程表.class_date
班级维度: 课程表.class_id
学生维度: 学生打卡表.student_id
指标字段: 学生打卡表.signin_time(打卡时间)
计算方案:
先判断是否出勤
情况1: 出勤了, 再次判断是否是正常出勤和迟到出勤
情况2: 未出勤, 认为没有来
指标判断指标:
作息时间表:
morning_begin_time
morning_end_time
afternoon_begin_time
afternoon_end_time
evening_begin_time
evening_end_time
过滤操作
1) ifnull(ctud.content,'') != '' 相当于 ctud.content != null and ctud.content != ''
转换为hive的操作
nvl(ctud.content,'') != ''
2) 将content中为开班典礼数据过滤掉
ctud.content != '开班典礼'
3) 确保打卡表中学生数据都是开启公屏
学生打卡表.share_state=1
- 书写SQL步骤:
- 先写SQL的大体框架(select 。。。 和过滤条件等)
- 再把时间的计算逻辑写上,由于在hive中无对分钟加减的函数,故需要转化为时间戳,然后对时间戳进行操作
- 再将SQL完整的写出
select
ctudd.class_date,
ctudd.class_id,
student_id,
if(
sum(
if(
unix_timestamp(sso.signin_time,'yyyy-MM-dd HH:mm:ss')
between unix_timestamp(concat(ctudd.class_date,' ',ctd.morning_begin_time),'yyyy-MM-dd HH:mm:ss') - 40*60
and unix_timestamp(concat(ctudd.class_date,' ',ctd.morning_end_time),'yyyy-MM-dd HH:mm:ss')
,1,0)
) > 0, -- 如果大于0 认为当天的打卡记录中, 一定是有出勤的打卡记录, 如果小于等于0 认为没有出勤
if(
sum(
if(
unix_timestamp(sso.signin_time,'yyyy-MM-dd HH:mm:ss')
between unix_timestamp(concat(ctudd.class_date,' ',ctd.morning_begin_time),'yyyy-MM-dd HH:mm:ss') - 40*60
and unix_timestamp(concat(ctudd.class_date,' ',ctd.morning_begin_time),'yyyy-MM-dd HH:mm:ss') + 10*60
, 1 , 0)
) >0, -- 如果大于0, 认为当天打卡记录中, 一定是有正常出勤的记录, 否则认为迟到出勤
0,1),
2)as morning_att,
if(
sum(
if(
unix_timestamp(sso.signin_time,'yyyy-MM-dd HH:mm:ss')
between unix_timestamp(concat(ctudd.class_date,' ',ctd.afternoon_begin_time),'yyyy-MM-dd HH:mm:ss') - 40*60
and unix_timestamp(concat(ctudd.class_date,' ',ctd.afternoon_end_time),'yyyy-MM-dd HH:mm:ss')
,1,0)
) > 0, -- 如果大于0 认为当天的打卡记录中, 一定是有出勤的打卡记录, 如果小于等于0 认为没有出勤
if(
sum(
if(
unix_timestamp(sso.signin_time,'yyyy-MM-dd HH:mm:ss')
between unix_timestamp(concat(ctudd.class_date,' ',ctd.afternoon_begin_time),'yyyy-MM-dd HH:mm:ss') - 40*60
and unix_timestamp(concat(ctudd.class_date,' ',ctd.afternoon_begin_time),'yyyy-MM-dd HH:mm:ss') + 10*60
, 1 , 0)
) >0, -- 如果大于0, 认为当天打卡记录中, 一定是有正常出勤的记录, 否则认为迟到出勤
0,1),
2) as afternoon_att,
if(
sum(
if(
unix_timestamp(sso.signin_time,'yyyy-MM-dd HH:mm:ss')
between unix_timestamp(concat(ctudd.class_date,' ',ctd.evening_begin_time),'yyyy-MM-dd HH:mm:ss') - 40*60
and unix_timestamp(concat(ctudd.class_date,' ',ctd.evening_end_time),'yyyy-MM-dd HH:mm:ss')
,1,0)
) > 0, -- 如果大于0 认为当天的打卡记录中, 一定是有出勤的打卡记录, 如果小于等于0 认为没有出勤
if(
sum(
if(
unix_timestamp(sso.signin_time,'yyyy-MM-dd HH:mm:ss')
between unix_timestamp(concat(ctudd.class_date,' ',ctd.evening_begin_time),'yyyy-MM-dd HH:mm:ss') - 40*60
and unix_timestamp(concat(ctudd.class_date,' ',ctd.evening_begin_time),'yyyy-MM-dd HH:mm:ss') + 10*60
, 1 , 0)
) >0, -- 如果大于0, 认为当天打卡记录中, 一定是有正常出勤的记录, 否则认为迟到出勤
0,1),
2) as evening_att, --未实现
substr(ctudd.class_date,1,4) as yearinfo,
substr(ctudd.class_date,6,2) as monthinfo,
substr(ctudd.class_date,9,2) as dayinfo
from (select * from itcast_dimen.course_table_upload_detail_dimen where nvl(content,'')!='' and content !='开班典礼') ctudd
left join (select * from itcast_ods.student_signin_ods where share_state = 1) sso on sso.class_id = ctudd.class_id
left join itcast_dimen.class_time_dimen ctd on ctd.id = sso.time_table_id
group by ctudd.class_date , ctudd.class_id,sso.student_id;
- 将SQL结果灌入DWM层
set hive.auto.convert.join=false;
--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶
--set hive.enforce.bucketing=true;
--set hive.enforce.sorting=true;
--set hive.optimize.bucketmapjoin = true;
--set hive.auto.convert.sortmerge.join=true;
--set hive.auto.convert.sortmerge.join.noconditionaltask=true;
--并行执行
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
--小文件合并
-- set mapred.max.split.size=2147483648;
-- set mapred.min.split.size.per.node=1000000000;
-- set mapred.min.split.size.per.rack=1000000000;
--矢量化查询
set hive.vectorized.execution.enabled=true;
--关联优化器
set hive.optimize.correlation=true;
--读取零拷贝
set hive.exec.orc.zerocopy=true;
--join数据倾斜
set hive.optimize.skewjoin=true;
-- set hive.skewjoin.key=100000;
set hive.optimize.skewjoin.compiletime=true;
set hive.optimize.union.remove=true;
-- group倾斜
set hive.groupby.skewindata=true;
insert into table itcast_dwm.student_attendance_dwm partition(yearinfo,monthinfo,dayinfo)
select ....
- 班级出勤人数表
insert into table itcast_dwm.class_attendance_dwm partition(yearinfo,monthinfo,dayinfo)
select
dateinfo,
class_id,
count(
case
when morning_att in ('0','1') then student_id
else null end
) as morning_att_count,
count(
case
when afternoon_att in ('0','1') then student_id
else null end
) as afternoon_att_count,
count(
case
when evening_att in ('0','1') then student_id
else null end
) as evening_att_count,
sum(
case
when morning_att ='1' then 1
else 0 end
) as morning_late_count,
sum(
case
when afternoon_att ='1' then 1
else 0 end
) as afternoon_late_count,
sum(
case
when evening_att ='1' then 1
else 0 end
) as evening_late_count,
yearinfo,
monthinfo,
dayinfo
from itcast_dwm.student_attendance_dwm
group by dateinfo,yearinfo,monthinfo,dayinfo,class_id;
- 班级请假人数表
涉及表:
student_leave_apply: 学生请假表 (事实表)
tbh_class_time_table: 班级作息时间表 (维度表)
course_table_upload_detail: 课程表 (维度表)
表关联条件:
学生请假表.class_id = 班级作息时间表.class_id
学生请假表.class_id = 课程表.class_id
涉及字段:
时间维度: 课程表.class_date
班级维度: 课程表.class_id
指标字段: 请假表.student_id
需要进行去重统计操作
过滤条件:
课程表:
content 不能为空 为null 以及不能为 开班典礼
获取有效的作息时间:
课程表.class_date between 作息表.use_begin_date and 作息表.use_end_date
学生请假表:保证请假数据有效的
audit_state =1 -- 审核通过
cancel_state = 0 -- 没有取消
valid_state = 1 -- 有效的
判断是否请假条件:
请假的开始时间(请假表.begin_time) <= 上课的开始时间 (morning_begin_time |afternon_begin_time | evening_begin_time)
请假的结束时间(请假表.end_time) >= 上课的开始时间(morning_begin_time |afternon_begin_time | evening_begin_time)
- 先添加一条请假数据
INSERT INTO itcast_ods.student_leave_apply_ods partition (dt)
values (125, 5032, 119142, 1, 3491, '2019-09-08 16:42:29', '狂犬疫苗最后一针', 1, 2, '2019-09-03 08:00:00', 2, '2019-09-03 23:30:00', 2, 1, 0, null, null, '医院打针,狂犬疫苗最后一针了', 1, '2019-09-02 08:56:54','2020-07-07');
- 上午每天各个班级请假人数
select
ctudd.class_date,
ctudd.class_id,
count(distinct sla.student_id) as morning_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) sla
join itcast_dimen.class_time_dimen ctd on sla.class_id = ctd.class_id
join (select * from itcast_dimen.course_table_upload_detail_dimen where nvl(content,'')!='' and content != '开班典礼') ctudd on ctudd.class_id = sla.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
and concat(ctudd.class_date,' ',ctd.morning_begin_time) >= sla.begin_time
and concat(ctudd.class_date,' ',ctd.morning_begin_time) <= sla.end_time
group by ctudd.class_date,ctudd.class_id;
- 下午每天各个班级请及人数
select
ctudd.class_date,
ctudd.class_id,
count(distinct sla.student_id) as afternoon_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) sla
join itcast_dimen.class_time_dimen ctd on sla.class_id = ctd.class_id
join (select * from itcast_dimen.course_table_upload_detail_dimen where nvl(content,'')!='' and content != '开班典礼') ctudd on ctudd.class_id = sla.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
and concat(ctudd.class_date,' ',ctd.afternoon_begin_time) >= sla.begin_time
and concat(ctudd.class_date,' ',ctd.afternoon_begin_time) <= sla.end_time
group by ctudd.class_date,ctudd.class_id;
- 晚自习每天各个班级请假人数
select
ctudd.class_date,
ctudd.class_id,
count(distinct sla.student_id) as evening_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) sla
join itcast_dimen.class_time_dimen ctd on sla.class_id = ctd.class_id
join (select * from itcast_dimen.course_table_upload_detail_dimen where nvl(content,'')!='' and content != '开班典礼') ctudd on ctudd.class_id = sla.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
and concat(ctudd.class_date,' ',ctd.evening_begin_time) >= sla.begin_time
and concat(ctudd.class_date,' ',ctd.evening_begin_time) <= sla.end_time
group by ctudd.class_date,ctudd.class_id;
- 将三张表合并
- SQL实现
with A as (select
ctudd.class_date,
ctudd.class_id,
count(distinct sla.student_id) as morning_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) sla
join itcast_dimen.class_time_dimen ctd on sla.class_id = ctd.class_id
join (select * from itcast_dimen.course_table_upload_detail_dimen where nvl(content,'')!='' and content != '开班典礼') ctudd on ctudd.class_id = sla.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
and concat(ctudd.class_date,' ',ctd.morning_begin_time) >= sla.begin_time
and concat(ctudd.class_date,' ',ctd.morning_begin_time) <= sla.end_time
group by ctudd.class_date,ctudd.class_id),
B AS (select
ctudd.class_date,
ctudd.class_id,
count(distinct sla.student_id) as afternoon_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) sla
join itcast_dimen.class_time_dimen ctd on sla.class_id = ctd.class_id
join (select * from itcast_dimen.course_table_upload_detail_dimen where nvl(content,'')!='' and content != '开班典礼') ctudd on ctudd.class_id = sla.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
and concat(ctudd.class_date,' ',ctd.afternoon_begin_time) >= sla.begin_time
and concat(ctudd.class_date,' ',ctd.afternoon_begin_time) <= sla.end_time
group by ctudd.class_date,ctudd.class_id),
C AS (select
ctudd.class_date,
ctudd.class_id,
count(distinct sla.student_id) as evening_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) sla
join itcast_dimen.class_time_dimen ctd on sla.class_id = ctd.class_id
join (select * from itcast_dimen.course_table_upload_detail_dimen where nvl(content,'')!='' and content != '开班典礼') ctudd on ctudd.class_id = sla.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
and concat(ctudd.class_date,' ',ctd.evening_begin_time) >= sla.begin_time
and concat(ctudd.class_date,' ',ctd.evening_begin_time) <= sla.end_time
group by ctudd.class_date,ctudd.class_id),
temp as (select
coalesce(A.class_date,B.class_date,C.class_date) AS class_date,
coalesce(A.class_id,B.class_id,C.class_id) AS class_id,
nvl(A.morning_leave_count,0) as morning_leave_count,
nvl(B.afternoon_leave_count,0) as afternoon_leave_count,
nvl(C.evening_leave_count,0) as evening_leave_count
from A
full join B on A.class_date = B.class_date and A.class_id = B.class_id
full join C on A.class_date = C.class_date and A.class_id = C.class_id)
select
class_date,
class_id,
sum(morning_leave_count) as morning_leave_count,
sum(afternoon_leave_count) as afternoon_leave_count,
sum(evening_leave_count) as evening_leave_count,
substr(class_date,1,4) as yearinfo,
substr(class_date,6,2) as monthinfo,
substr(class_date,9,2) as dayinfo
from temp group by class_date,class_id;
hive 运行:
set hive.auto.convert.join=false;
--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶
--set hive.enforce.bucketing=true;
--set hive.enforce.sorting=true;
--set hive.optimize.bucketmapjoin = true;
--set hive.auto.convert.sortmerge.join=true;
--set hive.auto.convert.sortmerge.join.noconditionaltask=true;
--并行执行
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
--小文件合并
-- set mapred.max.split.size=2147483648;
-- set mapred.min.split.size.per.node=1000000000;
-- set mapred.min.split.size.per.rack=1000000000;
--矢量化查询
set hive.vectorized.execution.enabled=true;
--关联优化器
set hive.optimize.correlation=true;
--读取零拷贝
set hive.exec.orc.zerocopy=true;
--join数据倾斜
set hive.optimize.skewjoin=false;
-- set hive.skewjoin.key=100000;
set hive.optimize.skewjoin.compiletime=false;
set hive.optimize.union.remove=false;
-- group倾斜
set hive.groupby.skewindata=false;
with A as .........
- 旷课人数统计
计算规则:
旷课人数 = 当日在读人数 - 出勤人数 - 请假人数
- SQL实现
insert into table itcast_dwm.class_truant_dwm partition(yearinfo,monthinfo,dayinfo)
select
ctudd.class_date as dateinfo,
ctudd.class_id,
cssc.studying_student_count - nvl(cad.morning_att_count,0) - nvl(cld.morning_leave_count,0) as morning_truant_count,
cssc.studying_student_count - nvl(cad.afternoon_att_count,0) - nvl(cld.afternoon_leave_count,0) as afternoon_truant_count,
cssc.studying_student_count - nvl(cad.evening_att_count,0) - nvl(cld.evening_leave_count,0) as evening_truant_count,
substr(ctudd.class_date,1,4) as yearinfo,
substr(ctudd.class_date,6,2) as monthinfo,
substr(ctudd.class_date,9,2) as dayinfo
from (select * from itcast_dimen.course_table_upload_detail_dimen where nvl(content,'')!='' and content != '开班典礼') ctudd
left join itcast_dimen.class_studying_student_count_dimen cssc on ctudd.class_date = cssc.studying_date and ctudd.class_id = cssc.class_id
left join itcast_dwm.class_attendance_dwm cad on ctudd.class_id = cad.class_id and ctudd.class_date = cad.dateinfo
left join itcast_dwm.class_leave_dwm cld on ctudd.class_id = cld.class_id and ctudd.class_date = cld.dateinfo
where ctudd.class_date in('2019-09-03','2019-09-04','2019-09-05')
- 汇总表
insert into table itcast_dwm.class_all_dwm partition(yearinfo,monthinfo,dayinfo)
select
ctudd.class_date as dateinfo,
ctudd.class_id,
cssc.studying_student_count,
cad.morning_att_count,
concat(round(nvl(cad.morning_att_count,0) / cssc.studying_student_count * 100,2),'%') as morning_att_ratio,
cad.afternoon_att_count,
concat(round(nvl(cad.afternoon_att_count,0) / cssc.studying_student_count * 100,2),'%') as afternoon_att_ratio,
cad.evening_att_count,
concat(round(nvl(cad.evening_att_count,0) / cssc.studying_student_count * 100,2),'%') as evening_att_ratio,
cad.morning_late_count,
concat(round(nvl(cad.morning_late_count,0) / cssc.studying_student_count * 100,2),'%') as morning_late_ratio,
cad.afternoon_late_count,
concat(round(nvl(cad.afternoon_late_count,0) / cssc.studying_student_count * 100,2),'%') as afternoon_late_ratio,
cad.evening_late_count,
concat(round(nvl(cad.evening_late_count,0) / cssc.studying_student_count * 100,2),'%') as evening_late_ratio,
cld.morning_leave_count,
concat(round(nvl(cld.morning_leave_count,0) / cssc.studying_student_count * 100,2),'%') as morning_leave_ratio,
cld.afternoon_leave_count,
concat(round(nvl(cld.afternoon_leave_count,0) / cssc.studying_student_count * 100,2),'%') as afternoon_leave_ratio,
cld.evening_leave_count,
concat(round(nvl(cld.evening_leave_count,0) / cssc.studying_student_count * 100,2),'%') as evening_leave_ratio,
ctd.morning_truant_count,
concat(round(nvl(ctd.morning_truant_count,0) / cssc.studying_student_count * 100,2),'%') as morning_truant_ratio,
ctd.afternoon_truant_count,
concat(round(nvl(ctd.afternoon_truant_count,0) / cssc.studying_student_count * 100,2),'%') as afternoon_truant_ratio,
ctd.evening_truant_count,
concat(round(nvl(ctd.evening_truant_count,0) / cssc.studying_student_count * 100,2),'%') as evening_truant_ratio,
substr(ctudd.class_date,1,4) as yearinfo,
substr(ctudd.class_date,6,2) as monthinfo,
substr(ctudd.class_date,9,2) as dayinfo
from (select * from itcast_dimen.course_table_upload_detail_dimen where nvl(content,'')!='' and content != '开班典礼') ctudd
left join itcast_dimen.class_studying_student_count_dimen cssc on cssc.class_id = ctudd.class_id and cssc.studying_date = ctudd.class_date
left join itcast_dwm.class_attendance_dwm cad on ctudd.class_id = cad.class_id and ctudd.class_date = cad.dateinfo
left join itcast_dwm.class_leave_dwm cld on ctudd.class_id = cld.class_id and ctudd.class_date = cld.dateinfo
left join itcast_dwm.class_truant_dwm ctd on ctudd.class_id = ctd.class_id and ctudd.class_date = ctd.dateinfo
where ctudd.class_date in ('2019-09-03','2019-09-04','2019-09-05');
1.8 数据分析
- DWS层:SQL实现
insert into table itcast_dws.class_attendance_dws partition(yearinfo,monthinfo,dayinfo)
select
concat(yearinfo,'-',monthinfo) as dateinfo,
class_id,
sum(studying_student_count) as studying_student_count,
sum(morning_att_count) as morning_att_count,
concat(round(sum(morning_att_count) / sum(studying_student_count) *100,2),'%') as morning_att_ratio,
sum(afternoon_att_count) as afternoon_att_count,
concat(round(sum(afternoon_att_count) / sum(studying_student_count) *100,2),'%') as afternoon_att_ratio,
sum(evening_att_count) as evening_att_count,
concat(round(sum(evening_att_count) / sum(studying_student_count) *100,2),'%') as evening_att_ratio,
sum(morning_late_count) as morning_late_count,
concat(round(sum(morning_late_count) / sum(studying_student_count) *100,2),'%') as morning_late_ratio,
sum(afternoon_late_count) as afternoon_late_count,
concat(round(sum(afternoon_late_count) / sum(studying_student_count) *100,2),'%') as afternoon_late_ratio,
sum(evening_late_count) as evening_late_count,
concat(round(sum(evening_late_count) / sum(studying_student_count) *100,2),'%') as evening_late_ratio,
sum(morning_leave_count) as morning_leave_count,
concat(round(sum(morning_leave_count) / sum(studying_student_count) *100,2),'%') as morning_leave_ratio,
sum(afternoon_leave_count) as afternoon_leave_count,
concat(round(sum(afternoon_leave_count) / sum(studying_student_count) *100,2),'%') as afternoon_leave_ratio,
sum(evening_leave_count) as evening_leave_count,
concat(round(sum(evening_leave_count) / sum(studying_student_count) *100,2),'%') as evening_leave_ratio,
sum(morning_truant_count) as morning_truant_count,
concat(round(sum(morning_truant_count) / sum(studying_student_count) *100,2),'%') as morning_truant_ratio,
sum(afternoon_truant_count) as afternoon_truant_count,
concat(round(sum(afternoon_truant_count) / sum(studying_student_count) *100,2),'%') as afternoon_truant_ratio,
sum(evening_truant_count) as evening_truant_count,
concat(round(sum(evening_truant_count) / sum(studying_student_count) *100,2),'%') as evening_truant_ratio,
'4' as time_type,
yearinfo,
monthinfo,
'-1' as dayinfo
from itcast_dwm.class_all_dwm
group by yearinfo,monthinfo,class_id;
1.9 数据导出
-
利用Sqoop 将在hive中分析好的数据导出至MySQL中,便于后续的使用
- 首先在MySQL中构建表结构
CREATE TABLE IF NOT EXISTS scrm_bi.class_attendance_dws ( dateinfo varchar(100) comment '日期', class_id int comment '班级id', studying_student_count int comment '在读班级人数', morning_att_count varchar(100) comment '上午出勤人数', morning_att_ratio varchar(100) comment '上午出勤率', afternoon_att_count varchar(100) comment '下午出勤人数', afternoon_att_ratio varchar(100) comment '下午出勤率', evening_att_count varchar(100) comment '晚自习出勤人数', evening_att_ratio varchar(100) comment '晚自习出勤率', morning_late_count varchar(100) comment '上午迟到人数', morning_late_ratio varchar(100) comment '上午迟到率', afternoon_late_count varchar(100) comment '下午迟到人数', afternoon_late_ratio varchar(100) comment '下午迟到率', evening_late_count varchar(100) comment '晚自习迟到人数', evening_late_ratio varchar(100) comment '晚自习迟到率', morning_leave_count varchar(100) comment '上午请假人数', morning_leave_ratio varchar(100) comment '上午请假率', afternoon_leave_count varchar(100) comment '下午请假人数', afternoon_leave_ratio varchar(100) comment '下午请假率', evening_leave_count varchar(100) comment '晚自习请假人数', evening_leave_ratio varchar(100) comment '晚自习请假率', morning_truant_count varchar(100) comment '上午旷课人数', morning_truant_ratio varchar(100) comment '上午旷课率', afternoon_truant_count varchar(100) comment '下午旷课人数', afternoon_truant_ratio varchar(100) comment '下午旷课率', evening_truant_count varchar(100) comment '晚自习旷课人数', evening_truant_ratio varchar(100) comment '晚自习旷课率', time_type varchar(100) COMMENT '聚合时间类型:1、按小时聚合;2、按天聚合;3、按周聚合;4、按月聚合;5、按年聚合。') comment '班级请假数据统计';
- 其次编写sqoop语句导数据
sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 123456 \
--table class_attendance_dws \
--hcatalog-database itcast_dws \
--hcatalog-table class_attendance_dws \
-m 1
2. 学生出勤主题看板__增量流程
2.1 模拟数据(工作环境无)
- 利用现有数据,通过修改模拟新增数据
2.2 数据采集
- 利用sqoop将MySQL中的新增数据导入至hive中
- 由于是增量数据,并且该主题下的表是拉链表,所以在抽取数据时需注意:1.start time(表示数据开始的时间) 2.end time(表示该条记录失效时间)
- 建立update表对增量数据进行存储,之后再汇总至temp表
- 每次使用update表均需重建,避免数据重复导致问题
2.3 数据清洗转换
- 在对增量数据进行清洗转化时,与全量过程是一致的
2.4 数据分析
在增量分析中,以天为例,
对年统计当年结果数据;对季度统计当年当季度结果数据;对月统计当年当季度当月结果数据 存在影响;
对当天和小时的统计对历史数据无影响
解决方案:通过删除分区的方案来解决
例如:2023-01-11 的数据所在分区为:yearinfo=‘2023’ and quarterinfo=‘1’ and monthinfo=‘01’ and dayinfo=‘11’
执行删除:alter table 表名 drop partition("分区");
2.5 数据导出
- 在数据导出操作中,也需要将MySQL中之前的当年、当季度、当月的结果数据删除,重新导入操作
- 我们可以将当年的统计结果数据全部删除,然后全部重新导入所有数据