目录
学生出勤主题看板
一:需求分析
需求一:
维度:时间维度:年,月,天,上午,下午,晚自习
班级维度
学生维度
指标:出勤人数(正常出勤+迟到出勤)
计算公式:打卡时间在上课前40分钟~上课时间点之内(假设上课时间是9-12点,打卡时间是8.40-12点都算),and 没有早退(离开时间为12点之后)(通过下面的需求得知不需要考虑早退问题,但要考虑是否开启共享屏幕问题(即按时签到+开启共享屏幕学习,这才算出勤))
表:course_table_upload_datail表,tbh_student_signin_record表,tbh_class_time_table表
关联条件:?
字段:打卡时间,离开时间,上课时间,下课时间(属于表?).....?
需求二:
维度:时间维度:年,月,天,上午,下午,晚自习
班级维度
学生维度
指标:出勤人数(需求一已统计)
当日在读学生数
针对比率的指标,我们最后在分析,因为可能后续的需求中会对其中的分子或分母进行计算。如果后续计算了,那么这里就不需要计算了。
表:course_table_upload_detail表,tbh_student_signin_record表,tbh_class_time_table表,class_studying_student_count表
关联条件:?
字段:?
需求三:
维度:时间维度:年,月,天,上午,下午,晚自习
班级维度
学生维度
指标:迟到人数
计算公式:打卡时间在上课10分钟以后到下课之前(假设上课时间是9-12点,打卡时间是9.10以后12点以前都算迟到)。通过这个可以看出打卡时间在上课前40分钟到上课后10分钟且离开时间在下课之后是正常出勤,打卡时间是在上课后10分钟到下课且离开时间在下课之后算迟到出勤。(通过下面的需求得知不需要考虑早退问题,但要考虑是否开启共享屏幕问题(即打卡迟到了+开启共享屏幕学习,这才算是迟到)
表:course_table_upload_detail表,tbh_student_signin_record表,tbh_class_time_table表
字段:?
需求四:
select dt.every_date,
ctud.class_id,
tssr.student_id,
if(
#上午正常打卡为0,迟到10分钟以上为1,其他(请假+旷课)为2
sum(
case
#上午打卡时间是否在上课前40分钟~下课时间段之内
when time(tssr.signin_time) between TIMESTAMPADD(minute, -40, tctt.morning_begin_time) and tctt.morning_end_time
then 1 #上午来了
else 0 end #上午没来
) > 0, #打卡多次,只要有一次正常打卡,就会>0,返回true;否则没来,返回false
if(
sum(
case
#上午打卡时间是否在上课前40分钟~上课后10分钟之内
when time(tssr.signin_time) between TIMESTAMPADD(minute, -40, tctt.morning_begin_time) and TIMESTAMPADD(minute, 10, tctt.morning_begin_time)
then 1 #正常出勤
else 0 end #迟到
) > 0, #有一次打卡是正常出勤,就会>0,返回true;否则迟到,返回false
0, #正常出勤
1 #迟到
),
2 #上午没来
) as morning_signin,
if(
#下午正常打卡为0,迟到10分钟以上为1,其他(请假+旷课)为2
sum(case
when time(tssr.signin_time) between TIMESTAMPADD(minute, -40, tctt.afternoon_begin_time) and tctt.afternoon_end_time
then 1
else 0 end) > 0,
if(sum(case
when time(tssr.signin_time) between TIMESTAMPADD(minute, -40, tctt.afternoon_begin_time) and TIMESTAMPADD(minute, 10, tctt.afternoon_begin_time)
then 1
else 0 end) > 0, 0, 1), 2) as afternoon_signin,
if(
#晚自习正常打卡为0,迟到10分钟以上为1,其他(请假+旷课)为2
sum(case
when time(tssr.signin_time) between TIMESTAMPADD(minute, -20, tctt.evening_begin_time) and tctt.evening_end_time
then 1
else 0 end) > 0,
if(sum(case
when time(tssr.signin_time) between TIMESTAMPADD(minute, -20, tctt.evening_begin_time) and TIMESTAMPADD(minute, 10, tctt.evening_begin_time)
then 1
else 0 end) > 0, 0, 1), 2) as evening_signin
from (
#获取今天之前一周内的日期
select datelist as every_date from calendar where datelist between '2019-09-01' and '2019-09-30'
) dt
#日期课表不为空且不是开班典礼
left join course_table_upload_detail ctud
on ctud.class_date = dt.every_date and ifnull(ctud.content, '') != '' and
ctud.content != '开班典礼'
#学生打卡记录日期和班级匹配,且开启共屏进入学习
left join tbh_student_signin_record tssr
on tssr.class_id = ctud.class_id and tssr.signin_date = dt.every_date and
tssr.share_state = 1
#获取班级作息时间以判断是否按时出勤
left join tbh_class_time_table tctt on tctt.id = tssr.time_table_id
#按照日期、班级、学生分组统计
group by dt.every_date, ctud.class_id, tssr.student_id;
维度:时间维度:年,月,天,(上午,下午,晚自习)
班级维度
学生维度
指标:迟到人数(需求三已统计)
当日在读学员数
表:course_table_upload_detail日期课程明细表(课程维度表),tbh_student_signin_record学生打卡记录(事实表)表,tbh_class_time_table班级作息时间表(班级作息维度表)
关联条件:course_table_upload_detail.class_id = tbh_student_signin_record.class_id
tbh_class_time_table.id = tbh_student_signin_record.time_table_id
字段:时间维度:course_table_upload_detail.class_date
班级维度:course_table_upload_detail.class_id
学生维度:tbh_student_signin_record.student_id
指标字段:tbh_student_signin_record.signin_time # 打卡时间
指标判断字段:tbh_class_time_table.morning_begin_time # 上午上课时间
tbh_class_time_table.morning_end_time # 上午下课时间
tbh_class_time_table.afternoon_begin_time
tbh_class_time_table.afternoon_end_time
tbh_class_time_table.evening_begin_time
tbh_class_time_table.evening_end_time
指标计算方法:先判断,上课时间-40 <= signin_time<= 下课时间
情形1:满足上述条件,出勤了,再次判断上课时间-40 <= signin_time<= 上课时间+10,如果满足则是正常出勤返回0,否则为迟到返回1。
情形2:不满足,缺勤了(旷课+请假),返回2。
注意:ifnull()是mysql中的函数与nvl()作用一样
是否需要清洗:course_table_upload_detail.content != '开班典礼' and != ' ' and != null # 目前是统计学生的出勤情况content为null和空说明没上课休息,因此不需要统计。开班典礼的时候也不需要统计出勤,因为还没有正式上课。
tbh_student_signin_record.share_state = 1 # share_state=1说明学生目前跟随着老师开启共享屏幕正在学习,即在统计出勤人数的时候,不仅仅需要在合理的时间范围内签到,必须开启共享屏幕,如果来共享屏幕都没开,说明即使是签到了但没有好好学习,这不算出勤(一般这个字段的值都是1,因为是否开启共享屏幕是由老师控制的,但不乏有那么几个特殊份子)。
是否需要转换:course_table_upload_detail.class_date可能需要转换为yearinfo,monthinfo,dayinfo。需要对数据调研。
需求五:
维度:时间维度:年,月,天,上午,下午,晚自习
班级维度
指标:请假人数
表:student_leave_apply学生请假申请表(事实表),tbh_class_time_table,course_table_upload_detail
关联条件:student_leave_apply.class_id = tbh_class_time_table.class_id
student_leave_apply.class_id = course_table_upload_detail.class_id
字段:时间维度:course_table_upload_detail.class_date
班级维度:course_table_upload_detail.class_id
指标字段:student_leave_apply.student_id
需要清洗的字段:course_table_upload_detail.content
course_table_upload_detail.class_date between tbh_class_time_table.use_begin_date and tbh_class_time_table.use_end_date
# 保证请假操作有效(即确定该学生是否有请假)
student_leave_apply.audit_state = 1
student_leave_apply.cancel_state = 0
student_leave_apply.valid_state = 1
# 判断该学生请假的时候是否合理(上课时间是9-11点,学生A请假时间为7-8点该学生没来上课,属于旷课)
真实情况下只有学生的请假开始时间<=上课时间 and 学生的请假结束时间>=下课时间,才算学生请假成功,由于一般请假我们都是请一上午或一下午,因此在判断的时候:请假开始时间<=上课开始时间 and 学生的请假结束时间>=上课开始时间,即:(这里只是以上午为例)
student_leave_apply.begin_time <= concat(course_table_upload_detail.class_date,' ',tbh_class_time_table.morning_begin_time)
student_leave_apply.end_time >= concat(course_table_upload_detail.class_date,' ',tbh_class_time_table.morning_begin_time)
需要转换的字段:course_table_upload_detail.class_date转化为yearinfo,monthinfo,dayinfo
需求六:
维度:时间维度:年,月,天,上午,下午,晚自习
班级维度
指标:请假人数(需求五已统计)
当日再度学员人数
需求七:
维度:时间维度:年,月,天,上午,下午,晚自习
班级维度
指标:旷课人数=当日再读学员人数-出勤人数(需求一已统计)-请假人数(需求五已统计)
需求八:
维度:时间维度:年,月,天,上午,下午,晚自习
班级维度
指标:旷课人数/当日在读学员人数
表:class_studying_student_count班级在读学生人数表(维度表)
通过查看数据发现在class_studying_student_count表中已经直接给出了当日在读学员人数
关联条件:class_studying_student_count.class_id = course_table_upload_detail.class_id
需求总结:
注意通过上述分析得知有两个事实表,一个事实表用于统计出勤和迟到,一个事实表用于统计请假,此时在需求总结的时候要分类总结
第一类:统计出勤和迟到人数
涉及维度:时间维度:年,月,天,上午,下午,晚自习
班级维度
学生维度
涉及指标:出勤人数
迟到人数
表:course_table_upload_detail(维度表),tbh_student_signin_record(事实表),tbh_class_time_table(维度表),class_studying_student_count(维度表)
关联条件:course_table_upload_detail.class_id = tbh_student_signin_record.class_id
tbh_class_time_table.id = tbh_student_signin_record.time_table_id
class_studying_student_count.class_id = course_table_upload_detail.class_id
字段:时间维度:course_table_upload_detail.class_date
班级维度:course_table_upload_detail.class_id
学生维度:tbh_student_signin_record.student_id
指标字段:tbh_student_signin_record.signin_time # 打卡时间
指标判断字段:tbh_class_time_table.morning_begin_time # 上午上课时间
tbh_class_time_table.morning_end_time # 上午下课时间
tbh_class_time_table.afternoon_begin_time
tbh_class_time_table.afternoon_end_time
tbh_class_time_table.evening_begin_time
tbh_class_time_table.evening_end_time
指标计算方法:先判断,上课时间-40 <= signin_time<= 下课时间
情形1:满足上述条件,出勤了,再次判断上课时间-40 <= signin_time<= 上课时间+10,如果满足则是正常出勤返回0,否则为迟到返回1。
情形2:不满足,缺勤了(旷课+请假),返回2。
是否需要清洗:course_table_upload_detail.content != '开班典礼' and != ' ' and != null
tbh_student_signin_record.share_state = 1
是否需要转换:course_table_upload_detail.class_date转换为yearinfo,monthinfo,dayinfo。
第二类:统计请假人数
涉及维度:时间维度:年,月,天,上午,下午,晚自习
班级维度
涉及指标:请假人数
表:student_leave_apply(事实表),tbh_class_time_table(维度表),course_table_upload_detail(维度表),class_studying_student_count(维度表)
关联条件:student_leave_apply.class_id = tbh_class_time_table.class_id
student_leave_apply.class_id = course_table_upload_detail.class_id
class_studying_student_count.class_id = course_table_upload_detail.class_id
字段:时间维度:course_table_upload_detail.class_date
班级维度:course_table_upload_detail.class_id
指标字段:student_leave_apply.student_id
需要清洗的字段:course_table_upload_detail.content != ' ' and != null and != '开班典礼'
course_table_upload_detail.class_date between tbh_class_time_table.use_begin_date and tbh_class_time_table.use_end_date
# 保证请假操作有效(即确定该学生是否有请假)
student_leave_apply.audit_state = 1
student_leave_apply.cancel_state = 0
student_leave_apply.valid_state = 1
student_leave_apply.begin_time <= concat(course_table_upload_detail.class_date,' ',tbh_class_time_table.morning_begin_time)
student_leave_apply.end_time >= concat(course_table_upload_detail.class_date,' ',tbh_class_time_table.morning_begin_time)
需要转换的字段:course_table_upload_detail.class_date转化为yearinfo,monthinfo,dayinfo。
第三类:统计旷课人数
涉及维度:时间维度:年,月,天,上午,下午,晚自习
班级维度
涉及指标: 旷课人数
指标计算方法:旷课人数 = 当日在读学员数 - 出勤人数 - 请假人数
二:业务数据准备
create database teach default character set utf8mb4 collate utf8mb4_unicode_ci;
在teach库中运行:
三:建模分析
ODS层:存放两张事实表分别是:tbh_student_signin_record和student_leave_apply,添加分区抽取时间start_time。(因该还要添加end_time字段)
DIM层:存放三张维度表分别是:tbh_class_time_table,course_table_upload_detail和class_studying_student_count,需要添加分区start_time。
DWD层:对ODS层表进行清洗转换工作,以及少量维度退化(这里不需要)
按理来说有两张表,分别是对两个事实表清洗转换后的表,考虑到这些清洗工作大多数都涉及到维度表,只有少数清洗工作是在事实表中的,在这一阶段维度表并没有退化到事实表中,因此大量的清洗转换工作在这里无法执行。为了方便起见所有的清洗转换操作都放到DWM层维度退化之后。
不需要DWD层
DWM层:维度退化+提前聚合操作
操作一:将course_table_upload_detail(维度表),tbh_class_time_table(维度表),class_studying_student_count(维度表)的数据退化到tbh_student_signin_record(事实表),并且完成清洗转换工作。
操作二:是否需要提前聚合?可以,因为跨天数据可以累加,提前按天聚合
操作三:将tbh_class_time_table(维度表),course_table_upload_detail(维度表),class_studying_student_count(维度表)的数据退化到student_leave_apply(事实表),并完成清洗转换工作。
操作四:是否需要提前聚合?可以,因为跨天数据可以累加,提前按天聚合
思考:如何提前聚合呢?先分后合
建表方案:
DWS层:统计分析操作
建表方案:
四:建模操作
ODS层:
CREATE TABLE IF NOT EXISTS itcast_ods.student_signin_ods (
id int,
normal_class_flag int comment '是否正课 1 正课 2 自习 3 休息',
time_table_id int comment '作息时间id normal_class_flag=2 关联tbh_school_time_table 或者 normal_class_flag=1 关联 tbh_class_time_table',
class_id int comment '班级id',
student_id int comment '学员id',
signin_time String comment '签到时间',
signin_date String comment '签到日期',
inner_flag int comment '内外网标志 0 外网 1 内网',
signin_type int comment '签到类型 1 心跳打卡 2 老师补卡 3 直播打卡',
share_state int comment '共享屏幕状态 0 否 1是 在上午或下午段有共屏记录,则该段所有记录该字段为1,内网默认为1 外网默认为0 (暂不用)',
inner_ip String comment '内网ip地址',
create_time String comment '创建时间')
comment '学生打卡记录表'
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.bloom.filter.columns'='time_table_id,class_id,signin_date,share_state');
-- 不需要建立拉链表,因为打卡记录表一般不允许修改,比如今天你手机打了卡,还能修改吗?显然不能,这个表每天只有新增的打卡记录。这个表可以建成外部表无所谓
-- 可以看出该表使用了布隆过滤索引。time_table_id,class_id都需要 = join操作;share_state需要在where条件中筛选出=1的情况;signin_date可以不用使用索引,因为没有等值连接。
CREATE TABLE IF NOT EXISTS itcast_ods.student_leave_apply_ods (
id int,
class_id int comment '班级id',
student_id int comment '学员id',
audit_state int comment '审核状态 0 待审核 1 通过 2 不通过',
audit_person int comment '审核人',
audit_time String comment '审核时间',
audit_remark String comment '审核备注',
leave_type int comment '请假类型 1 请假 2 销假 (查询是否请假不用过滤此类型,通过有效状态来判断)',
leave_reason int comment '请假原因 1 事假 2 病假',
begin_time String comment '请假开始时间',
begin_time_type int comment '1:上午 2:下午 3:晚自习',
end_time String comment '请假结束时间',
end_time_type int comment '1:上午 2:下午 3:晚自习',
days float comment '请假/已休天数',
cancel_state int comment '撤销状态 0 未撤销 1 已撤销',
cancel_time String comment '撤销时间',
old_leave_id int comment '原请假id,只有leave_type =2 销假的时候才有',
leave_remark String comment '请假/销假说明',
valid_state int comment '是否有效(0:无效 1:有效)',
create_time String comment '创建时间')
comment '学生请假申请表'
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.bloom.filter.columns'='class_id,audit_state,cancel_state,valid_state');
-- 请假表可以考虑设置为拉链表
-- class_id,audit_state,cancel_state,valid_state都存在等值连接操作可以设置布隆过滤索引,加快查询速度。
DIM层:
CREATE TABLE IF NOT EXISTS itcast_dimen.class_studying_student_count_dimen (
id int,
school_id int comment '校区id',
subject_id int comment '学科id',
class_id int comment '班级id',
studying_student_count int comment '在读班级人数',
studying_date STRING comment '在读日期')
comment '在读班级的每天在读学员人数'
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.bloom.filter.columns'='studying_student_count,studying_date');
CREATE TABLE IF NOT EXISTS itcast_dimen.course_table_upload_detail_dimen
(
id int comment 'id',
base_id int comment '课程主表id',
class_id int comment '班级id',
class_date STRING comment '上课日期',
content STRING comment '课程内容',
teacher_id int comment '老师id',
teacher_name STRING comment '老师名字',
job_number STRING comment '工号',
classroom_id int comment '教室id',
classroom_name STRING comment '教室名称',
is_outline int comment '是否大纲 0 否 1 是',
class_mode int comment '上课模式 0 传统全天 1 AB上午 2 AB下午 3 线上直播',
is_stage_exam int comment '是否阶段考试(0:否 1:是)',
is_pay int comment '代课费(0:无 1:有)',
tutor_teacher_id int comment '晚自习辅导老师id',
tutor_teacher_name STRING comment '辅导老师姓名',
tutor_job_number STRING comment '晚自习辅导老师工号',
is_subsidy int comment '晚自习补贴(0:无 1:有)',
answer_teacher_id int comment '答疑老师id',
answer_teacher_name STRING comment '答疑老师姓名',
answer_job_number STRING comment '答疑老师工号',
remark STRING comment '备注',
create_time STRING comment '创建时间')
comment '班级课表明细表'
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.bloom.filter.columns'='class_id,class_date');
CREATE TABLE IF NOT EXISTS itcast_dimen.class_time_dimen (
id int,
class_id int comment '班级id',
morning_template_id int comment '上午出勤模板id',
morning_begin_time STRING comment '上午开始时间',
morning_end_time STRING comment '上午结束时间',
afternoon_template_id int comment '下午出勤模板id',
afternoon_begin_time STRING comment '下午开始时间',
afternoon_end_time STRING comment '下午结束时间',
evening_template_id int comment '晚上出勤模板id',
evening_begin_time STRING comment '晚上开始时间',
evening_end_time STRING comment '晚上结束时间',
use_begin_date STRING comment '使用开始日期',
use_end_date STRING comment '使用结束日期',
create_time STRING comment '创建时间',
create_person int comment '创建人',
remark STRING comment '备注')
comment '班级作息时间表'
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.bloom.filter.columns'='id,class_id');
DWM层:
CREATE TABLE IF NOT EXISTS itcast_dwm.student_attendance_dwm (
dateinfo String comment '日期',
class_id int comment '班级id',
student_id int comment '学员id',
morning_att String comment '上午出勤情况:0.正常出勤、1.迟到、2.其他(请假+旷课)',
afternoon_att String comment '下午出勤情况:0.正常出勤、1.迟到、2.其他(请假+旷课)',
evening_att String comment '晚自习出勤情况:0.正常出勤、1.迟到、2.其他(请假+旷课)')
comment '学生出勤情况表'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');
CREATE TABLE IF NOT EXISTS itcast_dwm.class_attendance_dwm (
dateinfo String comment '日期',
class_id int comment '班级id',
morning_att_count String comment '上午出勤人数',
afternoon_att_count String comment '下午出勤人数',
evening_att_count String comment '晚自习出勤人数',
morning_late_count String comment '上午迟到人数',
afternoon_late_count String comment '下午迟到人数',
evening_late_count String comment '晚自习迟到人数')
comment '学生出勤(出勤和迟到)数据'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');
CREATE TABLE IF NOT EXISTS itcast_dwm.class_leave_dwm (
dateinfo String comment '日期',
class_id int comment '班级id',
morning_leave_count String comment '上午请假人数',
afternoon_leave_count String comment '下午请假人数',
evening_leave_count String comment '晚自习请假人数')
comment '班级请假数据表'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');
CREATE TABLE IF NOT EXISTS itcast_dwm.class_truant_dwm (
dateinfo String comment '日期',
class_id int comment '班级id',
morning_truant_count String comment '上午旷课人数',
afternoon_truant_count String comment '下午旷课人数',
evening_truant_count String comment '晚自习旷课人数')
comment '班级旷课数据表'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');
CREATE TABLE IF NOT EXISTS itcast_dwm.class_all_dwm (
dateinfo String comment '日期',
class_id int comment '班级id',
studying_student_count int comment '在读班级人数',
morning_att_count String comment '上午出勤人数',
morning_att_ratio String comment '上午出勤率',
afternoon_att_count String comment '下午出勤人数',
afternoon_att_ratio String comment '下午出勤率',
evening_att_count String comment '晚自习出勤人数',
evening_att_ratio String comment '晚自习出勤率',
morning_late_count String comment '上午迟到人数',
morning_late_ratio String comment '上午迟到率',
afternoon_late_count String comment '下午迟到人数',
afternoon_late_ratio String comment '下午迟到率',
evening_late_count String comment '晚自习迟到人数',
evening_late_ratio String comment '晚自习迟到率',
morning_leave_count String comment '上午请假人数',
morning_leave_ratio String comment '上午请假率',
afternoon_leave_count String comment '下午请假人数',
afternoon_leave_ratio String comment '下午请假率',
evening_leave_count String comment '晚自习请假人数',
evening_leave_ratio String comment '晚自习请假率',
morning_truant_count String comment '上午旷课人数',
morning_truant_ratio String comment '上午旷课率',
afternoon_truant_count String comment '下午旷课人数',
afternoon_truant_ratio String comment '下午旷课率',
evening_truant_count String comment '晚自习旷课人数',
evening_truant_ratio String comment '晚自习旷课率')
comment '班级各项数据按天统计汇总表'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');
DWS层:
CREATE TABLE IF NOT EXISTS itcast_dws.class_attendance_dws (
dateinfo String comment '日期',
class_id int comment '班级id',
studying_student_count int comment '在读班级人数',
morning_att_count String comment '上午出勤人数',
morning_att_ratio String comment '上午出勤率',
afternoon_att_count String comment '下午出勤人数',
afternoon_att_ratio String comment '下午出勤率',
evening_att_count String comment '晚自习出勤人数',
evening_att_ratio String comment '晚自习出勤率',
morning_late_count String comment '上午迟到人数',
morning_late_ratio String comment '上午迟到率',
afternoon_late_count String comment '下午迟到人数',
afternoon_late_ratio String comment '下午迟到率',
evening_late_count String comment '晚自习迟到人数',
evening_late_ratio String comment '晚自习迟到率',
morning_leave_count String comment '上午请假人数',
morning_leave_ratio String comment '上午请假率',
afternoon_leave_count String comment '下午请假人数',
afternoon_leave_ratio String comment '下午请假率',
evening_leave_count String comment '晚自习请假人数',
evening_leave_ratio String comment '晚自习请假率',
morning_truant_count String comment '上午旷课人数',
morning_truant_ratio String comment '上午旷课率',
afternoon_truant_count String comment '下午旷课人数',
afternoon_truant_ratio String comment '下午旷课率',
evening_truant_count String comment '晚自习旷课人数',
evening_truant_ratio String comment '晚自习旷课率',
time_type STRING COMMENT '聚合时间类型:1、按小时聚合;2、按天聚合;3、按季度聚合;4、按月聚合;5、按年聚合。')
comment '班级数据统计表'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');
五:数据抽取
ODS层:
sqoop import \
--connect jdbc:mysql://hadoop01:3306/teach \
--username root \
--password 123456 \
--query 'select *, "2021-10-07" as dt from tbh_student_signin_record where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table student_signin_ods \
-m 1
sqoop import \
--connect jdbc:mysql://hadoop01:3306/teach \
--username root \
--password 123456 \
--query 'select *, "2021-10-07" as dt from student_leave_apply where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table student_leave_apply_ods \
-m 1
DIM层:
sqoop import \
--connect jdbc:mysql://hadoop01:3306/teach \
--username root \
--password 123456 \
--query 'select *, "2021-10-07" as dt from class_studying_student_count where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table class_studying_student_count_dimen \
-m 1
sqoop import \
--connect jdbc:mysql://hadoop01:3306/teach \
--username root \
--password 123456 \
--query 'select *, "2021-10-07" as dt from course_table_upload_detail where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table course_table_upload_detail_dimen \
-m 1
sqoop import \
--connect jdbc:mysql://hadoop01:3306/teach \
--username root \
--password 123456 \
--query 'select *, "2021-10-07" as dt from tbh_class_time_table where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_dimen \
--hcatalog-table class_time_dimen \
-m 1
六:数据清洗转换,维度退化,提前聚合
将ODS层和DIM层的数据经过维度退化,清洗转换,提前聚合操作,并将结果保存至DWM层
# 学生出勤情况表
注意:学生打不打卡是建立在今天有课的基础上的,因此from字段课表需要放在最左边,如果将学生打卡记录表放在from字段最左边,学生是可以随意玩耍打卡软件的,显然可能存在不上课的时候也有学生在打卡,因此如果以学生打卡记录表为主可能存在统计结果偏大。而且如果课表说星期1有课,但是由于某些特殊原因所有学生没有打卡,此时以打卡表为主会导致星期1的统计没数据。
-- ctudd.class_date 的格式为 年-月-日
select ctudd.class_date as dateinfo, sso.class_id, sso.student_id,
xx as morning_att,
xx as afternoon_att,
xx as evening_att,
substr(ctudd.class_date,1,4) as yearinfo,
substr(ctudd.class_date,6,2) as monthinfo,
substr(ctudd.class_date,9,2) as dayinfo
from itcast_dimen.course_table_upload_detail_dimen ctudd
left join itcast_ods.student_signin_ods sso on sso.class_id = ctudd.class_id
left join itcast_dimen.class_time_dimen ctd on ctd.id = sso.time_table_id
and ctudd.content != '开班典礼' and ctudd.content != ' ' and ctudd.content != null and sso.share_state = 1
group by ctudd.class_date, sso.class_id, sso.student_id;
注意:select涉及join操作,针对join操作的常见优化手段就是,能提前过滤就提前过滤,过滤了在join,显然上面的select是join之后在过滤,效率低。
select
ctudd.class_date as dateinfo,
sso.class_id,
sso.student_id,
xx as morning_att, --需要使用hive中的函数实现分析
xx as afternoon_att, --需要使用hive中的函数实现分析
xx as evening_att, --需要使用hive中的函数实现分析
substr(ctudd.class_date,1,4) as yearinfo,
substr(ctudd.class_date,6,2) as monthinfo,
substr(ctudd.class_date,9,2) as dayinfo
from (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd
left join (select * from itcast_ods.student_signin_ods where share_state=1) sso on sso.class_id = ctudd.class_id
left join itcast_dimen.class_time_dimen ctd on ctd.id = sso.time_table_id
group by ctudd.class_date, ctudd.class_id, sso.student_id;
-- 通过查看原数据得知signin_time,morning_begin_time等等字段都是 时:分:秒格式
select
ctudd.class_date as dateinfo,
sso.class_id,
sso.student_id,
if(
sum(
if(sso.signin_time between ctd.morning_begin_time - 40 and ctd.morning_end_time,1,0) -- 需要将伪代码转化为真实代码
) > 0,
if(sum(
if(sso.signin_time between ctd.morning_begin_time - 40 and ctd.morning_begin_time + 10,1,0)
) > 0,
0,1),
2) as morning_att,
xx as afternoon_att,
xx as evening_att,
substr(ctudd.class_date,1,4) as yearinfo,
substr(ctudd.class_date,6,2) as monthinfo,
substr(ctudd.class_date,9,2) as dayinfo
from (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd
left join (select * from itcast_ods.student_signin_ods where share_state=1) sso on sso.class_id = ctudd.class_id
left join itcast_dimen.class_time_dimen ctd on ctd.id = sso.time_table_id
group by ctudd.class_date, ctudd.class_id, sso.student_id;
目前我们的任务是将格式为 时:分:秒的时间减去40分钟,通过查询hive的时间相关函数,发现我们只有针对日期的加减函数没有针对时间的加减函数。
思考:如何实现时间的加减呢?利用时间戳
首先将class_date与morning_begin_time进行concat拼接,其次利用unix_timestamp函数将日期转化为时间戳(单位为秒s),然后将时间戳减去2400s(40分钟),然后利用from_unixtime函数将时间戳转换为日期,最后获取时间字段。即ctd.morning_begin_time - 40等价于
substr(from_unixtime(unix_timestamp(concat(ctudd.class_date,' ',ctd.morning_begin_time), yyyy-MM-dd HH:mm:ss) - 2400),12,8)
当然可以直接以时间戳来比较,此时需要将sso.signin_time,ctd.morning_end_time同样转化为时间戳。
-- 通过查看原数据得知signin_time是 年-月-天 时:分:秒,morning_begin_time字段是 时:分:秒格式
--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶
-- set hive.enforce.bucketing=true;
-- set hive.enforce.sorting=true;
-- set hive.optimize.bucketmapjoin = true;
-- set hive.auto.convert.sortmerge.join=true;
-- set hive.auto.convert.sortmerge.join.noconditionaltask=true;
--并行执行
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
--矢量化查询
set hive.vectorized.execution.enabled=true;
--关联优化器
set hive.optimize.correlation=true;
--读取零拷贝
set hive.exec.orc.zerocopy=true;
--join数据倾斜
set hive.optimize.skewjoin=true;
-- set hive.skewjoin.key=100000;
-- set hive.optimize.skewjoin.compiletime=true;
set hive.optimize.union.remove=true;
-- group倾斜
set hive.groupby.skewindata=true;
insert into table itcast_dwm.student_attendance_dwm partition(yearinfo,monthinfo,dayinfo)
select
ctudd.class_date as dateinfo,
sso.class_id,
sso.student_id,
if(
sum(
if(sso.signin_time
between from_unixtime(unix_timestamp(concat(ctudd.class_date,' ',ctd.morning_begin_time), yyyy-MM-dd HH:mm:ss) - 2400)
and concat(ctudd.class_date,' ',ctd.morning_end_time),1,0)
) > 0,
if(sum(
if(unix_timestamp(sso.signin_time, yyyy-MM-dd HH:mm:ss)
between unix_timestamp(concat(ctudd.class_date,' ',ctd.morning_begin_time), yyyy-MM-dd HH:mm:ss) - 2400
and unix_timestamp(concat(ctudd.class_date,' ',ctd.morning_begin_time), yyyy-MM-dd HH:mm:ss) + 600,1,0)
) > 0,
0,1),
2) as morning_att, -- 上午统计
if(
sum(
if(sso.signin_time
between from_unixtime(unix_timestamp(concat(ctudd.class_date,' ',ctd.afternoon_begin_time), yyyy-MM-dd HH:mm:ss) - 2400)
and concat(ctudd.class_date,' ',ctd.afternoon_end_time),1,0)
) > 0,
if(sum(
if(unix_timestamp(sso.signin_time, yyyy-MM-dd HH:mm:ss)
between unix_timestamp(concat(ctudd.class_date,' ',ctd.afternoon_begin_time), yyyy-MM-dd HH:mm:ss) - 2400
and unix_timestamp(concat(ctudd.class_date,' ',ctd.afternoon_begin_time), yyyy-MM-dd HH:mm:ss) + 600,1,0)
) > 0,
0,1),
2) as afternoon_att, -- 下午统计
if(
sum(
if(sso.signin_time
between from_unixtime(unix_timestamp(concat(ctudd.class_date,' ',ctd.evening_begin_time), yyyy-MM-dd HH:mm:ss) - 2400)
and concat(ctudd.class_date,' ',ctd.evening_end_time),1,0)
) > 0,
if(sum(
if(unix_timestamp(sso.signin_time, yyyy-MM-dd HH:mm:ss)
between unix_timestamp(concat(ctudd.class_date,' ',ctd.evening_begin_time), yyyy-MM-dd HH:mm:ss) - 2400
and unix_timestamp(concat(ctudd.class_date,' ',ctd.evening_begin_time), yyyy-MM-dd HH:mm:ss) + 600,1,0)
) > 0,
0,1),
2) as evening_att, -- 晚自习统计
substr(ctudd.class_date,1,4) as yearinfo,
substr(ctudd.class_date,6,2) as monthinfo,
substr(ctudd.class_date,9,2) as dayinfo
from (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd
left join (select * from itcast_ods.student_signin_ods where share_state=1) sso on sso.class_id = ctudd.class_id
left join itcast_dimen.class_time_dimen ctd on ctd.id = sso.time_table_id
where cutdd.class_date in ('2019-09-03','2019-09-04','2019-09-05') -- 我们使用的创造出来的数据只有这三天是有效的可以进行统计的,其他数据是通过随机函数随机生产的无用数据,因此实验过程中只考虑这三天即可,注意在实际生产环境中是没有这个where条件的。
group by ctudd.class_date, ctudd.class_id, sso.student_id;
# 学生出勤数据表
以学生出勤情况表为基础统计出勤人数和迟到人数
insert into table itcast_dwm.class_attendancce_dwm partition(yearinfo, monthinfo, dayinfo)
select
dateinfo,
class_id,
sum(if(morning_att = 0 or morning_att = 1,1,0)) as morning_att_count,
sum(if(afternoon_att = 0 or afternoon_att = 1,1,0)) as afternoon_att_count,
sum(if(evening_att = 0 or evening_att = 1,1,0)) as evening_att_count,
sum(if(morning_att = 1,1,0)) as morning_late_count,
sum(if(afternoon_att = 1,1,0)) as afternoon_late_count,
sum(if(evening_att = 1,1,0)) as evening_late_count,
yearinfo,
monthinfo,
dayinfo
from itcast_dwm.student_attendance_dwm
group by dateinfo,yearinfo,monthinfo,dayinfo,class_id;
当然可以使用count()统计
insert into table itcast_dwm.class_attendancce_dwm partition(yearinfo, monthinfo, dayinfo)
select
dateinfo,
class_id,
count(
case
when morning_att in ('0','1') then student_id
else null end
) as morning_att_count,
count(
case
when afternoon_att in ('0','1') then student_id
else null end
) as afternoon_att_count,
count(
case
when evening_att in ('0','1') then student_id
else null end
) as evening_att_count,
count(
case
when morning_att = '1' then student_id
else null end
) as morning_late_count,
count(
case
when afternoon_att = '1' then student_id
else null end
) as afternoon_late_count,
count(
case
when evening_att = '1' then student_id
else null end
) as evening_late_count,
yearinfo,
monthinfo,
dayinfo
from itcast_dwm.student_attendance_dwm
group by dateinfo,yearinfo,monthinfo,dayinfo,class_id;
-- count对null是不统计的
# 班级请假数据表
原始数据中上午没有请假的数据,这里我们模拟出一条请假数据
INSERT INTO itcast_ods.student_leave_apply_ods partition (dt)
values (125, 5032, 119142, 1, 3491, '2019-09-08 16:42:29', '狂犬疫苗最后一针', 1, 2, '2019-09-03 08:00:00', 2, '2019-09-03 23:30:00', 2, 1, 0, null, null, '医院打针,狂犬疫苗最后一针了', 1, '2019-09-02 08:56:54','2020-07-07');
select
ctudd.class_date as dateinfo,
ctudd.class_id,
count(distinct, slao.student_id) as morning_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) slao
join itcast_dimen.class_time_dimen ctd on slao.class_id = ctd.class_id
join (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd on ctudd.class_id = slao.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
and slao.begin_time <= concat(ctudd.class_date,' ',ctd.morning_begin_time) and slao.end_time >= concat(ctudd.class_date,' ',ctd.morning_begin_time)
group by ctudd.class_date, ctudd.class_id;
-- ctd.use_begin_date and ctd.use_end_date 都是年-月-天格式
select
ctudd.class_date as dateinfo,
ctudd.class_id,
count(distinct, slao.student_id) as afternoon_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) slao
join itcast_dimen.class_time_dimen ctd on slao.class_id = ctd.class_id
join (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd on ctudd.class_id = slao.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
and slao.begin_time <= concat(ctudd.class_date,' ',ctd.afternoon_begin_time) and slao.end_time >= concat(ctudd.class_date,' ',ctd.afternoon_begin_time)
group by ctudd.class_date, ctudd.class_id;
select
ctudd.class_date as dateinfo,
ctudd.class_id,
count(distinct, slao.student_id) as evening_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) slao
join itcast_dimen.class_time_dimen ctd on slao.class_id = ctd.class_id
join (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd on ctudd.class_id = slao.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
and slao.begin_time <= concat(ctudd.class_date,' ',ctd.evening_begin_time) and slao.end_time >= concat(ctudd.class_date,' ',ctd.evening_begin_time)
group by ctudd.class_date, ctudd.class_id;
# 上午下午晚自习统计结果:当然我们这里只有一条数据,在实际情况下肯定有多条:
最终结果:
2019-09-03 5032 1 1 1 2019 09 03
2019-09-03 1205 8 0 0 2019 09 03
2019-09-03 5412 0 3 2 2019 09 03
2019-09-04 7014 5 0 3 2019 09 04
2019-09-05 7014 0 3 0 2019 09 05
思考:如何得到最终结果?满外连接full join。注意当我们发现我们的结果既要左边的数据,又要右边的数据时,一定使用full join
与最终结果还有一定的差距:即存在(class_date,class_id)相同的数据。需要将其合并为一条数据。
此时与最终表一致了
最终sql:
insert into table itcast_dwm.class_leave_dwm partition(yearinfo,monthinfo,dayinfo)
select
temp.class_date,
temp.class_id,
sum(temp.morning_leave_count) as morning_leave_count,
sum(temp.afternoon_leave_count) as afternoon_leave_count,
sum(temp.evening_leave_count) as evening_leave_count,
substr(temp.class_date,1,4) as yearinfo,
substr(temp.class_date,6,2) as monthinfo,
substr(temp.class_date,9,2) as dayinfo
from (select
coalesce(a.class_date,b.class_date,c.class_date) as class_date,
coalesce(a.class_id,b.class_id,c.class_id) as class_id,
nvl(a.morning_leave_count,0) as morning_leave_count,
nvl(b.afternoon_leave_count,0) as afternoon_leave_count,
nvl(c.evening_leave_count,0) as evening_leave_count
from (select
ctudd.class_date as dateinfo,
ctudd.class_id,
count(distinct, slao.student_id) as morning_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) slao
join itcast_dimen.class_time_dimen ctd on slao.class_id = ctd.class_id
join (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd on ctudd.class_id = slao.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
and slao.begin_time <= concat(ctudd.class_date,' ',ctd.morning_begin_time) and slao.end_time >= concat(ctudd.class_date,' ',ctd.morning_begin_time)
group by ctudd.class_date, ctudd.class_id) a
full join (select
ctudd.class_date as dateinfo,
ctudd.class_id,
count(distinct, slao.student_id) as afternoon_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) slao
join itcast_dimen.class_time_dimen ctd on slao.class_id = ctd.class_id
join (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd on ctudd.class_id = slao.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
and slao.begin_time <= concat(ctudd.class_date,' ',ctd.afternoon_begin_time) and slao.end_time >= concat(ctudd.class_date,' ',ctd.afternoon_begin_time)
group by ctudd.class_date, ctudd.class_id) b on a.class_date = b.class_date and a.class_id = b.class_id
full join (select
ctudd.class_date as dateinfo,
ctudd.class_id,
count(distinct, slao.student_id) as evening_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) slao
join itcast_dimen.class_time_dimen ctd on slao.class_id = ctd.class_id
join (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd on ctudd.class_id = slao.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
and slao.begin_time <= concat(ctudd.class_date,' ',ctd.evening_begin_time) and slao.end_time >= concat(ctudd.class_date,' ',ctd.evening_begin_time)
group by ctudd.class_date, ctudd.class_id) c on a.class_date = c.class_date and a.class_id = c.class_id) temp
group by temp.class_date,temp.class_id;
可以利用with as语句提前将上述三个子查询起别名:
语法:with 别名 as (select ....from.....)
详细语法请看:sql语句中with as 的用法_sql as with_jia718的博客-CSDN博客
注意运行过程中可能会出现资源不足的问题,可以考虑将上述设置的数据倾斜等等优化策略关闭。
# 班级旷课数据表
基于一下三张表计算每天各个班级上午下午晚自习的旷课人数
itcast_dwm.class_leave_dwm(班级请假数据表)可以得到请假人数
itcast_dwm.class_attendancce_dwm(班级出勤数据表)可以得到出勤人数
itcast_dimen.class_studying_student_count_dimen(班级每日在读学员表)可以得到每日各个班级的在读学生数。
旷课人数=每日在读学生数-出勤人数-请假人数
思考:旷课数据的统计一定是建立在当天有课的情况下才统计。所以需要还需要关联ctudd课表。
insert into table itcast_dwm.class_truant_dwm partition(yearinfo,monthinfo,dayinfo)
select
ctudd.class_date as dateinfo,
ctudd.class_id,
cssc.studying_student_count - nvl(cad.morning_att_count,0) - nvl(cld.morning_leave_count,0) as morning_truant_count,
cssc.studying_student_count - nvl(cad.afternoon_att_count,0) - nvl(cld.afternoon_leave_count,0) as afternoon_truant_count,
cssc.studying_student_count - nvl(cad.evening_att_count,0) - nvl(cld.evening_leave_count,0) as evening_truant_count,
substr(ctudd.class_date,1,4) as yearinfo,
substr(ctudd.class_date,6,2) as monthinfo,
substr(ctudd.class_date,9,2) as dayinfo
from (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd
left join itcast_dimen.class_studying_student_count_dimen cssc on ctudd.class_date = cssc.studying_date and ctudd.class_id = cssc.class_id
left join itcast_dwm.class_attendance_dwm cad on ctudd.class_id = cad.class_id and ctudd.class_date = cad.dateinfo
left join itcast_dwm.class_leave_dwm cld on ctudd.class_id = cld.class_id and ctudd.class_date = cld.dateinfo
where ctudd.class_date in ('2019-09-03','2019-09-04','2019-09-05'); -- 只有这三天数据是源数据中有意义的,实际生产环境中不需加where
# 汇总表:
insert into table itcast_dwm.class_all_dwn partition(yearinfo,monthinfo,dayinfo)
select
ctudd.class_date as dateinfo,
ctudd.class_id,
cssc.studying_student_count,
cad.morning_att_count,
concat(round(nvl(cad.morning_att_count,0) / cssc.studying_student_count * 100,2),'%') as morning_att_ratio,
cad.afternoon_att_count,
concat(round(nvl(cad.afternoon_att_count,0) / cssc.studying_student_count * 100,2),'%') as afternoon_att_ratio,
cad.evening_att_count,
concat(round(nvl(cad.evening_att_count,0) / cssc.studying_student_count * 100,2),'%') as evening_att_ratio,
cad.morning_late_count,
concat(round(nvl(cad.morning_late_count,0) / cssc.studying_student_count * 100,2),'%') as morning_late_ratio,
cad.afternoon_late_count,
concat(round(nvl(cad.afternoon_late_count,0) / cssc.studying_student_count * 100,2),'%') as afternoon_late_ratio,
cad.evening_late_count,
concat(round(nvl(cad.evening_late_count,0) / cssc.studying_student_count * 100,2),'%') as evening_late_ratio,
cld.morning_leave_count,
concat(round(nvl(cld.morning_leave_count,0) / cssc.studying_student_count * 100,2),'%') as morning_leave_ratio,
cld.afternoon_leave_count,
concat(round(nvl(cld.afternoon_leave_count,0) / cssc.studying_student_count * 100,2),'%') as afternoon_leave_ratio,
cld.evening_leave_count,
concat(round(nvl(cld.evening_leave_count,0) / cssc.studying_student_count * 100,2),'%') as evening_leave_ratio,
ctd.morning_truant_count,
concat(round(nvl(ctd.morning_truant_count,0) / cssc.studying_student_count * 100,2),'%') as morning_truant_ratio,
ctd.afternoon_truant_count,
concat(round(nvl(ctd.afternoon_truant_count,0) / cssc.studying_student_count * 100,2),'%') as afternoon_truant_ratio,
ctd.evening_truant_count,
concat(round(nvl(ctd.evening_truant_count,0) / cssc.studying_student_count * 100,2),'%') as evening_truant_ratio,
substr(ctudd.class_date,1,4) as yearinfo,
substr(ctudd.class_date,6,2) as monthinfo,
substr(ctudd.class_date,9,2) as dayinfo
from (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd
left join itcast_dimen.class_studying_student_count_dimen cssc on ctudd.class_date = cssc.studying_date and ctudd.class_id = cssc.class_id
left join itcast_dwm.class_attendance_dwm cad on ctudd.class_id = cad.class_id and ctudd.class_date = cad.dateinfo
left join itcast_dwm.class_leave_dwm cld on ctudd.class_id = cld.class_id and ctudd.class_date = cld.dateinfo
left join itcast_dwm.class_truant_dwm ctd on ctudd.class_id = ctd.class_id and ctudd.class_date = ctd.dateinfo
where ctudd.class_date in ('2019-09-03','2019-09-04','2019-09-05'); -- 只有这三天数据是源数据中有意义的,实际生产环境中不需加where
-- 注意我们这里将比率都加了%但是在实际生产环境中一般不加%因为加了百分号后续就如法计算了。
七:数据分析
基于DWM层的按天提前聚合的汇总表,统计每月,每年的数据
需求一已经统计完成,只需要将结果直接导入到dws层对应表即可
insert into table itcast_dws.class_attendance_dws partition(yearinfo,monthinfo,dayinfo)
select
dateinfo,
class_id,
studying_student_count,
morning_att_count,
morning_att_ratio,
afternoon_att_count,
afternoon_att_ratio,
evening_att_count,
evening_att_ratio,
morning_late_count,
morning_late_ratio,
afternoon_late_count,
afternoon_late_ratio,
evening_late_count,
evening_late_ratio,
morning_leave_count,
morning_leave_ratio,
afternoon_leave_count,
afternoon_leave_ratio,
evening_leave_count,
evening_leave_ratio,
morning_truant_count,
morning_truant_ratio,
afternoon_truant_count,
afternoon_truant_ratio,
evening_truant_count,
evening_truant_ratio,
'2' as time_type
yearinfo,
monthinfo,
dayinfo
from itcast_dwm.class_all_dwn;
需求二:
insert into table itcast_dws.class_attendance_dws partition(yearinfo,monthinfo,dayinfo)
select
concat(yearinfo,'-',monthinfo) as dateinfo,
class_id,
sum(studying_student_count) as studyinf_student_count,
sum(nvl(morning_att_count,0)) as morning_att_count,
concat(round(sum(nvl(morning_att_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_att_ratio,
sum(nvl(afternoon_att_count,0)) as afternoon_att_count,
concat(round(sum(nvl(afternoon_att_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_att_ratio,
sum(nvl(evening_att_count,0)) as evening_att_count,
concat(round(sum(nvl(evening_att_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_att_ratio,
sum(nvl(morning_later_count,0)) as morning_later_count,
concat(round(sum(nvl(morning_later_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_later_ratio,
sum(nvl(afternoon_later_count,0)) as afternoon_later_count,
concat(round(sum(nvl(afternoon_later_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_later_ratio,
sum(nvl(evening_later_count,0)) as evening_later_count,
concat(round(sum(nvl(evening_later_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_later_ratio,
sum(nvl(morning_leave_count,0)) as morning_leave_count,
concat(round(sum(nvl(morning_leave_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_leave_ratio,
sum(nvl(afternoon_leave_count,0)) as afternoon_leave_count,
concat(round(sum(nvl(afternoon_leave_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_leave_ratio,
sum(nvl(evening_leave_count,0)) as evening_leave_count,
concat(round(sum(nvl(evening_leave_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_leave_ratio,
sum(nvl(morning_truant_count,0)) as morning_truant_count,
concat(round(sum(nvl(morning_truant_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_truant_ratio,
sum(nvl(afternoon_truant_count,0)) as afternoon_truant_count,
concat(round(sum(nvl(afternoon_truant_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_truant_ratio,
sum(nvl(evening_truant_count,0)) as evening_truant_count,
concat(round(sum(nvl(evening_truant_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_truant_ratio,
'4' as time_type
yearinfo,
monthinfo,
'-1' as dayinfo
from itcast_dwm.class_all_dwn
group by yearinfo,monthinfo,class_id;
需求三:
insert into table itcast_dws.class_attendance_dws partition(yearinfo,monthinfo,dayinfo)
select
yearinfo as dateinfo,
class_id,
sum(studying_student_count) as studyinf_student_count,
sum(nvl(morning_att_count,0)) as morning_att_count,
concat(round(sum(nvl(morning_att_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_att_ratio,
sum(nvl(afternoon_att_count,0)) as afternoon_att_count,
concat(round(sum(nvl(afternoon_att_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_att_ratio,
sum(nvl(evening_att_count,0)) as evening_att_count,
concat(round(sum(nvl(evening_att_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_att_ratio,
sum(nvl(morning_later_count,0)) as morning_later_count,
concat(round(sum(nvl(morning_later_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_later_ratio,
sum(nvl(afternoon_later_count,0)) as afternoon_later_count,
concat(round(sum(nvl(afternoon_later_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_later_ratio,
sum(nvl(evening_later_count,0)) as evening_later_count,
concat(round(sum(nvl(evening_later_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_later_ratio,
sum(nvl(morning_leave_count,0)) as morning_leave_count,
concat(round(sum(nvl(morning_leave_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_leave_ratio,
sum(nvl(afternoon_leave_count,0)) as afternoon_leave_count,
concat(round(sum(nvl(afternoon_leave_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_leave_ratio,
sum(nvl(evening_leave_count,0)) as evening_leave_count,
concat(round(sum(nvl(evening_leave_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_leave_ratio,
sum(nvl(morning_truant_count,0)) as morning_truant_count,
concat(round(sum(nvl(morning_truant_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_truant_ratio,
sum(nvl(afternoon_truant_count,0)) as afternoon_truant_count,
concat(round(sum(nvl(afternoon_truant_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_truant_ratio,
sum(nvl(evening_truant_count,0)) as evening_truant_count,
concat(round(sum(nvl(evening_truant_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_truant_ratio,
'5' as time_type
yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwm.class_all_dwn
group by yearinfo,class_id;
八:数据导出
CREATE TABLE IF NOT EXISTS scrm_id.class_attendance_dws (
dateinfo varchar(100) comment '日期',
class_id int comment '班级id',
studying_student_count int comment '在读班级人数',
morning_att_count varchar(100) comment '上午出勤人数',
morning_att_ratio varchar(100) comment '上午出勤率',
afternoon_att_count varchar(100) comment '下午出勤人数',
afternoon_att_ratio varchar(100) comment '下午出勤率',
evening_att_count varchar(100) comment '晚自习出勤人数',
evening_att_ratio varchar(100) comment '晚自习出勤率',
morning_late_count varchar(100) comment '上午迟到人数',
morning_late_ratio varchar(100) comment '上午迟到率',
afternoon_late_count varchar(100) comment '下午迟到人数',
afternoon_late_ratio varchar(100) comment '下午迟到率',
evening_late_count varchar(100) comment '晚自习迟到人数',
evening_late_ratio varchar(100) comment '晚自习迟到率',
morning_leave_count varchar(100) comment '上午请假人数',
morning_leave_ratio varchar(100) comment '上午请假率',
afternoon_leave_count varchar(100) comment '下午请假人数',
afternoon_leave_ratio varchar(100) comment '下午请假率',
evening_leave_count varchar(100) comment '晚自习请假人数',
evening_leave_ratio varchar(100) comment '晚自习请假率',
morning_truant_count varchar(100) comment '上午旷课人数',
morning_truant_ratio varchar(100) comment '上午旷课率',
afternoon_truant_count varchar(100) comment '下午旷课人数',
afternoon_truant_ratio varchar(100) comment '下午旷课率',
evening_truant_count varchar(100) comment '晚自习旷课人数',
evening_truant_ratio varchar(100) comment '晚自习旷课率',
time_type varchar(100) COMMENT '聚合时间类型:1、按小时聚合;2、按天聚合;3、按季度聚合;4、按月聚合;5、按年聚合。',
yearinfo varchar(100) comment '年',
monthinfo varchar(100) comment '月',
dayinfo varchar(100) comment '天'
)
comment '班级数据统计表';
sqoop export \
--connect "jdbc:mysql://hadoop01:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 123456 \
--table class_attendance_dws \
--hcatalog-database itcast_dws \
--hcatalog-table class_attendance_dws \
-m 1