项目实战-知行教育大数据分析平台-06

目录

学生出勤主题看板

一:需求分析

二:业务数据准备

三:建模分析

四:建模操作

五:数据抽取

六:数据清洗转换,维度退化,提前聚合

七:数据分析

八:数据导出


学生出勤主题看板

一:需求分析

需求一:

        维度:时间维度:年,月,天,上午,下午,晚自习

                   班级维度

                   学生维度

        指标:出勤人数(正常出勤+迟到出勤)

        计算公式:打卡时间在上课前40分钟~上课时间点之内(假设上课时间是9-12点,打卡时间是8.40-12点都算),and 没有早退(离开时间为12点之后)(通过下面的需求得知不需要考虑早退问题,但要考虑是否开启共享屏幕问题(即按时签到+开启共享屏幕学习,这才算出勤))

        表:course_table_upload_datail表,tbh_student_signin_record表,tbh_class_time_table表

        关联条件:?

        字段:打卡时间,离开时间,上课时间,下课时间(属于表?).....?

需求二:

        维度:时间维度:年,月,天,上午,下午,晚自习

                   班级维度

                   学生维度

        指标:出勤人数(需求一已统计)

                   当日在读学生数

        针对比率的指标,我们最后在分析,因为可能后续的需求中会对其中的分子或分母进行计算。如果后续计算了,那么这里就不需要计算了。

        表:course_table_upload_detail表,tbh_student_signin_record表,tbh_class_time_table表,class_studying_student_count表

        关联条件:?

        字段:?

需求三:

        维度:时间维度:年,月,天,上午,下午,晚自习

                   班级维度

                   学生维度

        指标:迟到人数

        计算公式:打卡时间在上课10分钟以后到下课之前(假设上课时间是9-12点,打卡时间是9.10以后12点以前都算迟到)。通过这个可以看出打卡时间在上课前40分钟到上课后10分钟且离开时间在下课之后是正常出勤,打卡时间是在上课后10分钟到下课且离开时间在下课之后算迟到出勤。(通过下面的需求得知不需要考虑早退问题,但要考虑是否开启共享屏幕问题(即打卡迟到了+开启共享屏幕学习,这才算是迟到)

        表:course_table_upload_detail表,tbh_student_signin_record表,tbh_class_time_table表

        字段:?

需求四:

select dt.every_date,
       ctud.class_id,
       tssr.student_id,
       if(
           #上午正常打卡为0,迟到10分钟以上为1,其他(请假+旷课)为2
           sum(
               case
               #上午打卡时间是否在上课前40分钟~下课时间段之内
                   when time(tssr.signin_time) between TIMESTAMPADD(minute, -40, tctt.morning_begin_time) and tctt.morning_end_time
                       then 1   #上午来了
                   else 0 end   #上午没来
            ) > 0,  #打卡多次,只要有一次正常打卡,就会>0,返回true;否则没来,返回false
           if(
               sum(
                   case
               #上午打卡时间是否在上课前40分钟~上课后10分钟之内
                      when time(tssr.signin_time) between TIMESTAMPADD(minute, -40, tctt.morning_begin_time) and TIMESTAMPADD(minute, 10, tctt.morning_begin_time)
                          then 1    #正常出勤
                      else 0 end    #迟到
                ) > 0,              #有一次打卡是正常出勤,就会>0,返回true;否则迟到,返回false
               0,   #正常出勤
               1    #迟到
            ),
           2    #上午没来
        ) as morning_signin,
       if(
           #下午正常打卡为0,迟到10分钟以上为1,其他(请假+旷课)为2
                   sum(case
                           when time(tssr.signin_time) between TIMESTAMPADD(minute, -40, tctt.afternoon_begin_time) and tctt.afternoon_end_time
                               then 1
                           else 0 end) > 0,
                   if(sum(case
                              when time(tssr.signin_time) between TIMESTAMPADD(minute, -40, tctt.afternoon_begin_time) and TIMESTAMPADD(minute, 10, tctt.afternoon_begin_time)
                                  then 1
                              else 0 end) > 0, 0, 1), 2) as afternoon_signin,
       if(
           #晚自习正常打卡为0,迟到10分钟以上为1,其他(请假+旷课)为2
                   sum(case
                           when time(tssr.signin_time) between TIMESTAMPADD(minute, -20, tctt.evening_begin_time) and tctt.evening_end_time
                               then 1
                           else 0 end) > 0,
                   if(sum(case
                              when time(tssr.signin_time) between TIMESTAMPADD(minute, -20, tctt.evening_begin_time) and TIMESTAMPADD(minute, 10, tctt.evening_begin_time)
                                  then 1
                              else 0 end) > 0, 0, 1), 2) as evening_signin
from (
         #获取今天之前一周内的日期
         select datelist as every_date from calendar where datelist between '2019-09-01' and '2019-09-30'
     ) dt
         #日期课表不为空且不是开班典礼
         left join course_table_upload_detail ctud
                   on ctud.class_date = dt.every_date and ifnull(ctud.content, '') != '' and
                      ctud.content != '开班典礼'
    #学生打卡记录日期和班级匹配,且开启共屏进入学习
         left join tbh_student_signin_record tssr
                   on tssr.class_id = ctud.class_id and tssr.signin_date = dt.every_date and
                      tssr.share_state = 1
    #获取班级作息时间以判断是否按时出勤
         left join tbh_class_time_table tctt on tctt.id = tssr.time_table_id
     #按照日期、班级、学生分组统计
group by dt.every_date, ctud.class_id, tssr.student_id;

        维度:时间维度:年,月,天,(上午,下午,晚自习)

                   班级维度

                   学生维度

        指标:迟到人数(需求三已统计)

                   当日在读学员数

        表:course_table_upload_detail日期课程明细表(课程维度表),tbh_student_signin_record学生打卡记录(事实表)表,tbh_class_time_table班级作息时间表(班级作息维度表)

        关联条件:course_table_upload_detail.class_id = tbh_student_signin_record.class_id

                           tbh_class_time_table.id = tbh_student_signin_record.time_table_id

        字段:时间维度:course_table_upload_detail.class_date

                   班级维度:course_table_upload_detail.class_id

                   学生维度:tbh_student_signin_record.student_id

                   指标字段:tbh_student_signin_record.signin_time   # 打卡时间

                   指标判断字段:tbh_class_time_table.morning_begin_time    #  上午上课时间

                                            tbh_class_time_table.morning_end_time    #  上午下课时间

                                            tbh_class_time_table.afternoon_begin_time

                                            tbh_class_time_table.afternoon_end_time

                                            tbh_class_time_table.evening_begin_time

                                            tbh_class_time_table.evening_end_time

                   指标计算方法:先判断,上课时间-40 <= signin_time<= 下课时间

                                            情形1:满足上述条件,出勤了,再次判断上课时间-40 <= signin_time<= 上课时间+10,如果满足则是正常出勤返回0,否则为迟到返回1。

                                            情形2:不满足,缺勤了(旷课+请假),返回2。

        注意:ifnull()是mysql中的函数与nvl()作用一样

        是否需要清洗:course_table_upload_detail.content != '开班典礼' and != ' ' and != null   # 目前是统计学生的出勤情况content为null和空说明没上课休息,因此不需要统计。开班典礼的时候也不需要统计出勤,因为还没有正式上课。

        tbh_student_signin_record.share_state = 1   # share_state=1说明学生目前跟随着老师开启共享屏幕正在学习,即在统计出勤人数的时候,不仅仅需要在合理的时间范围内签到,必须开启共享屏幕,如果来共享屏幕都没开,说明即使是签到了但没有好好学习,这不算出勤(一般这个字段的值都是1,因为是否开启共享屏幕是由老师控制的,但不乏有那么几个特殊份子)。

        是否需要转换:course_table_upload_detail.class_date可能需要转换为yearinfo,monthinfo,dayinfo。需要对数据调研。

需求五:

        维度:时间维度:年,月,天,上午,下午,晚自习

                   班级维度

        指标:请假人数

        表:student_leave_apply学生请假申请表(事实表),tbh_class_time_table,course_table_upload_detail

        关联条件:student_leave_apply.class_id = tbh_class_time_table.class_id

                          student_leave_apply.class_id = course_table_upload_detail.class_id

        字段:时间维度:course_table_upload_detail.class_date

                   班级维度:course_table_upload_detail.class_id

                   指标字段:student_leave_apply.student_id

        需要清洗的字段:course_table_upload_detail.content

                course_table_upload_detail.class_date between tbh_class_time_table.use_begin_date  and tbh_class_time_table.use_end_date

                # 保证请假操作有效(即确定该学生是否有请假)

                student_leave_apply.audit_state = 1

                student_leave_apply.cancel_state = 0

                student_leave_apply.valid_state = 1

                # 判断该学生请假的时候是否合理(上课时间是9-11点,学生A请假时间为7-8点该学生没来上课,属于旷课)

                真实情况下只有学生的请假开始时间<=上课时间 and 学生的请假结束时间>=下课时间,才算学生请假成功,由于一般请假我们都是请一上午或一下午,因此在判断的时候:请假开始时间<=上课开始时间 and 学生的请假结束时间>=上课开始时间,即:(这里只是以上午为例)

                 student_leave_apply.begin_time <= concat(course_table_upload_detail.class_date,' ',tbh_class_time_table.morning_begin_time)

                student_leave_apply.end_time >= concat(course_table_upload_detail.class_date,' ',tbh_class_time_table.morning_begin_time)

        需要转换的字段:course_table_upload_detail.class_date转化为yearinfo,monthinfo,dayinfo

需求六:

        维度:时间维度:年,月,天,上午,下午,晚自习

                   班级维度

        指标:请假人数(需求五已统计)

                   当日再度学员人数

需求七:

        维度:时间维度:年,月,天,上午,下午,晚自习

                   班级维度

        指标:旷课人数=当日再读学员人数-出勤人数(需求一已统计)-请假人数(需求五已统计)

需求八:

        维度:时间维度:年,月,天,上午,下午,晚自习

                   班级维度

        指标:旷课人数/当日在读学员人数

        表:class_studying_student_count班级在读学生人数表(维度表)

        通过查看数据发现在class_studying_student_count表中已经直接给出了当日在读学员人数

        关联条件:class_studying_student_count.class_id = course_table_upload_detail.class_id

        需求总结:

        注意通过上述分析得知有两个事实表,一个事实表用于统计出勤和迟到,一个事实表用于统计请假,此时在需求总结的时候要分类总结

        第一类:统计出勤和迟到人数

        涉及维度:时间维度:年,月,天,上午,下午,晚自习

                          班级维度

                          学生维度

        涉及指标:出勤人数

                          迟到人数

        表:course_table_upload_detail(维度表),tbh_student_signin_record(事实表),tbh_class_time_table(维度表),class_studying_student_count(维度表)

        关联条件:course_table_upload_detail.class_id = tbh_student_signin_record.class_id

                           tbh_class_time_table.id = tbh_student_signin_record.time_table_id

                           class_studying_student_count.class_id = course_table_upload_detail.class_id

        字段:时间维度:course_table_upload_detail.class_date

                   班级维度:course_table_upload_detail.class_id

                   学生维度:tbh_student_signin_record.student_id

                   指标字段:tbh_student_signin_record.signin_time   # 打卡时间

                   指标判断字段:tbh_class_time_table.morning_begin_time    #  上午上课时间

                                            tbh_class_time_table.morning_end_time    #  上午下课时间

                                            tbh_class_time_table.afternoon_begin_time

                                            tbh_class_time_table.afternoon_end_time

                                            tbh_class_time_table.evening_begin_time

                                            tbh_class_time_table.evening_end_time

                   指标计算方法:先判断,上课时间-40 <= signin_time<= 下课时间

                                            情形1:满足上述条件,出勤了,再次判断上课时间-40 <= signin_time<= 上课时间+10,如果满足则是正常出勤返回0,否则为迟到返回1。

                                            情形2:不满足,缺勤了(旷课+请假),返回2。

        是否需要清洗:course_table_upload_detail.content != '开班典礼' and != ' ' and != null

                                  tbh_student_signin_record.share_state = 1

        是否需要转换:course_table_upload_detail.class_date转换为yearinfo,monthinfo,dayinfo。

        第二类:统计请假人数

        涉及维度:时间维度:年,月,天,上午,下午,晚自习

                          班级维度

        涉及指标:请假人数

        表:student_leave_apply(事实表),tbh_class_time_table(维度表),course_table_upload_detail(维度表),class_studying_student_count(维度表)

        关联条件:student_leave_apply.class_id = tbh_class_time_table.class_id

                          student_leave_apply.class_id = course_table_upload_detail.class_id

                          class_studying_student_count.class_id = course_table_upload_detail.class_id

        字段:时间维度:course_table_upload_detail.class_date

                   班级维度:course_table_upload_detail.class_id

                   指标字段:student_leave_apply.student_id

        需要清洗的字段:course_table_upload_detail.content != ' ' and != null and != '开班典礼'

                course_table_upload_detail.class_date between tbh_class_time_table.use_begin_date  and tbh_class_time_table.use_end_date

                # 保证请假操作有效(即确定该学生是否有请假)

                student_leave_apply.audit_state = 1

                student_leave_apply.cancel_state = 0

                student_leave_apply.valid_state = 1

                 student_leave_apply.begin_time <= concat(course_table_upload_detail.class_date,' ',tbh_class_time_table.morning_begin_time)

                student_leave_apply.end_time >= concat(course_table_upload_detail.class_date,' ',tbh_class_time_table.morning_begin_time)

        需要转换的字段:course_table_upload_detail.class_date转化为yearinfo,monthinfo,dayinfo。

        第三类:统计旷课人数

        涉及维度:时间维度:年,月,天,上午,下午,晚自习

                          班级维度

        涉及指标: 旷课人数

        指标计算方法:旷课人数 = 当日在读学员数 - 出勤人数 - 请假人数

二:业务数据准备

        create database teach default character set utf8mb4 collate utf8mb4_unicode_ci;

        在teach库中运行:

三:建模分析

        ODS层:存放两张事实表分别是:tbh_student_signin_record和student_leave_apply,添加分区抽取时间start_time。(因该还要添加end_time字段)

        DIM层:存放三张维度表分别是:tbh_class_time_table,course_table_upload_detail和class_studying_student_count,需要添加分区start_time。

        DWD层:对ODS层表进行清洗转换工作,以及少量维度退化(这里不需要)

        按理来说有两张表,分别是对两个事实表清洗转换后的表,考虑到这些清洗工作大多数都涉及到维度表,只有少数清洗工作是在事实表中的,在这一阶段维度表并没有退化到事实表中,因此大量的清洗转换工作在这里无法执行。为了方便起见所有的清洗转换操作都放到DWM层维度退化之后。

        不需要DWD层

        DWM层:维度退化+提前聚合操作

        操作一:将course_table_upload_detail(维度表),tbh_class_time_table(维度表),class_studying_student_count(维度表)的数据退化到tbh_student_signin_record(事实表),并且完成清洗转换工作。

        操作二:是否需要提前聚合?可以,因为跨天数据可以累加,提前按天聚合

        操作三:将tbh_class_time_table(维度表),course_table_upload_detail(维度表),class_studying_student_count(维度表)的数据退化到student_leave_apply(事实表),并完成清洗转换工作。

        操作四:是否需要提前聚合?可以,因为跨天数据可以累加,提前按天聚合

        思考:如何提前聚合呢?先分后合

        建表方案:

        DWS层:统计分析操作

        建表方案:

四:建模操作

        ODS层:

CREATE TABLE IF NOT EXISTS itcast_ods.student_signin_ods (
    id                int,
    normal_class_flag int           comment '是否正课 1 正课 2 自习 3 休息',
    time_table_id     int           comment '作息时间id normal_class_flag=2 关联tbh_school_time_table 或者 normal_class_flag=1 关联 tbh_class_time_table',
    class_id          int           comment '班级id',
    student_id        int           comment '学员id',
    signin_time       String       comment '签到时间',
    signin_date       String          comment '签到日期',
    inner_flag        int           comment '内外网标志  0 外网 1 内网',
    signin_type       int           comment '签到类型 1 心跳打卡 2 老师补卡 3 直播打卡',
    share_state       int            comment '共享屏幕状态 0 否 1是  在上午或下午段有共屏记录,则该段所有记录该字段为1,内网默认为1 外网默认为0   (暂不用)',
    inner_ip          String          comment '内网ip地址',
    create_time       String       comment '创建时间')
comment '学生打卡记录表'
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.bloom.filter.columns'='time_table_id,class_id,signin_date,share_state');

-- 不需要建立拉链表,因为打卡记录表一般不允许修改,比如今天你手机打了卡,还能修改吗?显然不能,这个表每天只有新增的打卡记录。这个表可以建成外部表无所谓
-- 可以看出该表使用了布隆过滤索引。time_table_id,class_id都需要 = join操作;share_state需要在where条件中筛选出=1的情况;signin_date可以不用使用索引,因为没有等值连接。

CREATE TABLE IF NOT EXISTS itcast_ods.student_leave_apply_ods (
    id              int,
    class_id        int           comment '班级id',
    student_id      int           comment '学员id',
    audit_state     int        comment '审核状态 0 待审核 1 通过 2 不通过',
    audit_person    int           comment '审核人',
    audit_time      String      comment '审核时间',
    audit_remark    String     comment '审核备注',
    leave_type      int           comment '请假类型  1 请假 2 销假 (查询是否请假不用过滤此类型,通过有效状态来判断)',
    leave_reason    int           comment '请假原因  1 事假 2 病假',
    begin_time      String      comment '请假开始时间',
    begin_time_type int           comment '1:上午 2:下午 3:晚自习',
    end_time        String      comment '请假结束时间',
    end_time_type   int           comment '1:上午 2:下午 3:晚自习',
    days            float         comment '请假/已休天数',
    cancel_state    int         comment '撤销状态  0 未撤销 1 已撤销',
    cancel_time     String      comment '撤销时间',
    old_leave_id    int           comment '原请假id,只有leave_type =2 销假的时候才有',
    leave_remark    String     comment '请假/销假说明',
    valid_state     int        comment '是否有效(0:无效 1:有效)',
    create_time     String      comment '创建时间')
comment '学生请假申请表'
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.bloom.filter.columns'='class_id,audit_state,cancel_state,valid_state');

-- 请假表可以考虑设置为拉链表
-- class_id,audit_state,cancel_state,valid_state都存在等值连接操作可以设置布隆过滤索引,加快查询速度。

        DIM层:

CREATE TABLE IF NOT EXISTS itcast_dimen.class_studying_student_count_dimen (
    id                     int,
    school_id              int  comment '校区id',
    subject_id             int  comment '学科id',
    class_id               int  comment '班级id',
    studying_student_count int  comment '在读班级人数',
    studying_date          STRING comment '在读日期')
comment '在读班级的每天在读学员人数'
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.bloom.filter.columns'='studying_student_count,studying_date');

CREATE TABLE IF NOT EXISTS itcast_dimen.course_table_upload_detail_dimen
(
    id                  int       comment 'id',
    base_id             int         comment '课程主表id',
    class_id            int         comment '班级id',
    class_date          STRING      comment '上课日期',
    content             STRING     comment '课程内容',
    teacher_id          int         comment '老师id',
    teacher_name        STRING     comment '老师名字',
    job_number          STRING     comment '工号',
    classroom_id        int         comment '教室id',
    classroom_name      STRING     comment '教室名称',
    is_outline          int         comment '是否大纲 0 否 1 是',
    class_mode          int         comment '上课模式 0 传统全天 1 AB上午 2 AB下午 3 线上直播',
    is_stage_exam       int         comment '是否阶段考试(0:否 1:是)',
    is_pay              int         comment '代课费(0:无 1:有)',
    tutor_teacher_id    int         comment '晚自习辅导老师id',
    tutor_teacher_name  STRING     comment '辅导老师姓名',
    tutor_job_number    STRING     comment '晚自习辅导老师工号',
    is_subsidy          int         comment '晚自习补贴(0:无 1:有)',
    answer_teacher_id   int         comment '答疑老师id',
    answer_teacher_name STRING     comment '答疑老师姓名',
    answer_job_number   STRING     comment '答疑老师工号',
    remark              STRING        comment '备注',
    create_time         STRING      comment '创建时间')
comment '班级课表明细表'
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.bloom.filter.columns'='class_id,class_date');

CREATE TABLE IF NOT EXISTS itcast_dimen.class_time_dimen (
    id                     int,
    class_id              int              comment '班级id',
    morning_template_id   int              comment '上午出勤模板id',
    morning_begin_time    STRING         comment '上午开始时间',
    morning_end_time      STRING         comment '上午结束时间',
    afternoon_template_id int              comment '下午出勤模板id',
    afternoon_begin_time  STRING         comment '下午开始时间',
    afternoon_end_time    STRING         comment '下午结束时间',
    evening_template_id   int              comment '晚上出勤模板id',
    evening_begin_time    STRING         comment '晚上开始时间',
    evening_end_time      STRING         comment '晚上结束时间',
    use_begin_date        STRING         comment '使用开始日期',
    use_end_date          STRING         comment '使用结束日期',
    create_time           STRING       comment '创建时间',
    create_person         int              comment '创建人',
    remark                STRING      comment '备注')
comment '班级作息时间表'
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY','orc.bloom.filter.columns'='id,class_id');

        DWM层:

CREATE TABLE IF NOT EXISTS itcast_dwm.student_attendance_dwm (
    dateinfo        String        comment '日期',
    class_id          int           comment '班级id',
    student_id        int           comment '学员id',
    morning_att       String       comment '上午出勤情况:0.正常出勤、1.迟到、2.其他(请假+旷课)',
    afternoon_att    String        comment '下午出勤情况:0.正常出勤、1.迟到、2.其他(请假+旷课)',
    evening_att       String        comment '晚自习出勤情况:0.正常出勤、1.迟到、2.其他(请假+旷课)')
comment '学生出勤情况表'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dwm.class_attendance_dwm (
    dateinfo        String        comment '日期',
    class_id          int           comment '班级id',
    morning_att_count       String         comment '上午出勤人数',
    afternoon_att_count      String        comment '下午出勤人数',
    evening_att_count       String        comment '晚自习出勤人数',
    morning_late_count       String         comment '上午迟到人数',
    afternoon_late_count      String        comment '下午迟到人数',
    evening_late_count       String        comment '晚自习迟到人数')
comment '学生出勤(出勤和迟到)数据'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dwm.class_leave_dwm (
    dateinfo      String        comment '日期',
    class_id        int           comment '班级id',
    morning_leave_count       String       comment '上午请假人数',
    afternoon_leave_count    String        comment '下午请假人数',
    evening_leave_count       String        comment '晚自习请假人数')
comment '班级请假数据表'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dwm.class_truant_dwm (
    dateinfo      String        comment '日期',
    class_id        int           comment '班级id',
    morning_truant_count       String          comment '上午旷课人数',
    afternoon_truant_count   String        comment '下午旷课人数',
    evening_truant_count       String        comment '晚自习旷课人数')
comment '班级旷课数据表'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');

CREATE TABLE IF NOT EXISTS itcast_dwm.class_all_dwm (
    dateinfo      String        comment '日期',
    class_id        int           comment '班级id',
   studying_student_count int  comment '在读班级人数',
    morning_att_count       String         comment '上午出勤人数',
   morning_att_ratio       String         comment '上午出勤率',
    afternoon_att_count      String        comment '下午出勤人数',
   afternoon_att_ratio      String        comment '下午出勤率',
    evening_att_count       String        comment '晚自习出勤人数',
   evening_att_ratio       String        comment '晚自习出勤率',
   morning_late_count       String        comment '上午迟到人数',
   morning_late_ratio       String        comment '上午迟到率',
    afternoon_late_count     String        comment '下午迟到人数',
   afternoon_late_ratio     String        comment '下午迟到率',
    evening_late_count       String        comment '晚自习迟到人数',
   evening_late_ratio       String        comment '晚自习迟到率',
   morning_leave_count       String       comment '上午请假人数',
   morning_leave_ratio       String       comment '上午请假率',
    afternoon_leave_count    String        comment '下午请假人数',
   afternoon_leave_ratio    String        comment '下午请假率',
    evening_leave_count       String        comment '晚自习请假人数',
   evening_leave_ratio       String        comment '晚自习请假率',
    morning_truant_count       String          comment '上午旷课人数',
   morning_truant_ratio       String          comment '上午旷课率',
    afternoon_truant_count   String        comment '下午旷课人数',
   afternoon_truant_ratio   String        comment '下午旷课率',
    evening_truant_count       String        comment '晚自习旷课人数',
   evening_truant_ratio       String        comment '晚自习旷课率')
comment '班级各项数据按天统计汇总表'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');

        DWS层:

CREATE TABLE IF NOT EXISTS itcast_dws.class_attendance_dws (
    dateinfo      String        comment '日期',
    class_id        int           comment '班级id',
   studying_student_count int  comment '在读班级人数',
    morning_att_count       String         comment '上午出勤人数',
   morning_att_ratio       String         comment '上午出勤率',
    afternoon_att_count      String        comment '下午出勤人数',
   afternoon_att_ratio      String        comment '下午出勤率',
    evening_att_count       String        comment '晚自习出勤人数',
   evening_att_ratio       String        comment '晚自习出勤率',
   morning_late_count       String        comment '上午迟到人数',
   morning_late_ratio       String        comment '上午迟到率',
    afternoon_late_count     String        comment '下午迟到人数',
   afternoon_late_ratio     String        comment '下午迟到率',
    evening_late_count       String        comment '晚自习迟到人数',
   evening_late_ratio       String        comment '晚自习迟到率',
   morning_leave_count       String       comment '上午请假人数',
   morning_leave_ratio       String       comment '上午请假率',
    afternoon_leave_count    String        comment '下午请假人数',
   afternoon_leave_ratio    String        comment '下午请假率',
    evening_leave_count       String        comment '晚自习请假人数',
   evening_leave_ratio       String        comment '晚自习请假率',
    morning_truant_count       String          comment '上午旷课人数',
   morning_truant_ratio       String          comment '上午旷课率',
    afternoon_truant_count   String        comment '下午旷课人数',
   afternoon_truant_ratio   String        comment '下午旷课率',
    evening_truant_count       String        comment '晚自习旷课人数',
   evening_truant_ratio       String        comment '晚自习旷课率',
   time_type            STRING       COMMENT '聚合时间类型:1、按小时聚合;2、按天聚合;3、按季度聚合;4、按月聚合;5、按年聚合。')
comment '班级数据统计表'
PARTITIONED BY (yearinfo STRING, monthinfo STRING, dayinfo STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
stored as orcfile
TBLPROPERTIES ('orc.compress'='SNAPPY');

五:数据抽取

        ODS层:

sqoop import \
--connect jdbc:mysql://hadoop01:3306/teach \
--username root \
--password 123456 \
--query 'select *, "2021-10-07" as dt from tbh_student_signin_record where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table student_signin_ods \
-m 1

sqoop import \
--connect jdbc:mysql://hadoop01:3306/teach \
--username root \
--password 123456 \
--query 'select *, "2021-10-07" as dt from student_leave_apply where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table student_leave_apply_ods \
-m 1

        DIM层:

sqoop import \
--connect jdbc:mysql://hadoop01:3306/teach \
--username root \
--password 123456 \
--query 'select *, "2021-10-07" as dt from class_studying_student_count where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_dimen  \
--hcatalog-table class_studying_student_count_dimen \
-m 1

sqoop import \
--connect jdbc:mysql://hadoop01:3306/teach \
--username root \
--password 123456 \
--query 'select *, "2021-10-07" as dt from course_table_upload_detail where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_dimen  \
--hcatalog-table course_table_upload_detail_dimen \
-m 1

sqoop import \
--connect jdbc:mysql://hadoop01:3306/teach \
--username root \
--password 123456 \
--query 'select *, "2021-10-07" as dt from tbh_class_time_table where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_dimen  \
--hcatalog-table class_time_dimen \
-m 1

六:数据清洗转换,维度退化,提前聚合

        将ODS层和DIM层的数据经过维度退化,清洗转换,提前聚合操作,并将结果保存至DWM层

        # 学生出勤情况表

        注意:学生打不打卡是建立在今天有课的基础上的,因此from字段课表需要放在最左边,如果将学生打卡记录表放在from字段最左边,学生是可以随意玩耍打卡软件的,显然可能存在不上课的时候也有学生在打卡,因此如果以学生打卡记录表为主可能存在统计结果偏大。而且如果课表说星期1有课,但是由于某些特殊原因所有学生没有打卡,此时以打卡表为主会导致星期1的统计没数据。

-- ctudd.class_date 的格式为 年-月-日
select ctudd.class_date as dateinfo, sso.class_id, sso.student_id,
    xx as morning_att,
    xx as afternoon_att,
    xx as evening_att,
    substr(ctudd.class_date,1,4) as yearinfo,
    substr(ctudd.class_date,6,2) as monthinfo,
    substr(ctudd.class_date,9,2) as dayinfo
from itcast_dimen.course_table_upload_detail_dimen ctudd
    left join itcast_ods.student_signin_ods sso on sso.class_id = ctudd.class_id
    left join itcast_dimen.class_time_dimen ctd on ctd.id = sso.time_table_id 
and ctudd.content != '开班典礼' and ctudd.content != ' ' and ctudd.content != null and sso.share_state = 1
group by ctudd.class_date, sso.class_id, sso.student_id;

        注意:select涉及join操作,针对join操作的常见优化手段就是,能提前过滤就提前过滤,过滤了在join,显然上面的select是join之后在过滤,效率低。

select 
    ctudd.class_date as dateinfo, 
    sso.class_id, 
    sso.student_id,
    xx as morning_att,    --需要使用hive中的函数实现分析
    xx as afternoon_att,    --需要使用hive中的函数实现分析
    xx as evening_att,    --需要使用hive中的函数实现分析
    substr(ctudd.class_date,1,4) as yearinfo,
    substr(ctudd.class_date,6,2) as monthinfo,
    substr(ctudd.class_date,9,2) as dayinfo
from (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd
    left join (select * from itcast_ods.student_signin_ods where share_state=1) sso on sso.class_id = ctudd.class_id
    left join itcast_dimen.class_time_dimen ctd on ctd.id = sso.time_table_id 
group by ctudd.class_date, ctudd.class_id, sso.student_id;
-- 通过查看原数据得知signin_time,morning_begin_time等等字段都是  时:分:秒格式
select 
    ctudd.class_date as dateinfo, 
    sso.class_id, 
    sso.student_id,
    if(
        sum(
            if(sso.signin_time between ctd.morning_begin_time - 40 and ctd.morning_end_time,1,0)  -- 需要将伪代码转化为真实代码
        ) > 0,
        if(sum(
               if(sso.signin_time between ctd.morning_begin_time - 40 and ctd.morning_begin_time + 10,1,0)
               ) > 0,
        0,1),
    2) as morning_att,
    xx as afternoon_att,
    xx as evening_att,
    substr(ctudd.class_date,1,4) as yearinfo,
    substr(ctudd.class_date,6,2) as monthinfo,
    substr(ctudd.class_date,9,2) as dayinfo
from (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd
    left join (select * from itcast_ods.student_signin_ods where share_state=1) sso on sso.class_id = ctudd.class_id
    left join itcast_dimen.class_time_dimen ctd on ctd.id = sso.time_table_id 
group by ctudd.class_date, ctudd.class_id, sso.student_id;

        目前我们的任务是将格式为 时:分:秒的时间减去40分钟,通过查询hive的时间相关函数,发现我们只有针对日期的加减函数没有针对时间的加减函数。

        思考:如何实现时间的加减呢?利用时间戳

        首先将class_date与morning_begin_time进行concat拼接,其次利用unix_timestamp函数将日期转化为时间戳(单位为秒s),然后将时间戳减去2400s(40分钟),然后利用from_unixtime函数将时间戳转换为日期,最后获取时间字段。即ctd.morning_begin_time - 40等价于

        substr(from_unixtime(unix_timestamp(concat(ctudd.class_date,' ',ctd.morning_begin_time), yyyy-MM-dd HH:mm:ss) - 2400),12,8)

        当然可以直接以时间戳来比较,此时需要将sso.signin_time,ctd.morning_end_time同样转化为时间戳。

-- 通过查看原数据得知signin_time是  年-月-天 时:分:秒,morning_begin_time字段是  时:分:秒格式

--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶
-- set hive.enforce.bucketing=true;
-- set hive.enforce.sorting=true;
-- set hive.optimize.bucketmapjoin = true;
-- set hive.auto.convert.sortmerge.join=true;
-- set hive.auto.convert.sortmerge.join.noconditionaltask=true;
--并行执行
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
--矢量化查询
set hive.vectorized.execution.enabled=true;
--关联优化器
set hive.optimize.correlation=true;
--读取零拷贝
set hive.exec.orc.zerocopy=true;
--join数据倾斜
set hive.optimize.skewjoin=true;
-- set hive.skewjoin.key=100000;
-- set hive.optimize.skewjoin.compiletime=true;
set hive.optimize.union.remove=true;
-- group倾斜
set hive.groupby.skewindata=true;

insert into table itcast_dwm.student_attendance_dwm partition(yearinfo,monthinfo,dayinfo)
select 
    ctudd.class_date as dateinfo, 
    sso.class_id, 
    sso.student_id,
    if(
        sum(
            if(sso.signin_time 
                between from_unixtime(unix_timestamp(concat(ctudd.class_date,' ',ctd.morning_begin_time), yyyy-MM-dd HH:mm:ss) - 2400) 
                and concat(ctudd.class_date,' ',ctd.morning_end_time),1,0)
        ) > 0,
        if(sum(
               if(unix_timestamp(sso.signin_time, yyyy-MM-dd HH:mm:ss) 
                    between unix_timestamp(concat(ctudd.class_date,' ',ctd.morning_begin_time), yyyy-MM-dd HH:mm:ss) - 2400 
                    and unix_timestamp(concat(ctudd.class_date,' ',ctd.morning_begin_time), yyyy-MM-dd HH:mm:ss) + 600,1,0)
               ) > 0,
        0,1),
    2) as morning_att, -- 上午统计
    if(
        sum(
            if(sso.signin_time 
                between from_unixtime(unix_timestamp(concat(ctudd.class_date,' ',ctd.afternoon_begin_time), yyyy-MM-dd HH:mm:ss) - 2400) 
                and concat(ctudd.class_date,' ',ctd.afternoon_end_time),1,0)
        ) > 0,
        if(sum(
               if(unix_timestamp(sso.signin_time, yyyy-MM-dd HH:mm:ss) 
                    between unix_timestamp(concat(ctudd.class_date,' ',ctd.afternoon_begin_time), yyyy-MM-dd HH:mm:ss) - 2400 
                    and unix_timestamp(concat(ctudd.class_date,' ',ctd.afternoon_begin_time), yyyy-MM-dd HH:mm:ss) + 600,1,0)
               ) > 0,
        0,1),
    2) as afternoon_att,  -- 下午统计
    if(
        sum(
            if(sso.signin_time 
                between from_unixtime(unix_timestamp(concat(ctudd.class_date,' ',ctd.evening_begin_time), yyyy-MM-dd HH:mm:ss) - 2400) 
                and concat(ctudd.class_date,' ',ctd.evening_end_time),1,0)
        ) > 0,
        if(sum(
               if(unix_timestamp(sso.signin_time, yyyy-MM-dd HH:mm:ss) 
                    between unix_timestamp(concat(ctudd.class_date,' ',ctd.evening_begin_time), yyyy-MM-dd HH:mm:ss) - 2400 
                    and unix_timestamp(concat(ctudd.class_date,' ',ctd.evening_begin_time), yyyy-MM-dd HH:mm:ss) + 600,1,0)
               ) > 0,
        0,1),
    2) as evening_att,  -- 晚自习统计
    substr(ctudd.class_date,1,4) as yearinfo,
    substr(ctudd.class_date,6,2) as monthinfo,
    substr(ctudd.class_date,9,2) as dayinfo
from (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd
    left join (select * from itcast_ods.student_signin_ods where share_state=1) sso on sso.class_id = ctudd.class_id
    left join itcast_dimen.class_time_dimen ctd on ctd.id = sso.time_table_id
where cutdd.class_date in ('2019-09-03','2019-09-04','2019-09-05')     -- 我们使用的创造出来的数据只有这三天是有效的可以进行统计的,其他数据是通过随机函数随机生产的无用数据,因此实验过程中只考虑这三天即可,注意在实际生产环境中是没有这个where条件的。
group by ctudd.class_date, ctudd.class_id, sso.student_id;

         # 学生出勤数据表

        以学生出勤情况表为基础统计出勤人数和迟到人数

insert into table itcast_dwm.class_attendancce_dwm partition(yearinfo, monthinfo, dayinfo)
select
    dateinfo,
    class_id,
    sum(if(morning_att = 0 or morning_att = 1,1,0)) as morning_att_count,
    sum(if(afternoon_att = 0 or afternoon_att = 1,1,0)) as afternoon_att_count,
    sum(if(evening_att = 0 or evening_att = 1,1,0)) as evening_att_count,
    sum(if(morning_att = 1,1,0)) as morning_late_count,
    sum(if(afternoon_att = 1,1,0)) as afternoon_late_count,
    sum(if(evening_att = 1,1,0)) as evening_late_count,
    yearinfo,
    monthinfo,
    dayinfo
from itcast_dwm.student_attendance_dwm
group by dateinfo,yearinfo,monthinfo,dayinfo,class_id;

       当然可以使用count()统计

insert into table itcast_dwm.class_attendancce_dwm partition(yearinfo, monthinfo, dayinfo)
select
    dateinfo,
    class_id,
    count(
            case 
                when morning_att in ('0','1') then student_id
                else null end
    ) as morning_att_count,
    count(
            case 
                when afternoon_att in ('0','1') then student_id
                else null end
    ) as afternoon_att_count,
    count(
            case 
                when evening_att in ('0','1') then student_id
                else null end
    ) as evening_att_count,
    count(
            case 
                when morning_att = '1' then student_id
                else null end
    ) as morning_late_count,
    count(
            case 
                when afternoon_att = '1' then student_id
                else null end
    ) as afternoon_late_count,
    count(
            case 
                when evening_att = '1' then student_id
                else null end
    ) as evening_late_count,
    yearinfo,
    monthinfo,
    dayinfo
from itcast_dwm.student_attendance_dwm
group by dateinfo,yearinfo,monthinfo,dayinfo,class_id;
-- count对null是不统计的

        # 班级请假数据表

        原始数据中上午没有请假的数据,这里我们模拟出一条请假数据

INSERT INTO itcast_ods.student_leave_apply_ods partition (dt)
values (125, 5032, 119142, 1, 3491, '2019-09-08 16:42:29', '狂犬疫苗最后一针', 1, 2, '2019-09-03 08:00:00', 2, '2019-09-03 23:30:00', 2, 1, 0, null, null, '医院打针,狂犬疫苗最后一针了', 1, '2019-09-02 08:56:54','2020-07-07');

select 
    ctudd.class_date as dateinfo, 
    ctudd.class_id,
    count(distinct, slao.student_id) as morning_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) slao
    join itcast_dimen.class_time_dimen ctd on slao.class_id = ctd.class_id
    join (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd on ctudd.class_id = slao.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
    and slao.begin_time <= concat(ctudd.class_date,' ',ctd.morning_begin_time) and slao.end_time >= concat(ctudd.class_date,' ',ctd.morning_begin_time)
group by ctudd.class_date, ctudd.class_id;

-- ctd.use_begin_date and ctd.use_end_date 都是年-月-天格式

select 
    ctudd.class_date as dateinfo, 
    ctudd.class_id,
    count(distinct, slao.student_id) as afternoon_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) slao
    join itcast_dimen.class_time_dimen ctd on slao.class_id = ctd.class_id
    join (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd on ctudd.class_id = slao.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
    and slao.begin_time <= concat(ctudd.class_date,' ',ctd.afternoon_begin_time) and slao.end_time >= concat(ctudd.class_date,' ',ctd.afternoon_begin_time)
group by ctudd.class_date, ctudd.class_id;

select 
    ctudd.class_date as dateinfo, 
    ctudd.class_id,
    count(distinct, slao.student_id) as evening_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) slao
    join itcast_dimen.class_time_dimen ctd on slao.class_id = ctd.class_id
    join (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd on ctudd.class_id = slao.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
    and slao.begin_time <= concat(ctudd.class_date,' ',ctd.evening_begin_time) and slao.end_time >= concat(ctudd.class_date,' ',ctd.evening_begin_time)
group by ctudd.class_date, ctudd.class_id;

# 上午下午晚自习统计结果:当然我们这里只有一条数据,在实际情况下肯定有多条:

最终结果:

2019-09-03   5032   1   1   1   2019   09   03

2019-09-03   1205   8   0   0   2019   09   03

2019-09-03   5412   0   3   2   2019   09   03

2019-09-04   7014   5   0   3   2019   09   04

2019-09-05   7014   0   3   0   2019   09   05

思考:如何得到最终结果?满外连接full join。注意当我们发现我们的结果既要左边的数据,又要右边的数据时,一定使用full join

         与最终结果还有一定的差距:即存在(class_date,class_id)相同的数据。需要将其合并为一条数据。

        此时与最终表一致了

        最终sql:

insert into table itcast_dwm.class_leave_dwm partition(yearinfo,monthinfo,dayinfo)
select
    temp.class_date,
    temp.class_id,
    sum(temp.morning_leave_count) as morning_leave_count,
    sum(temp.afternoon_leave_count) as afternoon_leave_count,
    sum(temp.evening_leave_count) as evening_leave_count,
    substr(temp.class_date,1,4) as yearinfo,
    substr(temp.class_date,6,2) as monthinfo,
    substr(temp.class_date,9,2) as dayinfo
from (select
    coalesce(a.class_date,b.class_date,c.class_date) as class_date,
    coalesce(a.class_id,b.class_id,c.class_id) as class_id,
    nvl(a.morning_leave_count,0) as morning_leave_count,
    nvl(b.afternoon_leave_count,0) as afternoon_leave_count,
    nvl(c.evening_leave_count,0) as evening_leave_count
from (select 
    ctudd.class_date as dateinfo, 
    ctudd.class_id,
    count(distinct, slao.student_id) as morning_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) slao
    join itcast_dimen.class_time_dimen ctd on slao.class_id = ctd.class_id
    join (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd on ctudd.class_id = slao.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
    and slao.begin_time <= concat(ctudd.class_date,' ',ctd.morning_begin_time) and slao.end_time >= concat(ctudd.class_date,' ',ctd.morning_begin_time)
group by ctudd.class_date, ctudd.class_id) a
    full join (select 
    ctudd.class_date as dateinfo, 
    ctudd.class_id,
    count(distinct, slao.student_id) as afternoon_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) slao
    join itcast_dimen.class_time_dimen ctd on slao.class_id = ctd.class_id
    join (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd on ctudd.class_id = slao.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
    and slao.begin_time <= concat(ctudd.class_date,' ',ctd.afternoon_begin_time) and slao.end_time >= concat(ctudd.class_date,' ',ctd.afternoon_begin_time)
group by ctudd.class_date, ctudd.class_id) b on a.class_date = b.class_date and a.class_id = b.class_id
    full join (select 
    ctudd.class_date as dateinfo, 
    ctudd.class_id,
    count(distinct, slao.student_id) as evening_leave_count
from (select * from itcast_ods.student_leave_apply_ods where audit_state = 1 and cancel_state = 0 and valid_state = 1) slao
    join itcast_dimen.class_time_dimen ctd on slao.class_id = ctd.class_id
    join (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd on ctudd.class_id = slao.class_id
where ctudd.class_date between ctd.use_begin_date and ctd.use_end_date
    and slao.begin_time <= concat(ctudd.class_date,' ',ctd.evening_begin_time) and slao.end_time >= concat(ctudd.class_date,' ',ctd.evening_begin_time)
group by ctudd.class_date, ctudd.class_id) c on a.class_date = c.class_date and a.class_id = c.class_id) temp
group by temp.class_date,temp.class_id;

        可以利用with as语句提前将上述三个子查询起别名:

        语法:with 别名 as (select ....from.....)

        详细语法请看:sql语句中with as 的用法_sql as with_jia718的博客-CSDN博客

        注意运行过程中可能会出现资源不足的问题,可以考虑将上述设置的数据倾斜等等优化策略关闭。

        # 班级旷课数据表

        基于一下三张表计算每天各个班级上午下午晚自习的旷课人数

        itcast_dwm.class_leave_dwm(班级请假数据表)可以得到请假人数

        itcast_dwm.class_attendancce_dwm(班级出勤数据表)可以得到出勤人数

        itcast_dimen.class_studying_student_count_dimen(班级每日在读学员表)可以得到每日各个班级的在读学生数。

        旷课人数=每日在读学生数-出勤人数-请假人数

        思考:旷课数据的统计一定是建立在当天有课的情况下才统计。所以需要还需要关联ctudd课表。

insert into table itcast_dwm.class_truant_dwm partition(yearinfo,monthinfo,dayinfo)
select
    ctudd.class_date as dateinfo,
    ctudd.class_id,
    cssc.studying_student_count - nvl(cad.morning_att_count,0) - nvl(cld.morning_leave_count,0) as morning_truant_count,
    cssc.studying_student_count - nvl(cad.afternoon_att_count,0) - nvl(cld.afternoon_leave_count,0) as afternoon_truant_count,
    cssc.studying_student_count - nvl(cad.evening_att_count,0) - nvl(cld.evening_leave_count,0) as evening_truant_count,
    substr(ctudd.class_date,1,4) as yearinfo,
    substr(ctudd.class_date,6,2) as monthinfo,
    substr(ctudd.class_date,9,2) as dayinfo
from (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd
    left join itcast_dimen.class_studying_student_count_dimen cssc on ctudd.class_date = cssc.studying_date and ctudd.class_id = cssc.class_id
    left join itcast_dwm.class_attendance_dwm cad on ctudd.class_id = cad.class_id and ctudd.class_date = cad.dateinfo
    left join itcast_dwm.class_leave_dwm cld on ctudd.class_id = cld.class_id and ctudd.class_date = cld.dateinfo
where ctudd.class_date in ('2019-09-03','2019-09-04','2019-09-05');  -- 只有这三天数据是源数据中有意义的,实际生产环境中不需加where

        # 汇总表:

insert into table itcast_dwm.class_all_dwn partition(yearinfo,monthinfo,dayinfo)
select 
    ctudd.class_date as dateinfo,
    ctudd.class_id,
    cssc.studying_student_count,
    cad.morning_att_count,
    concat(round(nvl(cad.morning_att_count,0) / cssc.studying_student_count * 100,2),'%') as morning_att_ratio,
    cad.afternoon_att_count,
    concat(round(nvl(cad.afternoon_att_count,0) / cssc.studying_student_count * 100,2),'%') as afternoon_att_ratio,
    cad.evening_att_count,
    concat(round(nvl(cad.evening_att_count,0) / cssc.studying_student_count * 100,2),'%') as evening_att_ratio,  
    cad.morning_late_count,
    concat(round(nvl(cad.morning_late_count,0) / cssc.studying_student_count * 100,2),'%') as morning_late_ratio,
    cad.afternoon_late_count,
    concat(round(nvl(cad.afternoon_late_count,0) / cssc.studying_student_count * 100,2),'%') as afternoon_late_ratio,
    cad.evening_late_count,
    concat(round(nvl(cad.evening_late_count,0) / cssc.studying_student_count * 100,2),'%') as evening_late_ratio,
    cld.morning_leave_count,
    concat(round(nvl(cld.morning_leave_count,0) / cssc.studying_student_count * 100,2),'%') as morning_leave_ratio,
    cld.afternoon_leave_count,
    concat(round(nvl(cld.afternoon_leave_count,0) / cssc.studying_student_count * 100,2),'%') as afternoon_leave_ratio,
    cld.evening_leave_count,
    concat(round(nvl(cld.evening_leave_count,0) / cssc.studying_student_count * 100,2),'%') as evening_leave_ratio,
    ctd.morning_truant_count,
    concat(round(nvl(ctd.morning_truant_count,0) / cssc.studying_student_count * 100,2),'%') as morning_truant_ratio,
    ctd.afternoon_truant_count,
    concat(round(nvl(ctd.afternoon_truant_count,0) / cssc.studying_student_count * 100,2),'%') as afternoon_truant_ratio,
    ctd.evening_truant_count,
    concat(round(nvl(ctd.evening_truant_count,0) / cssc.studying_student_count * 100,2),'%') as evening_truant_ratio,
    substr(ctudd.class_date,1,4) as yearinfo,
    substr(ctudd.class_date,6,2) as monthinfo,
    substr(ctudd.class_date,9,2) as dayinfo
from (select * from itcast_dimen.course_table_upload_detail_dimen where content != '开班典礼' and nvl(content, '') != '') ctudd
    left join itcast_dimen.class_studying_student_count_dimen cssc on ctudd.class_date = cssc.studying_date and ctudd.class_id = cssc.class_id
    left join itcast_dwm.class_attendance_dwm cad on ctudd.class_id = cad.class_id and ctudd.class_date = cad.dateinfo
    left join itcast_dwm.class_leave_dwm cld on ctudd.class_id = cld.class_id and ctudd.class_date = cld.dateinfo
    left join itcast_dwm.class_truant_dwm ctd on ctudd.class_id = ctd.class_id and ctudd.class_date = ctd.dateinfo
where ctudd.class_date in ('2019-09-03','2019-09-04','2019-09-05');  -- 只有这三天数据是源数据中有意义的,实际生产环境中不需加where

-- 注意我们这里将比率都加了%但是在实际生产环境中一般不加%因为加了百分号后续就如法计算了。

七:数据分析

        基于DWM层的按天提前聚合的汇总表,统计每月,每年的数据

         需求一已经统计完成,只需要将结果直接导入到dws层对应表即可

insert into table itcast_dws.class_attendance_dws partition(yearinfo,monthinfo,dayinfo)
select 
    dateinfo,
    class_id,
    studying_student_count,
    morning_att_count,
    morning_att_ratio,
    afternoon_att_count,
    afternoon_att_ratio,
    evening_att_count,
    evening_att_ratio,
    morning_late_count,
    morning_late_ratio,
    afternoon_late_count,
    afternoon_late_ratio,
    evening_late_count,
    evening_late_ratio,
    morning_leave_count,
    morning_leave_ratio,
    afternoon_leave_count,
    afternoon_leave_ratio,
    evening_leave_count,
    evening_leave_ratio,
    morning_truant_count,
    morning_truant_ratio,
    afternoon_truant_count,
    afternoon_truant_ratio,
    evening_truant_count,
    evening_truant_ratio,
    '2' as time_type
    yearinfo,
    monthinfo,
    dayinfo
from itcast_dwm.class_all_dwn;

        需求二:

insert into table itcast_dws.class_attendance_dws partition(yearinfo,monthinfo,dayinfo)
select 
    concat(yearinfo,'-',monthinfo) as dateinfo,
    class_id,
    sum(studying_student_count) as studyinf_student_count,
    sum(nvl(morning_att_count,0)) as morning_att_count,
    concat(round(sum(nvl(morning_att_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_att_ratio,
    sum(nvl(afternoon_att_count,0)) as afternoon_att_count,
    concat(round(sum(nvl(afternoon_att_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_att_ratio,
    sum(nvl(evening_att_count,0)) as evening_att_count,
    concat(round(sum(nvl(evening_att_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_att_ratio,
    sum(nvl(morning_later_count,0)) as morning_later_count,
    concat(round(sum(nvl(morning_later_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_later_ratio,
    sum(nvl(afternoon_later_count,0)) as afternoon_later_count,
    concat(round(sum(nvl(afternoon_later_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_later_ratio,
    sum(nvl(evening_later_count,0)) as evening_later_count,
    concat(round(sum(nvl(evening_later_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_later_ratio,
    sum(nvl(morning_leave_count,0)) as morning_leave_count,
    concat(round(sum(nvl(morning_leave_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_leave_ratio,
    sum(nvl(afternoon_leave_count,0)) as afternoon_leave_count,
    concat(round(sum(nvl(afternoon_leave_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_leave_ratio,
    sum(nvl(evening_leave_count,0)) as evening_leave_count,
    concat(round(sum(nvl(evening_leave_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_leave_ratio,
    sum(nvl(morning_truant_count,0)) as morning_truant_count,
    concat(round(sum(nvl(morning_truant_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_truant_ratio,
    sum(nvl(afternoon_truant_count,0)) as afternoon_truant_count,
    concat(round(sum(nvl(afternoon_truant_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_truant_ratio,
    sum(nvl(evening_truant_count,0)) as evening_truant_count,
    concat(round(sum(nvl(evening_truant_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_truant_ratio,
    '4' as time_type
    yearinfo,
    monthinfo,
    '-1' as dayinfo
from itcast_dwm.class_all_dwn
group by yearinfo,monthinfo,class_id;

        需求三:

insert into table itcast_dws.class_attendance_dws partition(yearinfo,monthinfo,dayinfo)
select 
    yearinfo as dateinfo,
    class_id,
    sum(studying_student_count) as studyinf_student_count,
    sum(nvl(morning_att_count,0)) as morning_att_count,
    concat(round(sum(nvl(morning_att_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_att_ratio,
    sum(nvl(afternoon_att_count,0)) as afternoon_att_count,
    concat(round(sum(nvl(afternoon_att_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_att_ratio,
    sum(nvl(evening_att_count,0)) as evening_att_count,
    concat(round(sum(nvl(evening_att_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_att_ratio,
    sum(nvl(morning_later_count,0)) as morning_later_count,
    concat(round(sum(nvl(morning_later_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_later_ratio,
    sum(nvl(afternoon_later_count,0)) as afternoon_later_count,
    concat(round(sum(nvl(afternoon_later_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_later_ratio,
    sum(nvl(evening_later_count,0)) as evening_later_count,
    concat(round(sum(nvl(evening_later_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_later_ratio,
    sum(nvl(morning_leave_count,0)) as morning_leave_count,
    concat(round(sum(nvl(morning_leave_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_leave_ratio,
    sum(nvl(afternoon_leave_count,0)) as afternoon_leave_count,
    concat(round(sum(nvl(afternoon_leave_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_leave_ratio,
    sum(nvl(evening_leave_count,0)) as evening_leave_count,
    concat(round(sum(nvl(evening_leave_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_leave_ratio,
    sum(nvl(morning_truant_count,0)) as morning_truant_count,
    concat(round(sum(nvl(morning_truant_count,0)) / sum(studying_student_count) * 100,2),'%') as morning_truant_ratio,
    sum(nvl(afternoon_truant_count,0)) as afternoon_truant_count,
    concat(round(sum(nvl(afternoon_truant_count,0)) / sum(studying_student_count) * 100,2),'%') as afternoon_truant_ratio,
    sum(nvl(evening_truant_count,0)) as evening_truant_count,
    concat(round(sum(nvl(evening_truant_count,0)) / sum(studying_student_count) * 100,2),'%') as evening_truant_ratio,
    '5' as time_type
    yearinfo,
    '-1' as monthinfo,
    '-1' as dayinfo
from itcast_dwm.class_all_dwn
group by yearinfo,class_id;

八:数据导出

CREATE TABLE IF NOT EXISTS scrm_id.class_attendance_dws (
    dateinfo      varchar(100)        comment '日期',
    class_id        int           comment '班级id',
   studying_student_count int  comment '在读班级人数',
    morning_att_count       varchar(100)         comment '上午出勤人数',
   morning_att_ratio       varchar(100)         comment '上午出勤率',
    afternoon_att_count      varchar(100)        comment '下午出勤人数',
   afternoon_att_ratio      varchar(100)        comment '下午出勤率',
    evening_att_count       varchar(100)        comment '晚自习出勤人数',
   evening_att_ratio       varchar(100)        comment '晚自习出勤率',
   morning_late_count       varchar(100)        comment '上午迟到人数',
   morning_late_ratio       varchar(100)        comment '上午迟到率',
    afternoon_late_count     varchar(100)        comment '下午迟到人数',
   afternoon_late_ratio     varchar(100)        comment '下午迟到率',
    evening_late_count       varchar(100)        comment '晚自习迟到人数',
   evening_late_ratio       varchar(100)      comment '晚自习迟到率',
   morning_leave_count      varchar(100)      comment '上午请假人数',
   morning_leave_ratio      varchar(100)      comment '上午请假率',
    afternoon_leave_count   varchar(100)      comment '下午请假人数',
   afternoon_leave_ratio    varchar(100)     comment '下午请假率',
    evening_leave_count     varchar(100)       comment '晚自习请假人数',
   evening_leave_ratio       varchar(100)    comment '晚自习请假率',
    morning_truant_count      varchar(100)          comment '上午旷课人数',
   morning_truant_ratio      varchar(100)        comment '上午旷课率',
    afternoon_truant_count  varchar(100)     comment '下午旷课人数',
   afternoon_truant_ratio   varchar(100)      comment '下午旷课率',
    evening_truant_count      varchar(100)       comment '晚自习旷课人数',
   evening_truant_ratio      varchar(100)     comment '晚自习旷课率',
   time_type           varchar(100)   COMMENT '聚合时间类型:1、按小时聚合;2、按天聚合;3、按季度聚合;4、按月聚合;5、按年聚合。',
    yearinfo varchar(100) comment '年',
    monthinfo varchar(100) comment '月',
    dayinfo varchar(100)  comment '天'
)
comment '班级数据统计表';
sqoop export \
--connect "jdbc:mysql://hadoop01:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 123456 \
--table class_attendance_dws \
--hcatalog-database itcast_dws \
--hcatalog-table class_attendance_dws \
-m 1
  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值