spark分析某答题系统日志
一、数据描述
这是一份来自于某在线考试系统的学员答题批改日志,日志中记录了日志生成时间,题目难度系数,题目所属的知识点ID,做题的学生ID,题目ID以及作答批改结果。日志的结构如下:
二、数据下载
链接:answer_question.log
提取码:47k1
三、需求
1. 数据准备
请在HDFS 中创建目录/app/data/exam,并将answer_question.log 传到该目录
hdfs dfs -mkdir -p /app/data/exam/
hdfs dfs -put answer_question.log /app/data/exam/
2. 在Spark-Shell 中,加载HDFS 文件系统answer_question.log 文件,并使用RDD 完成以下分析,也可使用Spark 的其他方法完成数据分析
①提取日志中的知识点ID,学生ID,题目ID,作答结果4 个字段的值
详细步骤如下:
- 创建RDD,加载数据
val logRDD = sc.textFile("hdfs://hadoop-single:9000/app/data/exam/answer_question.log")
- 提取字段 知识点ID,学生ID,题目ID,作答结果
val log_etl = logRDD.map(x => x.split(" ")).map(x => (x(9).split("_"),x(10).split(","))).map(x => (x._1(1),x._1(2),x._1(3).substring(0,x._1(3).length-1),x._2(0)))
结果如下:
注: 要去除 题目字符串后缀中的 r ,除了用 substring 算子修剪掉,也可以用 replace(" r " ,"") ,将 r 替换为空字符
②将提取后的 知识点ID ,学生ID,题目ID,作答结果 ,字段的值以文件的形式保存到HDFS的 /app/data/result 目录下。一行保留一条数据,字段间以“\t”分割,文件格式如下所示。(提示:元组可使用tuple.productIterator.mkString("\t")组合字符串)
34434481 | 8195023659599 | 1018 | 0 |
---|---|---|---|
34434425 | 8195023659599 | 7385 | 1 |
34434457 | 8195023659596 | 7346 | 1 |
34434498 | 8195023659597 | 6672 | 0 |
34434449 | 8195023659594 | 4809 | 1 |
34434489 | 8195023659596 | 7998 | 0.5 |
34434492 | 8195023659595 | 9406 | 0 |
34434485 | 8195023659597 | 8710 | 1 |
log_etl.map(x => x.productIterator.mkString("\t")).saveAsTextFile("hdfs://hadoop-single:9000/app/data/result")
3. 创建HBase 数据表
在HBase中创建命名空间(namespace)exam,在该命名空间下创建analysis表,使用学生ID作为RowKey,该表下有2 个列族accuracy、question
- 创建命名空间
create_namespace 'exam'
- 创建表 analysis
create 'exam:analysis','accuracy','question'
- 查看命名空间 exam 中的表
list_namespace_tables 'exam'
4. 创建Hive数据表
请在Hive中创建数据库 exam ,在该数据库中创建外部表 ex_exam_record 指向 /app/data/result 下Spark处理后的日志数据;创建外部表 ex_exam_anlysis 映射至HBase中的 analysis 表的 accuracy 列族;创建外部表 ex_exam_question 映射至HBase中的 analysis 表的 question 列族
ex_exam_record 表结构如下:
字段名称 | 字段类型 | 字段含义 |
---|---|---|
topic_id | string | 知识点ID |
student_id | string | 学生ID |
question_id | string | 题目ID |
score | float | 作答结果 |
ex_exam_anlysis表结构如下:
字段名称 | 字段类型 | 字段含义 |
---|---|---|
student_id | string | 学生ID |
total_score | float | 总分 |
question_count | int | 答题的试题数 |
accuracy | float | 正确率 |
ex_exam_question 表结构如下:
字段名称 | 字段类型 | 字段含义 |
---|---|---|
student_id | string | 学生ID |
right | string | 所有作对的题目的ID 列表 |
half | string | 所有半对的题目的ID 列表 |
error | string | 所有做错的题目的ID 列表 |
- 创建数据库 exam
create databases if not esists exam;
- 建表 ex_exam_record
create external table if not exists ex_exam_record(
topic_id string,
student_id string,
question_id string,
score float
)
row format delimited
fields terminated by '\t'
stored as textFile
location '/app/data/result/';
- 建表 ex_exam_anlysis
create external table if not exists ex_exam_anlysis(
student_id string,
total_score float,
question_count int,
accuracy float
)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping"=":key,accuracy:total_score,accuracy:question_count,accuracy:accuracy")
tblproperties("hbase.table.name"="exam:analysis");
- 建表 ex_exam_question
create external table if not exists ex_exam_question(
student_id string,
right string,
half string,
error string
)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping"=":key,question:right,question:half,question:error")
tblproperties("hbase.table.name"="exam:analysis");
5. 使用ex_exam_record表中的数据统计每个学员总分、答题的试题数和正确率,并保存到ex_exam_anlysis 表中,其中正确率的计算方法如下:正确率=总分/答题的试题数
- 查询并插入数据至表 ex_exam_anlysis
insert into ex_exam_anlysis
select student_id, sum(score) as total_score, count(1) as question_count,
sum(score)/count(1) as accuracy
from ex_exam_record group by student_id;
- 查询 ex_exam_anlysis 表中是否有数据
select * from ex_exam_anlysis;
- 在 hbase 中查询数据(列簇 accuracy )
scan 'exam:analysis',{COLUMNS=>'accuracy'}
6. 使用ex_exam_record 表中的数据统计每个做对,做错,半对的题目列表
①题目id 以逗号分割,并保存到ex_exam_question 表中
步骤:
- 1.查询一个学生做 对 的所有题目:
select student_id,score,concat_ws(",",collect_set(question_id)) right
from ex_exam_record
where score = 1
group by student_id,score;
部分截图如下:
- 2.查询一个学生做 半对 的所有题目:
select student_id,score,concat_ws(",",collect_set(question_id)) half
from ex_exam_record
where score = 0.5
group by student_id,score;
- 3.查询一个学生做 错 的所有题目:
select student_id,score,concat_ws(",",collect_set(question_id)) error
from ex_exam_record
where score = 0
group by student_id,score;
- 4.联表查询
select t1.student_id,t1.right,t2.half,t3.error
from
(select student_id,score,concat_ws(",",collect_set(question_id)) right
from ex_exam_record
where score = 1
group by student_id,score) t1
join
(select student_id,score,concat_ws(",",collect_set(question_id)) half
from ex_exam_record
where score = 0.5
group by student_id,score) t2
on t1.student_id = t2.student_id
join
(select student_id,score,concat_ws(",",collect_set(question_id)) error
from ex_exam_record
where score = 0
group by student_id,score) t3
on t1.student_id = t3.student_id;
部分截图如下:
- 5.将查询结果插入到表 ex_exam_question 中
insert into ex_exam_question
select t1.student_id,t1.right,t2.half,t3.error
from
(select student_id,score,concat_ws(",",collect_set(question_id)) right
from ex_exam_record
where score = 1
group by student_id,score) t1
join
(select student_id,score,concat_ws(",",collect_set(question_id)) half
from ex_exam_record
where score = 0.5
group by student_id,score) t2
on t1.student_id = t2.student_id
join
(select student_id,score,concat_ws(",",collect_set(question_id)) error
from ex_exam_record
where score = 0
group by student_id,score) t3
on t1.student_id = t3.student_id;
注: 上述代码也可以简写成这样:
with t as
(select student_id, score,
concat_ws(",",collect_set(question_id)) as question_id
from ex_exam_record
group by student_id,score),
a1 as (select student_id, question_id from t where score=1),
a2 as (select student_id, question_id from t where score=0.5),
a3 as (select student_id, question_id from t where score=0)
insert into ex_exam_question
select a1.student_id, a1.question_id right, a2.question_id half, a3.question_id error
from a1
join a2 on a1.student_id = a2.student_id
join a3 on a1.student_id = a3.student_id
;
也可以换一种思路,用 case when 简写成这样:
insert into ex_exam_question
select
student_id,
concat_ws('',collect_list(t.right)) right,
concat_ws('',collect_list(t.half)) half,
concat_ws('',collect_list(t.error)) error
from
(
select
student_id,
case when score=1.0 then concat_ws(",",collect_list(question_id)) else null end right,
case when score=0.5 then concat_ws(",",collect_list(question_id)) else null end half,
case when score=0.0 then concat_ws(",",collect_list(question_id)) else null end error
from ex_exam_record
group by student_id,score) t
group by student_id;
②完成统计后,在HBase Shell 中遍历 exam:analysis 表并只显示 question 列族中的数据,如下图所示:
- 由于 hive中外部表 ex_exam_question 已经映射至 hbase:exam:analysis 表 question 列簇,这里直接在habse shell 命令行查询即可
scan 'exam:analysis',{COLUMNS=>'question'}