spark日志分析（二）—— 某答题系统日志

我玩的很开心

已于 2024-02-10 20:51:59 修改

阅读量757

点赞数 1

文章标签： spark日志分析

于 2020-10-20 15:25:48 首次发布

本文链接：https://blog.csdn.net/and52696686/article/details/109181822

版权

spark分析某答题系统日志

一、数据描述
二、数据下载
三、需求

一、数据描述

这是一份来自于某在线考试系统的学员答题批改日志，日志中记录了日志生成时间,题目难度系数，题目所属的知识点ID,做题的学生ID，题目ID以及作答批改结果。日志的结构如下：

在这里插入图片描述

二、数据下载

链接：answer_question.log
提取码：47k1

三、需求

1. 数据准备

请在HDFS 中创建目录/app/data/exam，并将answer_question.log 传到该目录

hdfs dfs -mkdir -p /app/data/exam/
hdfs dfs -put answer_question.log /app/data/exam/

2. 在Spark-Shell 中，加载HDFS 文件系统answer_question.log 文件，并使用RDD 完成以下分析，也可使用Spark 的其他方法完成数据分析

①提取日志中的知识点ID，学生ID，题目ID，作答结果4 个字段的值

详细步骤如下：

创建RDD，加载数据

val logRDD = sc.textFile("hdfs://hadoop-single:9000/app/data/exam/answer_question.log")

提取字段 知识点ID，学生ID，题目ID，作答结果

val log_etl = logRDD.map(x => x.split(" ")).map(x => (x(9).split("_"),x(10).split(","))).map(x => (x._1(1),x._1(2),x._1(3).substring(0,x._1(3).length-1),x._2(0)))

结果如下：
在这里插入图片描述
注：要去除题目字符串后缀中的 r ，除了用 substring 算子修剪掉，也可以用 replace(" r " ,"") ，将 r 替换为空字符

②将提取后的 知识点ID ，学生ID，题目ID，作答结果 ，字段的值以文件的形式保存到HDFS的 /app/data/result 目录下。一行保留一条数据，字段间以“\t”分割，文件格式如下所示。（提示：元组可使用tuple.productIterator.mkString("\t")组合字符串）

34434481	8195023659599	1018	0
34434425	8195023659599	7385	1
34434457	8195023659596	7346	1
34434498	8195023659597	6672	0
34434449	8195023659594	4809	1
34434489	8195023659596	7998	0.5
34434492	8195023659595	9406	0
34434485	8195023659597	8710	1

log_etl.map(x => x.productIterator.mkString("\t")).saveAsTextFile("hdfs://hadoop-single:9000/app/data/result")

在这里插入图片描述

3. 创建HBase 数据表

在HBase中创建命名空间（namespace）exam，在该命名空间下创建analysis表，使用学生ID作为RowKey，该表下有2 个列族accuracy、question

创建命名空间

create_namespace 'exam'

创建表 analysis

create 'exam:analysis','accuracy','question'

查看命名空间 exam 中的表

list_namespace_tables 'exam'

在这里插入图片描述

4. 创建Hive数据表

请在Hive中创建数据库 exam ，在该数据库中创建外部表 ex_exam_record 指向 /app/data/result 下Spark处理后的日志数据;创建外部表 ex_exam_anlysis 映射至HBase中的 analysis 表的 accuracy 列族;创建外部表 ex_exam_question 映射至HBase中的 analysis 表的 question 列族

ex_exam_record 表结构如下：

字段名称	字段类型	字段含义
topic_id	string	知识点ID
student_id	string	学生ID
question_id	string	题目ID
score	float	作答结果

ex_exam_anlysis表结构如下：

字段名称	字段类型	字段含义
student_id	string	学生ID
total_score	float	总分
question_count	int	答题的试题数
accuracy	float	正确率

ex_exam_question 表结构如下：

字段名称	字段类型	字段含义
student_id	string	学生ID
right	string	所有作对的题目的ID 列表
half	string	所有半对的题目的ID 列表
error	string	所有做错的题目的ID 列表

创建数据库 exam

create databases if not esists exam;

建表 ex_exam_record

create external table if not exists ex_exam_record(
topic_id string,
student_id string,
question_id string,
score float
)
row format delimited
fields terminated by '\t'
stored as textFile
location '/app/data/result/';

建表 ex_exam_anlysis

create external table if not exists ex_exam_anlysis(
student_id string,
total_score float,
question_count int,
accuracy float
)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
with serdeproperties ("hbase.columns.mapping"=":key,accuracy:total_score,accuracy:question_count,accuracy:accuracy")
tblproperties("hbase.table.name"="exam:analysis");

建表 ex_exam_question

create external table if not exists ex_exam_question(
student_id string,
right string,
half string,
error string
)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
with serdeproperties  ("hbase.columns.mapping"=":key,question:right,question:half,question:error")
tblproperties("hbase.table.name"="exam:analysis");

在这里插入图片描述

5. 使用ex_exam_record表中的数据统计每个学员总分、答题的试题数和正确率，并保存到ex_exam_anlysis 表中，其中正确率的计算方法如下：正确率=总分/答题的试题数

查询并插入数据至表 ex_exam_anlysis

insert into ex_exam_anlysis 
select student_id, sum(score) as total_score, count(1) as question_count, 
sum(score)/count(1) as accuracy 
from ex_exam_record group by student_id;

查询 ex_exam_anlysis 表中是否有数据

select * from ex_exam_anlysis;

在这里插入图片描述

在 hbase 中查询数据(列簇 accuracy )

scan 'exam:analysis',{COLUMNS=>'accuracy'}

在这里插入图片描述

6. 使用ex_exam_record 表中的数据统计每个做对，做错，半对的题目列表

①题目id 以逗号分割，并保存到ex_exam_question 表中

步骤：

1.查询一个学生做对的所有题目：

select student_id,score,concat_ws(",",collect_set(question_id)) right
from ex_exam_record 
where score = 1 
group by student_id,score;

部分截图如下：
在这里插入图片描述

2.查询一个学生做半对的所有题目：

select student_id,score,concat_ws(",",collect_set(question_id)) half
from ex_exam_record 
where score = 0.5
group by student_id,score;

3.查询一个学生做错的所有题目：

select student_id,score,concat_ws(",",collect_set(question_id)) error
from ex_exam_record 
where score = 0
group by student_id,score;

4.联表查询

select t1.student_id,t1.right,t2.half,t3.error
from
(select student_id,score,concat_ws(",",collect_set(question_id)) right 
from ex_exam_record 
where score = 1 
group by student_id,score) t1
join
(select student_id,score,concat_ws(",",collect_set(question_id)) half
from ex_exam_record 
where score = 0.5
group by student_id,score) t2
on t1.student_id = t2.student_id
join
(select student_id,score,concat_ws(",",collect_set(question_id)) error
from ex_exam_record 
where score = 0
group by student_id,score) t3
on t1.student_id = t3.student_id;

部分截图如下：
在这里插入图片描述

5.将查询结果插入到表 ex_exam_question 中

insert into ex_exam_question
select t1.student_id,t1.right,t2.half,t3.error
from
(select student_id,score,concat_ws(",",collect_set(question_id)) right 
from ex_exam_record 
where score = 1 
group by student_id,score) t1
join
(select student_id,score,concat_ws(",",collect_set(question_id)) half
from ex_exam_record 
where score = 0.5
group by student_id,score) t2
on t1.student_id = t2.student_id
join
(select student_id,score,concat_ws(",",collect_set(question_id)) error
from ex_exam_record 
where score = 0
group by student_id,score) t3
on t1.student_id = t3.student_id;

注：上述代码也可以简写成这样：

with t as 
(select student_id, score, 
concat_ws(",",collect_set(question_id)) as question_id 
from ex_exam_record 
group by student_id,score),
a1 as (select student_id, question_id from t where score=1),
a2 as (select student_id, question_id from t where score=0.5),
a3 as (select student_id, question_id from t where score=0)
insert into ex_exam_question 
select a1.student_id, a1.question_id right, a2.question_id half, a3.question_id error 
from a1 
join a2 on a1.student_id = a2.student_id 
join a3 on a1.student_id = a3.student_id
;

也可以换一种思路，用 case when 简写成这样：

insert into ex_exam_question 
select
student_id,
concat_ws('',collect_list(t.right)) right,
concat_ws('',collect_list(t.half)) half,
concat_ws('',collect_list(t.error)) error
from 
(
select
student_id,
case when score=1.0 then concat_ws(",",collect_list(question_id)) else null end right,
case when score=0.5 then concat_ws(",",collect_list(question_id)) else null end half,
case when score=0.0 then concat_ws(",",collect_list(question_id)) else null end error
from ex_exam_record
group by student_id,score) t
group by student_id;

②完成统计后，在HBase Shell 中遍历 exam:analysis 表并只显示 question 列族中的数据，如下图所示：

在这里插入图片描述

由于 hive中外部表 ex_exam_question 已经映射至 hbase:exam:analysis 表 question 列簇，这里直接在habse shell 命令行查询即可

scan 'exam:analysis',{COLUMNS=>'question'}

在这里插入图片描述

我玩的很开心

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
spark日志分析（二）—— 某答题系统日志

spark分析某答题系统日志一、数据描述二、数据下载三、需求1. 数据准备2. 在Spark-Shell 中，加载HDFS 文件系统answer_question.log 文件，并使用RDD 完成以下分析，也可使用Spark 的其他方法完成数据分析3. 创建HBase 数据表4. 创建Hive数据表5. 使用ex_exam_record表中的数据统计每个学员总分、答题的试题数和正确率，并保存到ex_exam_anlysis 表中，其中正确率的计算方法如下：正确率=总分/答题的试题数6. 使用ex_exam_
复制链接

扫一扫