【考试真题】2020年7月份机试试卷

最新推荐文章于 2022-01-09 23:47:31 发布

Helltaker

最新推荐文章于 2022-01-09 23:47:31 发布

阅读量637

点赞数

分类专栏： spark Scala 大数据文章标签：大数据 spark 数据库

本文链接：https://blog.csdn.net/Helltaker/article/details/112991140

版权

大数据同时被 3 个专栏收录

20 篇文章 0 订阅

订阅专栏

Scala

14 篇文章 0 订阅

订阅专栏

spark

10 篇文章 1 订阅

订阅专栏

2020年7月份机试试卷

一、环境要求
二、提交结果要求
三、数据描述
四、功能要求

一、环境要求

sandbox-hdp 2.6.4 或同等版本自建的 Hadoop+Hive+Spark+HBase 开发环境。

二、提交结果要求

1.必须提交源码或对应分析语句，如不提交则不得分。
2.带有分析结果的功能，请分析结果的截图与代码一同提交。

三、数据描述

这是一份来自于某在线考试系统的学员答题批改日志，日志中记录了日志生成时间，题目难度系数，题目所属的知识点 ID，做题的学生 ID，题目 ID 以及作答批改结果。日志的结构如下：
在这里插入图片描述

四、功能要求

1.数据准备（10 分）

请在 HDFS 中创建目录/app/data/exam，并将 answer_question.log 传到该目录。

hdfs dfs -mkdir -p /app/data/exam202007
hdfs dfs -put answer_question.log /app/data/exam202007/

2.

在 Spark-Shell 中，加载 HDFS 文件系统 answer_question.log 文件，并使用 RDD 完成以下分析，也可使用 Spark 的其他方法完成数据分析。（20 分）

val fileRDD=sc.textFile("hdfs://HadoopY:9000/app/data/exam202007")

①提取日志中的知识点 ID，学生 ID，题目 ID，作答结果 4 个字段的值

fileRDD.map(x=>x.toString.replace("r ","_"))
	.map(_.replace(",","_"))
	.map(_.split("_"))
	.map(x=>(x(1),x(2),x(3),x(4))).foreach(println)

②将提取后的知识点 ID，学生 ID，题目 ID，作答结果字段的值以文件的形式保存到 HDFS 的 /app/data/result 目录下。一行保留一条数据，字段间以“\t”分割。文件格式如下所示。（提示：元组可使用 tuple.productIterator.mkString("\t")组合字符串，使用其他方法处理数据只要结果正确也给分）

在这里插入图片描述

fileRDD.map(x=>x.toString.replace("r ","_"))
	.map(_.replace(",","_"))
	.map(_.split("_"))
	.map(x=>(x(1),x(2),x(3),x(4)).productIterator.mkString("\t"))
	.saveAsTextFile("hdfs://HadoopY:9000/app/data/result")
	//这个路径一开始不能存在

3.创建 HBase 数据表（10 分）

在 HBase 中创建命名空间（namespace）exam，在该命名空间下创建 analysis 表，使用学生 ID 作为 RowKey，该表下有 2 个列族 accuracy、question。accuracy 列族用于保存学员答题正确率统计数据（总分 accuracy:total_score，答题的试题数 accuracy:question_count，正确率 accuracy:accuracy）；question 列族用于分类保存学员正确，错误和半对的题目id （正确 question:right，错误 question:error，半对 question:half）

create_namespace 'exam202007'
create 'exam202007:analysis','accuracy','question'

4.

请在 Hive 中创建数据库 exam，在该数据库中创建外部表 ex_exam_record 指向 /app/data/result 下 Spark 处理后的日志数据；创建外部表 ex_exam_anlysis 映射至 HBase 中的 analysis 表的 accuracy 列族；创建外部表 ex_exam_question 映射至 HBase 中的 analysis 表的 question 列族（20 分）

create database exam202007;
use exam202007;

ex_exam_record 表结构如下：
在这里插入图片描述

create external table ex_exam_record(
topic_id string,
student_id string,
question_id string,
score float)
row format delimited
fields terminated by '\t'
stored as textfile
location '/app/data/result';

在这里插入图片描述

ex_exam_anlysis 表结构如下：
在这里插入图片描述

create external table if not exists ex_exam_analysis(
student_id string,
total_score float,
question_count int,
accuracy float)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping"=":key, accuracy:total_score, accuracy:question_count, accuracy:accuracy")
tblproperties("hbase.table.name"="exam202007:analysis");

ex_exam_question 表结构如下：

在这里插入图片描述

create external table if not exists ex_exam_question(
student_id string,
right string,
half string,
error string)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping"=":key, question:right, question:half, question:error")
tblproperties("hbase.table.name"="exam202007:analysis");

5.

使用 ex_exam_record 表中的数据统计每个学员总分、答题的试题数和正确率，并保存到 ex_exam_anlysis 表中，其中正确率的计算方法如下：正确率=总分/答题的试题数（20 分）

insert into ex_exam_analysis
select student_id,
sum(score),
count(question_id),
sum(score)/count(question_id)
from ex_exam_record group by student_id;

在这里插入图片描述

6.

使用 ex_exam_record 表中的数据统计每个作对，做错，半对的题目列表。
①题目 id 以逗号分割，并保存到 ex_exam_question 表中。（10 分）

insert into ex_exam_question
select student_id,
concat_ws(",",collect_set(if(score=1,question_id,NULL))),
concat_ws(",",collect_set(if(score=0.5,question_id,NULL))),
concat_ws(",",collect_set(if(score=0,question_id,NULL)))
from ex_exam_record
group by student_id;

在这里插入图片描述

②完成统计后，在 HBase Shell 中遍历 exam:analysis 表并只显示 question 列族中的数据，如下图所示（10 分）

scan 'exam202007:analysis',{COLUMNS=>"question"}

在这里插入图片描述

Helltaker

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录