practice2

最新推荐文章于 2024-08-18 00:00:00 发布

buerba

最新推荐文章于 2024-08-18 00:00:00 发布

阅读量223

点赞数

文章标签： hadoop spark hive hbase

本文链接：https://blog.csdn.net/buerba/article/details/109199624

版权

1.数据准备（10 分）请在 HDFS 中创建目录/app/data/exam，并将 answer_question.log 传到该目录。

hdfs dfs -mkdir -p /app/data/exam
hdfs dfs -put /opt/baos/answer_question.log /app/data/exam

在这里插入图片描述
2.在 Spark-Shell 中，加载 HDFS 文件系统 answer_question.log 文件，并使用 RDD 完成以下分析，也可使用 Spark 的其他方法完成数据分析。（20 分）
①提取日志中的知识点 ID，学生 ID，题目 ID，作答结果 4 个字段的值

val logRDD = sc.textFile("hdfs://zhang:9000/app/data/exam/answer_question.log")
logRDD.map(x=>x.split(" ")).map(x=>(x(9),x(10))).map(x=>(x._1.split("_"),x._2.split(","))).map(x=>(x._1(1),x._1(2),x._1(3).split("r"),x._2(0))).map(x=>(x._1,x._2,x._3(0),x._4)).collect.foreach(println)

在这里插入图片描述
②将提取后的知识点ID，学生ID，题目ID，作答结果字段的值以文件的形式保存到HDFS的/app/data/result 目录下。一行保留一条数据，字段间以“\t”分割。文件格式如下所示。（提示：元组可使用 tuple.productIterator.mkString("\t")组合字符串，使用其他方法处
理数据只要结果正确也给分）
在这里插入图片描述

logsRDD.map(x=>x.productIterator.mkString("\t")).saveAsTextFile("hdfs://zhang:9000/app/data/result")

在这里插入图片描述

3.创建 HBase 数据表（10 分）在 HBase 中创建命名空间（namespace）exam，在该命名空间下创建 analysis 表，使用学生 ID 作为 RowKey，该表下有 2 个列族 accuracy、question。accuracy 列族用于保存学员答题正确率统计数据（总分 accuracy:total_score ，答题的试题数 accuracy:question_count，正确率 accuracy:accuracy）；question 列族用于分类保存学员正确，错误和半对的题目 id（正确 question:right，错误 question:error，半对 question:half）

create_namespace 'exam'
create 'exam:analysis','accuracy','question'

在这里插入图片描述
4.请在 Hive 中创建数据库 exam，在该数据库中创建外部表 ex_exam_record 指向 /app/data/result下Spark处理后的日志数据 ;创建外部表ex_exam_anlysis映射至HBase 中的 analysis 表的 accuracy 列族;创建外部表 ex_exam_question 映射至 HBase 中的 analysis 表的 question 列族（20 分）
ex_exam_record 表结构如下：
在这里插入图片描述
ex_exam_anlysis 表结构如下：

ex_exam_question 表结构如下：

create external table if not exists ex_exam_record(
topic_id string,
student_id string,
question_id string,
score string
)
row format delimited
fields terminated by '\t'
stored as textfile
location '/app/data/result'
;

在这里插入图片描述

create external table if not exists ex_exam_anlysis(
student_id string,
total_score float,
question_count int,
accuracy float
)
stored by  'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties
("hbase.columns.mapping"=":key,accuracy:total_score,accuracy:question_count,accuracy:accuracy")
tblproperties("hbase.table.name"="exam:analysis");

create external table if not exists ex_exam_question(
student_id string,
right string,
half string,
error string
)
stored by  'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties
("hbase.columns.mapping"=":key,question:right,question:half,question:error")
tblproperties("hbase.table.name"="exam:analysis");

5.使用 ex_exam_record 表中的数据统计每个学员总分、答题的试题数和正确率，并保存到 ex_exam_anlysis 表中，其中正确率的计算方法如下：正确率=总分/答题的试题数（20 分）

insert into table ex_exam_anlysis select student_id,sum(score) as total_score,count(question_id),sum(score)/count(question_id) as accuracy from ex_exam_record group by student_id;

在这里插入图片描述
6.使用 ex_exam_record 表中的数据统计每个作对，做错，半对的题目列表。
①题目 id 以逗号分割，并保存到 ex_exam_question 表中。（10 分）

insert into table ex_exam_question
select t1.student_id,t1.right,t2.half,t3.error from (
select student_id ,concat_ws(',',collect_list(question_id))as right from ex_exam_record where score=1 group by student_id ) t1
join
(select student_id ,concat_ws(',',collect_list(question_id))as half from ex_exam_record where score=0.5 group by student_id) t2 
on t1.student_id=t2.student_id
join
(select student_id ,concat_ws(',',collect_list(question_id)) as error from ex_exam_record where score=0 group by student_id) t3 
on t1.student_id=t3.student_id;

②完成统计后，在 HBaseShell 中遍历 exam:analysis 表并只显示 question 列族中的数据，如下图所示（10 分）

scan 'exam:analysis',{COLUMNS=>'question'}

在这里插入图片描述

buerba

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
practice2

hdfs dfs -mkdir -p /app/data/examhdfs dfs -put /opt/baos/answer_question.log /app/data/exam2.在 Spark-Shell 中，加载 HDFS 文件系统 answer_question.log 文件，并使用 RDD 完成以下分析，也可使用 Spark 的其他方法完成数据分析。（20 分）①提取日志中的知识点 ID，学生 ID，题目 ID，作答结果 4 个字段的值logRDD.map(x=>x.spl
复制链接

扫一扫