Hadoop/Hive/HBase/数据集成阶段测试参考答案

没钳蟹蟹

已于 2023-03-08 17:11:31 修改

阅读量1.1k

点赞数 4

分类专栏：大数据文章标签： hadoop hive hbase

于 2022-07-26 16:55:01 首次发布

本文链接：https://blog.csdn.net/holiday0520/article/details/125998446

版权

大数据专栏收录该内容

23 篇文章 0 订阅

订阅专栏

Hadoop/Hive/HBase/数据集成阶段测试参考答案

一、选择题（共10道，每道1分）

1~5、 B B D C B

6~10、 C D B D B

二、填空题（共10分，每空1分）

1、start-dfs.sh

2、128

3、InputFormat

4、yarn mapreduce hdfs

5、10000

6、长度原则散列原则唯一原则

三、判断题（共10道，每道1分）

1~10 X X X X X X X X X X

四、简答题（共5道，每道4分）

1、简诉SecondaryNameNode的作用。

合并NameNode的edit logs到fsimage文件中。

2、简诉HBase的写流程。

首先由Client发送一个写请求，然后会连接ZK查找meta-region-server记录的meta表（就是HBase中的一张系统表，记录了所有的用户表其region的信息）的位置，假设在node2上，接下来将会连接node2，获取meta表的数据，然后根据写入的数据（通常是Put对象）的Rowkey，判断该Put对象到底写入哪一个Region中，假设在node1上，那接下来客户端就会连接node1上的RegionServer并将数据写入，先写HLog，再写对应的Region中的MemStore，当MemStore达到128M，会刷写到磁盘形成StoreFile，StoreFile最终是以HFile形式存储在hdfs上。

3、简诉MapReduce流程。

数据进入到Map任务前需要切片，然后格式化成K-V格式，每一个切片会生成一个Map任务，Map端会对数据进行预处理，简单过滤后输出Reduce想要的K-V格式，Map端处理完成后会进入suffer write的过程，首先数据会先写入到环形缓冲区，大小默认是100M，达到百分之八十的时候溢写到磁盘，溢写到磁盘的时候会进行分区和快速排序，分区的数量对应Reduce的数量，Map任务生成所有的溢写出来的文件会进行合并操作，合并的过程中会进行归并排序，每一个Map任务都会合并得到一个分区文件，然后进入到suffer read的过程，相同的key进入同一个Reduce，首先会将每个分区文件中相同的key拉人到同一个分区，然后进行排序和合并，最后进入Reduce端做聚合的计算或其它的操作，每个Reduce任务对应一个输出文件，最后将结果写入到hdfs中存储。

4、简诉Hive优化。

采用MapJoin把小表全部加载到内存在map端进行join，避免reduce处理

采用分区技术，避免全表扫描，提高查询效率

使用外部表，避免数据误删

选择适当的文件储存格式及压缩格式

在select中，只拿需要的列，如果有分区，尽量使用分区过滤，少用select *

合理设置Map和Reduce的数量

5、简诉任意两个数据采集工具及其适用场景。

Sqoop：离线，将关系数据库（oracle、mysql、postgresql等）数据与hadoop数据进行转换

DataX：离线，实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能

五、代码题（50分）

1、请补充MR之WordCount代码块（20分）

import org.apache.hadoop.*;
import java.io.IOException;

public class Demo01WordCount {
    // Map任务 （2分）
    public static class MyMapper extends Mapper<  LongWritable  , Text , Text, IntWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            // Map Task逻辑 （5分）
            String[] splits = value.toString().split(",");
            for (String word : splits) {
                Text outputKey = new Text(word);
                IntWritable outputValue = new IntWritable(1);
                context.write(outputKey,outputValue);
            }
        }
    }

    // Reduce任务
    public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            // Reduce 逻辑 （5分）
            int sum = 0;
            for (IntWritable value : values) {
                sum += value.get();
            }
            context.write(key,new IntWritable(sum)); 
        }
    }
    
    // Driver程序，主要负责配置及提交任务
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        Configuration conf = new Configuration();
        
        // HDFS默认入口（2分）
        conf.set("fs.defaultFS", "  hdfs://master:9000  ");
        
        // 创建一个MapReduce的Job
        Job job = Job.getInstance(conf);
        // 配置任务
        // 设置任务运行的名字
        job.setJobName("Demo01WordCount");
        // 设置任务运行哪一个类
        job.setJarByClass(Demo01WordCount.class);

        // 配置Map端 （3分）
        job.setMapperClass(        MyMapper.class        );
        job.setMapOutputKeyClass(       Text.class       );
        job.setMapOutputValueClass(       IntWritable.class       );

        // 配置Reduce端 （3分）
        job.setReducerClass(      MyReducer.class      );
        job.setOutputKeyClass(      Text.class       );
        job.setOutputValueClass(      IntWritable.class      );

        // 配置输入输出路径
		// 等待任务运行完成
        /**
        * 后续代码省略
        */

    }
}

2、Hive数据分析（30分）

1、请在hdfs根目录下中创建一个目录，以**你的名字首字母**加上“**/data**”命名（例如：张三，则目录名为：**/zs/data**），并将以上疫情数据（**covid19.csv**）上传到这个目录中，请列出相关命令。（4分）
hdfs dfs -mkdir -p /xqb/data
hdfs dfs -put covid19.csv /xqb/data


2、请在hive中创建一张外部表名为: ods_yiqing_data，字段名同上，列分隔符为","，数据存储位置为**第1小题**创建的目录（5分）
create external table if not exists ods_yiqing_data(
 Date String  comment'时间'
,Province String comment '省份'
,City String comment'城市'
,Confirm BigInt comment'新增确诊'
,Heal BigInt comment'新增出院'
,Dead BigInt comment'新增死亡'
,Source String comment'消息来源')
row format delimeted fields terminated by ','
location '/xqb/data';


3、统计合肥市每月新增确诊病例总数，按照总数降序排列，请提供SQL语句（6分）
select distinct(t1.yue1)
		,t1.City
		,t2.sum_confirm 
from(
    select substring(Date,1,2) as yue1
    		,City 
    from ods_yiqing_data 
    where City='合肥市'
)t1 right join(
    select substring(Date,1,2) as yue2
    		,sum(Confirm) as sum_confirm 
    from ods_yiqing_data 
    where City='合肥市'
    group by substring(Date,1,2)
)t2
on t1.yue1=t2.yue2 
order by sum_confirm desc;


4、统计安徽省各市3月新增确诊病例总数，按照总数降序排列，请提供SQL语句（7分）
select substring(Date,1,2)as sub_date
		,sum(Confirm) as sum_confirm
		,City
from ods_yiqing_data
where Province='安徽' and substring(Date,1,2)='3月'
group by substring(Date,1,2),city
order by sum_confirm desc;


5、统计湖北省每月新增出院病例总数最多的前3个城市，请提供SQL语句（8分）
select t1.sub_date
		,t1.sum_heal
		,t1.City
		,t1.row_number
from(
   select substring(Date,1,2) as sub_date
   			,sum(Heal)as sum_heal
   			,city
   			,row_number() over(partition by substring(Date,1,2) order by sum(Heal) desc) as row_number 
   from ods_yiqing_data
   where  Province='湖北'   
   group by substring(Date,1,2),City
)t1
where t1.row_number<=3;