大数据清洗、统计案例（下）

最新推荐文章于 2024-07-12 17:53:13 发布

西北峰转东风

最新推荐文章于 2024-07-12 17:53:13 发布

阅读量138

点赞数

分类专栏：大数据文章标签：大数据

本文链接：https://blog.csdn.net/weixin_42634814/article/details/132124286

版权

大数据专栏收录该内容

3 篇文章 0 订阅

订阅专栏

一、数据要求

1.数据清洗
不符合要求的数据为：
1）每条记录如果为NULL的字段数量大于等3；
2）“星级6、评论数11、评分10、房间数8”这4个字段有一个为NULL；
3）重复的记录，将重复的去掉；
请删除满足以上三个条件的记录，并打印每类不符合要求的记录的数量；

2.请根据数据清洗的输出数据集，编写Mapreduce程序
1)统计各省份的酒店数量和房间数量，
2)以省份房间数量降序排列并输出前10条统计结果
数据定义如下：
province city hotel_num room_num
贵州贵阳 1234 123456.0

二、编写代码（完成以上 2.输出统计功能）

package com.mhys;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/***
 * 2.请根据数据清洗的输出数据集，编写Mapreduce程序
 * 	1)统计各省份的酒店数量和房间数量，
 * 	2)以省份房间数量降序排列并输出前10条统计结果
 * 数据定义如下：
 * province	city	hotel_num	room_num
 * 贵州		贵阳	1234		123456.0
 */

//本类完成以上的 2.输出数据
public class MyOperator1 {
    static class MyOperatorMapper1 extends Mapper<LongWritable, Text,Text, IntWritable>{
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] arr = value.toString().split(",");
            String province = arr[3]; //省份
            int rooms = Integer.parseInt(arr[8]);//房间数
            context.write(new Text(province),new IntWritable(rooms));
        }
    }

    static class MyOperatorReducer1 extends Reducer<Text, IntWritable,Text, NullWritable>{
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int hotel_num = 0;
            int room_num = 0;
            for(IntWritable iw :values){
                hotel_num+=1;
                room_num+=iw.get();
            }
            String result = key.toString()+"\t"+hotel_num+"\t"+room_num;
            context.write(new Text(result),NullWritable.get());
        }
    }

    public static void main(String[] args) throws Exception{
        Configuration conf = new Configuration();//创建一个执行mapreduce的默认配置
        Job job = Job.getInstance(conf);    //根据默认配置创建一个任务
        job.setJarByClass(MyOperator1.class);//指明当前类名.class是运行的主类

        job.setMapperClass(MyOperatorMapper1.class);
        job.setReducerClass(MyOperatorReducer1.class);
        //告诉任务 map端输出的key-value的类型是什么
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        //告诉任务 reduce端输出的key-value类型是什么
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        //告诉任务输入单词计数文件在哪里
        FileInputFormat.addInputPath(job,new Path("/out2"));
        //告诉任务输出结果的文件目录在哪里
        FileOutputFormat.setOutputPath(job,new Path("/out3"));
        //运行程序
        boolean flag = job.waitForCompletion(true); //提交运行
        System.exit(flag?1:0);  //根据结果退出程序
    }


}

西北峰转东风

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
大数据清洗、统计案例（下）

2）“星级6、评论数11、评分10、房间数8”这4个字段有一个为NULL；请删除满足以上三个条件的记录，并打印每类不符合要求的记录的数量；1）每条记录如果为NULL的字段数量大于等3；3）重复的记录，将重复的去掉；
复制链接

扫一扫