大数据课程——MapReduce编程基础

最新推荐文章于 2024-05-11 23:29:58 发布

冰冷灬泡面

最新推荐文章于 2024-05-11 23:29:58 发布

阅读量404

点赞数 1

分类专栏：大数据课程学习

本文链接：https://blog.csdn.net/weixin_43334251/article/details/118302721

版权

大数据课程学习专栏收录该内容

14 篇文章 4 订阅

订阅专栏

大数据课程——MapReduce编程基础

实验内容以及要求

在文档规模较小的时候，使用传统编程方式也能统计出文本中出现的单词数量，但是当文档规模巨大的时候（比如数据大小达到GB、PB级别的时候），就必须使用MapReduce来进行统计了。
请使用MapReduce编程框架，编写程序WordCount，统计文本中，每个单词出现的次数，并给予详细的步骤以及实验测试结果。

自己说两句

该实验主要是学习MapReduce的相关概念，重点是要理解Map、Reduce这两个阶段做了什么事情，并且在整个MapReduce编程中都紧紧围绕Map阶段、Reduce阶段的特点来进行编程。（虽然其实包含了4个阶段：Map 阶段、Partition & Shuffle 阶段、Reduce 阶段）

Map阶段：在这个阶段接收原始数据，对原始数据进行需要的处理，设置要输出的Key和Value。
Partition & Shuffle阶段：这两个阶段功能比较类似，区别只是Partition阶段在本地节点进行，Shuffle好像就不是在本地了。它们会接收Map阶段输出的Key-Value对，根据key值进行分组，将key相同的数据分为同一组，发送给Reduce节点。
Reduce阶段：接收的数据是 Key-ValueList，进行最终处理，最后输出Key-value对。

在这里插入图片描述

本次实验，是根据单词统计作为背景的。那么在Map阶段，就要接收原始的文本数据（每一行文本数据是由多个单词组成的，每个单词之间用空格分开），将其根据空格分割，就得到了一个一个的单词。因为要统计单词出现数量，所以将单词作为key，将value设置为整数1（表示该单词，出现了1次），进行输出。
在Reduce阶段，接收到的数据是Key-ValueList，Key就是某个单词，ValueList包含了一堆1，用一个循环，将ValueList中的所有Value进行求和，就知道了对应单词出现的总次数。输出结果即可。

实验步骤

实验步骤没啥说的，看代码就知道了。最后将代码打包成jar包，丢到虚拟机里跑。
运行命令：java -jar WordCount.jar [文本文件路径] [结果输出路径]

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

/**
 * @author: 冰冷灬泡面
 * @date: 2021/3/28 15:38
 * @description: MapReduce编程，实现了单词统计功能
 * @modifiedBy:
 */
public class WordCount {
    public static void main(String[] args) throws  Exception{
        //Configuration类
        Configuration conf = new Configuration();
        //通过实例化对象GenericOptionsParser可以获得程序执行所传入的参数
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length < 2) {
            System.err.println("Usage: wordcount <in> [<in>...] <out>");
            System.exit(2);
        }
        //构建任务对象
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);

        //设置输出结果的数据类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        for(int i = 0; i < otherArgs.length-1; i++) {
            //设置需要统计的文件的输入路径
            FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
        }
        //设置统计结果的输出路径
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length-1]));
        //提交任务给Hadoop集群
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

	//自定义Reducer内部类，编写Reduce阶段代码
    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context)
                throws IOException, InterruptedException {
            //统计单词总数
            int sum = 0;
            for(IntWritable val : values) {
                sum += val.get();
            }
            this.result.set(sum);
            context.write(key, this.result);
        }
    }

    //自定义的Mapper内部类，编写Map阶段代码
    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
        private static final IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context)
        throws IOException, InterruptedException {
            //默认根据空格、制表符\t、换行符\n、回车符\r分割字符串
            StringTokenizer itr = new StringTokenizer(value.toString());
            //循环输出每个单词与数量
            while (itr.hasMoreTokens()) {
                this.word.set(itr.nextToken());
                context.write(this.word, one);
            }
        }
    }
}

运行jar包
在这里插入图片描述
运行结果

冰冷灬泡面

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
大数据课程——MapReduce编程基础

大数据课程——MapReduce编程基础实验内容以及要求在文档规模较小的时候，使用传统编程方式也能统计出文本中出现的单词数量，但是当文档规模巨大的时候（比如数据大小达到GB、PB级别的时候），就必须使用MapReduce来进行统计了。请使用MapReduce编程框架，编写程序WordCount，统计文本中，每个单词出现的次数，并给予详细的步骤以及实验测试结果。自己说两句该实验主要是学习MapReduce的相关概念，重点是要理解Map、Reduce这两个阶段做了什么事情，并且在整个MapRed
复制链接

扫一扫