WordCount——MapReduce 实例入门

最新推荐文章于 2023-05-05 22:14:43 发布

五道口纳什

最新推荐文章于 2023-05-05 22:14:43 发布

阅读量1.4k

点赞数 1

分类专栏： hadoop Hadoop 实战

本文链接：https://blog.csdn.net/lanchunhui/article/details/50894233

版权

hadoop 同时被 2 个专栏收录

45 篇文章 0 订阅

订阅专栏

Hadoop 实战

10 篇文章 20 订阅

订阅专栏

本文我们从一个简单的实例出发，统计文本中不同单词出现的次数，来讲述 MapReduce 的执行流程。

考虑如下的文本信息（文件名为hello）：

hello you
hello me

MapReduce 工作流程

（1） [K1, V1]：将输入文本的每一行，解析成一个 key、value 对

键：当前文本行的首地址，则第一行的首地址为0，则第二行的首地址为10（第一行的换行也站一个字节）。

值：当前文本行文本内容。

第一步解析为：[0, hello you]、[10, hello me]

每一个键值对调用一次 Map 函数，则就会调用两次 Map 函数
（2）Map：[K1, V1] ⇒ [K2, V2]

Map 函数接受的是每一行的文本信息，它无法获取所有行的内容，自然它处理的也是一个单独的文本行的内容，对本例而言也即统计当前行中单词出现的次数；
```
public void map(K, V, ctx){
    String[] splited = v.split("\t");
    for (String word: splited){
        ctx.write(<word, 1>);
                    // 向上下文中写数据
    }
}
```
本例而言：<hello, 1>, <you, 1>, <hello, 1>, <me, 1>
（3）默认为1个分区
（4）先排序再分组，将相同 key （实现了 Comparable 接口的 compareTo 方法）的不同 values 置于一个集合中

先排序：<hello, 1>, <hello, 1>, <me, 1>, <you, 1>
分组：<hello, {1, 1}>, <me, 1>, <you, 1>
（5）规约
（6）拷贝到 Reducer 所在的节点

属于框架的工作；

（7）Reduce

本例而言，reduce被调用三次；

public void reduce(K, Vs, ctx){
    ctx.write(K, Vs.size());        
}

（8）保存结果

<hello, 2>, <me, 1>, <you, 1>

如何以 MapReduce 的思维处理业务

（1）在MapReduce中流转的是键值对

input ⇒ <K1, V1> ⇒ <K2, V2> ⇒ <K2, V2s> ⇒ <K3, V3>
（2）客户输入的文本信息提供了<K1, V1>，客户的要求给定了<K3, V3>（<K2、V2>在 Map 产生，在 Reduce 步被消灭）
（3）使用 MapReduce 的核心就在于确定<K2, V2>，如何确定K2, V2呢：
- （1）分组：把相同 key 的 values 放在一起
- （2）Reduce 函数

代码实现

package mr;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

public class WordCount {

    static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
        @Override
        protected void map(LongWritable k1, Text v1,
                Mapper<LongWritable, Text, Text, LongWritable>.Context ctx)
                throws IOException, InterruptedException {
            String[] splits = v1.toString().split(" ");
            for (String word: splits)
                ctx.write(new Text(word),  new LongWritable(1L));
        }
    } 

    static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable>{
        @Override
        protected void reduce(Text k2, Iterable<LongWritable> v2s,
                Reducer<Text, LongWritable, Text, LongWritable>.Context ctx)
                throws IOException, InterruptedException {
            long cnt = 0L;
            for (LongWritable v2 : v2s) {
                cnt += v2.get();
            }
            ctx.write(k2, new LongWritable(cnt));
        }
    }

    private static final String INPUT_PATH = "hdfs://hadoop0:9000/hello";
    private static final String OUTPUI_PATH = "hdfs://hadoop0:9000/hello_res";

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();
        Job job = new Job(conf, WordCount.class.getSimpleName());

        // step 1: 输入   ==> <K1, V1>
        // 向任务传递输入文件
        FileInputFormat.setInputPaths(job, INPUT_PATH);
        // 指定对输入文件进行格式化处理的类
        job.setInputFormatClass(TextInputFormat.class);
                                    // 注意：FileInputFormat TextInputFormat 包的选择
                                    // 步骤可省
        // step 2: <K1, V1> ==> <K2, V2>
        job.setMapperClass(MyMapper.class);
        // 指定map输出的<K, V>类型，如果<K3, V3> 与 <K2, V2>的类型一致，以下的两个设置也是可省的
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);


        // step 3: 分区
        job.setPartitionerClass(HashPartitioner.class);
        job.setNumReduceTasks(1);
                                    // 以上两个设置均可省略
                                    // 也即以上两个操作都是默认操作
        // step 4: 排序、分组

        // step 5：规约

        // step 2.1
        // step 2.2：指定自定义Reducer类
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);


        // step 2.3: 指定输出路径
        FileOutputFormat.setOutputPath(job, new Path(OUTPUI_PATH));
        // 指定对输出结果进行格式化处理的类
        job.setOutputFormatClass(TextOutputFormat.class);
                                    // 此步可省
        // 将作业提交给 JobTracker
        job.waitForCompletion(true);
    }
}

在启动hadoop的之后，运行该程序时，可能会报权限异常，将 FileUtil 解压拷贝到当前目录再次运行即顺利执行。

五道口纳什

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
WordCount——MapReduce 实例入门

本文我们从一个简单的实例出发，统计文本中不同单词出现的次数，来讲述 MapReduce 的执行流程。考虑如下的文本信息（文件名为hello）：hello youhello meMapReduce 工作流程（1） [K1, V1]：将输入文本的每一行，解析成一个 key、value 对键：当前文本行的首地址，则第一行的首地址为0，则第二行的首地址为10（第一行的换行也站一个字节）。值：当前文本行文
复制链接

扫一扫