mapreduce实现wordcount

最新推荐文章于 2022-04-13 10:44:16 发布

GodXuzzZ

最新推荐文章于 2022-04-13 10:44:16 发布

阅读量276

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/GodXuzzZ/article/details/106551821

版权

大数据专栏收录该内容

17 篇文章 0 订阅

订阅专栏

mapreduce实现wordcount

文件分割 splitting
小区分词 map
小分区聚合 combine
数据迁移，拉数据 shuffle
总计 reduce
在这里插入图片描述

MapReduce执行过程:
数据定义格式：
map：(k1,V1) —> list (k2,V2)
reduce：(K2，list(V2)) -----> list (K3，V3)
MapReduce执行过程
Mapper
Combiner
Paritioner
Shuffle and Sort
Reducer

步骤

首先，把一篇文章分开成为很多片，给分片标号1，2，3号，分别对3片进行分词，Map实现，按照每个英文单词进行拆分，每出现一个就标个1，不管是否重复，combine进行小分片聚合，把重复的英文单词出现的次数聚合到一起，每个片内不再有重复的英文单词，但是1，2，3之间会有重复单词，shuffle对所有分片的单词进行统计，把重复的放在一起，key为单词，value为小片内出现次数的数组集合，最后再经过reduce统计总数，把3片内出现的次数进行汇总，最后的结果为分词的最终结果。

例如：

首先上图中的文章经过spliting先分为1，2，3三片
然后经过map形成所有单词的每一次出现的记录
其次经过combine形成小片内所有单词出现次数的纪录，这时片与片之间存在重复的单词
再次经过shuffle形成小片之间所有单词出现次数的片记录，这时单词不再有重复，但是统计的次数只是单片的加总。
最后，经过reduce把所有的片的单词出现的所有次数在一起，统计文章所出现的单词次数

代码

只需要实现map和reduce两个步骤就可以了
MyMapper类：

public class MyMapper extends Mapper<LongWritable, Text,Text,LongWritable> {
    private LongWritable one = new LongWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 分隔开所有英文单词
        String [] wds = value.toString().split(" ");
        // 对每个单词每出现一次进行记录
        for (String word : wds) {
            Text wd = new Text(word);
            context.write(wd,one);
        }
    }
}

MyReducer类：

public class MyReduce extends Reducer<Text, LongWritable, Text,LongWritable> {
    private LongWritable res = new LongWritable();

    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
        long ressum = 0;
        // 把所有出现一次的数据进行加总，统计出最后出现的总次数ressum
        for (LongWritable one : values) {
            ressum += one.get();
        }
        res.set(ressum);
        // 分词完成后，分词记录key为单词，value为次数记录加总
        context.write(key,res);
    }
}

测试：MyDriver类：

package org.example.services.wc;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MyDriver {
    public static void main(String[] args) throws Exception{
        Configuration conf = new Configuration();
        // 准备一个空任务
        Job job = Job.getInstance(conf,"wc");
        // 设置该任务主启动类
        job.setJarByClass(MyDriver.class);
        // 设置任务的输入数据源
        FileInputFormat.addInputPath(job,new Path("d://abc.txt"));
        // 设置你的mapper任务类
        job.setMapperClass(MyMapper.class);
        // 设置mapper任务类的输出数据类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        // 设置reducer任务类
        job.setReducerClass(MyReduce.class);
        // 设置reducer任务类的输出数据类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        //设置任务的输出数据目标
        FileOutputFormat.setOutputPath(job,new Path("d://eee"));
        // 启动任务并执行
        job.waitForCompletion(true);
    }
}