MapReduce处理流程&wordCount源码解析和操作流程

最新推荐文章于 2024-07-29 09:10:44 发布

艾艾猫dori

最新推荐文章于 2024-07-29 09:10:44 发布

阅读量303

点赞数

分类专栏： big data 文章标签：大数据 hadoop mapreduce

本文链接：https://blog.csdn.net/m0_45899013/article/details/108092299

版权

big data 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

操作文档参考：
http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

MapReduce处理流程：

输入: 一系列的键值对<k1,v1>
map: map<k1,v1>转换成<k2,v2>
reduce: <k2,v2>转换成<k3,v3>
输出: 一系列的键值对<k3,v3>

流程解析：读取文件 splitting拆分 mapping计算 shuffling洗牌排序汇总结果统计

wordCount源码解析：

1.wordCount代码如下：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCount {
    //自定义mapper处理类
    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

        //map：分割任务。 编程模型中的Mapping
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            LongWritable one = new LongWritable(1);
            //接收到每一行数据
            String line = value.toString();
            //按照指定的分隔符进行拆分
            String[] words = line.split(" ");
            //遍历每个单词，并且通过context把map的处理结果进行输出
            for (String word : words) {
                context.write(new Text(word), one);
            }
        }
    }

    //自定义reducer处理类(map的输出作为reduce的输入)
    public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            //设置统计的初始值为0
            long sum = 0;
            for (LongWritable value : values) {
                sum += value.get();
            }
            //统计最终的结果
            context.write(key, new LongWritable(sum));
        }
    }

    //处理作业
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //创建Configuration对象
        Configuration cfg = new Configuration();

        //创建job
        Job job = Job.getInstance(cfg, "wordcount");

        //设置job的处理类
        job.setJarByClass(WordCount.class);

        //设置map相关的
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        //设置reduce相关的
        /**
         * 深入了解Combiners编程(相当于Map端的Reduce)
         每一个map可能会产生大量的输出，combiner的作用就是在map端对输出先做一次合并，以减少 传输到reducer的数据量。
         combiner最基本是实现本地key的归并，combiner具有类似本地的reduce功能。
         如果不用combiner，那么，所有的结果都是reduce完成，效率会相对低下。使用combiner， 先完成的map会在本地聚合，提升速度。
         注意:Combiner的输出是Reducer的输入，Combiner绝不能改变最终的计算结果。
         所以，Combiner只应该用于那种Reduce的输入key/value与输出key/value类型完全一致， 且不影响最终结果的场景。比如累加，最大值等。
         */
        job.setCombinerClass(MyReducer.class);
        
        job.setReducerClass(MyReducer.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        //判断输入和输出目录是否存在
        FileSystem fs = FileSystem.get(cfg);
        if (!fs.exists(new Path(args[0]))) {
            System.out.println("您要处理的目录不存在!");
        }

        //输出路径如果事先存在，程序会报错。
        fs.deleteOnExit( new Path(args[1]));

        //设置作业处理的输入路径
        FileInputFormat.addInputPath(job, new Path(args[0]));

        //设置作业处理的输出路径
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //执行作业
        boolean result = job.waitForCompletion(true);

        System.exit(result ? 0 : 1);

    }
}

2.wordCount代码编辑完毕后，给项目添加主类：File-Project Structure

3.package打成jar包，到target取出jar包，传到Ubuntu。
4.虚拟机Ubuntu启动。启动dfs，jps查看检查：

start-dfs.sh
jps

dfs启动成功后可到IP+50070访问到：

5.启动yarn：

start-yarn.sh

yarn启动成功后可访问到IP+8088网址：

6.把编辑好的文件hello.txt（数据模拟）先上传到虚拟机的opt下，再上传到hadoop的input下，并查看

hadoop dfs -put /opt/hello.txt /input
hadoop dfs -text /input/hello.txt
#查看显示的内容如下：
deer bear river
car car river
Deer car bear

7.把打包好的jar放到app的temp路径下，运行jar包：

hadoop jar test-hdfs-1.0-SNAPSHOT.jar com.kgc.WordCount /input /output

8.执行完毕，可在50070网页查看到增加了output文件夹，里面还多了2个文件，可查看到对hello.txt的内容进行了计数统计：
查看：

hadoop dfs -cat /output/part-r-00000

在任务运行过程中查看 http://192.168.1.14:8088 ，可以看到任务的状态

艾艾猫dori

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
MapReduce处理流程&wordCount源码解析和操作流程

操作文档参考：http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.htmlMapReduce处理流程：输入: 一系列的键值对<k1,v1>map: map<k1,v1>转换成<k2,v2>reduce: <k2,v2>转换成<k3,v3>输出: 一系列的键值对&l
复制链接

扫一扫

专栏目录