Hadoop之旅（5）— MapReduce Java API案例实战

最新推荐文章于 2024-08-17 21:28:57 发布

Andy Chen 陈郑游

最新推荐文章于 2024-08-17 21:28:57 发布

阅读量2.1k

点赞数

分类专栏： ————[ Hadoop之旅 ] 文章标签： hadoop mapreduce MapReduce 笔试题 MapReduce java实战案例 hdfs

本文链接：https://blog.csdn.net/javawebrookie/article/details/73655753

版权

————[ Hadoop之旅 ] 专栏收录该内容

10 篇文章 47 订阅

订阅专栏

1、MapReduce 案例

本文实现 MapReduce 用 Java API 实现统计单词案例、代码优化及其 MapReduce 编程格式；这也是面试经典笔试代码题。代码就是 “ MapReduce 八股文 ”，我们只需修改部分代码。

Hadoop之旅（4）— MapReduce 与 YARN 原理讲解、 Hadoop之旅（1）—单机与伪集群安装、简单经典案例

准备好环境：idea、创建项目、添加依赖；

A：依赖
 
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
B：启动 MapReduce

2、代码实现 WordCountMapReduce.class

2.1、map 阶段 — 代码有一定的格式、只需修改部分代码

 /**
     * step 1: Map Class
     * <p>
     * map 输入 ——> map 输出
     * public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
     */
    public static class WordCountMapper extends
            Mapper<LongWritable, Text, Text, IntWritable> {


        private Text mapOutputKey = new Text();
        private final static IntWritable mapOuputValue = new  IntWritable(1);

        /**
         * Called once for each key/value pair in the input split. Most applications
         * should override this, but the default is the identity function.
         *
         * @param key
         * @param value
         * @param context
         */
        @Override
        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            //super.map(key, value, context);
            // 获取值
            String lineValue = value.toString();

            // split 占内存 ——> 分割字符串
            // String[] strs = lineValue.split(" ");
            StringTokenizer stringTokenizer = new StringTokenizer(lineValue);

            // iterator
            while(stringTokenizer.hasMoreTokens()){
                // get word value
                String wordValue = stringTokenizer.nextToken();
                // set value
                mapOutputKey.set(wordValue);;
                // output  ——> 上下文操作
                context.write(mapOutputKey, mapOuputValue);
            }
        }
    }

2.2、Reduce 阶段

/**
     * step 2: Reduce Class
     * reduce 输入 ——> reduce 输出
     * public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
     */
    public static class WordCountReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {

        private IntWritable outputValue = new  IntWritable();

        /**
         * This method is called once for each key. Most applications will define
         * their reduce class by overriding this method. The default implementation
         * is an identity function.
         *
         * @param key
         * @param values
         * @param context
         */
        @Override
        public void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            //super.reduce(key, values, context);

            // sum tmp
            int sum= 0 ;
            // iterator
            for(IntWritable value: values){
                // total
                sum += value.get();
            }
            // set value
            outputValue.set(sum);

            // output
            context.write(key, outputValue);
        }
    }

2.3、封装map和reduce

 // step 3: Driver ,component job
    // 封装map和reduce
    public int run(String[] args) throws Exception {
        // 1: get confifuration   ——> extends Configured ——> new Configuration();
        Configuration configuration = getConf();

        // 2: create Job
        Job job = Job.getInstance(configuration, this.getClass().getSimpleName());

        // run jar
        job.setJarByClass(this.getClass());
        
        // 3: set job
        // input  -> map  -> reduce -> output
        // 3.1: input ——> 参数(数据源)
        Path ipath = new Path(args[0]);
        FileInputFormat.addInputPath(job,ipath);

        // 3.2: map
        job.setMapperClass(WordCountMapper.class);
        // 设置输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 3.3: reduce
        job.setReducerClass(WordCountReducer.class);
        // 设置输出类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 3.4: output ——> 参数(输出源)
        Path outPath = new Path(args[1]);
        FileOutputFormat.setOutputPath(job, outPath);

        // 4: submit job ——> job 提交
        boolean isSuccess = job.waitForCompletion(true);
        return isSuccess ? 0 :1;
    }

2.4、运行

// step 4: run program
    public static void main(String[] args) throws Exception {
        // 1: get confifuration
        Configuration configuration = new Configuration();

        //int status = new WordCountMapReduce().run(args);

        // ——> extends Configured ——> implements Tool
        int status = ToolRunner.run(configuration,//
                new WordCountMapReduce(),//
                args);

        System.exit(status);

    }

这里需要注意的是、用到ToolRunner类，所以我们的 WordCountMapReduce 类要写成这样：

public class WordCountMapReduce extends Configured implements Tool{

2.5、程序打包

2.6、运行 jar

数据源：/chenzhengyou/mapreduce/wordcount/input/idea.input

输出源：/chenzhengyou/mapreduce/wordcount/output/test01

运行MapReduce：[root@czy-1 hadoop-2.5.0]#

bin/hadoop jar /usr/local/chenzhengyou/hadoop/standalone/hadoop-2.5.0/jars/hadoop-mapreduce.jar /chenzhengyou/mapreduce/wordcount/input/idea.input /chenzhengyou/mapreduce/wordcount/output/test01

查看结果：bin/hdfs dfs -text /chenzhengyou/mapreduce/wordcount/output/test01/par*