MapReduce入门示例-WordCount

最新推荐文章于 2021-11-18 11:29:55 发布

u013063153

最新推荐文章于 2021-11-18 11:29:55 发布

阅读量509

点赞数

分类专栏： Hadoop

本文链接：https://blog.csdn.net/u013063153/article/details/62884322

版权

Hadoop 专栏收录该内容

63 篇文章 1 订阅

订阅专栏

package org.myorg;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;

/**
 * 应用程序一般会提供Map和Reduce来实现Mapper和Reducer接口，它们组成作业的核心。
 */
public class WordCount {

    //提供Map来实现Mapper接口
    public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

        //mapper的输出的value的类型为IntWritable
        private final static IntWritable one = new IntWritable(1);
        //mapper的输出的key的类型为Text
        private Text word = new Text();

        //必须实现的Mapper中定义的map()方法
        public void map(LongWritable value, Text key, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            String line = key.toString();
            //1、通过TextInputFormat一次处理一行；
            //2、通过StringTokenizer以空格为分隔符将一行切分为若干个tokens
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                output.collect(word, one);//输出<<word>,1>形式的键值对
            }
        }
    }

    //提供Reduce来实现Reducer接口
    //Reduce方法：将每一个key(本例中就是单词)出现的次数求和。
    public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

        //必须实现的Reducer中的定义的reduce()方法
        public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }
            output.collect(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws IOException {

        //代表了一个Map/Reduce作业的配置
        JobConf conf = new JobConf(WordCount.class);
        //设置job作业的Name
        conf.setJobName("wordcount");

        //设置Reducer的输出的key-value对的格式
        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);

        //指定一个mapper
        conf.setMapperClass(Map.class);
        //指定了一个combiner，因此，每次map运行之后，会对输出按照key进行排序，然后把输出传递给本地的combiner（按照作业的配置与Reduce一样），进行本地聚合。
        conf.setCombinerClass(Reduce.class);
        //指定一个reducer
        conf.setReducerClass(Reduce.class);

        //设置map的输入格式，默认为TextInputFormat,key为Text，value为LongWritable
        conf.setInputFormat(TextInputFormat.class);
        //设置Reduce的输出格式，默认为TextOutputFormat
        conf.setOutputFormat(TextOutputFormat.class);

        //设置文件的输入路径
        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        //设置文件的输出路径
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        //run方法中指定了作业的几个方面，例如，通过命令行传递过来的输入/输出路径、key/value的类型、输入/输出的格式等JobConf中的配置信息。随后程序调用了JobClient.runJob(conf)来提交作业并且监控它的执行。
        JobClient.runJob(conf);
    }

}

解释两个Java基础问题：

1.StringTokenizer

Java语言中，提供了专门用来分析字符串的类StringTokenizer（位于java.util包中）。该类可以将字符串分解为独立使用的单词，并称之为语言符号。语言符号之间由定界符（delim）或者是空格、制表符、换行符等典型的空白字符来分隔。其他的字符也同样可以设定为定界符。StringTokenizer类的构造方法及描述见表15-6所示。

表15-6 StringTokenizer类的构造方法及描述

构造方法	描述
StringTokenizer(String str)	为字符串str构造一个字符串分析器。使用默认的定界符，即空格符（如果有多个连续的空格符，则看作是一个）、换行符、回车符、Tab符号等
StringTokenizer(String str, String delim)	为字符串str构造一个字符串分析器，并使用字符串delim作为定界符

StringTokenizer类的主要方法及功能见表15-7所示。

表15-7 StringTokenizer类的主要方法及功能

方法	功能
String nextToken()	用于逐个获取字符串中的语言符号（单词）
boolean hasMoreTokens()	用于判断所要分析的字符串中，是否还有语言符号，如果有则返回true，反之返回false
int countTokens()	用于得到所要分析的字符串中，一共含有多少个语言符号