（1）Hadoop 的第一个程序 WordCount 理解

最新推荐文章于 2021-06-05 15:31:30 发布

王珂_wangke

最新推荐文章于 2021-06-05 15:31:30 发布

阅读量111

点赞数

分类专栏：大数据文章标签：大数据

本文链接：https://blog.csdn.net/weixin_42209307/article/details/114023284

版权

大数据专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Hadoop 的第一个程序 WordCount 理解

map and Reduce 相关概念

Map

将数据拆分成一个个键值对， reduce 负责将一个个键值对进行归集，最后统计出结果

machine1：

# 以下数据是machine1 hdfs 区块的数据
hello hello hello

// 这是machine 1 的 context
[
   {"hello" : 1},
   {"hello" : 1},
   {"hello" : 1}
]

machine2:

# 以下数据是machine2 hdfs 区块的数据
hello abc abc abc

// 这是machine 2 的 context
[
   {"hello" : 1},
   {"abc" : 1},
   {"abc" : 1},
   {"abc" : 1}
]

Hadoop Word Count 程序

public class WordcountMap extends Mapper<LongWritable, Text, Text, IntWritable> {
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();//读取一行数据

        String str[] = line.split(" ");//因为英文字母是以“ ”为间隔的，因此使用“ ”分隔符将一行数据切成多个单词并存在数组中

        /**
         *  key 是偏移量，一行的偏移量
         *  value 就是一行的数据
         *  context 可以理解 mapList, 不带去重的map
         *
         *  [
         *      {hello: 1}
         *      {hello: 1}
         *      {abc: 1}....
         *  ]
         */
        for (String s : str) {//循环迭代字符串，将一个单词变成<key,value>形式，及<"hello",1>
            context.write(new Text(s), new IntWritable(1));
        }
    }
}

Reduce

reduce 概念可以理解为归集，上面每一个机器都已经将数据简单的进行了分组， reduce 就是要把上面的简单分组，进行数据归集，得到我们想要的数据，比如我们的需求就是，求所有单词的数量，假设这个 reduce 操作被分配在 machine 3 进行操作

此时 machine3 上面的归集数据

{
    "abc" : [{int:1},{int:1},{int:1}],
    "hello" : [{int:1},{int:1},{int:1},{int:1}],
}

然后我们一个个的key 进行循环，统计单词出现了多少次，然后输出

public class WordcountReduce extends Reducer<Text,IntWritable,Text,IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values,Context context)throws IOException,InterruptedException{
        /**
         *  map 已经将单词拆分， reduce 就做统计
         *  key 就是 hello OR abc....
         *  values 就是 hello OR abc 的数据
         *  将数据统计后输出
         */
        int count = 0;
        for(IntWritable value: values) {
            // 这里是直接计数了，没有使用值
            count++;
        }
        context.write(key,new IntWritable(count));
    }
}