大数据学习(二)-手把手运行Hadoop的WordCount程序

最新推荐文章于 2020-07-02 09:55:13 发布

6点A君

最新推荐文章于 2020-07-02 09:55:13 发布

阅读量1k

点赞数

分类专栏： Hadoop 文章标签： Hadoop WordCount HelloWorld

本文链接：https://blog.csdn.net/anLA_/article/details/88737182

版权

Hadoop 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

前一篇文章介绍了Hadoop的安装以及简单配置，博主以伪分布式的方式安装，即单机安装极有master也有cluster。
本篇文章将展示如何运行经典的WordCount程序。

源代码

首先例子源代码如下：

package com.anla.chapter1;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.StringTokenizer;


/**
 * @user anLA7856
 * @time 19-3-21 下午10:30
 * @description
 */
public class WordCount {

    public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable>{

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        /**
         * 读入文件，并标记为<word, 1>
         * @param key
         * @param value
         * @param context
         * @throws IOException
         * @throws InterruptedException
         */
        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());   // 因为只有一行，所以直接第一个就好了
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }


    public static class IntSumReducer
            extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();

        /**
         * 将相同的key值，也就是word的value值收集起来，然后交由给Reduce处理，
         * Reduce将相同key值的value收集起来，形成<word, list of 1>的形式，之后将这些1加起来，
         * 即为单词个数。最后将这个<key,value>对TextOutputFormat的形式输出HDFS中。
         * @param key
         * @param values
         * @param context
         * @throws IOException
         * @throws InterruptedException
         */
        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context
        ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
//        String file0 = "/input";
//        String file1 = "/output";
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");    // 初始化
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);    // 设置mapper类
        job.setCombinerClass(IntSumReducer.class);    // 设置reducer
        job.setReducerClass(IntSumReducer.class);     // 设置reducer
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));     // 设置文件输入路径
        FileOutputFormat.setOutputPath(job, new Path(args[1]));    // 设置第二个擦数文件路径
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

运行

运行方式

首先确保在hdfs中有相应的输入文件目录，用hadoop fs -ls / 查看
如果没有，则需要创建hafs目录，并且把例子文件file01和file02放入hdfs中

hadoop fs -mkdir /input
hadoop fs -put file0* /input

编译WordCount类，注意，由于博主类名带有包名，所以在编译时候需要注意下

hadoop com.sun.tools.javac.Main -d . WordCount.java

将含有内部类文件，统一打成jar包

jar cf wc.jar com/anla/chapter1/WordCount*.class

最后，运行jar包

hadoop jar wc.jar com.anla.chapter1.WordCount /input /output

大功告成：
4. 此时，在hdfs下面，多了个/output，执行cat命令，可以看到输出

hadoop fs -cat /output/part-r-00000

在这里插入图片描述

下面，简单介绍下这个例子程序

分析

InputFormat

当数据传送给Map时，Map会将输入分片传送到InputFormat上，InputFormat则调用getRecordReader方法生成RecordReader，
RecordReader再通过createKey、createValue创建可供Map处理的<key,value>，即<k1,v1>。
即，InputFormat是用来生成可供Map处理的<key,value>对的。
Hadoop预定义了很多方法将不同类型的输入数据转化为Map能够处理的<key,value>对。

就拿FileInputFormat来说，每行都会生成一条记录，每条记录规则表示成<key,value>形式。

key值是每个数据的记录在数据分片的字节偏移量，数据类型是LongWritable
value是每行的内容，数据类型是Text
即数据会以如下形式传入Map
file01
0 hello world bye world
file02
0 hello hadoop bye hadoop

OutputFormat

对于每一种输入跟是都有一种输出格式与其对应，默认的输出格式是TextOutputFormat，会将每条记录以一行的形式存入文本文件，不过，它的键和值
可以是任意形式的，因为会调用toString方法输出，最后形式为：
bye 2
hadoop 2
hello 2
world 2

Map

Mapper接口是一个接受4个参数的泛型类型，分别是用来指定

输入key值类型
输入value值类型
输出key值类型
输出value值类型

Reduce

Reducer接口同样接受四个泛型参数
而Reducer()方法以Map()的输出作为输入，因此，Reducer的输入类型是<Text,IneWritable>，而Reduce()的输出是单词和它的数目，因此为<Text, IntWritable>。

代码分析

通过Mapper，一次处理一行，即第一次map结果为：

< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>

第二次map结果为：

< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>

通过job.setCombinerClass(IntSumReducer.class); 来将map的结果聚合起来，使用IntSumReducer即相加起来，得到两次map输出为：

< Bye, 1>
< Hello, 1>
< World, 2>

以及

< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>

通过Reducer方法，将两次Map进行运算并输出结果，即相加，最终job结果为：

< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>

6点A君

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录