Hadoop 入门例子

最新推荐文章于 2022-02-28 15:56:57 发布

UpCoderXH

最新推荐文章于 2022-02-28 15:56:57 发布

阅读量248

点赞数

分类专栏：数据挖掘文章标签： hadoop

本文链接：https://blog.csdn.net/liangdong2014/article/details/62428936

版权

数据挖掘专栏收录该内容

10 篇文章 0 订阅

订阅专栏

“Hello World”–WordCount

我们这里设置两个输入文件，都在input目录下，输出放在output目录下
上诉的两个目录都作为main的参数传进来

Map程序

/**
     *  这个map的作用是用来计数每个单词出现的次数
     *  LongWritable 代表的是输入的key值类型
     *  Text 代表的是输入的value值类型
     *  Text 代表的是输出的key值类型
     *  IntWritable 代表的是输出的value值类型
     * */
    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
    {
        /**
         *  IntWritable 代表的就相当于一个整数，1在这里指的是每个单词出现一次算一次
         * */
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        /**
         * 接受经过InputFormat处理的<key,value>对
         * 输出经过自己处理的<key,value>对
         * */
        public void map(LongWritable key, Text value, Context context)
        {
            StringTokenizer tokenizer = new StringTokenizer(value.toString());
            //System.out.println("key is " + key.toString());

            while (tokenizer.hasMoreTokens())
            {
                String token = tokenizer.nextToken();
                word.set(token);         
                try
                {
                    context.write(word, one);
                }
                catch (IOException | InterruptedException e)
                {
                    e.printStackTrace();
                }
            }
        }
    }

Reduce 程序

/**
 *  这个map的作用是用来计数每个单词出现的次数
 *  LongWritable 代表的是输入的key值类型
 *  Text 代表的是输入的value值类型
 *  Text 代表的是输出的key值类型
 *  IntWritable 代表的是输出的value值类型
 * */
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{
    /**
     *  IntWritable 代表的就相当于一个整数，1在这里指的是每个单词出现一次算一次
     * */
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    /**
     * 接受经过InputFormat处理的<key,value>对
     * 输出经过自己处理的<key,value>对
     * */
    public void map(LongWritable key, Text value, Context context)
    {
        StringTokenizer tokenizer = new StringTokenizer(value.toString());
        //System.out.println("key is " + key.toString());

        while (tokenizer.hasMoreTokens())
        {
            String token = tokenizer.nextToken();
            word.set(token);         
            try
            {
                context.write(word, one);
            }
            catch (IOException | InterruptedException e)
            {
                e.printStackTrace();
            }
        }
    }
}

main函数

public static void main(String[] args)
{
    File dir = new File(args[1]);
    if(dir.exists())
    {
        if(WordCount.deleteDirectory(args[1]))
        {
            System.out.println("delete success");
        }else{
            System.out.println("delete failed");
        }
    } else{
        System.out.println(args[1]+" not exists!");
    }
    //学习读取configuration 文件
    Configuration conf = new Configuration();
    conf.addResource("configuration/configuration_default.xml");
    Job job = null;
    try
    {
        job = Job.getInstance(conf);
    }
    catch (IOException e)
    {
        e.printStackTrace();
    }

    job.setJarByClass(WordCount.class);
    //FindVarMap.var = "Hadoop";
    //Class<FindVarMap> map = FindVarMap.class;
    //Class<FindVarReduce> reduce = FindVarReduce.class;
    Class<Map> map = Map.class;
    Class<Reduce> reduce = Reduce.class;
    job.setMapperClass(map);
    job.setReducerClass(reduce);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    try
    {
        /** InputSplit 是Hadoop 定义的用来传送给每个单独map的数据
         * InputSplit 记录的并不是数据本身，而是一个分片的长度和记录这个数据位置的数据
         * 生成InputSplit的方法可以通过InputFormat来设置
         * 简而言之 InputFormat是用来产生提供给map的<key,value>对的
         * 由于两个文件是独立的，所以两个对象的key一开始都是0，
         * 每一行对应一个key，value对
        */
        FileInputFormat.addInputPath(job, new Path(args[0]));
        TextOutputFormat.setOutputPath(job, new Path(args[1]));
    }
    catch (IllegalArgumentException | IOException e)
    {
        e.printStackTrace();
    }

    try
    {
        job.submit();
        System.out.println("finish");
    }
    catch (ClassNotFoundException | IOException | InterruptedException e)
    {
        e.printStackTrace();
    }
}

代码解释
- Job作为我们的一个任务，通过他我们可以指定Map、Reduce的类等属性
- 这里就需要参考Job的工作流程了
- 我们写的Map 就是图里的Mapper，我们写的Reduce 就是图里的Reducer 至于图里的其他任务，就是MapReduce这个框架来帮我们完成的。
- Mapper的四个参数的意思已经在注释里面说了，Reducer也一样
完整代码链接：
- wordcount
- Hadoop 实战第3章
- Hadoop 实战第5章
- Hadoop 实战第6章