Hadoop单词统计-各个过程详细说明

最新推荐文章于 2021-11-21 10:18:13 发布

Phoenixul

最新推荐文章于 2021-11-21 10:18:13 发布

阅读量1.2k

点赞数 1

分类专栏： Hadoop

本文链接：https://blog.csdn.net/lovefef4/article/details/53706611

版权

Hadoop 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

package hadoop01;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

    /**
     * 一：调用Map执行
     * map阶段
     * 输入：行数为Key，正行的内容为value
     * 在map函数中，会对输入的值进行分割处理
     * 输出：以<key,value>的形式输出数据。例如<"hello",1>;
     */
    public static class WordCountmapper extends Mapper<Object, Text, Text, IntWritable>{
        private final static IntWritable one = new IntWritable();
        private Text word = new Text();
        public void map(Object key,Text value,Context context) throws IOException, InterruptedException{
            String [] words = value.toString().split(" ");
            for(String str : words){
                word.set(str);
                context.write(word, one);
            }
        }
    }

    /**
     * 二：Map函数执行过程中
     * map端输出的数据首先会存储在内存缓冲区中，当超出溢写阀值是，会将内存中的文件溢写到本地文件系统
     * 1.在内存中，首先会进行partation操作，目的是将不同的key值分配到不同的reduce任务上，来进行负载均衡，默认的partation方法是Hash模运算
     * 2.在溢写发生时，首相会对数据进行sort归并排序操作，产生的结果应该为<"hello",{1,1,1,1,1}>的形式
     * 3.如果设置了Combiner,现在就会执行Combiner函数，进行map端的Combiner操作
     * 4.将执行结果溢写到本地文件系统 
     */
    /**
     * 三：Map函数执行完毕
     *  Map函数执行完毕后，可能会产生多个溢写文件，此时会对多个溢写文件进行合并操作
     *  在合并文件的过程中，也可能进行Combiner操作
     */
    /**
     * 四：执行Reduce操作
     * Reduce操作会把Map端的输出结果文件进行最终的合并，生成最终的结果
     */
    public static class WorldCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

        public void reduce(Text key,Iterable<IntWritable> value,Context context) throws IOException, InterruptedException{
            int total = 0;
            for(IntWritable val : value){
                total++;
            }
            context.write(key, new IntWritable(total));
        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration configuration = new Configuration();
        Job job = new Job(configuration,"word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WordCountmapper.class);
        job.setReducerClass(WorldCountReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path("/user/root/books"));
        FileOutputFormat.setOutputPath(job, new Path("/user/root/bookout"));
        System.exit(job.waitForCompletion(true)?0:1);
    }

}

Phoenixul

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop单词统计-各个过程详细说明

package hadoop01;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;imp
复制链接

扫一扫

专栏目录