hadoop初级班（二）

最新推荐文章于 2021-08-11 16:49:40 发布

大兔齐齐

最新推荐文章于 2021-08-11 16:49:40 发布

阅读量999

点赞数

分类专栏： Hadoop

本文链接：https://blog.csdn.net/Datuqiqi/article/details/45918343

版权

Hadoop 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

详细解读wordcount程序

本次内容将详细介绍wordcount程序：

word count 源代码:

<span style="font-size:18px;">package ustc.hilab.wordcount;

import java.io.exception;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount{
    public static class TokenizerMapper
    extends Mapper<Object,Text,Text,IntWritable>{
        private final static IntWritable one=new IntWritable(1);
        private Text word=new Text();
        
        public void map(Object key,Text value,Context context)throws IOException,interruptedException{
        StringTokenizer itr=new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()){
        word.set(itr.nextToken());
        context.write(word,one)
                }
        }
    }
   
    
    public static class IntSumReducer extends Reducer<Text,Intwritable,Text,IntWritable>{
        private IntWritable result=new IntWritable();
        public void reduce(Text key,Iterable<Intwritable>values,Context context)throws IOException,InterruptedException{
        int sum=0;
        for(IntWritable val:values){
            sum+=val.get();
        }
        result.set(sum);
        context.write(key,result);
        }
    }
    
    public static void main(String[] args)throws Exception{
    Configuration conf=new Configuration();
    String[]otherArgs=new GenericOptionsParser(conf,args).getRemainingArgs();
    if(otherArgs.length!=2){
        System.err.println("usage:wordcount<in><out>");
        System.exit(2);
    }
    
    Job job=new Job(conf,"wordcount")
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setReducerClass(IntsumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Intwritable.class);
    FileInputFormat.addInputPath(job,new Path(otherArgs[0]));
    FileOutputFormat.addOutputPath(job,new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true)?0:1);
    
    }
    
}
                                                </span>

引入的java类就不说了

<span style="font-size:18px;">import org.apache.hadoop.conf.Configuration;</span>

这一句是引入configuration类，用来hadoop读写和配置各种资源

<span style="font-size:18px;">import org.apache.hadoop.fs.Path;</span>

引入PATH类，它保存文件或者目录的路径字符串

<span style="font-size:18px;">import org.apache.hadoop.io.IntWritable;</span>

这一句引入IntWritable类。它表示一个以类表示的整数。

<span style="font-size:18px;">import org.apache.hadoop.io.Text;</span>

从hadoop的io包里引入Text类。Text类时存储字符串的可比较可序列化类。

<span style="font-size:18px;">import org.apache.hadoop.mapreduce.Job;</span>

这句话引入job类。在hadoop里，每个需要执行的任务是一个job，这个job负责很多事情，包括参数配置，设置mapreduce的细节，提交到hadoop集群，执行控制，查询执行状态，等等

<span style="font-size:18px;">import org.apache.hadoop.mapreduce.Mapper;</span>

Mapper类很重要，他将输入键值对映射到输出键值对，也就是map过程

<span style="font-size:18px;">import org.apache.hadoop.mapreduce.Reducer;</span>

负责Reduce过程

<span style="font-size:18px;">import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;</span>

这个类就是将文件进行切分split，因为只有切分才可以进行并行处理

<span style="font-size:18px;">import org.apache.hadoop.mapreduce.lib.input.FileOutputFormat;</span>

处理结果写入输出文件

<span style="font-size:18px;">import org.apache.hadoop.util.GenericOptionsParser;</span>

这个类负责解析hadoop的命令行参数

处理部分：

<span style="font-size:18px;">public static class TokenizerMapper
    extends Mapper<Object,Text,Text,IntWritable></span>

这是定义一个自己的map过程，集成了hadoop的Mapper类，第一个参数类型是object，表示输入键key的参数类型，第二个参数类型是Text，表示输入value的类型，第三个参数也是text，表示输出键类型，第四个参数类型是IntWritable，表示输出value的类型。第一个object是根据hadoop默认值生成的，一般是文件块里一行文字的行偏移数，这些偏移数不重要，在处理时候一般用不上，第二个参数是要处理的字符串，经过map之后生成<"ss",1>这样的键值对，这里的ss就是第三个参数类型，而1就是第四个参数类型IntWritable

<span style="font-size:18px;">private final static IntWritable one=new IntWritable(1);</span>

定义输出值1，之后出现一次累加一次

<span style="font-size:18px;">private Text word=new Text();</span>

定义输出键

<span style="font-size:18px;">public void map(Object key,Text value,Context context)throws IOException,interruptedException</span>

定义map函数，函数有三个参数，key是输入键，这个实际上用不到，value是输入值。context可以理解为是用来传递数据以及其他运行状态信息，map中的key、value写入context，让它传递给Reducer进行reduce，而reduce进行处理之后数据继续写入context，继续交给Hadoop写入hdfs系统。

<span style="font-size:18px;">StringTokenizer itr=new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()){
        word.set(itr.nextToken());
        context.write(word,one)</span>

这就是一行一行的在读，分隔单词，判断还有没有单词，最后将出现的记下来

<span style="font-size:18px;">public void reduce(Text key,Iterable<Intwritable>values,Context context)</span>

定义reduce函数，key是输入键类型，values是实现了一个iterable接口的变量，可以把它理解成values里包含若干个IntWritable整数，可以通过迭代的方式遍历所有的值，至于context类型，跟mapper里的context一样，是在reducer类内部实现的。比方说处理的是”sss is dd sss“这个字符串，map完了之后是key="sss",values=<1,1>;key="is",values=<1>.................

<span style="font-size:18px;">{
        int sum=0;
        for(IntWritable val:values){
            sum+=val.get();
        }
        result.set(sum);
        context.write(key,result);
        }
    }</span>

reduce过程，这个过程就是用一个循环，遍历values，然后累加求和，就是多个1加起来。然后赋给result,最后键值对写入一个context，所以写入的就是<sss,2>等等

<span style="font-size:18px;">public static void main(String[] args)throws Exception{
    Configuration conf=new Configuration();</span>

这句从hadoop的配置文件里读取参数

<span style="font-size:18px;">String[]otherArgs=new GenericOptionsParser(conf,args).getRemainingArgs();
    if(otherArgs.length!=2){
        System.err.println("usage:wordcount<in><out>");
        System.exit(2);
    }</span>

命令行读参数，不足两个的话退出，第一个参数是输入，第二个是输出

<span style="font-size:18px;">Job job=new Job(conf,"word count")</span>

每个运行的处理任务就是一个job，wordcount是job的名字

<span style="font-size:18px;">job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setReducerClass(IntsumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Intwritable.class);</span>

jarbyclass根据类打包成jar包

设置mapper类，设置reducer类，设置输出的key为text类，设置输出的value的类型为IntWritable类

<span style="font-size:18px;">FileInputFormat.addInputPath(job,new Path(otherArgs[0]));
    FileOutputFormat.addOutputPath(job,new Path(otherArgs[1]));</span>

两个命令行参数分别是输入和输出

<span style="font-size:18px;">System.exit(job.waitForCompletion(true)?0:1);</span>

job结束就退出

分析结束。。。。。

大兔齐齐

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
1
评论
hadoop初级班（二）

详细解读wordcount程序本次内容将详细介绍wordcount程序：word count 源代码：package ustc.hilab.wordcount;import java.io.exception;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import
复制链接

扫一扫