Hadoop的WordCount案例

最新推荐文章于 2024-04-19 23:58:36 发布

wister136

最新推荐文章于 2024-04-19 23:58:36 发布

阅读量730

点赞数 1

分类专栏： hadoop 文章标签： hadoop mapreduce wordcount

本文链接：https://blog.csdn.net/u010599953/article/details/76022073

版权

hadoop 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

第一步：先建立一个mapper类
注：WordCountMapper继承了Mapper类，重写了map（）方法，定义了输入的文件的内容的类型：LongWritable, Text和输出的类型：Text, IntWritable。

package cn.lsm.bigdata.WordCount;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
 * KEYIN,VALUEIN:输入进入的类型
 * context:周围环境，上下文，来龙去脉
 * @author lsm
 *
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
      //重写map方法
      @Override
      protected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {
           //拿到一行数据转换城String
           String line = value.toString();
           //将每一行进行切割
           String[] words = line.split(" ");
           //遍历数组，输出<单词>
           //输出的方式：context（key,value）
           for(String word: words) {
                 context.write(new Text(word), new IntWritable(1));
           }
      }
}

第二步：定义一个reducer类

注：wordcountreducer继承了reducer类。并重写了reudcer（）方法，输入的数据类型就是map的输出的类型：Text, IntWritable，输出的类型是：Text, IntWritable

package cn.lsm.bigdata.WordCount;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * myreduce的生命周期：框架每传递一个kv组，reduce方法被调用一次
 *
 * @author lsm
 *
 */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        // 先定义一个计数器
        int count = 0;

        // 遍历一个kv组的所有v,累加到count中
        for (IntWritable value : values) {
            count += value.get();
        }
        // 将获得的kv对输出出去
        context.write(key, new IntWritable(count));
    }
}

第三步：建立一个job

注：这个类的属性就是为了将我们的mapper和reducer提交给集群来处理文件的，其中的步骤是：

1.新建一个configuration

2.新建一个job

3.指定jar的位置

4.指定map和reduce的位置

5.指定map和reudce的输出的类型

6.提交job

package cn.lsm.bigdata.WordCount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * 提交任务的job，解释哪一个是map，哪一个是reduce，要处理的数据在哪里，输出的数据在哪里
 *
 * @author lsm
 *
 */
public class WordCountRunner {
      public static void main(String[] args) throws Exception {
           // 新建一个配置文件
           Configuration conf = new Configuration();
           Job wcjob = Job.getInstance(conf);
           // 指定我这个job所在的jar包位置，放在jar包所在的位置
           wcjob.setJarByClass(WordCountRunner.class);

           // 设置map和reduce的class文件
           wcjob.setMapperClass(WordCountMapper.class);
           wcjob.setReducerClass(WordCountReducer.class);

           // 设置我们的的map输出的key和value的类型
           wcjob.setMapOutputKeyClass(Text.class);
           wcjob.setOutputValueClass(IntWritable.class);

           // 设置reduce输出的key和value的类型
           wcjob.setOutputKeyClass(Text.class);
           wcjob.setOutputValueClass(IntWritable.class);

           // 指定要处理的数据所在的位置
           FileInputFormat.setInputPaths(wcjob, new Path(args[0]));
           // 指定要处理的数据的所输出的位置
           FileOutputFormat.setOutputPath(wcjob, new Path(args[1]));

           // 向yarn集群提交这个job
           boolean res = wcjob.waitForCompletion(true);
           System.exit(res ? 0 : 1);

      }
}

第四步：执行任务

  [root@mini1 hadoop]# hadoop jar mywordcount1.jar cn.lsm.bigdata.WordCount.WordCountRunner /wordcount/input/ /wordcount/output2/ 

运行的结果：

wister136

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录