MapReduce入门之多种统计方式wordcount（续）

最新推荐文章于 2024-10-09 22:52:16 发布

Deniece_X

最新推荐文章于 2024-10-09 22:52:16 发布

阅读量393

点赞数

分类专栏： mapreduce 文章标签：统计数量统计多个位置数量 Combiner Partition

本文链接：https://blog.csdn.net/deniece_x/article/details/78444461

版权

mapreduce 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文介绍了MapReduce在统计任务中的深入应用，包括按单词的第二个位置进行统计、使用Partitioner实现多位置统计、Combiner优化减少数据传输以及自定义InputFormat以改变数据读取方式。通过这些方法，可以提高MapReduce处理效率并实现更复杂的计算需求。

摘要由CSDN通过智能技术生成

这次的学习是在MapReduce入门之后的，上次实现了简单的单词统计的计算，是对文件全部的读取划分。本次学习多种读取和划分方式。

在读取的时候，单独读取第二个位置的单词进行统计

只需要将上次的MyMapper程序进行修改，修改情况如下：

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MyMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
    protected void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException
    {
        String[] str = value.toString().split(" ");
        //现在只统计第二个单词
        ontext.write(new Text(str[1]),new IntWritable(1));
        //原来全部统计
        //context.write(new Text(ss),new IntWritable(1));   
    }
}

统计两个位置的单词数量——Partitionr类的应用

首先在wordcount类中需要再加Partition的设置。

job.setNumReduceTasks(2);//此处的二为读取两个位置的单词数量。
job.setPartitionerClass(MyPartitioner.class);

MyMapper类：

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MyMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
    @Override
    protected void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException
    {
            Text ss = new Text();
            ss.clear();
            //使用不同的前缀标记
            ss.set(new StringBuffer("loglevel::").append(str[1]).toString());
            context.write(new Text(str[1]), new IntWritable(1));
            //同一个Text对象，在读取完毕后再次读取之前需要clear。
            ss.clear();
            ss.set(new StringBuffer("loglevel::").append(str[2]).toString());
            context.write(new Text(str[2]), new IntWritable(1));
    }
}

MyReduce类不需要更改，新建一个MyPartitioner类

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class MyPartitioner extends Partitioner<Text, IntWritable>{
@Override
public int getPartition(Text key, IntWritable value, int arg2) {
    //参数三arg2：就是在wordcount类中设置的参数job.setNumReduceTasks(2);
    if(key.toString().startsWith("loglevel::"))
            return 0;//分发到第一个reducer
    else if(key.toString().startsWith("logresource::"))
            return 1;//分发到第二个reducer    0，1要以次写出来
        return 0;
        }
}

如此就可以实现对一个文档的两处进行读取，分别输出到两个文件，及本次的输出目录中会有三个文件。

combiner的使用

combiner实为Reduce前的数据处理类。
①每一个Map可能会产生大量的输出，Combiner的作用就是在Map端对输出先做一次合并，以减少传输到reducer的数据量。
②Combiner最基本是实现本地key的归并，Combiner具有类似本地的Reduce功能。
如果不用Combiner，那么，所有的结果都是reduce完成，效率会相对低下。
使用Combiner，先完成的map会在本地聚合，提升速度。
注意：Combiner的输出是Reducer的输入，如果Combiner是可插拔的，添加Combiner绝不能改变最终的计算结果。所以Combiner只应该用于那种Reduce的输入key/value与输出key/value类型完全一致，且不影响最终结果的场景。比如累加，最大值等。

InputFormat的自我实现

总体流程如下：
（1）在wordcount中写入job.setInputFormat(MyInputFormat.class)
（2）新建MyInputFormat.class 类（继承FileInputFormat，FileInputFormat继承InputFormat）重写两个方法：getSplits 方法不需要改动和 createRecordReader 的返回值类型修改为：（return new Myrecordreader（））
（3）新建Myrecordreader类，继承RecordReader类型改为这个：