hadoop MapReduce自定义分区partition的作用和用法

最新推荐文章于 2024-06-29 23:33:29 发布

历史五千年

最新推荐文章于 2024-06-29 23:33:29 发布

阅读量6.8k

点赞数 1

分类专栏： hadoop

本文链接：https://blog.csdn.net/wo198711203217/article/details/80621738

版权

hadoop 专栏收录该内容

16 篇文章 1 订阅

订阅专栏

背景

在Hadoop的MapReduce过程中，每个map task处理完数据后，如果存在自定义Combiner类，会先进行一次本地的reduce操作，然后把数据发送到Partitioner，由Partitioner来决定每条记录应该送往哪个reducer节点，默认使用的是HashPartitioner，其核心代码如下：

public class HashPartitioner<K, V> extends Partitioner<K, V> {

  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

}

上面的getPartition函数的作用：
1、获取key的哈希值
2、使用key的哈希值对reduce任务数求模
3、这样做的目的是可以把(key,value)对均匀的分发到各个对应编号的reduce task节点上，达到reduce task节点的负载均衡。

上面的代码只是实现了(key,value)键值对的均匀分布，但是无法实现如下需求：
1、假设输入的数据文件有4个，里面包含各个部门各个季度的销售额
2、使用mapreduce程序进行统计各个部门全年销售额，同时每个部门对应一个输出文件

由于输出的文件是区分数据类型的（部门类型），所以这个时候就需要我们自定义partition，分别把各个部门的数据分发到各自的reduce task上。

自定义分区

自定义分区很简单，我们只需要继承抽象类Partitioner，重写getPartition方法即可，另外还要给任务设置分区：job.setPartitionerClass()，就可以了。
注意：
自定义分区的数量需要和reduce task的数量保持一致。

代码演示

1、准备数据

[hadoop@hadoop1 ~]$ cat jidu1.txt 
研发部门        100
测试部门        90
硬件部门        92
销售部门        200
[hadoop@hadoop1 ~]$ cat jidu2.txt 
研发部门        200
测试部门        93
硬件部门        95
销售部门        230
[hadoop@hadoop1 ~]$ cat jidu3.txt 
研发部门        202
测试部门        92
硬件部门        94
销售部门        231
[hadoop@hadoop1 ~]$ cat jidu4.txt 
研发部门        209
测试部门        98
硬件部门        99
销售部门        251
[hadoop@hadoop1 ~]$

2、上传到hdfs上

[hadoop@hadoop1 ~]$ hdfs dfs -put jidu1.txt /jidu/input
18/06/08 19:45:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop1 ~]$ hdfs dfs -put jidu2.txt /jidu/input
18/06/08 19:45:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop1 ~]$ hdfs dfs -put jidu3.txt /jidu/input
18/06/08 19:45:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop1 ~]$ hdfs dfs -put jidu4.txt /jidu/input
18/06/08 19:46:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

[hadoop@hadoop1 ~]$ hdfs dfs -ls /jidu/input
18/06/08 19:46:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 4 items
-rw-r--r--   1 hadoop supergroup         66 2018-06-08 19:45 /jidu/input/jidu1.txt
-rw-r--r--   1 hadoop supergroup         66 2018-06-08 19:45 /jidu/input/jidu2.txt
-rw-r--r--   1 hadoop supergroup         66 2018-06-08 19:45 /jidu/input/jidu3.txt
-rw-r--r--   1 hadoop supergroup         66 2018-06-08 19:46 /jidu/input/jidu4.txt

3、编写mapreduce程序
JiduMapper.java:

package com.demo;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class JiduMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
            throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        String line=value.toString();
        String[] ss=line.split("\t");

        context.write(new Text(ss[0]), new IntWritable(Integer.parseInt(ss[1])));
    }
}

JiduReducer.java:

package com.demo;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class JiduReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,
            Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        int sum=0;
        for(IntWritable value:values)
        {
            sum=sum+value.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

JiduPartitioner.java:

package com.demo;

import org.apache.hadoop.mapreduce.Partitioner;

public class JiduPartitioner<K, V> extends Partitioner<K, V>{

    @Override
    //自定义partition的数量需要和reduce task数量保持一致
    public int getPartition(K key, V value, int numPartitions) {
        // TODO Auto-generated method stub
        String dname=key.toString();
        switch(dname)
        {
        case "研发部门":return 0;
        case "测试部门":return 1;
        case "硬件部门":return 2;
        case "销售部门":return 3;
        }
        return 4;
    }

}

JiduRunner.java:

package com.demo;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class JiduRunner {
    public static void main(String[] args) throws Exception{
        // TODO Auto-generated method stub
        Configuration conf=new Configuration();
        Job job=Job.getInstance(conf);
        job.setJarByClass(JiduRunner.class);
        job.setMapperClass(JiduMapper.class);
        job.setReducerClass(JiduReducer.class);
        job.setCombinerClass(JiduReducer.class);
        job.setPartitionerClass(JiduPartitioner.class);

        job.setNumReduceTasks(4);//设置reduce task数量为4

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.16.2:9000/jidu/input"));

        FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.16.2:9000/jidu/output"));

        System.exit(job.waitForCompletion(true)?0:1);
    }
}

输出结果：

[hadoop@hadoop1 ~]$ hdfs dfs -ls /jidu/output
18/06/08 20:59:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 5 items
-rw-r--r--   3 hadoop supergroup          0 2018-06-08 20:56 /jidu/output/_SUCCESS
-rw-r--r--   3 hadoop supergroup         17 2018-06-08 20:56 /jidu/output/part-r-00000
-rw-r--r--   3 hadoop supergroup         17 2018-06-08 20:56 /jidu/output/part-r-00001
-rw-r--r--   3 hadoop supergroup         17 2018-06-08 20:56 /jidu/output/part-r-00002
-rw-r--r--   3 hadoop supergroup         17 2018-06-08 20:56 /jidu/output/part-r-00003
[hadoop@hadoop1 ~]$ hdfs dfs -cat /jidu/output/part-r-00000
18/06/08 20:59:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
研发部门        711
[hadoop@hadoop1 ~]$ hdfs dfs -cat /jidu/output/part-r-00002
18/06/08 20:59:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
硬件部门        380
[hadoop@hadoop1 ~]$ hdfs dfs -cat /jidu/output/part-r-00001
18/06/08 21:00:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
测试部门        373
[hadoop@hadoop1 ~]$ hdfs dfs -cat /jidu/output/part-r-00003
18/06/08 21:00:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
销售部门        912
[hadoop@hadoop1 ~]$