Hadoop Partitioner组件

最新推荐文章于 2018-08-29 22:25:14 发布

lfdanding

最新推荐文章于 2018-08-29 22:25:14 发布

阅读量1.5k

点赞数

分类专栏： hadoop 大数据文章标签： hadoop Partition

本文链接：https://blog.csdn.net/lfdanding/article/details/51384791

版权

大数据同时被 2 个专栏收录

29 篇文章 0 订阅

订阅专栏

hadoop

28 篇文章 1 订阅

订阅专栏

1、Partitioner组件可以让Map对Key进行分区，从而可以根据不同key来分发到不同的reduce中去处理。
2、你可以自定义key的一个分发规则，如数据文件包含不同的省份，而输出的要求是每个省份输出一个文件
3、提供了一个默认的HashPartitioner
在org.apache.hadoop.mapreduce.lib.partition.HashPartitioner.java

package org.apache.hadoop.mapreduce.lib.partition;

import org.apache.hadoop.mapreduce.Partitioner;

/** Partition keys by their {@link Object#hashCode()}. */
public class HashPartitioner<K, V> extends Partitioner<K, V> {

  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

}

4、自定义Partitioner
1）继承抽象类Partitioner，实现自定义的getPartition（）方法
2）通过job.setPartitionerClass()来设置自定义的Partitioner
在org.apache.hadoop.mapreduce.Partitioner.java中

package org.apache.hadoop.mapreduce;

/** 
 * Partitions the key space.
 * 
 * <p><code>Partitioner</code> controls the partitioning of the keys of the 
 * intermediate map-outputs. The key (or a subset of the key) is used to derive
 * the partition, typically by a hash function. The total number of partitions
 * is the same as the number of reduce tasks for the job. Hence this controls
 * which of the <code>m</code> reduce tasks the intermediate key (and hence the 
 * record) is sent for reduction.</p>
 * 
 * @see Reducer
 */
public abstract class Partitioner<KEY, VALUE> {

  /** 
   * Get the partition number for a given key (hence record) given the total 
   * number of partitions i.e. number of reduce-tasks for the job.
   *   
   * <p>Typically a hash function on a all or a subset of the key.</p>
   *
   * @param key the key to be partioned.
   * @param value the entry value.
   * @param numPartitions the total number of partitions.
   * @return the partition number for the <code>key</code>.
   */
  public abstract int getPartition(KEY key, VALUE value, int numPartitions);

}

Partitioner例子
Partitioner应用情景：
需求：分别统计每种商品的周销售情况
site1的周销售清单：
shoes 20
hat 10
stockings 30
clothes 40

site2的周销售清单：
shoes 15
hat 1
stockings 90
clothes 80

汇总结果：
shoes 35
hat 11
stockings 120
clothes 120

代码如下：
MyMapper.java

package com.partitioner;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    @Override
    protected void map(LongWritable key, Text value,Context context)
            throws IOException, InterruptedException {
        String[] s = value.toString().split("\\s+") ;
        context.write(new Text(s[0]), new IntWritable(Integer.parseInt(s[1]))) ;
    }

}

MyPartitioner.java

package com.partitioner;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class MyPartitioner extends Partitioner<Text,IntWritable>{

    @Override
    public int getPartition(Text key, IntWritable value, int numPartitions) {
        if(key.toString().equals("shoes")){
            return 0 ;
        }

        if(key.toString().equals("hat")){
            return 1 ;
        }

        if(key.toString().equals("stockings")){
            return 2 ;
        }

        return 3 ;      
    }

}

MyReducer.java

package com.partitioner;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    @Override
    protected void reduce(Text key, Iterable<IntWritable> value,Context context)
            throws IOException, InterruptedException {
        int sum = 0 ;
        for(IntWritable val : value ){
            sum += val.get() ;
        }
        context.write(key, new IntWritable(sum)) ;
    }

}

TestPartitioner.java

package com.partitioner;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;
import org.apache.hadoop.util.GenericOptionsParser;

public class TestPartitioner {
    public static void main(String args[])throws Exception{
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length != 2) {
          System.err.println("Usage: wordcount <in> <out>");
          System.exit(2);
        }

        Job job = new Job(conf, "word count");
        job.setJarByClass(TestPartitioner.class);
        job.setMapperClass(MyMapper.class);
//      job.setCombinerClass(MyCombiner.class);
        job.setReducerClass(MyReducer.class);
        job.setPartitionerClass(MyPartitioner.class) ;
        job.setNumReduceTasks(4) ;

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);


        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

lfdanding

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop Partitioner组件

1、Partitioner组件可以让Map对Key进行分区，从而可以根据不同key来分发到不同的reduce中去处理。 2、你可以自定义key的一个分发规则，如数据文件包含不同的省份，而输出的要求是每个省份输出一个文件 3、提供了一个默认的HashPartitioner 在org.apache.hadoop.mapreduce.lib.partition.HashPartitioner.jav
复制链接

扫一扫

专栏目录