3.MapReduce高级接口编程（partitioner、sort、combiner）

最新推荐文章于 2021-11-28 15:28:55 发布

dream0352

最新推荐文章于 2021-11-28 15:28:55 发布

阅读量696

点赞数

分类专栏： MapReduce 文章标签： partitionersortcombi mapreduce高级接口编程

本文链接：https://blog.csdn.net/dream0352/article/details/59021274

版权

MapReduce 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Partitioner--分区

主要作用就是将map的结果发送到相应的reduce。

Partitioner组件可以让Map对Key进行分区，从而可以根据不同的key来分发到不同的reduce中去处理。

如果需要定制partitioner也需要继承该类。HashPartitioner是mapreduce的默认partitioner。计算方法是which reducer=(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks，得到当前的目的reducer。HashPartitioner是处理Mapper任务输出的，getPartition()方法有三个形参，源码中key、value分别指的是Mapper任务的输出，numReduceTasks指的是设置的Reducer任务数量，默认值是1。那么任何整数与1相除的余数肯定是0。也就是说getPartition(…)方法的返回值总是0。也就是Mapper任务的输出总是送给一个Reducer任务，最终只能输出到一个文件中。所以如果想要最终输出到多个文件中，在Mapper任务中对数据应该划分到多个区中。那么，我们只需要按照一定的规则让getPartition(…)方法的返回值是0,1,2,3…即可。

一般，我们都会使用默认的分区函数，但有时我们又有一些，特殊的需求，而需要定制Partition来完成我们的业务，案例如下：
对手机号划分为移动、联通、电信分三个文件保存。

  实现分区的步骤： 

  1.1先分析一下具体的业务逻辑，确定大概有多少个分区 

  1.2写一个类，它要继承org.apache.hadoop.mapreduce.Partitioner这个类 

  1.3重写public int getPartition这个方法，根据具体逻辑，读数据库或者配置返回相同的数字 

  1.4在main方法中设置Partioner的类，job.setPartitionerClass(DataPartitioner.class); 

  1.5设置Reducer的数量，job.setNumReduceTasks(6); 

public class DataCount {

	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);

		job.setJarByClass(DataCount.class);

		job.setMapperClass(DataCountMap.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(DataBean.class);
		FileInputFormat.setInputPaths(job, new Path(args[0]));

		job.setReducerClass(DataCountReduce.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(DataBean.class);
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		job.setNumReduceTasks(Integer.parseInt(args[2]));

		job.setPartitionerClass(DataCountPartitioner.class);

		job.waitForCompletion(true);

	}

	public static class DataCountPartitioner extends Partitioner<Text, DataBean> {

		private static Map<String, Integer> map = new HashMap<String, Integer>();

		static {
			map.put("139", 1);
			map.put("153", 2);
			map.put("182", 3);
		}

		/**
		 * arg2是partitioner的数量，启动几个reduce就产生几个partitioner
		 */
		@Override
		public int getPartition(Text key, DataBean bean, int arg2) {
			String account = key.toString();
			String tel_sub = account.substring(0, 3);
			Integer count = map.get(tel_sub);
			if (count == null) {
				count = 0;
			}
			return count;
		}
	}
}

sort--排序

  因为mapreduce可以迭代多个使用，而默认的排序规则是按key2进行排序的，所以能在处理完数据后再自定义一个mapreduce进行排序， 
 被排序的对象要实现WritableComparable接口，在compareTo方法中实现排序规则，然后将这个对象当做k2，即可完成排序 

  例如： 

public class SortStep {

	public static class SortMapper extends Mapper<LongWritable, Text, InfoBean, NullWritable>{

		private InfoBean k = new InfoBean();
		@Override
		protected void map(LongWritable key,Text value,Mapper<LongWritable, Text, InfoBean, NullWritable>.Context context)
				throws IOException, InterruptedException {
			String line = value.toString();
			String[] fields = line.split("\t");
			k.set(fields[0], Double.parseDouble(fields[1]), Double.parseDouble(fields[2]));
			
			context.write(k, NullWritable.get());			
		}		
	}
	public static class SortReducer extends Reducer<InfoBean, NullWritable, Text, InfoBean>{

		private Text k = new Text();
		@Override
		protected void reduce(InfoBean key, Iterable<NullWritable> values,Reducer<InfoBean, NullWritable, Text, InfoBean>.Context context)
				throws IOException, InterruptedException {
			k.set(key.getAccount());
			context.write(k, key);
		}		
	}
}

public class InfoBean implements WritableComparable<InfoBean>{

	private String account;
	private double income;
	private double expenses;
	private double surplus;
	
	@Override
	public void write(DataOutput out) throws IOException {
		out.writeUTF(account);
		out.writeDouble(income);
		out.writeDouble(expenses);
		out.writeDouble(surplus);		
	}

	@Override
	public void readFields(DataInput in) throws IOException {
		this.account = in.readUTF();
		this.income = in.readDouble();
		this.expenses = in.readDouble();
		this.surplus = in.readDouble();
	}

	@Override
	public int compareTo(InfoBean o) {
		if(this.income == o.getIncome()){
			return this.expenses > o.getExpenses() ? 1 : -1;
		}
		return this.income > o.getIncome() ? 1 : -1;
	}

	@Override
	public String toString() {
		return  income + "\t" +	expenses + "\t" + surplus;
	}
}

combiner

 
 每一个map可能会产生大量的输出，combiner的作用就是在map端对输出先做一次合并，以减少传输到reducer的数据量。它主要是为了削减Mapper的输出从而减少网络带宽和Reducer之上的负载。 
 combiner最基本是实现本地key的归并，combiner具有类似本地的reduce功能。 
 如果不用combiner，那么，所有的结果都是reduce完成，效率会相对低下。使用combiner，先完成的map会在本地聚合，提升速度。与mapper和reducer不同的是，combiner没有默认的实现，需要显式的设置在conf中才有作用。 

 
 注意：Combiner的输出是Reducer的输入，如果Combiner是可插拔的，添加Combiner绝不能改变最终的计算结果。所以Combiner只应该用于那种Reduce的输入key/value与输出key/value类型完全一致，且不影响最终结果的场景。比如累加，最大值等。 

数据格式转换：

map: (K1, V1) → list(K2,V2)
combine: (K2, list(V2)) → list(K3, V3)
reduce: (K3, list(V3)) → list(K4, V4)

dream0352

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
3.MapReduce高级接口编程（partitioner、sort、combiner）

Partitioner--分区主要作用就是将map的结果发送到相应的reduce。Partitioner组件可以让Map对Key进行分区，从而可以根据不同的key来分发到不同的reduce中去处理。如果需要定制partitioner也需要继承该类。HashPartitioner是mapreduce的默认partitioner。计算方法是which reducer=(key.hash
复制链接

扫一扫