hadoop--combiner、partitioner学习

最新推荐文章于 2022-10-07 08:33:38 发布

gongrui_59

最新推荐文章于 2022-10-07 08:33:38 发布

阅读量420

点赞数

分类专栏： hadoop学习文章标签： hadoop combiner

本文链接：https://blog.csdn.net/gongrui_59/article/details/75567518

版权

hadoop学习专栏收录该内容

10 篇文章 0 订阅

订阅专栏

1、什么是Combi ners？
combine操作是一个可选的操作，使用时需要我们自己设定
每一个map可能会产生大量的输出，combiner的作用就是在map端对输出先做一次合并，以减少传输到reducer的数据量。combiner最基本是实现本地key的归并，combiner具有类似本地的reduce功能。如果不用combiner，那么，所有的结果都是reduce完成，效率会相对低下。使用combiner，先完成的map会在本地聚合，提升速度。
输入以下一行，便可调用combiner:
job.setCombinerClass(CiteReducer.class);

Combine阶段在Mapper结束与Reducer开始之间

通过一个去重案例说一下这个combiner，在我们不调用这个操作时，将job.setCombinerClass(CiteReducer.class);注释（CiteReducer这里我们与reduce采用同样的方法。）

package com.yc.hadoop.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueLineRecordReader;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class CiteDemo{
	public static void main(String[] args) throws Exception {
		if(args.length < 2){
			throw new RuntimeException("参数个数不对，至少需要两个参数");
		}
		Configuration conf = new Configuration();
		
		//设置KeyValue输入数据格式以“，”拆分
		conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, ",");
		Job job = Job.getInstance(conf,"CiteDemo");
		job.setJarByClass(CiteDemo.class);
		
		job.setInputFormatClass(KeyValueTextInputFormat.class);
				
		job.setMapperClass(CiteMapper.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		//job.setCombinerClass(CiteReduce.class); //本机合并
		job.setReducerClass(CiteReduce.class);  //集群合并
		
		//输入文件操作
		Path[] inPaths = new Path[args.length-1];
		for (int i = 0; i < inPaths.length; i++) {
			inPaths[i] = new Path(args[i]);
		}
		
		//输出文件操作
		Path outPath = new Path(args[args.length -1]);
		FileSystem fs = outPath.getFileSystem(conf);
		if(fs.exists(outPath)){
			fs.delete(outPath, true);
		}
		
		FileInputFormat.setInputPaths(job, inPaths);
		FileOutputFormat.setOutputPath(job, outPath);
		job.waitForCompletion(true);
		
		
		conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, "\t");
		job = Job.getInstance(conf,"CiteDemo");
		job.setJarByClass(CiteDemo.class);
		job.setInputFormatClass(KeyValueTextInputFormat.class);
		
		job.setMapperClass(CiteMapper01.class);
		job.setMapOutputKeyClass(IntWritable.class);
		job.setMapOutputValueClass(Text.class);
		
		job.setPartitionerClass(MyPartitioner.class);
		job.setNumReduceTasks(2);
		
		FileInputFormat.setInputPaths(job, outPath);
		
		outPath = new Path(args[args.length -1] + "02");
		if(fs.exists(outPath)){
			fs.delete(outPath, true);
		}
		FileOutputFormat.setOutputPath(job, outPath);
		
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}
	
}

package com.yc.hadoop.mapreduce;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class CiteMapper extends Mapper<Text, Text, Text, IntWritable> {
	
	public static final IntWritable ONE =new IntWritable(1);
	
	@Override
	protected void map(Text key, Text value, Mapper<Text, Text, Text, IntWritable>.Context context)
			throws IOException, InterruptedException {
		
		System.out.println("key:" + key + "<==>" + "value:" + value);
		context.write(key, ONE);
		
	}

}

package com.yc.hadoop.mapreduce;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class CiteReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
	
	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,
			Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
		
		int count = 0;
		StringBuilder countStr = new StringBuilder("{");
		for (IntWritable v : values) {
			count += v.get();
			countStr.append(v.get() + ",");
		}
		countStr.replace(countStr.lastIndexOf(","), countStr.lastIndexOf(",") + 1, "}");
		System.out.println("-------------> key:" + key + ", value:" + countStr);
		
		context.write(key, new IntWritable(count));
	}

}

package com.yc.hadoop.mapreduce;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class MyPartitioner extends Partitioner<IntWritable ,Text > {
	@Override
	public int getPartition(IntWritable key, Text value, int numPartitions) {
		System.out.println("***************>  key:" + key + ", value:" + value);
		if(key.get() < 5 ){
			return 0;
		}else{
			return 1;
		}
	}
}

我们运行之后

会发现Combine input records 和 Combine input records都为0 这里的Reduce input records = 25 Reduce output records=25我们的所有数据都交给reduce去运行。此时我们调用combiner

job.setCombinerClass(CiteReduce.class);将注释去掉开始运行

查看结果为：

这时
Combine input records=58，Combine output records=26，Reduce input records=26，因为Combine阶段在Mapper结束与Reducer开始之间，Combiners处理的数据，就是在不设置Combiners时，Reduce所应该接受的数据，所以为58，然后再将Combiners的输出作为Reduce端的输入，所以Reduce input records这个字段由58变成了26。

于是我们可以看出之前说的：每一个map可能会产生大量的输出，combiner的作用就是在map端对输出先做一次合并，以减少传输到reducer的数据量。combiner最基本是实现本地key的归并，combiner具有类似本地的reduce功能。如果不用combiner，那么，所有的结果都是reduce完成，效率会相对低下。使用combiner，先完成的map会在本地聚合，提升速度。

为什么使用Combiner？

答：Combiner发生在Map端，对数据进行规约处理，数据量变小了，传送到reduce端的数据量变小了，传输时间变短，作业的整体时间变短。
为什么Combiner不作为MR运行的标配，而是可选步骤？
答：因为不是所有的算法都适合使用Combiner处理，例如求平均数。
Combiner本身已经执行了reduce操作，为什么在Reducer阶段还要执行reduce操作？
答：combiner操作发生在map端的，智能处理一个map任务中的数据，不能跨map任务执行；只有reduce可以接收多个map任务处理的数据。

1、什么是分区？
在之前我们知道Mapper最终处理的键值对<key, value>，是需要送到Reducer去合并的，合并的时候，有相同key的键/值对会送到同一个Reducer节点中进行归并。哪个key到哪个Reducer的分配过程，是由Partitioner规定的。在一些集群应用中，例如分布式缓存集群中，缓存的数据大多都是靠哈希函数来进行数据的均匀分布的，在Hadoop中也一样。
其实，把数据分区是为了更好的利用数据，根据数据的属性不同来分成不同区，再根据不同的分区完成不同的任务。

首先HADOOP里的内置partitioner：
MapReduce的使用者通常会指定Reduce任务和Reduce任务输出文件的数量（R）。用户在中间key上使用分区函数来对数据进行分区，之后在输入到后续任务执行进程。一个默认的分区函数式使用hash方法（比如常见的：hash(key) mod R）进行分区。hash方法能够产生非常平衡的分区，鉴于此，Hadoop中自带了一个默认的分区类HashPartitioner，它继承了Partitioner类，提供了一个getPartition的方法，它的定义如下所示：

/**

* Licensed to the Apache Software Foundation (ASF) under one

* or more contributor license agreements. See the NOTICE file

* distributed with this work for additional information

* regarding copyright ownership. The ASF licenses this file

* to you under the Apache License, Version 2.0 (the

* "License"); you may not use this file except in compliance

* with the License. You may obtain a copy of the License at

* http://www.apache.org/licenses/LICENSE-2.0

* Unless required by applicable law or agreed to in writing, software

* distributed under the License is distributed on an "AS IS" BASIS,

* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

* See the License for the specific language governing permissions and

* limitations under the License.

package org.apache.hadoop.mapreduce.lib.partition;

import org.apache.hadoop.classification.InterfaceAudience;

import org.apache.hadoop.classification.InterfaceStability;

import org.apache.hadoop.mapreduce.Partitioner;

/** Partition keys by their {@link Object#hashCode()}. */

@InterfaceAudience.Public

@InterfaceStability.Stable

public class HashPartitioner<K, V> extends Partitioner<K, V> {

/** Use {@link Object#hashCode()} to partition. */

public int getPartition(K key, V value,

int numReduceTasks) {

return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

}

这里主要的一句就是 (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
这段代码实现的目的是 将key均匀分布在Reduce Tasks上，例如：如果Key为Text的话，Text的hashcode方法跟String的基本一致，都是采用的Horner公式计算，得到一个int整数。但是，如果string太大的话这个int整数值可能会溢出变成负数，所以和整数的上限值Integer.MAX_VALUE（即0111111111111111）进行与运算，然后再对reduce任务个数取余，这样就可以让key均匀分布在reduce上。

大部分情况下，我们都会使用默认的分区函数HashPartitioner。但有时我们又有一些特殊的应用需求，所以我们需要定制Partitioner来完成我们的任务。
这个时候我们通过重写 getPartition()方法来实现。

package com.yc.hadoop.mapreduce;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Partitioner;

public class MyPartitioner extends Partitioner<IntWritable ,Text > {

@Override

public int getPartition(IntWritable key, Text value, int numPartitions) {

System.out.println("***************> key:" + key + ", value:" + value);

if(key.get() < 5 ){

return 0;

}else{

return 1;