hadoop基本用法回顾（Combiner和Partitioner实现）

最新推荐文章于 2020-12-26 20:23:42 发布

TJU_ZH

最新推荐文章于 2020-12-26 20:23:42 发布

阅读量312

点赞数 1

分类专栏：大数据文章标签：大数据

本文链接：https://blog.csdn.net/beautiful_girl_love/article/details/88133964

版权

大数据专栏收录该内容

7 篇文章 0 订阅

订阅专栏

首先说一下Combiner。在使用MapReduce时，以wordcount为例说明，我们假定要统计单词个数，给出一个文档，hadoop的一个block128M，假设一个block上都是储存的单词，这种情况下，在进行计算的时候在一个block上会传输数据量非常非常大的键值对，将这些键值对进行Reduce计算时，会引起很大的网络带宽负载压力。所以hadoop允许对部分map任务先进行一次reduce，见下图，自己手画，比较丑：

所以，实现Combiner时，只需要指定需要combiner的reduce类即可，这样经过reduce后，数据量相对来说会小很多，算是hadoop对于减轻负载的一种策略。这个并不是所有方法都适用，比如说是求平均气温<1950,15><1950,25><1962,10><1962,30><1962,20>，这个就会变成mean（mean（15,25），mean（10,20,30））而不是mean（10,20,30，15,25）【见hadoop权威指南】代码我就不放了，只需要在运行类中加上一句setCombinerClass()，方法即可；

再来说Partitioner，这个出现是可能会有类似这种需求，当进行完所有所有的统计分析后，比如每个省份的经济花费的地方，在这种情况下，就需要进行对不同的省份进行划分，把不同的省份划分到不同的文件去。所以对数据进行分析时，可以把省份相同的数据划分到一起（就是把key值相同的划分到一起），这样运算的时候就分不同的任务来运算即可。实现代码如下：

class MyMapper1 extends Mapper<LongWritable, Text, Text,IntWritable>{
	/**
	 * 每次读取一行记录，所以只需要写一行就好了；
	 * */
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		String[] strs = value.toString().split("\t");
		context.write(new Text(strs[0]), new IntWritable(Integer.parseInt(strs[1])));
	}
}
/**
 * 把相同的key的值相加到一起；
 * */
class MyReducer1 extends Reducer<Text, IntWritable, Text, IntWritable>{
	@Override
	protected void reduce(Text key, Iterable<IntWritable> value,Context context) throws IOException, InterruptedException {
		int sum = 0;
		for(IntWritable val : value) {
			sum += val.get();
		}
		context.write(key, new IntWritable(sum));
	}
}
/**
 * 把同一类别的key放到一个文件里；
 * */
class MyPartitioner extends Partitioner<Text, IntWritable>{
	/**
	 * reducercount指的是reduce汇合的过程中，分发到几个reduce中
	 * */
	@Override
	public int getPartition(Text key, IntWritable value, int reducercount) {
		if(key.toString().equals("xiaomi"))
			return 0;
		if(key.toString().equals("huawei"))
			return 1;
		if(key.toString().equals("iphone"))
			return 2;
		return 3;
	}	
}

/**
 * 自己定义实现Partitioner
 * 思想就是把相同的key划分到一个文件去
 * 还是现有map，然后reduce，还有写partitioner
 * */
public class MyPartionier {
	public static void main(String[]args) throws IOException, ClassNotFoundException, InterruptedException {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		//设置运行类
		job.setJarByClass(MyPartionier.class);
		
		//设置Map和Reduce
		job.setMapperClass(MyMapper1.class);
		job.setReducerClass(MyReducer1.class);
		
		//设置map中key和value输出
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		//设置reduce中key和value的输出
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		//设置reduce的分发任务数量
		job.setPartitionerClass(MyPartitioner.class);
		job.setNumReduceTasks(4);
		
		//设置文件的格式化输入输出路径
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		System.exit(job.waitForCompletion(true)?0:1);
		
	}
	
	
}

TJU_ZH

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hadoop基本用法回顾（Combiner和Partitioner实现）

首先说一下Combiner。在使用MapReduce时，以wordcount为例说明，我们假定要统计单词个数，给出一个文档，hadoop的一个block128M，假设一个block上都是储存的单词，这种情况下，在进行计算的时候在一个block上会传输数据量非常非常大的键值对，将这些键值对进行Reduce计算时，会引起很大的网络带宽负载压力。所以hadoop允许对部分map任务先进行一次reduce...
复制链接

扫一扫