分析:利用MapReduce如何实现类似Wordcount的TopN问题
数据源:
1 A 10
2 A 40
3 B 30
4 C 20
5 B 10
6 D 40
7 A 30
8 C 20
9 B 10
10 D 40
11 C 30
12 D 20
问题难点:
(1)Reduce端TreeSet方法进阶
(2)Reduce中Iterable迭代数据
引申:Reduce端只能遍历一次
较简单的方法是使用内置的TreeMap或者TreeSet。这两种是基于红黑树的一种数据结构,内部维持的事key的次序,但每次添加新元素,其排序的开销要大于堆调整的开销。找最大的Top N 元素,创建的就是小顶堆。小顶堆的特性是根节点是最小元素。不需要对堆进行排序,当堆的根节点被替换成新的节点时,需要进行堆化,以保持小顶堆的特性。
TreeMap不指定排序器的情况下,默认按照key值进行升序排列。
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.TreeSet;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class TopN implements Tool{
public static class mapper extends Mapper<LongWritable, Text, Text, LongWritable>{
public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{
String[] strings = value.toString().split("\t");
context.write(new Text(strings[1].trim()), new LongWritable(Integer.valueOf(strings[2].trim())));
}
}
public static class reduce extends Reducer<Text,LongWritable,Text,LongWritable>{
public void reduce(Text key,Iterable<LongWritable> values,Context context) throws IOException, InterruptedException{
TreeSet<Long> tSet= new TreeSet<Long>();
for(LongWritable value:values){
tSet.add(value.get());
}
if(tSet.size() > 3){
tSet.remove(tSet.first());
}
for(Long num:tSet){
context.write(key, new LongWritable(num));
}
}
}
static String input = "";
static String output = "";
public int run(String[] str) throws IOException, URISyntaxException, ClassNotFoundException, InterruptedException {
input = str[0];
output = str[1];
Configuration conf = new Configuration();
FileSystem file = FileSystem.get(new URI(input), conf);
Path outPath = new Path(output);
if (file.exists(outPath)) {
file.delete(outPath, true);
}
Job job = Job.getInstance();
job.setJarByClass(TopN.class);
FileInputFormat.setInputPaths(job, input);
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(mapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setPartitionerClass(HashPartitioner.class);
job.setNumReduceTasks(4);
job.setReducerClass(reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileOutputFormat.setOutputPath(job, outPath);
job.setOutputFormatClass(TextOutputFormat.class);
//用于提交未提交过得作业
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new TopN(), args);
}
public Configuration getConf() {
return null;
}
public void setConf(Configuration arg0) {
}
}
引申:Reduce端只能进行一次iterable(单向迭代一次)
虽然reduce方法会反复执行多次,但是key和value相关的对象只有两个,reduce会反复重用这两个对象(类似String是不可变对象的道理)。所以如果要保存key或者value的结果,只能将其中的值取出另存或者重新clone一个对象
public void reduce(Text host, Iterator<CrawlDatum> values, OutputCollector<Text, CrawlDatum> output, Reporter reporter) throws IOException {
List<CrawlDatum> cache = new LinkedList<CrawlDatum>();
// first loop and caching
while (values.hasNext()) {
CrawlDatum datum = values.next();
doSomethingWithValue();
CrawlDatum copy = new CrawlDatum();
copy.set(datum);
cache.add(copy);
}
// second loop
for(IntWritable value:cache) {
doSomethingElseThatCantBeDoneInFirstLoop(value);
}
}
参考博客:
http://blog.csdn.net/zeb_perfect/article/details/53335207
Reduce iterable单向迭代问题
http://www.wangzhe.tech/MapReduce/MapReduce%E4%B8%ADreduce%E9%98%B6%E6%AE%B5iterator%E5%A6%82%E4%BD%95%E9%81%8D%E5%8E%86%E4%B8%A4%E9%81%8D%E5%92%8C%E6%89%80%E9%81%87%E5%88%B0%E7%9A%84%E9%97%AE%E9%A2%98/2016/07/13/