MapReduce 归并排序利用之高效TopN案例
1.MapReduce排序
- 根据mapreduce的内部机制可知,分区和排序贯穿整个过程。
- maptask的key,value输出到环形缓冲区中,
- 每个maptask的数据从环形缓冲区溢写到临时文件,
- 相同maptask的临时文件合并为一个文件,
- reduce从maptask产生的文件中针对key进行分组提取。
这些排序,默认都是根据key来进行字符串升序排序,也就是ASCII码表的字母顺序升序排序。
2.问题
- 当需要对一个很大的分布式文件中数据做排序,如何处理?
- 常规mapreduce方法有什么问题?
- 有哪些排序方法?
- 怎样利用mapreduce内部的排序机制来解决这个排序问题,进而提升处理效率和速度?
3.环境准备
- 准备一个电影评分原始数据. 实际可以使用for循环,创建很多Javabean对象,并赋予各个属性不同的值,然后利用fastjson把java bean转为字符串,再逐行写入到文本文件中。实际有十几兆,这里摘抄几十行数据。
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}
{"movie":"919","rate":"4","timeStamp":"978301368","uid":"1"}
{"movie":"595","rate":"5","timeStamp":"978824268","uid":"1"}
{"movie":"938","rate":"4","timeStamp":"978301752","uid":"1"}
{"movie":"2398","rate":"4","timeStamp":"978302281","uid":"1"}
{"movie":"2918","rate":"4","timeStamp":"978302124","uid":"1"}
{"movie":"1035","rate":"5","timeStamp":"978301753","uid":"1"}
{"movie":"2791","rate":"4","timeStamp":"978302188","uid":"1"}
{"movie":"2687","rate":"3","timeStamp":"978824268","uid":"1"}
{"movie":"2018","rate":"4","timeStamp":"978301777","uid":"1"}
{"movie":"3105","rate":"5","timeStamp":"978301713","uid":"1"}
{"movie":"2797","rate":"4","timeStamp":"978302039","uid":"1"}
{"movie":"2321","rate":"","timeStamp":"978302205","uid":"1"}
{"movie":"720","rate":"3","timeStamp":"978300760","uid":"1"}
{"movie":"1270","rate":"5","timeStamp":"978300055","uid":"1"}
{"movie":"527","rate":"5","timeStamp":"978824195","uid":"1"}
{"movie":"2340","rate":"3","timeStamp":"978300103","uid":"1"}
{"movie":"48","rate":"5","timeStamp":"978824351","uid":"1"}
{"movie":"1097","rate":"4","timeStamp":"978301953","uid":"1"}
{"movie":"1721","rate":"4","timeStamp":"978300055","uid":"1"}
{"movie":"2321","rate":"3","timeStamp":"978302205","uid":"1"}
{"movie":"1545","raete":"4","timeStamp":"978824139","uid":"1"}
{"movie":"745","rate":"3","timeStamp":"978824268","uid":"1"}
{"movie":"2294","rate":"4","timeStamp":"978824291","uid":"1"}
{"movie":"3186","rate":"4","timeStamp":"978300019","uid":"1"}
{"movie":"1566","rate":"a","timeStamp":"978824330","uid":"1"}
{"movie":"588","rate":"4","timeStamp":"978824268","uid":"1"}
{"movie":"1907","rate":"4","timeStamp":"978824330","uid":"1"}
{"movie":"783","rate":"4","timeStamp":"978824291","uid":"1"}
{"movie":"1836","rate":"5","timeStamp":"978300172","uid":"1"}
{"movie":"1022","rate":"5","timeStamp":"978300055","uid":"1"}
{"movie":"2762","rate":"4","timeStamp":"978302091","uid":"1"}
{"movie":"150","rate":"5","timeStamp":"978301777","uid":"1"}
{"movie":"1","rate":"5","timeStamp":"978824268","uid":"1"}
{"movie":"1961","rate":"5","timeStamp":"978301590","uid":"1"}
{"movie":"1962","rate":"4","timeStamp":"978301753","uid":"1"}
{"movie":"2692","rate":"4","timeStamp":"978301570","uid":"1"}
{"movie":"260","rate":"4","timeStamp":"978300760","uid":"1"}
{"movie":"1028","rate":"5","timeStamp":"978301777","uid":"1"}
{"movie":"1029","rate":"5","timeStamp":"978302205","uid":"1"}
{"movie":"1207","rate":"4","timeStamp":"978300719","uid":"1"}
{"movie":"2028","rate":"5","timeStamp":"978301619","uid":"1"}
{"movie":"531","rate":"4","timeStamp":"978302149","uid":"1"}
{"movie":"3114","rate":"4","timeStamp":"978302174","uid":"1"}
{"movie":"608","rate":"4","timeStamp":"978301398","uid":"1"}
{"movie":"1246","rate":"4","timeStamp":"978302091","uid":"1"}
{"movie":"1357","rate":"5","timeStamp":"978298709","uid":"2"}
{"movie":"3068","rate":"4","timeStamp":"978299000","uid":"2"}
{"movie":"1537","rate":"4","timeStamp":"978299620","uid":"2"}
{"movie":"647","rate":"3","timeStamp":"978299351","uid":"2"}
{"movie":"2194","rate":"4","timeStamp":"978299297","uid":"2"}
{"movie":"648","rate":"4","timeStamp":"978299913","uid":"2"}
{"movie":"2268","rate":"5","timeStamp":"978299297","uid":"2"}
{"movie":"2628","rate":"3","timeStamp":"978300051","uid":"2"}
{"movie":"1103","rate":"3","timeStamp":"978298905","uid":"2"}
{"movie":"2916","rate":"3","timeStamp":"978299809","uid":"2"}
{"movie":"3468","rate":"5","timeStamp":"978298542","uid":"2"}
{"movie":"1210","rate":"4","timeStamp":"978298151","uid":"2"}
{"movie":"1792","rate":"3","timeStamp":"978299941","uid":"2"}
{"movie":"1687","rate":"3","timeStamp":"978300174","uid":"2"}
{"movie":"1213","rate":"2","timeStamp":"978298458","uid":"2"}
{"movie":"3578","rate":"5","timeStamp":"978298958","uid":"2"}
{"movie":"2881","rate":"3","timeStamp":"978300002","uid":"2"}
{"movie":"3030","rate":"4","timeStamp":"978298434","uid":"2"}
{"movie":"1217","rate":"3","timeStamp":"978298151","uid":"2"}
{"movie":"3105","rate":"4","timeStamp":"978298673","uid":"2"}
{"movie":"434","rate":"2","timeStamp":"978300174","uid":"2"}
{"movie":"2126","rate":"3","timeStamp":"978300123","uid":"2"}
{"movie":"3107","rate":"2","timeStamp":"978300002","uid":"2"}
{"movie":"3108","rate":"3","timeStamp":"978299712","uid":"2"}
{"movie":"3035","rate":"4","timeStamp":"978298625","uid":"2"}
{"movie":"1253","rate":"3","timeStamp":"978299120","uid":"2"}
{"movie":"1610","rate":"5","timeStamp":"978299809","uid":"2"}
{"movie":"292","rate":"3","timeStamp":"978300123","uid":"2"}
{"movie":"2236","rate":"5","timeStamp":"978299220","uid":"2"}
{"movie":"3071","rate":"4","timeStamp":"978299120","uid":"2"}
{"movie":"902","rate":"2","timeStamp":"978298905","uid":"2"}
{"movie":"368","rate":"4","timeStamp":"978300002","uid":"2"}
{"movie":"1259","rate":"5","timeStamp":"978298841","uid":"2"}
{"movie":"3147","rate":"5","timeStamp":"978298652","uid":"2"}
{"movie":"1544","rate":"4","timeStamp":"978300174","uid":"2"}
{"movie":"1293","rate":"5","timeStamp":"978298261","uid":"2"}
{"movie":"1188","rate":"4","timeStamp":"978299620","uid":"2"}
{"movie":"3255","rate":"4","timeStamp":"978299321","uid":"2"}
{"movie":"3256","rate":"2","timeStamp":"978299839","uid":"2"}
{"movie":"3257","rate":"3","timeStamp":"978300073","uid":"2"}
{"movie":"110","rate":"5","timeStamp":"978298625","uid":"2"}
{"movie":"2278","rate":"3","timeStamp":"978299889","uid":"2"}
{"movie":"2490","rate":"3","timeStamp":"978299966","uid":"2"}
{"movie":"1834","rate":"4","timeStamp":"978298813","uid":"2"}
{"movie":"3471","rate":"5","timeStamp":"978298814","uid":"2"}
- 安装Idea2020版,创建一个maven项目,引入fastjson的依赖
4.常规的mapreduce实现topn排序
- 每部电影的每条评论都有rate评分这个属性,需要选出每部电影评分最高的3条数据,并且输出到文件
- 代码,从下列代码可以看出,如果数据量很大,再reduce的list排序时,内存可能会耗尽。
java bean
package com.doit.hadoop.movie;
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class Movie implements Writable {
/*
* {"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
* */
private String movie;
private double rate;
private String timeStamp;
private String uid;
public String getMovie()