MapReduce 归并排序利用之高效TopN案例

本文探讨了如何利用MapReduce的内部排序机制来实现高效TopN排序。通过介绍MapReduce排序原理、常规MapReduce实现TopN的问题,以及提出利用每个小阶段的排序进行归并排序的方法,详细讲解了如何自定义分区器、分组器以解决大数据量排序的内存问题,从而提升处理效率和速度。
摘要由CSDN通过智能技术生成

MapReduce 归并排序利用之高效TopN案例

1.MapReduce排序

  1. 根据mapreduce的内部机制可知,分区和排序贯穿整个过程。
  2. maptask的key,value输出到环形缓冲区中,
  3. 每个maptask的数据从环形缓冲区溢写到临时文件,
  4. 相同maptask的临时文件合并为一个文件,
  5. reduce从maptask产生的文件中针对key进行分组提取。

这些排序,默认都是根据key来进行字符串升序排序,也就是ASCII码表的字母顺序升序排序。

2.问题

  1. 当需要对一个很大的分布式文件中数据做排序,如何处理?
  2. 常规mapreduce方法有什么问题?
  3. 有哪些排序方法?
  4. 怎样利用mapreduce内部的排序机制来解决这个排序问题,进而提升处理效率和速度?

3.环境准备

  1. 准备一个电影评分原始数据. 实际可以使用for循环,创建很多Javabean对象,并赋予各个属性不同的值,然后利用fastjson把java bean转为字符串,再逐行写入到文本文件中。实际有十几兆,这里摘抄几十行数据。
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}
{"movie":"919","rate":"4","timeStamp":"978301368","uid":"1"}
{"movie":"595","rate":"5","timeStamp":"978824268","uid":"1"}
{"movie":"938","rate":"4","timeStamp":"978301752","uid":"1"}
{"movie":"2398","rate":"4","timeStamp":"978302281","uid":"1"}
{"movie":"2918","rate":"4","timeStamp":"978302124","uid":"1"}
{"movie":"1035","rate":"5","timeStamp":"978301753","uid":"1"}
{"movie":"2791","rate":"4","timeStamp":"978302188","uid":"1"}
{"movie":"2687","rate":"3","timeStamp":"978824268","uid":"1"}
{"movie":"2018","rate":"4","timeStamp":"978301777","uid":"1"}
{"movie":"3105","rate":"5","timeStamp":"978301713","uid":"1"}
{"movie":"2797","rate":"4","timeStamp":"978302039","uid":"1"}
{"movie":"2321","rate":"","timeStamp":"978302205","uid":"1"}
{"movie":"720","rate":"3","timeStamp":"978300760","uid":"1"}
{"movie":"1270","rate":"5","timeStamp":"978300055","uid":"1"}
{"movie":"527","rate":"5","timeStamp":"978824195","uid":"1"}
{"movie":"2340","rate":"3","timeStamp":"978300103","uid":"1"}
{"movie":"48","rate":"5","timeStamp":"978824351","uid":"1"}
{"movie":"1097","rate":"4","timeStamp":"978301953","uid":"1"}
{"movie":"1721","rate":"4","timeStamp":"978300055","uid":"1"}
{"movie":"2321","rate":"3","timeStamp":"978302205","uid":"1"}
{"movie":"1545","raete":"4","timeStamp":"978824139","uid":"1"}
{"movie":"745","rate":"3","timeStamp":"978824268","uid":"1"}
{"movie":"2294","rate":"4","timeStamp":"978824291","uid":"1"}
{"movie":"3186","rate":"4","timeStamp":"978300019","uid":"1"}
{"movie":"1566","rate":"a","timeStamp":"978824330","uid":"1"}
{"movie":"588","rate":"4","timeStamp":"978824268","uid":"1"}
{"movie":"1907","rate":"4","timeStamp":"978824330","uid":"1"}
{"movie":"783","rate":"4","timeStamp":"978824291","uid":"1"}
{"movie":"1836","rate":"5","timeStamp":"978300172","uid":"1"}
{"movie":"1022","rate":"5","timeStamp":"978300055","uid":"1"}
{"movie":"2762","rate":"4","timeStamp":"978302091","uid":"1"}
{"movie":"150","rate":"5","timeStamp":"978301777","uid":"1"}
{"movie":"1","rate":"5","timeStamp":"978824268","uid":"1"}
{"movie":"1961","rate":"5","timeStamp":"978301590","uid":"1"}
{"movie":"1962","rate":"4","timeStamp":"978301753","uid":"1"}
{"movie":"2692","rate":"4","timeStamp":"978301570","uid":"1"}
{"movie":"260","rate":"4","timeStamp":"978300760","uid":"1"}
{"movie":"1028","rate":"5","timeStamp":"978301777","uid":"1"}
{"movie":"1029","rate":"5","timeStamp":"978302205","uid":"1"}
{"movie":"1207","rate":"4","timeStamp":"978300719","uid":"1"}
{"movie":"2028","rate":"5","timeStamp":"978301619","uid":"1"}
{"movie":"531","rate":"4","timeStamp":"978302149","uid":"1"}
{"movie":"3114","rate":"4","timeStamp":"978302174","uid":"1"}
{"movie":"608","rate":"4","timeStamp":"978301398","uid":"1"}
{"movie":"1246","rate":"4","timeStamp":"978302091","uid":"1"}
{"movie":"1357","rate":"5","timeStamp":"978298709","uid":"2"}
{"movie":"3068","rate":"4","timeStamp":"978299000","uid":"2"}
{"movie":"1537","rate":"4","timeStamp":"978299620","uid":"2"}
{"movie":"647","rate":"3","timeStamp":"978299351","uid":"2"}
{"movie":"2194","rate":"4","timeStamp":"978299297","uid":"2"}
{"movie":"648","rate":"4","timeStamp":"978299913","uid":"2"}
{"movie":"2268","rate":"5","timeStamp":"978299297","uid":"2"}
{"movie":"2628","rate":"3","timeStamp":"978300051","uid":"2"}
{"movie":"1103","rate":"3","timeStamp":"978298905","uid":"2"}
{"movie":"2916","rate":"3","timeStamp":"978299809","uid":"2"}
{"movie":"3468","rate":"5","timeStamp":"978298542","uid":"2"}
{"movie":"1210","rate":"4","timeStamp":"978298151","uid":"2"}
{"movie":"1792","rate":"3","timeStamp":"978299941","uid":"2"}
{"movie":"1687","rate":"3","timeStamp":"978300174","uid":"2"}
{"movie":"1213","rate":"2","timeStamp":"978298458","uid":"2"}
{"movie":"3578","rate":"5","timeStamp":"978298958","uid":"2"}
{"movie":"2881","rate":"3","timeStamp":"978300002","uid":"2"}
{"movie":"3030","rate":"4","timeStamp":"978298434","uid":"2"}
{"movie":"1217","rate":"3","timeStamp":"978298151","uid":"2"}
{"movie":"3105","rate":"4","timeStamp":"978298673","uid":"2"}
{"movie":"434","rate":"2","timeStamp":"978300174","uid":"2"}
{"movie":"2126","rate":"3","timeStamp":"978300123","uid":"2"}
{"movie":"3107","rate":"2","timeStamp":"978300002","uid":"2"}
{"movie":"3108","rate":"3","timeStamp":"978299712","uid":"2"}
{"movie":"3035","rate":"4","timeStamp":"978298625","uid":"2"}
{"movie":"1253","rate":"3","timeStamp":"978299120","uid":"2"}
{"movie":"1610","rate":"5","timeStamp":"978299809","uid":"2"}
{"movie":"292","rate":"3","timeStamp":"978300123","uid":"2"}
{"movie":"2236","rate":"5","timeStamp":"978299220","uid":"2"}
{"movie":"3071","rate":"4","timeStamp":"978299120","uid":"2"}
{"movie":"902","rate":"2","timeStamp":"978298905","uid":"2"}
{"movie":"368","rate":"4","timeStamp":"978300002","uid":"2"}
{"movie":"1259","rate":"5","timeStamp":"978298841","uid":"2"}
{"movie":"3147","rate":"5","timeStamp":"978298652","uid":"2"}
{"movie":"1544","rate":"4","timeStamp":"978300174","uid":"2"}
{"movie":"1293","rate":"5","timeStamp":"978298261","uid":"2"}
{"movie":"1188","rate":"4","timeStamp":"978299620","uid":"2"}
{"movie":"3255","rate":"4","timeStamp":"978299321","uid":"2"}
{"movie":"3256","rate":"2","timeStamp":"978299839","uid":"2"}
{"movie":"3257","rate":"3","timeStamp":"978300073","uid":"2"}
{"movie":"110","rate":"5","timeStamp":"978298625","uid":"2"}
{"movie":"2278","rate":"3","timeStamp":"978299889","uid":"2"}
{"movie":"2490","rate":"3","timeStamp":"978299966","uid":"2"}
{"movie":"1834","rate":"4","timeStamp":"978298813","uid":"2"}
{"movie":"3471","rate":"5","timeStamp":"978298814","uid":"2"}
  1. 安装Idea2020版,创建一个maven项目,引入fastjson的依赖

4.常规的mapreduce实现topn排序

  1. 每部电影的每条评论都有rate评分这个属性,需要选出每部电影评分最高的3条数据,并且输出到文件
  2. 代码,从下列代码可以看出,如果数据量很大,再reduce的list排序时,内存可能会耗尽。
    java bean
package com.doit.hadoop.movie;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class Movie implements Writable {
   
    /*
     * {"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
     * */
    private String movie;
    private double rate;
    private String timeStamp;
    private String uid;

    public String getMovie() 
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值