Distributed Sort via MapReduce vs. K路归并+快排

最新推荐文章于 2023-03-23 17:26:14 发布

TigerYang414

最新推荐文章于 2023-03-23 17:26:14 发布

阅读量2.2k

点赞数

分类专栏： Hadoop 算法文章标签： mapreduce function input io output

本文链接：https://blog.csdn.net/tigeryang414/article/details/7285874

版权

7 篇文章 0 订阅

订阅专栏

3 篇文章 0 订阅

订阅专栏

Distributed Sort via MapReduce
- Map function just output key+record
- Partition immediate keys to R pieces and this R pieces is sorted partitions for the key value domain. This functions as bucket sorting
- R function does quick sort on input keys(suppose all keys can be held in memory and no external sort needed)
- Then the computation complexity is(suppose N keys in total)
  - Map phase: N
  - Reduce pahse: R * (N/R * log(N/R)) = NlogN - NlogR
  - Two rounds read & write on input
K路归并+快排
- 快排复杂度：K * (N/K * log(N/K)) = NlogN - NlogK
- K路归并复杂度：NlogK
- 对input的两轮读写
总结
- 若R==K，两者的计算与IO复杂度都相当，但Reduce阶段可分布式并发执行，而K路归并排序只能串行操作，总体来说MapReduce在实际应用中更好。
- 另外，需要注意的是，两者中IO的时间与CPU计算的时间都相当，假设数据量为1TB（2^40B），IO速度为100MB/s，CPU为2GHZ，K=R=1000，串行处理情况下大致计算如下，并发情况类似：
  - 计算时间：2^40 * (1 + log(2^40) - log1000) / (2 * 2^30)= 2^9* (1 + 40 - 10) = 15000s
  - IO时间：1TB/(100MB/s) * 2 = 2^21/100 = 20000s