- Distributed Sort via MapReduce
- Map function just output key+record
- Partition immediate keys to R pieces and this R pieces is sorted partitions for the key value domain. This functions as bucket sorting
- R function does quick sort on input keys(suppose all keys can be held in memory and no external sort needed)
- Then the computation complexity is(suppose N keys in total)
- Map phase: N
- Reduce pahse: R * (N/R * log(N/R)) = NlogN - NlogR
- Two rounds read & write on input
- K路归并+快排
- 快排复杂度:K * (N/K * log(N/K)) = NlogN - NlogK
- K路归并复杂度:NlogK
- 对input的两轮读写
- 总结
- 若R==K,两者的计算与IO复杂度都相当,但Reduce阶段可分布式并发执行,而K路归并排序只能串行操作,总体来说MapReduce在实际应用中更好。
- 另外,需要注意的是,两者中IO的时间与CPU计算的时间都相当,假设数据量为1TB(2^40B),IO速度为100MB/s,CPU为2GHZ,K=R=1000,串行处理情况下大致计算如下,并发情况类似:
- 计算时间:2^40 * (1 + log(2^40) - log1000) / (2 * 2^30)= 2^9* (1 + 40 - 10) = 15000s
- IO时间:1TB/(100MB/s) * 2 = 2^21/100 = 20000s
Distributed Sort via MapReduce vs. K路归并+快排
最新推荐文章于 2023-03-23 17:26:14 发布