External Merge Sort, time complexity analysis

The time complexity of Merge Sort is nlogn.  How about the external merge sort?


1. One pass external merge sort

step 1. break N data to k groups, each group has N/k data, complexity is N/klog(N/k) * k = N log(N/k)

step 2. k way merge sort.  Use min_heap, each time to push the data in the heap takes log(k) time, N data takes Nlog(k), 

so total time complexity is Nlog(k) + Nlog(N/k) = NlogN


2. Two pass external merge sort

If k is too large, in merge sort, the content read in the input buffer may be too small.  For example, if the data needs to be sorted is 500G, one only have 1G memory.  Then k= 500.  When doing the merge sort, only 2MB will be read in the input buffer(Looks OK, Ah? But if total data is TB, or the memory is much smaller, then each time it may only read KB data in the main memory).  The frequency to access disk may be too high.  So we could do two path.

First path, break N to k2 group, do merge of each group, each group has N/k2 data.  (k2 = 20, each group has 25G data, and we further break each 25G data to 25 pieces of files)

Second path, do k2 way merge sort, takes Nlogk2 time.  So total time for two pass merge sort is Nlog(N/k2) + Nlogk2 = NlogN


Reference:

http://en.wikipedia.org/wiki/External_sorting

http://www.mitbbs.com/article_t/JobHunting/32257009.html


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
先让我们看看原题的三个任务介绍: Task 1: Sorting the LINEITEM table by External Merge Sort Consider two cases: 1) using 5 buffer pages in memory for the external merge sort; 2) using 129 buffer pages in memory for the external merge sort. In the implementation, each buffer page occupies 8K bytes. The ORDERKEY attribute of the LINEITEM table is assumed to be the sort key in the external merge sort. Please report the number of passes and also the running time of the external merge sort in each case. Task 2: Organizing the sorted LINEITEM table into disk pages Please use the page format for storing variable-length records to organize the LINEITEM table sorted in Task 1. In the implementation, each disk page occupies 1K bytes. For each page we maintain a directory of slots, with a pair per slot. Both “record offset” and “record length” are 4 bytes wide. Task 3: Building a B-Tree over LINEITEM disk pages by Bulk Loading. Please use bulk loading to build a B-Tree over the disk pages of the LINEITEM table, which are generated in Task 2. The ORDERKEY attribute of the LINEITEM table is used as the (search) key for building the B-Tree. In the B-Tree, each internal node corresponds to a page of 1K bytes, both key and pointer are 4 bytes wide. Please report the running time of the bulk loading. A query interface is required for checking the B-Tree. For a reasonable ORDERKEY value, please print out all the pages visited along the path to find the corresponding record. Please also report the running time of the search.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值