置换-选择排序

eplacement Selection

Overview

Perform the in-memory sort by passing the records through a large priority queue.

Detailed Strategy

  1. Choose as large a priority queue as possible, say of M elements
  2. Sort Step
    • Initialization Step: Read M records into the priority queue
    • Replacement Step (creating a single run):
      1. Delete the smallest record from the priority queue and write it out
      2. Read a record from the input file. If the new element is smaller than the last one output, it cannot become part of the current run. Mark it as belonging to the next run and treat it as greater than all the unmarked elements in the queue.
      3. Terminate the run when a marked element reaches the top of the queue
  3. Merge Step: Same as before

Pseudo-Code

The pseudo-code in this section shows how the heap code from earlier in the course must be modified to handle external sorting. It also shows how the create_initial_runs code must be modified from the balanced, multi-way sorting scheme. The other procedures from the balanced, multi-way sorting scheme remain the same.

Code for Create_Initial_Runs

create_initial_runs(input_file_name, run_size, num_ways) {

    allocate a dynamic array, a, large enough to accommodate runs of 
        size run_size
    open the input file using the fields package
    for i = 0 to NUM_WAYS-1 {
        open output_scratch_file i using fopen
    }
    end_of_input = false
    next_output_file = 0
    num_runs_per_output_file = 0

  /* instantiate the heap */
  for i = 1 to run_size {
     read a record into a[i]
     if (end of input)
         end_of_input = true
         break

     /* indicate the record belongs in the current run and insert the
        record into the heap */
     a[i]->mark = current_mark;
     upheap(i);
  }
  /* initialize the heap size */
  N = i - 1;

  /* create the initial runs--get_item sets end_of_input to true only
      when there is no more input and the heap is exhausted */

  while (end_of_input == false) {
    /* keep outputting records to the current run until get_item returns
       a NULL record */
    for (record = get_item(input_file, &end_of_input); record != NULL; 
	 record = get_item(input_file, &end_of_input)) {
      write record to output_scratch_files[next_output_file]
    }
    output the sentinel value to scratch_output_file[next_output_file]

    /* everytime we get back to the first output file, increment the
	number of runs per output file by 1 */
    if (next_output_file == 0)
	 num_runs_per_output_file = num_runs_per_output_file + 1
     next_output_file = (next_output_file + 1) % num_ways
  }

  /* make sure the same number of runs are assigned to each scratch 
     output file */
  if (next_output_file != 0) {
      for i = next_output_file to (num_ways - 1) {
          output the sentinel value to scratch_output_file[i]
      }     
  }
  for i = 0 to (num_ways -1)
      close scratch_output_file[i] using fclose
  close the input file using the fields package
 
  return num_runs_per_output_file
}

Heap Code

This code assumes there are two values in a record:
  • mark: A mark that toggles between two arbitrary values (e.g., 0 and 1). This field is used to determine whether a record belongs in the current run or the next run.

  • key: The record's key

There is also a global variable:

current_mark: Indicates which of the two values a record's mark field must
              be set to in order to be included in this run.

Here is the pseudo-code:

/* if the two records both belong in the run or both don't belong in the
   run, compare the two values and return true if the first value is less
   than the first. Otherwise, return true if the first record belongs in
   the run, and false otherwise (in the latter case, the second record
   belongs in the run and hence it is "less than" the first record). */

int less_than_or_equal(record1, record2) {
  if (record1->mark == record2->mark)
    return (record1->key <= record2->key)
  else 
    return (record1->mark == current_mark)
}

downheap (int k) {
   int j;
   Record v;

   v = a[k];
   while (k <= N/2) {
       j = 2 * k;
       if (j < N && less_than_or_equal(a[j+1], a[j])) j++;
       if (less_than_or_equal(v, a[j])) break;
       a[k] = a[j];
       k = j;
   }
   a[k] = v;
}

int heap_empty () {
  return (N == 0);
}

/* get_item returns the next item in the run. It returns
   NULL if the end of a run or end of the input is reached. The
   flag end_of_input allows get_item to indicate whether the reason for
   the NULL is due to the end of a run or the end of the input. 
   end_of_input must be a pointer to allow a value to be passed back
   through the end_of_input parameter. */

PERSONNEL get_item(input_file, *end_of_input) {
  static int more_input = true;

  /* determine if the heap is empty. If it is, then the end of the
     input has been reached, so set end_of_input to true */
  if (heap_empty()) {
    *end_of_input = true;
    return NULL;
  }
  /* determine if the root element belongs in this run */
  if (a[1]->mark != current_mark) {
    /* reverse the mark so that all the elements in the heap will be
       available for the next run */
    current_mark = !(current_mark);
    *end_of_input = false;
    return NULL;
  }
     
  if (more_input) {
    if (not end of file) {
      read a record into a[0]
      /* determine if the newly read record belongs in the current run
         or the next run. */
      if (a[0]->key >= a[1]->key)
          a[0]->mark = current_mark
      else
          a[0]->mark = !(current_mark)
      /* insert the newly read record into the heap by executing a
         replace operation (i.e., the root of the heap will be returned
	 and the new value will be pushed onto the heap)
      downheap(0);
      return a[0];
    }
    else { /* once the input is exhausted, start returning items from the
                heap */
      more_input = false;
      return remove_item();
    }
  }
  else {
    /* if the input has been exhausted, remove the top item on the heap
       and return it */
    return remove_item();
  }
}

Performance

  1. For random keys, the runs produced by replacement selecton are about twice the size of the heap used
  2. Consequently, priority queues save approximately one merge pass. The number of passes is approximately 1 + log[P](N / 2M)

Practical Considerations

Systems Issues

The cost of I/O dominates the cost of computation so overlapping reading, writing, and computation is critical. A file is divided into blocks, each with a fixed number of bytes. When a file is read into memory or written out, the system maintains buffers that are equal to the block size of a file. Files are transferred to and from memory in chunks equal to the block size of a file, or equivalently, equal to the size of a buffer.
  1. Double Buffering: For each input file maintain two buffers--one that is used by the CPU and one that is used by the I/O processor. As the CPU uses its buffers the I/O processor reads its buffers.
  2. Forecasting: Double buffering's drawback is that it uses only half the available memory space. If there are P input units being used, 2P buffers are required. Forecasting can reduce this number to P+1. The idea is as follows:
    • Find the input buffer that will be exhausted next. The buffer that will be exhausted next is the one which has the smallest item at the end.
    • Allocate the extra buffer to this input file and start filling this extra buffer.

Selecting P

  1. If N scratch tapes are available, use a N/2 - way sort (i.e. P = N/2)
  2. If disks are available, a good strategy is to choose P so that only 2 passes are required. This requires that:
    		log[P](N / 2M) < 2   
    	   ==>  N / 2M < 2**P
    	   ==>  sqrt(N / 2M) < P
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值