Sorting in Perl

Sorting in Perl

John Klassa / Raleigh Perl Mongers / June, 2000

 

Introduction

  • Perl has a built-in "sort" function.
  • It uses a quicksort algorithm, which has good (in fact, O(n log n)) performance. (Note, however, that a simple bubble sort can be faster than most other sorts for very short lists.)
  • It's easy to use. You say:
    	@s = sort @a;
    
    and you've got a sorted array...

 

So Why Do We Need a Talk?

  • In order to do useful sorts, you need to add a bit of code to your "sort" calls.
  • If you haven't had the need to do a complex sort, you will.
  • The built-in "sort" compares the sortkeys O(n log n) times. How (and how often) you generate the sortkeys, and what you sort them with, makes a huge difference.

 

The Basics

  • Ascending lexicographic sort:
    	@s = sort @a;  (does {$a cmp $b} implicitly)
    
  • Ascending numeric sort:
    	@s = sort {$a <=> $b} @a;
    

 

More Basics

  • Descending lexicographic sort:
    	@s = sort {$b cmp $a} @a;
    	@s = reverse sort @a; (faster)
    Comment from Uri Guttman: "The reverse is only faster in the real world of your data. With a long enough list and perl now (or soon) recognizing and optimizing simple compare blocks, the reverse would be extra work."
  • Descending numeric sort:
    	@s = sort {$b <=> $a} @a;
    

 

Variations

  • Case-insensitive sort:
    	@s = sort {lc $a cmp lc $b} @a;
    
  • Element-length sort:
    	@s = sort {lenth $a <=> length $b} @a;
    
  • Any function will do...

 

Combination Sorts

  • Length first, then lexicographic:
    	@s = sort {length $a <=> length $b ||
    			$a cmp $b} @a;
    
  • Size of file, then age of file:
    	@s = sort {-s $a <=> -s $b ||
    			-M $b <=> -M $a} @a;
    

 

Sort Subroutines

  • Useful when your sort criteria gets a bit involved. $a and $b are automatic.
    	@s = sort mycriteria @a;
    	   sub mycriteria {
    		 my($aa) = $a =~ /(\d+)/;
    		 my($bb) = $b =~ /(\d+)/;
    		 sin($aa) <=> sin($bb) ||
    		 $aa*$aa <=> $bb*$bb;
    	   }
    

 

Advanced Sorting: Motivation

  • Everything that happens in your sort subroutine/clause happens O(n log n) times.
  • If you do something expensive (like an extraction via a regexp, or perhaps a "stat" on a file), this is a Bad Thing.
  • The Goal: Extract the sortkeys just once.

 

Solution #1: "Orcish" Maneuver

  • Cache the computed values in a hash (so that you're only computing them once). Use an "or" to set missing values. An "or-cache".
    	@s = sort {
    		($hash{$a} ||= fn($a)) cmp
    		($hash{$b} ||= fn($b))
    	     } @a;
    

 

Problems with the OM

  • Performs an extra test after each sortkey lookup.
  • False values are recomputed each time.

 

A Better Way: The Schwartzian Transform

  • The Schwartzian Transform creates a sorted list by transforming the original list into an intermediate form, where the sortkeys are cached, and then pulling the original list back out.

 

But First, A Digression...

  • Just as @a = (1, 2, 3) creates an array, $aref = [1, 2, 3] creates a reference to an anonymous array.
  • Just as $a[0] is the first element of @a, $aref->[0] is the first element of the anonymous array to which $aref refers.
  • Understanding this is central to understanding the ST.

 

The Schwartzian Transform

  • Goal: Sort a list of filenames by age (oldest last), efficiently.
  • The nave approach does O(n log n) "stat" operations, so it's inefficient:
    	@s = sort {-M $a <=> -M $b} @a;
    

 

ST: Mechanics

  • Map the list into a new one that contains the extracted sortkeys and the original values.
  • Sort on the sortkeys.
  • Map the resulting list into a new one that contains the original values in the sorted order.

 

ST: Verbose Approach

  • Verbosely, in code:
    	# @a exists, and contains filenames
    	@x = map { [ $_, -M ] } @a;  # transform: value, sortkey
    	@sx = sort { $a->[1] <=> $b->[1] } @x;	# sort
    	@s = map { $_->[0] } @sx;  # restore original values
    

 

ST: Final

  • Put it all together? The key is to read it backwards.
    	# @a exists, and contains filenames
    	@s = map { $_->[0] }  # restore original values
    	     sort { $a->[1] <=> $b->[1] }  # sort
    	     map { [$_, -M] } @a;  # transform: value, sortkey
    

 

Can We Do Better?

  • I didn't think so until I read the Guttman-Rosler paper.
  • Turns out, yes. Use packed sortkeys and the default sort.

 

The "Packed Default" Sort

  • So-named because it uses packed sortkeys, then sorts them with the default "sort" (i.e. no sort subroutine or sort clause. just the native, all-in-C comparison routine).
  • The benefits: fast, one-time sortkey generation; fast comparison; fast extraction.

 

PD: The Mechanics

  • Pack the sortkeys into a single string (tack on subkeys, if any).
  • Tack on the original values (or an index, if the original values are complex data structures).
  • Sort.
  • Retrieve original values via "substr", "split" or whatever.

 

PD: Example

  • Sort "dotted-quad" values:
    	@out =
    	    map substr($_, 4) =>
    	    sort
    	    map pack(`C4', /(\d+)\. (\d+)\. (\d+)\. (\d+)/)
    		    . $_ => @a;
    
  • Again, read it in reverse...

 

Conclusion

  • Using "sort" is always O(n log n).
  • For complicated sorts, how you pull out the sortkeys and how you compare them is what matters.
  • The ST is my personal favorite. It's easy to remember, and it's fast.
  • The PD sort is faster, but it's also a bit more cryptic (unless you're a natural with "pack", and have a desire to really understand your data). By the way, Uri Guttman started on a Sort::Records module that does a PD sort under the covers, but did not finish it or publish it to CPAN. He has, however, offered to give us the current source, design ideas, help, etc. if anyone in raleigh.pm would like to pick it up. 

References


Revisions

  1. November 20, 2002: Rob West: Update based on feedback from Uri Guttman.
  2. August 22, 2003: Rob West: Updated References links.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值