Hash Functions for Hash Table Lookup

最新推荐文章于 2024-06-03 20:33:19 发布

zhangxg8776

最新推荐文章于 2024-06-03 20:33:19 发布

阅读量514

点赞数

分类专栏： hash 文章标签： table input function output c numbers

hash 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

http://burtleburtle.net/bob/hash/evahash.html

Robert J. Jenkins Jr., 1995-1997

Abstract

This paper presents new hash functions for table lookup using32-bit or 64-bit arithmetic. These hashes are fast and reliable.A framework is also given for evaluating hash functions.

Introduction

Hash tables ^[Knuth6] are acommon data structure. They consist of an array (the hashtable) and a mapping (the hash function). The hashfunction maps keys into hash values. Items stored ina hash table must have keys. The hash function maps the key of an itemto a hash value, and that hash value is used as an index into the hashtable for that item. This allows items to be inserted and locatedquickly.

What if an item hashes to a value that some other item has alreadyhashed to? This is a collision.There are several strategies for dealing with collisions ^[Knuth6], but the strategies all makethe hash tables slower than if no collisions occurred.

If the actual keys to be used are known before the hash function ischosen, it is possible to choose a hash function that causes nocollisions. This is known as a perfecthash function ^[Fox]. This paperwill deal with the other case, where the actual keys are a smallsubset of all possible keys.

For example, if a hash function maps 30-byte keys into a 32-bitoutput, it maps 2²⁴⁰ possible keys into 2³²possible hash values. Less than 2³² actual keys will beused. With a ratio of 2²⁰⁸ possible keys per hash value,it is impossible to guarantee that the actual keys will have nocollisions.

If the actual keys being hashed were uniformly distributed,selecting the first v bits of the input to be the v-bithash value would make a wonderful hash function. It is fast and ithashes an equal number of possible keys to each hash value.Unfortunately, the actual keys supplied by humans and computers areseldom uniformly distributed. Hash functions must be more clever thanthat.

This paper is organized as follows. HashFunctions for Table Lookup present the new 32-bit and 64-bithashes. Patterns lists some patterns commonin human-selected and computer-generated keys. A Hash Model names common pieces of hashfunctions. Funnelingdescribes a flaw in hash functions and how to detect that flaw. Characteristics are a more subtle flaw.The last section shows that the new hashes have nofunnels.

Hash Functions for Table Lookup

Code for the new hash function is given infigure Newhash. ^ means exclusive-or,and << and >> are left and right shiftrespectively (neither is a barrelshift).

Newhash: C code for the new 32-bithash

typedef  unsigned long int  u4;   /* unsigned 4-byte type */
typedef  unsigned     char  u1;   /* unsigned 1-byte type */

/* The mixing step */
#define mix(a,b,c) \
{ \
  a=a-b;  a=a-c;  a=a^(c>>13); \
  b=b-c;  b=b-a;  b=b^(a<<8);  \
  c=c-a;  c=c-b;  c=c^(b>>13); \
  a=a-b;  a=a-c;  a=a^(c>>12); \
  b=b-c;  b=b-a;  b=b^(a<<16); \
  c=c-a;  c=c-b;  c=c^(b>>5);  \
  a=a-b;  a=a-c;  a=a^(c>>3);  \
  b=b-c;  b=b-a;  b=b^(a<<10); \
  c=c-a;  c=c-b;  c=c^(b>>15); \
}

/* The whole new hash function */
u4 hash( k, length, initval)
register u1 *k;        /* the key */
u4           length;   /* the length of the key in bytes */
u4           initval;  /* the previous hash, or an arbitrary value */
{
   register u4 a,b,c;  /* the internal state */
   u4          len;    /* how many key bytes still need mixing */

   /* Set up the internal state */
   len = length;
   a = b = 0x9e3779b9;  /* the golden ratio; an arbitrary value */
   c = initval;         /* variable initialization of internal state */

   /*---------------------------------------- handle most of the key */
   while (len >= 12)
   {
      a=a+(k[0]+((u4)k[1]<<8)+((u4)k[2]<<16) +((u4)k[3]<<24));
      b=b+(k[4]+((u4)k[5]<<8)+((u4)k[6]<<16) +((u4)k[7]<<24));
      c=c+(k[8]+((u4)k[9]<<8)+((u4)k[10]<<16)+((u4)k[11]<<24));
      mix(a,b,c);
      k = k+12; len = len-12;
   }

   /*------------------------------------- handle the last 11 bytes */
   c = c+length;
   switch(len)              /* all the case statements fall through */
   {
   case 11: c=c+((u4)k[10]<<24);
   case 10: c=c+((u4)k[9]<<16);
   case 9 : c=c+((u4)k[8]<<8);
      /* the first byte of c is reserved for the length */
   case 8 : b=b+((u4)k[7]<<24);
   case 7 : b=b+((u4)k[6]<<16);
   case 6 : b=b+((u4)k[5]<<8);
   case 5 : b=b+k[4];
   case 4 : a=a+((u4)k[3]<<24);
   case 3 : a=a+((u4)k[2]<<16);
   case 2 : a=a+((u4)k[1]<<8);
   case 1 : a=a+k[0];
     /* case 0: nothing left to add */
   }
   mix(a,b,c);
   /*-------------------------------------------- report the result */
   return c;
}

Fitting bytes into registers

The new hash deals with blocks of 12 bytes, rather than 1byte at a time like most hashes. The mixuses 36 instructions, which is 3 instructions per byte. Mix() allows2::1 parallelism, so ideally it would run twice as fast on superscalar CPUs.

The new hash fits the bytes into the registers a,b, and c as efficiently aspossible in a machine-independent way. Fitting bytes intoregisters consumes 3m instructions. (If the key were known tobe an array of words, this 3m could be reduced to .75m.)The whole hash, including the mix, takes about 6m+35instructions to hash m bytes.

The switch statement is an interestingoptimization. It contains a single piece of code for handling 11-bytestrings, but the suffixes of this code can handle shorter strings.The switch statement causes the program control to jump to the correctsuffix, determined by the actual number of bytes remaining.

A hash for 64-bit machines

64-bit machines can hash faster and better with 64-bit arithmetic.Code for the mixing step for a hash for 64-bitmachines is given in figure Hash64. Themodifications needed to hash() are straightforward. It should put24-byte blocks into 3 8-byte registers and return an 8-byte result.The 64-bit golden ratio is 0x9e3779b97f4a7c13LL.

Hash64: C code for a mixing step for64-bit machines

#define mix64(a,b,c) \
{ \
  a=a-b;  a=a-c;  a=a^(c>>43); \
  b=b-c;  b=b-a;  b=b^(a<<9); \
  c=c-a;  c=c-b;  c=c^(b>>8); \
  a=a-b;  a=a-c;  a=a^(c>>38); \
  b=b-c;  b=b-a;  b=b^(a<<23); \
  c=c-a;  c=c-b;  c=c^(b>>5); \
  a=a-b;  a=a-c;  a=a^(c>>35); \
  b=b-c;  b=b-a;  b=b^(a<<49); \
  c=c-a;  c=c-b;  c=c^(b>>11); \
  a=a-b;  a=a-c;  a=a^(c>>12); \
  b=b-c;  b=b-a;  b=b^(a<<18); \
  c=c-a;  c=c-b;  c=c^(b>>22); \
}

The whole 64-bit hash takes about 5m+41 instructions to hashm bytes.

How these functions are used

These hashes work equally well on alltypes of input, including text, numbers, compressed data, countingsequences, and sparse bit arrays. No final mod, multiply, or divideis needed to further mix the result. If the hash value needs to besmaller than 32 (64) bits, this can be done by masking out the highbits, for example (hash&0x0000000f). The hash functions workbest if the size of the hash table is a power of 2. If the hash tablehas more than 2³² (2⁶⁴) entries, this can behandled by calling the hash function twice with different initialinitvals then concatenating the results. If the key consistsof multiple strings, the strings can be hashed sequentially, passingin the hash value from the previous string as the initval for thenext. Hashing a key with different initial initvals producesindependent hash values.

The rest of this paper explains the design criteria for these hashfunctions.

Patterns

Table EMP: Is this data uniformlydistributed?

EMPNO	ENAME	JOB	MGR	HIREDATE	SAL	COMM	DEPTNO
7369	SMITH	CLERK	7902	17-DEC-80	800		20
7499	ALLEN	SALESMAN	7698	20-FEB-81	1600	300	30
7521	WARD	SALESMAN	7698	22-FEB-81	1250	500	30
7566	JONES	MANAGER	7839	02-APR-81	2975		20
7654	MARTIN	SALESMAN	7898	28-SEP-81	1250	1400	30
7698	BLAKE	MANAGER	7539	01-MAY-81	2850		30
7782	CLARK	MANAGER	7566	09-JUN-81	2450		10
7788	SCOTT	ANALYST	7698	19-APR-87	3000		20
7839	KING	PRESIDENT		17-NOV-81	5000		10
7844	TURNER	SALESMAN	7698	08-SEP-81	1500		30
7876	ADAMS	CLERK	7788	23-MAY-87	1100	0	20
7900	JAMES	CLERK	7698	03-DEC-81	950		30
7902	FORD	ANALYST	7566	03-DEC-81	3000		20
7934	MILLER	CLERK	7782	23-JAN-82	1300		10

Table EMP is the standard toy database table, and atypical set of data ^[Oracle].

A few patterns stand out.

ENAME and JOB are the 26 uppercase ASCII lettersarranged in different orders. The numbers in EMPNO can also appear inMGR. Values consist of common substrings arranged in different orders.
All the characters are ASCII, with the high bit of every byte setto 0. The EMPNO field is all numbers, while the ENAME field is alluppercase letters and spaces. Some rows even have identical values insome columns. Values often differ in only a few bits.
The only difference between zero and the null valuemay be that the lengths are different. Also, consider the two keys"a" "aaa" "a" and "aa" "a" "aa". Lengths must be considered partof the data, otherwise such keys are indistinguishable.
Another common pattern (not present in this example) is for keysto be nearly all zero, with only a few bits set.

Human-selected and computer-generated sets of keys almost alwaysmatch at least one of these patterns. Most mappings of keys to hashvalues map these sets of keys quite uniformly. Unfortunately, thehash functions that are fastest and easiest to write tend to be amongthe rare functions that do poorly on these sets of keys.

Hash Model

To aid in analysis, this paper will assume that hash functions areconstructed using a hash model.Although most hashes fit this model,some (for example MD4 ^[MD4] and Rogaway'sbucket hash ^[Rogaway]) do not.Hash functions have some internal state, and a permutation mix is used to mix thisinternal state. Another function, combine, is used to combineinput blocks with the current internal state.

Model for hashfunctions

  initialize the internal state;
  for (each block of the input)
  {
    combine (the internal state, the current input block);
    mix( the internal state);
  }
  value = postprocess( the internal state );
  return (value);

Consider the hash function XORhash:

  char XORhash( char *key, int len)
  {
    char hash;
    int  i;
    for (hash=0, i=0; i<len; ++i) hash=hash^key[i];
    return (hash%101);           /* 101 is prime */
  }

(XORhash requires 5 m+3 instructions to hash mbytes. Compare that to 6 m+35 and 5 m+41 for the two newhash functions.) The internal state of XORhash is the 1-byte value hash. Each key[i] is an input block. The combiningstep is ^. There is no mixing step (or, the mixing step isthe identity function). The postprocessing step is (hash%101).

XORhash hashes the keys "ac" and "bd" to the same value. The onlydifference between the two keys is the first bit of each byte. Whatis the problem here?

Funneling

A hash function is bad if it causes collisions when keys differ inonly a few bits. This always happens when a number of input bits canchange only a smaller number of bits in the internal state. Funnelingformalizes this concept.

Let K (keys) be the set of all k-bit values and V(hash values) be the set of all v-bit values. Let a hashfunction h : K -> V be given. Bit i in 1..kaffects bit j in 1..v if two keys differing onlyat input bit i will differ at output bit j about halfthe time.

Define h to be a funneling hash if there is somesubset t of the input bits which can only affect bits uin the internal state, and |t| > |u| and v > |u|.h has a funnel of those t input bits into thoseu bits of the internal state. If a hash has a funnel oft bits into u, then u of those t bits cancancel out the effects of the other |t|-|u|. The set of keysdiffering only in the input bits of the funnel can produce no morethan half that number of hash values. (Those 2^|t|keys can produce no more than 2^|u| out of2^v hash values.) Differing in only a few bits is a commonpattern in human and computer keys, so a funneling hash is seriouslyflawed.

For example, consider XORhash and 30-byte keys. All 30 lowest-order key bits funnel into only the lowest-order bit ofthe internal state. Every set of a billion (2³⁰) keys whichdiffer only in the lowest order key bits maps into just 2 hash values,even though 101 hash values are available.

Theorem Nofunnel:

Unless the mixing step compresses already-mixed data, a hash matchingthe hash model has no funnels if these conditions all hold:

The postprocessing step just selects v bits of theinternal state to be the v-bit result,
When the input block is fixed, the combining step is a reversiblemapping of the internal state to the internal state,
When the original internal state is fixed, the combining step isa one-to-one mapping (every input block value is mapped to a distinctinternal state value),
The mixing function is reversible,
The mixing step causes every bit of the internal state to affectevery bit of the result, and
The mixing step, when run either forwards or in reverse, causesevery bit of the internal state to affect at least v bits ofthe internal state.
For every intermediate point in the mixing step, consider runningthe mixing step forward to that point from the previous combine, andbackward to that point from the next combine. For every set Y of bitsin the internal state in the middle, there some set X of bits in theprevious input block and Z in the next input block that affect those Ybits. For every Y with less than v bits, |X|+|Z| must beless than or equal to |Y|. (Note that if every bit in the previousand next block affects at least v bits in the center of themixing step, this requirement is satisfied.)

Almost all nonlinear mixing steps do not compress already-mixed data.

The proof can be found at http://burtleburtle.net/bob/hash/funnels.html.

There is an efficient way of testing which input bits affect whichoutput bits. For an (input bit, output bit) pair, the test is tofind a key pair differing in only that input bit that changes thatoutput bit, and another such key pair that does not change that outputbit. That is really two tests, one to test that the output bitchanges sometimes, and the other to test that the output bit sometimesstays the same.

How many key pairs need to be hashed? If the input bit changes theoutput bit with probability 1/2, the chances of the output notchanging at all is 1/2 for 1 pair, 1/4 for 2 pairs, 1/8 for 3 pairs,and 2^-x for x pairs. If n tests arebeing checked, and each test passes with probability 1/2, afterlog(n) pairs there is a still a (1 - 1/n)ⁿ =1/e chance of some test not passing. However, if after2log(n) key pairs a test has still not passed, it is safe tosay that the hash fails that test. Key pairs differing in a given inputbit can be used to check all output bits simultaneously. Alltogether, it is possible to show that every input bit affects everyoutput bit by hashing about klog(2kv) key pairs, and itis possible to show that a particular (input bit, output bit) pairfails by hashing about 2log(2kv) key pairs.

Characteristics

Characteristics are another flaw that cause hash functions tobehave poorly when keys differ in only a few bits.

A delta is the difference (usually by XOR or subtraction)between two values. An input delta is the delta of two inputkeys, and an output delta is the delta of two hash values. Acharacteristic ^[Biham] is aninput delta that produces a predictable output delta.

Suppose that a mixing function has a characteristic that occurswith probability 1, and it has an input delta with only t bitsset and an output delta with only u bits set. If two keysdiffer in all t bits in block k₁ and allu bits in block k₂, they will produce thesame internal state. That means any set of2^t+u keys differing in only those bits willproduce at most 2^t+u-1 distinct internalstates. This is not a funnel because each of those bits alone mightaffect all output bits. But it has the same effect as a funnel oft+u bits into t+u-1.

Unlike funneling, there are no efficienttests known for checking for all characteristics. The test forfunneling is actually a test for all k 1-bit deltas. There are(k choose x) x-bit deltas, and it quickly becomesimpractical to test all of them.

Consider two keys, each of which is all zero except for one bit.They can be viewed as the same key with two substrings swapped, wherethe set bit was in one of the substrings. The boundaries of thesubstrings could be just about anything. By checking the behavior ofall such pairs of keys, we can check if any two substrings are treatedcommutatively. This test is actually equivalent to checking for allcharacteristics with 2-bit input deltas.

The New Hashes Have No Funnels

The funneling test was used to choose the structure and constantsfor the mixing function of the 32 bit (64 bit) hash. By theorem Nofunnel, the new hashes have nofunnels:

The postprocessing step just selects the 32 (64) bits ofc to be the result.
The combining step (addition of the internal state and an inputblock) is reversible when the input block is fixed.
The combining step is reversible (which implies one-to-one) whenthe internal state is fixed.
Mix() is reversible.
Mix() causes every bit of a, b, and cto affect every bit of the result (c) with probability1/2+-1/6.
Mix(), when run either forwards or in reverse, causesevery bit of of a, b, and c to affect atleast 32 (80) bits of a, b, and c at least1/4 (1/2) of the time.
This last point, actually, didn't hold for lookup2.c. There is a funnel of 32 bits to 31 bits, with those 32 bits distributed acrosstwo blocks. I backed up my computer, wrote a program that found this,then changed computers. So I don't have the code and don't rememberwhere the funnel was. A funnel of 32 bits to 31 is awfullynonserious, though, so I let things be.

The nonlinear permutation mix() was also run forwards and backwardsfor many iterations. Two or more iterations always caused every bitof the internal state to affect every other bit of the internal state,so it appears that mix() does not unmix already-mixed data.

Before the final mixing step, the length is added to c.Nothing else is added to the bottom byte of c. The upperlength bytes may overlap the final key block, but the upper lengthbytes cannot be changed without changing at least 256 bytes of thekey, so this does not introduce a funnel.

Every 2-bit characteristic changes every bit as well. Theleast-affected output bit changes with probability 1/2+-28/100(1/2+-1/6 for the 64-bit hash) for one 2-bit delta. The 3-bit deltasconsisting of the high bits of a, b, c andthe low bits of a, b, c were also tested;they changed every output bit with probability 1/2+-1/6. Tests wererun both with random and with almost-all-zero keys.

The tests for funneling and simple characteristics show that thenew hashes perform well when keys differ in only a few bits. Theyperform well on almost-all-zero keys. The tests for 2-bitcharacteristics also show that they do not treat any substringscommutatively. That covers all the common patterns, so these hashesshould work well on all classes of keys.

Summary

If a hash function funnels a number of input bits into fewer bitsin its internal state, fewer than the number of output bits, thenkeys differing in only those input bits will produce at most half ofall possible hash values. Human-selected and computer-generated setsof keys often have keys differing in only a few bits, so hashes withfunnels should be avoided. There is an efficient test for funnels.

Two new hash functions are given for hash table lookup.They produce full 32-bit or 64-bit results and allow hash table sizesto be a power of two. The new hashes are fast and reliable. They haveno funnels, so they should work equally well on all types of keys.

Further code and analysis can be found on the web at http://burtleburtle.net/bob/hash/index.html.

Biham

Biham, E. and Shamir, A. Differential Cryptanalysis of Snefru,Khafre, REDOC-II, LOKI, and Lucifer (extended abstract). In Advances in Cryptography -- CRYPTO '91 Proceedings, pp 156-171

Fox

Fox, E., Heath, L., Chen, Q., and Daoud, A. Practical MinimalPerfect Hash Functions for Large Databases. Communications of theACM 35,1 (January 1992) 105-121

Knuth6

Knuth, D. The Art of Computer Programming, Volume 3:Sorting and Searching, Chapter 6.4. Addison Wesley, 1973

Schneier

Schneier, B. Applied Cryptography. John Wiley & Sons, 1993

MD4

Rivest, R. TheMD4 Message Digest Algorithm. In Advances in Cryptology -- CRYPTO'90 Proceedings. (1991) 303-311

Oracle

Linden, B. SQL Language Reference Manual, Version 7.0Oracle Corporation, 1992. 1-15

Rogaway

Rogaway, P. Bucket Hashing and its Application to Fast MessageAuthentication. Proceedings of CRYPTO '95 (1995)

Hash functions and block ciphers
Pseudorandom number generation
The birthday paradox
Examples of existing hash functions
Table of Contents