用于查找哈希表的哈希函数

最新推荐文章于 2024-10-25 18:01:58 发布

logicouter

最新推荐文章于 2024-10-25 18:01:58 发布

阅读量362

点赞数 2

分类专栏：其他/未归类文章标签： function table 代码分析数据结构 memcached output

其他/未归类专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Hash Functions for Hash Table Lookup

前几天在看sina技术团队写的memcached源代码分析，当中提到使用这篇论文的算法实现KV的哈希。好好看下这片论文，翻译备用。

This paper presents new hash functions for table lookup using 32-bit or 64-bit arithmetic. These hashes are fast and reliable. A framework is also given for evaluating hash functions.

这篇论文展示了使用32位或64位算术运算实现新的哈希表查找哈希函数。这些哈希函数都是快速且可靠的。我们同时给出一个评估这些哈希函数的框架。

Introduction

Hash tables ^[Knuth6] are a common data structure. They consist of an array (thehash table) and a mapping (thehash function). The hash function mapskeys intohash values. Items stored in a hash table must have keys. The hash function maps the key of an item to a hash value, and that hash value is used as an index into the hash table for that item. This allows items to be inserted and located quickly.

引言

哈希表是一种普遍的数据结构。由一个数组（哈希表）和一个映射（哈希函数）组成。哈希函数将键映射成哈希值。存储在哈希表里面的数据项必须有键。哈希函数将数据项映射成一个哈希值，这个哈希值被当作哈希表中数据项的索引。这样使得数据项可以快速的插入和查找。

What if an item hashes to a value that some other item has already hashed to? This is acollision. There are several strategies for dealing with collisions^[Knuth6], but the strategies all make the hash tables slower than if no collisions occurred.

假如一个数据项的哈希值与其他数据项一样呢？着就是冲突。有几种策略来处理冲突，但这些策略会让哈希查找变的比无冲突时更慢。

If the actual keys to be used are known before the hash function is chosen, it is possible to choose a hash function that causes no collisions. This is known as aperfect hash function^[Fox]. This paper will deal with the other case, where the actual keys are a small subset of all possible keys.

假如实际使用的键可以在哈希函数的选择之前获知，就有可能选择一个没有冲突的哈希函数。这就是完美哈希函数。这篇论文将处理其他情况，即实际使用的键是所有可能的键中的一个小子集。

For example, if a hash function maps 30-byte keys into a 32-bit output, it maps 2²⁴⁰ possible keys into 2³² possible hash values. Less than 2³² actual keys will be used. With a ratio of 2²⁰⁸ possible keys per hash value, it is impossible to guarantee that the actual keys will have no collisions.

例如，一个哈希函数将一个30字节的键映射到一个32位的输出。它映射2²⁴⁰个可能的值到2³²个可能的哈希值。使用少于 2³² 个实际的键. 一个哈希值有2²⁰⁸ 个可能的键，不可能保证实际的键会没有任何冲突。

If the actual keys being hashed were uniformly distributed, selecting the firstv bits of the input to be the v-bit hash value would make a wonderful hash function. It is fast and it hashes an equal number of possible keys to each hash value. Unfortunately, the actual keys supplied by humans and computers are seldom uniformly distributed. Hash functions must be more clever than that.

如果实际的键是均匀分布的，选择输入的前V个位为哈希值是一个非常好的哈希函数。这个哈希函数快速，并且将相同数目的键哈希到每一个哈希值。不幸的是，由人或计算机提供的键并不均匀分布。哈希函数应该要更加聪明（来处理不均匀分布的情况）。

This paper is organized as follows. Hash Functions for Table Lookup present the new 32-bit and 64-bit hashes. Patterns lists some patterns common in human-selected and computer-generated keys. AHash Model names common pieces of hash functions.Funneling describes a flaw in hash functions and how to detect that flaw.Characteristics are a more subtle flaw. The last section shows thatthe new hashes have no funnels.

这篇论文组织如下。“用于哈希表查找的哈希函数”展示了新的32位和64位的哈希函数。“模式”列举了一些人类选择或者计算机产生的键的模式。一个哈希的模型命名几个常见的哈希函数（这个翻译不好）。“富集”描述了哈希函数的缺陷以及如何检测这种缺陷。“特征”是一种更加微妙的缺陷，最后一部分来说明新的哈希函数没有富集（现象）。