之前由于项目中用到了大规模的数据处理,使用了哈希函数作为应用,在此做了些工作将一些哈希( hash )函数的性能和冲突概率进行了测试、总结,并给出了推荐的几种较好的字符串哈希函数。
哈希的目的即将原有的长字符串压缩为32位、64位、128位的哈希编码存储,以节省存储空间。而在这个过程中,起重要作用的便是哈希函数。
在本实验中,采用了常见的一些哈希函数作为对比,并采用了10 million以上(千万级)的较大数据规模进行了测试。
表中的除了最后行表示为时间,其他均为该列哈希函数的冲突概率。
SET 1 :包含大小写字母、数字的,长度为3-12均匀分布,15 million 个样本。
SET 2 :仅包含小写字母的,长度为3-12均匀分布, 15 million 个样本
SET 3 :包含ASCII(32-127)中的常见的字符,长度10-30均匀分布,11 million 个样本。
最后在Release模式下,进行了时间性能测试,即为上表中的最后一行,记录为平均每次哈希(Hash)消耗时间。性能测试,横向来看,差距都并不大,对于哈希函数的选择上,还是应首选冲突小的较好。
测试系统配置:
CPU: AMD 945 X4 MEMORY: 4G SYSTEM: WINDOWS VISTA ULTIMATE (32 bit)
推荐:表中标红的为效果较好的算法,具统计和评论来说BKDR、SDBM、FNV_1 对大规模的字符串哈希来说,有较好的性能表现,推荐使用。同时,如果数据集在 million级以上的话,建议使用64位哈希函数,这样可以有效的避免冲突概率过高的情况。(10Million 上 64位哈希冲突率能到10e-6以下,经过测试)
在下面有全部代码,注释部分为算法的简单摘要,有兴趣的朋友可以去仔细参详下。第一次写BLOG,写的不好请大家指出不足之处。
Code1unsigned int RSHash(const char *str)
2/**//*
3A simple hash function from Robert Sedgwicks Algorithms in C book.
4*/
5{
6 unsigned int b = 378551;
7 unsigned int a = 63689;
8 unsigned int hash = 0;
9
10 while (*str)
11 {
12 hash = hash * a + (*str++);
13 a *= b;
14 }
15
16 return (hash & 0x7FFFFFFF);
17}
18
19unsigned int PJWHash(const char *str)
20/**//*
21This hash algorithm is based on work by Peter J. Weinberger of AT&T Bell Labs. The book Compilers (Principles, Techniques
22and Tools) by Aho, Sethi and Ulman, recommends the use of hash functions that employ the hashing methodology found in this
23particular algorithm.
24*/
25{
26 unsigned int BitsInUnignedInt = (unsigned int)(sizeof(unsigned int) * 8);
27 unsigned int ThreeQuarters = (unsigned int)((BitsInUnignedInt * 3) / 4);
28 unsigned int OneEighth = (unsigned int)(BitsInUnignedInt / 8);
29 unsigned int HighBits = (unsigned int)(0xFFFFFFFF) << (BitsInUnignedInt - OneEighth);
30 unsigned int hash = 0;
31 unsigned int test = 0;
32
33 while (*str)
34 {
35 hash = (hash << OneEighth) + (*str++);
36 if ((test = hash & HighBits) != 0)
37 {
38 hash = ((hash ^ (test >> ThreeQuarters)) & (~HighBits));
39 }
40 }
41
42 return (hash & 0x7FFFFFFF);
43}
44
45
46unsigned int JSHash(const char *str)
47/**//*
48A simple hash function from Robert Sedgwicks Algorithms in C book.
49*/
50{
51 unsigned int hash = 1315423911;
52
53 while (*str)
54 {
55 hash ^= ((hash << 5) + (*str++) + (hash >> 2));
56 }
57
58 return (hash & 0x7FFFFFFF);
59}
60
61unsigned int BKDRHash(const char *str)
62{
63 unsigned int seed = 131; // 31 131 1313 13131 131313 etc..
64 unsigned int hash = 0;
65
66 while (*str)
67 {
68 hash = hash * seed + (*str++);
69 }
70
71 return (hash & 0x7FFFFFFF);
72}
73
74unsigned int FNV_1_Hash(const char* str)
75/**//*
76Famous hash algorithm in Unix system, also used by Microsoft in their hash_map implementation for VC++ 2005
77detail can be found in :http://www.isthe.com/chongo/tech/comp/fnv/#FNV-param
78*/
79{
80 unsigned int hash = 2166136261;//offset_basis
81 unsigned int prime = 16777619; //FNV_PRIME_32
82 while(*str!='\0')
83 {
84 hash *= prime;
85 hash ^= *str++;
86 }
87 return (hash & 0x7FFFFFFF);
88}
89
90unsigned int FNV_1a_Hash(const char* str)
91/**//*
92Famous hash algorithm in Unix system, also used by Microsoft in their hash_map implementation for VC++ 2005
93detail can be found in :http://www.isthe.com/chongo/tech/comp/fnv/#FNV-param
94*/
95{
96 unsigned int hash = 2166136261;//offset_basis
97 unsigned int prime = 16777619; //FNV_PRIME_32
98 while(*str!='\0')
99 {
100 hash ^= *str++;
101 hash *= prime;
102 }
103 return (hash & 0x7FFFFFFF);
104}
105
106unsigned int DJBHash(const char *str)
107/**//*
108An algorithm produced by Professor Daniel J. Bernstein and shown first to the world on the
109usenet newsgroup comp.lang.c. It is one of the most efficient hash functions ever published.
110*/
111{
112 unsigned int hash = 5381;
113
114 while (*str)
115 {
116 hash += (hash << 5) + (*str++);
117 }
118
119 return (hash & 0x7FFFFFFF);
120}
121
122unsigned int DJB_2_Hash(const char* s)
123{
124 unsigned int hashvalue = 5381;
125 while(*s!='\0')
126 {
127 hashvalue = hashvalue * 33^(*s);
128 s++;
129 }
130 return (hashvalue & 0x7FFFFFFF);
131}
132
133unsigned int SDBM_Hash(const char *str)
134/**//*
135This is the algorithm of choice which is used in the open source SDBM project.
136The hash function seems to have a good overall distribution for many different data
137sets. It seems to work well in situations where there is a high variance in the MSBs of the
138elements in a data set.
139*/
140{
141 unsigned int hash = 0;
142
143 while (*str)
144 {
145 // equivalent to: hash = 65599*hash + (*str++);
146 hash = (*str++) + (hash << 6) + (hash << 16) - hash;
147 }
148
149 return (hash & 0x7FFFFFFF);
150}
151
152unsigned int APHash(const char *str)
153/**//*
154An algorithm produced by me Arash Partow.
155*/
156{
157 unsigned int hash = 0;
158 for (int i=0; *str; i++)
159 {
160 if ((i & 1) == 0)
161 {
162 hash ^= ((hash << 7) ^ (*str++) ^ (hash >> 3));
163 }
164 else
165 {
166 hash ^= (~((hash << 11) ^ (*str++) ^ (hash >> 5)));
167 }
168 }
169
170 return (hash & 0x7FFFFFFF);
171}