数据结构-- 哈希表

最新推荐文章于 2021-07-17 11:29:37 发布

chaooooooo

最新推荐文章于 2021-07-17 11:29:37 发布

阅读量725

点赞数

(2010-07-16 23:05)

分类： C/C++编程

Andrew Huang bluedrum@163.com

一.哈希表

------------------------------------------------------------------------------

哈希是hash的音译，有时也意译成散列。它定义了一种将字符组成的字符串转换为固定长度(一般是更短长度)的数值或索引值的方法，称为散列法，也叫哈希法。由于通过更短的哈希值比用原始值进行数据库搜索更快，这种方法一般用来在数据库中建立索引并进行搜索，同时还用在各种解密算法中。

正因为哈希有快速检索和加密两大功能。因此在编程中大量应用哈希算法。

哈希的原理比较容易懂。我们看一个Hash表例子：

有一组数据包括用户名字、电话、住址等，为了快速的检索，我们可以利用名字作为关键码，hash规则就是把名字中每一个字的拼音的第一个字母拿出来，把该字母在26个字母中的顺序值取出来加在一块作为改记录的地址。比如张三，就是z+s＝26+19＝45。就是把张三存在地址为45处。

但是这样存在一个问题，比如假如有个用户名字叫做：周四，那么计算它的地址时也是z+s＝45，这样它与张三就有相同的地址，这就是冲突，也叫作碰撞！

二.不同系统哈希算法

------------------------------------------------------

Hash函数是算法任意的，但是一个好的Hash函数有两个特点。运算快，冲突值少。

因此很多系统内置哈希转换算法。我们看一下各个系统自带的哈希函数。

2.1 PHP中出现的字符串Hash函数

static unsigned long hashpjw(char *arKey, unsigned int nKeyLength) { unsigned long h = 0, g; char *arEnd=arKey+nKeyLength; while (arKey < arEnd) { h = (h << 4) + *arKey++; if ((g = (h & 0xF0000000))) { h = h ^ (g >> 24); h = h ^ g; } } return h; }

2.2 OpenSSL中出现的字符串 Hash函数

unsigned long OpenSSL_Hash1(char *str) { int i,l; unsigned long ret=0; unsigned short *s; if (str == NULL) return(0); l=(strlen(str)+1)/2; s=(unsigned short *)str; for (i=0; i ret^=(s[i]<<(i&0x0f)); return(ret); }

/* The following hash seems to work very well on normal text strings * no collisions on /usr/dict/words and it distributes on %2^n quite * well, not as good as MD5, but still good. */ unsigned long OpenSSL_Hash2(const char *c) { unsigned long ret=0; long n; unsigned long v; int r; if ((c == NULL) || (*c == '\0')) return(ret); /* unsigned char b[16]; MD5(c,strlen(c),b); return(b[0]|(b[1]<<8)|(b[2]<<16)|(b[3]<<24)); */ n=0x100; while (*c) { v=n|(*c); n+=0x100; r= (int)((v>>2)^v)&0x0f; ret=(ret(32-r)); ret&=0xFFFFFFFFL; ret^=v*v; c++; } return((ret>>16)^ret); }

2.3 MySql中出现的字符串Hash函数

/* Calc hash for a key */ static uint calc_hashnr(const byte *key,uint length) { register uint nr=1, nr2=4; while (length--) { nr^= (((nr & 63)+nr2)*((uint) (uchar) *key++))+ (nr << 8); nr2+=3; } return((uint) nr); } /* Calc hash for a key, case indepenently */ static uint calc_hashnr_caseup(const byte *key,uint length) { register uint nr=1, nr2=4; while (length--) { nr^= (((nr & 63)+nr2)*((uint) (uchar) toupper(*key++)))+ (nr << 8); nr2+=3; } return((uint) nr); }

Mysql中对字符串 Hash函数还区分了大小写，我们的测试中使用不区分大小写的字符串 Hash函数，另外我们将上面的两个函数分别记为MYSQL_ Hash1和MYSQL_ Hash2。

/* * Fowler/Noll/Vo hash * * The basis of the hash algorithm was taken from an idea sent by email to the * IEEE Posix P1003.2 mailing list from Phong Vo (kpv@research.att.com) and * Glenn Fowler (gsf@research.att.com). Landon Curt Noll (chongo@toad.com) * later improved on their algorithm. * * The magic is in the interesting relationship between the special prime * 16777619 (2^24 + 403) and 2^32 and 2^8. * * This hash produces the fewest collisions of any that we've seen so * far, and works well on both numbers and strings. */ uint calc_hashnr(const byte *key, uint len) { const byte *end=key+len; uint hash; for (hash = 0; key < end; key++) { hash *= 16777619; hash ^= (uint) *(uchar*) key; } return (hash); } uint calc_hashnr_caseup(const byte *key, uint len) { const byte *end=key+len; uint hash; for (hash = 0; key < end; key++) { hash *= 16777619; hash ^= (uint) (uchar) toupper(*key); } return (hash); }

2.4 另一个经验字符串Hash函数

unsigned int hash(char *str) { register unsigned int h; register unsigned char *p; for(h=0, p = (unsigned char *)str; *p ; p++) h = 31 * h + *p; return h; }

2.5 各个HASH算法测试比较

以下分别是对两个不同文本文件中的全部不重复单词构造 Hash链表的测试结果，测试结果中函数调用次数放大了100倍，相应的函数调用时间也放大了100倍。

从上表可以看出，这些经典软件虽然构造字符串Hash函数的方法不同，但是它们的效率都是不错的，相互之间差距很小，读者可以参考实际情况从其中借鉴使用。

三.哈希表冲突解决办法

--------------------------------------------------------------------------

哈希算法不可避免要碰到两组值得到HASH值一样的情况,如何解决冲突，有常见几种算法

１、开放定址法.

即按如下公式重新把HASH值再次做运算

hi=(h(key)+di) mod m i=1,2,...,k(k<=m-1).

其中di值取 1,2,3....

2、二次探查法(quadratic probing)

上述公式的值中di可以取1,-1,2,-2,4,-4...

3.双重散列法。

即做两个HASH函数，求值后再次运算.两次Hash重复值可能性大大降低

hi=(h(key)+i*h1(key)) %m 其中 0≤i≤m-1

4.、拉链法
　　　拉链法解决冲突的做法是：将所有关键字为同义词的结点链接在同一个单链表中。若选定的散列表长度为 m，则可将散列表定义为一个由m个头指针组成的指针数组t[0..m-1]。凡是散列地址为i的结点，均插入到以t为头指针的单链表中。

四，暴雪游戏的hash算法

-----------------------------------------------------------------------

魔兽( Blizzard)文件打包管理器的相关算法，即MPQ文件索引表的生成和查找采用哈希表进行，在冲突方面的处理方面，采用线性探测再散列。在添加和查找过程中进行了三次哈希，第一个哈希值用来查找，后两个哈希值用来校验，这样可以大大减少冲突的几率.

哈希算法

unsigned long HashString(char *lpszFileName, unsigned long dwHashType) { unsigned char *key = (unsigned char *)lpszFileName; unsigned long seed1 = 0x7FED7FED, seed2 = 0xEEEEEEEE; int ch; while(*key != 0) { ch = toupper(*key++); seed1 = cryptTable[(dwHashType << 8) + ch] ^ (seed1 + seed2); seed2 = ch + seed1 + seed2 + (seed2 << 5) + 3; } return seed1; }

其冲突查找算法：这个方法非常快，没有使用链表，而是用顺延的方式

1. 计算出字符串的三个哈希值（一个用来确定位置，另外两个用来校验)
2. 察看哈希表中的这个位置
3. 哈希表中这个位置为空吗？如果为空，则肯定该字符串不存在，返回
4. 如果存在，则检查其他两个哈希值是否也匹配，如果匹配，则表示找到了该字符串，返回
5. 移到下一个位置，如果已经越界，则表示没有找到，返回
6. 看看是不是又回到了原来的位置，如果是，则返回没找到
7. 回到3

int GetHashTablePos(char *lpszString, MPQHASHTABLE *lpTable, int nTableSize) { const int HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2; int nHash = HashString(lpszString, HASH_OFFSET); int nHashA = HashString(lpszString, HASH_A); int nHashB = HashString(lpszString, HASH_B); int nHashStart = nHash % nTableSize, nHashPos = nHashStart; while (lpTable[nHashPos].bExists) { //比较的是Table中存储的另外两个Hash函数的值，Table中不存储字符串？ if (lpTable[nHashPos].nHashA == nHashA && lpTable[nHashPos].nHashB == nHashB) return nHashPos; else //冲突处理 nHashPos = (nHashPos + 1) % nTableSize; if (nHashPos == nHashStart) break; } return -1; //Error value }

五。一些标准哈希算法

参考资料

---------------------------------

关于暴雪中Hash引自 http://blog.csdn.net/eaglewood2005/archive/2009/07/30/4394583.aspx
http://blog.csdn.net/goodluckyxl/archive/2005/10/10/498558.aspx
关于经典的字符串Hash算法 http://blog.csai.cn/user3/50125/archives/2009/35639.html
常见的字符串Hash 算法 http://www.cmykrgb123.cn/blog/string-hash-compare/

chaooooooo

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据结构-- 哈希表

(2010-07-16 23:05) 分类： C/C++编程 Andrew Huang bluedrum@163.com一.哈希表------------------------------------------------------------------------------哈希是hash的音译，有时也意译成散列。它定义了一种将字符组成
复制链接

扫一扫