多阶hash表

最新推荐文章于 2022-01-31 18:30:50 发布

juary_01

最新推荐文章于 2022-01-31 18:30:50 发布

阅读量4.8k

点赞数 1

分类专栏：网络编程

网络编程专栏收录该内容

79 篇文章

订阅专栏

关于多阶hash表的具体代码实现，请移步到：《使用共享内存的多级哈希表的一种实现》http://webcache.googleusercontent.com/search?q=cache:GEiOeyiYdXEJ:www.cppblog.com/lmlf001/archive/2007/09/08/31858.html+&cd=2&hl=zh-CN&ct=clnk

本文主要讲多阶HASH表的结构。

1. 多阶hash表实际上是一个锯齿数组，看起来是这个样子的：
■■■■■■■■■■■■■■■
■■■■■■■■■■■■■
■■■■■■■■■■
■■■■■■
■■■

每一行是一阶，上面的元素个数多，下面的元素个数依次减少。
每一行的元素个数都是素数的。

2. 数组的每个节点用于存储数据的内容，其中，节点的前四个字节用于存储int类型的key或者是hash_code

3. 创建多阶HASH的时候，用户通过参数来指定有多少阶，每一阶最多多少个元素。
那么，下面的每一阶究竟应该选择多少个元素呢？从代码注释上看来，是采用了素数集中原理的算法来查找的。
例如，假设每阶最多1000个元素，一共10阶，则算法选择十个比1000小的最大素数，从大到小排列，以此作为各阶的元素个数。通过素数集中的算法得到的10个素数分别是：997 991 983 977 971 967 953 947 941 937。
可见，虽然是锯齿数组，各层之间的差别并不是很多。

4. 查找过程：

先将key在第一阶内取模，看是否是这个元素，如果这个位置为空，直接返回不存在；如果是这个KEY，则返回这个位置。
如果这个位置有元素，但是又不是这个key，则说明hash冲突，再到第二阶去找。
循环往复。

5. 好处：
1. hash冲突的处理非常简单；
2. 有多个桶，使得空间利用率很高，你并不需要一个很大的桶来减少冲突。
3. 可以考虑动态增长空间，不断加入新的一阶，且对原来的数据没影响。

整理后的源码在此：https://docs.google.com/open?id=0B2ZH5H4iY-oLSXdPSDVoVFBoSVU

使用共享内存的多级哈希表的一种实现

在一个服务程序运行的时候，它往往要把数据写入共享内存以便在进城需要重新启动的时候可以直接从共享内存中读取数据，另一方面，在服务进程因某种原因挂掉的时候，共享内存中的数据仍然存在，这样就可以减少带来的损失。关于共享内存的内容请google之，在这里，实现了一种在共享内存中存取数据的hash表，它采用了多级存储求模取余的方法，具体内容请看以下代码：
http://lmlf001.blog.sohu.com/

// hash_shm.h
#ifndef _STORMLI_HASH_SHM_H_
#define _STORMLI_HASH_SHM_H_

#include<iostream>
#include<cstdlib>
#include<cmath>
#include<sys/shm.h>
using namespace std;

template<typename valueType,unsigned long maxLine, int lines>
class hash_shm
{
public:
     int find(unsigned long _key);     // if _key in the table,return 0,and set lastFound the position,otherwise return -1
     int remove(unsigned long _key);     // if _key not in the table,return-1,else remove the node,set the node key 0 and return 0

     // insert node into the table,if the _key exists,return 1,if insert success,return 0;and if fail return -1
     int insert(unsigned long _key, const valueType &_value);
     void clear();         // remove all the data

public:     // some statistic function
     double getFullRate() const;         // the rate of the space used

public:
     // constructor,with the share memory start position and the space size,if the space is not enough,the program will exit
    hash_shm( void *startShm,unsigned long shmSize= sizeof(hash_node)*maxLine*lines);

     // constructor,with the share memory key,it will get share memory,if fail,exit
    hash_shm(key_t shm_key);
    ~hash_shm(){}     // destroy the class
private:
     void *mem;         // the start position of the share memory   // the mem+memSize  space used to storage the runtime data:currentSize
    unsigned long memSize;     // the size of the share memory
    unsigned long modTable[lines];     // modtable,the largest primes
    unsigned long maxSize;         // the size of the table
    unsigned long *currentSize;     // current size of the table ,the pointer of the shm mem+memSize
     void *lastFound;         // write by the find function,record the last find place

     struct hash_node{         // the node of the hash table
        unsigned long key;     // when key==0,the node is empty
        valueType value;     // name-value pair
    };
private:
     bool getShm(key_t shm_key);     // get share memory,used by the constructor
     void getMode();         // get the largest primes blow maxLine,use by the constructor
     void *getPos(unsigned int _row,unsigned long _col); // get the positon with the (row,col)
};

template<typename vT,unsigned long maxLine, int lines>
hash_shm<vT,maxLine,lines>::hash_shm( void *startShm,unsigned long shmSize)
{
     if(startShm!=NULL){
        cerr<<"Argument error

\n Please check the shm address

\n";
        exit(-1);
    }
    getMode();
    maxSize=0;
     int i;
     for(i=0;i<lines;i++)     // count the maxSize
        maxSize+=modTable[i];
     if(shmSize< sizeof(hash_node)*(maxSize+1)){     // check the share memory size
        cerr<<"Not enough share memory space

\n";
        exit(-1);
    }
    memSize=shmSize;
     if(*(currentSize=(unsigned long *)(( long)mem+memSize))<0)
        *currentSize=0;;
}

template<typename vT,unsigned long maxLine, int lines>
hash_shm<vT,maxLine,lines>::hash_shm(key_t shm_key)
{     // constructor with get share memory
    getMode();
    maxSize=0;
     for( int i=0;i<lines;i++)
        maxSize+=modTable[i];
    memSize= sizeof(hash_node)*maxSize;
     if(!getShm(shm_key)){
        exit(-1);
    }
//     memset(mem,0,memSize);
     if(*(currentSize=(unsigned long *)(( long)mem+memSize))<0)
        *currentSize=0;
}

template<typename vT,unsigned long maxLine, int lines>
int hash_shm<vT,maxLine,lines>::find(unsigned long _key)
{
    unsigned long hash;
    hash_node *pH=NULL;
     for( int i=0;i<lines;i++)
    {
        hash=(_key+maxLine)%modTable[i];     // calculate the col position
        pH=(hash_node *)getPos(i,hash);
//         if(pH==NULL)return -2;     // almost not need
         if(pH->key==_key){
            lastFound=pH;
             return 0;
        }
    }
     return -1;
}

template<typename vT,unsigned long maxLine, int lines>
int hash_shm<vT,maxLine,lines>::remove(unsigned long _key)
{
     if(find(_key)==-1) return -1;     // not found
    hash_node *pH=(hash_node *)lastFound;
    pH->key=0;         // only set the key 0
    (*currentSize)--;
     return 0;
}

template<typename vT,unsigned long maxLine, int lines>
int hash_shm<vT,maxLine,lines>::insert(unsigned long _key, const vT &_value)
{
     if(find(_key)==0) return 1;     // if the key exists
    unsigned long hash;
    hash_node *pH=NULL;
     for( int i=0;i<lines;i++){
        hash=(_key+maxLine)%modTable[i];
        pH=(hash_node *)getPos(i,hash);
         if(pH->key==0){         // find the insert position,insert the value
            pH->key=_key;
            pH->value=_value;
            (*currentSize)++;
             return 0;
        }
    }
     return -1;     // all the appropriate position filled
}

template<typename vT,unsigned long maxLine, int lines>
void hash_shm<vT,maxLine,lines>::clear()
{
    memset(mem,0,memSize);
    *currentSize=0;
}

template<typename vT,unsigned long maxLine, int lines>
bool hash_shm<vT,maxLine,lines>::getShm(key_t shm_key)
{
     int shm_id=shmget(shm_key,memSize,0666);
     if(shm_id==-1)     // check if the shm exists
    {
        shm_id=shmget(shm_key,memSize,0666|IPC_CREAT); // create the shm
         if(shm_id==-1){
            cerr<<"Share memory get failed

\n";
             return false;
        }
    }
    mem=shmat(shm_id,NULL,0);     // mount the shm
     if( int(mem)==-1){
        cerr<<"shmat system call failed

\n";
         return false;
    }
     return true;
}

template<typename vT,unsigned long maxLine, int lines>
void hash_shm<vT,maxLine,lines>::getMode()
{         // 采用 6n+1 6n-1 素数集中原理
     if(maxLine<5){exit(-1);}

    unsigned long t,m,n,p;
     int i,j,a,b,k;
     int z=0;

     for(t=maxLine/6;t>=0,z<lines;t--)
    {
        i=1;j=1; k=t%10;
        m=6*t;                                         /* *i,j的值是是否进行验证的标志也是对应的6t-1和6t+1的素性标志* */
         if(((k-4)==0)||((k-9)==0)||((m+1)%3==0))j=0; /* 此处是简单验证6*t-1,6*t+1 是不是素数，借以提高素数纯度* */
         if(((k-6)==0)||((m-1)%3==0))i=0;             /* **先通过初步判断去除末尾是5，及被3整除的数** */
         for(p=1;p*6<=sqrt(m+1)+2;p++ )
        {
            n=p*6;                                     /* *将6*p-1和6*p+1看作伪素数来试除**** */
            k=p%10;
            a=1;b=1;                                 /* *同样此处a,b的值也是用来判断除数是否为素数提高除数的素数纯度* */
             if(((k-4)==0)||((k-9)==0))a=0;
             if(((k-6)==0))b=0;
             if(i){                             /* 如果i非零就对m-1即所谓6*t-1进行验证，当然还要看除数n+1,n-1,素性纯度 */
                 if(a){ if((m-1)%(n+1)==0)i=0;}         /* **一旦被整除就说明不是素数故素性为零即将i 赋值为零** */
                 if(b){ if((m-1)%(n-1)==0)i=0;}
            }
             if(j){                            /* *如果j非零就对m+1即所谓6*t+1进行验证，当然还要看除数n+1,n-1,素性纯度 */
                 if(a){ if((m+1)%(n+1)==0)j=0;}          /* **一旦被整除就说明不是素数故素性为零即将j 赋值为零** */
                 if(b){ if((m+1)%(n-1)==0)j=0;}
            }
             if((i+j)==0) break;                      /* *如果已经知道6*t-1,6*t+1都不是素数了那就结束试除循环** */
        }
         if(j){modTable[z++]=m+1; if(z>= lines) return;}
         if(i){modTable[z++]=m-1; if(z>= lines) return;}
    }
}

template<typename vT,unsigned long maxLine, int lines>
void *hash_shm<vT,maxLine,lines>::getPos(unsigned int _row,unsigned long _col)
{
    unsigned long pos=0UL;
     for( int i=0;i<_row;i++)     // calculate the positon from the start
        pos+=modTable[i];
    pos+=_col;
     if(pos>=maxSize) return NULL;
     return ( void *)(( long)mem+pos* sizeof(hash_node));
}

template<typename vT,unsigned long maxLine, int lines>
double hash_shm<vT,maxLine,lines>::getFullRate() const
{
     return double(*currentSize)/maxSize;
}

#endif

// test.cpp

#include"hash_shm.h"
#include<cstdlib>
using namespace std;
int main()
{
    hash_shm< int,1000,100> ht(key_t(999));
     double rate=0.0;
//     ht.clear();
     for( int i=0;i<100;i++){
        srand(time(NULL)+i);
         while( true){
             if(ht.insert(rand(),0)==-1) break;
        }
        cout<<ht.getFullRate()<<endl;
        rate+=ht.getFullRate();
        ht.clear();
    }
    cout<<"\n\n\n";
    cout<<rate/100<<endl;
}

这段代码作测试的时候发现了一些问题，用gprof查看函数时间的时候发现，getPos函数占用了大部分的执行时间，始主要的性能瓶颈，后来又新设立了一个数组，用来记录每行开始时的位置，性能提高了很多，改动部分的代码如下：

template<typename valueType,unsigned long maxLine, int lines>
class hash_shm
{
private:
     void *mem;         // the start position of the share memory   // the mem+memSize  space used to storage the runtime data:currentSize
    unsigned long memSize;     // the size of the share memory
    unsigned long modTable[lines];     // modtable,the largest primes
    unsigned long modTotal[lines];     // modTotal[i] is the summary of the modTable when x<=i
                     // used by getPos to improve the performance
    ...
};

template<typename vT,unsigned long maxLine, int lines>
hash_shm<vT,maxLine,lines>::hash_shm( void *startShm,unsigned long shmSize)
{
     ...

     int i;
     for(i=0;i<lines;i++){     // count the maxSize
        maxSize+=modTable[i];
         if(i!=0)modTotal[i]=modTotal[i-1]+modTable[i-1];
         else modTotal[i]=0;     // caculate the modTotal
    }
     ...
}

template<typename vT,unsigned long maxLine, int lines>
hash_shm<vT,maxLine,lines>::hash_shm(key_t shm_key)
{     // constructor with get share memory
    getMode();
    maxSize=0;
     for( int i=0;i<lines;i++){
        maxSize+=modTable[i];
         if(i!=0)modTotal[i]=modTotal[i-1]+modTable[i-1];
         else modTotal[i]=0;
    }
     ...
}

template<typename vT,unsigned long maxLine, int lines>
void *hash_shm<vT,maxLine,lines>::getPos(unsigned int _row,unsigned long _col)
{
    unsigned long pos=_col+modTotal[_row];
     // for(int i=0;i<_row;i++)     // calculate the positon from the start
     //     pos+=modTable[i];
     if(pos<maxSize)
         return ( void *)(( long)mem+pos*sizeof(hash_node));
     return NULL;
}

新增了一个用于遍历的函数foreach

template<typename vT,unsigned long maxLine, int lines>
void hash_shm<vT,maxLine,lines>:: foreach( void (*fn)(unsigned long _key,vT &_value))
{
    typedef  unsigned long u_long;
    u_long beg=(u_long)mem;
    u_long end=(u_long)mem+ sizeof(hash_node)*(modTable[lines-1]+modTotal[lines-1]);
    hash_node *p=NULL;
     for(u_long pos=beg;pos<end;pos+= sizeof(hash_node))
    {
        p=(hash_node *)pos;
         if(p->key!=0)fn(p->key,p->value);
    }
}

为了利于使用新增一个用于查找的函数find,该函数同find(_key)类似，如果找到_key节点，把它赋给_value以返回

int find(unsigned long _key,vT &_value);