Hash表的理论基础与具体实现（详细教程）

最新推荐文章于 2024-08-10 22:25:36 发布

置顶 coordinate_blog

最新推荐文章于 2024-08-10 22:25:36 发布

阅读量2k

点赞数 2

分类专栏： Data Structures and Algorithms 文章标签： hash

本文链接：https://blog.csdn.net/qq_17550379/article/details/80302217

版权

Data Structures and Algorithms 专栏收录该内容

19 篇文章 11 订阅

订阅专栏

字典通常可以用三种数据类型表示：线性表，跳表，Hash表。

Hash表又称为散列表，使用一个散列函数把字典的数对映射到一个散列表的具体位置。如果数对p的关键字是k，散列函数为f，那么在理想情况下，p在散列表中的位置为f(k)。暂时假定散列表的每一个位置最多能够存储一个记录。为了搜索关键字为k的数对，先要计算f(k)，然后查看在散列表的*f(k)处是否已有一个数对。如果有，便找到该数对。如果没有，字典就不包含该数对。在前一种情况下，可以删除该数对，为此只需使散列表的f(k)位置为空。在后一种情况下，可以把该数对插入f(k)*的位置。

0x01 bucket 和 home bucket

当关键字的范围太大，不能用理想方法表示时，可以采用并不理想的散列表和散列函数。散列表位置的数量比关键字的个数少，散列函数把若干不同的关键字映射到散列表的同一位置。散列表的每一个位置称为一个bucket；对关键字为k 的数对，f(k) 是home bucket；bucket的数量等同于散列表的长度。因为散列函数可以把若干个关键字映射到同一个bucket，所以bucket要能够容纳多个数对。

0x02 除法散列函数

在多种散列函数中，最常用的就是除法散列函数，它的形式如下：

$=k\%D$

其中k是关键字，D是散列表长度。

0x03 冲突与溢出

我们先考虑一种情况，每个bucket只能存储一个数对。现在有这样一个散列表，它有11个bucket，序号从0到10。

   0   1   2   3   4   5   6   7   8   9   10
               80              40          65

D为11。我们很容易就可以计算出 $80\%11=3$ ， $40\%11=7$ ， $65\%11=10$ 。其余的bucket为空。

现在我们要插入58，那么我们通过计算 $58\%11=3$ ，但是这个bucket已经有一个数了。当两个不同的关键字所对应的home bucket相同，这个时候就发生了冲突。因为我们这个例子中的bucket只能存放一个数对，所以这个时候同时发生了溢出，但是如果我们的bucket可以存放多个数对的话，就不会发生溢出。

我们可以通过线性探查（找到下一个可用的bucket）的方法解决这个问题。我们将58加入到4号bucket。通过这种方法，我们将散列表看成是一个环形表。例如，对于 $98\%11=10$ ，我们将98加入到0号bucket。

0x04 一个好的散列函数

我们很容易就能想明白的一点是：冲突不可怕，可怕的是溢出。除非每个bucket可以存放无限个数对，否则插入时发生溢出就是一个很难解决的问题。当映射到散列表中任何一个bucket里的关键字数量大致相等时，冲突和溢出发生的平均次数最少。基于这一点，我们就有了均匀散列函数。

假定散列有b个bucket，且 $b > 1$ ，bucket的序号从 0 到 $b - 1$ 。如果对所有的k，散列函数 $f (k) = 0$ ，那么f(k)就不是一个均匀散列函数，因为它把所有的关键字都映射到一个0号桶里。这样的散列函数使冲突和溢出的数量最大。假设 $b = 11$ ，关键字范围为[0, 999]，那么它应该大约把每99个关键字映射到同一个桶bucket中。
理想的D应该是一个素数。当不能找到一个接近散列表长度的素数时，你应该选择不能被2和19之间的数整除的D。

0x05 除法和非整型关键字

为了使用除法散列函数，在计算*f(k)*之前，需要把关键字转换为非负整数。因为所有散列函数都把若干个关键字分不到相同的home bucket，所以没有必要把关键字转化为统一的非负整数。

int stringToInt(string s)
{
    int length = (int)s.length();
    int answer = 0;
    if (length % 2 == 1)
    {
        answer = s.at(length - 1);
        --length;
    }
    for (int i = 0; i < length; ++i)
    {
        answer += s.at(i);
        answer += ((int)s.at(i + 1)) << 8;
    }
    return (answer < 0) ? -answer : answer;
}

我们通过上述代码将字符逐个转换为一个唯一整数，并累积求和。这个函数中有一个有意思的操作：左移24位。我们为什么不直接将所有字符加起来，而要左移后相加呢？我们希望充分的利用数据空间，原先如果没有左移操作，那么对于长度为8的字符串，直接相加转化为整数的话，只是用了11位字节，而int有32位（win32下）。

参考STL，我们给出令一个版本的hash函数

template<>
class hash<string>
{
public:
    size_t operator()(const string theKey) const
    {
        unsigned long hashValue = 0;
        int length = (int) theKey.length();
        for (int i = 0; i < length; ++i)
        {
            hashValue = 5 * hashValue + theKey.at(i);
        }
        return size_t(hashValue)
    }
}

0x06 hashTable类的设计

我们使用前面设计的hash函数设计这样的一个hashTable类

template<class K, class E>
class hashTable
{
public:
    hashTable(int theDivisor);
    ~hashTable();
    int search(const K& theKey) const;
    pair<const K, E>* find(const K& theKey) const;
    void insert(const pair<const K, E>& thePair);
private:
    pair<const K, E>** table;//散列表
    hash<K> hash;			//hash函数
    int dSize;				//散列表中的数对个数
    int divisor;			//D
}

同时这里我们给出构造函数

template<class K, class E>
hashTable<K, E>::hashTable(int theDivisor)
{
    divisor = theDivisor;
    dSize = 0;
    table = new pair<const K, E>*[divisor];
    for (int i = 0; i < divisor; ++i)
    {
        table[i] = nullptr;
    }
}

0x07 查找记录

如果你明白了之前的线性探查法进行插入过程，那么就可以轻松的设计出散列表的查找方法。假设要查找的关键字为k的数对，首先搜索f(k)，然后把散列表当作环形表继续搜索下一个bucket，直到下面的情况发生：

关键字k的bucket找到
到达一个空的bucket
回到f(k)

后两者说明关键字k的数对不存在。

template<class K, class E>
int hashTable<K, E>::search(const K& theKey) const
{
    int i = (int) hash(theKey) % divisor;
    int j = i; 
    do
    {
        if (table[j] == nullptr || table[j]->first == theKey) return j;
        j = (j + 1) % divisor;
    }while(j != i);
    return j;
}
template<class K, class E>
pair<const K, E>* hashTable<K, E>::find(const K& theKey) const
{
    int b = search(theKey);
    if (table[b] == nullptr || table[b]->first != theKey) return nullptr;
    return table[b];
}

我们这里给出的search函数中，函数返回值有三种情况：

table[b]是一个指针，指向关键字theKey的数对
散列表没有关键字theKey的数对，并且table[b]=nullptr
散列表没有关键字theKey的数对，但是table[b]!=nullptr，这表示表满。

0x08 删除记录

删除一个记录要保证查找过程可以正常进行。例如，我们添加记录 $35\%11=2$ ，根据线性探查法，插入到5号bucket。

   0   1   2   3   4   5   6   7   8   9   10
               80  58  35      40          65

删除58，我们不能仅仅将4号bucket置空，这样我们就无法找到关键字35的数对。从删除位置的下一个bucket开始，逐个检查每个bucket，以确定要移动的元素，直至到达一个空的bucket或回到删除位置为止。

实现删除的另一个策略是为每个bucket增加一个域neverUsed。在散列表初始化时，这个域被设置为true。当一个数对存入一个bucket中时，neverUsed域被设置为false。现在搜索结束条件：到达一个空bucket变为bucket的neverUsed域为true。不过在删除时，只是把表的相应位置置为空。一个新元素被插入在其对应的home bucket之后所找到的第一个空的bucket中。

template<class K, class E>
void hashTable<K, E>::erase(const K& theKey)
{
    int b = search(theKey);
    if (table[b] == nullptr || table[b]->first != theKey) return;
    int i = theKey % divisor;
    int j = i;
    delete table[j];
    table[j] = nullptr;
    do
    {      
        int pre = j;
        j = (j + 1) % divisor; 
        if (table[j] != nullptr && (table[j]->first) % divisor == j)
        {
            table[pre] = nullptr;
            break;
        }
        table[pre] = table[j];     
    } while (i != j || table[j] != nullptr);
}

0x09 插入记录

我们首先调用search方法，根据search方法，如果返回的b号bucket为空，则白哦是没有关键字为thePair.first的数对，该数对可以插入bucket中。若返回的值非空，则表示在表中已经存在thePair.first的数对了或者表满。如果是前面一种情况，把该bucket的数对值改为thePair.second。如果是后一种情况，那么抛出异常。

class hashTableFull
{
public:
	hashTableFull(string theMessage =
		"The hash table is full")
	{
		message = theMessage;
	}
	void outputMessage() { cout << message << endl; }
private:
	string message;
};
template<class K, class E>
void hashTable<K, E>::insert(const pair<const K, E>& thePair)
{
    int b = search(thePair.first);
    if (table[b] == nullptr)
    {
        table[b] = new pair<const K, E>(thePair);
        ++dSize;
    }
    else
    {
        if (table[b]->first == thePair.first)
        {
            table[b]->second == thePair.second;
        }
        else
        {
            throw hashTableFull();
        }
    }
}