请回答数据结构【哈希桶和模拟unordered容器】

言之命至9012

已于 2022-07-03 16:54:24 修改

阅读量169

点赞数 3

分类专栏：高阶数据结构文章标签：哈希算法数据结构散列表

于 2022-06-30 10:17:10 首次发布

本文链接：https://blog.csdn.net/Allen9012/article/details/125535260

版权

高阶数据结构专栏收录该内容

6 篇文章 0 订阅

订阅专栏

BingWallpaper30

1. 实现闭散列

1.0 基本结构

template <class K,class V>
struct HashData
{
    pair<K, V> _kv;
};

template <class K, class V>
class HashTable
{
public:
private:
    vector<HashData> _table;
    size_t _n=0;  //存储的有效数据个数
};

采用闭散列处理哈希冲突时，不能随便物理删除哈希表中已有的元素，若直接删除元素会影响其他
元素的搜索。因此线性探测采用标记的伪删除法来删除一个元素。

enum State
{
    EMPTY,
    EXIST,
    DELETE,
};

template <class K,class V>
struct HashData
{
    pair<K, V> _kv;
    State _state = EMPTY;//默认给空
};

template <class K, class V>
class HashTable
{
public:
private:
    vector<HashData> _table;
    size_t _n=0;  //存储的有效数据个数
};

1.1 Insert

通过哈希函数获取待插入元素在哈希表中的位置

首先我们考虑一个问题：究竟index是下面哪一个？（底层是vector）

size_t index = kv.first % _table.size();
size_t index = kv.first % _table.capacity();

因为对于vector来说只让你访问最多size位，不让你访问整个capacity，所以一旦模过头了会超出size，无法使用

_table[index]= ...

因此使用size

bool Insert(const pair<K, V>& kv)
{	
    size_t start = kv.first % _table.size();
    size_t index = start;

    // 探测后面的位置 -- 线性探测 or 二次探测
    size_t i = 1;
    while (_table[index]._state == EXIST)
    {
        index = start + i;
        index %= _table.size();
        ++i;
    }

    _table[index]._kv = kv;
    _table[index]._state = EXIST;
    ++_n;

    return true;
}

那么还要考虑增容问题和重复问题

//防止重复
HashData* ret = Find(kv.first);
if (ret)
{
    return false;
}
//空表
if (_table.size()==0)
{
    _table.resize(10);
}
//负载0.7
else if ((double)_n / (double)_table.size() >=0.7)
{
    //增容
        vector<HashData> newtable;
        newtable.resize(_table.size*2);
        for (auto& e:_table)
        {
            if (e._state==EXIST)
            {
                //重新计算放到newtable
                //逻辑类似插入
            }
        }
        _table.swap(newtable);
}

这里我们发现在处理扩容问题的时候要把数据重新放到newtable中,这里的逻辑和插入部分实现逻辑很像,好像有点代码重复,其实我们可以有一个更好的解决方法,就是直接构造一个HashTable复用Insert,如果像之前的话不能够调用Insert,现在有一个table就可以了,这也是增容的现代版

//空表
if (_table.size()==0)
{
    _table.resize(10);
}
//负载0.7
else if ((double)_n / (double)_table.size() >=0.7)
{
    //增容
    HashTable<K, V> newHT;
    newHT._table.resize(_table.size() * 2);
    for (auto& e:  _table)
    {
        if (e._state == EXIST)
        {
            newHT.Insert(e._kv);
        }
    }
    _table.swap(newHT._table);
}

1.1.1 完整的Insert

bool Insert(const pair<K, V>& kv)
{
    //防止重复
    auto ret = Find(kv.first);
    if (ret)
    {
        return false;
    }

    //空表
    if (_table.size() == 0)
    {
        _table.resize(10);
    }
    //负载0.7
    else if ((double)_n / (double)_table.size() >= 0.7)
    {
        //增容
        HashTable<K, V,HashFunc> newHT;
        newHT._table.resize(_table.size() * 2);
        for (auto& e : _table)
        {
            if (e._state == EXIST)
            {
                newHT.Insert(e._kv);
            }
        }
        _table.swap(newHT._table);
    }

    HashFunc hf;
    size_t start = hf(kv.first) % _table.size();
    size_t index = start;

    // 探测后面的位置 -- 线性探测 or 二次探测
    size_t i = 1;
    while (_table[index]._state == EXIST)
    {
        index = start + i;
        index %= _table.size();
        ++i;
    }

    _table[index]._kv = kv;
    _table[index]._state = EXIST;
    ++_n;

    return true;
}

1.2 Find

对于size=0的判断，可以写一个判断也可以构造函数的时候就给一些size

HashData<K,V>* Find(const K& key)
{
    if (_table.size() == 0)
    {
        return nullptr;
    }

    //HashFunc hf;
    size_t start = hf(key) % _table.size();
    size_t index = start;
    size_t i = 1;
    while (_table[index]._state != EMPTY)
    {
        if ( _table[index]._kv.first == key)
        {
            return &_table[index];
        }

        index = start + i;
        index %= _table.size();
        ++i;
    }
}

这是后其实还是有问题的，因为我假如我删除100，之后要再去寻找，删除方法是修改了标识，这时候状态已经删除，但是寻找的时候还是找的到所以说，我们应该要在判断的时候，加一个条件

 if (_table[index]._state == EXIST 
 		&& _table[index]._kv.first == key)        
{
return &_table[index];
}

1.3 Erase

bool Erase(const K& key)
{
    HashData<K, V>* ret = Find(key);
    if (ret == nullptr)
    {
        return false;
    }
    else
    {
        ret->_state = DELETE;
        return true;
    }
}

1.4 string问题解决

我们发现如果之前的取模操作当对于发生在模板参数输入为string的时候会遇到问题，因为没有字符串的取模啊，所以我们可以来一个仿函数解决问题

template<class K>
struct int_HashFunc
{
    int operator()(int i)
    {
        return i;
    }
};

template<class K>
struct string_HashFunc
{
    size_t operator()(const string& s)
    {
        return s[0];
    }
};

但是这时候又不太好了，很多时候首字母都会重叠，那如果都是字符串的话，会导致很多都是重复的，那所以最好还是换一种形式映射，更好的方式就是字符串转成整型值来映射，比如我们可以把字符串每个字符ASCII码加起来转换为整形（当然也不是必须的因为整形也会超标，字符串可以无限长，整形是有范围的）

template<class K>
struct string_HashFunc
{
    size_t operator()(const string& s)
    {
        size_t value = 0;
        for (auto ch : s)
        {
            value += ch;
        }
        return value;
    }
};

1.4.1 BKDR

其实还是不够好"abcd"和"cdba"和"adad"都是一样的ASCII,都被分到了一起,于是大佬们搞定了字符串哈希算法,其中最有名的是BKDR哈希,累加相应的乘积

template<class T>  
size_t BKDRHash(const T *str)  
{  
    register size_t hash = 0;  
    while (size_t ch = (size_t)*str++)  
    {         
        hash = hash * 131 + ch;   // 也可以乘以31、131、1313、13131、131313..  
        // 有人说将乘法分解为位运算及加减法可以提高效率，如将上式表达为：hash = hash << 7 + hash << 1 + hash + ch;  
        // 但其实在Intel平台上，CPU内部对二者的处理效率都是差不多的，  
        // 我分别进行了100亿次的上述两种运算，发现二者时间差距基本为0（如果是Debug版，分解成位运算后的耗时还要高1/3）；  
        // 在ARM这类RISC系统上没有测试过，由于ARM内部使用Booth's Algorithm来模拟32位整数乘法运算，它的效率与乘数有关：  
        // 当乘数8-31位都为1或0时，需要1个时钟周期  
        // 当乘数16-31位都为1或0时，需要2个时钟周期  
        // 当乘数24-31位都为1或0时，需要3个时钟周期  
        // 否则，需要4个时钟周期  
        // 因此，虽然我没有实际测试，但是我依然认为二者效率上差别不大          
    }  
    return hash;  
}

下面是一些对不同哈希的测试https://blog.csdn.net/icefireelf/article/details/5796529,可以发现最后还是BKDR最好,那么就采用BKDR哈希就好了,那么现在仿函数只要单独写一个,然后特化出其他版本就可以了

1.4.2 实现Hash仿函数

template<class K>
struct Hash
{
    size_t operator()(const K& key)
    {
        return key;
    }
};

// 特化
template<>
struct Hash<string>
{
    size_t operator()(const string& s)
    {
        // BKDR Hash
        size_t value = 0;
        for (auto ch : s)
        {
            value += ch;
            value *= 131;
        }

        return value;
    }
};

template <class K, class V,class HashFunc=Hash<K>>
class HashTable
{
	...
}

void TestHashTable()
{
    string a[] = { "皮卡丘", "喷火龙", "皮卡丘", "喷火龙", "皮卡丘", "路卡利欧", "皮卡丘" };
    HashTable<string, int,Hash<string>> ht;
    for (auto str : a)
    {
        auto ret = ht.Find(str);
        if (ret)
        {
            ret->_kv.second++;
        }
        else
        {
            ht.Insert(make_pair(str, 1));
        }
    }
}

这里的部分很像Java的重写Hashcode,其实就是判断相等有很多条件,看需要什么,就相应判断

struct pokemon
{
    // ...
};

struct PokemonHashFunc
{
    size_t operator()(const pokemon& kv)
    {
        // 如果是结构体
        // 1、比如说结构体中有一个整形，基本是唯一值 - pokemon序号
        // 2、比如说结构体中有一个字符串，基本是唯一值 - pokemon name
        // 3、如果没有一项是唯一值，可以考虑多项组合
        size_t value = 0;
        // ...
        return value;
    }
};

我们的unordered类型容器就是可以传入一个Hash的仿函数

2. 实现开散列

开散列本质上是一个指针数组和链表结合，此时就会有一个问题，对于模拟实现开散列来说，我们可以使用list库函数吗？还是要自己实现一下链表，最好还是自己写链表，因为list的迭代器是一个增加麻烦的事情

2.0 HashNode

由于是一个指针数组，HashTable的私有成员只能写成双指针形式，看起来非常麻烦，那么我们这里把指针放入vector中，这样稍微好一点

template<class K,class V>
struct HashNode
{
    HashNode<K, V>* _next;
    pair<K, V> _kv;
};

template<class K, class V>
class HashTable 
{
    typedef HashNode<K, V> Node;
public:
private:
    vector<Node*> _table;//存的是指针
    size_t _n = 0;  //有效数据个数
};

2.1 Insert

如何实现插入呢？其实闭散列还要简单

在大小为 7 的哈希表中，键 42 和 38 将分别获得 0 和 3 作为哈希索引。

如果我们插入一个新元素52，那也将转到第四个索引，下标是3，因为52%7是3

实际上就效率来看，利用头插是效率更高的，因为尾插还有遍历取找尾，这显然效率上就不太好

bool Insert(const pair<K, V>& kv)
{
    if (Find(kv.first))
    {
        return false;
    }
    size_t index = kv.first % _table.size();
    Node* newnode = new Node(kv);
    //头插,而且也不用排空
    newnode->_next = _table[index];
    _table[index] = newnode;
    ++_n;
    return true;
}

接下来解决增容问题,当负载因子超过1的时候，table要开始增容，为了获取更多slot，此时不是直接把原来slot对应位置的所有链表直接拉下来，而是要重新mod，插入的思想，这时候难道我们还是按照闭散列的思想来做吗，这样复用代码还是有不好的地方因为，复用是在开新节点，而旧的节点也需要delete，这样得不偿失

bool Insert(const pair<K, V>& kv)
{
    //有相同数据直接false
    if (Find(kv.first))
        return false;

    //负载因子，到一的时候，进行增容
    if (_n == _table.size())
    {
        vector<Node*> newtable;
        size_t new_size = _table.size() == 0 ? 10 : _table.size() * 2;
        newtable.resize(new_size);
        //旧表节点重新算位置搞到新表
        for (size_t i=0;i<_table.size();++i)
        {
            if (_table[i])
            {
                Node* cur = _table[i];
                while (cur)
                {
                    Node* next = cur->_next;
                    size_t index = cur->_kv.first % newtable.size();
                    //头插
                    cur->_next = newtable[index];
                    newtable[index] = cur;
                    //原表迭代
                    cur = next;
                }
                _table[i] = nullptr;
            }
        }
        _table.swap(newtable);
    }

    //没有到1，直接链接
    size_t index = kv.first % _table.size();
    Node* newnode = new Node(kv);
    //头插,而且也不用排空
    newnode->_next = _table[index];
    _table[index] = newnode;
    ++_n;
    return true;
}

最后还可以在加上素数表，那么这里就不再写了

2.2 Find

查找很简单

Node* Find(const K& key)
{
	if (_table.size() == 0)
	{
		return nullptr;
	}
    size_t index = key % _table.size();
    Node* cur = _table[index];
    while (cur)
    {
        if (cur->_kv.first == key)
        {
            return cur;
        }
        else
        {
            cur = cur->_next;
        }
    }
    return nullptr;
}

2.3 Erase

在这个seperate chainning中删除的话加状态不是最好（当然也不是不可以），删除节点的方式

一般的话可以用一个prev指针记录前者的方式来做，这个是经典的链表删除法

然而有人给出是方式是替换法删除，也就是说但是这种方法不能删除尾节点吗，不过可以转化一下

这里还是采用了经典方法

bool Erase(const K& key)
{
    size_t index = key % _table.size();
    Node* cur = _table[index];
    Node* prev=nullptr;
    while (cur)
    {
        if (cur->_kv.first==key)
        {
            if (_table[index]==cur)
            {
                _table[index] = cur->_next;
            }
            else
            {
                prev->_next = cur->_next;
            }
            delete cur;
            cur = nullptr;
            return true;
        }
        prev = cur;
        cur = cur->_next;
    }
    return false;
}

2.4 Hash仿函数

老样子这里还要写一个仿函数

template<class K>
struct Hash
{
    size_t operator()(const K& key)
    {
        return key;
    }
};
// 特化
template<>
struct Hash<string>
{
    size_t operator()(const string& s)
    {
        // BKDR Hash
        size_t value = 0;
        for (auto ch : s)
        {
            value += ch;
            value *= 131;
        }

        return value;
    }
};

template<class K, class V,class HashFunc=Hash<K>>
class HashTable 
{
	...
}

2.5 iterator

实现unordered_map真正难点在于迭代器，而这里的迭代器用的就是HashTable的迭代器，所以这里我们来实现一下

2.5.1 基本结构

template<class K, class T, class Key_Of_T, class HashFunc = Hash<K>>
struct __HTIterator
{
    typedef HashNode<T> Node;
    typedef __HTIterator<K, T, Key_Of_T, HashFunc> Self;
    typedef HashTable<K, T, Key_Of_T, HashFunc> HT;
    Node* _node;
    HT* _pht;
    __HTIterator(Node* _node, HT* _pht)
        :_node(node)
        ,_pht(pht)
    {}
};

这里产生了特殊情况，就是__HTIterator，中出现了HashTable，但是HashTable同样也出现了__HTIterator，为了解决冲突，我们需要在迭代器之前前置声明

//前置声明
template<class K, class T, class Key_Of_T, class HashFunc>
class HashTable;
//迭代器类
template<class K, class T, class Key_Of_T, class HashFunc = Hash<K>>
struct __HTIterator
{
    typedef HashNode<T> Node;
    typedef __HTIterator<K, T, Key_Of_T, HashFunc> Self;
    typedef HashTable<K, T, Key_Of_T, HashFunc>   HT;
    Node* _node;
    HT* _pht;
    __HTIterator(Node* node, HT* pht)
        :_node(node)
        , _pht(pht)
    {}
  ...
}
//HashTable类
template<class K, class T, class Key_Of_T,class HashFunc =Hash<K>>
class HashTable 
{
    typedef HashNode<T> Node;
    //友元
   	template<class K, class T, class Key_Of_T, class HashFunc>
	friend struct __HTIterator;
    typedef __HTIterator<K, T, Key_Of_T, HashFunc> iterator;
	//...
}

为什么要有类模板友元这里参见operator++，这里是友元类所以说，要带上友元类的模板，模板这里不能写class HashFunc=Hash，因为

2.5.2 begin()和end()

typedef __HTIterator<K, T, Key_Of_T, HashFunc> iterator;

iterator begin()
{
    size_t i = 0;
    while (i<_table.size())
    {
        if(_table[i])
        {
            return iterator(_table[i], this);
        }
        ++i;
    }
    return end();
}

iterator end()
{
    return iterator(nullptr, this);
}

2.5.3 operator++和operator–

迭代器的难点在于实现operator++和operator–,当迭代器++之后，如果已经走完一个桶，如何走到下一个桶中?

对operator++来说，由于我们需要处理一中情况，也就是当一个哈希桶走完之后，就要往下一个桶走，为了确定下一个桶，我们就需要获取当前的_table[index]，那就需要一个当前的HashTable对象，于是我们在迭代器中，需要利用友元获取当前对象的size属性，还有在构造器中传入当前对象指针，来确定对象

Self& operator++()
{
    //1.当前桶中还有数据,直接往后走
    if (_node->_next)
    {
        _node = _node->_next;
    }
    //2.当前走完了
    else
    {
        //走到下一个桶中
        Key_Of_T kot;
        HashFunc hf;
        size_t index = hf(kot(_node->_data)) % _pht->_table.size();
        ++index;
        //要找到有数据的桶
        while (index < _pht->_table.size())
        {
            if (_pht->_table[index])
            {
                _node = _pht->_table[index];
                return *this;
            }
            else
            {
                ++index;
            }
        }
        _node = nullptr;
    }
    return *this;
}

operator–要实现吗，其实库里也没有提供operator–，没有提供rend和rbegin，说明库里也没有反向迭代器，所以说一般没有–操作

要operator--的话，那就可能需要双向链表实现

2.5.4 other operator

T& operator*()
{
    return _node->_data;
}

T* operator->()
{
    return &_node->_data;
}

bool operator != (const Self& s) const
{
    return _node != s._node;
}

bool operator == (const Self& s) const
{
    return _node == s.node;
}

2.6 迭代器based增删改查

2.6.1 Insert

pair<iterator,bool> Insert(const T& data)
{
    Key_Of_T kot;
    //有相同数据直接false
    auto ret = Find(kot(data));
    if (ret != end())
    {
        return make_pair(ret, false);
    }

    HashFunc hf;
    //负载因子，到一的时候，进行增容
    if (_n == _table.size())
    {
        vector<Node*> newtable;
        newtable.resize(GetNextPrime(_table.size()));
        //旧表节点重新算位置搞到新表
        for (size_t i=0;i<_table.size();++i)
        {
            if (_table[i])
            {
                Node* cur = _table[i];
                while (cur)
                {
                    Node* next = cur->_next;
                    size_t index = hf(kot(cur->_data)) % newtable.size();
                    //头插
                    cur->_next = newtable[index];
                    newtable[index] = cur;
                    //原表迭代
                    cur = next;
                }
                _table[i] = nullptr;
            }
        }
        _table.swap(newtable);
    }

    //没有到1，直接链接
    size_t index = hf(kot(data)) % _table.size(); 
    Node* newnode = new Node(data);
    //头插,而且也不用排空
    newnode->_next = _table[index];
    _table[index] = newnode;
    ++_n;
    return make_pair(iterator(newnode,this), true);
}

2.6.2 Find

iterator Find(const K& key)
{
    if (_table.size() ==0)
    {
        return end();
    }

    Key_Of_T kot;
    if (_table.size() == 0)
    {
        return end();
    }
    HashFunc hf;
    size_t index = hf(key) % _table.size();
    Node* cur = _table[index];
    while (cur)
    {
        if (kot(cur->_data) == key)
        {
            return iterator(cur,this);
        }
        else
        {
            cur = cur->_next;
        }
    }
    return end();
}

2.7 拷贝构造

可以不用自己写，但是由于写了拷贝构造，所以至少要说明一下

HashTable()=default;//显示指定

2.8 析构函数

//析构
~HashTable()
{
    for (size_t i = 0; i < _table.size(); ++i)
    {
        Node* cur = _table[i];
        while (cur)
        {
            Node* next = cur->_next;
            delete cur;
            cur = next;
        }
        _table[i] = nullptr;
    }
}

2.9 拷贝构造

//拷贝构造
HashTable(const HashTable& ht)	//构造和拷贝可以不写模板
{
    _n = ht._n;
    _table.resize(ht._table.size());
    for (size_t i = 0; i < ht._table.size(); i++)
    {
        Node* cur = ht._table[i];
        while (cur)
        {
            Node* copy = new Node(cur->_data);
            // 头插到新表
            copy->_next = _table[i];
            _table[i] = copy;

            cur = cur->_next;
        }
    }
}

2.9 赋值运算符重载

//赋值重载
HashTable& operator=(HashTable ht)
{
    _table.swap(ht._table);
    swap(_n, ht._n);

    return *this;
}

同时这样的话map和set就不需要自己写这些了，默认生成的就可以用了，会调用这里的

3. 封装实现unorder容器

3.1 修改HashTable

这里的封装和map、set部分很类似

template<class K, class T, class Key_Of_T,class HashFunc =Hash<K>>

3.2 unordered_map

3.2.1 基本结构

这里的仿函数还是和map很像的

template<class K,class V>
class unordered_map
{
    struct Map_Key_Of_T 
    {
        const K& operator()(const pair<K, V>& kv)
        {
            return kv.first;
        }
    };
public:
 
private:
    Open_Hash::HashTable<K, pair<K, V>, Map_Key_Of_T> _ht;
};

3.2.2 insert

pair<iterator,bool> insert(const pair<K, V>& kv)
{
    return _ht.Insert(kv);
}

3.2.3 iterator

typedef typename Open_Hash::HashTable<K, pair<K,V>, Map_Key_Of_T>::iterator iterator;
iterator begin()
{
    return _ht.begin();
}

iterator end()
{
    return _ht.end();
}

3.2.4 operator[]

map有一个专门的operator[]，如果这里要实现的话，需要先修改Insert等

V& operator[](const K& key)
{
    pair<iterator, bool> ret = _ht.Insert(make_pair(key, V()));
    return ret.first->second;
}

3.3 unordered_set

3.3.1 基本结构

template<class K>
class unordered_set
{
    struct Set_Key_Of_T
    {
        const K& operator()(const K& key)
        {
            return key;
        }
    };
public:
    bool insert(const K& key)
    {
        _ht.Insert(key);
        return true;
    }
private:
    Open_Hash::HashTable<K,  K>,Set_Key_of_T> _ht;
};

3.3.2 insert

bool insert(const K& key)
{
    _ht.Insert(key);
    return true;
}

3.3.3 iterator

typedef typename Open_Hash::HashTable<K, K, Set_Key_Of_T>::iterator iterator;

iterator begin()
{
    return _ht.begin();
}

iterator end()
{
    return _ht.end();
}

言之命至9012

关注

3
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
3
评论
请回答数据结构【哈希桶和模拟unordered容器】

采用闭散列处理哈希冲突时，不能随便物理删除哈希表中已有的元素，若直接删除元素会影响其他元素的搜索。因此线性探测采用标记的伪删除法来删除一个元素。1.1 Insert通过哈希函数获取待插入元素在哈希表中的位置首先我们考虑一个问题：究竟index是下面哪一个？（底层是vector）因为对于vector来说只让你访问最多size位，不让你访问整个capacity，所以一旦模过头了会超出size，无法使用因此使用size那么还要考虑增容问题和重复问题这里我们发现在处理扩容问题的时候要把数据重新放到ne
复制链接

扫一扫