哈希和unordered系列封装（C++）

最新推荐文章于 2024-11-05 17:16:24 发布

kpl_20

最新推荐文章于 2024-11-05 17:16:24 发布

阅读量1k

点赞数 17

分类专栏： C++ 文章标签： c++ 哈希表 unordered系列

本文链接：https://blog.csdn.net/kpl_20/article/details/134607028

版权

C++ 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

一、哈希

1. 概念

通过某种函数使用元素的存储位置与其关键码之间建立映射关系。

插入元素时，通过该函数求得的值，就是该元素的存储位置。
搜索元素时，通过该函数求得的值进行比对，如果关键码相等则搜索成功。

该方法称为哈希（散列）方法，而其中的某中函数被称为哈希（散列）函数，构造出来的结构成为哈希表（散列表）。

2. 哈希函数，哈希碰撞

哈希函数（常用的两个）

直接定址法

函数
取关键字的某个线性函数得出散列地址：Hash(Key) = A * Key + B
优缺
优点：简单均匀
缺点：关键码的分布范围需要集中
场景
统计字符串中字符出现的个数，其中字符是集中的。

除留余数法

函数
Hash(Key) = Key % m（m是小于等于表中可取地址数即可（建议：质数））
场景
适用于值的方位分散

eg:
除留余数法

注意：

使用除留余数法，所以就要求被%的key必须是整型。如果key为字符串如何转成整型呢？
答：字符串哈希函数。评价hash函数性能的一个重要指标就是冲突，在相关资源允许的条件下冲突越少hash函数的性能越好。
常见的字符串哈希算法BKDRHash，APHash，DJBHash…

eg:

 // BKDR Hash Function
unsigned int BKDRHash(char *str)
{
    unsigned int seed = 131; // 31 131 1313 13131 131313 etc..
    unsigned int hash = 0;
 
    while (*str)
    {
        hash = hash * seed + (*str++);
    }
 
    return (hash & 0x7FFFFFFF);
}

使用除留余数法，最好模一个素数，如何快速模一个类似两倍关系的素数？
答：使用了一个默认的素数集合，这个集合中包含了一系列素数。在不同的STL实现中，这个素数集合可能会有所不同。一般来说，这个集合中的素数经过仔细选择，以确保哈希表的负载因子（即平均哈希桶中元素的数量）保持在一个较小的范围内，从而提供更好的性能。

//素数集合
size_t GetNextPrime(size_t prime)
{
	const int PRIMECOUNT = 28;
	static const size_t primeList[PRIMECOUNT] =
	{
		53ul, 97ul, 193ul, 389ul, 769ul,
		1543ul, 3079ul, 6151ul, 12289ul, 24593ul,
		49157ul, 98317ul, 196613ul, 393241ul, 786433ul,
		1572869ul, 3145739ul, 6291469ul, 12582917ul,
		25165843ul,
		50331653ul, 100663319ul, 201326611ul, 402653189ul,
		805306457ul,
		1610612741ul, 3221225473ul, 4294967291ul
	};
	size_t i = 0;
	for (; i < PRIMECOUNT; ++i)
	{
		if (primeList[i] > prime)
			return primeList[i];
	}
	return primeList[i];
}

哈希冲突（碰撞）

根据上面的例子，如果在数据集合中添加一个数据25，那么会发现通过哈希函数求的地址已经被别的关键码占据。

概念：不同关键码通过相同的哈希函数计算出相同的哈希地址，被称为哈希冲突（碰撞）。

小结

哈希函数的设计跟哈希冲突有着必要的联系。
哈希函数的设计：

哈希函数的定义域，需要包含存储的全部关键码。值域，0到哈希表允许地址数最大值-1
哈希函数计算的地址，均匀分布在哈希表中
设计简单

3. 解决哈希碰撞

解决哈希碰撞的两种方法：闭散列和开散列

闭散列

闭散列：也叫开放地址法，当发生哈希冲突时，如果哈希表未被填满，说明哈希表还有空位置，那么就可以从冲突位置为起始找下一个空位置。

线性探测

概念：从发生冲突的位置开始，依次向后探测，直到寻找到下一个空位置为止。

优缺点

优点：实现简单
缺点：一旦发生冲突连在一起，容易产生数据“堆积”。搜索效率下降

插入

通过哈希函数获取待插入元素在哈希表的目标位置
如果该位置没有元素直接插入，如果有元素则发生冲突，使用线性探测找到下一个空位置，然后插入。

eg:

删除

因为哈希冲突的原因，不能随便删除，会影响后面元素的搜索。例如：删除上个例子哈希表的6，那么我们查找25会被影响。
所以采用伪删除，给哈希表每个空间设置一个状态
`状态: EMPTY此位置为空，EXIST此位置有元素，DELETE此位置元素被删除。

enum STATE
{
	EXIST, 
	EMPTY,
	DELETE
};`

二次探测

不同于线性探测是依次寻找空位置，二次探测是通过公式跳跃式的寻找空位置。
Hash(i) = (Hash(x) + i^2) % m;
Hash(X)：通过哈希函数计算key值得到的位置，但是已经存在元素
Hash(i)：将要存放位置
m：哈希表的大小
i = 1，2，3，4…

注意： 除了线性探测，二次探测，还有双重哈希…

代码实现

//开放地址法
namespace open_address 
{
	//哈希函数
	template<class K>
	struct DefaultHashFunc
	{
		size_t operator()(const K& key)
		{
			return size_t(key);     //转成无符号整型
		}
	};
	
	//模板特化 -- 针对字符串    BKDRHash算法
	template<>
	struct DefaultHashFunc<string>
	{
		size_t operator()(const string& s)
		{
			size_t hash = 0;
			for (auto ch : s)
			{
				hash *= 131;
				hash += ch;
			}
			return hash;
		}
	};


	//状态
	enum STATE
	{
		EXIST,
		EMPTY,
		DELETE
	};

	//数据
	template<class K, class V>
	struct HashData
	{
		pair<K, V> _kv;
		STATE _state = EMPTY;
	};

	template<class K, class V, class HashFunc = DefaultHashFunc<K>>
	class HashTable
	{
	public:
		HashTable()
		{
			_table.resize(10);     //给哈希表初始化十个空间
		}

		bool Insert(const pair<K, V>& kv)
		{
			if (Find(kv.first))
				return false;

			//扩容   -->   根据载荷因子
			//if ((double)_n / (double)_table.size() >= 0.7)
			if (10 * _n / _table.size() >= 7)
			{
				size_t newSize = _table.size() * 2;
				
				//造新表
				HashTable<K, V, HashFunc> newHT;
				newHT._table.resize(newSize);

				//遍历旧表重新映射到新表
				for (size_t i = 0; i < _table.size(); i++)
				{
					if (_table[i]._state == EXIST)
					{
						newHT.Insert(_table[i]._kv);
					}
				}

				//交换新旧表,原空间出作用域自动销毁
				_table.swap(newHT._table);
			}

			//线性探测
			HashFunc hf;
			size_t hashi = hf(kv.first) % _table.size();
			while (_table[hashi]._state == EXIST)
			{
				++hashi;
				hashi %= _table.size();
			}
			_table[hashi]._kv = kv;
			_table[hashi]._state = EXIST;
			_n++;
			return true;
		}

		HashData<const K, V>* Find(const K& key)
		{
			HashFunc hf;
			size_t hashi = hf(key) % _table.size();
			while (_table[hashi]._state != EMPTY)
			{
				if (_table[hashi]._state == EXIST
					&& _table[hashi]._kv.first == key)
				{
					//&_table[hashi]类型是HashData<K, V>*
					return (HashData<const K, V>*)&_table[hashi];    
				}

				++hashi;
				//如果到_table的最后了，绕到最前面
				hashi %= _table.size();
			}
			return nullptr;
		}

		bool Erase(const K& key)
		{
			HashData<const K, V>* ret = Find(key);
			if (ret)
			{
				ret->_state = DELETE;
				--_n;
				return true;
			}
			return false;
		}

	private:
		vector<HashData<K, V>> _table;
		size_t _n = 0;           //存储有效数据
	};

}

载荷因子（扩容）

载荷因子的就算方法：α = 表中有效的元素个数 / 散列表的长度。
对于开放地址法，载荷因子是特别重要的元素，通过一些科学实验，载荷因子应严格控制在0.7-0.8。∵散列表的长度是一定的，表中有效元素个数和α成正比，∴如果超过载荷因子0.8，产生冲突的可能就越大，查表时CPU缓存命中率低。

再进行插入操作的时候要根据载荷因子判断需不需要扩容，用空间换时间

开散列

开散列：也叫链地址法（开链法），首先对关键码集合用散列函数计算散列地址，具有相同关键码的归于同一子集合，每个自己和称为一个桶，各个桶中的元素通过单链表链起来，各链表的头节点存在哈希表中。

哈希桶

5和8下标都存在哈希冲突

代码实现

namespace hash_bucket
{
	template<class K>
	struct DefaultHashFunc
	{
		size_t operator()(const K& key)
		{
			return size_t(key);
		}
	};
	//模板特化 -- 针对字符串
	template<>
	struct DefaultHashFunc<string>
	{
		size_t operator()(const string& s)
		{
			size_t hash = 0;
			for (auto ch : s)
			{
				hash *= 131;
				hash += ch;
			}

			return hash;
		}
	};


	template<class K, class V>
	struct HashNode
	{
		pair<K, V> _kv;
		HashNode<K, V>* _next;

		//初始化
		HashNode(const pair<K, V>& kv)
			:_kv(kv)
			,_next(nullptr)
		{}
	};


	template<class K, class V, class HashFunc = DefaultHashFunc<K>>
	class HashTable
	{
		typedef HashNode<const K, V> Node;
	public:
		HashTable()
		{
			//开十个空间，初始化为nullptr
			_table.resize(10, nullptr);
		}

		~HashTable()
		{
			for (size_t i = 0; i < _table.size(); i++)
			{
				Node* cur = _table[i];
				while (cur)
				{
					Node* next = cur->_next;
					delete cur;
					cur = next;
				}
				_table[i] = nullptr;
			}
		}

		bool Insert(const pair<K, V>& kv)
		{
			if (Find(kv.first))
			{
				return false;
			}
			HashFunc hf;

			//负载因子到1扩容
			if (_n == _table.size())
			{
				size_t newSize = _table.size() * 2;
				vector<Node*> newTable;
				newTable.resize(newSize, nullptr);

				//遍历旧表
				for (size_t i = 0; i < _table.size(); i++)
				{
					Node* cur = _table[i];
					while (cur)
					{
						Node* next = cur->_next;
						size_t hashi = hf(cur->_kv.first) % newSize;
						cur->_next = newTable[hashi];
						newTable[hashi] = cur;

						cur = next;
					}
					_table[i] = nullptr;
				}
				_table.swap(newTable);
			}

			size_t hashi = hf(kv.first) % _table.size();
			Node* newnode = new Node(kv);
			newnode->_next = _table[hashi];
			_table[hashi] = newnode;
			_n++;
			return true;
		}

		Node* Find(const K& key)
		{
			HashFunc hf;
			size_t hashi = hf(key) % _table.size();
			Node* cur = _table[hashi];
			while (cur)
			{
				if (cur->_kv.first == key)
				{
					return cur;
				}
				cur = cur->_next;
			}
			return nullptr;
		}

		bool Erase(const K& key)
		{
			HashFunc hf;
			size_t hashi = hf(key) % _table.size();
			Node* cur = _table[hashi];
			Node* prev = nullptr;
			while (cur)
			{
				if (cur->_kv.first == key)
				{
					if (prev == nullptr)
					{
						_table[hashi] = cur->_next;
					}
					else
					{
						prev->_next = cur->_next;
					}

					delete cur;
					return true;
				}
				prev = cur;
				cur = cur->_next;
			}
			return false;
		}

		void Print()
		{
			for (size_t i = 0; i < _table.size(); i++)
			{
				printf("[%d]->", i);
				Node* cur = _table[i];
				while (cur)
				{
					cout << cur->_kv.first << ":" << cur->_kv.second << "->";
					cur = cur->_next;
				}
				printf("nullptr\n");
			}
			cout << endl;
		}

	private:
		vector<Node*> _table;
		size_t _n = 0;
	};
}

扩容

桶的个数是一定的（桶的个数 == 表的大小）。如果不进行扩容，可能一个桶中有很多元素，会影响哈希表的性能。开散列最完美的情况就是每个哈希桶中刚好挂一个节点，再插入时就会发生哈希冲突，因此判断扩容的条件就可以是： 元素的个数 == 桶的个数。

二、unordered系列封装

unordered系列set、map的容器接口和红黑树实现的set、map相似，使用大差不差，所以在这里就不进行介绍了。

hash_table

迭代器实现原理(单项迭代器)

迭代器++

当前桶没遍历完，直接通过链表找下一个节点
当前桶遍历完
a. 通过哈希函数确定当前存储位置然后+1
b. 循环（加过1的位置小于哈希表的大小）
- - Ⅰ.该位置不为空，则成功找到，直接返回
- - Ⅱ.该位置为空继续向后+1，继续循环判断
c. 循环结束没找到，返回nullptr

Self& operator++()
{
	if (_node->_next)  //当前桶没完
	{
		_node = _node->_next;
	}
	else               //当前桶完了
	{
		HashFunc hf;
		KeyOfT kot;

		size_t hashi = hf(kot(_node->_data)) % _pht->_table.size();
		++hashi;
		while (hashi < _pht->_table.size())
		{
			if (_pht->_table[hashi])
			{
				_node = _pht->_table[hashi];
				return *this;
			}
			else
			{
				hashi++;
			}
		}
		_node = nullptr;
	}
	return *this;
}

hash_table实现代码

#include <vector>

// 1、哈希表
// 2、封装map和set
// 3、普通迭代器
// 4、const迭代器
// 5、insert返回值  operator[]
// 6、key不能修改的问题

namespace hash_bucket
{
	template<class K>
	struct DefaultHashFunc
	{
		size_t operator()(const K& key)
		{
			return (size_t)key;
		}
	};

	template<>   //特化
	struct DefaultHashFunc<string>
	{
		size_t operator()(const string& str)
		{
			size_t hash = 0;
			for (auto ch : str)
			{
				hash *= 131;
				hash += ch;
			}
			return hash;
		}
	};


	template<class T>
	struct HashNode
	{
		T _data;
		HashNode<T>* _next;

		HashNode(const T& data)
			:_data(data)
			, _next(nullptr)
		{}
	};

	//类前置声明  -->  因为迭代器的实现会调用哈希表指针
	template<class K, class T, class KeyOfT, class HashFunc>
	class HashTable;


	//迭代器
	template<class K, class T, class Ptr, class Ref, class KeyOfT, class HashFunc>
	struct HTIterator
	{
		typedef HashNode<T> Node;
		typedef HTIterator<K, T, Ptr, Ref, KeyOfT, HashFunc> Self;

		//普通迭代器
		typedef HTIterator<K, T, T*, T&, KeyOfT, HashFunc> Iterator;

		Node* _node;
		//哈希表指针  注意这里要加上const限制*this，不然哈希表调用时的this是const的会导致权限放大
		const HashTable<K, T, KeyOfT, HashFunc>* _pht;
		
		HTIterator(Node* node, const HashTable<K, T, KeyOfT, HashFunc>* pht)
			:_node(node)
			,_pht(pht)
		{}

		//普通迭代器时，是拷贝构造
		//const迭代器时，是构造。普通迭代器构造const迭代器
		HTIterator(const Iterator& it)
			:_node(it._node)
			, _pht(it._pht)
		{}

		Ref operator*()
		{
			return _node->_data;
		}

		Ptr operator->()
		{
			return &_node->_data;
		}

		bool operator!=(const Self& s)
		{
			return _node != s._node;
		}

		bool operator==(const Self& s)
		{
			return _node == s._node;
		}

		Self& operator++()
		{
			if (_node->_next)  //当前桶没完
			{
				_node = _node->_next;
			}
			else               //当前桶完了
			{
				HashFunc hf;
				KeyOfT kot;

				size_t hashi = hf(kot(_node->_data)) % _pht->_table.size();
				++hashi;
				while (hashi < _pht->_table.size())
				{
					if (_pht->_table[hashi])
					{
						_node = _pht->_table[hashi];
						return *this;
					}
					else
					{
						hashi++;
					}
				}
				_node = nullptr;
			}
			return *this;
		}
	};


	//set -> hash_bucket::HashTable<K, K> _ht
	//map -> hash_bucket::HashTable<K, pair<K, V>> _ht
	template<class K, class T, class KeyOfT, class HashFunc = DefaultHashFunc<K>>
	class HashTable
	{
		typedef HashNode<T> Node;

		//友元         迭代器的实现会调用哈希表指针
		template<class K, class T, class Ptr, class Ref, class KeyOfT, class HashFunc>
		friend struct HTIterator;

	public:

		typedef HTIterator<K, T, T*, T&, KeyOfT, HashFunc> iterator;
		typedef HTIterator<K, T, const T*, const T&, KeyOfT, HashFunc> const_iterator;

		iterator begin()
		{
			for (size_t i = 0; i < _table.size(); i++)
			{
				Node* cur = _table[i];
				if (cur)
				{
					return iterator(cur, this);
				}
			}

			return iterator(nullptr, this);
		}

		iterator end()
		{
			return iterator(nullptr, this);
		}


		const_iterator begin()  const
		{
			for (size_t i = 0; i < _table.size(); i++)
			{
				Node* cur = _table[i];
				if (cur)
				{
					return const_iterator(cur, this);
				}
			}

			return const_iterator(nullptr, this);
		}

		const_iterator end() const
		{
			return const_iterator(nullptr, this);
		}



		HashTable()
		{
			_table.resize(10, nullptr);
		}

		~HashTable()
		{
			for (size_t i = 0; i < _table.size(); i++)
			{
				Node* cur = _table[i];
				while (cur)
				{
					Node* next = cur->_next;
					delete cur;
					cur = next;
				}

				_table[i] = nullptr;
			}
		}

		pair<iterator, bool> Insert(const T& data)
		{
			HashFunc hf;
			KeyOfT kot;

			iterator it = Find(kot(data));


			if (it != end())
			{
				return make_pair(it, false);
			}


			// 负载因子到1--扩容
			if (_n == _table.size())
			{
				size_t newSize = _table.size() * 2;
				vector<Node*> newTable;
				newTable.resize(newSize, nullptr);

				// 遍历旧表，把节点牵下来挂到新表
				for (size_t i = 0; i < _table.size(); i++)
				{
					Node* cur = _table[i];
					while (cur)
					{
						Node* next = cur->_next;

						size_t hashi = hf(kot(data)) % newSize;
						cur->_next = newTable[hashi];
						newTable[hashi] = cur;

						cur = next;
					}

					_table[i] = nullptr;
				}

				_table.swap(newTable);
			}

			size_t hashi = hf(kot(data)) % _table.size();
			// 头插
			Node* newnode = new Node(data);
			newnode->_next = _table[hashi];
			_table[hashi] = newnode;
			++_n;
			return make_pair(iterator(newnode, this), true);
		}

		iterator Find(const K& key)
		{
			HashFunc hf;
			KeyOfT kot;

			size_t hashi = hf(key) % _table.size();
			Node* cur = _table[hashi];
			while (cur)
			{
				if (kot(cur->_data) == key)
				{
					return iterator(cur, this);
				}

				cur = cur->_next;
			}

			return iterator(nullptr, this);
		}

		bool Erase(const K& key)
		{
			HashFunc hf;
			KeyOfT kot;

			size_t hashi = hf(key) % _table.size();
			Node* prev = nullptr;
			Node* cur = _table[hashi];
			while (cur)
			{
				if (kot(cur->_data) == key)
				{
					if (prev == nullptr)
					{
						_table[hashi] = cur->_next;
					}
					else
					{
						prev->_next = cur->_next;
					}

					delete cur;
					return true;
				}

				prev = cur;
				cur = cur->_next;
			}

			--_n;
			return false;
		}

	private:
		vector<Node*> _table; // 指针数组
		size_t _n = 0;        // 存储有效数据个数
	};
}

unordered_set封装

namespace kpl
{
	template<class K>
	class unordered_set
	{
		//该仿函数只是跟map跑
		struct SetKeyOfT
		{
			const K& operator()(const K& key)
			{
				return key;
			}
		};
	public:
		typedef typename hash_bucket::HashTable<K, K, SetKeyOfT>::const_iterator iterator;
		typedef typename hash_bucket::HashTable<K, K, SetKeyOfT>::const_iterator const_iterator;


		const_iterator begin() const
		{
			return _ht.begin();
		}

		const_iterator end() const
		{
			return _ht.end();
		}


		pair<iterator, bool> insert(const K& key)
		{
			//这里返回值的first的迭代器是普通迭代器，用普通迭代器接收
			pair<typename hash_bucket::HashTable<K, K, SetKeyOfT>::iterator, bool> ret = _ht.Insert(key);
			//使用普通迭代器构造一个const的迭代器，这里就体现出迭代器实现中的那个拷贝构造
			return pair<iterator, bool>(ret.first, ret.second);
		}

	private:
		hash_bucket::HashTable<K, K, SetKeyOfT> _ht;
	};
}

unordered_map封装

namespace kpl
{
	template<class K, class V>
	class unordered_map
	{
		//仿函数的主要作用在这里，set的封装只是跟跑，为了就是去键值对的key
		struct MapKeyOfT
		{
			const K& operator()(const pair<K, V>& kv)
			{
				return kv.first;
			}
		};

	public:
		typedef typename hash_bucket::HashTable<K, pair<const K, V>, MapKeyOfT>::iterator iterator;
		typedef typename hash_bucket::HashTable<K, pair<const K, V>, MapKeyOfT>::const_iterator const_iterator;

		iterator begin()
		{
			return _ht.begin();
		}

		iterator end()
		{
			return _ht.end();
		}

		const_iterator begin() const
		{
			return _ht.begin();
		}

		const_iterator end() const
		{
			return _ht.end();
		}

		pair<iterator, bool> insert(const pair<K, V>& kv)
		{
			return _ht.Insert(kv);
		}
	
		//返回值是与key对应的value的值。
		V& operator[](const K& key)
		{
			pair<iterator, bool> ret = _ht.Insert(make_pair(key, V()));
			return ret.first->second;
		}

	private:
		hash_bucket::HashTable<K, pair<const K, V>, MapKeyOfT> _ht;
	};

}