一文学会c++哈希

原创已于 2025-09-24 02:41:11 修改 · 1.4k 阅读

19 ·

CC 4.0 BY-SA版权

文章标签：

#哈希算法 #c++ #算法 #开发语言 #数据结构 #java

于 2025-09-21 23:55:00 首次发布

c++成长之路专栏收录该内容

14 篇文章

订阅专栏

【投稿赢 iPhone 17】「我的第一个开源项目」故事征集：用代码换C位出道！ 10w+人浏览 505人参与

unordered系列关联性容器

❤必看红黑树篇

STL提供了底层为红黑树的一系列容器，查询的效率为log2n，即最差情况下需查询红黑树高度次，如果数据太多，查询效率还是不理想，所以c++11又提供了unordered系列容器查询效率可达O(1)，使用方法与红黑树系列容器基本一致，只是底层不一样，也有4个
unordered_map, unordered_set, unordered_multimap, unordered_multiset

unordered_map

unordered_map文档
特性：
1，unordered_map存储<key,value>键值对，通过key值直接找到value
2，unordered_map存储key值唯一，value是映射值，两者类型可以不同。
3，unordered_map没有对<key,value>进行任何排序
4，unordered_map将相同哈希值存到同一个哈希桶中
5，unordered_map查询比map快，但遍历元素比map慢，（因为要遍历许多空的哈希桶）
6，unordered_map实现了[]，可以直接通过key访问value

接口：

1 构造：


unordered_map	构造各种类型的unordered_map对象

2 容量：


bool empty() const	检查unordered_map是否为空
size_t size() const	检查unordered_map有效元素个数

3 迭代器：


begin()	返回第一个元素的迭代器
end()	返回最后一个元素下一个元素迭代器
cbegin()	返回第一个元素const迭代器
cend()	返回最后一个元素下一个元素const迭代器

4 💪元素访问[]


operator[]	通过key访问value

❤注意，该函数实际调用插入操作，
⭐如果哈希表中无key，直接插入并返回 V(),
⭐如果哈希表有key插入失败，说明存在，返回value值，

5 查询


iterator find(const K& key)	返回key值在哈希桶中位置
size_t count(const K& key)	返回哈希桶中关键码为key的个数

注意：unordered_map中count最多为1

6 修改


insert(make_pair())	插入键值对
erase(const K& key)	删除key值
void clear()	清空容器
void swap(const unordered_map& )	交换两个容器元素

7 unordered_map桶操作


size_t bucket_count() const	返回哈希桶中桶的个数
size_t bucket(const K&key)	返回key值在哈希桶号
size_t bucket_size(size_t n)	返回n号哈希桶有效元素个数

#include <iostream>
#include <unordered_map>
using namespace std;
int main()
{
	int a[] = { 5,8,7,6,89,5,3,5,5,4,11};
	unordered_map<int, int> mp;
	for (auto& e : a)
	{
		mp.insert(make_pair(e, e));
	}
	cout << mp.bucket_count()<<endl;
	for (auto& e:a)
	{
		cout<<mp.bucket(e)<<" ";
	}
	return 0;
}

在这里插入图片描述

unordered_set没什么好说的，与set区别就是无顺序，
unordered_set文档

❤底层结构

哈希系列容器之所以效率高，因为使用了哈希结构

哈希概念

顺序结构和平衡树中关键码和对应位置无关系，因此查询一个元素时，必须经过关键码的多次比较，顺序结构O(N)，树形结构O(logN),效率取决于比较次数

理想搜索方法，=不经过任何比较一次性直接找到值，将关键码和存储位置建立一一映射关系，搜索时直接找到。

向该结构中

⭐插入元素：将关键值交给哈希函数，计算出存储位置的值并在该位置存储
⭐搜索元素：将关键值交给哈希函数，将计算出的值当作存储位置，若此位置关键码相等，则搜索成功，

例如集合{1，7，6，4，5，9}
将哈希函数设置为 hash(key)= key%capacity,capacity为底层存储空间总大小

在这里插入图片描述
🚩此方法即为哈希方法，用此方法使用的转换函数叫作哈希函数，构造出来的结构为哈希表或散列表
用该方法不需要多次比较，可以直接搜索到元素，因此效率快

问题，如果插入44会怎样？
44%10=4，但位置已有4了，出现哈希冲突

哈希冲突

不同关键字通过哈希函数计算出相同哈希地址，这种叫做哈希冲突或哈希碰撞
把具有不同关键字具有相同哈希地址的数据元素叫同义词

哈希函数

引发哈希冲突的一个原因是哈希函数设置不够合理，

哈希函数设计原则：

哈希函数定义域必须囊括所有存储数据的地址，如果散列表有m个地址，那么数据地址个数0到m-1
哈希函数计算出的地址要能平均分配在散列表
哈希函数要比较简单

常见哈希函数
1 🚩直接定址法
取关键字的某个线性函数为散列地址：hash(key) =A*key+B
优点：简单均匀
缺点：需要事先知道关键字分布情况=
比较适合数据量小且连续情况
2 🚩除留取余法
散列表最多存储m个地址，那么取一个不大于m，但最接近m或等于m的质数p作除数，哈希函数： hash(key)=key%p，将关键码转为地址

平方取中法–(了解)
假设关键字为1234，对它平方就是1522756，抽取中间的3位227作为哈希地址；
再比如关键字为4321，对它平方就是18671041，抽取中间的3位671(或710)作为哈希地址
平方取中法比较适合：不知道关键字的分布，而位数又不是很大的情况
折叠法–(了解)
折叠法是将关键字从左到右分割成位数相等的几部分(最后一部分位数可以短些)，然后将这
几部分叠加求和，并按散列表表长，取后几位作为散列地址。
折叠法适合事先不需要知道关键字的分布，适合关键字位数比较多的情况
随机数法–(了解)
选择一个随机函数，取关键字的随机函数值为它的哈希地址，即H(key) = random(key),其中
random为随机数函数。
通常应用于关键字长度不等时采用此法
数学分析法–(了解)
设有n个d位数，每一位可能有r种不同的符号，这r种不同的符号在各位上出现的频率不一定
相同，可能在某些位上分布比较均匀，每种符号出现的机会均等，在某些位上分布不均匀只
有某几种符号经常出现。可根据散列表的大小，选择其中各种符号分布均匀的若干位作为散
列地址。

在这里插入图片描述
某公司员工登记表，前7位容易相同，所以取后4位作散列地址，
数字分析法适合关键码位数多的情况，需要事先知道关键码分布且关键码若干位数分布较均匀情况，

注意：合适的哈希函数可以降低冲突，但是无法避免哈希冲突

解决哈希冲突

两种方法闭散列和开散列

闭散列

也叫开放地址法，哈希表只要没全满，就可以把冲突的数据放在空桶上
那么如何找到空桶？

1，线性探测：
== 从冲突的位置开始向后寻找，找到空桶就放入==
如上面 44数据
在这里插入图片描述
放到了8位置

删除：
不能直接删除，如果删除4位置，再寻找44时，先找到4位置，发现是空的，返回错误，
但实际44在8位置存放，所以删除时要伪标记

❤// 哈希表每个空间给个标记
// EMPTY此位置空， EXIST此位置已经有元素， DELETE元素已经删除
enum State{EMPTY, EXIST, DELETE};

enum STATE
{
    EMPTY,
    EXIST,
    DELETE
};
template<class K, class V>
struct HashNode
{
    pair<K, V> _kv;
    STATE _state = EMPTY;
};

template<class K>
struct DefaultHashFunc
{
    size_t operator()(const K& key)
    {
        return (size_t)key;
    }
};
template<>
struct DefaultHashFunc<string>
{
    size_t operator()(const string& str)
    {
        size_t hash = 0;
        //BKDR
        for (auto& ch : str)
        {
            hash *= 131;
            hash += ch;
        }
        return hash;
    }
};
namespace jib {
    template<class K, class V, class HashFunc = DefaultHashFunc<K>>
    class HashTable
    {
    public:
        typedef HashNode<K, V> Node;
        HashTable()
        {
            _table.resize(10);
        }
        bool Insert(const pair<K, V>& kv)
        {
            //扩容
            if (_n * 10 / _table.size() > 7)
            {
                size_t newsize = _table.size() * 2;
                HashTable<K, V, HashFunc> newHT;
                newHT._table.resize(newsize);

                for (size_t i = 0;i < _table.size();i++)
                {
                    newHT.Insert(_table[i]._kv);
                }
                _table.swap(newHT._table);
            }
            HashFunc hf;
            size_t hashi = hf(kv.first) % _table.size();
            while (_table[hashi]._state == EXIST)
            {
                ++hashi;
                hashi %= _table.size();
            }
            _table[hashi]._kv = kv;
            _table[hashi]._state = EXIST;
            ++_n;
            return true;
        }
        HashNode<const K, V>* Find(const K& key)
        {
            HashFunc hf;
            size_t hashi = hf(key) % _table.size();
            while (_table[hashi]._state != EMPTY)
            {
                if (_table[hashi]._state == EXIST && _table[hashi]._kv.first == hf(key))
                {
                    return (HashNode<const K, V>*) & _table[hashi];
                }
                ++hashi;
                hashi %= _table.size();
            }
            return nullptr;
        }
        bool Erase(const K& key)
        {

            HashNode<const K, V>* ret = Find(key);
            if (ret)
            {
                ret.state = DELETE;//设状态
                --_n;
                return true;
            }
            return false;
        }
    private:
        vector<HashNode<K, V>> _table;
        size_t _n = 0;//有效个数
    };
}

HashFunc是仿函数，作用取出key的关键码，如果key是int类型直接返回，如果string用BKDR算法返回关键码

思考：哈希表在什么情况下扩容？如何扩容？
答：负载因子>=0.7时
负载因子：元素的有效个数/哈希表长度，负载因子越大说明哈希表越满，越容易出现冲突，
当负载因子>=0.7时就扩容

线性探测优点：实现简单
线性探测缺点：发生哈希冲突时容易出现数据堆积，即不同关键码数据把一段哈希表占满了，再冲突需要多次比较才能找到空位置，效率低

所以有了二次探测
有了哈希函数计算出的地址，线性探测只能向后找，容易堆积，二次探测可以向左找一次向右找一次，或者向右跳两格，向右跳三格，左跳两格右跳两格。

闭散列最大缺陷就是空间利用率低，这也是哈希的缺陷

开散列

开散列法，又叫链地址法，通过哈希函数计算出地址后，具有相同关键码的数据放在同一个子集中，每一个子集称为一个桶，各个桶的元素通过单链表存储起来，每个链表的头节点存在散列表中
在这里插入图片描述
开散列实现：

namespace hash_bucket
{
	template<class K, class V>
	struct HashNode
	{
		pair<K, V> _kv;
		HashNode<K, V>* _next;

		HashNode(const pair<K, V>& kv)
			:_kv(kv)
			,_next(nullptr)
		{}
	};

	template<class K, class V, class HashFunc = DefaultHashFunc<K>>
	class HashTable
	{
		typedef HashNode<K, V> Node;
	public:
		HashTable()
		{
			_table.resize(10, nullptr);
		}

		~HashTable()
		{
			for (size_t i = 0; i < _table.size(); i++)
			{
				Node* cur = _table[i];
				while (cur)
				{
					Node* next = cur->_next;
					delete cur;
					cur = next;
				}

				_table[i] = nullptr;
			}
		}

		bool Insert(const pair<K, V>& kv)
		{
			if(Find(kv.first))
			{
				return false;
			}

			HashFunc hf;

			// 负载因子到1就扩容
			if (_n == _table.size())
			{
				size_t newSize = _table.size()*2;
				vector<Node*> newTable;
				newTable.resize(newSize, nullptr);

				// 遍历旧表，顺手牵羊，把节点牵下来挂到新表
				for (size_t i = 0; i < _table.size(); i++)
				{
					Node* cur = _table[i];
					while (cur)
					{
						Node* next = cur->_next;

						// 头插到新表
						size_t hashi = hf(cur->_kv.first) % newSize;
						cur->_next = newTable[hashi];
						newTable[hashi] = cur;

						cur = next;
					}

					_table[i] = nullptr;
				}

				_table.swap(newTable);
			}

			size_t hashi = hf(kv.first) % _table.size();
			// 头插
			Node* newnode = new Node(kv);
			newnode->_next = _table[hashi];
			_table[hashi] = newnode;
			++_n;
			return true;
		}

		Node* Find(const K& key)
		{
			HashFunc hf;
			size_t hashi = hf(key) % _table.size();
			Node* cur = _table[hashi];
			while (cur)
			{
				if (cur->_kv.first == key)
				{
					return cur;
				}

				cur = cur->_next;
			}

			return nullptr;
		}

		bool Erase(const K& key)
		{
			HashFunc hf;
			size_t hashi = hf(key) % _table.size();
			Node* prev = nullptr;
			Node* cur = _table[hashi];
			while (cur)
			{
				if (cur->_kv.first == key)
				{
					if (prev == nullptr)
					{
						_table[hashi] = cur->_next;
					}
					else
					{
						prev->_next = cur->_next;
					}

					delete cur;	
					return true;
				}

				prev = cur;
				cur = cur->_next;
			}

			return false;
		}

		void Print()
		{
			for (size_t i = 0; i < _table.size(); i++)
			{
				printf("[%d]->", i);
				Node* cur = _table[i];
				while (cur)
				{
					cout << cur->_kv.first <<":"<< cur->_kv.second<< "->";
					cur = cur->_next;
				}
				printf("NULL\n");
			}
			cout << endl;
		}

	private:
		vector<Node*> _table; // 指针数组
		size_t _n = 0; // 存储了多少个有效数据
	};
}

链地址法比闭散列节省空间，（闭散列需要开大量的空间来确保搜索效率）

❤改造哈希表

增加迭代器

// 前置声明
	template<class K, class T, class KeyOfT, class HashFunc>
	class HashTable;

	template<class K, class T, class Ptr, class Ref, class KeyOfT, class HashFunc>
	struct HTIterator
	{
		typedef HashNode<T> Node;
		typedef HTIterator<K, T, Ptr, Ref, KeyOfT, HashFunc> Self;
		typedef HTIterator<K, T, T*, T&, KeyOfT, HashFunc> Iterator;

		Node* _node;
		const HashTable<K, T, KeyOfT, HashFunc>* _pht;

		HTIterator(Node* node, const HashTable<K, T, KeyOfT, HashFunc>* pht)
			:_node(node)
			, _pht(pht)
		{}

		// 普通迭代器时，他是拷贝构造
		// const迭代器时，他是构造
		HTIterator(const Iterator& it)
			:_node(it._node)
			, _pht(it._pht)
		{}

		Ref operator*()
		{
			return _node->_data;
		}

		Ptr operator->()
		{
			return &_node->_data;
		}

		Self& operator++()
		{
			if (_node->_next)
			{
				// 当前桶还没完
				_node = _node->_next;
			}
			else
			{
				KeyOfT kot;
				HashFunc hf;
				size_t hashi = hf(kot(_node->_data)) % _pht->_table.size();
				// 从下一个位置查找查找下一个不为空的桶
				++hashi;
				while (hashi < _pht->_table.size())
				{
					if (_pht->_table[hashi])
					{
						_node = _pht->_table[hashi];
						return *this;
					}
					else
					{
						++hashi;
					}
				}

				_node = nullptr;
			}

			return *this;
		}

		bool operator!=(const Self& s)
		{
			return _node != s._node;
		}

		bool operator==(const Self& s)
		{
			return _node == s._node;
		}
	};

template<class K, class T, class KeyOfT, class HashFunc = DefaultHashFunc<K>>
	class HashTable
	{
		typedef HashNode<T> Node;

		// 友元声明
		template<class K, class T, class Ptr, class Ref, class KeyOfT, class HashFunc>
		friend struct HTIterator;
	public:
		typedef HTIterator<K, T, T*, T&, KeyOfT, HashFunc> iterator;
		typedef HTIterator<K, T, const T*, const T&, KeyOfT, HashFunc> const_iterator;

		iterator begin()
		{
			// 找第一个桶
			for (size_t i = 0; i < _table.size(); i++)
			{
				Node* cur = _table[i];
				if (cur)
				{
					return iterator(cur, this);
				}
			}

			return iterator(nullptr, this);
		}

		iterator end()
		{
			return iterator(nullptr, this);
		}

		const_iterator begin() const
		{
			// 找第一个桶
			for (size_t i = 0; i < _table.size(); i++)
			{
				Node* cur = _table[i];
				if (cur)
				{
					return const_iterator(cur, this);
				}
			}

			return const_iterator(nullptr, this);
		}

		const_iterator end() const
		{
			return const_iterator(nullptr, this);
		}
         //除留余数法
		size_t GetNextPrime(size_t prime)
		{
			static const int __stl_num_primes = 28;
			static const unsigned long __stl_prime_list[__stl_num_primes] =
			{
			  53,         97,         193,       389,       769,
			  1543,       3079,       6151,      12289,     24593,
			  49157,      98317,      196613,    393241,    786433,
			  1572869,    3145739,    6291469,   12582917,  25165843,
			  50331653,   100663319,  201326611, 402653189, 805306457,
			  1610612741, 3221225473, 4294967291
			};

			size_t i = 0;
			for (; i < PRIMECOUNT; ++i)
			{
				if (primeList[i] > prime)
					return primeList[i];
			}

			return primeList[i];
		}

		HashTable()
		{
			_table.resize(GetNextPrime(1), nullptr);
		}

		~HashTable()
		{
			for (size_t i = 0; i < _table.size(); i++)
			{
				Node* cur = _table[i];
				while (cur)
				{
					Node* next = cur->_next;
					delete cur;
					cur = next;
				}

				_table[i] = nullptr;
			}
		}

		pair<iterator, bool> Insert(const T& data)
		{
			KeyOfT kot;

			iterator it = Find(kot(data));
			if(it != end())
			{
				return make_pair(it, false);
			}

			HashFunc hf;

			// 负载因子到1就扩容
			if (_n == _table.size())
			{
				//size_t newSize = _table.size() * 2;
				size_t newSize = GetNextPrime(_table.size());
				vector<Node*> newTable;
				newTable.resize(newSize, nullptr);

				// 遍历旧表，顺手牵羊，把节点牵下来挂到新表
				for (size_t i = 0; i < _table.size(); i++)
				{
					Node* cur = _table[i];
					while (cur)
					{
						Node* next = cur->_next;

						// 头插到新表
						size_t hashi = hf(kot(cur->_data)) % newSize;
						cur->_next = newTable[hashi];
						newTable[hashi] = cur;

						cur = next;
					}

					_table[i] = nullptr;
				}

				_table.swap(newTable);
			}

			size_t hashi = hf(kot(data)) % _table.size();
			// 头插
			Node* newnode = new Node(data);
			newnode->_next = _table[hashi];
			_table[hashi] = newnode;
			++_n;
			return make_pair(iterator(newnode, this), true);
		}

		iterator Find(const K& key)
		{
			HashFunc hf;
			KeyOfT kot;
			size_t hashi = hf(key) % _table.size();
			Node* cur = _table[hashi];
			while (cur)
			{
				if (kot(cur->_data) == key)
				{
					return iterator(cur, this);
				}

				cur = cur->_next;
			}

			return end();
		}

		bool Erase(const K& key)
		{
			HashFunc hf;
			KeyOfT kot;
			size_t hashi = hf(key) % _table.size();
			Node* prev = nullptr;
			Node* cur = _table[hashi];
			while (cur)
			{
				if (kot(cur->_data) == key)
				{
					if (prev == nullptr)
					{
						_table[hashi] = cur->_next;
					}
					else
					{
						prev->_next = cur->_next;
					}

					--_n;
					delete cur;	
					return true;
				}

				prev = cur;
				cur = cur->_next;
			}

			return false;
		}

		void Print()
		{
			for (size_t i = 0; i < _table.size(); i++)
			{
				printf("[%d]->", i);
				Node* cur = _table[i];
				while (cur)
				{
					cout << cur->_kv.first <<":"<< cur->_kv.second<< "->";
					cur = cur->_next;
				}
				printf("NULL\n");
			}
			cout << endl;
		}

	private:
		vector<Node*> _table; // 指针数组
		size_t _n = 0; // 存储了多少个有效数据
	};
}

模拟实现set

namespace jib
{ 
	template<class K>
	class unorderedset
	{
		struct SetKeyOfT
		{
			const K& operator()(const K& key)
			{
				return key;
			}
		};
	public:
		typedef typename Hash_Bucket::Hash_Table<K ,K,SetKeyOfT>::const_iterator const_iterator;
		typedef typename Hash_Bucket::Hash_Table<K, K, SetKeyOfT>::const_iterator iterator;
		const_iterator begin() const
		{
			return _hb.begin();
		}
		const_iterator end() const
		{
			return _hb.end();
		}
		pair<const_iterator,bool> Insert(const K& key)
		{
			//return _hb.Insert(key);
			pair<typename Hash_Bucket::Hash_Table<K, K, SetKeyOfT>::iterator, bool> ret = _hb.Insert(key);
			return pair<const_iterator, bool>(ret.first, ret.second);
		}
		Hash_Bucket::Hash_Table<K, K, SetKeyOfT> _hb;
	};
	
}

模拟实现map

namespace jib {
	template<class K,class V>
	class unorderedmap
	{
		struct MapKeyOfT
		{
			const K& operator()(const pair<K, V>& kv)
			{
				return kv.first;
			}
		};
	public:
		typedef typename Hash_Bucket::Hash_Table<K, pair<K, V>, MapKeyOfT>::iterator iterator;
		typedef typename Hash_Bucket::Hash_Table<K, pair<K, V>, MapKeyOfT>::const_iterator const_iterator;
		iterator begin()
		{
			return _hb.begin();
		}
		const_iterator begin() const
		{
			return _hb.begin();
		}
		iterator end()
		{
			return _hb.end();
		}
		const_iterator end() const
		{
			return _hb.end();
		}
		pair<iterator,bool> Insert(const pair<K,V>& kv)
		{
			return _hb.Insert(kv);
		}
		V& operator [](const K& key)
			{
			pair<iterator, bool> ret = _hb.Insert(make_pair((key), V()) );
			return ret.first->second;
			}
	private:
		Hash_Bucket::Hash_Table<K, pair<K, V>, MapKeyOfT> _hb;
	};
}

❤ 哈希的应用

位图

面试题：给40亿个不重复的无符号整数，给出一个无符号整数，没排过序，如何快速判断是否在40亿个数当中。
❤位图：数据是否在整数中，在不在正好是两种状态，刚好用一个二进制bit位解决

一个整数32bit位，那就可以代表32个数是否存在

如：
在这里插入图片描述
用与位实现0，或位实现1
位图实现：

#include <iostream>
#include <vector>
using namespace std;
namespace jib {
	template <size_t N>
	class bitset
	{
	public:
		bitset()
		{
			_a.resize(N / 32 + 1);
		}
	    void set(size_t x)
		{
			//变1 或门
			int i = x / 32;
			int j = x % 32;
			_a[i] |= (1 << j);
		}
		void reset(size_t x)
		{
			//变0 与门
			int i = x / 32;
			int j = x % 32;
			_a[i] &= (~(1<<j));
		}
		bool test(size_t x)
		{
			int i = x / 32;
			int j = x % 32;
			//判断是否为1
			return _a[i] & (1 << j);
		}
	private:
		vector<int> _a;
	};

	template <size_t N>
	class twobitset
	{
	public:
		void set(size_t x)
		{
			//00->01;
			if (!a1.test(x) && !a2.test(x))
			{
				a2.set(x);
			}
			//01->10,10就代表2个及以上
			else
			{
				if (!a1.test(x) && a2.test(x))
				{
					a1.set(x);
					a2.reset(x);
				}
			}
		}
		bool is_once(size_t x)
		{
			return !a1.test(x) && a2.test(x);
		}
	private:
		bitset<N> a1;
		bitset<N> a2;
	};
}

位图：适用海量数据，且无重复数据，用来判断某个数据是否存在
也可以用来判断两个集合交集并集

布隆过滤器

某些媒体给我们推送视频时，要知道我们的历史记录，已经看过的就不在推送了，那么如何实现去重的呢？在刷新时去历史记录里查找是否看过这个视频，那么如何实现快速查找呢
如果用哈希桶，太浪费空间了，
位图又只能用于整形，用字符串编号就不行了

所以有了哈希和位图结合起来，布隆过滤器
概念：
布隆过滤器是布隆提出的紧凑的数据结构，💪 特点是高效的插入和查找 ，用来 判断数据一定不存在或可能存在，通过多个哈希函数，将数据投射进位图结构，不仅提高了效率也节省空间
在这里插入图片描述
如图：

百度和腾讯字符串，通过3个哈希函数投射到位图中，以此来判断是否字符串存在

⭐但是会发现，两个都投到了1，4位置，如果只用一个哈希函数，百度投到4了，腾讯还没插入，去查询腾讯的话却发现4位置是1，我们会误判腾讯已在

⭐所以我们用3个哈希函数以此来提高正确率，但是还是会误判，这是不可避免的，另外，3个哈希函数计算3个地址，如果一个地址为0，就说明字符串不在
代码实现：

#include <iostream>
#include <vector>
#include <string>
#include <bitset>
using namespace std;
struct BKDRHash
{
    size_t operator()(const string& str)
    {
        size_t hash=0;
        for (auto ch : str)
        {
            hash *= 31;
            hash += ch;
        }
        return hash;
    }
};
struct APHash
{
    size_t operator()(const string& str)
    {
        size_t hash=0;
        for (size_t i = 0; i < str.size(); i++)
        {
            size_t ch = str[i];
            if ((i & 1) == 0)
            {
                hash ^= ((hash << 7) ^ ch ^ (hash >> 3));
            }
            else
            {
                hash ^= (~((hash << 11) ^ ch ^ (hash >> 5)));
            }
        }
        return hash;
    }
};
struct DJBHash
{
    size_t operator()(const string& str)
    {
        size_t hash = 5381;
        for (auto ch : str)
        {
            hash += (hash << 5) + ch;
        }

        //cout << "DJBHash:" << hash << endl;
        return hash;
    }
};
template <size_t N,class K=string,class Hash1=BKDRHash ,class Hash2=APHash,class Hash3=DJBHash>
class BloomFilter
{
public:
    void set(const K& key)
    {
        size_t hash1 = Hash1()(key) % N;
        _bs.set(hash1);

        size_t hash2 = Hash2()(key) % N;
        _bs.set(hash2);

        size_t hash3 = Hash3()(key) % N;
        _bs.set(hash3);
    }
    bool Test(const K& key)
    {
    //一个为0就错误
        size_t hash1 = Hash1()(key) % N;
        if (_bs.test(hash1)==false)
        {
            return false;
        }
        size_t hash2 = Hash2()(key) % N;
        if (_bs.test(hash2) == false)
        {
            return false;
        }
        size_t hash3 = Hash3()(key) % N;
        if (_bs.test(hash3) == false)
            return false;

        return true;
    }
private:
    bitset<N> _bs;
};

💪海量数据面试题

哈希切割

一个超过100G大小的log file, log中存着IP地址, 设计算法找到出现次数最多的IP地址？

解：100G直接用排序啥的内存肯定爆了，所以我们切割一下，100G文件分成1000份小文件，如果是相同的地址一定会存放到同一个文件里，每个空间用map存储一下找出次数最多的，最后每个map比较即可，
但是问题来了，每个空间能均分吗？如果数据哈希计算后发生很多冲突怎么办？所以不能均分空间

小空间爆了又要分两种情况
1，相同的IP地址很多，哈希冲突的少
2，数据不相同，但计算地址后发生哈希冲突

解决：
a，把小文件读入set，若发生==异常，说明是第二种情况，（因为set底层红黑树，占的空间很多==，小文件又只有100mb，插入多了很容易爆内存），这种就要重新设计哈希函数
b，如果读入成功，说明相同IP地址的多，（set有去重，相同的不会插入），

位图变形

给定100亿个整数，设计算法找到只出现一次的整数
这个上面已经讲了就不说了
给两个文件，分别有100亿个整数，我们只有1G内存，如何找到两个文件交集？
把一个文件用位图存储，遍历另一个文件查找数据是否存在即可
1个文件有100亿个int，1G内存，设计算法找到出现次数不超过2次的所有整
数。

template <size_t N>
class twobitset
{
public:
	void set(size_t x)
	{
		//00->01;
		if (!a1.test(x) && !a2.test(x))
		{
			a2.set(x);
		}
		//01->10,10就代表2个及以上
		else
		{
			if (!a1.test(x) && a2.test(x))
			{
				a1.set(x);
				a2.reset(x);
			}
		}
	}
	bool is_once(size_t x)
	{
		return !a1.test(x) && a2.test(x);
	}
private:
	bitset<N> a1;
	bitset<N> a2;
};