_Hashtable
STL中并没有提供直接提供hashtable,而是通过将其封装为unordered_map和unordered_set来供外部使用,也可以说它们是由_Hashtable来实现的。
其_Hashtable的结构如下,其与传统的hash list有相同点但也存在些许不同。
同:
- _M_buckets为nodes指针数组,其保存了node的地址。而nodes中主要保存了2个元素,分别为next指针和value。
- 增删改查过程基本相似,主要分为两个步骤。首先计算hash(key),获取的bucket_index,也就是插入的链表位置。之后便是插入进入链表
- 同样面临rehash问题
异
- bucket中的list中包含头节点,也就是forword list中的begin_before节点。引入此节点后其插入删除操作将得到统一。
- 传统hash链表结构的链表最后一个节点会指向nullptr,STL中的则不同,其会指向next_bucket的链表的第一个node,参照上图中的红色线,我们可以发现所有的节点相连为一个forward list结构,这种结构将很方便地完成迭代器的遍历(仅单向遍历,所以unorder_set/map 返回的是LegacyForwardIterator)。但是这样结构也对判断是否到达list 尾部增加了额外计算量,我们无法像之前通过node->next == nullptr来判断其是否为last节点。
- list中包含头节点,其由对应的buckets[]所指向,在图中为绿色箭头所指向的node,我们可以看到其指向了previous bucket的last node,通过这种方法可以很轻易地实现头节点并节省一个节点的空间。但是此方法也会造成当我们修改last node是我们还需要修改bucket[]中存储的数据。
- _M_before_begin是一个特殊节点,用于解决第一个node没有previous begin的问题。
note:
图中节点排列顺序按照bucket顺序进行连接是为了构图简单,实质上bucket list并不存在明显的先后顺序,其顺序通常由构造顺序所决定。
_HashTable 声明
在源码中hashtable有完整的注释,其注解内容如下。
/**
* Primary class template _Hashtable.
*
* @ingroup hashtable-detail
*
* @tparam _Value CopyConstructible type.
*
* @tparam _Key CopyConstructible type.
*
* @tparam _Alloc An allocator type
* ([lib.allocator.requirements]) whose _Alloc::value_type is
* _Value. As a conforming extension, we allow for
* _Alloc::value_type != _Value.
*
* @tparam _ExtractKey Function object that takes an object of type
* _Value and returns a value of type _Key.
*
* @tparam _Equal Function object that takes two objects of type k
* and returns a bool-like value that is true if the two objects
* are considered equal.
*
* @tparam _H1 The hash function. A unary function object with
* argument type _Key and result type size_t. Return values should
* be distributed over the entire range [0, numeric_limits<size_t>:::max()].
*
* @tparam _H2 The range-hashing function (in the terminology of
* Tavori and Dreizin). A binary function object whose argument
* types and result type are all size_t. Given arguments r and N,
* the return value is in the range [0, N).
*
* @tparam _Hash The ranged hash function (Tavori and Dreizin). A
* binary function whose argument types are _Key and size_t and
* whose result type is size_t. Given arguments k and N, the
* return value is in the range [0, N). Default: hash(k, N) =
* h2(h1(k), N). If _Hash is anything other than the default, _H1
* and _H2 are ignored.
*
* @tparam _RehashPolicy Policy class with three members, all of
* which govern the bucket count. _M_next_bkt(n) returns a bucket
* count no smaller than n. _M_bkt_for_elements(n) returns a
* bucket count appropriate for an element count of n.
* _M_need_rehash(n_bkt, n_elt, n_ins) determines whether, if the
* current bucket count is n_bkt and the current element count is
* n_elt, we need to increase the bucket count. If so, returns
* make_pair(true, n), where n is the new bucket count. If not,
* returns make_pair(false, <anything>)
*
* @tparam _Traits Compile-time class with three boolean
* std::integral_constant members: __cache_hash_code, __constant_iterators,
* __unique_keys.
*
* Each _Hashtable data structure has:
*
* - _Bucket[] _M_buckets
* - _Hash_node_base _M_before_begin
* - size_type _M_bucket_count
* - size_type _M_element_count
*
* with _Bucket being _Hash_node* and _Hash_node containing:
*
* - _Hash_node* _M_next
* - Tp _M_value
* - size_t _M_hash_code if cache_hash_code is true
*
* In terms of Standard containers the hashtable is like the aggregation of:
*
* - std::forward_list<_Node> containing the elements
* - std::vector<std::forward_list<_Node>::iterator> representing the buckets
*
* The non-empty buckets contain the node before the first node in the
* bucket. This design makes it possible to implement something like a
* std::forward_list::insert_after on container insertion and
* std::forward_list::erase_after on container erase
* calls. _M_before_begin is equivalent to
* std::forward_list::before_begin. Empty buckets contain
* nullptr. Note that one of the non-empty buckets contains
* &_M_before_begin which is not a dereferenceable node so the
* node pointer in a bucket shall never be dereferenced, only its
* next node can be.
*
* Walking through a bucket's nodes requires a check on the hash code to
* see if each node is still in the bucket. Such a design assumes a
* quite efficient hash functor and is one of the reasons it is
* highly advisable to set __cache_hash_code to true.
*
* The container iterators are simply built from nodes. This way
* incrementing the iterator is perfectly efficient independent of
* how many empty buckets there are in the container.
*
* On insert we compute the element's hash code and use it to find the
* bucket index. If the element must be inserted in an empty bucket
* we add it at the beginning of the singly linked list and make the
* bucket point to _M_before_begin. The bucket that used to point to
* _M_before_begin, if any, is updated to point to its new before
* begin node.
*
* On erase, the simple iterator design requires using the hash
* functor to get the index of the bucket to update. For this
* reason, when __cache_hash_code is set to false the hash functor must
* not throw and this is enforced by a static assertion.
*
* Functionality is implemented by decomposition into base classes,
* where the derived _Hashtable class is used in _Map_base,
* _Insert, _Rehash_base, and _Equality base classes to access the
* "this" pointer. _Hashtable_base is used in the base classes as a
* non-recursive, fully-completed-type so that detailed nested type
* information, such as iterator type and node type, can be
* used. This is similar to the "Curiously Recurring Template
* Pattern" (CRTP) technique, but uses a reconstructed, not
* explicitly passed, template pattern.
*
* Base class templates are:
* - __detail::_Hashtable_base
* - __detail::_Map_base
* - __detail::_Insert
* - __detail::_Rehash_base
* - __detail::_Equality
*/
template<typename _Key, typename _Value, typename _Alloc,
typename _ExtractKey, typename _Equal,
typename _H1, typename _H2, typename _Hash,
typename _RehashPolicy, typename _Traits>
class _Hashtable
: public __detail::_Hashtable_base<_Key, _Value, _ExtractKey, _Equal,
_H1, _H2, _Hash, _Traits>,
public __detail::_Map_base<_Key, _Value, _Alloc, _ExtractKey, _Equal,
_H1, _H2, _Hash, _RehashPolicy, _Traits>,
public __detail::_Insert<_Key, _Value, _Alloc, _ExtractKey, _Equal,
_H1, _H2, _Hash, _RehashPolicy, _Traits>,
public __detail::_Rehash_base<_Key, _Value, _Alloc, _ExtractKey, _Equal,
_H1, _H2, _Hash, _RehashPolicy, _Traits>,
public __detail::_Equality<_Key, _Value, _Alloc, _ExtractKey, _Equal,
_H1, _H2, _Hash, _RehashPolicy, _Traits>,
private __detail::_Hashtable_alloc<
__alloc_rebind<_Alloc,
__detail::_Hash_node<_Value,
_Traits::__hash_cached::value>>>
数据成员
__bucket_type* _M_buckets = &_M_single_bucket; // __bucket_type 实质上为node**
size_type _M_bucket_count = 1;
__node_base _M_before_begin;
size_type _M_element_count = 0;
_RehashPolicy _M_rehash_policy;
// A single bucket used when only need for 1 bucket. Especially
// interesting in move semantic to leave hashtable with only 1 buckets
// which is not allocated so that we can have those operations noexcept
// qualified.
// Note that we can't leave hashtable with 0 bucket without adding
// numerous checks in the code to avoid 0 modulus.
__bucket_type _M_single_bucket = nullptr;
常见的有bucket
数组_M_buckets
用于索引链表。_M_bucket_count
用于记录bucket大小,_M_before_begin
便是first bucket的before begin,_M_element_count
用于记录数据个数,_RehashPolicy用于对rehash行为进行管理。对于hashtable有一个负载的概念,也就是load = _M_element_count / _M_bucket_count。load表示平均bucket的list长度。当load较大时,其性能会出现明显下降,此时我们需要扩充bucket大小,并进行重新插入,也就是rehash操作。
这里有一个特殊的变量_M_single_bucket,此变量仅用于bucket_count == 1时,这样在进行移动语义操作时我们可以不必再分配bucket空间。
hashtable的node与forward list中的node相似,都是仅包含next而不包含previous。其定义如下
/* @chapter _Hash_node */
struct _Hash_node_base // node基类,其包含_M_nxt即next指针
{
_Hash_node_base* _M_nxt;
_Hash_node_base() noexcept : _M_nxt() {
}
_Hash_node_base(_Hash_node_base* __next) noexcept : _M_nxt(__next) {
}
};
/**
* struct _Hash_node_value_base
*
* Node type with the value to store.
*/
template<typename _Value>
struct _Hash_node_value_base : _Hash_node_base // node中包含value
{
typedef _Value value_type;
__gnu_cxx::__aligned_buffer<_Value> _M_storage;
_Value*
_M_valptr() noexcept
{
return _M_storage._M_ptr(); }
const _Value*
_M_valptr() const noexcept
{
return _M_storage._M_ptr(); }
_Value&
_M_v() noexcept
{
return *_M_valptr(); }
const _Value&
_M_v() const noexcept
{
return *_M_valptr(); }
};
/**
* Primary template struct _Hash_node.
* _Hash_node 的基本模板结构
*/
template<typename _Value, bool _Cache_hash_code>
struct _Hash_node;
/*
* 这里使用了偏特化技术,创建了两种hashnode,一种为带有hashcode cache的node,其_Cache_hash_code为true,另一种则不是。
*/
/**
* Specialization for nodes with caches, struct _Hash_node.
*
* Base class is __detail::_Hash_node_value_base.
*/
template<typename _Value>
struct _Hash_node<_Value, true> : _Hash_node_value_base<_Value>
{
std::size_t _M_hash_code; // 用于存储hashcode,由此可见hashcode的类型为size_t
_Hash_node*
_M_next() const noexcept
{
return static_cast<_Hash_node*>(this->_M_nxt); }
};
/**
* Specialization for nodes without caches, struct _Hash_node.
*
* Base class is __detail::_Hash_node_value_base.
*/
template<typename _Value>
struct _Hash_node<_Value, false> : _Hash_node_value_base<_Value>
{
_Hash_node* // 不缓存,其未定义任何额外数据
_M_next() const noexcept
{
return static_cast<_Hash_node*>(this->_M_nxt); }
};
hashcode是否需要缓存是根据需求而言,当我们缓存hashcode时会造成node节点需要额外的size_t存储hashcode,造成存储密度降低,但是我们在rehash的过程中却不必再对其hash值进行任何计算。相反则是存储密度高,计算性能的浪费。对于某些特殊的变量,如string。其hash计算过程较为复杂,如果我们不存储器hash值将会造成很大计算浪费。
这里看一下string的hash计算过程
template<>
struct hash<string> // basic_string.h 中特化了一个hash<T>,此函数用于hash过程
: public __hash_base<size_t, string>
{
size_t
operator()(const string& __s) const noexcept
{
return std::_Hash_impl::hash(__s.data(), __s.length()); } // 传递字符串数组,length
};
struct _Hash_impl
{
static size_t // hash函数调用如下,其会传递如下内容
hash(const void* __ptr, size_t __clength,
size_t __seed = static_cast<size_t>(0xc70f6907UL)) // 此处增加了一个seed,但是是写死的
{
return _Hash_bytes(__ptr, __clength, __seed); }
template<typename _Tp>
static size_t
hash(const _Tp& __val)
{
return hash(&__val, sizeof(__val)); }
template<typename _Tp>
static size_t
__hash_combine(const _Tp& __val, size_t __hash)
{
return hash(&__val, sizeof(__val), __hash); }
};
// Hash function implementation for the nontrivial specialization.
// All of them are based on a primitive that hashes a pointer to a
// byte array. The actual hash algorithm is not guaranteed to stay
// the same from release to release -- it may be updated or tuned to
// improve hash quality or speed.
/*
* 所有的hash操作都指向一个原语,即hash一个字节数组。我在这里未找到相应的源代码可能是由编译器通过汇编来实现此函数
*/
size_t
_Hash_bytes(const void* __ptr, size_t __len, size_t __seed);
通过上面的代码不难看出,hash string时我们需要对其保存的所有数据进行hash操作,如果string很大,将会产生很大的计算量。
construct
默认构造函数如下
_Hashtable() = default;
此构造函数传入__bucket_hint,作用于最后的bucket_size大小。
template<