Trie树,即字典树,又称单词查找树或键树,是一种树形结构,是一种哈希树的变种。典型应用是用于统计和排序大量的字符串(但不仅限于字符串),所以经常被搜索引擎系统用于文本词频统计。它的优点是:最大限度地减少无谓的字符串比较,查询效率比哈希表高。
Trie的核心思想是空间换时间。利用字符串的公共前缀来降低查询时间的开销以达到提高效率的目的。
HAMT实现了几乎类似哈希表的速度,同时更经济地使用内存。此外,哈希表可能必须定期调整大小,这是一项昂贵的操作,而HAMT则会动态增长。通常,HAMT性能通过具有N个时隙的多个的较大根表来改善; 一些HAMT变体允许根部懒惰地生长,对性能的影响可以忽略不计。
Python\Python\hamt.c
/*
This file provides an implemention of an immutable mapping using the
Hash Array Mapped Trie (or HAMT) datastructure.
This design allows to have:
1. Efficient copy: immutable mappings can be copied by reference,
making it an O(1) operation.
2. Efficient mutations: due to structural sharing, only a portion of
the trie needs to be copied when the collection is mutated. The
cost of set/delete operations is O(log N).
3. Efficient lookups: O(log N).
(where N is number of key/value items in the immutable mapping.)
HAMT
====
The core idea of HAMT is that the shape of the trie is encoded into the
hashes of keys.
Say we want to store a K/V pair in our mapping. First, we calculate the
hash of K, let's say it's 19830128, or in binary:
0b1001011101001010101110000 = 19830128
Now let's partition this bit representation of the hash into blocks of
5 bits each:
0b00_00000_10010_11101_00101_01011_10000 = 19830128
(6) (5) (4) (3) (2) (1)
Each block of 5 bits represents a number between 0 and 31. So if we have
a tree that consists of nodes, each of which is an array of 32 pointers,
those 5-bit blocks will encode a position on a single tree level.
For example, storing the key K with hash 19830128, results in the following
tree structure:
(array of 32 pointers)
+---+ -- +----+----+----+ -- +----+
root node | 0 | .. | 15 | 16 | 17 | .. | 31 | 0b10000 = 16 (1)
(level 1) +---+ -- +----+----+----+ -- +----+
|
+---+ -- +----+----+----+ -- +----+
a 2nd level node | 0 | .. | 10 | 11 | 12 | .. | 31 | 0b01011 = 11 (2)
+---+ -- +----+----+----+ -- +----+
|
+---+ -- +----+----+----+ -- +----+
a 3rd level node | 0 | .. | 04 | 05 | 06 | .. | 31 | 0b00101 = 5 (3)
+---+ -- +----+----+----+ -- +----+
|
+---+ -- +----+----+----+----+
a 4th level node | 0 | .. | 04 | 29 | 30 | 31 | 0b11101 = 29 (4)
+---+ -- +----+----+----+----+
|
+---+ -- +----+----+----+ -- +----+
a 5th level node | 0 | .. | 17 | 18 | 19 | .. | 31 | 0b10010 = 18 (5)
+---+ -- +----+----+----+ -- +----+
|
+--------------+
|
+---+ -- +----+----+----+ -- +----+
a 6th level node | 0 | .. | 15 | 16 | 17 | .. | 31 | 0b00000 = 0 (6)
+---+ -- +----+----+----+ -- +----+
|
V -- our value (or collision)
To rehash: for a K/V pair, the hash of K encodes where in the tree V will
be stored.
To optimize memory footprint and handle hash collisions, our implementation
uses three different types of nodes:
* A Bitmap node;
* An Array node;
* A Collision node.
Because we implement an immutable dictionary, our nodes are also
immutable. Therefore, when we need to modify a node, we copy it, and
do that modification to the copy.
*/