Generic Mapping Types
the dict
lives in __builtins__.__dict__
Because dict
's crucial role, Python dicts are highly optimized. Hash tables are the engines behind Python’s high-performance dicts. set
is also implemented with hash tables.
the collections.abc
module provides the Mapping
and MutableMapping
ABCs to formalize the interfaces of dict
and similar types.
Implementations of specialized mappings often extend dict
or collections.UserDict
.
All mapping types in the standard library use the basic dict
in their implementation, so they share the limitation that the keys must be hashable
.
What is Hashable?
an object is hashable if it has hash value which never changes during its lifetime (it needs a
__hash__()
method), and can be compared to other objects (it needs an__eq__()
method).–
hashable includes:
- The atomic immutable types –
str
,byte
, numeric types- frozen set
tuple
if all its items are hashable可以用
hash(x)
去查看x的hash value
User-defined types
User-defined types are hashable by default because their hash value is their
id()
and they all compare not equal.
If an object implements a custom__eq__
that takes into account its internal state, it may be hashable only if all its attributes are immutable.
dict Comprehensions
A dictComp
builds a dict
instance by producing key:value
pair from any iterable.
Overview of Common Mapping Methods
collections.defaultdict()
– leetcode老盆友了,括号里面传入想要value的数据类型 e.g. list, set
collections.OrderedDict
– 每次要是周赛遇到都会临时去查 😂
Mappings with Flexible Key Lookup
sometimes it is convenient to have mappings that return some made-up value when a missing key is searched.
two main approaches:
- use a
defaultdict
instead of a plaindict
- subclass
dict
or any other mapping type and add a__missing__
method.
defaultdict
: Another Take on Missing Keys
how it works:
when instantiating a defaultdict
, you provide a callable that is used to produce a default value whenever __getitem__
is passed a nonexistent key argument.
The __missing__
Method
underlying the way mappings deal with missing keys is the aptly named __missing__
method.
If you subclass dict
can provide a __missing__
method, the standard dict.__getitem__
will call it whenever a key is not found, instead of raising KeyError
Variations of dict
collections.OrderedDict
- maintain keys in insertion order (啊,我每次都去再sort一便)
- allowing iteration over items in a predictable order
popitem()
pops the first item by default- but you can use
popitem(last-True)
to pop the last item added
collection.Counter
- 和自己写一个freq一样
collections.ChainMap
- holds a list of mappings that can be searched as one
- the lookup is performed on each mapping in order, and succeeds if the key is found in any of them.
- useful to interpreters for languages with nested scopes, where each mapping represents a scope context.
collections.UserDict
- a pure Python implementation of a mapping that works like a standard dict
- 蛤???蛤蛤蛤??
Subclassing UserDict
it almost always easier to create a new mapping type by extending UserDict
rather than dict
.
its value can be appreciated as we extend 上面那个strKeyDict
to make sure that any keys added to the mapping are stored asstr
.
strKeyDict
always converts non-string keys to str-on insertion, update and lookup.
WHY prefer to subclass from
UserDict
than fromdict
- the built-in has some implementation shortcuts that end up forcing us to override methods that we can just inherit from
UserDict
with no problem.
UserDict
does not inherit from dict
, but has an internal dict
instance, called data
, which holds the actual items.
- this avoids undesired recursion when coding special methods like
__setitem__
, and Simplifies the coding of__contains__
.
Because UserDict
subclasses MutableMapping
, the remaining methods that make strKeyDict
a full-fledged mapping are inherited from UserDict
, MutableMapping
, or Mapping
.
MutableMapping.update
- powerful method can be called directly but is also used by
__init__
to load the instance from other Mappings, from iterablles of(key, value)
pairs, and key-word argument. - buz it uses
self[key] = value
to add items, it ends up calling our implementation of__setitem__
- powerful method can be called directly but is also used by
Mapping.get
Immutable Mappings
the types
module provides a wrapper class called MappingProxy Type
, which, given a mapping, returns a mappingproxy
instance that is a read-only but dynamic view of the original mapping. (i.e. updates to the original mapping can be seen in the mappingproxy
, but chaNges cannot be made throught it.)
Set Theory
A set is a collection of unique objects.
A basic use case is removing duplications.
set
element must be hashableset
本人 is not hashablefrozenset
is hashableset
里的元素可以是frozenset
infix operators | meaning |
---|---|
a | b | union |
a & b | intersection |
a - b | difference |
set Literals
there is not literal notation for the empty set
, so we must remember to writer set()
.
直接 s={1,2,3}
会比 s=set([1,2,3])
要快,因为唔,这不是废话嘛,后面那个还要先build个list
There is no special syntax to represent frozenset
literals – they must be created by calling the constructor.
set Comprehensions
和list comprehension一样,就是把[]
换成 {}
dict and set Under the Hood
python dict
and set
are implemented using hash tables
-
How efficient are Python
dict
andset
?
len 越长, 差得越大 -
Why are they unordered?
-
Why does the order of the
dict
keys orset
element depend on intersection order, and may change during the lifetime of the structure?因为,第一,其实每个key 哈希算完忘bucket里存的时候就是稀疏存的。第二,insert就有可能触发python觉得现在hash table太拥挤了,它想要去重新建张更大的表。那么key的值就会变,也没法order
-
Why can’t we use any Python object as a
dict
key orset
element?虽然dict和set比array快,但是也是有它的缺点的。
space efficient 是需要考虑的很重要的一点,看要不要空间换时间了 -
Why is it bad to add items to a
dict
orset
while iterating through it?If you are iterating over the dictionary keys and changing them at the same time, your loop may not scan all the items expected – not even the items that were already In the dictionary Before you added to it.
A Performance Experiment
- an array of 10 million distinct double-precision floats - the haystack
- an array of needles - 1,000 floats, with 500 picked frOm haystack and 500 verified not to be in it.
a dict with 1,000 floats
用timeit
module
If your program does any kind of I/O, the lookup time for keys in dict
or set
is negligible, regardless of the dict
or set
size (as long as it does fit in RAM)
Hash Tables in Dictionaries
a hash table is a sparse array
In standard data structure texts, the cells in a hash table are often called “buckets”.
In a dict
hash table, there is a bucket for each item, and it contains two fields:
- a reference to the key
- a reference to the value of the item
because all buckets have the same size, access to an individual bucket is done by offset.
Python tries to keep at least 1/3 of the buckets empty, if the hash table becomes too crowded, it is copied to a new location with room for more buckets.
哈希冲突
To put an item in hash table:
- step 1 : calculate the hash value of the item key – done with the
hash()
built-in function. - step 2 : use part of hash to locate a bucket in hash table
- step 3 (insert) – when an empty bucket is located, the new item is put there
- step 3 (update) – when a bucket with a matching key is found, the value in that bucket is overwritten with the new value
- python may determine that the hash table is too crowded and rebuild it to a new location with more room. As the hash table grows, so does the number of hash bits used as bucket offsets, and this keeps the rate of collisions low.
Practical Consequences of of How dict Works
- keys must be hashable objects
dict
have significant memory overhead- because a
dict
uses a hash table internally, and hash tables must be sparse to work, they are not space efficient.
- because a
- key search is very fast
- we could search more than 2 million keys per second in a
dict
with 10 million items
- we could search more than 2 million keys per second in a
- key ordering depends on insertion order
- adding items to a
dict
may change the order of existing keys- Python 会决定这个哈希表要不要grow
- 那如果要grow的话,key们很大可能会变
How sets Work - Practical Consequences
set
elements must be hashable objectsset
have a significant memory overhead- membership testing is very efficient
- element ordering depends on insertion order
- adding elements to a set may change the order of other elements