Python defaultdict() 的理解

最新推荐文章于 2024-05-14 12:58:23 发布

bandaye3

最新推荐文章于 2024-05-14 12:58:23 发布

阅读量3.3k

点赞数 7

分类专栏： Python 文章标签： Python

本文链接：https://blog.csdn.net/bandaye3/article/details/83479771

版权

Python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

首先来看下具体的定义：

class collections.defaultdict([default_factory[, ...]])
'''
Returns a new dictionary-like object. defaultdict is a subclass of the built-in dict class. It overrides one method and adds one writable instance variable. The remaining functionality is the same as for the dict class and is not documented here.
'''

defaultdict()返回类似字典类型的对象，其特点是会自动为将要访问的键（就算目前字典中并不存在这样的键）创建映射实体。也就是说，当通过key来访问对应的value时，普通字典中若还没有事先创建该键值对，则会发生访问错误。而defaultdict()可以在访问没有事先建立好的键值对时，自动构建相应键值对以供访问，构建方法由参数default_factory决定。

那么default_factory取值包括但不限于内建转换函数int()、list()等等，其要求是**“first argument must be callable or None”**。当调用这些函数时，实际上是将为当前键值对的value默认构造为对应的类型。

# example 1
>>> s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
>>> d = defaultdict(list)
>>> for k, v in s:
		d[k].append(v)

>>> d.items()
dict_items([('yellow', [1, 3]), ('blue', [2, 4]), ('red', [1])])

上述代码中键值对中value的默认类型为list，循环体中每次遇到相同key时，就把对应元素添加进列表中；每次遇到新的key时，就初始化为空列表后添加进对应元素。

# example 2
>>> d = defaultdict(int)
>>> s = 'mississippi'
>>> for k in s:
		d[k] += 1

>>> d.items()
dict_items([('m', 1), ('i', 4), ('s', 4), ('p', 2)])

上述代码中键值对中value的默认类型为int，循环体中每次遇到相同key时，就把对应元素加1；每次遇到新的key时，就初始化为整型数字0后再加1。

在自然语言处理任务中，常常需要根据训练语料建立词典，并为所有单词指派索引值。此种情况下就可利用defaultdict()方法：

# example 3
w2i = defaultdict(lambda: len(w2i))
S = w2i["<s>"]
UNK = w2i["<unk>"]
# read the training data...
w2i = defaultdict(lambda: UNK, w2i)
# read the test data...

首先参数default_factory为可调用的匿名表达式，每遇到新的key，就将对应的value初始化为当前字典的元素个数，然后再把构建好的键值对添加进字典。注意非常巧妙的一点是，添加键值对时字典长度还是添加进上一个键值对后的长度。这样，当执行语句时

S = w2i["<s>"]

当前词典为空，包含元素个数为0，初始化给"< s >"时正好此字符的索引为0。当读取完训练语料后，词典也构建完毕。为方便起见，读取训练语料和测试语料通常调用相同的函数，也就意味着在读取语料函数内部包含词典构建的部分。但测试语料中的词语不应包含在构建好的词典中，因为需要在测试语料上衡量模型的泛化能力，也因此测试语料才存在未登录词的问题。

w2i = defaultdict(lambda: UNK, w2i)

以基于训练语料构建好的词典构建新的词典，但参数default_factory虽然仍为可调用的匿名表达式，但每次遇到新的key，就将对应的value初始化为UNK，然后再读取测试语料。这就保证了测试语料中的未登录词仍然保留，可以用来衡量模型泛化能力。

bandaye3

关注

7
点赞
踩
17

收藏

觉得还不错? 一键收藏
3
评论
Python defaultdict() 的理解

首先来看下具体的定义：class collections.defaultdict([default_factory[, ...]])'''Returns a new dictionary-like object. defaultdict is a subclass of the built-in dict class. It overrides one method and adds one...
复制链接

扫一扫