Keras Tokenizer 的 num_words 起不到限制词表大小的作用？

最新推荐文章于 2024-02-25 22:16:18 发布

火目小码农

最新推荐文章于 2024-02-25 22:16:18 发布

阅读量1k

点赞数 6

分类专栏： NLP Python

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/cecurio/article/details/116710612

版权

Python 同时被 2 个专栏收录

5 篇文章 1 订阅

订阅专栏

1 篇文章 0 订阅

订阅专栏

Keras Tokenizer 的 num_words 起不到限制词表大小的作用？

文章目录

问题复现：

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=5)
text = ["今天 武汉 下 雨 了", "我 今天 上课", "我 明天 不 上课", "武汉 明天 不 下 雨"]
tokenizer.fit_on_texts(text)
word_index = tokenizer.word_index

num_words参数在官网文档中的解释如下：

num_words: the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.

查看 word_index :

{'今天': 1, '武汉': 2, '下': 3, '雨': 4, '我': 5, '上课': 6, '明天': 7, '不': 8, '了': 9}

num_words指定为5，按理说，词汇表中应该只有4个词。为什么word_index中有全部的词汇？

问题解释：

测试代码：

tokenizer.texts_to_sequences(['今天 武汉 下 雨 我 上课 明天 不 了'])

对应输出：

[[1, 2, 3, 4]]

可见，只有今天 武汉 下 雨 这四个词被映射了。

如果不指定num_words 参数，词汇表会收录所有出现的单词。测试代码的对应输出将如下：

[[1, 2, 3, 4, 5, 6, 7, 8, 9]]

若在实例化Tokenizer的时候，加上oov_token参数

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=5, oov_token='<OOV>')
text = ["今天 武汉 下 雨 了", "我 今天 上课", "我 明天 不 上课", "武汉 明天 不 下 雨"]
tokenizer.fit_on_texts(text)
word_index = tokenizer.word_index
print(word_index)
# {'<OOV>': 1,
#  '今天': 2,
#  '武汉': 3,
#  '下': 4,
#  '雨': 5,
#  '我': 6,
#  '上课': 7,
#  '明天': 8,
#  '不': 9,
#  '了': 10}

sequences = tokenizer.texts_to_sequences(['今天 武汉 下 雨 我 上课 明天 不 了'])
print(sequences)
#[[2, 3, 4, 1, 1, 1, 1, 1, 1]]

这在参考链接1中也有解释。

有人提问如下：

Hi,

I am currently working with the Tokenizer class and I have a question about the relevance of num_words. Reading the documentation suggests that when .fit_on_texts is run the Tokenizer will only take the most common num_words amounts. I currently have a dataset consisting of 10358 uniques words. When I run Tokenizer specifying num_words = 1000 I then call the word index which has a length of 10358. Does the Tokenizer generate an index of the top 1000 then add any others on after this when running fit_on_texts?

Thanks

作者回答如下：

The Tokenizer stores everything in the word_index during fit_on_texts. Then, when calling the texts_to_sequences method, only the top num_words are considered.

In [1]: from keras.preprocessing.text import Tokenizer
In [2]: texts = ['a a a', 'b b', 'c']
In [3]: tokenizer = Tokenizer(num_words=2)
In [4]: tokenizer.fit_on_texts(texts)
In [5]: tokenizer.word_index
Out[5]: {'a': 1, 'b': 2, 'c': 3}
In [6]: tokenizer.texts_to_sequences(texts)
Out[6]: [[1, 1, 1], [], []]

There’s actually an off-by-one error as you can see; the output should be [[1, 1, 1], [2, 2], []]. I am fixing, but in the meantime you can set your num_words to be one more than you intended.

总结：

虽然word_index属性中维护了所有的词信息，但是在使用(如：tokenizer.texts_to_sequences())的时候，词汇表中只有num_words-1个单词能够正常使用。

参考链接：

https://github.com/keras-team/keras/issues/7836
https://github.com/keras-team/keras/issues/7551

火目小码农

关注

6
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
Keras Tokenizer 的 num_words 起不到限制词表大小的作用？

Keras Tokenizer 的 num_words 起不到限制词表大小的作用？
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。