Keras.Tokenizer：文本与序列预处理

最新推荐文章于 2024-08-13 13:11:20 发布

qq_39594141

最新推荐文章于 2024-08-13 13:11:20 发布

阅读量1.8k

点赞数 1

本文链接：https://blog.csdn.net/qq_39594141/article/details/88869489

版权

Tokenizer类

成员变量

document_count 处理的文档数量
word_index 一个dict，保存所有word对应的编号id，从1开始
word_counts 一个dict，保存每个word在所有文档中出现的次数
word_docs 一个dict，保存每个word出现的文档的数量
index_docs 一个dict，保存word的id出现的文档的数量

示例：

本例中text为长度等于16104的列表，每个元素为1个imdb的新闻评论，例如：

Passport to Pimlico is a real treat for all fans of British cinema. Not only is it an enjoyable and thoroughly entertaining comedy, but it is a cinematic flashback to a bygone age, with attitudes and scenarios sadly now only a memory in British life.<br /><br />Stanley Holloway plays Pimlico resident Arthur Pemberton, who after the accidental detonation of an unexploded bomb, discovers a wealth of medieval treasure belonging to the 14th Century Duke of Burgundy that has been buried deep underneath their little suburban street these last 600 years.<br /><br />Accompanying