tf.contrib.learn.preprocessing.VocabularyProcessor()基本用法

最新推荐文章于 2020-07-02 18:16:15 发布

菜小白—NLP

最新推荐文章于 2020-07-02 18:16:15 发布

阅读量4.4k

点赞数 2

分类专栏： Python

本文链接：https://blog.csdn.net/ACM_hades/article/details/87446181

版权

Python 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

函数原型：

tf.contrib.learn.preprocessing.VocabularyProcessor(max_document_length, min_frequency=0, vocabulary=None, tokenizer_fn=None)

作用：

实现的功能就是，根据所有已分词好的文本建立好一个词典，然后找出每个词在词典中对应的索引，不足长度或者不存在的词补0

参数：

max_document_length: 文档的最大长度。如果文本的长度大于最大长度，那么它会被剪切，反之则用0填充。
min_frequency: 词频的最小值，出现次数小于最小词频则不会被收录到词表中。
vocabulary: CategoricalVocabulary 对象。
tokenizer_fn：分词函数

方法：

fit (raw_documents, unused_y=None)
作用：从原始文档raw_documents中学习到一个词汇表。
参数：raw_documents：一个可产生str或uncode的迭代器。
fit_transform (raw_documents, unused_y=None)
与上面一个函数一样，但是返回它返回原始文档的id矩阵[n_samples, max_document_length]
transform (raw_documents)
将raw_documents中的词转化为id
save (filename)
Saves vocabulary processor into given file.
restore (cls, filename)
Restores(还原) vocabulary processor from given file.
返回：VocabularyProcessor object.

例子：

from tensorflow.contrib import learn
import numpy as np
max_document_length = 4
x_text =['i love you','me too']
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
vocab_processor.fit(x_text)
print(next(vocab_processor.transform(['i me too'])).tolist())
#[1, 4, 5, 0]
x = np.array(list(vocab_processor.fit_transform(x_text)))
print(x)
#[[1 2 3 0]
 # [4 5 0 0]]