一般我们在进行文本处理时,需要写方法建立词汇表和word到idx,以及idx到word的映射关系,这就需要统计词汇表中的所有单词并建立相应的词典。
在建立文档到idx的映射关系时,我们也可以用tensorflow内置的preprocessing.VocabularyProcessor来建立word到idx的映射关系。
VocabularyProcessor:Maps documents to sequences of word ids
class VocabularyProcessor(object):
"""Maps documents to sequences of word ids."""
def __init__(self,
max_document_length,
min_frequency=0,
vocabulary=None,
tokenizer_fn=None):
"""Initializes a VocabularyProcessor instance.
Args:
max_document_length: Maximum length of documents.
if documents are longer, they will be trimmed, if shorter - padded.
min_frequency: Minimum frequency of words in the vocabulary.
vocabulary: CategoricalVocabulary object.
Attributes:
vocabulary_: CategoricalVocabulary object.
"