用TensorFlow中内置的vocabulary processor处理单词

最新推荐文章于 2024-05-14 03:00:57 发布

暴躁的猴子

最新推荐文章于 2024-05-14 03:00:57 发布

阅读量2.1k

点赞数

本文链接：https://blog.csdn.net/orangefly0214/article/details/83215986

版权

一般我们在进行文本处理时，需要写方法建立词汇表和word到idx,以及idx到word的映射关系，这就需要统计词汇表中的所有单词并建立相应的词典。

在建立文档到idx的映射关系时，我们也可以用tensorflow内置的preprocessing.VocabularyProcessor来建立word到idx的映射关系。

VocabularyProcessor：Maps documents to sequences of word ids

class VocabularyProcessor(object):
  """Maps documents to sequences of word ids."""

  def __init__(self,
               max_document_length,
               min_frequency=0,
               vocabulary=None,
               tokenizer_fn=None):
    """Initializes a VocabularyProcessor instance.

    Args:
      max_document_length: Maximum length of documents.
        if documents are longer, they will be trimmed, if shorter - padded.
      min_frequency: Minimum frequency of words in the vocabulary.
      vocabulary: CategoricalVocabulary object.

    Attributes:
      vocabulary_: CategoricalVocabulary object.
    "