NLP————VocabularyProcessor进行词表创建和原数据填充

最新推荐文章于 2024-08-07 11:38:54 发布

coder_Gray

最新推荐文章于 2024-08-07 11:38:54 发布

阅读量2.8k

点赞数 1

分类专栏： NLP DeepLearning

本文链接：https://blog.csdn.net/coder_Gray/article/details/86478718

版权

DeepLearning 同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

NLP

6 篇文章 0 订阅

订阅专栏

在进行NLP相关编码时，将文本进行序列化编码是一个必要的环节（word->id），之前自己一直是自己手写代码构造词表并给句子进行填充，后来网上看到一个神仙函数两句语句就能完成构造词表vocab和句子填充，就是我们今天的主角VocabularyProcessor函数。函数参数如下：

from tensorflow.contrib import learn

vocal = learn.preprocessing.VocabularyProcessor(max_document_length,min_frequency=0,vocabulary=None,tokenizer_fn=None)

其中max_document_length是最大文本长度，如果句子大于这个参数会进行自动的剪切，如果小于就会自动填充。

min_frequency是最小词频限制，只有大于等于这个参数的词汇才会被收录到词表vocab中。

vocabulary是CategoricalVocabulary 对象，平时基本不加。

tokenizer_fn是将文本进行特定token化的参数，如果是中文语料可以传入分词函数。

例子如下：

from tensorflow.contrib import learn
import numpy as np

documents = [
    'this is the first test',
    'this is the second test',
    'this is not a test'
]
vocab = learn.preprocessing.VocabularyProcessor(10)
x = np.array(list(vocab.fit_transform(documents)))

print(x)

输出为：

[[1 2 3 4 5 0 0 0 0 0]
 [1 2 3 6 5 0 0 0 0 0]
 [1 2 7 8 5 0 0 0 0 0]]

coder_Gray

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录