用tensorflow快速创建语料库词汇索引的方法

最新推荐文章于 2023-12-15 09:43:03 发布

szZack

最新推荐文章于 2023-12-15 09:43:03 发布

阅读量4.9k

点赞数 1

分类专栏：自然语言处理常见问题

本文链接：https://blog.csdn.net/zengNLP/article/details/95041026

版权

自然语言处理同时被 2 个专栏收录

36 篇文章 8 订阅

订阅专栏

常见问题

34 篇文章 1 订阅

订阅专栏

前言

在写快速搭建垃圾分类智能问答机器人时，发现使用词向量训练模型准确率仅有70左右，考虑了几点问题：一是数字类、英文类的词没有对应的词向量；二是训练语料太少了（百级），导致词向量优势体现不出来。故增加一种词索引的表示方法。

下面介绍用tensorflow快速创建语料库词汇索引的方法

用tensorflow快速创建语料库词汇索引

功能
建立词汇表和word到index，及index到word的map，这就需要统计词汇表中的所有单词并建立相应的词典。
api
tensorflow.contrib.learn.preprocessing.VocabularyProcessor(
　　　　　　　　　　　max_document_length,
　　　　　　　　　　　　min_frequency=0,
　　　　　　　　　　　　vocabulary=None,
　　　　　　　　　　　　tokenizer_fn=None)
　函数有4个参数:
max_document_length：文档的最大长度。如果文本的长度大于最大长度，那么它会被剪切，反之则用0填充
min_frequency：词频的最小值，出现次数>最小词频的词才会被收录到词表中
vocabulary：词典对象
tokenizer_fn：分词函数，如：list、jieba等

（VocabularyProcessor：Maps documents to sequences of word ids）
从语料库中创建词汇映射表

代码示例：

from tensorflow.contrib.learn import preprocessing
import numpy as np

def test():
    text_list = ['苹果 是 什么 垃圾', '塑料瓶 是 那种 垃圾']#先用结巴分好词
    max_words_length = 10
    vocab_processor = preprocessing.VocabularyProcessor(max_document_length=max_words_length)
    x = np.array(list(vocab_processor.fit_transform(text_list)))

    print('x:\n', x)

    print('词-索引映射表：\n', vocab_processor.vocabulary_._mapping)

    print('词汇表：\n', vocab_processor.vocabulary_._reverse_mapping)


    #保存vocabulary
    vocab_processor.save('vocab.pkl')

test()

输出结果：

x:
	 [[1 2 3 4 0 0 0 0 0 0]
	 [5 2 6 4 0 0 0 0 0 0]]
词-索引映射表：
 	{'垃圾': 4, '那种': 6, '塑料瓶': 5, '苹果': 1, '什么': 3, '是': 2, '<UNK>': 0}
词汇表：
	 ['<UNK>', '苹果', '是', '什么', '垃圾', '塑料瓶', '那种']

使用词汇映射表

1、先加载 vocab.pkl
2、再把文本转为index表示

示例代码：

from tensorflow.contrib.learn import preprocessing
import numpy as np

def get_text_idx(text_list, vocab, max_words_length):
    text_array = np.zeros([len(text_list),  max_words_length], dtype=np.int32)

    for i, x in  enumerate(text_list):
        words = x.split(" ")
        for j,  w in enumerate(words):
            if w in vocab:
                text_array[i,  j] = vocab[w]
            else :
                text_array[i,  j] = vocab['<UNK>']

    return text_array

def  test2():
    #加载词汇映射表
    vocabulary = preprocessing.VocabularyProcessor.restore('vocab.pkl')
    
    max_words_length = 10
    text_list2 = ['苹果 属于 那种 ？', '塑料瓶 是 ？']
    x2 = get_text_idx(text_list2, vocabulary.vocabulary_._mapping, max_words_length)
    print('x2:\n', x2)
    
test2()

输出结果：

x2:
 [[1 0 6 0 0 0 0 0 0 0]
 [5 2 0 0 0 0 0 0 0 0]]

szZack

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录