-
关键字:
数据字典
,字符编码
-
问题描述:使用循环神经网络训练一个IMDB数据集得到一个模型,使用这个模型进行预测句子,无论句子是正面还是负面的,预测的结果都是一样。
-
报错信息:
[[5146, 5146, 5146, 5146, 5146, 5146], [5146, 5146, 5146, 5146, 5146], [5146, 5146, 5146, 5146]]
Predict probability of 0.54538333 to be positive and 0.45461673 to be negative for review ' read the book forget the movie '
Predict probability of 0.54523355 to be positive and 0.45476642 to be negative for review ' this is a great movie '
Predict probability of 0.54504114 to be positive and 0.45495886 to be negative for review ' this is very bad '
- 问题复现:在预测是,使用
Inferencer
接口创建一个预测器,然后把句子里的每个单词转换成列表形式,然后使用word_dict.get(words, UNK)
根据数据集的字典把单词转换成标签,然后使用这些标签进行预测,最后预测的都是错误的。错误代码如下:
inferencer = Inferencer(
infer_func=partial(inference_program, word_dict),
param_path=params_dirname,
place=place)
reviews_str = ['read the book forget the movie', 'this is a great movie', 'this is very bad']
reviews = [c.split() for c in reviews_str]
UNK = word_dict['<unk>']
lod = []
for c in reviews:
lod.append([word_dict.get(words, UNK) for words in c])
print(lod)
base_shape = [[len(c) for c in lod]]
tensor_words = fluid.create_lod_tensor(lod, base_shape, place)
results = inferencer.infer({'words': tensor_words})
- 解决问题:错误的原因是没使用正确的编码,所以在使用
word_dict.get(words, UNK)
转换编码时,程序理解里面都是<unk>
,所以句子都是<unk>
对应的编码。需要对里面的单词转换成UTF-8的字符编码,例子这样word_dict.get(words.encode('utf-8')
。正确代码如下:
inferencer = Inferencer(
infer_func=partial(inference_program, word_dict),
param_path=params_dirname,
place=place)
reviews_str = ['read the book forget the movie', 'this is a great movie', 'this is very bad']
reviews = [c.split() for c in reviews_str]
UNK = word_dict['<unk>']
lod = []
for c in reviews:
lod.append([word_dict.get(words.encode('utf-8'), UNK) for words in c])
print(lod)
base_shape = [[len(c) for c in lod]]
tensor_words = fluid.create_lod_tensor(lod, base_shape, place)
results = inferencer.infer({'words': tensor_words})