-
关键字:
数据字典
-
问题描述:使用3个句子进行预测,预测该句子的正面和负面的概率,在执行预测时大多数的结果都不正确,而且每个句子的编码都很长。
-
报错信息:
['read the book forget the movie', 'this is a great movie', 'this is very bad']
[[2237, 4008, 2, 2062, 5146, 3602, 3752, 4008, 5146, 951, 2903, 2903, 5146, 5146, 2414, 2903, 2237, 3316, 4008, 3602, 5146, 3602, 3752, 4008, 5146, 4136, 2903, 5146, 8, 4008], [3602, 3752, 8, 2551, 5146, 8, 2551, 5146, 2, 5146, 3316, 2237, 4008, 2, 3602, 5146, 4136, 2903, 5146, 8, 4008], [3602, 3752, 8, 2551, 5146, 8, 2551, 5146, 5146, 4008, 2237, 5146, 5146, 951, 2, 2062]]
Predict probability of 0.59143597 to be positive and 0.4085641 to be negative for review ' read the book forget the movie '
Predict probability of 0.73750913 to be positive and 0.26249087 to be negative for review ' this is a great movie '
Predict probability of 0.55495805 to be positive and 0.445042 to be negative for review ' this is very bad '
- 问题复现:在预测时需要把句子转换成单词列表,在把单词转换成编码。把句子转换成列表时使用
reviews = [c for c in reviews_str]
进行转换,然后使用这个结果通过数据集字典转换成编码进行预测,预测结果几乎都是错误的。错误代码如下:
inferencer = Inferencer(
infer_func=partial(inference_program, word_dict),
param_path=params_dirname,
place=place)
reviews_str = ['read the book forget the movie', 'this is a great movie', 'this is very bad']
reviews = [c for c in reviews_str]
print(reviews)
UNK = word_dict['<unk>']
lod = []
for c in reviews:
lod.append([word_dict.get(words.encode('utf-8'), UNK) for words in c])
print(lod)
base_shape = [[len(c) for c in lod]]
tensor_words = fluid.create_lod_tensor(lod, base_shape, place)
results = inferencer.infer({'words': tensor_words})
- 解决问题:上面错误的原因是数据预处理时,没有正确把句子中的单词拆开,导致在使用数据字典把字符串转换成编码的时候,使用的是句子的字符,所以导致错误出现。在处理的时候应该是
reviews = [c.split() for c in reviews_str]
。正确代码如下:
inferencer = Inferencer(
infer_func=partial(inference_program, word_dict),
param_path=params_dirname,
place=place)
reviews_str = ['read the book forget the movie', 'this is a great movie', 'this is very bad']
reviews = [c.split() for c in reviews_str]
print(reviews)
UNK = word_dict['<unk>']
lod = []
for c in reviews:
lod.append([word_dict.get(words.encode('utf-8'), UNK) for words in c])
print(lod)
base_shape = [[len(c) for c in lod]]
tensor_words = fluid.create_lod_tensor(lod, base_shape, place)
results = inferencer.infer({'words': tensor_words})
正确的输出情况:
[['read', 'the', 'book', 'forget', 'the', 'movie'], ['this', 'is', 'a', 'great', 'movie'], ['this', 'is', 'very', 'bad']]
[[325, 0, 276, 818, 0, 16], [9, 5, 2, 78, 16], [9, 5, 51, 81]]
Predict probability of 0.44390476 to be positive and 0.55609524 to be negative for review ' read the book forget the movie '
Predict probability of 0.83933955 to be positive and 0.16066049 to be negative for review ' this is a great movie '
Predict probability of 0.35688713 to be positive and 0.64311296 to be negative for review ' this is very bad '