训练word2vec模型时碰到的两个问题：AttributeError和 UnicodeDecodeError，即属性错误和编码问题

_小鲤鱼_

已于 2022-08-15 14:59:04 修改

阅读量3.8k

点赞数 2

文章标签： pycharm python

于 2022-04-22 23:05:38 首次发布

本文链接：https://blog.csdn.net/qq_43160348/article/details/124355245

版权

一、编码问题
1.报错：UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xd7 in position 1
2. 错误的源码

在这里插# -*- coding: utf8 -*-

def write_review(review, stop_words_address, stop_address, no_stop_address):
    """写出处理好的数据"""
    no_stop_review = remove_stop_words(review, stop_words_address)
    with open(no_stop_address, 'a') as f:
        for row in no_stop_review:
            f.write(row[-1])
            f.write("\n")
    f.close()


if __name__ == '__main__':
    segment_text = cut_words('audito_whole.csv')
    review_text = remove_punctuation(segment_text)
    write_review(review_text, 'stop_words.txt', 'no_stop.txt', 'stop.txt')

    # 模型训练主程序
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    sentences_1 = word2vec.LineSentence('no_stop.txt')
    model_1 = word2vec.Word2Vec(sentences_1)

    # model.wv.save_word2vec_format('test_01.model.txt', 'test_01.vocab.txt', binary=False)  # 保存模型，后面可直接调用
    # model = word2vec.Word2Vec.load("test_01.model")  # 调用模型

    # 计算某个词的相关词列表
    a_1 = model_1.wv.most_similar(u"空间", topn=20)
    print(a_1)

    # 计算两个词的相关度
    b_1 = model_1.wv.similarity(u"空间", u"后座")
    print(b_1)

入代码片

3.报错解析：这个错误的意思是，在进行解码时，无法解析编码为0xd7的字符。这是由于在保存文本时没有指明编码方式，导致文本保存时可能使用的时gbk编码。
4.解决办法：
①在写出数据时，指出编码格式为utf-8，如下图

def write_review(review, stop_words_address, stop_address, no_stop_address):
    """写出处理好的数据"""
    with open(stop_address, 'a', encoding='utf-8') as f:
        for row in review:
            f.write(row[-1])
            f.write("\n")
    f.close()

    no_stop_review = remove_stop_words(review, stop_words_address)
    with open(no_stop_address, 'a', encoding='utf-8') as f:
        for row in no_stop_review:
            f.write(row[-1])
            f.write("\n")
    f.close()

②对gbk编码的文件做一次转换，如下：

a = []
with open('../new_seg_data/test.txt', encoding='gbk') as f:
    for line in f:
        a.append(line)
f.close()

with open('../new_seg_data/test.txt', 'a', encoding='utf8') as f:
    for line in a:
        f.write(line)
        f.write('\n')
f.close()

5.拓展——为什么会发生这种错误？
首先，我们点击报错的源码文件，如下图：请添加图片描述
打开utils，我们找到如下代码
在这里插入图片描述
可以看出，word2vec设置的编码格式为utf-8，如果解码时不是utf-8将会触发errors=‘strict’，进而报错UnicodeDecodeError。
另外，如果不懂python的解码和编码，可以查看这篇文章，写的挺清楚的：https://finthon.com/python-encode-decode/

二、属性错误
1.报错：AttributeError: ‘Word2Vec’ object has no attribute ‘most_similar’
2. 错误的源码：

 # 模型训练主程序
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    sentences_1 = word2vec.LineSentence('no_stop.txt')
    model_1 = word2vec.Word2Vec(sentences_1)

    # model.wv.save_word2vec_format('test_01.model.txt', 'test_01.vocab.txt', binary=False)  # 保存模型，后面可直接调用
    # model = word2vec.Word2Vec.load("test_01.model")  # 调用模型

    # 计算某个词的相关词列表
    a_1 = model_1.most_similar(u"空间", topn=20)
    print(a_1)

****3.报错解析：****报错说的是属性错误，这是因为在新的word2vec中更新了源代码结构，见官网说明如下：
在这里插入图片描述
意思是训练好的词向量存储在一个KeyedVectors实例中，这个实例可以是model.wv，方法most_simlar时对象model.wv的方法，而不是以前直接是对象model的方法。

**4.解决办法：**更换调用的对象，如下：

 # 模型训练主程序
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    sentences_1 = word2vec.LineSentence('no_stop.txt')
    model_1 = word2vec.Word2Vec(sentences_1)

    # model.wv.save_word2vec_format('test_01.model.txt', 'test_01.vocab.txt', binary=False)  # 保存模型，后面可直接调用
    # model = word2vec.Word2Vec.load("test_01.model")  # 调用模型

    # 计算某个词的相关词列表
    a_1 = model_1.wv.most_similar(u"空间", topn=20)
    print(a_1)

5.拓展具体为什么是 model_1.wv.most_similar()，我没有在word2vec中找到关于wv定义的语句，有哪位大佬知道吗？

_小鲤鱼_

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
训练word2vec模型时碰到的两个问题：AttributeError和 UnicodeDecodeError，即属性错误和编码问题

一、属性错误1.报错：UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xd7 in position 12.报错解析：3.解决办法：二、编码问题1.报错：UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xd7 in position 12.报错解析：3.解决办法：......
复制链接

扫一扫