LineSentence 解决utf-8 codec can't decode byte 0xbe in position xx 方法

最新推荐文章于 2023-07-30 16:15:15 发布

退堂鼓一级演员

最新推荐文章于 2023-07-30 16:15:15 发布

阅读量864

点赞数

分类专栏： jieba 文章标签： python 机器学习

本文链接：https://blog.csdn.net/yeweij226/article/details/104385190

版权

jieba 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

这块真的是修了快一下午，真实自闭（菜确实是原罪）

本文参考自
解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: ‘utf-8’ codec can’t decode byte…的问题
而有所改动

在上文方法中提到了在word2vec.py文件中进行源代码的修改与粘贴，在其修改完成的代码中有一两处需要更正的地方（在代码中有注释标注）

import logging
import itertools
import gensim
from gensim.models import word2vec
from gensim import utils

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

class LineSentence(object):
    """Iterate over a file that contains sentences: one line = one sentence.
    Words must be already preprocessed and separated by whitespace.

    """
    def __init__(self, source, max_sentence_length=10000, limit=None):
        """

        Parameters
        ----------
        source : string or a file-like object
            Path to the file on disk, or an already-open file object (must support `seek(0)`).
        limit : int or None
            Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default).

        Examples
        --------
        .. sourcecode:: pycon

            >>> from gensim.test.utils import datapath
            >>> sentences = LineSentence(datapath('lee_background.cor'))
            >>> for sentence in sentences:
            ...     pass

        """
        self.source = source
        self.max_sentence_length = max_sentence_length
        self.limit = limit

    def __iter__(self):
        """Iterate through the lines in the source."""
        try:
            # Assume it is a file-like object and try treating it as such
            # Things that don't have seek will trigger an exception
            self.source.seek(0)
            for line in itertools.islice(self.source, self.limit):
                line = utils.to_unicode(line).split()
                i = 0
                while i < len(line):
                    yield line[i: i + self.max_sentence_length]
                    i += self.max_sentence_length
        except AttributeError:
            # If it didn't work like a file, use it as a string filename
            #需要修改的是这个地方
            with utils.smart_open(self.source, mode="r") as fin:
                for line in itertools.islice(fin, self.limit):
                    line = utils.to_unicode(line).split()
                    i = 0
                    while i < len(line):
                        yield line[i: i + self.max_sentence_length]
                        i += self.max_sentence_length

其中util.smart_open由于在笔者这边不支持，而修改为支持的open方法，同时由于报错是utf-8编码失误，因此在参数中需要设置encoding="utf-8"与errors=“ignore”来特殊设置编码与放宽错误限定

with utils.smart_open(self.source, mode=“r”) as fin:
修改成
with utils.open(self.source, encoding=“utf-8”,errors=“ignore”) as fin:

这样就设置好了模式，代码就可以跑起来啦！

退堂鼓一级演员

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
LineSentence 解决utf-8 codec can't decode byte 0xbe in position xx 方法

这块真的是修了快一下午，真实自闭（菜确实是原罪）本文参考自解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: ‘utf-8’ codec can’t decode byte…的问题而有所改动在上文方法中提到了在word2vec.py文件中进行源代码的修改与粘贴，在其修改完成的代码中有一两处需要更正的地方（...
复制链接

扫一扫

专栏目录