这块真的是修了快一下午,真实自闭(菜确实是原罪)
本文参考自
解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: ‘utf-8’ codec can’t decode byte…的问题
而有所改动
在上文方法中提到了在word2vec.py文件中进行源代码的修改与粘贴,在其修改完成的代码中有一两处需要更正的地方(在代码中有注释标注)
import logging
import itertools
import gensim
from gensim.models import word2vec
from gensim import utils
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
class LineSentence(object):
"""Iterate over a file that contains sentences: one line = one sentence.
Words must be already preprocessed and separated by whitespace.
"""
def __init__(self, source, max_sentence_length=10000, limit=None):
"""
Parameters
----------
source : string or a file-like object
Path to the file on disk, or an already-open file object (must support `seek(0)`).
limit : int or None
Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default).
Examples
--------
.. sourcecode:: pycon
>>> from gensim.test.utils import datapath
>>> sentences = LineSentence(datapath('lee_background.cor'))
>>> for sentence in sentences:
... pass
"""
self.source = source
self.max_sentence_length = max_sentence_length
self.limit = limit
def __iter__(self):
"""Iterate through the lines in the source."""
try:
# Assume it is a file-like object and try treating it as such
# Things that don't have seek will trigger an exception
self.source.seek(0)
for line in itertools.islice(self.source, self.limit):
line = utils.to_unicode(line).split()
i = 0
while i < len(line):
yield line[i: i + self.max_sentence_length]
i += self.max_sentence_length
except AttributeError:
# If it didn't work like a file, use it as a string filename
#需要修改的是这个地方
with utils.smart_open(self.source, mode="r") as fin:
for line in itertools.islice(fin, self.limit):
line = utils.to_unicode(line).split()
i = 0
while i < len(line):
yield line[i: i + self.max_sentence_length]
i += self.max_sentence_length
其中util.smart_open由于在笔者这边不支持,而修改为支持的open方法,同时由于报错是utf-8编码失误,因此在参数中需要设置encoding="utf-8"与errors=“ignore”来特殊设置编码与放宽错误限定
with utils.smart_open(self.source, mode=“r”) as fin:
修改成
with utils.open(self.source, encoding=“utf-8”,errors=“ignore”) as fin:
这样就设置好了模式,代码就可以跑起来啦!