Python打开读文件：UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xed in position 7014: invalid conti

最新推荐文章于 2024-06-03 20:03:24 发布

凝眸伏笔

最新推荐文章于 2024-06-03 20:03:24 发布

阅读量9.1k

点赞数 9

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/pearl8899/article/details/114645011

版权

python 专栏收录该内容

29 篇文章 4 订阅

订阅专栏

1.问题：

Python用open()打开文件，读取其中内容时，报错说编码错误，'utf-8' 编码不能给字节0xed编码。

feed LM input feed file: ./data/raw/21000101.204243.txt
Traceback (most recent call last):
  File "run.py", line 9, in <module>
    traindata = load_data_in_cache()
  File "/data/deploy/wang/bertt/bigdata/feedrec/LM_embedding/gen_sample.py", line 20, in load_data_in_cache
    for line in input:
  File "/home/op_dev/wang/py3.6.12/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7014: invalid continuation byte

另外一次报错：

feed LM input feed file: ./data/raw/21000101.210302.txt
Traceback (most recent call last):
  File "run.py", line 9, in <module>
    traindata = load_data_in_cache()
  File "/data/deploy/wang/bertt/bigdata/feedrec/LM_embedding/gen_sample.py", line 20, in load_data_in_cache
    for line in input:
  File "/home/op_dev/wang/py3.6.12/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 2824: invalid continuation byte

报错代码块：

    for input_feeds_file in file_path:
        with open(input_feeds_file) as input:
            for line in input:
                line = line.strip()
                ......

2.问题原因：

这是编码解码的问题，这个错误就是‘utf-8’不能解码位置2824的那个字节（0xed），也就是这个字节超出了utf-8的表示范围了.
换句话说，内容读取的时候发现了文件中存在utf-8不可编译的内容，所以我需要使用一种encoding来使文件能够被正常读取。

3.解决方法：

先看一步：在open()参数中增加了：encoding='unicode_escape'，解决上面的问题

    for input_feeds_file in file_path:
        with open(input_feeds_file, encoding='unicode_escape') as input:
            for line in input:
                line = line.strip()
                ......

又报了另外一个错误：

feed LM input feed file: ./data/raw/21000101.210302.txt
Traceback (most recent call last):
  File "run.py", line 9, in <module>
    traindata = load_data_in_cache()
  File "/data/deploy/wang/bertt/bigdata/feedrec/LM_embedding/gen_sample.py", line 20, in load_data_in_cache
    for line in input:
  File "/home/op_dev/wang/py3.6.12/lib/python3.6/encodings/unicode_escape.py", line 26, in decode
    return codecs.unicode_escape_decode(input, self.errors)[0]
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 8191: \ at end of string

问题原因：'unicodeescape'不能解码8191位置的0x5c.

查询了一下：要想彻底解决编码问题，直接用 encoding='ISO-8859-1'，目前不曾报错。
参考：

1.Unicode、UTF－8 和 ISO8859-1到底有什么区别：https://blog.csdn.net/robertcpp/article/details/7837712

凝眸伏笔

关注

9
点赞
踩
8

收藏

觉得还不错? 一键收藏
2
评论
Python打开读文件：UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xed in position 7014: invalid conti

feed LM input feed file: ./data/raw/21000101.204243.txtTraceback (most recent call last): File "run.py", line 9, in <module> traindata = load_data_in_cache() File "/data/deploy/wangzhenzhu/bertt/bigdata/feedrec/LM_embedding/gen_sample.py", l...
复制链接

扫一扫

专栏目录