Python打开读文件:UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xed in position 7014: invalid conti

1.问题:

Python用open()打开文件,读取其中内容时,报错说编码错误,'utf-8' 编码不能给字节0xed编码。

feed LM input feed file: ./data/raw/21000101.204243.txt
Traceback (most recent call last):
  File "run.py", line 9, in <module>
    traindata = load_data_in_cache()
  File "/data/deploy/wang/bertt/bigdata/feedrec/LM_embedding/gen_sample.py", line 20, in load_data_in_cache
    for line in input:
  File "/home/op_dev/wang/py3.6.12/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7014: invalid continuation byte

另外一次报错:

feed LM input feed file: ./data/raw/21000101.210302.txt
Traceback (most recent call last):
  File "run.py", line 9, in <module>
    traindata = load_data_in_cache()
  File "/data/deploy/wang/bertt/bigdata/feedrec/LM_embedding/gen_sample.py", line 20, in load_data_in_cache
    for line in input:
  File "/home/op_dev/wang/py3.6.12/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 2824: invalid continuation byte

报错代码块:

    for input_feeds_file in file_path:
        with open(input_feeds_file) as input:
            for line in input:
                line = line.strip()
                ......

2.问题原因:

这是编码解码的问题,这个错误就是‘utf-8’不能解码位置2824的那个字节(0xed),也就是这个字节超出了utf-8的表示范围了.
换句话说,内容读取的时候发现了文件中存在utf-8不可编译的内容,所以我需要使用一种encoding来使文件能够被正常读取。

3.解决方法:

先看一步:在open()参数中增加了:encoding='unicode_escape',解决上面的问题

    for input_feeds_file in file_path:
        with open(input_feeds_file, encoding='unicode_escape') as input:
            for line in input:
                line = line.strip()
                ......

又报了另外一个错误:

feed LM input feed file: ./data/raw/21000101.210302.txt
Traceback (most recent call last):
  File "run.py", line 9, in <module>
    traindata = load_data_in_cache()
  File "/data/deploy/wang/bertt/bigdata/feedrec/LM_embedding/gen_sample.py", line 20, in load_data_in_cache
    for line in input:
  File "/home/op_dev/wang/py3.6.12/lib/python3.6/encodings/unicode_escape.py", line 26, in decode
    return codecs.unicode_escape_decode(input, self.errors)[0]
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 8191: \ at end of string

问题原因:'unicodeescape'不能解码8191位置的0x5c.

查询了一下:要想彻底解决编码问题,直接用 encoding='ISO-8859-1',目前不曾报错。
参考:

1.Unicode、UTF-8 和 ISO8859-1到底有什么区别:https://blog.csdn.net/robertcpp/article/details/7837712

  • 9
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值