处理中文语料 codec can‘t decode byte 0xaa in position 134: illegal multibyte sequence问题

最新推荐文章于 2021-12-16 14:29:49 发布

weixin_46319026

最新推荐文章于 2021-12-16 14:29:49 发布

阅读量926

点赞数

文章标签： python 开发语言后端

本文链接：https://blog.csdn.net/weixin_46319026/article/details/121495573

版权

在处理中文语料时，遇到'gb2312'编码错误，尝试使用gb18030编码格式解决大部分问题，少数文件依然报错。最终通过设置errors='ignore'参数忽略错误，成功处理文件。

摘要由CSDN通过智能技术生成

import os
import jieba
import chardet
path = "D:/python/复旦中文文本训练集/train/"
path1 = "D:/python/复旦中文文本训练集/"

def check(path: str):
    with open(path, 'rb') as f:
      print(chardet.detect(f.read())['encoding'], ': ', i)

发现都是GB2312编码

但是open起来就有问题了

# 获取语料库文件夹下所有文件名
filelist = []
for i in os.listdir(path):
    for j in os.listdir(path+i):
        filelist.append(path+i+"/"+j)

# 语料拼接起来放一起
str = ""        
for file in filelist:
    with open(file, encoding='GB2312') as f:        
        article = f.read()
        str += article

报错：'gb2312' codec can't decode byte 0xaa in position 134: illegal multibyte sequence

网上一搜，说GB2312对于繁体字会报错，于是

最低0.47元/天解锁文章

weixin_46319026

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
处理中文语料 codec can‘t decode byte 0xaa in position 134: illegal multibyte sequence问题

import osimport jiebaimport chardetpath = "D:/python/复旦中文文本训练集/train/"path1 = "D:/python/复旦中文文本训练集/"def check(path: str): with open(path, 'rb') as f: print(chardet.detect(f.read())['encoding'], ': ', i)发现都是GB2312编码但是open起来就有问题了..
复制链接

扫一扫