背景
收集到一批标注数据,似乎中文的编码方式比较多。 大概用 chardet
检查一下,有ascill, utf-8, gbk, gb2312, gb18030。
chardet
确定编码
f = open('test.txt', 'rb')
data = f.readline()
f.close()
result = chardet.detect(data)
print(result)
结果:{‘encoding’: ‘ascii’, ‘confidence’: 1.0, ‘language’: ‘’}
处理不确定编码的代码。
import os
import json
encodings = ['ascii', 'utf-8', 'gbk', 'gb2312', 'gb18030']
def json_open_encoding(json_dir, json_name, encoding_json):
try:
with open(os.path.join(json_dir, json_name) , encoding = encoding_json) as f:
sjson = json.load(f)
except:
sjson = None
return sjson
def json_open(json_dir, json_name):
for encoding_j in encodings:
sjson = json_open_encoding(json_dir, json_name, encoding_j)
if sjson is not None:
break
return sjson