有的时候我们有一些网页的项目,需要用到JavaScript读取一些文本文件,用以读取数据;但各种文本文件的编码方式不尽相同,特别是带有中文字符的文件,为GBK编码,一般加载后都会出现乱码情况,故需要在加载之前将文件的编码形式转为国际兼容的编码方式UTF-8。乱码也是一个很烦的问题,博主苦寻良久,终于找到了相应的解决方案,这个python程序对单个文件或者整个文件夹下的文件进行批量转码操作,经过实例测试,代码有效,代码中文件类型是自己设置的,本文文件格式为"cfg",可根据项目需要在程序内修改文件格式,程序代码如下:
gbk2utf.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
__author__ = ''
import logging, os, argparse, textwrap
import time
import chardet
# Default configuration will take effect when corresponding input args are missing.
# Feel free to change this for your convenience.
DEFAULT_CONF = {
# Only those files ending with extensions in this list will be scanned or converted.
'exts' : ['cfg'],
'overwrite' : False,
'add_BOM' : False,
'convert_UTF' : False,
'confi_thres' : 0.8,
}
# We have to set a minimum threshold. Only those target_encoding results returned by chartdet that are above that threshold level would be accepted.
# See https://github.com/x1angli/convert2utf/issues/4 for further details
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)
log = logging.getLogger(__name__)
class Convert2Utf8:
def __init__(self, args):
self.args = args
def walk_dir(self, dirname):
for root, dirs, files in os.walk(dirname):
for name in files:
extension = os.path.splitext(name)[1][1:].strip().lower()
# On linux there is a newline at the end which will cause the match to fail, so we just 'strip()' the '\n'
# Also, add 'lower()' to ensure matching
if (extension in self.args.exts):
fullname = os.path.join(root, name)
try:
self.convert_file(fullname)