python实现批量转换文件编码encoding

最新推荐文章于 2024-06-15 00:33:51 发布

Staney.Chan

最新推荐文章于 2024-06-15 00:33:51 发布

阅读量1.9k

点赞数

分类专栏：工具文章标签： python 开发语言后端

本文链接：https://blog.csdn.net/awisc/article/details/120901910

版权

工具专栏收录该内容

5 篇文章

订阅专栏

因多语种造成的编码问题

很多代码文件移到其他开发环境IDE时，会因编码问题编译出错。

典型的例如：ecllipse下的.java文件不是utf-8格式，在AS上中文是乱码。一些非中英文的其他语种文字存在于文件.c , .cpp, .h, .hpp时，如果文件不是utf-8则在visual studio上打开会出现乱码，直接使得代码排版出现问题而编译失败。

解决方法

使用python将所有文件编码转换为utf-8

运行环境: python3.7.4

这里废话不多，直接上全部代码。解释一下使用方法：

dump_file_encode(source_dir)
只作分析source_dir目录下所有文件的编码格式，有助于分析源文件是什么语言的编码

convert(path)

将path目录下所有.c, .cpp, .h, .hpp文件转换为utf8编码，详细看代码extension变量，有些文件识别不出是什么编码的情况

    elif src_file_encode is None:
        src_file_encode = 'windows-1251'

时，我这里强制指定为windows-1251编码（因为我编译用到的一些源文件有俄语），可按需修改。完整代码如下:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# author:Staney.Chan [staney_chan@126.com]
# datetime:2021/10/22 10:59
# description:批量修改文件编码，例如从ansi转为utf-8


import os
import sys
import codecs
import chardet


def get_file_extension(file):
    (filepath, filename) = os.path.split(file)
    (shortname, extension) = os.path.splitext(filename)
    return extension


def get_file_encode(filename):
    with open(filename, 'rb') as f:
        data = f.read()
        encoding_type = chardet.detect(data)
        # print(encoding_type)

    return encoding_type


def process_dir(root_path):
    for path, dirs, files in os.walk(root_path):
        for file in files:
            file_path = os.path.join(path, file)
            process_file(file_path, file_path)


def process_file(filename_in, filename_out):
    """
    filename_in :输入文件(全路径+文件名)
    filename_out :保存文件(全路径+文件名)
    文件编码类型: 'windows-1251','UTF-8-SIG'
    """
    extension = get_file_extension(filename_in).lower()
    if not (extension == '.c' or extension == '.h' or extension == '.cpp' or extension == '.hpp'):
        return

    # 输出文件的编码类型
    dest_file_encode = 'utf-8'
    encoding_type = get_file_encode(filename_in)
    src_file_encode = encoding_type['encoding']
    if src_file_encode == 'utf-8':
        return
    elif src_file_encode is None:
        src_file_encode = 'windows-1251'

    print("[Convert]File:" + filename_in + " from:" + encoding_type['encoding'] + " to:UTF-8")

    try:
        with codecs.open(filename=filename_in, mode='r', encoding=src_file_encode) as fi:
            data = fi.read()
            with open(filename_out, mode='w', encoding=dest_file_encode) as fo:
                fo.write(data)
                fo.close()

        with open(filename_out, 'rb') as f:
            data = f.read()
            print(chardet.detect(data))
    except Exception as e:
        print(e)


def dump_file_encode(root_path):
    for path, dirs, files in os.walk(root_path):
        for file in files:
            filename = os.path.join(path, file)
            with open(filename, 'rb') as f:
                data = f.read()
                encoding_type = chardet.detect(data)
                print("FILE:" + file + " ENCODE:" + str(encoding_type))


def convert(path):
    """
    批量转换文件编码格式
    path :输入文件或文件夹
    """
    # sys.argv[1], sys.argv[2]
    if os.path.isfile(path):
        process_file(path, path)
    elif os.path.isdir(path):
        process_dir(path)


if __name__ == '__main__':
    # convert(r'F:\OpenPapyrus-11.1.12\Src')
    dump_file_encode(r'C:\Users\Administrator\Desktop\cc')