python实现批量转换文件编码encoding

因多语种造成的编码问题

很多代码文件移到其他开发环境IDE时,会因编码问题编译出错。

典型的例如:ecllipse下的.java文件不是utf-8格式,在AS上中文是乱码。一些非中英文的其他语种文字存在于文件.c , .cpp, .h, .hpp时,如果文件不是utf-8则在visual studio上打开会出现乱码,直接使得代码排版出现问题而编译失败。

解决方法

使用python将所有文件编码转换为utf-8

运行环境: python3.7.4

这里废话不多,直接上全部代码。解释一下使用方法:

dump_file_encode(source_dir)
只作分析source_dir目录下所有文件的编码格式,有助于分析源文件是什么语言的编码
convert(path)

将path目录下所有.c, .cpp, .h, .hpp文件转换为utf8编码,详细看代码extension变量,有些文件识别不出是什么编码的情况

    elif src_file_encode is None:
        src_file_encode = 'windows-1251'

时,我这里强制指定为windows-1251编码(因为我编译用到的一些源文件有俄语),可按需修改。完整代码如下:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# author:Staney.Chan [staney_chan@126.com]
# datetime:2021/10/22 10:59
# description:批量修改文件编码,例如从ansi转为utf-8


import os
import sys
import codecs
import chardet


def get_file_extension(file):
    (filepath, filename) = os.path.split(file)
    (shortname, extension) = os.path.splitext(filename)
    return extension


def get_file_encode(filename):
    with open(filename, 'rb') as f:
        data = f.read()
        encoding_type = chardet.detect(data)
        # print(encoding_type)

    return encoding_type


def process_dir(root_path):
    for path, dirs, files in os.walk(root_path):
        for file in files:
            file_path = os.path.join(path, file)
            process_file(file_path, file_path)


def process_file(filename_in, filename_out):
    """
    filename_in :输入文件(全路径+文件名)
    filename_out :保存文件(全路径+文件名)
    文件编码类型: 'windows-1251','UTF-8-SIG'
    """
    extension = get_file_extension(filename_in).lower()
    if not (extension == '.c' or extension == '.h' or extension == '.cpp' or extension == '.hpp'):
        return

    # 输出文件的编码类型
    dest_file_encode = 'utf-8'
    encoding_type = get_file_encode(filename_in)
    src_file_encode = encoding_type['encoding']
    if src_file_encode == 'utf-8':
        return
    elif src_file_encode is None:
        src_file_encode = 'windows-1251'

    print("[Convert]File:" + filename_in + " from:" + encoding_type['encoding'] + " to:UTF-8")

    try:
        with codecs.open(filename=filename_in, mode='r', encoding=src_file_encode) as fi:
            data = fi.read()
            with open(filename_out, mode='w', encoding=dest_file_encode) as fo:
                fo.write(data)
                fo.close()

        with open(filename_out, 'rb') as f:
            data = f.read()
            print(chardet.detect(data))
    except Exception as e:
        print(e)


def dump_file_encode(root_path):
    for path, dirs, files in os.walk(root_path):
        for file in files:
            filename = os.path.join(path, file)
            with open(filename, 'rb') as f:
                data = f.read()
                encoding_type = chardet.detect(data)
                print("FILE:" + file + " ENCODE:" + str(encoding_type))


def convert(path):
    """
    批量转换文件编码格式
    path :输入文件或文件夹
    """
    # sys.argv[1], sys.argv[2]
    if os.path.isfile(path):
        process_file(path, path)
    elif os.path.isdir(path):
        process_dir(path)


if __name__ == '__main__':
    # convert(r'F:\OpenPapyrus-11.1.12\Src')
    dump_file_encode(r'C:\Users\Administrator\Desktop\cc')

  • 0
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
你可以使用以下Python代码来批量修改文件编码: ```python import os from chardet import detect def convert_encoding(path): with open(path, 'rb') as fp: content = fp.read() encoding = detect(content)['encoding'] content = content.decode(encoding).encode('utf8') with open(path, 'wb') as fp: fp.write(content) def batch_convert_encoding(directory): for root, dirs, files in os.walk(directory): for file in files: if file.endswith(('.c', '.cpp', '.h', '.hpp')): path = os.path.join(root, file) convert_encoding(path) # 使用示例 batch_convert_encoding("目标文件夹目录") ``` 这段代码会递归地遍历指定目录下的所有文件,并将后缀名为.c, .cpp, .h, .hpp的文件转换为UTF-8编码。首先,它通过chardet库检测文件的原始编码。然后,将文件内容按照检测到的编码解码,并重新以UTF-8编码写入文件。这样就完成了批量修改文件编码的操作。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* [python 批量修改文件编码](https://blog.csdn.net/Eternal_Whispers/article/details/120220132)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] - *2* *3* [python实现批量转换文件编码encoding](https://blog.csdn.net/awisc/article/details/120901910)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值