Python3批量转换文件编码

最新推荐文章于 2024-03-21 07:06:12 发布

Mingyueyixi

最新推荐文章于 2024-03-21 07:06:12 发布

阅读量718

点赞数

分类专栏： Python 文章标签：数据分析

本文链接：https://blog.csdn.net/Mingyueyixi/article/details/104401607

版权

Python 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

Python3批量转换文件编码

| 背景： 我这个程序员菜鸟有一天突然发现，自己的某个很菜鸟的项目，所有文件编码都是混乱的。这该怎么办？急，在线等。

可惜，我终于没有等到大佬给我推荐什么好使唤的软件。于是我觉得我是不是可以自己批量解决一下。

准备工作

python3
pip install chardet （检测编码）

检测文件编码

“凡事预则立，不预则废”，编码混乱的文件实在太多，还是的好好计划下：首先，我们检测一下各个文件的编码状况，然后才可以动工修正。
检测文件编码，我们可以使用 chardet 开源库，用法很简单，直接将 bytes 传入即可：

import chardet

f_file = open(path, "rb")
content = f_file.read()
# 结果是一个字典，包含了猜测的编码与概率
guess_encode = chardet.detect(content)

获取要检测编码的所有文件

“有子存焉，子又生孙，孙又生子，子又有子，子又有孙，子子孙孙无穷匮也”——对于一些个文件夹而言，真的是有非常有深度，它们有非常深的目录结构。

无论是检测编码，还是修正文件编码，都应先将这许多个文件先查找出来。如何查找？

一般我们想到的是递归，但其实针对文件的这个情况，python 的os 模块已经做好了准备，使用os.walk即可：

import os
import re

    # 深度递归遍历所有文件夹下的文件
    def walk_files(path, regex=r"."):
        if not os.path.isdir(path):
            return [path]
        file_list = []
        for root, dirs, files in os.walk(path):
            for file in files:
                file_path = os.path.join(root, file)
                if re.match(regex, file_path):
                    file_list.append(file_path)

        return file_list

使用正则表达式（re模块），是为了方便过滤，总有些文件是可能不需要检测或修改的。
既然获取了文件列表，那么遍历读取并检测编码并不是难事，只需要加上一个循环即可，在循环中我们记录下编码的猜测结果，或是打印，或是暂存到最后写入到报告文件中，不再赘述。

修改文件编码

python2 的字符串可以说设计得比较糟糕，二进制bytes类型也算是字符串，导致了一系列的混乱。

python3 对这方面做了改进，byte编码转换只需要如下进行即可：

# byte解码为字符串
contentStr = content.decode(original)
# 转为目标编码bytes
targetBytes = bytes(contentStr, target)

当然，记得加上try，bytes的解码需要按照正确方法进行，否则会抛出异常，这相当于是一个解密的过程，用错了钥匙将无法打开大门（比如本来是 utf-8 编码的内容，错用了 gbk 解码）

获取修改完编码方式的bytes后，我们还需要保存文件：

f_file.seek(0)
f_file.truncate()
f_file.write(targetBytes)

先将文件指针移动到最前面，接着使用 f_file.truncate() 清空指针后所有内容，最后写入。

终章（实例代码和截图）

上文大部分都是在叙述思路，代码并不完整。不过，最重要的是——进行任何批量操作前，请先备份。但我没有实现，可以考虑使用 shutil.copytree(原文件夹，新文件夹) 进行备份。

在这里插入图片描述
如上图，chardet 的猜测不一定是正确的，所以需要备份，需要针对某些文件进行一些微调，直到IDE能够正常显示或运行。

下面是完整的测试代码：

# -*- coding: utf-8 -*-
# @Date:2020/1/12 19:04
# @Author: Lu
# @Description

import os
import copy
import re
import chardet


class FileUtil():

    # 深度递归遍历所有文件夹下的文件
    def walk_files(path, regex=None):
        if not os.path.isdir(path):
            return [path]
        file_list = []
        for root, dirs, files in os.walk(path):
            for file in files:
                file_path = os.path.join(root, file)
                if re.match(regex, file_path):
                    file_list.append(file_path)

        return file_list


class EncodeTask():

    def __init__(self):
        self.default_config = {
            "workpaths": [u"./"],
            "filefilter": r"."
        }
        self.config = copy.deepcopy(self.default_config)
        self.work_files = []
        self.workpaths = []

    def update(self, config, fill_default_value=False):
        cache = copy.deepcopy(config)
        for k in self.default_config.keys():
            if cache.get(k):
                self.config[k] = cache[k]
            elif fill_default_value:
                self.config[k] = self.default_config[k]
        self.__gen_files(self.config["workpaths"])
        return self

    def __gen_files(self, workpaths):
        self.work_files.clear()
        for workpath in workpaths:
            self.work_files += FileUtil.walk_files(workpath, self.config["filefilter"])

    def check_encoding(self):
        encoding_report = {"stat": {}, "reports": []}
        for path in self.work_files:
            f_file = open(path, "rb")
            content = f_file.read()
            guess_encode = chardet.detect(content)

            encoding = guess_encode.get("encoding")
            encoding_report["reports"].append([path, guess_encode])
            if not encoding_report["stat"].get(encoding):
                encoding_report["stat"][encoding] = 1
            else:
                encoding_report["stat"][encoding] += 1

            f_file.flush()
            f_file.close()

        reportfile = open(u"./encoding_report.txt", "w",encoding="utf-8")
        reportContent = u"{}\n".format(encoding_report["stat"])

        for item in encoding_report["reports"]:
            reportContent += u"\n{}    {}".format(item[0], item[1])

        reportfile.write(reportContent)
        reportfile.flush()
        reportfile.close()
        print(encoding_report)

    def change_encoding(self, original, target):
        for path in self.work_files:
            print(u"\n{}\nchange {} to {}".format(path, original, target))
            f_file = open(path, "rb+")
            content = f_file.read()
            try:
                # byte解码为字符串
                contentStr = content.decode(original)
                # 字符串编码为uniccode str
                # unicodeBytes = contentStr.encode("unicode_escape")

                # 转为目标编码bytes
                targetBytes = bytes(contentStr, target)

                # print(targetBytes)

                f_file.seek(0)
                f_file.truncate()
                f_file.write(targetBytes)

            except Exception as e:
                print(u"Error:可能编码有误\n{}".format(e))

            finally:
                f_file.flush()
                f_file.close()


def task():
    print("""You can use it like this code:
# -*- coding: utf-8 -*-

    from conver_encode import EncodeTask

    EncodeTask().update({
        "workpaths": [u"./test"],
        "filefilter": r".*\.(?:java)"
    }).check_encoding()

    EncodeTask().update({
        "workpaths": [u"./test"],
        "filefilter": r".*\.(?:java)"
    }).change_encoding("gb18030", "utf-8")

    # }).change_encoding("utf-8", "gb18030")
    # }).change_encoding("Windows-1252", "utf-8")
    """);
    pass


if __name__ == '__main__':
    task()

Mingyueyixi

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Python3批量转换文件编码

Python3批量转换文件编码| 背景：我这个程序员菜鸟有一天突然发现，自己的某个很菜鸟的项目，所有文件编码都是混乱的。这该怎么办？急，在线等。可惜，我终于没有等到大佬给我推荐什么好使唤的软件。于是我觉得我是不是可以自己批量解决一下。准备工作python3pip install chardet （检测编码）检测文件编码“凡事预则立，不预则废”，编码混乱的文件实在太多，还是...
复制链接

扫一扫