UTF-8格式大统一：转码高效指南，彻底解决文件编码乱码问题！

最新推荐文章于 2024-08-11 08:34:55 发布

Lemo`s Studio

最新推荐文章于 2024-08-11 08:34:55 发布

阅读量2.4k

点赞数 42

分类专栏： python 可信编程文章标签： UTF-8 乱码 C++ python

本文链接：https://blog.csdn.net/qq_36631379/article/details/139242717

版权

python 同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

可信编程

7 篇文章 1 订阅

订阅专栏

文章目录

1 背景说明

你是否遇到过，在项目开发过程中，项目组成员编码完成后，保存的文件格式五花八门，在集成编译的过程中各种报错，结果定位过程中，发现是编码格式的问题；亦或是代码跨平台迁移时，由于编码格式的不统一，编译器又是报各种问题；亦或是用 BeyondCompare 去对比代码时，由于文件格式的不同，直接报显示错误，或是代码乱码。

依此种种，总是让人抓狂，而废了老大劲定位之后，大多都指向文件格式不统一的问题，有顿时感觉这么小的问题，竟然还犯，又像泄了气的气球。痛定思痛之后，下次还是遇到，真是无语。

那么怎么对文件格式进行统一呢，就是本文要展开的话题，甭管它是后续增量的，还是代码库中已有存量的，从全方位的角度来对文件格式问题进行处理。

2 统一的好处

我们先来具体的看下，文件格式不统一，在编码过程中究竟会导致哪些问题。

笔者在查找了相关资料之后，总结会有以下几类影响：

字符显示错误
字符串处理错误
跨平台兼容问题
开发工具和版本控制系统的兼容问题等。

那么对文件格式统一为 UTF-8 格式又有哪些好处呢：

跨平台一致性：UTF-8是一种广泛支持的编码格式，适用于多种操作系统和开发环境。通过统一编码格式，可以确保代码在不同平台之间具有更好的兼容性，无论是在Windows、Linux还是MacOS上，文件都能被正确解析和显示。
国际化支持：UTF-8支持表示全世界绝大多数的字符，这为多语言开发和国际化项目提供了便利。如果项目中涉及到多种语言，使用UTF-8可以避免编码转换造成的乱码问题。
简化开发流程：在团队开发环境中，不同开发者可能使用不同的操作系统和工具。如果不统一文件的编码格式，可能会因为编码不兼容而导致代码合作时出现问题，比如代码审查时出现乱码。统一使用UTF-8可以减少这类问题，简化版本控制和代码合作流程。
避免编码错误：不统一编码格式可能会导致编译器或解释器错误理解源代码，特别是当代码文件包含非ASCII字符时。这可能会导致编译错误或运行时错误，尤其是在字符串处理和字符编码转换时更加明显。
标准化推荐：许多现代编程语言和标准化组织都推荐或要求使用UTF-8编码。例如，Python 3将UTF-8作为源代码文件的默认编码，HTML5明确规定使用UTF-8。遵循这些标准和最佳实践有利于提高代码的可维护性和可读性。

好的，讲了这么多背景知识，至少从思想上，我们统一了，对文件格式进行 UTF-8 统一，在项目开发过程中，还是有很明显的好处的。

那么具体而言，我们该怎么做呢？

3 对增量代码怎么进行统一

对新增的文件，我们采用统一编译器配置的措施来进行干预。在项目组中进行开发，统一开发的 IDE 工具及其配置是必不可少的。

在此我以 VS2017 来进行举例：

对新增文件，可以通过配置模板文件形式进行设置：

在 C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\Common7\IDE\VC\vcprojectitems 里放 hfile.h 和 newc++file.cpp 两个文件，有些人路径可能为：C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\Common7\IDE\VC\vcprojectitems 目录。

hfile.h 和 newc++file.cpp 模板文件为自己配置的已设置好格式的UTF-8格式的文件模板。

在 VS 中配置好这些之后，在工程项目中中新增加文件（“解决方案 -》添加 -》新建项”）的时候，VS就会按这两个模板文件的编码来保存文件了。

在这里插入图片描述

4 对存量代码怎么进行统一

对于存量文件的情况，有四种场景，我们来一一看下相应的解决方案。

4.1 指定单一文件夹，对里面的 .h .cpp 文件全转换

import chardet
import os
import codecs

def get_encoding(file_name):
    f = open(file_name, 'rb')
    encoding = chardet.detect(f.read())['encoding']
    return encoding

def write(file_name, content):
    with codecs.open(file_name, 'w', 'utf-8') as f:
        f.write(content)

def read(file_name, encoding):
    with codecs.open(file_name, 'r', encoding) as f:
        return f.read()

def convert(src_file, dst_file):
    content = None
    try:
        encoding = get_encoding(src_file)
        content = read(src_file, encoding)
    except OSError:
        print("something wrong: %s" % src_file)
    finally:
        write(dst_file, content)
        #print("Done")

def convert_dir(src_dir, dst_dir, sub_dir, file_prefixes=[]):
    src_prj_dir = os.path.join(src_dir, sub_dir)
    for root, dirs, files in os.walk(src_prj_dir):
        dst_root = root.replace(src_dir, dst_dir)
        for f in files:
            if f.endswith('.h') or f.endswith('.cpp'):
                if file_prefixes:
                    to_parse = False
                    for p in file_prefixes:
                        if f.startswith(p):
                            to_parse = True
                            break
                    if not to_parse:
                        continue
                os.makedirs(dst_root, exist_ok=True)
                src_file = os.path.join(root, f)
                dst_file = os.path.join(dst_root, f)
                convert(src_file, dst_file)
    
if __name__ == '__main__':
    src_dir = 'D:/ProjectRootDir/AModule/source'  # 源文件夹根路径
    dst_dir = 'F:/Python/Src/result'  # 转换成utf8 后目的路径

    convert_dir(src_dir, dst_dir, 'include')  # 对源文件夹下 include 目录扫描

4.2 指定单一文件夹，对里面的.h .cpp文件按需转换

import chardet
import os
import codecs

def get_encoding(file_name):
    f = open(file_name, 'rb')
    encoding = chardet.detect(f.read())['encoding']
    return encoding

def write(file_name, content):
    with codecs.open(file_name, 'w', 'utf-8') as f:
        f.write(content)

def read(file_name, encoding):
    with codecs.open(file_name, 'r', encoding) as f:
        return f.read()

def convert(src_file, dst_file):
    content = None
    try:
        encoding = get_encoding(src_file)
        content = read(src_file, encoding)
    except OSError:
        print("something wrong: %s" % src_file)
    finally:
        write(dst_file, content)
        #print("Done")

def convert_dir(src_dir, dst_dir, sub_dir, file_prefixes=[]):
    src_prj_dir = os.path.join(src_dir, sub_dir)
    for root, dirs, files in os.walk(src_prj_dir):
        dst_root = root.replace(src_dir, dst_dir)
        for f in files:
            if f.endswith('.h') or f.endswith('.cpp'):
                if file_prefixes:
                    to_parse = False
                    for p in file_prefixes:
                        if f.startswith(p):
                            to_parse = True
                            break
                    if not to_parse:
                        continue
                os.makedirs(dst_root, exist_ok=True)
                src_file = os.path.join(root, f)
                dst_file = os.path.join(dst_root, f)
                convert(src_file, dst_file)
    
if __name__ == '__main__':
    src_dir = 'D:/ProjectRootDir/AModule/source'  # 源文件夹根路径
    dst_dir = 'F:/Python/Src/result'  # 转换成utf8 后目的路径
    prefixes = ['prop_', 'resource_', 'widgets_']  # 选定带特定前缀的文件进行转换
    convert_dir(src_dir, dst_dir, 'include', prefixes)  # 对源文件夹下 include 目录中带特定前缀的文件进行转换

4.3 指定多文件夹，对里面的.h .cpp文件全部转换

import chardet
import os
import codecs

def get_encoding(file_name):
    f = open(file_name, 'rb')
    encoding = chardet.detect(f.read())['encoding']
    return encoding

def write(file_name, content):
    with codecs.open(file_name, 'w', 'utf-8') as f:
        f.write(content)

def read(file_name, encoding):
    with codecs.open(file_name, 'r', encoding) as f:
        return f.read()

def convert(src_file, dst_file):
    content = None
    try:
        encoding = get_encoding(src_file)
        content = read(src_file, encoding)
    except OSError:
        print("something wrong: %s" % src_file)
    finally:
        write(dst_file, content)
        #print("Done")

def convert_dir(src_dir, dst_dir, sub_dir, file_prefixes=[]):
    src_prj_dir = os.path.join(src_dir, sub_dir)
    for root, dirs, files in os.walk(src_prj_dir):
        dst_root = root.replace(src_dir, dst_dir)
        for f in files:
            if f.endswith('.h') or f.endswith('.cpp'):
                if file_prefixes:
                    to_parse = False
                    for p in file_prefixes:
                        if f.startswith(p):
                            to_parse = True
                            break
                    if not to_parse:
                        continue
                os.makedirs(dst_root, exist_ok=True)
                src_file = os.path.join(root, f)
                dst_file = os.path.join(dst_root, f)
                convert(src_file, dst_file)
    
if __name__ == '__main__':
    src_dir = 'D:/ProjectRootDir/AModule/source'  # 源文件夹根路径
    dst_dir = 'F:/Python/Src/result'  # 转换成utf8 后目的路径
    prjs = ['config', 'gui', 'rs/include', 'rs/ResourceSystem']
    for p in prjs:
        convert_dir(src_dir, dst_dir, p)  # 对源文件夹下config、gui等子文件夹下的文件进行转换

4.4 指定多文件夹，对里面的.h .cpp文件按需转换

import chardet
import os
import codecs

def get_encoding(file_name):
    f = open(file_name, 'rb')
    encoding = chardet.detect(f.read())['encoding']
    return encoding

def write(file_name, content):
    with codecs.open(file_name, 'w', 'utf-8') as f:
        f.write(content)

def read(file_name, encoding):
    with codecs.open(file_name, 'r', encoding) as f:
        return f.read()

def convert(src_file, dst_file):
    content = None
    try:
        encoding = get_encoding(src_file)
        content = read(src_file, encoding)
    except OSError:
        print("something wrong: %s" % src_file)
    finally:
        write(dst_file, content)
        #print("Done")

def convert_dir(src_dir, dst_dir, sub_dir, file_prefixes=[]):
    src_prj_dir = os.path.join(src_dir, sub_dir)
    for root, dirs, files in os.walk(src_prj_dir):
        dst_root = root.replace(src_dir, dst_dir)
        for f in files:
            if f.endswith('.h') or f.endswith('.cpp'):
                if file_prefixes:
                    to_parse = False
                    for p in file_prefixes:
                        if f.startswith(p):
                            to_parse = True
                            break
                    if not to_parse:
                        continue
                os.makedirs(dst_root, exist_ok=True)
                src_file = os.path.join(root, f)
                dst_file = os.path.join(dst_root, f)
                convert(src_file, dst_file)
    
if __name__ == '__main__':
    src_dir = 'D:/ProjectRootDir/AModule/source'  # 源文件夹根路径
    dst_dir = 'F:/Python/Src/result'  # 转换成utf8 后目的路径
    # parse A include
    prefixes = ['gui_', 'prop_', 'resource_', 'widgets_']
    convert_dir(src_dir, dst_dir, 'A', prefixes)  # 对源文件夹下 A 目录中带特定前缀的文件进行转换扫描

    # parse project group or project
    prjs = ['config', 'gui', 'rs/include', 'rs/ResourceSystem']
    for p in prjs:
        convert_dir(src_dir, dst_dir, p)  # 对源文件夹下config、gui等子文件夹下的文件进行转换

    # parse some include file
    convert_dir(src_dir, dst_dir, 'C/include', ['pm_property', 'prop_'])  # 对指定目录下的指定前缀文件进行转换
    convert_dir(src_dir, dst_dir, 'D/include', ['ply_property', 'db_property'])