用Python实现中文的正则替换、调用元素组

最新推荐文章于 2023-03-18 17:19:20 发布

LeoHsiao1

最新推荐文章于 2023-03-18 17:19:20 发布

阅读量1.4k

点赞数

分类专栏： Python 文章标签：正则表达式字符串

任何人都可以自由地复制、修改和使用该博客及其附件，同时自负责任。

本文链接：https://blog.csdn.net/qq_35952638/article/details/104038447

版权

Python 专栏收录该内容

19 篇文章 2 订阅

订阅专栏

笔者需要批量修改一些文件中的汉字，因此调研正则替换的工具，试用后发现：

VS Code可以正则替换、调用元素组、处理中文字符，如下：

但是它批量处理多个文件时是基于另一个正则引擎，不能处理中文字符。
Notepad++与之类似，但是处理多个文件时还需要手动打开，比较麻烦。

sed命令可以正则替换、调用元素组、批量处理多个文件，但是不能处理中文字符。如下：

[root@Centos ~]# echo Hello World | sed 's/Hello/hi/g'
hi World
[root@Centos ~]# echo Hello World | sed 's/Hello \(\w*\)/\1/g'
World
[root@Centos ~]# echo 你好World | sed 's/[\u4e00-\u9fa5]/ /g' 
sed: -e expression #1, char 21: Invalid range end

perl命令与之类似，处理中文字符时会乱码。

Python的re.sub()函数可以正则替换、处理中文字符，但是不能调用元素组。

>>> import re
>>> re.sub('(Hello).', 'hi', 'Hello World') 
'hiWorld'
>>> re.sub('(Hello).', '$1', 'Hello World') 
'$1World'

综上，笔者决定基于Python的re模块自定义一个正则替换的函数，如下：

import re


def replace(string, src: str, dst: str) -> str:
    """
    Replace `src` with `dst` in `string`, based on regular expressions.
    
    Sample:
    >>> replace('Hello World', 'Hello', 'hi')
    'hi World'
    >>> replace('Hello World', '(Hello).', 'hi')
    'hiWorld'
    >>> replace('Hello World', '(Hello).', '$1,')
    'Hello,World'
    >>> replace('Hello World', 'Hello', '$1')
    ValueError: group id out of range : $1
    >>> replace('你好World', '([\\u4e00-\\u9fa5])(\w)', '$1 $2')
    '你好 World'
    """
    # Check the element group
    src_group_num = min(len(re.findall(r'\(', src, re.A)), len(re.findall(r'\)', src, re.A)))
    dst_group_ids = re.findall(r'\$(\d)', dst, re.A)
    if dst_group_ids:
        dst_group_ids = list(set(dst_group_ids))  # Remove duplicate id
        dst_group_ids.sort()
        max_group_id = int(dst_group_ids[-1])
        if max_group_id > src_group_num:
            raise ValueError('group id out of range : ${}'.format(max_group_id))

    # replace
    if dst_group_ids:
        pattern = re.compile('({})'.format(src), re.A)
        result = string[:]
        for match in pattern.findall(string):
            _dst = dst[:]
            for i in dst_group_ids:
                i = int(i)
                _dst = _dst.replace('${}'.format(i), match[i])
            result = result.replace(match[0], _dst)
    else:
        pattern = re.compile(src, re.A)
        result = pattern.sub(dst, string)

    return result

把它做成脚本：

import argparse

parser = argparse.ArgumentParser(description=r"""This script is use to replace string in a file. Sample: python replace.py --file 1.py --src "([\u4e00-\u9fa5])(\w)" --dst "$1 $2" """)
parser.add_argument('--file', help='a valid file path', type=str, required=True)
parser.add_argument('--src', help='the source string, which is a regular expression.', type=str, required=True)
parser.add_argument('--dst', help='the destination string', type=str, required=True)
parser.add_argument('--encoding', help='the encoding of the original file, which is utf-8 by default.', type=str, default='utf-8')
args = parser.parse_args()


try:
    # read the file
    with open(args.file, 'r', encoding=args.encoding) as f:
        text = f.read()
        print('Handling file: {} ...'.format(args.file), end='\t\t')

    # handling
    result = replace(text, args.src, args.dst)

    # save the result
    with open(args.file, 'w', encoding=args.encoding) as f:
        f.write(result)
        print('done')

except Exception as e:
    print('Error: {}'.format(str(e)))

使用时，相当于可以处理中文字符的sed命令：

for file in `find . -name "*.md"`
do
    python3 replace.py --file $file --src '([\u4e00-\u9fa5])(\w)' --dst '$1 $2'
done

LeoHsiao1

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
用Python实现中文的正则替换、调用元素组

笔者需要批量修改一些文件中的汉字，因此调研正则替换的工具，试用后发现：VS Code可以正则替换、调用元素组、处理中文字符，如下：但是它批量处理多个文件时是基于另一个正则引擎，不能处理中文字符。Notepad++与之类似，但是处理多个文件时还需要手动打开，比较麻烦。sed命令可以正则替换、调用元素组、批量处理多个文件，但是不能处理中文字符。如下：[root@Centos ~]# ech...
复制链接

扫一扫