文本清洗正则表达式（持续更新）

最新推荐文章于 2025-03-16 15:35:13 发布

置顶

小基基o_O

最新推荐文章于 2025-03-16 15:35:13 发布

阅读量5.5k

点赞数 14

分类专栏：数据处理

本文链接：https://blog.csdn.net/Yellow_python/article/details/99084214

版权

文章目录

常用但记不住的pattern
- 正向肯定预查
- \w
特殊字符清洗
清除连续空白符
HTML标签清洗
标点格化
文本切分
是否单词
时间&日期
网址、邮箱、电话…
数量词

常用但记不住的pattern

pattern	description
[\u4e00-\u9fa5]	中文
\s	任何空白字符
\S	任何非空白字符
(?=pattern)	正向肯定预查
(?<=pattern)	反向肯定预查
(?!pattern)	正向否定预查
(?<!pattern)	反向否定预查
\w	中英文数字下划线

正向肯定预查

import re
rc = re.compile('小米(?=手机)')
print(rc.fullmatch('小米手机'))  # None
print(rc.fullmatch('小米'))  # None
print(rc.findall('小米手机'))  # ['小米']
print(rc.findall('小米粥'))  # []

\w

import re
rec = re.compile('\w')
a = 'aA1啊の.\n,。_+-='
print(rec.findall(a))  # 日文也属于\w
# ['a', 'A', '1', '啊', 'の', '_']
print(list(rec.sub('', a)))
# ['.', '\n', ',', '。', '+', '-', '=']

特殊字符清洗

import re

a = '''𠙶山aA1１,./<>?";':[]{}\\|`~!@#$%^&*()_+-=《》？，。/：；’‘”“【】、·！@=#=￥%=…（）—❤'''

rc = re.compile(r'[^-_a-zA-Z\d\u4e00-\u9fa5\s,<.>/?;:"\[{\]}|`~!@#$%^&*()=+，《。》？；：‘’“”【】、·！￥…（）—]')  # 少'\
print(rc.findall(a))
# ['𠙶', "'", '\\', '❤']

rc = re.compile(r'[^-\w\s,<.>/?;:\'"\[{\]}\\|`~!@#$%^&*()=+，《。》？；：‘’“”【】、·！￥…（）—]')
print(rc.findall(a))
# ['❤']

清除连续空白符

def replace_continuous_blank_lines(text):
    """清除连续空行"""
    return re.sub(r'\n\s*\n', '\n', text.strip())

def replace_space(text):
    """清除连续空白"""
    text = re.sub(r'\s*\n\s*', '\n', text.strip())
    text = re.sub(r'[^\S\n]', ' ', text)
    text = re.sub('(?<![\u4e00-\u9fa5]) (?=[\u4e00-\u9fa5])|(?<=[\u4e00-\u9fa5]) (?![\u4e00-\u9fa5])', '', text)
    return text

def replace_space_resolutely(text, substitution=''):
    return re.sub(r'\s+', substitution, text.strip())

HTML标签清洗

def replace_tag(html, completely=True):
    """替换HTML标签"""
    # 独立元素
    html = re.sub('<img[^>]*>', '', html)  # 图片
    html = re.sub('<br/?>|<br [^<>]*>|<hr/?>|<hr [^<>]*>', '\n'