Python 正则匹配标签中的中文

最新推荐文章于 2023-12-09 18:00:00 发布

好逸爱劳

最新推荐文章于 2023-12-09 18:00:00 发布

阅读量703

点赞数 2

分类专栏： Re Python 文章标签：正则表达式

本文链接：https://blog.csdn.net/weixin_44685869/article/details/109179899

版权

Python 同时被 2 个专栏收录

34 篇文章 0 订阅

订阅专栏

1 篇文章 0 订阅

订阅专栏

有如下内容：

text = '<div class="comment-content comment-content_new">测试</div> <div class="comment-content comment-content_new">学习正则</div>'

使用正则匹配出所有的中文。

第一种

p = re.compile(r'([^x00-xff]*)\<\/div\>')

for m in p.finditer(text):
    print(m.group(1))

# 打印结果：

测试
学习正则

这样就是比较的简单，直接是匹配 Ascii 码大于 255 的那些字符(包括中文符号)。

第二种

res = re.findall(u"[\u4e00-\u9fa5]+", str(text))
print(res)

# 打印结果：

['测试', '学习正则']

\u4e00-\u9fa5 是 unicode 编码的中文编码范围，用它来匹配中文也是非常的合适。

还可以在添加一些优化，使得可以匹配出中文的字符。

text = '<div class="comment-content comment-content_new">测试，。、【】、</div> <div class="comment-content comment-content_new">学习正则</div>'

res = re.findall(u"[\u2000-\u206f\u3000-\u303f\u4e00-\u9fef\uff00-\uffef]+", str(text)

print(res)

# 打印结果：

['测试，。、【】、', '学习正则']

# http://www.unicode.org/charts/PDF/U2000.pdf 一般标点
# http://www.unicode.org/charts/PDF/U3000.pdf CJK符号和标点
# http://www.unicode.org/charts/PDF/U4E00.pdf CJK统一表意文字
# http://www.unicode.org/charts/PDF/UFF00.pdf 半宽全宽形状

"[\u2000-\u206f\u3000-\u303f\u4e00-\u9fef\uff00-\uffef]*";

好逸爱劳

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python 正则匹配标签中的中文

有如下内容：text = '<div class="comment-content comment-content_new">测试</div> <div class="comment-content comment-content_new">学习正则</div>'使用正则匹配出所有的中文。第一种p = re.compile(r'([^x00-xff]*)\<\/div\>')for m in p.finditer(text):
复制链接

扫一扫