python 中文查找_使用python和regex查找字符串中的所有中文文本

python窄unicode构建的简短但相对全面的答案(不包括只能通过代理项对在窄unicode构建中表示的序号65535):

RE = re.compile(u'[âº-âºâº-⻳â¼-â¿ããã¡-ã©ã¸-ãºã»ã-䶵ä¸-é¿è±-鶴侮-頻並-é¾]', re.UNICODE)

nochinese = RE.sub('', mystring)

生成RE的代码,如果需要在

supplementary plane

对于宽版本:

# -*- coding: utf-8 -*-

import re

LHan = [[0x2E80, 0x2E99], # Han # So [26] CJK RADICAL REPEAT, CJK RADICAL RAP

[0x2E9B, 0x2EF3], # Han # So [89] CJK RADICAL CHOKE, CJK RADICAL C-SIMPLIFIED TURTLE

[0x2F00, 0x2FD5], # Han # So [214] KANGXI RADICAL ONE, KANGXI RADICAL FLUTE

0x3005, # Han # Lm IDEOGRAPHIC ITERATION MARK

0x3007, # Han # Nl IDEOGRAPHIC NUMBER ZERO

[0x3021, 0x3029], # Han # Nl [9] HANGZHOU NUMERAL ONE, HANGZHOU NUMERAL NINE

[0x3038, 0x303A], # Han # Nl [3] HANGZHOU NUMERAL TEN, HANGZHOU NUMERAL THIRTY

0x303B, # Han # Lm VERTICAL IDEOGRAPHIC ITERATION MARK

[0x3400, 0x4DB5], # Han # Lo [6582] CJK UNIFIED IDEOGRAPH-3400, CJK UNIFIED IDEOGRAPH-4DB5

[0x4E00, 0x9FC3], # Han # Lo [20932] CJK UNIFIED IDEOGRAPH-4E00, CJK UNIFIED IDEOGRAPH-9FC3

[0xF900, 0xFA2D], # Han # Lo [302] CJK COMPATIBILITY IDEOGRAPH-F900, CJK COMPATIBILITY IDEOGRAPH-FA2D

[0xFA30, 0xFA6A], # Han # Lo [59] CJK COMPATIBILITY IDEOGRAPH-FA30, CJK COMPATIBILITY IDEOGRAPH-FA6A

[0xFA70, 0xFAD9], # Han # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70, CJK COMPATIBILITY IDEOGRAPH-FAD9

[0x20000, 0x2A6D6], # Han # Lo [42711] CJK UNIFIED IDEOGRAPH-20000, CJK UNIFIED IDEOGRAPH-2A6D6

[0x2F800, 0x2FA1D]] # Han # Lo [542] CJK COMPATIBILITY IDEOGRAPH-2F800, CJK COMPATIBILITY IDEOGRAPH-2FA1D

def build_re():

L = []

for i in LHan:

if isinstance(i, list):

f, t = i

try:

f = unichr(f)

t = unichr(t)

L.append('%s-%s' % (f, t))

except:

pass # A narrow python build, so can't use chars > 65535 without surrogate pairs!

else:

try:

L.append(unichr(i))

except:

pass

RE = '[%s]' % ''.join(L)

print 'RE:', RE.encode('utf-8')

return re.compile(RE, re.UNICODE)

RE = build_re()

print RE.sub('', u'ç¾å½').encode('utf-8')

print RE.sub('', u'blah').encode('utf-8')

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值