python unicode 标点范围_在Python中从unicode字符串中去除标点符号的最快方法

我正在尝试有效地从unicode字符串中去除标点符号。对于常规字符串,使用mystring.translate(None, string.punctuation)显然是{a1}。但是,在Python2.7中,这段代码在unicode字符串上中断。正如对这个answer的注释所解释的那样,translate方法仍然可以实现,但是它必须使用字典来实现。但是,当我使用这个implementation时,我发现translate的性能显著降低。这是我的计时代码(主要是从answer复制的):import re, string, timeit

import unicodedata

import sys

#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = "For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."

su = u"For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."

exclude = set(string.punctuation)

regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):

return ''.join(ch for ch in s if ch not in exclude)

def test_re(s): # From Vinko's solution, with fix.

return regex.sub('', s)

def test_trans(s):

return s.translate(None, string.punctuation)

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)

if unicodedata.category(unichr(i)).startswith('P'))

def test_trans_unicode(su):

return su.translate(tbl)

def test_repl(s): # From S.Lott's solution

for c in string.punctuation:

s=s.replace(c,"")

return s

print "sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)

print "regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)

print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)

print "replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

print "sets (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_set as f').timeit(1000000)

print "regex (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_re as f').timeit(1000000)

print "translate (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_trans_unicode as f').timeit(1000000)

print "replace (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_repl as f').timeit(1000000)

正如我的结果所示,translate的unicode实现执行得非常糟糕:

^{pr2}$

我的问题是,是否有一种更快的方法来实现translateforunicode(或任何其他方法)的性能优于regex。在

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值