python unicode 标点范围_在Python中从unicode字符串中去除标点符号的最快方法

最新推荐文章于 2023-12-09 10:39:43 发布

weixin_39762478

最新推荐文章于 2023-12-09 10:39:43 发布

阅读量207

点赞数

文章标签： python unicode 标点范围

本文链接：https://blog.csdn.net/weixin_39762478/article/details/111519348

版权

我正在尝试有效地从unicode字符串中去除标点符号。对于常规字符串，使用mystring.translate(None, string.punctuation)显然是{a1}。但是，在Python2.7中，这段代码在unicode字符串上中断。正如对这个answer的注释所解释的那样，translate方法仍然可以实现，但是它必须使用字典来实现。但是，当我使用这个implementation时，我发现translate的性能显著降低。这是我的计时代码(主要是从answer复制的)：import re, string, timeit

import unicodedata

import sys

#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = "For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."

su = u"For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."

exclude = set(string.punctuation)

regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):

return ''.join(ch for ch in s if ch not in exclude)

def test_re(s): # From Vinko's solution, with fix.

return regex.sub('', s)

def test_trans(s):

return s.translate(None, string.punctuation)

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)

if unicodedata.category(unichr(i)).startswith('P'))

def test_trans_unicode(su):

return su.translate(tbl)

def test_repl(s): # From S.Lott's solution

for c in string.punctuation:

s=s.replace(c,"")

return s

print "sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)

print "regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)

print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)

print "replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

print "sets (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_set as f').timeit(1000000)

print "regex (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_re as f').timeit(1000000)

print "translate (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_trans_unicode as f').timeit(1000000)

print "replace (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_repl as f').timeit(1000000)

正如我的结果所示，translate的unicode实现执行得非常糟糕：

^{pr2}$

我的问题是，是否有一种更快的方法来实现translateforunicode(或任何其他方法)的性能优于regex。在

weixin_39762478

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python unicode 标点范围_在Python中从unicode字符串中去除标点符号的最快方法

我正在尝试有效地从unicode字符串中去除标点符号。对于常规字符串，使用mystring.translate(None, string.punctuation)显然是{a1}。但是，在Python2.7中，这段代码在unicode字符串上中断。正如对这个answer的注释所解释的那样，translate方法仍然可以实现，但是它必须使用字典来实现。但是，当我使用这个implementation时，...
复制链接

扫一扫