python如何去除中文标点符号_从Python中的unicode字符串去除标点符号的最快方法...

目前的测试脚本是有缺陷的,因为它不像以前那样比较。

为了更公平的比较,所有功能必须使用相同的标点符号集(即所有ascii或所有unicode)运行。

当这样做完成时,正则表达式和替换方法就会使用一整套unicode标点符号更差。

对于完整的unicode,它看起来像“set”方法是最好的。但是,如果您只想从unicode字符串中删除ascii标点符号,则编码,翻译和解码可能最好(取决于输入字符串的长度)。

在替换之前进行遏制测试(取决于字符串的准确组合),“替换”方法也可以大大提高。

以下是测试脚本重新排序的一些示例结果:

$ python2 test.py

running ascii punctuation test...

using byte strings...

set: 0.862006902695

re: 0.17484498024

trans: 0.0207080841064

enc_trans: 0.0206489562988

repl: 0.157525062561

in_repl: 0.213351011276

$ python2 test.py a

running ascii punctuation test...

using unicode strings...

set: 0.927773952484

re: 0.18892288208

trans: 1.58275294304

enc_trans: 0.0794939994812

repl: 0.413739919662

in_repl: 0.249747991562

python2 test.py u

running unicode punctuation test...

using unicode strings...

set: 0.978360176086

re: 7.97941994667

trans: 1.72471117973

enc_trans: 0.0784001350403

repl: 7.05612301826

in_repl: 3.66821289062

这里是重新编写的脚本:

# -*- coding: utf-8 -*-

import re, string, timeit

import unicodedata

import sys

#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = """For me, Reddit brings to mind Obi Wan’s enduring description of the Mos

Eisley cantina: a wretched hive of scum and villainy. But, you know, one you

still kinda want to hang out in occasionally. The thing is, though, Reddit

isn’t some obscure dive bar in a remote corner of the universe—it’s a huge

watering hole at the very center of it. The site had some 400 million unique

visitors in 2012. They can’t all be Greedos. So maybe my problem is just that

I’ve never been able to find the places where the decent people hang out."""

su = u"""For me, Reddit brings to mind Obi Wan’s enduring description of the

Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one

you still kinda want to hang out in occasionally. The thing is, though,

Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a

huge watering hole at the very center of it. The site had some 400 million

unique visitors in 2012. They can’t all be Greedos. So maybe my problem is

just that I’ve never been able to find the places where the decent people

hang out."""

def test_trans(s):

return s.translate(tbl)

def test_enc_trans(s):

s = s.encode('utf-8').translate(None, string.punctuation)

return s.decode('utf-8')

def test_set(s): # with list comprehension fix

return ''.join([ch for ch in s if ch not in exclude])

def test_re(s): # From Vinko's solution, with fix.

return regex.sub('', s)

def test_repl(s): # From S.Lott's solution

for c in punc:

s = s.replace(c, "")

return s

def test_in_repl(s): # From S.Lott's solution, with fix

for c in punc:

if c in s:

s = s.replace(c, "")

return s

txt = 'su'

ptn = u'[%s]'

if 'u' in sys.argv[1:]:

print 'running unicode punctuation test...'

print 'using unicode strings...'

punc = u''

tbl = {}

for i in xrange(sys.maxunicode):

char = unichr(i)

if unicodedata.category(char).startswith('P'):

tbl[i] = None

punc += char

else:

print 'running ascii punctuation test...'

punc = string.punctuation

if 'a' in sys.argv[1:]:

print 'using unicode strings...'

punc = punc.decode()

tbl = {ord(ch):None for ch in punc}

else:

print 'using byte strings...'

txt = 's'

ptn = '[%s]'

def test_trans(s):

return s.translate(None, punc)

test_enc_trans = test_trans

exclude = set(punc)

regex = re.compile(ptn % re.escape(punc))

def time_func(func, n=10000):

timer = timeit.Timer(

'func(%s)' % txt,

'from __main__ import %s, test_%s as func' % (txt, func))

print '%s: %s' % (func, timer.timeit(n))

print

time_func('set')

time_func('re')

time_func('trans')

time_func('enc_trans')

time_func('repl')

time_func('in_repl')

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值