python如何去除中文标点符号_从Python中的unicode字符串去除标点符号的最快方法...

最新推荐文章于 2023-02-27 14:35:05 发布

weixin_39982933

最新推荐文章于 2023-02-27 14:35:05 发布

阅读量352

点赞数

文章标签： python如何去除中文标点符号

目前的测试脚本是有缺陷的，因为它不像以前那样比较。

为了更公平的比较，所有功能必须使用相同的标点符号集(即所有ascii或所有unicode)运行。

当这样做完成时，正则表达式和替换方法就会使用一整套unicode标点符号更差。

对于完整的unicode，它看起来像“set”方法是最好的。但是，如果您只想从unicode字符串中删除ascii标点符号，则编码，翻译和解码可能最好(取决于输入字符串的长度)。

在替换之前进行遏制测试(取决于字符串的准确组合)，“替换”方法也可以大大提高。

以下是测试脚本重新排序的一些示例结果：

$ python2 test.py

running ascii punctuation test...

using byte strings...

set: 0.862006902695

re: 0.17484498024

trans: 0.0207080841064

enc_trans: 0.0206489562988

repl: 0.157525062561

in_repl: 0.213351011276

$ python2 test.py a

running ascii punctuation test...

using unicode strings...

set: 0.927773952484

re: 0.18892288208

trans: 1.58275294304

enc_trans: 0.0794939994812

repl: 0.413739919662

in_repl: 0.249747991562

python2 test.py u

running unicode punctuation test...

using unicode strings...

set: 0.978360176086

re: 7.97941994667

trans: 1.72471117973

enc_trans: 0.0784001350403

repl: 7.05612301826

in_repl: 3.66821289062

这里是重新编写的脚本：

# -*- coding: utf-8 -*-

import re, string, timeit

import unicodedata

import sys

#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = """For me, Reddit brings to mind Obi Wan’s enduring description of the Mos

Eisley cantina: a wretched hive of scum and villainy. But, you know, one you

still kinda want to hang out in occasionally. The thing is, though, Reddit

isn’t some obscure dive bar in a remote corner of the universe—it’s a huge

watering hole at the very center of it. The site had some 400 million unique

visitors in 2012. They can’t all be Greedos. So maybe my problem is just that

I’ve never been able to find the places where the decent people hang out."""

su = u"""For me, Reddit brings to mind Obi Wan’s enduring description of the

Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one

you still kinda want to hang out in occasionally. The thing is, though,

Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a

huge watering hole at the very center of it. The site had some 400 million

unique visitors in 2012. They can’t all be Greedos. So maybe my problem is

just that I’ve never been able to find the places where the decent people

hang out."""

def test_trans(s):

return s.translate(tbl)

def test_enc_trans(s):

s = s.encode('utf-8').translate(None, string.punctuation)

return s.decode('utf-8')

def test_set(s): # with list comprehension fix

return ''.join([ch for ch in s if ch not in exclude])

def test_re(s): # From Vinko's solution, with fix.

return regex.sub('', s)

def test_repl(s): # From S.Lott's solution

for c in punc:

s = s.replace(c, "")

return s

def test_in_repl(s): # From S.Lott's solution, with fix

for c in punc:

if c in s:

s = s.replace(c, "")

return s

txt = 'su'

ptn = u'[%s]'

if 'u' in sys.argv[1:]:

print 'running unicode punctuation test...'

print 'using unicode strings...'

punc = u''

tbl = {}

for i in xrange(sys.maxunicode):

char = unichr(i)

if unicodedata.category(char).startswith('P'):

tbl[i] = None

punc += char

else:

print 'running ascii punctuation test...'

punc = string.punctuation

if 'a' in sys.argv[1:]:

print 'using unicode strings...'

punc = punc.decode()

tbl = {ord(ch):None for ch in punc}

else:

print 'using byte strings...'

txt = 's'

ptn = '[%s]'

def test_trans(s):

return s.translate(None, punc)

test_enc_trans = test_trans

exclude = set(punc)

regex = re.compile(ptn % re.escape(punc))

def time_func(func, n=10000):

timer = timeit.Timer(

'func(%s)' % txt,

'from __main__ import %s, test_%s as func' % (txt, func))

print '%s: %s' % (func, timer.timeit(n))

time_func('set')

time_func('re')

time_func('trans')

time_func('enc_trans')

time_func('repl')

time_func('in_repl')

weixin_39982933

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python如何去除中文标点符号_从Python中的unicode字符串去除标点符号的最快方法...

目前的测试脚本是有缺陷的，因为它不像以前那样比较。为了更公平的比较，所有功能必须使用相同的标点符号集(即所有ascii或所有unicode)运行。当这样做完成时，正则表达式和替换方法就会使用一整套unicode标点符号更差。对于完整的unicode，它看起来像“set”方法是最好的。但是，如果您只想从unicode字符串中删除ascii标点符号，则编码，翻译和解码可能最好(取决于输入字符串的长度)...
复制链接

扫一扫