python删除字符串中重复字符,删除字符串中重复字符的最快方法-Python

We can deduplicate the contiguous characters in a string with:

def deduplicate(string, char):

return char.join([substring for substring in string.strip().split(char) if substring])

E.g.

>>> s = 'this is an irritating string with random spacing .'

>>> deduplicate(s)

'this is an irritating string with random spacing .'

On the command line there is a squeeze option for tr:

$ tr -s " " < file

Is there a squeeze function in Python's string?

What is the fastest way to deduplicate contiguous characters in string in Python?

Please note that the character to be deduplicated should be any ascii/unicode character and not just \s / whitespace. (It's fine to have 2 sub-answers for ascii and unicode.

解决方案

First of all, your deduplicate function is actually really fast. But there can be some improvements made to make it even faster. I have lambdaized your function and called it org_deduplicate (below). Now for some time tests (using iPython's %timeit):

s = 'this is an irritating string with random spacing .'

org_deduplicate = lambda s,c: c.join([substring for substring in s.strip().split(c) if substring])

%timeit org_deduplicate(s,' ')

100000 loops, best of 3: 3.59 µs per loop

but the strip really isn't necessary and may even give you unexpected results (if you are not deduplicating whitespace) so we can try:

org_deduplicate2 = lambda s,c: c.join(substring for substring in s.split(c) if substring)

%timeit org_deduplicate2(s,' ')

100000 loops, best of 3: 3.4 µs per loop

which speeds things up by a tiny bit but its not all that impressive. Lets try a different approach... regular expressions. These are also nice because they give you the flexibility to choose any regular expression as your "character" to deduplicate (not just a single char):

import re

re_deduplicate = lambda s,c: re.sub(r'(%s)(?:\1)+' %c, '\g<1>', s)

re_deduplicate2 = lambda s,c: c.join(re.split('%s+'%c,s))

%timeit re_deduplicate(s,' ')

100000 loops, best of 3: 13.8 µs per loop

%timeit re_deduplicate2(s,' ')

100000 loops, best of 3: 6.47 µs per loop

The second one is faster but neither are even close to your original function. It looks like regular string operations are quicker than re functions. What if we try zipping instead (use itertools.izip if working with Python 2):

zip_deduplicate = lambda s,c: ''.join(s1 for s1,s2 in zip(s,s[1:]) if s1!=c or s1!=s2)

%timeit zip_deduplicate(s,' ')

100000 loops, best of 3: 12.9 µs per loop

Still no improvement. The zip method makes too many substrings which makes doing ''.join slow. Ok one more try... what about str.replace called recursively:

def rec_deduplicate(s,c):

if s.find(c*2) != -1:

return rec_deduplicate(s.replace(c*2, c),c)

return s

%timeit rec_deduplicate(s,' ')

100000 loops, best of 3: 2.83 µs per loop

Not bad, that seems to be our winner. But just to be sure, lets try it against our original function with a really long input string:

s2 = s*100000

%timeit rec_deduplicate(s2,' ')

10 loops, best of 3: 64.6 ms per loop

%timeit org_deduplicate(s2,' ')

1 loop, best of 3: 209 ms per loop

Yup, it looks like it scales nicely. But lets try one more test, the recursive deduplicator only removes duplicate chars of length 2 each time it is called. So does it still do better with long duplicate chars:

s3 = 'this is an irritating string with random spacing .'

%timeit rec_deduplicate(s3,' ')

100000 loops, best of 3: 9.93 µs per loop

%timeit org_deduplicate(s3,' ')

100000 loops, best of 3: 8.99 µs per loop

It does lose some of its advantage when there are long strings of repeated characters to remove.

In summary, use your original function (with a few tweaks maybe) if your strings will have long substrings of repeating characters. Otherwise, the recursive version is fastest.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值