python字符串重复符是什么,在Python 3中检查字符串是否包含重复字符的最快方法是什么?...

I need to filter strings by the criterion that they contain no character twice.

The strings are many (say 1.4 trillion).

The strings are short (around 8 characters).

The strings are unique (caching won't work).

The strings have a big character set (say any Unicode character).

The strings usually meet the criterion (say 2/3 have no repeating characters).

The using code would look like this:

>>> candidate_strings = ["foobnehg", "barfnehg", "bazfnehg"]

>>> result_strings = [s if unique_chars(s) for s in candidate_strings]

>>> print(result_strings)

["barfnehg", "bazfnehg"]

I implemented a naive version, simply iterating the string:

def unique_chars_naive(string_given):

"""

Checks if a given string contains only unique characters.

This version iterates the given string, saving all occurred characters.

"""

chars_seen = []

for char in string_given:

if char in chars_seen:

return False

chars_seen.append(char)

return True

My next-best idea was to use a set, so I implemented that:

def unique_chars_set(string_given):

"""

Checks if a given string contains only unique characters.

This version exploits that a set contains only unique entries.

"""

return len(string_given) == len(set(string_given))

Saving the functions to a file UniqueCharacters.py, timed them:

$ python3 -m timeit -n 100000 --setup='import UniqueCharacters; candidate_strings = ["foobnehg", "barfnehg", "bazfnehg"]' '[UniqueCharacters.unique_chars_naive(s) for s in candidate_strings]'

100000 loops, best of 3: 20.3 usec per loop

$ python3 -m timeit -n 100000 --setup='import UniqueCharacters; candidate_strings = ["foobnehg", "barfnehg", "bazfnehg"]' '[UniqueCharacters.unique_chars_set(s) for s in candidate_strings]'

100000 loops, best of 3: 17.7 usec per loop

This shows that the unique_chars_set is faster by about 15 % for this dataset.

Is there a faster way to do this? With regular expressions maybe? Is there some method in the standard library that does this?

解决方案

Let me start off by saying that I suspect that you are optimizing when you don't need to. Python is a high-level language that supports thinking about computation in a high-level manner. A solution that is readable, elegant, and reusable is often going to be better than one that is blazingly fast, but hard to understand.

When, and only when, you determine that speed is an issue, then you should proceed with the optimizations. Perhaps even write a C extension for the computationally intense parts.

That being said, here's a comparison of a few techniques:

def unique_chars_set(s):

return len(s) == len(set(s))

def unique_chars_frozenset(s):

return len(s) == len(frozenset(s))

def unique_chars_counter(s):

return Counter(s).most_common(1)[0][1] > 1

def unique_chars_sort(s):

ss = ''.join(sorted(s))

prev = ''

for c in ss:

if c == prev:

return False

prev = c

return True

def unique_chars_bucket(s):

buckets = 255 * [False]

for c in s:

o = ord(c)

if buckets[o]:

return False

buckets[o] = True

return True

And here is the performance comparisons (in IPython):

In [0]: %timeit -r10 [unique_chars_set(s) for s in candidate_strings]

100000 loops, best of 10: 6.63 us per loop

In [1]: %timeit -r10 [unique_chars_frozenset(s) for s in candidate_strings]

100000 loops, best of 10: 6.81 us per loop

In [2]: %timeit -r10 [unique_chars_counter(s) for s in candidate_strings]

10000 loops, best of 10: 83.1 us per loop

In [3]: %timeit -r10 [unique_chars_sort(s) for s in candidate_strings]

100000 loops, best of 10: 13.1 us per loop

In [4]: %timeit -r10 [unique_chars_bucket(s) for s in candidate_strings]

100000 loops, best of 10: 15 us per loop

Conclusion: set is elegant and faster than many other obvious methods. But the differences are so small, it doesn't matter anyway.

For more benchmarks, see @FrancisAvila's answer.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值