python字符串重复符是什么,在Python 3中检查字符串是否包含重复字符的最快方法是什么？...

最新推荐文章于 2021-12-21 11:46:43 发布

Blair Long

最新推荐文章于 2021-12-21 11:46:43 发布

阅读量236

点赞数

文章标签： python字符串重复符是什么

I need to filter strings by the criterion that they contain no character twice.

The strings are many (say 1.4 trillion).

The strings are short (around 8 characters).

The strings are unique (caching won't work).

The strings have a big character set (say any Unicode character).

The strings usually meet the criterion (say 2/3 have no repeating characters).

The using code would look like this:

>>> candidate_strings = ["foobnehg", "barfnehg", "bazfnehg"]

>>> result_strings = [s if unique_chars(s) for s in candidate_strings]

>>> print(result_strings)

["barfnehg", "bazfnehg"]

I implemented a naive version, simply iterating the string:

def unique_chars_naive(string_given):

"""

Checks if a given string contains only unique characters.

This version iterates the given string, saving all occurred characters.

"""

chars_seen = []

for char in string_given:

if char in chars_seen:

return False

chars_seen.append(char)

return True

My next-best idea was to use a set, so I implemented that:

def unique_chars_set(string_given):

"""

Checks if a given string contains only unique characters.

This version exploits that a set contains only unique entries.

"""

return len(string_given) == len(set(string_given))

Saving the functions to a file UniqueCharacters.py, timed them:

$ python3 -m timeit -n 100000 --setup='import UniqueCharacters; candidate_strings = ["foobnehg", "barfnehg", "bazfnehg"]' '[UniqueCharacters.unique_chars_naive(s) for s in candidate_strings]'

100000 loops, best of 3: 20.3 usec per loop

$ python3 -m timeit -n 100000 --setup='import UniqueCharacters; candidate_strings = ["foobnehg", "barfnehg", "bazfnehg"]' '[UniqueCharacters.unique_chars_set(s) for s in candidate_strings]'

100000 loops, best of 3: 17.7 usec per loop

This shows that the unique_chars_set is faster by about 15 % for this dataset.

Is there a faster way to do this? With regular expressions maybe? Is there some method in the standard library that does this?

解决方案

Let me start off by saying that I suspect that you are optimizing when you don't need to. Python is a high-level language that supports thinking about computation in a high-level manner. A solution that is readable, elegant, and reusable is often going to be better than one that is blazingly fast, but hard to understand.

When, and only when, you determine that speed is an issue, then you should proceed with the optimizations. Perhaps even write a C extension for the computationally intense parts.

That being said, here's a comparison of a few techniques:

def unique_chars_set(s):

return len(s) == len(set(s))

def unique_chars_frozenset(s):

return len(s) == len(frozenset(s))

def unique_chars_counter(s):

return Counter(s).most_common(1)[0][1] > 1

def unique_chars_sort(s):

ss = ''.join(sorted(s))

prev = ''

for c in ss:

if c == prev:

return False

prev = c

return True

def unique_chars_bucket(s):

buckets = 255 * [False]

for c in s:

o = ord(c)

if buckets[o]:

return False

buckets[o] = True

return True

And here is the performance comparisons (in IPython):

In [0]: %timeit -r10 [unique_chars_set(s) for s in candidate_strings]

100000 loops, best of 10: 6.63 us per loop

In [1]: %timeit -r10 [unique_chars_frozenset(s) for s in candidate_strings]

100000 loops, best of 10: 6.81 us per loop

In [2]: %timeit -r10 [unique_chars_counter(s) for s in candidate_strings]

10000 loops, best of 10: 83.1 us per loop

In [3]: %timeit -r10 [unique_chars_sort(s) for s in candidate_strings]

100000 loops, best of 10: 13.1 us per loop

In [4]: %timeit -r10 [unique_chars_bucket(s) for s in candidate_strings]

100000 loops, best of 10: 15 us per loop

Conclusion: set is elegant and faster than many other obvious methods. But the differences are so small, it doesn't matter anyway.

For more benchmarks, see @FrancisAvila's answer.

Blair Long

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python字符串重复符是什么,在Python 3中检查字符串是否包含重复字符的最快方法是什么？...

I need to filter strings by the criterion that they contain no character twice.The strings are many (say 1.4 trillion).The strings are short (around 8 characters).The strings are unique (caching won't...
复制链接

扫一扫