如何判断字符串是否在Python中重复?

本文翻译自:How can I tell if a string repeats itself in Python?

I'm looking for a way to test whether or not a given string repeats itself for the entire string or not. 我正在寻找一种方法来测试一个给定的字符串是否为整个字符串重复自己。

Examples: 例子:

[
    '0045662100456621004566210045662100456621',             # '00456621'
    '0072992700729927007299270072992700729927',             # '00729927'
    '001443001443001443001443001443001443001443',           # '001443'
    '037037037037037037037037037037037037037037037',        # '037'
    '047619047619047619047619047619047619047619',           # '047619'
    '002457002457002457002457002457002457002457',           # '002457'
    '001221001221001221001221001221001221001221',           # '001221'
    '001230012300123001230012300123001230012300123',        # '00123'
    '0013947001394700139470013947001394700139470013947',    # '0013947'
    '001001001001001001001001001001001001001001001001001',  # '001'
    '001406469760900140646976090014064697609',              # '0014064697609'
]

are strings which repeat themselves, and 是重复自己的字符串,和

[
    '004608294930875576036866359447',
    '00469483568075117370892018779342723',
    '004739336492890995260663507109',
    '001508295625942684766214177978883861236802413273',
    '007518796992481203',
    '0071942446043165467625899280575539568345323741',
    '0434782608695652173913',
    '0344827586206896551724137931',
    '002481389578163771712158808933',
    '002932551319648093841642228739',
    '0035587188612099644128113879',
    '003484320557491289198606271777',
    '00115074798619102416570771',
]

are examples of ones that do not. 是那些没有的例子。

The repeating sections of the strings I'm given can be quite long, and the strings themselves can be 500 or more characters, so looping through each character trying to build a pattern then checking the pattern vs the rest of the string seems awful slow. 我给出的字符串的重复部分可能很长,并且字符串本身可以是500或更多字符,因此循环遍历每个字符尝试构建模式然后检查模式与字符串的其余部分似乎非常慢。 Multiply that by potentially hundreds of strings and I can't see any intuitive solution. 乘以可能数百个字符串,我看不到任何直观的解决方案。

I've looked into regexes a bit and they seem good for when you know what you're looking for, or at least the length of the pattern you're looking for. 我已经看了一下正则表达式,当你知道你在寻找什么,或者至少是你正在寻找的模式的长度时,它们看起来很好。 Unfortunately, I know neither. 不幸的是,我也不知道。

How can I tell if a string is repeating itself and if it is, what the shortest repeating subsequence is? 如何判断一个字符串是否重复,如果是,那么最短的重复子序列是什么?


#1楼

参考:https://stackoom.com/question/1zhNQ/如何判断字符串是否在Python中重复


#2楼

Here's a solution using regular expressions. 这是使用正则表达式的解决方案。

import re

REPEATER = re.compile(r"(.+?)\1+$")

def repeated(s):
    match = REPEATER.match(s)
    return match.group(1) if match else None

Iterating over the examples in the question: 迭代问题中的示例:

examples = [
    '0045662100456621004566210045662100456621',
    '0072992700729927007299270072992700729927',
    '001443001443001443001443001443001443001443',
    '037037037037037037037037037037037037037037037',
    '047619047619047619047619047619047619047619',
    '002457002457002457002457002457002457002457',
    '001221001221001221001221001221001221001221',
    '001230012300123001230012300123001230012300123',
    '0013947001394700139470013947001394700139470013947',
    '001001001001001001001001001001001001001001001001001',
    '001406469760900140646976090014064697609',
    '004608294930875576036866359447',
    '00469483568075117370892018779342723',
    '004739336492890995260663507109',
    '001508295625942684766214177978883861236802413273',
    '007518796992481203',
    '0071942446043165467625899280575539568345323741',
    '0434782608695652173913',
    '0344827586206896551724137931',
    '002481389578163771712158808933',
    '002932551319648093841642228739',
    '0035587188612099644128113879',
    '003484320557491289198606271777',
    '00115074798619102416570771',
]

for e in examples:
    sub = repeated(e)
    if sub:
        print("%r: %r" % (e, sub))
    else:
        print("%r does not repeat." % e)

... produces this output: ...产生这个输出:

'0045662100456621004566210045662100456621': '00456621'
'0072992700729927007299270072992700729927': '00729927'
'001443001443001443001443001443001443001443': '001443'
'037037037037037037037037037037037037037037037': '037'
'047619047619047619047619047619047619047619': '047619'
'002457002457002457002457002457002457002457': '002457'
'001221001221001221001221001221001221001221': '001221'
'001230012300123001230012300123001230012300123': '00123'
'0013947001394700139470013947001394700139470013947': '0013947'
'001001001001001001001001001001001001001001001001001': '001'
'001406469760900140646976090014064697609': '0014064697609'
'004608294930875576036866359447' does not repeat.
'00469483568075117370892018779342723' does not repeat.
'004739336492890995260663507109' does not repeat.
'001508295625942684766214177978883861236802413273' does not repeat.
'007518796992481203' does not repeat.
'0071942446043165467625899280575539568345323741' does not repeat.
'0434782608695652173913' does not repeat.
'0344827586206896551724137931' does not repeat.
'002481389578163771712158808933' does not repeat.
'002932551319648093841642228739' does not repeat.
'0035587188612099644128113879' does not repeat.
'003484320557491289198606271777' does not repeat.
'00115074798619102416570771' does not repeat.

The regular expression (.+?)\\1+$ is divided into three parts: 正则表达式(.+?)\\1+$分为三个部分:

  1. (.+?) is a matching group containing at least one (but as few as possible) of any character (because +? is non-greedy ). (.+?)是一个匹配组,包含至少一个(但尽可能少)任何字符(因为+?是非贪婪的 )。

  2. \\1+ checks for at least one repetition of the matching group in the first part. \\1+检查第一部分中匹配组的至少一次重复。

  3. $ checks for the end of the string, to ensure that there's no extra, non-repeating content after the repeated substrings (and using re.match() ensures that there's no non-repeating text before the repeated substrings). $检查字符串的结尾,以确保在重复的子字符串之后没有额外的,非重复的内容(并且使用re.match()确保在重复的子字符串之前没有非重复的文本)。

In Python 3.4 and later, you could drop the $ and use re.fullmatch() instead, or (in any Python at least as far back as 2.3) go the other way and use re.search() with the regex ^(.+?)\\1+$ , all of which are more down to personal taste than anything else. 在Python 3.4及更高版本中,您可以删除$并使用re.fullmatch()代替,或者(在任何Python中至少早于2.3) re.fullmatch()另一种方式并使用re.search()与正则表达式^(.+?)\\1+$ ,所有这些都比其他任何东西都更符合个人品味。


#3楼

Non-regex solution: 非正则表达式解决方案:

def repeat(string):
    for i in range(1, len(string)//2+1):
        if not len(string)%len(string[0:i]) and string[0:i]*(len(string)//len(string[0:i])) == string:
            return string[0:i]

Faster non-regex solution, thanks to @ThatWeirdo (see comments): 更快的非正则表达式解决方案,感谢@ThatWeirdo(见评论):

def repeat(string):
    l = len(string)
    for i in range(1, len(string)//2+1):
        if l%i: continue
        s = string[0:i]
        if s*(l//i) == string:
            return s

The above solution is very rarely slower than the original by a few percent, but it's usually a good bit faster - sometimes a whole lot faster. 上面的解决方案很少比原始解决方案慢几个百分点,但它通常要快一点 - 有时速度要快很多。 It's still not faster than davidism's for longer strings, and zero's regex solution is superior for short strings. 对于较长的字符串,它仍然不比davidism快,而对于短字符串,零的正则表达式解决方案更胜一筹。 It comes out to the fastest (according to davidism's test on github - see his answer) with strings of about 1000-1500 characters. 它以最快的速度出现(根据dithidism对github的测试 - 请参阅他的回答),其中包含大约1000-1500个字符的字符串。 Regardless, it's reliably second-fastest (or better) in all cases I tested. 无论如何,在我测试的所有情况下,它都是可靠的第二快(或更好)。 Thanks, ThatWeirdo. 谢谢,ThatWeirdo。

Test: 测试:

print(repeat('009009009'))
print(repeat('254725472547'))
print(repeat('abcdeabcdeabcdeabcde'))
print(repeat('abcdefg'))
print(repeat('09099099909999'))
print(repeat('02589675192'))

Results: 结果:

009
2547
abcde
None
None
None

#4楼

You can make the observation that for a string to be considered repeating, its length must be divisible by the length of its repeated sequence. 你可以观察到一个字符串被认为是重复的,它的长度必须能够被重复序列的长度整除。 Given that, here is a solution that generates divisors of the length from 1 to n / 2 inclusive, divides the original string into substrings with the length of the divisors, and tests the equality of the result set: 鉴于此,这是一个生成从1n / 2的长度除数的解决方案,将原始字符串除以具有除数长度的子串,并测试结果集的相等性:

from math import sqrt, floor

def divquot(n):
    if n > 1:
        yield 1, n
    swapped = []
    for d in range(2, int(floor(sqrt(n))) + 1):
        q, r = divmod(n, d)
        if r == 0:
            yield d, q
            swapped.append((q, d))
    while swapped:
        yield swapped.pop()

def repeats(s):
    n = len(s)
    for d, q in divquot(n):
        sl = s[0:d]
        if sl * q == s:
            return sl
    return None

EDIT: In Python 3, the / operator has changed to do float division by default. 编辑:在Python 3中, /运算符已默认更改为浮点除法。 To get the int division from Python 2, you can use the // operator instead. 要从Python 2获得int除法,可以使用//运算符。 Thank you to @TigerhawkT3 for bringing this to my attention. 感谢@ TigerhawkT3引起我的注意。

The // operator performs integer division in both Python 2 and Python 3, so I've updated the answer to support both versions. //运算符在Python 2和Python 3中执行整数除法,因此我更新了答案以支持这两个版本。 The part where we test to see if all the substrings are equal is now a short-circuiting operation using all and a generator expression. 我们测试以查看所有子串是否相等的部分现在是使用all和生成器表达式的短路操作。

UPDATE: In response to a change in the original question, the code has now been updated to return the smallest repeating substring if it exists and None if it does not. 更新:响应原始问题的更改,代码现在已更新为返回最小的重复子字符串(如果存在)和None如果不存在)。 @godlygeek has suggested using divmod to reduce the number of iterations on the divisors generator, and the code has been updated to match that as well. @godlygeek建议使用divmod来减少divisors生成器上的迭代次数,并且代码也已更新以匹配它。 It now returns all positive divisors of n in ascending order, exclusive of n itself. 它现在以升序返回n所有正除数,不包括n本身。

Further update for high performance: After multiple tests, I've come to the conclusion that simply testing for string equality has the best performance out of any slicing or iterator solution in Python. 进一步更新以获得高性能:经过多次测试后,我得出的结论是,简单地测试字符串相等性在Python中的任何切片或迭代器解决方案中都具有最佳性能。 Thus, I've taken a leaf out of @TigerhawkT3 's book and updated my solution. 因此,我从@ TigerhawkT3的书中摘了一条叶子并更新了我的解决方案。 It's now over 6x as fast as before, noticably faster than Tigerhawk's solution but slower than David's. 现在它的速度比以前快了6倍,明显快于Tigerhawk的解决方案,但比大卫的速度慢。


#5楼

Here's a straight forward solution, without regexes. 这是一个直接的解决方案,没有正则表达式。

For substrings of s starting from zeroth index, of lengths 1 through len(s) , check if that substring, substr is the repeated pattern. 对于从第0个索引开始的s子字符串,长度为1到len(s) ,检查子字符串substr是否为重复模式。 This check can be performed by concatenating substr with itself ratio times, such that the length of the string thus formed is equal to the length of s . 可以通过将substr与自身的ratio时间连接来执行该检查,使得由此形成的串的长度等于s的长度。 Hence ratio=len(s)/len(substr) . 因此, ratio=len(s)/len(substr)

Return when first such substring is found. 找到第一个这样的子字符串时返回。 This would provide the smallest possible substring, if one exists. 这将提供尽可能小的子字符串(如果存在)。

def check_repeat(s):
    for i in range(1, len(s)):
        substr = s[:i]
        ratio = len(s)/len(substr)
        if substr * ratio == s:
            print 'Repeating on "%s"' % substr
            return
    print 'Non repeating'

>>> check_repeat('254725472547')
Repeating on "2547"
>>> check_repeat('abcdeabcdeabcdeabcde')
Repeating on "abcde"

#6楼

First, halve the string as long as it's a "2 part" duplicate. 首先,将字符串减半,只要它是“2部分”副本即可。 This reduces the search space if there are an even number of repeats. 如果存在偶数个重复,则会减少搜索空间。 Then, working forwards to find the smallest repeating string, check if splitting the full string by increasingly larger sub-string results in only empty values. 然后,向前工作以找到最小的重复字符串,检查是否通过越来越大的子字符串分割整个字符串仅导致空值。 Only sub-strings up to length // 2 need to be tested since anything over that would have no repeats. 只需要测试length // 2子字符串,因为任何超过该字符串的子字符串都不会重复。

def shortest_repeat(orig_value):
    if not orig_value:
        return None

    value = orig_value

    while True:
        len_half = len(value) // 2
        first_half = value[:len_half]

        if first_half != value[len_half:]:
            break

        value = first_half

    len_value = len(value)
    split = value.split

    for i in (i for i in range(1, len_value // 2) if len_value % i == 0):
        if not any(split(value[:i])):
            return value[:i]

    return value if value != orig_value else None

This returns the shortest match or None if there is no match. 如果没有匹配,则返回最短匹配或无。

  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值