python字符串重复_识别字符串中相邻重复字符的发生(python)

位通配符条目:

优势相当快:例如具有400万个字符分析瞬间

缺点依靠numpy的;旋绕

这里所说:

import numpy as np

a = "etr" + 1_000_000 * "zr" + "hhh" + 1_000_000 * "Ar"

np.max(np.diff(np.r_[-1, np.where(np.diff(np.frombuffer(a.encode('utf16'), dtype=np.uint16)[1:]))[0], len(a) - 1]))

3

工作原理:

编码串到固定宽度的每字符字节串

解释缓冲液作为numpy的阵列

计算“衍生物”

找到非零位置=字符变化的地方

它们之间的距离是重复数

计算最大

UPDATE:

下面是做一些原油短期circuitting再加上一些基本的基准测试,以找到最佳参数的混合动力版本:

import numpy as np

from timeit import timeit

occ = 4

loc = (10, 20, 40, 80, 160, 320, 1000, 2000, 4000, 8000, 16000, 32000, 64000,

125000, 250000, 500000, 1_000_000, 2_000_000)

a = ['pafoe<03' + o * 'gr' + occ * 'x' + (2_000_000 - o) * 'u1'

+ 'leto50d-fjeoa'[occ:] for o in loc]

def brute_force(a):

np.max(np.diff(np.r_[-1, np.where(np.diff(np.frombuffer(

a.encode('utf16'), dtype=np.uint16)[1:]))[0], len(a) - 1]))

def reverse_bisect(a, chunk, encode_all=True):

j = 0

i = chunk

n = len(a)

if encode_all:

av = np.frombuffer(a.encode('utf16'), dtype=np.uint16)[1:]

while j

if encode_all:

s = av[j : j + chunk]

else:

s = np.frombuffer(a[j:j+chunk].encode('utf16'), dtype=np.uint16)[1:]

if np.max(np.diff(np.r_[-1, np.where(np.diff(s))[0], len(s)-1])) >= occ:

return True

j += chunk - occ + 1

chunk *= 2

return False

leave_out = 2

out = []

print('first repeat at', loc[:-leave_out])

print('brute force {}'.format(

(timeit('[f(a) for a in A]', number=100, globals={

'f': brute_force, 'A': a[:-leave_out]}))))

print('hybrid (reverse bisect)')

for chunk in 2**np.arange(2, 18):

out.append(timeit('[f(a,c,e) for a in A]', number=100, globals={

'f': reverse_bisect, 'A': a[:-leave_out], 'c': chunk, 'e': True}))

out.append(timeit('[f(a,c,e) for a in A]', number=100, globals={

'f': reverse_bisect, 'A': a[:-leave_out], 'c': chunk, 'e': False}))

print('chunk: {}, timings: encode all {} -- encode chunks {}'.format(

chunk, out[-2], out[-1]))

样品运行:

first repeat at (10, 20, 40, 80, 160, 320, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 125000, 250000, 500000)

brute force 90.26514193788171

hybrid (reverse bisect)

chunk: 4, timings: encode all 5.257935176836327 -- encode chunks 2.3392367498017848

chunk: 8, timings: encode all 5.210895746946335 -- encode chunks 2.288218504982069

chunk: 16, timings: encode all 5.268893962958828 -- encode chunks 2.2223802611697465

chunk: 32, timings: encode all 5.109196993988007 -- encode chunks 2.1715646600350738

chunk: 64, timings: encode all 5.05742059298791 -- encode chunks 2.1255820950027555

chunk: 128, timings: encode all 5.110778157133609 -- encode chunks 2.100305920932442

chunk: 256, timings: encode all 5.058305847924203 -- encode chunks 2.153960411902517

chunk: 512, timings: encode all 5.108077083015814 -- encode chunks 2.056686638854444

chunk: 1024, timings: encode all 4.969490061048418 -- encode chunks 2.0368234540801495

chunk: 2048, timings: encode all 5.153041162993759 -- encode chunks 2.465495347045362

chunk: 4096, timings: encode all 5.28073402796872 -- encode chunks 2.173405918991193

chunk: 8192, timings: encode all 5.044360157102346 -- encode chunks 2.1234876308590174

chunk: 16384, timings: encode all 5.294338152976707 -- encode chunks 2.334656815044582

chunk: 32768, timings: encode all 5.7856643970590085 -- encode chunks 2.877617093967274

chunk: 65536, timings: encode all 7.04935942706652 -- encode chunks 4.1559580829925835

chunk: 131072, timings: encode all 7.516369879012927 -- encode chunks 4.553452031919733

first repeat at (10, 20, 40)

brute force 16.363576064119115

hybrid (reverse bisect)

chunk: 4, timings: encode all 0.6122389689553529 -- encode chunks 0.045893668895587325

chunk: 8, timings: encode all 0.5982049370650202 -- encode chunks 0.03538667503744364

chunk: 16, timings: encode all 0.5907809699419886 -- encode chunks 0.025738760828971863

chunk: 32, timings: encode all 0.5741697370540351 -- encode chunks 0.01634934707544744

chunk: 64, timings: encode all 0.5719085780438036 -- encode chunks 0.013115004170686007

chunk: 128, timings: encode all 0.5666680270805955 -- encode chunks 0.011037093820050359

chunk: 256, timings: encode all 0.5664500128477812 -- encode chunks 0.010536623885855079

chunk: 512, timings: encode all 0.5695593091659248 -- encode chunks 0.01133729494176805

chunk: 1024, timings: encode all 0.5688401609659195 -- encode chunks 0.012476094998419285

chunk: 2048, timings: encode all 0.5702746720053256 -- encode chunks 0.014690137933939695

chunk: 4096, timings: encode all 0.5782928131520748 -- encode chunks 0.01891179382801056

chunk: 8192, timings: encode all 0.5943365979474038 -- encode chunks 0.0272749038413167

chunk: 16384, timings: encode all 0.609349318081513 -- encode chunks 0.04354232898913324

chunk: 32768, timings: encode all 0.6489383969455957 -- encode chunks 0.07695812894962728

chunk: 65536, timings: encode all 0.7388215309474617 -- encode chunks 0.14061277196742594

chunk: 131072, timings: encode all 0.8899400909431279 -- encode chunks 0.2977339250501245

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值