C语言研讨字符串查找算法,c – 哪种字符串查找算法适用于此?

大多数字符串搜索算法的问题是它们将至少花费时间O(k)来返回k个匹配,所以如果你有一个字符串说100万“a”s,以及100万个小字符串查询“a”,那么它将花费大约100万次,数百万次迭代来计算所有比赛!

另一种线性时间方法是:

>构造一个大字符串的后缀树:O(n)其中n是len(大字符串)

>预先计算后缀树中每个节点下面的后缀数:O(n)

>对于每个小字符串,在后缀树中找到具有小字符串作为后缀的节点:O(m)其中m是len(小字符串)

>将总节点数添加到该节点下方的后缀数. (每个后缀对应于大字符串中小字符串的不同匹配)

这将花费时间O(n p),其中n是大字符串的长度,p是所有小字符串的总长度.

示例代码

根据要求,这里有一些Python中使用这种方法的小(ish)示例代码:

from collections import defaultdict

class SuffixTree:

def __init__(self):

"""Returns an empty suffix tree"""

self.T=''

self.E={}

self.nodes=[-1] # 0th node is empty string

def add(self,s):

"""Adds the input string to the suffix tree.

This inserts all substrings into the tree.

End the string with a unique character if you want a leaf-node for every suffix.

Produces an edge graph keyed by (node,character) that gives (first,last,end)

This means that the edge has characters from T[first:last+1] and goes to node end."""

origin,first,last = 0,len(self.T),len(self.T)-1

self.T+=s

nc = len(self.nodes)

self.nodes += [-1]*(2*len(s))

T=self.T

E=self.E

nodes=self.nodes

Lm1=len(T)-1

for last_char_index in xrange(first,len(T)):

c=T[last_char_index]

last_parent_node = -1

while 1:

parent_node = origin

if first>last:

if (origin,c) in E:

break

else:

key = origin,T[first]

edge_first,edge_last,edge_end = E[key]

span = last - first

A = edge_first+span

m = T[A+1]

if m==c:

break

E[key] = (edge_first,A,nc)

nodes[nc] = origin

E[nc,m] = (A+1,edge_end)

parent_node = nc

nc+=1

E[parent_node,c] = (last_char_index,Lm1,nc)

nc+=1

if last_parent_node>0:

nodes[last_parent_node] = parent_node

last_parent_node = parent_node

if origin==0:

first+=1

else:

origin = nodes[origin]

if first <= last:

edge_first,edge_end=E[origin,T[first]]

span = edge_last-edge_first

while span <= last - first:

first+=span+1

origin = edge_end

if first <= last:

edge_first,edge_end = E[origin,T[first]]

span = edge_last - edge_first

if last_parent_node>0:

nodes[last_parent_node] = parent_node

last+=1

if first <= last:

edge_first,T[first]]

span = edge_last - edge_first

return self

def make_choices(self):

"""Construct a sorted list for each node of the possible continuing characters"""

choices = [list() for n in xrange(len(self.nodes))] # Contains set of choices for each node

for (origin,c),edge in self.E.items():

choices[origin].append(c)

choices=[sorted(s) for s in choices] # should not have any repeats by construction

self.choices=choices

return choices

def count_suffixes(self,term):

"""Recurses through the tree finding how many suffixes are based at each node.

Strings assumed to use term as the terminating character"""

C = self.suffix_counts = [0]*len(self.nodes)

choices = self.make_choices()

def f(node=0):

t=0

X=choices[node]

if len(X)==0:

t+=1 # this node is a leaf node

else:

for c in X:

if c==term:

t+=1

continue

first,end = self.E[node,c]

t+=f(end)

C[node]=t

return t

return f()

def count_matches(self,needle):

"""Return the count of matches for this needle in the suffix tree"""

i=0

node=0

E=self.E

T=self.T

while i

c=needle[i]

key=node,c

if key not in E:

return 0

first,node = E[key]

while i

if needle[i]!=T[first]:

return 0

i+=1

first+=1

return self.suffix_counts[node]

big="aaaaaaaaaaabbbbbbbbbcccccccccccddddddddddd"

small_strings=["a","ab","abc"]

s=SuffixTree()

term=chr(0)

s.add(big+term)

s.count_suffixes(term)

for needle in small_strings:

x=s.count_matches(needle)

print needle,'has',x,'matches'

它打印:

a has 11 matches

ab has 1 matches

abc has 0 matches

但是,在实践中,我建议您只使用预先存在的Aho-Corasick实现,因为我希望在您的特定情况下这会更快.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值