说明:1.在学习过程中没看到此论坛中有好的python实现,所以写一篇文章作为补充。
2.不赘述正则表达式和NFA的概念,着重描述python的算法实现,以及过程中用到的数据结构(十分精妙)。
创建NFA的第一步是解析正则表达式,我看龙书中使用分析树来解析的,在这里卡了一段时间,后来查资料发现Thompson的原论文用的解析法是将正则式从前缀形式转换成后缀形式,对我来讲同样清晰易懂而且好实现,实现的复杂度时间复杂度是O(n)。举例来说,原正则表达式为a.b,a|b,(a|b).c,(a|b)*那么转换后分别为ab., ab|, ab|c和ab|*(符号“.”表示联合,“|”表示或,“*”表示任意)。这个转换算法叫 Shunting Yard 算法:
def shunt(infix):
# Curly braces = dictionary
# *, | are repetition operators. They take precedence over concatenation and alternation operators
# * = Zero or more
# . = Concatenation
# | = Alternation
specials = {'*': 50, '.': 40, '|': 30}
pofix = ""
stack = ""
# Loop through the string one character at a time
for c in infix:
if c == '(':
stack = stack + c
elif c == ')':
while stack[-1] != '(':
pofix, stack = pofix + stack[-1], stack[:-1]
# Remove '(' from stack
stack = stack[:-1]
elif c in specials:
while stack and specials.get(c, 0) <= specials.get(stack[-1], 0):
pofix, stack = pofix + stack[-1], stack[:-1]
stack = stack + c
else:
pofix = pofix + c
while stack:
pofix, stack = pofix + stack[-1], stack[:-1]
return pofix
第二步就是创建NFA。用什么数据结构表示NFA呢?其实很精炼,只用起始、接收两个状态表示即可。
class NFA:
def __init__(self, start, end):
self.start = start
self.end = end
# both start and end are States
State(状态)又怎么表示的呢?根据Thompson'NFA的要求,状态有两种转换方式-字符和空串(ε),字符只能转换到另外一个状态,空串最多转换到另外两个状态,两种转换不能同时进行。下面是状态的结构:
class State:
def __init__(self, isEnd):
self.isEnd = isEnd # isEnd is bool
self.transition = {}
self.epsilonTransitions = []
给状态添加转换:
def addEpsilonTransition(come, to):
come.epsilonTransitions.append(to)
def addTransiton(come, to, symbol):
come.transition[symbol] = to
创建两种基础NFA:
def fromEpsilon():
start = State(False)
end = State(True)
addEpsilonTransition(start,end)
return start, end
def fromSymbol(symbol):
start = State(False)
end = State(True)
addTransiton(start, end, symbol)
return start, end
这样所有的条件都具备了,创建符合正则表达式的NFA.
创建字符 "a" 的NFA,记作 N(a):
def createNFA(symbol):
start, end = fromSymbol(symbol)
return NFA(start, end)
创建字符"a|b"的NFA,联合N(a) 和 N(b) → N(a|b):
def union(first, second):
start = State(False);
addEpsilonTransition(start, first.start)
addEpsilonTransition(start, second.start)
end = State(True)
addEpsilonTransition(first.end, end)
first.end.isEnd = False
addEpsilonTransition(second.end, end)
second.end.isEnd = False
创建闭集"(a|b)*"的NFA,N(a|b) → N((a|b)∗):
def closure(nfa):
start = State(False)
end = State(True)
addEpsilonTransition(start, end)
addEpsilonTransition(start, nfa.start)
addEpsilonTransition(nfa.end, end)
addEpsilonTransition(nfa.end, nfa.start)
nfa.end.isEnd = False
return NFA(start, end)
再补充一个联合的NFA,N(a.b):
def concat(first, second):
addEpsilonTransition(first.end, second.start)
first.end.isEnd = False
return NFA(first.start, second.end)
现在我们把上述的算法放到一起,分析“(a|b)*.c”
先将这种前缀模式转换成后缀模式:
regex = '(a|b)*.c'
pofix = shunt(regex)
print(pofix)
返回
ab|*c.
创建一个栈(stack),并从左往右以此读取pofix.
1.若读取的是字符,创建字符NFA,压入栈中
2.若读取的是操作符,弹出栈中内容,创建操作符NFA,再压入栈
def toNFA(postfix):
if postfix == '':
return fromEpsilon()
stack = []
for c in postfix:
if c == '.':
nfa2 = stack.pop()
nfa1 = stack.pop()
new_nfa = concat(nfa1, nfa2)
stack.append(new_nfa)
elif c == '|':
nfa2 = stack.pop()
nfa1 = stack.pop()
new_nfa = union(nfa1, nfa2)
stack.append(new_nfa)
elif c == '*':
nfa = stack.pop()
new_nfa = closure(nfa)
stack.append(new_nfa)
else:
nfa = createNFA(c)
stack.append(nfa)
return stack.pop()
postfix1 = 'ab|*c.'
nfa = toNFA(postfix1)
至此,算法结束。