理解DFA和NFA正则表达式引擎

最新推荐文章于 2022-09-30 16:26:11 发布

nathena

最新推荐文章于 2022-09-30 16:26:11 发布

阅读量4.7k

点赞数

文章标签：正则表达式 character transition input construction each

he task of a scanner generator, such as JLex, is to generate the transition tables or to synthesize the scanner program given a scanner specification (in the form of a set of REs). So it needs to convert REs into a single DFA. This is accomplished in two steps: first it converts REs into a non-deterministic finite automaton (NFA) and then it converts the NFA into a DFA.

An NFA is similar to a DFA but it also permits multiple transitions over the same character and transitions over $ /varepsilon$. In the case of multiple transitions from a state over the same character, when we are at this state and we read this character, we have more than one choice; the NFA succeeds if at least one of these choices succeeds. The $ /varepsilon$ transition doesn't consume any input characters, so you may jump to another state for free.

Clearly DFAs are a subset of NFAs. But it turns out that DFAs and NFAs have the same expressive power. The problem is that when converting a NFA to a DFA we may get an exponential blowup in the number of states.

We will first learn how to convert a RE into a NFA. This is the easy part. There are only 5 rules, one for each type of RE:

As it can been shown inductively, the above rules construct NFAs with only one final state. For example, the third rule indicates that, to construct the NFA for the RE AB, we construct the NFAs for A and B, which are represented as two boxes with one start state and one final state for each box. Then the NFA for AB is constructed by connecting the final state of A to the start state of B using an empty transition.

For example, the RE (a| b)c is mapped to the following NFA:

The next step is to convert a NFA to a DFA (called subset construction). Suppose that you assign a number to each NFA state. The DFA states generated by subset construction have sets of numbers, instead of just one number. For example, a DFA state may have been assigned the set {5, 6, 8}. This indicates that arriving to the state labeled {5, 6, 8} in the DFA is the same as arriving to the state 5, the state 6, or the state 8 in the NFA when parsing the same input. (Recall that a particular input sequence when parsed by a DFA, leads to a unique state, while when parsed by a NFA it may lead to multiple states.)

First we need to handle transitions that lead to other states for free (without consuming any input). These are the $ /varepsilon$ transitions. We define the closure of a NFA node as the set of all the nodes reachable by this node using zero, one, or more $ /varepsilon$ transitions. For example, The closure of node 1 in the left figure below

is the set {1, 2}. The start state of the constructed DFA is labeled by the closure of the NFA start state. For every DFA state labeled by some set {s1,..., sn} and for every character c in the language alphabet, you find all the states reachable by s1, s2, ..., or sn using c arrows and you union together the closures of these nodes. If this set is not the label of any other node in the DFA constructed so far, you create a new DFA node with this label. For example, node {1, 2} in the DFA above has an arrow to a {3, 4, 5} for the character a since the NFA node 3 can be reached by 1 on a and nodes 4 and 5 can be reached by 2. The b arrow for node {1, 2} goes to the error node which is associated with an empty set of NFA nodes.

The following NFA recognizes (a| b)*(abb | a+b), even though it wasn't constructed with the above RE-to-NFA rules. It has the following DFA:

eterministic Finite Automata (DFAs)

A DFA represents a finite state machine that recognizes a RE. For example, the following DFA:

recognizes (abc+)+. A finite automaton consists of a finite set of states, a set of transitions (moves), one start state, and a set of final states (accepting states). In addition, a DFA has a unique transition for every state-character combination. For example, the previous figure has 4 states, state 1 is the start state, and state 4 is the only final state.

A DFA accepts a string if starting from the start state and moving from state to state, each time following the arrow that corresponds the current input character, it reaches a final state when the entire input string is consumed. Otherwise, it rejects the string.

The previous figure represents a DFA even though it is not complete (ie, not all state-character transitions have been drawn). The complete DFA is:

but it is very common to ignore state 0 (called the error state) since it is implied. (The arrows with two or more characters indicate transitions in case of any of these characters.) The error state serves as a black hole, which doesn't let you escape.

A DFA is represented by a transition table T, which gives the next state T[s, c] for a state s and a character c. For example, the T for the DFA above is:
a b c
0 0 0 0
1 2 0 0
2 0 3 0
3 0 0 4
4 2 0 4
Suppose that we want to build a scanner for the REs:

for - keyword = for
identifier = [a - z][a - z0 - 9]*
The corresponding DFA has 4 final states: one to accept the for-keyword and 3 to accept an identifier:

(the error state is omitted again). Notice that for each state and for each character, there is a single transition.

A scanner based on a DFA uses the DFA's transition table as follows:

state = initial_state;
current_character = get_next_character();
while ( true )
{ next_state = T[state,current_character];
if (next_state == ERROR)
break;
state = next_state;
current_character = get_next_character();
if ( current_character == EOF )
break;
};
if ( is_final_state(state) )
`we have a valid token'
else `report an error'

This program does not explicitly take into account the longest match disambiguation rule since it ends at EOF. The following program is more general since it does not expect EOF at the end of token but still uses the longest match disambiguation rule.

state = initial_state;
final_state = ERROR;
current_character = get_next_character();
while ( true )
{ next_state = T[state,current_character];
if (next_state == ERROR)
break;
state = next_state;
if ( is_final_state(state) )
final_state = state;
current_character = get_next_character();
if (current_character == EOF)
break;
};
if ( final_state == ERROR )
`report an error'
else if ( state != final_state )
`we have a valid token but we need to backtrack
(to put characters back into the input stream)'
else `we have a valid token'

Is there any better (more efficient) way to build a scanner out of a DFA? Yes! We can hardwire the state transition table into a program (with lots of gotos). You've learned in your programming language course never to use gotos. But here we are talking about a program generated automatically, which no one needs to look at. The idea is the following. Suppose that you have a transition from state s1 to s2 when the current character is c. Then you generate the program:

s1: current_character = get_next_character();
...
if ( current_character == 'c' )
goto s2;
...
s2: current_character = get_next_character();

描述：正则表达式词法分析图
图片：

为了提高大家对正则表达式子的进一步理解我们来讨论下正则表达式引擎的内部工作机制。
正则表达式子有两种类型的引擎：文本导向(text-directed)的引擎和正则导向(regex-directed)的引擎。
分别称为DFA和NFA引擎。
一、确定型有限自动机
定义：（确定型有限自动机）一个确定型有限自动机是一个五元组 DFA A =(Q, M, f, q_0, F)，其中
1. 有限状态集 Q
2. 有限输入符号集合 M
3. 状态转移函数 f: Q x M --> Q
4. 初始状态 q_0 in Q
5. 接受状态集 F
定义：（扩展转移函数）扩展转移函数 f', 设 e 为空串。
1. f'(q, e)
2. 设 w = xa, a in M，则 f'(q, w) = f(f'(q, x), a)
f' 接受空串 e，这不是必须的。引入空串是为了形式上的方便。
定义：（DFA 接受的语言）设 A 为一个 DFA，令
L(A) = {w : f'(q_0, w) in F}
称 L(A) 为 A 接受的语言。
二、非确定型有限自动机
定义：（非确定型有限自动机）非确定型有限自动机是一个五元组 NFA A = (Q, M, f, q_0, F)
1. 有限状态集 Q
2. 有限输入符号集合 M
3. 状态转移函数 f: Q x M --> Power(Q)
4. 初始状态 q_0 in Q
5. 接受状态集 F
定义：（扩展转移函数）f'
1. f'(q, e) = {q}
2. 设 w = xa, a in M, f'(q, xa) = Union (f(r, a)), 其中 r in f'(q, x)
定义：（NFA 接受的语言）设 A 是一个 NFA，令
L(A) = {w : f'(q_0, w) meet F 非空}
称 L(A) 为 A 接受的语言。
结论：DFA 与 NFA 是等价的。等价的含义是它们在定义语言的能力上是相当的。给定一个 DFA，必定存在一个 NFA 使得两者接受同样的语言，反之也一样。
在给定 NFA，构造等价 DFA 时，可以采用 ``子集建构'' 的方法。即用 Power(Q) 作为 DFA 的状态下标集。在使用归纳法证明时，可以去证：对于给定的输入，DFA 返回的状态之下标正是 NFA 返回的状态。
（参考离散数学与编译原理。。好像是）

/*词法分析,取得一个符号*/
SymbolType getSymbol() {
char *cc = &g_symbol_charvalue;
char tmp; /*临时变量*/
*cc = g_strRegExp[++g_scan_pos];
if ((*cc>='0' && *cc<='9') || (*cc>='a' && *cc<='z') || (*cc>='A' && *cc<='Z')) {
return g_symbol_type=INPUT_ELE;
}
/*正则表达式结尾*/
if (*cc == NULL) {return g_symbol_type=END_REGEXP;}
if (*cc == '(') {return g_symbol_type=AND_MACHINE_BEGIN;}
if (*cc == ')') {return g_symbol_type=AND_MACHINE_END;}
if (*cc == '[') {return g_symbol_type=OR_MACHINE_BEGIN;}
if (*cc == ']') {return g_symbol_type=OR_MACHINE_END;}
if (*cc == '|') {return g_symbol_type=BACKTRACE;}
if (*cc == '^') {return g_symbol_type=NOT_OP;}
if (*cc == '.') {return g_symbol_type=DOT;}
/*转义字符*/
if (*cc == '//') {
*cc = g_strRegExp[++g_scan_pos];
if (*cc == 'd') {return g_symbol_type=NUMBER;}
if (*cc == 'D') {return g_symbol_type=NOT_NUMBER;}
if (*cc == 'f') {*cc = '/f'; return g_symbol_type=INPUT_ELE;}
if (*cc == 'n') {*cc = '/n'; return g_symbol_type=INPUT_ELE;}
if (*cc == 'r') {*cc = '/r'; return g_symbol_type=INPUT_ELE;}
if (*cc == 't') {*cc = '/t'; return g_symbol_type=INPUT_ELE;}
if (*cc == 'v') {*cc = '/v'; return g_symbol_type=INPUT_ELE;}
if (*cc == 's') {return g_symbol_type=ALL_SPACE;}
if (*cc == 'S') {return g_symbol_type=NOT_ALL_SPACE;}
if (*cc == 'w') {return g_symbol_type=AZaz09_;}
if (*cc == 'W') {return g_symbol_type=NOT_AZaz09_;}
/*16进制*/
if (*cc == 'x') {
tmp = g_strRegExp[++g_scan_pos];
if (tmp>='0' && tmp<='9') {
*cc = tmp - '0';
} else if (tmp>='a' && tmp<='f') {
*cc = tmp - 'a' + 10;
} else if (tmp>='A' && tmp<='F') {
*cc = tmp - 'A' + 10;
} else {
/*不是16进制数字,返回原字符'x'*/
g_scan_pos--;
return g_symbol_type=INPUT_ELE;
}
tmp = g_strRegExp[++g_scan_pos];
if (tmp>='0' && tmp<='9') {
*cc *= 16;
*cc += tmp - '0';
} else if (*cc>='a' && *cc<='f') {
*cc *= 16;
*cc += tmp - 'a' + 10;
} else if (*cc>='A' && *cc<='F') {
*cc *= 16;
*cc += tmp - 'A' + 10;
} else {
/*这个不是16进制数字,只有一位:/xF*/
g_scan_pos--;
return g_symbol_type=INPUT_ELE;
}

return g_symbol_type=INPUT_ELE;
}
/*8进制*/
if (*cc>='0' && *cc<='3') {
*cc -= '0';
tmp = g_strRegExp[++g_scan_pos];
if (tmp>='0' && tmp<='7') {
*cc *= 8;
*cc += tmp - '0';
} else {
/*只有一位8进制数字:/7*/
g_scan_pos--;
return g_symbol_type=INPUT_ELE;
}
tmp = g_strRegExp[++g_scan_pos];
if (tmp>='0' && tmp<='7') {
*cc *= 8;
*cc += tmp - '0';
} else {
/*只有两位8进制数字:/77*/
g_scan_pos--;
return g_symbol_type=INPUT_ELE;
}
return g_symbol_type=INPUT_ELE;
}
if (*cc>='4' && *cc<='7') {
*cc -= '0';
tmp = g_strRegExp[++g_scan_pos];
if (tmp>='0' && tmp<='7') {
*cc *= 8;
*cc += tmp - '0';
} else {
/*只有一位8进制数字:/7*/
g_scan_pos--;
return g_symbol_type=INPUT_ELE;
}
return g_symbol_type=INPUT_ELE;
} else {
/*[^x0-7dDfnrtvsSwW]*/
return g_symbol_type=INPUT_ELE;
}
}
/*重复运算*/
if (*cc == '*') {return g_symbol_type=REPEAT_ZERO_MORE;}
if (*cc == '+') {return g_symbol_type=REPEAT_ONCE_MORE;}
if (*cc == '?') {return g_symbol_type=REPEAT_ZERO_ONCE;}
if (*cc == '{') {
*cc = g_strRegExp[++g_scan_pos];
if (*cc>'9' || *cc<'0') {return g_symbol_type=UNKNOWN;}
g_repeat_m = 0; /*取{m,n}的m*/
while (*cc>='0' && *cc<='9') {
g_repeat_m *= 10;
g_repeat_m += *cc-'0';
*cc = g_strRegExp[++g_scan_pos];
}
/*{m}*/
if (*cc == '}') {return g_symbol_type=REPEAT_RANGE_M;}
if (*cc != ',') {return g_symbol_type=UNKNOWN;}

*cc = g_strRegExp[++g_scan_pos];
/*{m,}*/
if (*cc == '}') {return g_symbol_type=REPEAT_RANGE_M_MORE;}
/*{m,n}*/
if (*cc>'9' || *cc<'0') {return g_symbol_type=UNKNOWN;}
g_repeat_n = 0; /*取{m,n}的n*/
while (*cc>='0' && *cc<='9') {
g_repeat_n *= 10;
g_repeat_n += *cc-'0';
*cc = g_strRegExp[++g_scan_pos];
}
if (*cc == '}') {
return g_symbol_type=REPEAT_RANGE_MN;
} else {
return g_symbol_type=UNKNOWN;
}
}

return g_symbol_type=UNKNOWN;
}

nathena

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
理解DFA和NFA正则表达式引擎

he task of a scanner generator, such as JLex, is to generate the transition tables or to synthesize the scanner program given a scanner specification (in the form of a set of REs). So it needs to conv
复制链接

扫一扫