Finite automata and regular expression algorithms

最新推荐文章于 2022-09-17 18:52:20 发布

西面来风

最新推荐文章于 2022-09-17 18:52:20 发布

阅读量242

点赞数

分类专栏： Automata

本文链接：https://blog.csdn.net/weixin_42993054/article/details/82860939

版权

Automata 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

C++ class library implementing finite automata and regular expression algorithms

Each algorithm is classified into one of two families: those based upon the structure of regular expressions, and those based upon the automata-theoretic work of Myhill and Nerode.

defining regular expressions as a $\Sigma$ -term algebra. The FIRE engine only implements $\Sigma$ -algebras for three carrier sets (regular expressions,finite automata, and reduced finite automata). Future versions of the toolkit will include more $\Sigma$ -algebras.
The use of $\Sigma$ -algebras in [Wat93a] can provide great computational efficiency in practice. For example, from regular expressions $E_0$ and $E_1$ we can construct finite automata $M_0$ , $M_1$ (accepting the languages denoted by $E_0$ , $E_1$ , respectively). Assume that we now require a finite automaton accepting the language denoted by $E_0 \cdot E_1$ (their concatenation). With some of the existing toolkits, the new finite automaton would be constructed from scratch. With the FIRE engine, a concatenation operator on finite automata is implemented (for two of the different varieties of finite automata), enabling us to compute $M_0 \cdot M_1$ (a finite automaton accepting the desired language). This type of reuse of intermediate results can be a great computational saving.
A future version of the toolkit will include support for extended regular expressions, i.e. regular expressions containing intersection or complementation operators.
Basic regular expressions and automata transition labels are represented by character ranges. A future version of the FIRE engine will permit basic regular expressions and transition labels to be built from more complex data-structures. For example, it will be possible to process a string (vector) of structures.

Definition Finite automaton

A finite automaton (an FA, an Finite Automata (automaton的复数) is a 6-tuple $(Q, V, T, E, S, F)$ where

$Q$ is a finite set of states,
$V$ is an alphabet,
$T\in \mathcal P (Q \times V \times Q)$ is a transition relation,
$\in \mathcal P (Q \times Q)$ is an $\epsilon$ -transition relation,
$\subseteq Q$ is a set start states, and
$\subseteq Q$ is a set of final states

Remark the signatures of the transition relations

$\in V \rightarrow \mathcal P (Q \times Q)$
$\in Q \times Q \rightarrow \mathcal P (V)$
$\in Q \times V \rightarrow \mathcal P (Q)$
$\in Q \rightarrow \mathcal P (V \times Q)$
$\in Q \rightarrow \mathcal P (Q)$
In each case, the order of the $Q^{'} s$ from left to right will be preserved; for example, the function $\in Q \rightarrow \mathcal P (V \times Q)$ is defined as $\{(a,q) | (p,a,q) \in T\}$ . The signature that is used will be clear from the context.

Convention (Sets of functions):

For sets A and B, $\rightarrow B$ denotes the set of all total functions from A to B, while $\nrightarrow B$ denotes the set of all partial functions from A to B.

Properties of finite automata

To make these definitions more concise, we introduce particular finite automata $M = (Q,V,T,E,S,F), M_0 = (Q_0,V_0,T_0,E_0,S_0,F_0), M_1 = (Q_1,V_1,T_1,E_1,S_1,F_1)$

Definition (Size of an FA):

Define the size of an FA as $∣ M ∣ = ∣ Q ∣$

Definition (Isomorphism ( $\cong$ ) of FA’s):

We define isomorphism ( $\cong$ ) as an equivalence relation on FA’s. $M_0$ and $M_1$ are isomorphism (written $M_0 \cong $M_1$ ) if and only if $V_0 =V_1$ and there exists a bijection $\in Q_0 \to Q_1$ such that

$T_1 = \{(g(p),a,g(q)) | (p,a,q) \in T_0\}$
$E_1 = \{(g(p),g(q)) | (p,q) \in E_0\}$
$S_1 = \{(g(s)) | s \in S_0\}$
$F_1 = \{(g(s)) | s \in F_0\}$

Definition (Extending the transition relation T):

We extend transition relation $\in V \to \mathcal P(Q \times Q)\ to\ T^{\ast} \in V^{\ast} \to \mathcal P(Q \times Q)$ as follows:
$T^\ast(\epsilon) = E^\ast$
and (for $\in V, w \in V^\ast)$
$\qquad T^\ast(aw) = E^\ast \circ T(a) \circ T^\ast(w)$

Convention (Relation composition):

Given sets A, B, C (not necessarily different) and two relations, $\subseteq A \times B$ and $\subseteq B \times C$ , we define relation composition (infix operator $\circ$ ) as:
$\circ F = \{(a,c) | (\exist b \in B | (a,b) \in E \land (b,c) \in F)\}$

Definition (Left and right languages):

The left language of a state (in $M$ ) is given by function $\overleftarrow{\mathcal L}_M \in Q \to \mathcal P(V^\ast)$ , where
$\qquad \overleftarrow{\mathcal L}_M(q) = (\cup s | s \in S \land T^\ast(s,q))$
The right language of a state (in $M$ ) is given by function $\overrightarrow{\mathcal L}_M \in Q \to \mathcal P(V^\ast)$ , where
$\qquad \overrightarrow{\mathcal L}_M(q) = (\cup f | f \in F \land T^\ast(q,f))$
The subscript $M$ is usually dropped when no ambiguity can arise.

Property 2.13 (Language of an FA):

From the definitions of left and right languages (of a state), we can also write:
$\qquad \mathcal L_{FA}(M) = (U f | f \in F: \overleftarrow \mathcal L(f))$
and
$\qquad \mathcal L_{FA}(M) = (U s | s \in F: \overrightarrow \mathcal L(s))$

Definition 2.15 (Complete):

A Complete finite automaton is one satisfying the following:
$\qquad Complete(M) \equiv (\forall q,a | q\in Q \land a \in V | T(q,a) \neq \empty)$

Property 2.16 (Complete):

For all Complete FA’s $(Q, V, T, E, S, F)$ :
$\qquad (\cup q | q\in Q | \overleftarrow \mathcal L(q) = V^\ast)$

Definition 2.17 (e-free):

Automaton M is $\epsilon$ -free if and only if $\empty$ .

Remark 2.18:

Even if M is $\epsilon$ -free it is still possible that $\epsilon \in \mathcal L_{FA}(M)$ ; in this case $S\cap F \neq \empty$

Convention A.7 (Equivalence classes of an equivalence relation):

For any equivalence relation E on set A we denote the set of equivalence classes of E by $A]_E$ ; that is
$\qquad [A]_E = \{[a]_E|a\in A\}$
Set $A]_E$ is also called the partition of A induced by E.

Definition A.8 (Index of an equivalence class):

For equivalence relation E on set A, define $E = |[A]_E|$ . $\#E$ is called the index of E.

Definition (Language of an FA):

The language of a finite automaton (with alphabet V) is given by the function $\mathcal L_{FA}(M) \to \mathcal P(V^\ast)$ defined as:
$\qquad \mathcal L_{FA}(M) = (\cup s|(s\in S, f\in F) \land T^\ast(s,f))$

Property 2.25 (Deterministic finite automaton):

A finite automaton M is deterministic if and only if

it does not have multiple start states,
it is $\epsilon$ -free, and
transition function $T\in Q\times V \to \mathcal P(Q)$ does not map pairs in $Q\times V$ to multiple states.
Formally,
$\qquad Det(M) \equiv (|S| \leq 1 \land \epsilon-free(E) \land (\forall q,a:q\in Q \land a \in V: |T(q,a)| \leq 1))$

Definition 2.30 (Minimality of a DFA):

An $M\in DFA$ is minimal as follows:
$\qquad Min(M) \equiv (\forall M' : M' \in DFA \land \mathcal L_{FA}(M) = \mathcal L_{FA}(M'):|M| \leq |M'|)$

Constructions based on regular expression structure

A finite automaton construction is any function $f$ , such that the following diagram commutes:

在这里插入图片描述

In this section, we will be defining some $\Sigma$ -algebras with FA $_{\cong}$ , as the carrier set; the idea behind the above commuting diagram still holds in this case, as all isomorphic FA’s accept the same language.
The isomorphism class of an FA corresponding to a given regular expression is the image of the regular expression under the (unique) homomorphism from RE to the other $\Sigma$ -algebras. Such a homomorphism is a FA construction.
Thompson’s construction is considered first, followed by a derivation of Berry and Sethi’s, McNaughton, Yamada and Glushkov’s, and Aho, Sethi, and Ullman’s constructions.

Thompson’s construction

Definition (Thompson’s $\Sigma$ -algebra of FA’s):

The carrier set is $[FA]_\cong$ The operator requirement is:

For the binary operators, the representatives of the arguments must have disjoint state sets. For any two equivalence classes (under $\cong$ ) we can always choose a representative of each such that they satisfy this requirement.

The operators (With subscript Th, for Thompson) are:

$\begin{array}{l} \text{$\mathcal C_{\epsilon,Th} = let\ q_0,q_1$ be new states}\\ \text{$in$}\\ \text{\qquad $[(\{q_0,q_1\},V,\empty,\{(q_0,q_1)\},\{q_0\},\{q_1\}]_\cong$}\\ \text{$end$} \end{array}$
$\{q_0,q_1\}, T = \empty, E = \{q_0,q_1\},S = \{q_0\},F = \{q_1\}$
在这里插入图片描述

$\begin{array}{l} \text{$\mathcal C_{\empty,Th} = let\ q_0,q_1$ be new states}\\ \text{$in$}\\ \text{\qquad $[(\{q_0,q_1\},V,\empty,\empty,\{q_0\},\{q_1\}]_\cong$}\\ \text{$end$} \end{array}$
$\{q_0,q_1\}, T = \empty, E = \empty,S = \{q_0\},F = \{q_1\}$
在这里插入图片描述

$\begin{array}{l} \text{$\mathcal C_{a,Th} = let\ q_0,q_1$ be new states}\\ \text{$in$}\\ \text{\qquad $[(\{q_0,q_1\},V,\{q_0,a,q_1\},\empty,\{q_0\},\{q_1\}]_\cong$}\\ \text{$end$} \end{array}$
$\{q_0,q_1\}, T = \{q_0,a,q_1\}, E = \empty,S = \{q_0\},F = \{q_1\}$
在这里插入图片描述

$\begin{array}{l} \text{$\mathcal C_{\cdot,Th}([M_0]_\cong,[M_1]_\cong) = let\ (Q_0,V,T_0,E_0,S_0,F_0) = M_0$ }\\ \text{ \qquad \qquad \qquad \qquad \qquad \quad$ (Q_1,V,T_1,E_1,S_1,F_1) = M_1$ }\\ \text{$in$}\\ \text{$\qquad let\ E' = E_0 \cup E_1 \cup (F_0 \times S_1)$}\\ \text{$\qquad in$}\\ \text{$\qquad \qquad [(\{Q_0 \cup Q_1,V,T_0 \cup T_1,E',S_0,F_1]_\cong$}\\ \text{$\qquad end$}\\ \text{$end$} \end{array}$
$F A = (Q, V, T, E, S, F)$
在这里插入图片描述

$\begin{array}{l} \text{$\mathcal C_{\cup,Th}([M_0]_\cong,[M_1]_\cong) = let\ (Q_0,V,T_0,E_0,S_0,F_0) = M_0$ }\\ \text{ \qquad \qquad \qquad \qquad \qquad \quad$ (Q_1,V,T_1,E_1,S_1,F_1) = M_1$ }\\ \text{ \qquad \qquad \qquad \qquad \qquad \quad$q_0,q_1$ be new states}\\ \text{$in$}\\ \text{$\qquad let\ Q' = Q_0 \cup Q_1 \cup (q_0,q_1)$}\\ \text{$\qquad \quad E' = E_0 \cup E_1 \cup (\{q_0\} \times (S_0 \cup S_1)) \cup ((F_0 \cup F_1) \times \{q1\})$}\\ \text{$\qquad in$}\\ \text{$\qquad \qquad [(Q',V,T_0 \cup T_1,E',\{q_0\},\{q_1\})]_\cong$}\\ \text{$\qquad end$}\\ \text{$end$} \end{array}$
$F A = (Q, V, T, E, S, F)$
在这里插入图片描述

$\begin{array}{l} \text{$\mathcal C_{\ast,Th}([M]) = let\ (Q,V,T,E,S,F) = M$ }\\ \text{ \qquad \qquad \qquad \qquad \qquad$q_0,q_1$ be new states}\\ \text{$in$}\\ \text{$\qquad let\ Q' = Q \cup \{(q_0,q_1\}$}\\ \text{$\qquad \quad E' = E \cup (\{q_0\} \times S) \cup (F \times S) \cup (F \times \{q_1\}) \cup \{(q_0,q_1)\}$}\\ \text{$\qquad in$}\\ \text{$\qquad \qquad [(Q',V,T,E',\{q_0\},\{q_1\})]_\cong$}\\ \text{$\qquad end$}\\ \text{$end$} \end{array}$
$F A = (Q, V, T, E, S, F)$
在这里插入图片描述

$\begin{array}{l} \text{$\mathcal C_{+,Th}([M]) = let\ (Q,V,T,E,S,F) = M$ }\\ \text{ \qquad \qquad \qquad \qquad \qquad$q_0,q_1$ be new states}\\ \text{$in$}\\ \text{$\qquad let\ Q' = Q \cup \{(q_0,q_1\}$}\\ \text{$\qquad \quad E' = E \cup (\{q_0\} \times S) \cup (F \times S) \cup (F \times \{q_1\})$}\\ \text{$\qquad in$}\\ \text{$\qquad \qquad [(Q',V,T,E',\{q_0\},\{q_1\})]_\cong$}\\ \text{$\qquad end$}\\ \text{$end$} \end{array}$
$F A = (Q, V, T, E, S, F)$
在这里插入图片描述

$\begin{array}{l} \text{$\mathcal C_{?,Th}([M]) = let\ (Q,V,T,E,S,F) = M$ }\\ \text{ \qquad \qquad \qquad \qquad \qquad$q_0,q_1$ be new states}\\ \text{$in$}\\ \text{$\qquad let\ Q' = Q \cup \{(q_0,q_1\}$}\\ \text{$\qquad \quad E' = E \cup (\{q_0\} \times S) \cup (F \times q_1) \cup \{(q_0,q_1)\})$}\\ \text{$\qquad in$}\\ \text{$\qquad \qquad [(Q',V,T,E',\{q_0\},\{q_1\})]_\cong$}\\ \text{$\qquad end$}\\ \text{$end$} \end{array}$
$F A = (Q, V, T, E, S, F)$
在这里插入图片描述

These operators are symmetrical. Furthermore, they do not depend upon the choice of representative of the equivalence classes (under $\cong$ ). An automaton in Thompson’s $\Sigma$ -algebra (here we speak of a representative FA, instead of the isomorphism class) has the following properties:

It has a single start state with no in-transitions.
It has a single final state with no out-transitions.
Every state has either a single in-transition on a symbol (in V), or at most two $\epsilon$ in-transitions.
Every state has either a single out-transition on a symbol (in V), or at most two $\epsilon$ out-transitions.

These properties are symmetrical because the operators are symmetricaL Hopcroft and Uliman have shown [HU79] that in practice these properties facilitate the quick simulation of M.

Example Thompson’s construction

$\begin{array}{ll} Th((a \cup \epsilon)b^\ast) & = \mathcal C_{\cdot,Th}(Th(a \cup \epsilon),Th(b^\ast))\\ & = \mathcal C_{\cdot,Th}(\mathcal C_{\cup,th}(Th(a,Th(\epsilon)),\mathcal C_{\ast,Th}(b))\\ & = \mathcal C_{\cdot,Th}(\mathcal C_{\cup,Th}(\mathcal C_{a,Th},\mathcal C_{\epsilon,Th}),\mathcal C_{\ast,Th}(\mathcal C_{b,Th}))) \end{array}$
在这里插入图片描述
Figure : A representative FA of the isomorphism class $Th((a\cup\epsilon)b^\ast$

Character ranges

In the FIRE engine, atomic regular expressions and finite automata transitions can be labeled by sets of characters. The sets of characters allowed are restricted to non-empty ranges of characters in the execution character set (which is usually ASCII or EBCDIC).
Class CharRange is used to represent such ranges of characters, which are specified by their upper and lower (inclusive) bounds [lo,hi].

[Wat93a] WATSON, B. W. “A taxonomy of finite automata construction algorithms,” Computing Science Note 93/43, Eindhoven University of Technology, The Netherlands, 1993. Available by ftp from ftp.win.tue.nl in pub/techreports/pi.

[Wat93b] WATSON, B. W. “A taxonomy of finite automata minimization algorithms,” Computing Science Note 93/44, Eindhoven University of Technology, The Netherlands, 1993. Available by ftp from ftp.win.tue.nl in pub/techreports/pi.

[Wat94a] WATSON, B. W. “An introduction to the FIRE engine: A C++ toolkit for FInite automata and Regular Expressions,” Computing Science Note 94/21, Eindhoven University of Technology, The Netherlands, 1994. Available by ftp from ftp.win.tue.nl in pub/techreports/pi

[Wat94b] WATSON, B.W. "The design. and implementation of the FIRE engine:
A C++ toolkit for FInite automata and Regular Expressions, " Computing Science Note 94/22, Eindhoven University of Technology, The Netherlands, 1994. Available by ftp from ftp.win.tue.nl in pub/techreports/pi.

西面来风

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Finite automata and regular expression algorithms

C++ class library implementing finite automata and regular expression algorithmsEach algorithm is classified into one of two families: those based upon the structure of regular expressions, and those...
复制链接

扫一扫