Operations : union , concatenation , Kleene closure, positive closure
Alphabet
Non-empty finite set of symbols
Σ
\Sigma
Σ
Strings
finite sequence of symbols of
Σ
\Sigma
Σ
(word)
empty string : ε \varepsilon ε
Σ ∗ \Sigma^* Σ∗ : the set of all strings over alphabet Σ \Sigma Σ (including ε \varepsilon ε)
Σ ∗ \Sigma^* Σ∗ is monoid with respect to concatenation
Languages
A language over alphabet
Σ
\Sigma
Σ is a subset (which can be infinite) of
Σ
∗
\Sigma^*
Σ∗
(personally, similar to vocabulary)
Kleene star or Kleene closure of a language
L
L
L
L
∗
=
∪
n
≥
0
L
n
L^* = \cup_{n\geq 0}L^n
L∗=∪n≥0Ln
which is consistent with the notation of Σ ∗ \Sigma^* Σ∗ by considering Σ \Sigma Σ as a language over alphabet Σ \Sigma Σ
Positive closure of a language
L
L
L
L
+
=
∪
n
≥
1
L
n
L^+ = \cup_{n\geq 1} L^n
L+=∪n≥1Ln
Some Properties
P1 Concatenation is associative
P2 Concatenation is not commutative
P3 L { ε } = { ε } L = L L\{\varepsilon\}=\{\varepsilon\}L=L L{ε}={ε}L=L
P4 L ∅ = ∅ L = ∅ L\varnothing=\varnothing L= \varnothing L∅=∅L=∅
P5 Distributive properties
- ( L 1 ∪ L 2 ) L 3 = L 1 L 3 ∪ L 2 L 3 (L_1\cup L_2)L_3 = L_1L_3\cup L_2L_3 (L1∪L2)L3=L1L3∪L2L3
- L 1 ( L 2 ∪ L 3 ) = L 1 L 2 ∪ L 1 L 3 L_1(L_2\cup L_3) = L_1L_2\cup L_1L_3 L1(L2∪L3)=L1L2∪L1L3
P6 if L 1 ⊆ L 2 L_1\subseteq L_2 L1⊆L2 and L 3 ⊆ L 4 L_3\subseteq L_4 L3⊆L4, then L 1 L 3 ⊆ L 2 L 4 L_1L_3 \subseteq L_2L_4 L1L3⊆L2L4
P7 ∅ ∗ = { ε } \varnothing^*=\{\varepsilon\} ∅∗={ε}
P8 { ε } ∗ = { ε } \{\varepsilon\}^*=\{\varepsilon\} {ε}∗={ε}
P9 if ε ∈ L \varepsilon \in L ε∈L, then L ∗ = L + L^*=L^+ L∗=L+
P10 L L ∗ = L ∗ L = L + LL^*=L^*L=L^+ LL∗=L∗L=L+
P11 ( L ∗ ) ∗ = L ∗ (L^*)^*=L^* (L∗)∗=L∗
P12 L ∗ L ∗ = L ∗ L^*L^*=L^* L∗L∗=L∗
P13 ( L 1 L 2 ) ∗ L 1 = L 1 ( L 2 L 1 ) ∗ (L_1L_2)^*L_1=L_1(L_2L_1)^* (L1L2)∗L1=L1(L2L1)∗
P14 ( L 1 ∪ L 2 ) ∗ = ( L 1 ∗ L 2 ∗ ) ∗ (L_1\cup L_2)^*=(L_1^*L_2^*)^* (L1∪L2)∗=(L1∗L2∗)∗
Regular Expressions
- ∅ \varnothing ∅, ε \varepsilon ε, and a a a for each a ∈ Σ a \in \Sigma a∈Σ are regular expressions representing the language for ∅ \varnothing ∅, { ε } \{\varepsilon\} {ε}, and { a } \{a\} {a} respectively
- a.
(
r
+
s
)
(r+s)
(r+s) representing the language
R
∪
S
R\cup S
R∪S
b. ( r s ) (rs) (rs) representing the language R S RS RS
c. ( r ∗ ) (r^*) (r∗) representing the language R ∗ R^* R∗
(keep minimum possible number of parenthesis in writing)
regular language
L
(
r
)
L(r)
L(r) represented by regular expression
r
r
r
sometimes just use
r
r
r instead of
L
(
r
)
L(r)
L(r) to indicate the regular language
L
L
L
Equivalent
expressions
r
1
r_1
r1 and
r
2
r_2
r2 representing the same language
we write
r
1
≈
r
2
r_1 \approx r_2
r1≈r2 (
r
1
=
r
2
r_1=r_2
r1=r2 is also reasonable)
Some identities
- r ε ≈ ε r ≈ r r\varepsilon\approx \varepsilon r\approx r rε≈εr≈r
- r 1 r 2 ≉ r 2 r 1 r_1r_2\not\approx r_2r_1 r1r2≈r2r1, in general
- ( r 1 r 2 ) r 3 ≈ r 1 ( r 2 r 3 ) (r_1r_2)r_3\approx r_1(r_2r_3) (r1r2)r3≈r1(r2r3)
- r ∅ ≈ ∅ r ≈ ∅ r\varnothing\approx \varnothing r \approx \varnothing r∅≈∅r≈∅
- ∅ ∗ ≈ ε \varnothing^*\approx \varepsilon ∅∗≈ε
- ε ∗ ≈ ε \varepsilon^*\approx \varepsilon ε∗≈ε
- if { ε } ∈ L ( r ) \{\varepsilon\} \in L(r) {ε}∈L(r) then r ∗ ≈ r + r^*\approx r^+ r∗≈r+
- r r ∗ ≈ r ∗ r ≈ r + rr^*\approx r^*r \approx r^+ rr∗≈r∗r≈r+
- Distributive identity.
( r 1 + r 2 ) r 3 ≈ r 1 r 3 + r 2 r 3 (r_1+r_2)r_3\approx r_1r_3+r_2r_3 (r1+r2)r3≈r1r3+r2r3
r 1 ( r 2 + r 3 ) ≈ r 1 r 2 + r 1 r 3 r_1(r_2+r_3)\approx r_1r_2+r_1r_3 r1(r2+r3)≈r1r2+r1r3 - ( r ∗ ) ∗ ≈ r ∗ (r^*)^*\approx r^* (r∗)∗≈r∗
- ( r 1 r 2 ) ∗ r 1 ≈ r 1 ( r 2 r 1 ) ∗ (r_1r_2)^*r_1\approx r_1(r_2r_1)^* (r1r2)∗r1≈r1(r2r1)∗
- ( r 1 + r 2 ) ∗ ≈ ( r 1 ∗ r 2 ∗ ) ∗ (r_1+r_2)^*\approx (r_1^*r_2^*)^* (r1+r2)∗≈(r1∗r2∗)∗
Grammars
Context-Free Grammars (CFG)
A grammar is a quadruple
G
=
(
V
,
T
,
S
,
P
)
\mathcal{G}=(V, T, S, P)
G=(V,T,S,P)
where
- V = N ∪ T V=N \cup T V=N∪T, where N N N is a finite set of non-terminals
- T T T is a finite set of terminals
- S ∈ N S \in N S∈N is the start symbol
- P P P is a finite set of N × V ∗ N \times V^* N×V∗, called the set of production rules.
Some definitions
- Consider a binary relation ⇒ G \Rightarrow_{\mathcal{G}} ⇒G on V ∗ V^* V∗, let’s say, one step relation.
- α ⇒ G β \alpha \Rightarrow_{\mathcal{G}} \beta α⇒Gβ iff. α = α 1 A α 2 , β = α 1 γ α 2 \alpha=\alpha_1A\alpha_2, \beta=\alpha_1\gamma\alpha_2 α=α1Aα2,β=α1γα2 and A → γ ∈ P A\rightarrow \gamma \in P A→γ∈P.
- if α ⇒ G β \alpha \Rightarrow_{\mathcal{G}} \beta α⇒Gβ, then we call α \alpha α yields β \beta β in one step in G \mathcal{G} G.
- The reflexive-transitive closure(自反传递闭包)of ⇒ G \Rightarrow_{\mathcal{G}} ⇒G is denoted by ⇒ G ∗ \Rightarrow_{\mathcal{G}}^* ⇒G∗. That is, for α , β ∈ V ∗ \alpha,\beta \in V^* α,β∈V∗, α ⇒ G ∗ β \alpha \Rightarrow_{\mathcal{G}}^* \beta α⇒G∗β iff. ∃ n ≥ 0 \exist n\geq 0 ∃n≥0 and α 1 , α 2 , . . . , α n ∈ V ∗ \alpha_1,\alpha_2,...,\alpha_n \in V^* α1,α2,...,αn∈V∗ such that α = α 0 ⇒ G α 1 ⇒ G α 2 ⇒ G . . . ⇒ G α n − 1 ⇒ G α n = β \alpha=\alpha_0\Rightarrow_{\mathcal{G}}\alpha_1\Rightarrow_{\mathcal{G}}\alpha_2\Rightarrow_{\mathcal{G}}...\Rightarrow_{\mathcal{G}}\alpha_{n-1}\Rightarrow_{\mathcal{G}}\alpha_n=\beta α=α0⇒Gα1⇒Gα2⇒G...⇒Gαn−1⇒Gαn=β
- if α ⇒ G ∗ β \alpha\Rightarrow_{\mathcal{G}}^*\beta α⇒G∗β, then we say α \alpha α derives β \beta β. Further, α ⇒ G ∗ β \alpha\Rightarrow_{\mathcal{G}}^*\beta α⇒G∗β is called as a derivation in G \mathcal{G} G, and β \beta β is the yield of the derivation.
- In a given context, if we’re dealing with only one grammar G \mathcal{G} G, then simply write ⇒ \Rightarrow ⇒ instead of ⇒ G \Rightarrow_{\mathcal{G}} ⇒G
- if α = α 0 ⇒ G α 1 ⇒ G α 2 ⇒ G . . . ⇒ G α n − 1 ⇒ G α n = β \alpha=\alpha_0\Rightarrow_{\mathcal{G}}\alpha_1\Rightarrow_{\mathcal{G}}\alpha_2\Rightarrow_{\mathcal{G}}...\Rightarrow_{\mathcal{G}}\alpha_{n-1}\Rightarrow_{\mathcal{G}}\alpha_n=\beta α=α0⇒Gα1⇒Gα2⇒G...⇒Gαn−1⇒Gαn=β is a derivation, then the length of the derivation is n n n and it maybe written as α ⇒ G n β \alpha\Rightarrow_{\mathcal{G}}^n\beta α⇒Gnβ
- A string α ∈ V ∗ \alpha \in V^* α∈V∗ is said to be a sentential form in G \mathcal{G} G, if α \alpha α can be derived from the start symbol S S S of G \mathcal{G} G, viz. S ⇒ ∗ α S\Rightarrow^*\alpha S⇒∗α
- In particular, if α ∈ Σ ∗ \alpha \in \Sigma^* α∈Σ∗, then the sentential form α \alpha α is known as a sentence. In which case, we say α \alpha α is generated by G \mathcal{G} G
- The language generated by G \mathcal{G} G, denoted by L ( G ) L(\mathcal{G}) L(G), is the set of all sentences generated by G \mathcal{G} G. That is, L ( G ) = { x ∈ Σ ∗ ∣ S ⇒ ∗ x } L(\mathcal{G})=\{x\in \Sigma^*|S\Rightarrow^*x\} L(G)={x∈Σ∗∣S⇒∗x}
Production rules A → α A\rightarrow \alpha A→α is independent of neighbouring symbols, i.e. context free.
Thus the grammar defined here is known as context free grammar, simply CFG.
Derivation Trees
also known as parse trees
the yield
α
\alpha
α of derivation
A
⇒
∗
α
A\Rightarrow^*\alpha
A⇒∗α can be identified in the derivation tree by juxtaposing the lables of leaf nodes from left to right.
Ambiguity
ambiguous grammar/unambiguous grammar
inherently ambiguous language
Regular Grammars
linear grammar : every production rule has at most one nonterminal symbol
left-linear (right-linear) grammar : the nonterminal symbol is at the left end (right end)
The language generated by a right linear grammar is regular.
For every regular language
L
L
L, there exists a right linear grammar that generates
L
L
L
Right linear grammars are also called regular grammars
Things also hold for left linear grammar.
Directed Graph Representation
Directed graph (short for digraph)
Chomsky Normal Form (CNF)
All productions are of the form A → B C A\rightarrow BC A→BC or A → a A\rightarrow a A→a, where A , B , C A,B,C A,B,C are variables, and a a a is terminal symbol.
Greibach Normal Form (GNF)
All productions are shown as the following form :
A
→
a
X
,
w
h
e
r
e
a
∈
T
,
X
∈
V
∗
A \rightarrow aX,\ where\ a\in T, X\in V^*
A→aX, where a∈T,X∈V∗
Simplification of CFG
“3R”
- Reduction (two phases, 1.generating+2.reachable=useful) (Removal of useless production)
- Removal of unit production
- Removal of ϵ \epsilon ϵ-production
Finite Automata (FA)
Deterministic Finite Automata (DFA)
A (deterministic) finite automaton
A
=
(
Q
,
Σ
,
q
0
,
δ
,
F
)
\mathscr{A}=(Q,\Sigma,q_0,\delta,F)
A=(Q,Σ,q0,δ,F)
Q
Q
Q is a finite set called the set of states
Σ
\Sigma
Σ is a finite set called the input alphabet
q
0
∈
Q
q_0 \in Q
q0∈Q, called the initial/start state
δ
:
Q
×
Σ
→
Q
\delta : Q \times \Sigma \rightarrow Q
δ:Q×Σ→Q is a function called transition function or next-state function (which can be extended, say
δ
^
:
Q
×
Σ
∗
→
Q
\hat\delta : Q \times \Sigma^*\rightarrow Q
δ^:Q×Σ∗→Q)
F
⊆
Q
F \subseteq Q
F⊆Q, called the set of final/accept states
Useful graphical representation : Transition Diagram
which can be constructed as follows:
- Every state in Q Q Q is represented by a node
- If δ ( p , a ) = q \delta(p,a)=q δ(p,a)=q, then there’s an arc from p p p to q q q labeled a a a
- There’s an arrow with no source into the initial state q 0 q_0 q0
- Final states are indicated by double circle
“accepted by”
Language of a DFA : L ( A ) L(\mathscr{A}) L(A)
Configurations
C
∈
Q
×
Σ
∗
C \in Q\times \Sigma^*
C∈Q×Σ∗
the computation of A \mathscr{A} A on the input x x x
Nondeterministic Finite Automata (NFA)
NFA
ε \varepsilon ε-NFA
A nondeterministic finite automaton
N
=
(
Q
,
Σ
,
q
0
,
δ
,
F
)
\mathscr{N}=(Q,\Sigma,q_0,\delta,F)
N=(Q,Σ,q0,δ,F), where
Q
,
Σ
,
q
0
,
F
Q,\Sigma,q_0,F
Q,Σ,q0,F are as in a DFA; whereas, the transition function
δ
\delta
δ is as below:
δ
:
Q
×
(
Σ
∪
{
ε
}
)
→
℘
(
Q
)
\delta : Q \times (\Sigma \cup \{\varepsilon\}) \rightarrow \wp(Q)
δ:Q×(Σ∪{ε})→℘(Q)
is a function so that, for a given state an a input symbol (possibly
ε
\varepsilon
ε),
δ
\delta
δ assigns a set of next states, possibly empty set.
δ \delta δ can also be extended to δ ^ \hat\delta δ^
Every DFA can be treated as an NFA.
L ( N ) = { x ∈ Σ ∗ ∣ δ ^ ( q 0 , x ) ∩ F ≠ ∅ } L(\mathscr{N})=\{x \in \Sigma^* | \hat\delta(q_0,x) \cap F \neq \varnothing \} L(N)={x∈Σ∗∣δ^(q0,x)∩F=∅}
Equivalence of RE & FA
Pushdown Automata (PDA)
P
=
(
Q
,
Σ
,
Γ
,
δ
,
q
0
,
Z
0
,
F
)
P = (Q,\Sigma,\Gamma,\delta,q_0,Z_0,F)
P=(Q,Σ,Γ,δ,q0,Z0,F)
Γ
\Gamma
Γ : A finite stack alphabet.
Z
0
Z_0
Z0 : The start symbol in the stack.
Equivalence of CFG & PDA
Turing Machine
e.g.
L
=
{
a
n
b
n
c
n
∣
n
≥
0
}
L = \{ a^nb^nc^n | n\geq 0\}
L={anbncn∣n≥0}
not a CFL
Can be recognized by “A modified PDA with 2 stacks” - an variation of TM
TM is a seven-tuple P = ( Q , Σ , Γ , δ , q 0 , B , F ) P=(Q, \Sigma, \Gamma, \delta, q_0, B, F) P=(Q,Σ,Γ,δ,q0,B,F)
- Q Q Q is finite set of states
- Σ \Sigma Σ is finite set of input symbols
- Γ \Gamma Γ is finite set of tape symbols
- d e l t a delta delta is transition function : Q × Γ ⇒ Q × Γ × { R , L } Q\times \Gamma \Rightarrow Q\times \Gamma \times \{R,L\} Q×Γ⇒Q×Γ×{R,L}, δ ( q , X ) = ( p , Y , → / ← ) \delta(q, X)=(p, Y, \rightarrow/\leftarrow) δ(q,X)=(p,Y,→/←)
- q 0 q_0 q0 is start state
- B B B is blank symbol
- F F F is finite set of final states
Instantaneous
How to describe the configuration of TM
- sequence of symbols in tape
- state of TM
- read/write head of TM
i.e. X 1 . . . X i − 1 q X i . . . X n X_1...X_{i-1}qX_i...X_n X1...Xi−1qXi...Xn, where q ∈ Q q\in Q q∈Q
Language of TM
{ w ∣ q 0 w − > α p β , p ∈ F , α , β ∈ Γ ∗ } \{w|q_0w -> \alpha p\beta , p\in F, \alpha,\beta \in \Gamma^*\} {w∣q0w−>αpβ,p∈F,α,β∈Γ∗}
The kind of languages accepted by TM is called recursively enumerable(RE) language. (递归可枚举语言)