形式语言与自动机_笔记整理(二)_上下文无关文法与下推自动机

Context-Free Grammars

CFG Formalism

Terminals = symbols of the alphabet of the language being defined.

Variables = Nonterminals = a finite set of other symbols, each of which represents a language.

Start Symbol = the variable whose language is the one being defined.

Production
variable(head)->string of variables and terminals(body)

Iterated Derivation
=>* means "zero or more derivation steps."

Sentential Form
Any string of variables and/or terminals derived from the start symbol.
Formally, α is a sentential form iff S=> α .

Context-Free Language
If G is a CFG, then the language of G, i.e., L(G) is { w|S=>w} .
A language that is defined by some CFG is called a context-free language.

BNF Notation

  • Variables are words in <…>;
  • Terminals are often multicharacter strings indicated by boldface or underline;
  • Symbol ::= is often used for ->.
  • Symbol | is used for "or."
  • Symbol … is used for "one or more."
  • Surround one or more symbols by […] to make them optional.
  • Use {…} to surround a sequence of symbols that need to be treated as a unit.

Leftmost and Rightmost Derivations

Derivations allow us to replace any of the variables in a string.
Leads to many different derivations of the same string.
By forcing the leftmost variable (or alternatively, the rightmost variable) to be replaced, we avoid these "distinctions without a difference".

Leftmost Derivations

Say wAα=>lmwβα if w is a string of terminals only and Aβ is a production.
Also, α=>lmβ if α becomes β by a sequence of 0 or more =>lm steps.

Rightmost Derivations

Say αAw=>rmαβ w if w is a string of terminals only and Aβ is a production.
Also, α=>rmβ if α becomes β by a sequence of 0 or more =>rm steps.

Parse Trees

  • Parse trees are trees labeled by symbols of a particular CFG.
  • Leaves: labeled by a terminal or ε.
  • Interior nodes: labeled by a variable.
  • Children are labeled by the body of a production for the parent.
  • Root: must be labeled by the start symbol.

Yield of a Parse Tree

The concatenation of the labels of the leaves in left-to-right order (that is, in the order of a preorder traversal) is called the yield of the parse tree.

Generalization of Parse Trees

We sometimes talk about trees that are not exactly parse trees, but only because the root is labeled by some variable A that is not the start symbol.
Call these parse trees with root A.

Trees, leftmost, and rightmost derivations correspond

  • If there is a parse tree with root labeled A and yield w, then A=>lm w.
  • If A =>lm w, then there is a parse tree with root A and yield w.

Ambiguous Grammar

  • A CFG is ambiguous if there is a string in the language that is the yield of two or more parse trees.

  • Equivalent definitions of "ambiguous grammar"

    • There is a string in the language that has two different leftmost derivations.
    • There is a string in the language that has two different rightmost derivations.
  • Ambiguity is a Property of Grammars, not Languages.

    For the balanced-parentheses language, here is another CFG, which is unambiguous.

LL(1) Grammars
  • As an aside, a grammar where you can always figure out the production to use in a leftmost derivation by scanning the given string left-to-right and looking only at the next one symbol is called LL(1).
    • Leftmost derivation, left-to-right scan, one symbol of lookahead.
  • Most programming languages have LL(1) grammars.
    • LL(1) grammars are never ambiguous.
Inherent Ambiguity
  • Certain CFL' s are inherently ambiguous, meaning that every grammar for the language is ambiguous.
  • Every grammar for the language is ambiguous.

Normal Forms for CFG’ s

Eliminating Useless Symbols

  • A symbol is useful if it appears in some derivation of some terminal string from the start symbol.

  • Otherwise, it is useless.

  • Eliminate symbols that derive no terminal string.
  • Eliminate unreachable symbols.

Eliminate Variables That Derive Nothing

  • Discover all variables that derive terminal strings.
  • For all other variables, remove all productions in which they appear in either the head or body.

S -> AB | C, A -> aA | a, B -> bB, C -> c
Basis: A and C are discovered because of A -> a and C -> c.
Induction: S is discovered because of S -> C.
Nothing else can be discovered.
Result: S -> C, A -> aA | a, C -> c

Eliminate unreachable symbols

  • Remove from the grammar all symbols not discovered reachable from S and all productions that involve these symbols.

Eliminate ε-Productions

Epsilon Productions

Theorem: If L is a CFL, then L-{ε} has a CFG with no ε-productions.
Note: ε cannot be in the language of any grammar that has no ε–productions.

Nullable Symbols

nullable symbols = variables A such that A =>* ε.

S -> AB, A -> aA | ε, B -> bB | A

Basis: A is nullable because of A -> ε.
Induction: B is nullable because of B -> A.
Then, S is nullable because of S -> AB.

Key idea: turn each production AX1Xn into a family of productions.
For each subset of nullable X' s, there is one production with those eliminated from the right side "in advance".

  • Except, if all X' s are nullable (or the body was empty to begin with), do not make a production with ε as the right side.

Eliminate Unit Productions

Unit Productions


  • A unit production is one whose body consists of exactly one variable.
  • These productions can be eliminated.
  • Key idea: If A=>B by a series of unit productions, and Bα is a non-unit-production, then add production Aα .
  • Then, drop all unit productions.

  • Find all pairs (A, B) such that A =>* B by a sequence of unit productions only.

Cleaning Up a Grammar

Theorem: if L is a CFL, then there is a CFG for L – {ε} that has:

  • No useless symbols.
  • No ε-productions.
  • No unit productions.

i.e., every body is either a single terminal or has length > 2.

Perform the following steps in order:

  • Eliminate ε-productions.
  • Eliminate unit productions.
  • Eliminate variables that derive no terminal string.
  • Eliminate variables not reached from the start symbol.

Chomsky Normal Form

A CFG is said to be in Chomsky Normal Form if every production is of one of these two forms:

  • A -> BC (body is two variables).
  • A -> a (body is a single terminal).

Theorem: If L is a CFL, then L – {ε} has a CFG in CNF.

Step 1: Clean the grammar, so every body is either a single terminal or of length at least 2.
Step 2: For each body a single terminal, make the right side all variables.

  • For each terminal a create new variable Aa and production Aaa .
  • Replace a by Aa in bodies of length > 2.

Consider production ABcDe `.
We need variables Ac and Ae with productions Acc and Aee .

  • Note: you create at most one variable for each terminal, and use it everywhere it is needed.

Replace ABcDe by ABAcDAe .

Step 3: Break right sides longer than 2 into a chain of productions with right sides of two variables.

Example: A -> BCDE is replaced by A -> BF, F -> CG, and G -> DE.

  • F and G must be used nowhere else.

Recall A -> BCDE is replaced by A -> BF, F -> CG, and G -> DE.
In the new grammar, A => BF => BCG => BCDE.

More importantly: Once we choose to replace A by BF, we must continue to BCG and BCDE.

  • Because F and G have only one production.

Pushdown Automata

  • The PDA is an automaton equivalent to the CFG in language-defining power.
  • Only the nondeterministic PDA defines all the CFL' s.
  • But the deterministic version models parsers.
    • Most programming languages have deterministic PDA' s.

PDA Formalism

A PDA is described by:

  • A finite set of states (Q, typically).
  • An input alphabet (Σ, typically).
  • A stack alphabet (Γ, typically).
  • A transition function (δ, typically).
    • Takes three arguments:
      • A state, in Q.
      • An input, which is either a symbol in Σ or ε
  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值