最近刚刚接触NLP,一些概念都不是很熟悉。小学期刚开始,今天下午就在图书馆好好看看一些基本的概念。学习笔记记录如下。
1. Kleene star操作
Given a set V define
- V 0 = { ε } (the language consisting only of the empty string),
- V 1 = V
and define recursively the set
- V i+1 = { wv : w ∈ V i and v ∈ V } for each i >0.
If V is a formal language, then Vi, the i-th power of the set V, is a shorthand for the concatenation of set V with itself i times. That is, Vi can be understood to be the set of all strings that can be represented as the concatenation of i strings in V.
The definition of Kleene star on V is[2]
2.Kleene plus操作
In some formal language studies, (e.g. AFL Theory) a variation on the Kleene star operation called the Kleene plus is used. The Kleene plus omits the V0 term in the above union. In other words, the Kleene plus on V is
3.Production rule
A grammar is defined by production rules (or just 'productions') that specify which symbols may replace which other symbols; these rules may be used to generate strings, or to parse them. Each such rule has a head, or left-hand side, which consists of the string that may be replaced, and a body, or right-hand side, which consists of a string that may replace it. Rules are often written in the form head → body; e.g., the rule z0 → z1 specifies that z0 can be replaced by z1.
In the classic formalization of generative grammars first proposed by Noam Chomsky in the 1950s,[1][2] a grammar G consists of the following components:
- A finite set of nonterminal symbols.
- A finite set of terminal symbols that is disjoint from .
- A finite set of production rules, each rule of the form
- where is the Kleene star operator and denotes set union, so represents zero or more symbols, and means one nonterminal symbol. That is, each production rule maps from one string of symbols to another, where the first string contains at least one nonterminal symbol. In the case that the body consists solely of the empty string—i.e., that it contains no symbols at all—it may be denoted with a special notation (often , or ) in order to avoid confusion.
- A distinguished symbol that is the start symbol.
A grammar is formally defined as the ordered quadruple . Such a formal grammar is often called a rewriting system or a phrase structure grammar in the literature.[3][4]
4.Terminal symbols and Nonterminal Symbols
Terminal symbols are literal symbols which may appear in the inputs to or outputs from the production rules of a formal grammar and which cannot be changed using the rules of the grammar.
Nonterminal symbols are those symbols which can be replaced. They may also be called simply syntactic variables.
5.Context-Free Grammar
Definition:
In formal language theory, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form
- V → w
where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w can be empty). A formal grammar is considered "context free" when its production rules can be applied regardless of the context of a nonterminal. No matter which symbols surround it, the single nonterminal on the left hand side can always be replaced by the right hand side.
Formal definitions:
A context-free grammar G is defined by the 4-tuple:[3]
where
- is a finite set; each element is called a non-terminal character or a variable. Each variable represents a different type of phrase or clause in the sentence. Variables are also sometimes called syntactic categories. Each variable defines a sub-language of the language defined by .
- is a finite set of terminals, disjoint from , which make up the actual content of the sentence. The set of terminals is the alphabet of the language defined by the grammar .
- is a finite relation from to , where the asterisk represents the Kleene star operation. The members of are called the (rewrite) rules or productions of the grammar. (also commonly symbolized by a )
- is the start variable (or start symbol), used to represent the whole sentence (or program). It must be an element of .
Production rule notation[edit]
A production rule in is formalized mathematically as a pair , where is a non-terminal and is a string of variables and/or terminals; rather than using ordered pair notation, production rules are usually written using an arrow operator with as its left hand side and as its right hand side: .
It is allowed for to be the empty string, and in this case it is customary to denote it by ε. The form is called an ε-production.[4]
It is common to list all right-hand sides for the same left-hand side on the same line, using | (the pipe symbol) to separate them. Rules and can hence be written as .
Rule application[edit]
For any strings , we say directly yields , written as , if with and such that and . Thus, is the result of applying the rule to .
Repetitive rule application[edit]
For any we say yields written as (or in some textbooks), if such that
Context-free language[edit]
The language of a grammar is the set
A language is said to be a context-free language (CFL), if there exists a CFG , such that .
6.PCFG
一个概率上下文无关文法(PCFG)是一个五元组(N,∑,S,R,P):(1)一个非终结符集N(2)一个终结符集∑(3)一个开始非终结符S∈N(4)一个产生式集R(5)对于任意产生式r∈R,其概率为P(r)PCFG是 CFG的扩展,PCFG的规则表示形式为:A→α p,其中A为非终结符,p为A推导出α的概率,即p=P(A→α),该概率分布必须满足如下条件:∑P(A→α)=1也就是说,相同左部的产生式概率分布满足归一化条件。分析树的概率等于所有使用规则概率之积。
In formal language theory, a context-free grammar is said to be in Chomsky normal form (invented by Noam Chomsky)[1][2] if all of its production rules are of the form:
- or
- or
- ,
where , and are nonterminal symbols, is a terminal symbol (a symbol that represents a constant value), is the start symbol, and is the empty string. Also, neither nor may be the start symbol, and the third production rule can only appear if is in , namely, the language produced by the context-free grammar .