NLP学习笔记

最近刚刚接触NLP,一些概念都不是很熟悉。小学期刚开始,今天下午就在图书馆好好看看一些基本的概念。学习笔记记录如下。

1. Kleene star操作

Given a set V define

V 0 = { ε } (the language consisting only of the empty string),
V 1 = V

and define recursively the set

V i+1 = { wv  : w V i and v V } for each i >0.

If V is a formal language, then Vi, the i-th power of the set V, is a shorthand for the concatenation of set V with itself i times. That is, Vi can be understood to be the set of all strings that can be represented as the concatenation of i strings in V.

The definition of Kleene star on V is[2]

 V^*=\bigcup_{i \in \N }V_i = \{\varepsilon\} \cup V \cup V_2 \cup V_3 \cup V_4 \cup \ldots.

2.Kleene plus操作

In some formal language studies, (e.g. AFL Theory) a variation on the Kleene star operation called the Kleene plus is used. The Kleene plus omits the V0 term in the above union. In other words, the Kleene plus on V is

V^+=\bigcup_{i \in \N \setminus \{0\}} V_i = V_1 \cup V_2 \cup V_3 \cup \ldots.

3.Production rule

A grammar is defined by production rules (or just 'productions') that specify which symbols may replace which other symbols; these rules may be used to generate strings, or to parse them. Each such rule has a head, or left-hand side, which consists of the string that may be replaced, and a body, or right-hand side, which consists of a string that may replace it. Rules are often written in the form headbody; e.g., the rule z0 → z1 specifies that z0 can be replaced by z1.

In the classic formalization of generative grammars first proposed by Noam Chomsky in the 1950s,[1][2] a grammar G consists of the following components:

  • A finite set N of nonterminal symbols.
  • A finite set \Sigma of terminal symbols that is disjoint from N.
  • A finite set P of production rules, each rule of the form
(\Sigma \cup N)^{*} N (\Sigma \cup N)^{*} \rightarrow (\Sigma \cup N)^{*}
where {}^{*} is the Kleene star operator and \cup denotes set union, so (\Sigma \cup N)^{*} represents zero or more symbols, and N means one nonterminal symbol. That is, each production rule maps from one string of symbols to another, where the first string contains at least one nonterminal symbol. In the case that the body consists solely of the empty string—i.e., that it contains no symbols at all—it may be denoted with a special notation (often \Lambda, e or \epsilon) in order to avoid confusion.
  • A distinguished symbol S \in N that is the start symbol.

A grammar is formally defined as the ordered quadruple <N, \Sigma, P, S>. Such a formal grammar is often called a rewriting system or a phrase structure grammar in the literature.[3][4]

4.Terminal symbols and Nonterminal Symbols

Terminal symbols are literal symbols which may appear in the inputs to or outputs from the production rules of a formal grammar and which cannot be changed using the rules of the grammar.

Nonterminal symbols are those symbols which can be replaced. They may also be called simply syntactic variables.

5.Context-Free Grammar

    Definition:

In formal language theory, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form

Vw

where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w can be empty). A formal grammar is considered "context free" when its production rules can be applied regardless of the context of a nonterminal. No matter which symbols surround it, the single nonterminal on the left hand side can always be replaced by the right hand side.

    Formal definitions:

A context-free grammar G is defined by the 4-tuple:[3]

G = (V\,, \Sigma\,, R\,, S\,) where

  1. V\, is a finite set; each element  v\in V is called a non-terminal character or a variable. Each variable represents a different type of phrase or clause in the sentence. Variables are also sometimes called syntactic categories. Each variable defines a sub-language of the language defined by G\, .
  2. \Sigma\, is a finite set of terminals, disjoint from V\,, which make up the actual content of the sentence. The set of terminals is the alphabet of the language defined by the grammar G\, .
  3. R\, is a finite relation from V\, to (V\cup\Sigma)^{*}, where the asterisk represents the Kleene star operation. The members of R\, are called the (rewrite) rules or productions of the grammar. (also commonly symbolized by a P\,)
  4. S\, is the start variable (or start symbol), used to represent the whole sentence (or program). It must be an element of V\,.

Production rule notation[edit]

A production rule in R\, is formalized mathematically as a pair (\alpha, \beta)\in R, where \alpha \in V is a non-terminal and \beta \in (V\cup\Sigma)^{*} is a string of variables and/or terminals; rather than using ordered pair notation, production rules are usually written using an arrow operator with \alpha as its left hand side and \beta as its right hand side: \alpha\rightarrow\beta.

It is allowed for \beta to be the empty string, and in this case it is customary to denote it by ε. The form \alpha\rightarrow\varepsilon is called an ε-production.[4]

It is common to list all right-hand sides for the same left-hand side on the same line, using | (the pipe symbol) to separate them. Rules \alpha\rightarrow \beta_1 and \alpha\rightarrow\beta_2 can hence be written as \alpha\rightarrow\beta_1\mid\beta_2.

Rule application[edit]

For any strings u, v\in (V\cup\Sigma)^{*}, we say u\, directly yields v\,, written as u\Rightarrow v\,, if \exists (\alpha, \beta)\in R with \alpha \in V and u_{1}, u_{2}\in (V\cup\Sigma)^{*} such that u\,=u_{1}\alpha u_{2} and v\,=u_{1}\beta u_{2}. Thus, \! v is the result of applying the rule \! (\alpha, \beta) to \! u.

Repetitive rule application[edit]

For any u, v\in (V\cup\Sigma)^{*}, we say u yields v written as u\stackrel{*}{\Rightarrow} v (or u\Rightarrow\Rightarrow v\, in some textbooks), if \exists \ u_{1}, u_{2}, \cdots u_{k}\in (V\cup\Sigma)^{*}, k\geq 0 such that u\Rightarrow u_{1}\Rightarrow u_{2}\cdots\Rightarrow u_{k}\Rightarrow v

Context-free language[edit]

The language of a grammar G = (V\,, \Sigma\,, R\,, S\,) is the set

L(G) = \{ w\in\Sigma^{*} : S\stackrel{*}{\Rightarrow} w\}

A language L\, is said to be a context-free language (CFL), if there exists a CFG G\,, such that L\,=\,L(G).


6.PCFG

一个概率上下文无关文法(PCFG)是一个五元组(N,∑,S,R,P):
(1)一个非终结符集N
(2)一个终结符集∑
(3)一个开始非终结符S∈N
(4)一个产生式集R
(5)对于任意产生式r∈R,其概率为P(r)
PCFG是 CFG的扩展,PCFG的规则表示形式为:A→α p,其中A为非终结符,p为A推导出α的概率,即p=P(A→α),该概率分布必须满足如下条件:
∑P(A→α)=1
也就是说,相同左部的产生式概率分布满足归一化条件。
分析树的概率等于所有使用规则概率之积。

7.CNF

In formal language theory, a context-free grammar is said to be in Chomsky normal form (invented by Noam Chomsky)[1][2] if all of its production rules are of the form:

A \rightarrow BC or
A \rightarrow \alpha or
S \rightarrow \varepsilon,

where A, B and C are nonterminal symbols, \alpha is a terminal symbol (a symbol that represents a constant value), S is the start symbol, and \varepsilon is the empty string. Also, neither B nor C may be the start symbol, and the third production rule can only appear if \varepsilon is in L(G), namely, the language produced by the context-free grammar G.


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值