NLP学习笔记

最新推荐文章于 2022-10-16 21:32:01 发布

Coog

最新推荐文章于 2022-10-16 21:32:01 发布

阅读量755

点赞数

分类专栏：学习笔记文章标签： nlp parse 学习

学习笔记专栏收录该内容

3 篇文章 0 订阅

订阅专栏

最近刚刚接触NLP，一些概念都不是很熟悉。小学期刚开始，今天下午就在图书馆好好看看一些基本的概念。学习笔记记录如下。

1. Kleene star操作

Given a set V define

V ₀ = { ε } (the language consisting only of the empty string),

V ₁ = V

and define recursively the set

V _i+1 = { wv : w ∈ V _i and v ∈ V } for each i >0.

If V is a formal language, then V_i, the i-th power of the set V, is a shorthand for the concatenation of set V with itself i times. That is, V_i can be understood to be the set of all strings that can be represented as the concatenation of i strings in V.

The definition of Kleene star on V is^[2]

$V^*=\bigcup_{i \in \N }V_i = \{\varepsilon\} \cup V \cup V_2 \cup V_3 \cup V_4 \cup \ldots.$

2.Kleene plus操作

In some formal language studies, (e.g. AFL Theory) a variation on the Kleene star operation called the Kleene plus is used. The Kleene plus omits the V₀ term in the above union. In other words, the Kleene plus on V is

$V^+=\bigcup_{i \in \N \setminus \{0\}} V_i = V_1 \cup V_2 \cup V_3 \cup \ldots.$

3.Production rule

A grammar is defined by production rules (or just 'productions') that specify which symbols may replace which other symbols; these rules may be used to generate strings, or to parse them. Each such rule has a head, or left-hand side, which consists of the string that may be replaced, and a body, or right-hand side, which consists of a string that may replace it. Rules are often written in the form head → body; e.g., the rule z0 → z1 specifies that z0 can be replaced by z1.

In the classic formalization of generative grammars first proposed by Noam Chomsky in the 1950s,^[1]^[2] a grammar G consists of the following components:

A finite set $N$ of nonterminal symbols.
A finite set $\Sigma$ of terminal symbols that is disjoint from $N$ .
A finite set $P$ of production rules, each rule of the form

$(\Sigma \cup N)^{*} N (\Sigma \cup N)^{*} \rightarrow (\Sigma \cup N)^{*}$

where ${}^{*}$ is the Kleene star operator and $\cup$ denotes set union, so $(\Sigma \cup N)^{*}$ represents zero or more symbols, and $N$ means one nonterminal symbol. That is, each production rule maps from one string of symbols to another, where the first string contains at least one nonterminal symbol. In the case that the body consists solely of the empty string—i.e., that it contains no symbols at all—it may be denoted with a special notation (often $\Lambda$ , $e$ or $\epsilon$ ) in order to avoid confusion.

A distinguished symbol $S \in N$ that is the start symbol.

A grammar is formally defined as the ordered quadruple $<N, \Sigma, P, S>$ . Such a formal grammar is often called a rewriting system or a phrase structure grammar in the literature.^[3]^[4]

4.Terminal symbols and Nonterminal Symbols

Terminal symbols are literal symbols which may appear in the inputs to or outputs from the production rules of a formal grammar and which cannot be changed using the rules of the grammar.

Nonterminal symbols are those symbols which can be replaced. They may also be called simply syntactic variables.

5.Context-Free Grammar

Definition:

In formal language theory, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form

V → w

where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w can be empty). A formal grammar is considered "context free" when its production rules can be applied regardless of the context of a nonterminal. No matter which symbols surround it, the single nonterminal on the left hand side can always be replaced by the right hand side.

Formal definitions:

A context-free grammar G is defined by the 4-tuple:^[3]

$G = (V\,, \Sigma\,, R\,, S\,)$ where

$V\,$ is a finite set; each element $v\in V$ is called a non-terminal character or a variable. Each variable represents a different type of phrase or clause in the sentence. Variables are also sometimes called syntactic categories. Each variable defines a sub-language of the language defined by $G\,$ .
$\Sigma\,$ is a finite set of terminals, disjoint from $V\,$ , which make up the actual content of the sentence. The set of terminals is the alphabet of the language defined by the grammar $G\,$ .
$R\,$ is a finite relation from $V\,$ to $(V\cup\Sigma)^{*}$ , where the asterisk represents the Kleene star operation. The members of $R\,$ are called the (rewrite) rules or productions of the grammar. (also commonly symbolized by a $P\,$ )
$S\,$ is the start variable (or start symbol), used to represent the whole sentence (or program). It must be an element of $V\,$ .

Production rule notation[edit]

A production rule in $R\,$ is formalized mathematically as a pair $(\alpha, \beta)\in R$ , where $\alpha \in V$ is a non-terminal and $\beta \in (V\cup\Sigma)^{*}$ is a string of variables and/or terminals; rather than using ordered pair notation, production rules are usually written using an arrow operator with $\alpha$ as its left hand side and $\beta$ as its right hand side: $\alpha\rightarrow\beta$ .

It is allowed for $\beta$ to be the empty string, and in this case it is customary to denote it by ε. The form $\alpha\rightarrow\varepsilon$ is called an ε-production.^[4]

It is common to list all right-hand sides for the same left-hand side on the same line, using | (the pipe symbol) to separate them. Rules $\alpha\rightarrow \beta_1$ and $\alpha\rightarrow\beta_2$ can hence be written as $\alpha\rightarrow\beta_1\mid\beta_2$ .

Rule application[edit]

For any strings $u, v\in (V\cup\Sigma)^{*}$ , we say $u\,$ directly yields $v\,$ , written as $u\Rightarrow v\,$ , if $\exists (\alpha, \beta)\in R$ with $\alpha \in V$ and $u_{1}, u_{2}\in (V\cup\Sigma)^{*}$ such that $u\,=u_{1}\alpha u_{2}$ and $v\,=u_{1}\beta u_{2}$ . Thus, $\! v$ is the result of applying the rule $\! (\alpha, \beta)$ to $\! u$ .

Repetitive rule application[edit]

For any $u, v\in (V\cup\Sigma)^{*},$ we say $u$ yields $v$ written as $u\stackrel{*}{\Rightarrow} v$ (or $u\Rightarrow\Rightarrow v\,$ in some textbooks), if $\exists \ u_{1}, u_{2}, \cdots u_{k}\in (V\cup\Sigma)^{*}, k\geq 0$ such that $u\Rightarrow u_{1}\Rightarrow u_{2}\cdots\Rightarrow u_{k}\Rightarrow v$

Context-free language[edit]

The language of a grammar $G = (V\,, \Sigma\,, R\,, S\,)$ is the set

$L(G) = \{ w\in\Sigma^{*} : S\stackrel{*}{\Rightarrow} w\}$

A language $L\,$ is said to be a context-free language (CFL), if there exists a CFG $G\,$ , such that $L\,=\,L(G)$ .

6.PCFG

一个概率上下文无关文法（PCFG）是一个五元组(N,∑,S,R,P)：

（1）一个非终结符集N

（2）一个终结符集∑

（3）一个开始非终结符S∈N

（4）一个产生式集R

（5）对于任意产生式r∈R，其概率为P(r)

PCFG是 CFG的扩展，PCFG的规则表示形式为：A→α p，其中A为非终结符，p为A推导出α的概率，即p=P(A→α)，该概率分布必须满足如下条件：

∑P(A→α)=1

也就是说，相同左部的产生式概率分布满足归一化条件。

分析树的概率等于所有使用规则概率之积。

7.CNF

In formal language theory, a context-free grammar is said to be in Chomsky normal form (invented by Noam Chomsky)^[1]^[2] if all of its production rules are of the form:

$A \rightarrow BC$ or

$A \rightarrow \alpha$ or

$S \rightarrow \varepsilon$ ,

where $A$ , $B$ and $C$ are nonterminal symbols, $\alpha$ is a terminal symbol (a symbol that represents a constant value), $S$ is the start symbol, and $\varepsilon$ is the empty string. Also, neither $B$ nor $C$ may be the start symbol, and the third production rule can only appear if $\varepsilon$ is in $L(G)$ , namely, the language produced by the context-free grammar $G$ .