N-gram

最新推荐文章于 2023-10-09 09:59:09 发布

__meteor

最新推荐文章于 2023-10-09 09:59:09 发布

阅读量403

点赞数

文章标签： NLP

本文链接：https://blog.csdn.net/du_lun/article/details/106660607

版权

N-Gram

定义
BiGram
语言模型工具（toolkits）
语言模型评估
参考

定义

n元语法（英语：n-gram）指文本中连续出现的n个语词。n元语法模型是基于(n-1)阶马尔可夫链的一种概率语言模型，通过n个语词出现的概率来推断语句的结构。
例如：对于to be or not to be

n-gram	sequence
一元语法	to, be, or, not, to, be
二元语法	to be, be or, or not, not to, to be
三元语法	to be or, be or not, or not to, not to be

通过概率来推断接下来的词语：
Please turn your homework ___
可能填in、over 而不可能填 refrigerator、the

所以，n-gram 模型就是当知道前面单词之后，为n元语法的最后一个字设置合适的概率。

BiGram

用 $p(w_n|w_{n-1})$ 近似的表示 $p(w_n|w_1^{n-1})$ ，如P(the|that) $\approx$ P(the|Walden Pond’s water is so transparent that)
这个模型也叫Markov模型。
由此我们可以得出，trigram是向前看2个单词，n-gram是向前看n-1个单词，这就是为什么该模型是基于(n-1)阶马尔可夫链的一种概率语言模型。
为了计算 $p(w_n|w_{n-1})$ ，我们需要在语料库中统计二元语法 $C(x_{n-1}x_{n})$ 和 $\sum_w{C(w_{n-1}w)}$ ,所以 $p(w_n|w_{n-1})=\frac{C(x_{n-1}x_{n})}{\sum_w{C(w_{n-1}w)}}$ ，其中w是任意一个字。以此类推，得出通用公式如下：
在这里插入图片描述

在实际使用中，trigram和4-gram是比较常用的。
通常对概率取log，防止下溢

语言模型工具（toolkits）

语言模型评估

extrinsic evaluation（外部估计）

将语言模型嵌入程序，查看程序的性能提升，代价高

intrinsic evaluation （内部估计）

不需要嵌入任何应用的估计，用训练集训练，测试集测试性能。

perplexity

在评估语言模型的时候，我们通常不使用朴素的联合概率，而是使用perplexity（pp）.
$pp(w)=p(w_1w_2...w_n)^{-\frac{1}{n}}$

等价于在这里插入图片描述
用BiGram就近似于

perplexity越小越好。

参考

__meteor

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
N-gram

N-Gram定义BiGram定义n元语法（英语：n-gram）指文本中连续出现的n个语词。n元语法模型是基于(n-1)阶马尔可夫链的一种概率语言模型，通过n个语词出现的概率来推断语句的结构。例如：对于to be or not to ben-gramsequence一元语法to, be, or, not, to, be二元语法to be, be or, or not, not to, to be三元语法to be or, be or not, or not to,
复制链接

扫一扫