Lecture 3 N-gram Language Models

Language Models

  • One application NLP is about explaining language: Why some sentences are more fluent than others NLP的一个应用是关于解释语言:为什么有些句子比其他句子更流畅
  • E.g. in speech recognition: recognize speech > wreck a nice beach 在语音识别中:识别语音 > 毁坏一个好的海滩
  • Measures “goodness” using probabilities estimated by language models 使用语言模型估计的概率衡量"好度"
  • Language model can also be used for generation 语言模型也可以用于生成
  • Language model is useful for: 语言模型对以下方面很有用
    • Query completion 查询补全
    • Optical character recognition 光学字符识别
    • And other generation tasks: 和其他生成任务
      • Machine translation 机器翻译
      • Summarization 概括
      • Dialogue systems 对话系统
  • Nowadays pretrained language models are the backbone of modern NLP systems 如今,预训练的语言模型是现代NLP系统的骨干

N-gram Language Model

Probabilities: Joint to Conditional 概率:联合概率到条件概率

  • Goal of language model is to get a probability for an arbitrary sequence of m words: 语言模型的目标是获取一个任意序列的m个词的概率

  • First step is to apply the chain rule to convert joint probabilities to conditional ones: 第一步是应用链式法则,将联合概率转换为条件概率

The Markov Assumption 马尔可夫假设

  • is still intractable, so make a simplifying assumption: 还是不可处理的,所以做一个简化的假设

  • For some small n: 对于某个小的n

    • When n = 1, it is a unigram model :

      在这里插入图片描述

    • When n = 2, it is a bigram model:

      在这里插入图片描述

    • When n = 3, it is a trigram model:

      在这里插入图片描述

Maximum Likelihood Estimation 最大似然估计

  • Estimate the probabilities based on counts in the corpus: 根据语料库中的计数估计概率
    • For unigram models:

    • For bigram models:

    • For n-gram models generally:

Book-ending Sequences 序列的开头和结尾

  • Special tags used to denote start and end of sequence: 用特殊的标签表示序列的开始和结束
    • <s> = sentence start 句子开始
    • </s> = sentence end 句子结束

Problems with N-gram models N-gram模型的问题

  • Language has long distance effects, therefore large n required. 语言有长距离的影响,因此需要较大的n值

    The lecture/s that took place last week was/were on preprocessing

    • The “was/were” here is mentioning “lecture/s” which is 6 words ahead. Therefore need a 6-grams
  • Resulting probabilities are often very small 结果的概率通常非常小

    • Possible solution: Use log probability to avoid numerical underflow 可能的解决方案:使用对数概率以避免数值下溢问题
  • Unseen words: 未见过的词

    • Special symbol to represent. E.g. <UNK> 用特殊符号表示
  • Unseen n-grams: Because the opertaion is multiplication, if one term in the multiplication is 0 then whole probability is 0 未见过的n-grams:因为操作是乘法,如果乘法中的一个术语为0,那么整个概率就是0

    • Need to smooth the n-gram language model 需要对n-gram语言模型进行平滑处理

Smoothing

Smoothing 平滑处理

Laplacian(add-one) smoothing

  • Simple idea: pretend we have seen each n-gram once more than we did. 假装我们看到每个n-gram多了一次

  • For unigram models:

  • For bigram models:

Add-k smoothing

  • Adding one is often too much. Instead, add a fraction k. 加一通常太多。相反,加一个k的分数

  • Also called Lidstone Smoothing

  • Have to choose k 需要选择k的值

Absolute Discounting

  • Borrows a fixed probability mass from observed n-gram counts 从观察到的n-gram计数中借来固定的概率质量
  • Redistributes it to unseen n-grams 将其重新分配给未见过的n-grams

Katz Backoff

  • Absolute discounting redistributes the probability mass equally for all unseen n-grams

  • Katz Backoff: redistributes the mass based on a lower order model (e.g. Unigram)

    在这里插入图片描述

  • Problems: Has preference of high frequency words rather than true related words. 问题:倾向于高频词,而不是真正相关的词

    • E.g. I can’t see without my reading _
      • C(reading, glasses) = C(reading, Francisco) = 0
      • C(Francisco) > C(glasses)
      • Katz Backoff will give higher probability to Francisco

Kneser-Ney Smoothing

  • Redistribute probability mass based on the versatility(广泛性) of the lower order n-gram. 根据低阶n-gram的多功能性(广泛性)重新分配概率质量
  • Also called continuation probability 也称为续写概率
  • Versatility: 多功能性
    • High versatility: co-occurs with a lot of unique words 高多功能性:与许多唯一的词共现
      • E.g. glasses: men’s glasses, black glasses, buy glasses
    • Low versatility: co-occurs with few unique words 低多功能性:与少数唯一的词共现
      • E.g. Francisco: San Francisco

在这里插入图片描述

  • Intuitively the numerator of Pcont counts the number of unique wi-1 that co-occurs with wi 直观地说,Pcont的分子计算与wi共现的唯一wi-1的数量
  • High continuation counts for glasses and low continuation counts for Francisco 对于眼镜有高的续写计数,对于Francisco有低的续写计数

Interpolation

  • A better way to combine different orders of n-grams models 结合不同阶数的n-grams模型的更好方式
  • Weighted sum of probabilities across progressively shorter contexts 对逐渐缩短的上下文进行加权概率求和
  • E.g. Interpolated trigram model: 插值trigram模型:

    PIN(wi|wi-1, wi-2) = λ3P3(wi|wi-2, wi-1) + λ2P2(wi|wi-1) + λ1P1(wi)
    λ3 + λ2 + λ1 = 1

Interpolated Kneser-Ney Smoothing

  • Interpolation instead of back-off 使用插值而不是回退

在这里插入图片描述

In Practice

  • Commonly used Kneser-Ney language models use 5-grams as max order 常用的Kneser-Ney语言模型将5-grams作为最大阶数
  • Has different discount values for each n-gram order 对每个n-gram阶数有不同的折扣值

Generating Language

Generation 生成

  • Given an initial word, draw the next word according to the probability distribution produced by the language model. 给定一个初始词,根据语言模型产生的概率分布选择下一个词

  • Include n-1 <s> tokens for n-gram model to provide context to generate first word 对于n-gram模型,包括n-1个标记,以便提供上下文来生成第一个词

    • Never generate <s> 永不生成
    • Generating </s> terminates the sequence 生成会结束序列
  • E.g.

在这里插入图片描述

How to select next word

  • Argmax: Takes highest probability word each turn. 每次选择概率最高的词

    • Greedy Search 贪婪搜索
  • Beam Search Decoding:

    • Keeps track of top-N highest probability words each turn 每次跟踪前N个概率最高的词
    • Select sequence of words that produce the best sentence probability 选择产生最佳句子概率的词序列
  • Randomly samples from the distribution 从分布中随机采样

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值