Language Models(语言模型)

1,Probabilistic language model (概率语言模型)

目标:计算句子或单词序列的概率:

            P(W)=P(w1,w2,w3,...,wn)

下一个单词出现的概率:

            P(W)=P(wn|w1,w2,w3,...,wn-1)

2,Application

  • Machine translation

P(Students from my class are the best|我班上的学⽣是最棒的) > P(Students from Stanford are the best|我班上的学⽣是最棒的)

  • Natural language generation

P(best|Students from my class are the) > P(average|Students from my class are the)

  • Speech recognition

P(Three students) > P(Tree students)

3,language models in daily life

  • (search engine)搜索引擎
  • (search engine Grammar check)语法检查

4,Probability of next word

1) P(best|Students from my class are the) = C(Students from my class are the best)/ C(Students from my class are the)(条件概率)

C(......)->the count of  the phrase of "......"

2)A more easier way to calculate the probability:

P(Students from my class are the best) = P(best|the)P(the|are)P(are|class)P(class|my)P(my|from)P(from|Students)P(Students)

chain rule:P(w1:n) = P(w1)P(w2 |w1)P(w3 |w1:2) . . . P(wn |w1:n−1)

5,N-gram

  • Bigram model:approximates the probability of a word given all the previous words by using only the conditional probability of the preceding word.

       eg:P(best|Students from my class are the) ≈ P(best|the)

  • Markov assumption:the probability of a word depends only on the previous word            (P(wn |w1:n−1) ≈ P(wn |wn−1))
  • Markov model:assume we can predict the probability of some future unit without looking too far into the past

  • Generalizing bigram to n-gram:

From bigram to n-gram:P(wn |w1:n−1) ≈ P(wn |wn−N+1:n−1)

‣ N = 2: bigram ‣ N = 3: trigram ‣ N = 4: 4-gram ‣ N = 5: 5-gram

cases:

1) unigram:P(w1:n) = P(w1)P(w2)P(w3) . . . P(wn)

2)biogram:Condition on the previous word P(wi|w1:i−1) ≈ P(wi|wi−1)

6,Perplexity

  • Definition: the inverse probability of the test set, normalized by the number of words
  • Formula:

        Apply chain rule:

  • Intuition of perplexity

       Definition: Intuitively, perplexity can be understood as a measure of uncertainty

       Understanding:What’s the level of uncertainty to predict the next word? eg:

               - ChatGPT is built on top of OpenAI's GPT-3 family of large language _____ ?

       Uncertainty level

          - Unigram: highest

          - Bigram: high

          - 5-gram: low

       Note:Lower perplexity = better model

7, The perils of overfitting

  •  N-gram models only work well for word prediction if the test corpus looks like the training corpus

             - In real world, the inference corpus often doesn’t look like the training
             - Robust models that generalize are all we need
             - One kind of generalization: Zeros

  •  Things that doesn’t ever occur in the training set but not in the test set

7,Zeros

  • Training set - … denied the allegations - … denied the reports - … denied the claims - … denied the request
  • Test set - … denied the offer - … denied the loan
  • For biogram:

           zero probability:

           Perplexity: can't find because of 1 over 0

 

  • Create an unknown word token
  • Training of probabilities

         Training of <UNK>probabilities

         - Create a fixed lexicon L of size V

         - At text normalization phase, any training word not in L changed to<UNK>

  • During inference - Use UNK probabilities for any word not in training

8,Smoothing

  • To improve the accuracy of our model
  • To handle data sparsity, out of vocabulary words, words that are absent in the training set.
  • Smoothing techniques

     - Laplace smoothing: Also known as add-1 smoothing

     - Additive smoothing - Good-turing smoothing

     - Kneser-Ney smoothing

     - Katz smoothing

     - Church and Gale Smoothing

  • Laplace Smoothing:Assuming every (seen or unseen) event occurred once more than it did in the training data.

  • Intuition of smoothing

                ‣ When we have sparse statistics:

                         - P(w | denied the)

                         • 3 allegations 

                         • 2 reports

                         • 1 claims

                         • 1 request

                 ‣ Steal probability mass to generalize better - P(w | denied the)

                         • 2.5 allegations • 1.5                   

                         •  reports

                         • 0.5 claims

                         • 0.5 request

                         • 2 other

9,Backoff and interpolation

‣ Use less context

        - Backoff(退避)

             • use trigram if you have good evidence,

             • otherwise bigram, otherwise unigram

        - Interpolation(差值)

             • Mix unigram, bigram, trigram

  • 22
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值