C2W3.LAB.N-grams+Language Model+OOV

oldmao_2000

于 2024-07-21 20:39:09 发布

阅读量332

点赞数 14

分类专栏： DL.AI NLPS实验与作业文章标签：语言模型人工智能自然语言处理

本文链接：https://blog.csdn.net/oldmao_2001/article/details/140589223

版权

DL.AI NLPS实验与作业专栏收录该内容

16 篇文章 0 订阅

订阅专栏

理论课：C2W3.Auto-complete and Language Models

理论课： C2W3.Auto-complete and Language Models

N-grams Corpus preprocessing

文本预处理在之前已经学过，这里需要对语言模型的文本预处理操作和前面学过的预处理进行区分。
语言模型的一些常见预处理步骤包括：

lowercasing the text
remove special characters
split text to list of sentences
split sentence into list words

导入包：

import nltk               # NLP toolkit
import re                 # Library for Regular expression operations

Lowercase

句子开头的单词、人名和专有名词以大写字母开头。但是，在计算单词时，要将它们与出现在句子中间的单词同等对待。使用的转换函数看这里：str.lowercase

# change the corpus to lowercase
corpus = "Learning% makes 'me' happy. I am happy be-cause I am learning! :)"
corpus = corpus.lower()

# note that word "learning" will now be the same regardless of its position in the sentence
print(corpus)

结果：
learning% makes ‘me’ happy. i am happy be-cause i am learning! 😃

Remove special characters

做N-grams前，需要从语料库中删除一些字符。通常情况下，语料中会删除双引号“”或破折号“-”等特殊字符，而句号“. ”或问号”?会保留在句子中。

# remove special characters
corpus = "learning% makes 'me' happy. i am happy be-cause i am learning! :)"
corpus = re.sub(r"[^a-zA-Z0-9.?! ]+", "", corpus)
print(corpus)

结果：
learning makes me happy. i am happy because i am learning!
对应情感分析非常重要的笑脸符号，这里不需要保留。

Text splitting

语料库中的句子由一个特殊的分隔符 \n 分隔，需要利用这个分隔符将语料分割成一个句子数组。一种方法是使用 str.split 方法。

以下示例说明了如何使用该方法。代码显示

如何将包含日期的字符串拆分为日期部分的数组
如何将包含时间的字符串拆分为包含小时、分钟和秒的数组
另外，请注意观察 “May”和 “9 ”之间的分隔符，发生了什么情况。

# split text by a delimiter to array
input_date="Sat May  9 07:33:35 CEST 2020"

# get the date parts in array
date_parts = input_date.split(" ")
print(f"date parts = {date_parts}")

#get the time parts in array
time_parts = date_parts[4].split(":")
print(f"time parts = {time_parts}")

结果：
date parts = [‘Sat’, ‘May’, ‘’, ‘9’, ‘07:33:35’, ‘CEST’, ‘2020’]
time parts = [‘07’, ‘33’, ‘35’]

Sentence tokenizing

有了句子列表后，下一步就是将每个句子拆分成单词列表。
这个过程可以用多种方法完成，甚至可以使用上面介绍的 str.split 方法，这里将使用 NLTK 库 nltk 来搞定。

# tokenize the sentence into an array of words

sentence = 'i am happy because i am learning.'
tokenized_sentence = nltk.word_tokenize(sentence)
print(f'{sentence} -> {tokenized_sentence}')

结果：
i am happy because i am learning. -> [‘i’, ‘am’, ‘happy’, ‘because’, ‘i’, ‘am’, ‘learning’, ‘.’]
分词后可以有其他操作，例如求各个单词长度：

# find length of each word in the tokenized sentence
sentence = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']
word_lengths = [(word, len(word)) for word in sentence] # Create a list with the word lengths using a list comprehension
print(f' Lengths of the words: \n{word_lengths}')

结果：
Lengths of the words:
[(‘i’, 1), (‘am’, 2), (‘happy’, 5), (‘because’, 7), (‘i’, 1), (‘am’, 2), (‘learning’, 8), (‘.’, 1)]

N-grams

Sentence to n-gram

下一步是根据分词后的句子构建 n-gram。
大小为 n 个单词的滑动窗口可以生成 n-gram。窗口从句子开头开始扫描单词列表，每移动一个单词，直到句子结束。
下面是一个打印给定句子中所有-gram的示例方法。

def sentence_to_trigram(tokenized_sentence):
    """
    Prints all trigrams in the given tokenized sentence.
    
    Args:
        tokenized_sentence: The words list.
    
    Returns:
        No output
    """
    # note that the last position of i is 3rd to the end
    for i in range(len(tokenized_sentence) - 3 + 1):
        # the sliding window starts at position i and contains 3 words
        trigram = tokenized_sentence[i : i + 3]
        print(trigram)

tokenized_sentence = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']

print(f'List all trigrams of sentence: {tokenized_sentence}\n')
sentence_to_trigram(tokenized_sentence)

结果：
List all trigrams of sentence: [‘i’, ‘am’, ‘happy’, ‘because’, ‘i’, ‘am’, ‘learning’, ‘.’]

[‘i’, ‘am’, ‘happy’]
[‘am’, ‘happy’, ‘because’]
[‘happy’, ‘because’, ‘i’]
[‘because’, ‘i’, ‘am’]
[‘i’, ‘am’, ‘learning’]
[‘am’, ‘learning’, ‘.’]

Prefix of an n-gram

n-gram 概率通常根据 (n-1)-gram 计数来计算。计算 n-gram 概率的公式中需要前缀。
$\begin{equation*} P(w_n|w_1^{n-1})=\frac{C(w_1^n)}{C(w_1^{n-1})} \end{equation*}$

# get trigram prefix from a 4-gram
fourgram = ['i', 'am', 'happy','because']
trigram = fourgram[0:-1] # Get the elements from 0, included, up to the last element, not included.
print(trigram)

结果：
[‘i’, ‘am’, ‘happy’]

Start and end of sentence word <s> and <e>

为了公式的统一性，需要在句首和句尾加上开始和结束符号，对于n-gram，我们必须在句子开头预置 n-1 个字符。

# when working with trigrams, you need to prepend 2 <s> and append one </s>
n = 3
tokenized_sentence = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']
tokenized_sentence = ["<s>"] * (n - 1) + tokenized_sentence + ["<e>"]
print(tokenized_sentence)

结果：

['<s>', '<s>', 'i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.', '<e>']

Building the language model

Count matrix

要计算 n-gram 概率，需要计算训练数据集中 n-gram 和 n-gram 前缀的频率。这里将在字典中存储 n-gram 频率。创建一个计数矩阵，对词汇表中所有可能的最后一个词后面的 (n-1)-gram 前缀进行计数。
下面的代码展示了如何检查、检索和更新字词计数词典中的 n-grams 计数。

# manipulate n_gram count dictionary
n_gram_counts = {
    ('i', 'am', 'happy'): 2,
    ('am', 'happy', 'because'): 1}

# get count for an n-gram tuple
print(f"count of n-gram {('i', 'am', 'happy')}: {n_gram_counts[('i', 'am', 'happy')]}")

# check if n-gram is present in the dictionary
if ('i', 'am', 'learning') in n_gram_counts:
    print(f"n-gram {('i', 'am', 'learning')} found")
else:
    print(f"n-gram {('i', 'am', 'learning')} missing")

# update the count in the word count dictionary
n_gram_counts[('i', 'am', 'learning')] = 1
if ('i', 'am', 'learning') in n_gram_counts:
    print(f"n-gram {('i', 'am', 'learning')} found")
else:
    print(f"n-gram {('i', 'am', 'learning')} missing")

结果：
count of n-gram (‘i’, ‘am’, ‘happy’): 2
n-gram (‘i’, ‘am’, ‘learning’) missing
n-gram (‘i’, ‘am’, ‘learning’) found

下面代码显示元组的合并：

# concatenate tuple for prefix and tuple with the last word to create the n_gram
prefix = ('i', 'am', 'happy')
word = 'because'

# note here the syntax for creating a tuple for a single word
n_gram = prefix + (word,)
print(n_gram)

结果：
(‘i’, ‘am’, ‘happy’, ‘because’)

import numpy as np
import pandas as pd
from collections import defaultdict
def single_pass_trigram_count_matrix(corpus):
    """
    Creates the trigram count matrix from the input corpus in a single pass through the corpus.
    
    Args:
        corpus: Pre-processed and tokenized corpus. 
    
    Returns:
        bigrams: list of all bigram prefixes, row index
        vocabulary: list of all found words, the column index
        count_matrix: pandas dataframe with bigram prefixes as rows, 
                      vocabulary words as columns 
                      and the counts of the bigram/word combinations (i.e. trigrams) as values
    """
    bigrams = []
    vocabulary = []
    count_matrix_dict = defaultdict(dict)
    
    # go through the corpus once with a sliding window
    for i in range(len(corpus) - 3 + 1):
        # the sliding window starts at position i and contains 3 words
        trigram = tuple(corpus[i : i + 3])
        
        bigram = trigram[0 : -1]
        if not bigram in bigrams:
            bigrams.append(bigram)        
        
        last_word = trigram[-1]
        if not last_word in vocabulary:
            vocabulary.append(last_word)
        
        if (bigram,last_word) not in count_matrix_dict:
            count_matrix_dict[bigram,last_word] = 0
            
        count_matrix_dict[bigram,last_word] += 1
    
    # convert the count_matrix to np.array to fill in the blanks
    count_matrix = np.zeros((len(bigrams), len(vocabulary)))
    for trigram_key, trigam_count in count_matrix_dict.items():
        count_matrix[bigrams.index(trigram_key[0]), \
                     vocabulary.index(trigram_key[1])]\
        = trigam_count
    
    # np.array to pandas dataframe conversion
    count_matrix = pd.DataFrame(count_matrix, index=bigrams, columns=vocabulary)
    return bigrams, vocabulary, count_matrix

corpus = ['i', 'am', 'happy', 'because', 'i', 'am', 'learning', '.']

bigrams, vocabulary, count_matrix = single_pass_trigram_count_matrix(corpus)

print(count_matrix)

结果：

                  happy  because    i   am  learning    .
(i, am)             1.0      0.0  0.0  0.0       1.0  0.0
(am, happy)         0.0      1.0  0.0  0.0       0.0  0.0
(happy, because)    0.0      0.0  1.0  0.0       0.0  0.0
(because, i)        0.0      0.0  0.0  1.0       0.0  0.0
(am, learning)      0.0      0.0  0.0  0.0       0.0  1.0

Probability matrix

下一步是根据计数矩阵建立概率矩阵。使用 pandas 库中的对象dataframe及其 sum 和 div 方法，用各行的总和对单元格计数进行归一化处理。

# create the probability matrix from the count matrix
row_sums = count_matrix.sum(axis=1)
# divide each row by its sum
prob_matrix = count_matrix.div(row_sums, axis=0)

print(prob_matrix)

结果：

                  happy  because    i   am  learning    .
(i, am)             0.5      0.0  0.0  0.0       0.5  0.0
(am, happy)         0.0      1.0  0.0  0.0       0.0  0.0
(happy, because)    0.0      0.0  1.0  0.0       0.0  0.0
(because, i)        0.0      0.0  0.0  1.0       0.0  0.0
(am, learning)      0.0      0.0  0.0  0.0       0.0  1.0

根据概率矩阵查找3-gram的概率：

# find the probability of a trigram in the probability matrix
trigram = ('i', 'am', 'happy')

# find the prefix bigram 
bigram = trigram[:-1]
print(f'bigram: {bigram}')

# find the last word of the trigram
word = trigram[-1]
print(f'word: {word}')

# we are using the pandas dataframes here, column with vocabulary word comes first, row with the prefix bigram second
trigram_probability = prob_matrix[word][bigram]
print(f'trigram_probability: {trigram_probability}')

结果：
bigram: (‘i’, ‘am’)
word: happy
trigram_probability: 0.5
在作业中，需要根据部分前缀查找单词，这里使用str.startswith完成该功能：

# lists all words in vocabulary starting with a given prefix
vocabulary = ['i', 'am', 'happy', 'because', 'learning', '.', 'have', 'you', 'seen','it', '?']
starts_with = 'ha'

print(f'words in vocabulary starting with prefix: {starts_with}\n')
for word in vocabulary:
    if word.startswith(starts_with):
        print(word)

结果：
words in vocabulary starting with prefix: ha

happy
have

Language model evaluation

Train/validation/test split

要评估语言模型，需要保留部分语料库数据用于验证和测试。
测试和验证数据的选择应尽可能符合实际应用中的数据分布。如果只有输入语料库，可从语料库中随机抽样来定义测试和验证子集。
下面的函数对输入数据进行随机抽样，并按照方法参数给出的分割方式返回训练/验证/测试子集。

# we only need train and validation %, test is the remainder
import random
def train_validation_test_split(data, train_percent, validation_percent):
    """
    Splits the input data to  train/validation/test according to the percentage provided
    
    Args:
        data: Pre-processed and tokenized corpus, i.e. list of sentences.
        train_percent: integer 0-100, defines the portion of input corpus allocated for training
        validation_percent: integer 0-100, defines the portion of input corpus allocated for validation
        
        Note: train_percent + validation_percent need to be <=100
              the reminder to 100 is allocated for the test set
    
    Returns:
        train_data: list of sentences, the training part of the corpus
        validation_data: list of sentences, the validation part of the corpus
        test_data: list of sentences, the test part of the corpus
    """
    # fixed seed here for reproducibility
    random.seed(87)
    
    # reshuffle all input sentences
    random.shuffle(data)

    train_size = int(len(data) * train_percent / 100)
    train_data = data[0:train_size]
    
    validation_size = int(len(data) * validation_percent / 100)
    validation_data = data[train_size:train_size + validation_size]
    
    test_data = data[train_size + validation_size:]
    
    return train_data, validation_data, test_data

data = [x for x in range (0, 100)]

train_data, validation_data, test_data = train_validation_test_split(data, 80, 10)
print("split 80/10/10:\n",f"train data:{train_data}\n", f"validation data:{validation_data}\n", 
      f"test data:{test_data}\n")

train_data, validation_data, test_data = train_validation_test_split(data, 98, 1)
print("split 98/1/1:\n",f"train data:{train_data}\n", f"validation data:{validation_data}\n", 
      f"test data:{test_data}\n")

结果：

split 80/10/10:
 train data:[28, 76, 5, 0, 62, 29, 54, 95, 88, 58, 4, 22, 92, 14, 50, 77, 47, 33, 75, 68, 56, 74, 43, 80, 83, 84, 73, 93, 66, 87, 9, 91, 64, 79, 20, 51, 17, 27, 12, 31, 67, 81, 7, 34, 45, 72, 38, 30, 16, 60, 40, 86, 48, 21, 70, 59, 6, 19, 2, 99, 37, 36, 52, 61, 97, 44, 26, 57, 89, 55, 53, 85, 3, 39, 10, 71, 23, 32, 25, 8]
 validation data:[78, 65, 63, 11, 49, 98, 1, 46, 15, 41]
 test data:[90, 96, 82, 42, 35, 13, 69, 24, 94, 18]

split 98/1/1:
 train data:[66, 23, 29, 28, 52, 87, 70, 13, 15, 2, 62, 43, 82, 50, 40, 32, 30, 79, 71, 89, 6, 10, 34, 78, 11, 49, 39, 42, 26, 46, 58, 96, 97, 8, 56, 86, 33, 93, 92, 91, 57, 65, 95, 20, 72, 3, 12, 9, 47, 37, 67, 1, 16, 74, 53, 99, 54, 68, 5, 18, 27, 17, 48, 36, 24, 45, 73, 19, 41, 59, 21, 98, 0, 31, 4, 85, 80, 64, 84, 88, 25, 44, 61, 22, 60, 94, 76, 38, 77, 81, 90, 69, 63, 7, 51, 14, 55, 83]
 validation data:[35]
 test data:[75]

Perplexity

困惑度计算公式为：
$\begin{equation*} PP(W)=\sqrt[M]{\prod_{i=1}^{m}{\frac{1}{P(w_i|w_{i-1})}}} \end{equation*}$
开m次方可以写成幂的形式：
$\begin{equation*} \sqrt[M]{\frac{1}{x}} = x^{-\frac{1}{M}} \end{equation*}$

# to calculate the exponent, use the following syntax
p = 10 ** (-250)
M = 100
perplexity = p ** (-1 / M)
print(perplexity)

结果：
316.22776601683796

Out of vocabulary words (OOV)

Vocabulary

处理未知单词的第一步是决定哪些单词属于或不属于词汇表。例如
基于最小频率的方法–训练集中出现的频率大于等于最小频率的所有单词都会被添加到词汇表中。
下面是另一种方法的代码，即事先知道词表的大小，然后根据词汇在训练集中的出现频率来填充词汇，按排序填充，填满为止。

# build the vocabulary from M most frequent words
# use Counter object from the collections library to find M most common words
from collections import Counter

# the target size of the vocabulary
M = 3

# pre-calculated word counts
# Counter could be used to build this dictionary from the source corpus
word_counts = {'happy': 5, 'because': 3, 'i': 2, 'am': 2, 'learning': 3, '.': 1}

vocabulary = Counter(word_counts).most_common(M)

# remove the frequencies and leave just the words
vocabulary = [w[0] for w in vocabulary]

print(f"the new vocabulary containing {M} most frequent words: {vocabulary}\n")

结果：
the new vocabulary containing 3 most frequent words: [‘happy’, ‘because’, ‘learning’]
词汇表已经准备就绪，可以使用它将 OOV 词替换为 $< U N K >$ ，

# test if words in the input sentences are in the vocabulary, if OOV, print <UNK>
sentence = ['am', 'i', 'learning']
output_sentence = []
print(f"input sentence: {sentence}")

for w in sentence:
    # test if word w is in vocabulary
    if w in vocabulary:
        output_sentence.append(w)
    else:
        output_sentence.append('<UNK>')
        
print(f"output sentence: {output_sentence}")

结果：
input sentence: [‘am’, ‘i’, ‘learning’]
output sentence: [‘’, ‘’, ‘learning’]

接下来遍历所有单词和词频，并只打印出词频等于 f 的单词。

# iterate through all word counts and print words with given frequency f
f = 3

word_counts = {'happy': 5, 'because': 3, 'i': 2, 'am': 2, 'learning':3, '.': 1}

for word, freq in word_counts.items():
    if freq == f:
        print(word)

结果：
because
learning
注意，不要过度使用 $< U N K >$ 标签，否则会降低模型表现，试试回答为什么。

# many <unk> low perplexity 
training_set = ['i', 'am', 'happy', 'because','i', 'am', 'learning', '.']
training_set_unk = ['i', 'am', '<UNK>', '<UNK>','i', 'am', '<UNK>', '<UNK>']

test_set = ['i', 'am', 'learning']
test_set_unk = ['i', 'am', '<UNK>']

M = len(test_set)
probability = 1
probability_unk = 1

# pre-calculated probabilities
bigram_probabilities = {('i', 'am'): 1.0, ('am', 'happy'): 0.5, ('happy', 'because'): 1.0, ('because', 'i'): 1.0, ('am', 'learning'): 0.5, ('learning', '.'): 1.0}
bigram_probabilities_unk = {('i', 'am'): 1.0, ('am', '<UNK>'): 1.0, ('<UNK>', '<UNK>'): 0.5, ('<UNK>', 'i'): 0.25}

# got through the test set and calculate its bigram probability
for i in range(len(test_set) - 2 + 1):
    bigram = tuple(test_set[i: i + 2])
    probability = probability * bigram_probabilities[bigram]
        
    bigram_unk = tuple(test_set_unk[i: i + 2])
    probability_unk = probability_unk * bigram_probabilities_unk[bigram_unk]

# calculate perplexity for both original test set and test set with <UNK>
perplexity = probability ** (-1 / M)
perplexity_unk = probability_unk ** (-1 / M)

print(f"perplexity for the training set: {perplexity}")
print(f"perplexity for the training set with <UNK>: {perplexity_unk}")

结果：

perplexity for the training set: 1.2599210498948732
perplexity for the training set with <UNK>: 1.0

Smoothing

这里使用的是Add-k 平滑法，它是一种用于自然语言处理中的语言模型平滑技术，它通过在每个n-gram的概率估计中添加一个常数k来避免概率为零的问题。然而，这种平滑方法的缺点是训练数据集中未见过的 n-gram 的概率过高。
在下面的代码输出中，可以看到训练集中的短语与未知短语获得了相同的概率。

def add_k_smooting_probability(k, vocabulary_size, n_gram_count, n_gram_prefix_count):
    numerator = n_gram_count + k
    denominator = n_gram_prefix_count + k * vocabulary_size
    return numerator / denominator

trigram_probabilities = {('i', 'am', 'happy') : 2}
bigram_probabilities = {( 'am', 'happy') : 10}
vocabulary_size = 5
k = 1

probability_known_trigram = add_k_smooting_probability(k, vocabulary_size, trigram_probabilities[('i', 'am', 'happy')], 
                           bigram_probabilities[( 'am', 'happy')])

probability_unknown_trigram = add_k_smooting_probability(k, vocabulary_size, 0, 0)

print(f"probability_known_trigram: {probability_known_trigram}")
print(f"probability_unknown_trigram: {probability_unknown_trigram}")

结果：
probability_known_trigram: 0.2
probability_unknown_trigram: 0.2

Back-off

Back-off 是一种模型泛化方法，在高阶 n-gram 信息缺失的情况下利用低阶 n-gram 的信息。例如，如果缺少三元组的概率，则使用双元组信息，以此类推。

# pre-calculated probabilities of all types of n-grams
trigram_probabilities = {('i', 'am', 'happy'): 0}
bigram_probabilities = {( 'am', 'happy'): 0.3}
unigram_probabilities = {'happy': 0.4}

# this is the input trigram we need to estimate
trigram = ('are', 'you', 'happy')

# find the last bigram and unigram of the input
bigram = trigram[1: 3]
unigram = trigram[2]
print(f"besides the trigram {trigram} we also use bigram {bigram} and unigram ({unigram})\n")

# 0.4 is used as an example, experimentally found for web-scale corpuses when using the "stupid" back-off
lambda_factor = 0.4
probability_hat_trigram = 0

# search for first non-zero probability starting with trigram
# to generalize this for any order of n-gram hierarchy, 
# you could loop through the probability dictionaries instead of if/else cascade
if trigram not in trigram_probabilities or trigram_probabilities[trigram] == 0:
    print(f"probability for trigram {trigram} not found")
    
    if bigram not in bigram_probabilities or bigram_probabilities[bigram] == 0:
        print(f"probability for bigram {bigram} not found")
        
        if unigram in unigram_probabilities:
            print(f"probability for unigram {unigram} found\n")
            probability_hat_trigram = lambda_factor * lambda_factor * unigram_probabilities[unigram]
        else:
            probability_hat_trigram = 0
    else:
        probability_hat_trigram = lambda_factor * bigram_probabilities[bigram]
else:
    probability_hat_trigram = trigram_probabilities[trigram]

print(f"probability for trigram {trigram} estimated as {probability_hat_trigram}")

结果：
besides the trigram (‘are’, ‘you’, ‘happy’) we also use bigram (‘you’, ‘happy’) and unigram (happy)

probability for trigram (‘are’, ‘you’, ‘happy’) not found
probability for bigram (‘you’, ‘happy’) not found
probability for unigram happy found

probability for trigram (‘are’, ‘you’, ‘happy’) estimated as 0.06400000000000002

Interpolation

另一种使用低阶 n-gram 概率的方法是插值法。它在是一种用于语言模型平滑的技术。这种方法的核心思想是通过将不同的语言模型或概率分布以某种权重组合起来，以减少模型对罕见事件的预测偏差。
Interpolation法通常涉及以下几个步骤：
1.定义基础模型：首先定义一个或多个基础的语言模型，这些模型可能是基于不同大小的n-gram构建的，例如unigram（1-gram）、bigram（2-gram）等。
2.计算插值权重：为每个基础模型分配一个权重，这些权重的总和通常为1。权重的选择可以基于模型的性能，或者通过交叉验证等方法来确定。
3.插值计算：对于给定的n-gram，其插值后的概率是所有基础模型概率的加权和。
4.处理零概率问题：由于插值法结合了多个模型的预测，它通常可以避免零概率问题，因为即使某个模型给出了零概率，其他模型的非零概率贡献可以保证插值后的概率不为零。
5.优化和调整：在实际应用中，可能需要对插值权重进行调整，以优化模型的性能。这可以通过训练数据来完成，使得插值模型在测试集上的表现最佳。
Interpolation法的优点在于它结合了多个模型的优点，提高了模型的鲁棒性和泛化能力。然而，这种方法也有其局限性，例如需要合理选择和调整插值权重，以及可能增加计算复杂度。

# pre-calculated probabilities of all types of n-grams
trigram_probabilities = {('i', 'am', 'happy'): 0.15}
bigram_probabilities = {( 'am', 'happy'): 0.3}
unigram_probabilities = {'happy': 0.4}

# the weights come from optimization on a validation set
lambda_1 = 0.8
lambda_2 = 0.15
lambda_3 = 0.05

# this is the input trigram we need to estimate
trigram = ('i', 'am', 'happy')

# find the last bigram and unigram of the input
bigram = trigram[1: 3]
unigram = trigram[2]
print(f"besides the trigram {trigram} we also use bigram {bigram} and unigram ({unigram})\n")

# in the production code, you would need to check if the probability n-gram dictionary contains the n-gram
probability_hat_trigram = lambda_1 * trigram_probabilities[trigram] 
+ lambda_2 * bigram_probabilities[bigram]
+ lambda_3 * unigram_probabilities[unigram]

print(f"estimated probability of the input trigram {trigram} is {probability_hat_trigram}")

besides the trigram ('i', 'am', 'happy') we also use bigram ('am', 'happy') and unigram (happy)

estimated probability of the input trigram ('i', 'am', 'happy') is 0.12

oldmao_2000

关注

14
点赞
踩
11

收藏

觉得还不错? 一键收藏
打赏
0
评论
C2W3.LAB.N-grams+Language Model+OOV

这里使用的是Add-k 平滑法，它是一种用于自然语言处理中的语言模型平滑技术，它通过在每个n-gram的概率估计中添加一个常数k来避免概率为零的问题。它在是一种用于语言模型平滑的技术。1.定义基础模型：首先定义一个或多个基础的语言模型，这些模型可能是基于不同大小的n-gram构建的，例如unigram（1-gram）、bigram（2-gram）等。4.处理零概率问题：由于插值法结合了多个模型的预测，它通常可以避免零概率问题，因为即使某个模型给出了零概率，其他模型的非零概率贡献可以保证插值后的概率不为零。
复制链接

扫一扫