edittext禁止换行符但能自动换行简书_使用n-gram创建自动完成系统

本文介绍了如何使用n-gram开发自动完成系统。通过预处理数据,包括分词、处理'Out of Vocabulary' words,然后建立n-gram语言模型,计算句子概率,最后评估模型的困惑度。文章详细讲解了每个步骤,并提供了相关的代码练习。
摘要由CSDN通过智能技术生成

n-gram语言模型用于就是计算句子的概率,通俗来讲就是判断这句话是人话的可能性有多少。n就是将句子做切割,n个单词为一组。

如何计算句子的概率?根据条件概率和链式规则

P(B|A)=P(A,B)/P(A) ==>P(A,B) = P(A)P(B|A)

所以P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)

如果句子很长,那么这个式子就会很长,计算会变得很复杂。为了简化,引入了马尔科夫假设:随意一个词出现的概率只与它前面出现的有限的一个或者几个词有关。

假设前面的例子当中,一个词出现的概率只和前面的一个词有关,那么

P(A,B,C,D) = P(B|A)P(C|B)P(D|C)

用公式来表达,如果出现的一个词出现的概率只和前面的k个词(就是n-gram里的n)有关,那么一个句子的计算概率就为

d2354b1e24e5b2753a93cf3cc7b5e780.png

以下为翻译文,原版是

https://github.com/tsuirak/deeplearning.ai/blob/master/Natural%20Language%20Processing/Course%202%20-%20Probabilistic%20Models/Labs/Week%203/C2-W3-assginment-Auto%20Complete.ipynb​github.com

在这篇文章里,你将创建一个自动匹配系统。自动匹配系统,是你每天都能看到的

  • 当在你google里搜索的时候,你经常会得到一些提示来帮助你完成你的搜索。
  • 当你在写一封邮件的时候,你会得到一些对于你的语句中结束词的建议

这这个任务的结尾,你将会开发类似于这个系统的原型。

721096bda27b309cdfbf891937b45453.png

大纲

1.加载和处理数据

1.1加载数据

1.2处理数据

2.开发一个n-gram基本语言模型

3.复杂性

4.创建一个自动完成系统

自动完成系统中,一个关键的创建区块就是一个语言模型。一个语言模型给一个序列的所有词分配概率,换句话来说最有可能的句子序列有更高的分数

"I have a pen" 跟"I am a pen" 相比,我们希望它有更高的概率,因为它在我们的实际中,更加符合自然的句子

你可以利用概率计算去开发一套自动完成系统。比如用户输入:

"I eat scrambled" ,那么你可以找到一个单词 x 使 "I eat scrambled x" 拥有最高的概率. 如果x = "eggs", 那么句子将会是 "I eat scrambled eggs"

现在已经有很多种语言模型已经被开发出来, 在这个任务当中,我们使用 N-grams, 这个简单又强大的方法去开发语言模型。

  • N-grams 同样使用于机器翻译和语音识别.

以下是这个任务的步骤

  1. 加载和预处理数据
  • 加载数据,然后分词.
  • 把句子分成训练数据集和测试数据集.
  • 对于低频词,使用 <unk>代替.

2.开发一个n-grams的基础语言模型

  • 从一个给定的数据集中计算n-grams的数量
  • 使用k-平滑方式,估算下个一个词的条件概率

3.通过计算困惑度分数来评估N-gram模型

4. 利用的你模型,来预测你句子的下一个单词

先引入所需要的库

import math
import random
import numpy as np
import nltk
import pandas as pd
nltk.data.path.append('.')

Part 1: 加载和预处理数据

Part 1.1: 加载数据

你将使用twitter data. 运行下面的代码,加载和看前几个句子。

注意数据是一个包含很多很多推文的长长的字符串,在推文之间有换行符 "n" 分割。

with open("en_US.twitter.txt","r") as f:
    data = f.read()
    
print("Data type:", type(data))
print("Number of letters:", len(data))
print("First 300 letters of the data")
print("-------")
display(data[0:300])
print("-------")

print("Last 300 letters of the data")
print("-------")
display(data[-300:])
print("-------")

输出:

Data type: <class 'str'> Number of letters: 3335477 First 300 letters of the data -------


"How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.nWhen you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.nthey've decided its more fun if I don't.nSo Tired D; Played Lazer Tag & Ran A "


------- Last 300 letters of the data -------
"ust had one a few weeks back....hopefully we will be back soon! wish you the best yonColombia is with an 'o'...“: We now ship to 4 countries in South America (fist pump). Please welcome Columbia to the Stunner Family”n#GutsiestMovesYouCanMake Giving a cat a bath.nCoffee after 5 was a TERRIBLE idea.n"
-------

Part 1.2 预处理数据

根据以下步骤预处理数据:

  1. 使用 "n" 做为分隔符,将数据成句子.
  2. 将句子分词。 注意在这篇文章里,我们使用 "token" and "words" ,他们是同一个意思.
  3. 将句子分配进训练数据机和测试数据集
  4. 在训练数据机里,找到出现次数大于n的词
  5. 对于出现次数小于n的,使用<unk>带地

Note: 在这项练习里,我们抛弃了验证数据集合。

  • 在真实的应用中,我们应该保留一部分数据当作验证数据集,去调整我们的训练
  • 为了简单起见,我们跳过了这一步

Exercise 01

将数据分隔成句子

def split_to_sentences(data):
    """
    Split data by linebreak "n"
    
    Args:
        data: str
    
    Returns:
        A list of sentences
    """
    
    sentences = data.split('n')
    
    # Additional clearning (This part is already implemented)
    # - Remove leading and trailing spaces from each sentence
    # - Drop sentences if they are empty strings.
    
    sentences = [s.strip() for s in sentences]
    sentences = [s for s in sentences if len(s) > 0]
    
    return sentences

Exercise 02

下一步就是将句子分词 (将一个句子分成一系列的词).

  • 将所有的词都转换成小写,这样的话大写的词和小写的词(比如big 和Big就可以当作同一个词)
  • 将每个一个句子的分词列表添加道一个总的list里
def tokenize_sentences(sentences):
    """
    Tokenize sentences into tokens (words)
    
    Args:
        sentences: List of strings
    
    Returns:
        List of lists of tokens
    """
    
    # Initialize the list of lists of tokenized sentences
    tokenized_sentences = []
    
    # go through each sentence 
    for sentence in sentences:
        
        # convert to lowercase letters
        sentence = sentence.lower()
        
        # convert to a list of words
        tokenized = nltk.word_tokenize(sentence)
        
        # append the list of words to the list of lists
        tokenized_sentences.append(tokenized)
        
    return tokenized_sentences

Exercise 03

使用以上定义的两个方法获取分词数据。

  • 将数据分成句子
  • 给句子分词
def get_tokenized_data(data):
    """
    Make a list of tokenized sentences
    
    Args:
        data: String
    
    Returns:
        List of lists of tokens
    """
    # get the sentences bt splitting up the data
    sentences = split_to_sentences(data)
    
    # get the list of lists of tokens by tokenizing the sentences
    tokenized_sentences = tokenize_sentences(sentences)
    
    return tokenized_sentences

将数据分为训练集和测试集

tokenized_data = get_tokenized_data(data)
random.seed(87)
random.shuffle(tokenized_data)

train_size = int(len(tokenized_data) * 0.8)
train_data = tokenized_data[0:train_size]
test_data = tokenized_data[train_size:]

Exercise 04

并非有所的词在训练里你都会用到,你只会用到高频词

  • 你只会关注在数据集里出现n次的词
  • 首先需要计算词出现的此时

你需要嵌套两个循环,一个是sentences ,一个是sentences里的词

def count_words(tokenized_sentences):
    """
    Count the number of word appearence in the tokenized sentences
    
    Args:
        tokenized_sentences: List of lists of strings
    
    Returns:
        dict that maps word (str) to the frequency (int)
    """
    word_counts = {}
    
    # Loop through the sentences
    for sentence in tokenized_sentences:
        
        for token in sentence:
            
            # If the token is not in the dictionary yet,set the count to 1
            if token not in word_counts.keys():
                word_counts[token] = 1
            
            # If the token already in the dictionary,increment the count by 1
            else:
                word_counts[token] += 1
                
    return word_counts

处理 'Out of Vocabulary' words

如果你的模型正在实现自动完成,但是遇到一个在训练的时候从来没有出现过的词,没法得到下一个词的建议。这个模型不可以预测下一个词因为对于当前词(未出现的词,它的数量是没有的)

  • 这个新词叫做 'unknown word', 或者 out of vocabulary (OOV) words.
  • unknown words 在测试集里的百分比叫做 OOV 率.

在预测过程中,为了处理这些unknown words,我们用一个特殊词'unk'表示 。

修改训练数据,方便训练一些 'unknown' words .

  • 在测试数据里将低频词转换成"unknown" words
  • 在训练集里创建一个高频词列表,叫closed vocabulary .
  • 将所有非高频词都转化成 'unk'.

Exercise 05

创建一个方法,使用文档和一个数量门槛'count_threshold'

  • 任何词频大于 'count_threshold' 的被当作 closed vocabulary.
def get_words_with_nplus_frequency(tokenized_sentences, count_threshold):
    """
    Find the words that appear N times or more
    
    Args:
        tokenized_sentences: List of lists of sentences
        count_threshold: minimum number of occurrences for a word to be in the closed vocabulary.
    
    Returns:
        List of words that appear N times or more
    """
    # Initialize ant empty list to contain in the words that
    # appear at least N times
    closed_vocab = []
    
    # Get the word counts of the tokenized sentences
    # Use the function that you defined earlier to count the words
    word_counts = count_words(tokenized_sentences)
    
    for word,cnt in word_counts.items():
        
        # Check that the word's count
        # is at least as greater as the minimum count
        
        if cnt >= count_threshold:
            closed_vocab.append(word)
            
    return closed_vocab

Exercise 06

  • 除了高频词,其他都是'unknown'.
  • 将'unknown'.用"<unk>"表示
def replace_oov_words_by_unk(tokenized_sentences, vocabulary, unknown_token="<unk>"):
    """
    Replace words not in the given vocabulary with '<unk>' token.
    
    Args:
        tokenized_sentences: List of lists of strings
        vocabulary: List of strings that we will use
        unknown_token: A string representing unknown (out-of-vocabulary) words
    
    Returns:
        List of lists of strings, with words not in the vocabulary replaced
    """
    # Place vocabulary into a set for faster search
    vocabulary = set(vocabulary)
    
    # Initialize a list that will hold the sentences 
    # after less frequent words are replaced by the unk
    replaced_tokenized_sentences = []
    
    for sentence in tokenized_sentences:
        
        # Initialize the list that will contain
        # a single sentence with unk replacements
        replaced_sentence = []
        
        for token in sentence:
            
            if token in vocabulary:
                replaced_sentence.append(token)
                
            else:
                replaced_sentence.append(unknown_token)
                
        replaced_tokenized_sentences.append(replaced_sentence)
        
    return replaced_tokenized_sentences

Exercise 07

现在我们已经准备通过组合之前已经实现的方法来处理数据.

  1. 在训练数据集里,找到出现次数大于 count_threshold 的词
  2. 在训练数据集和测死数据集里,将出现次数小于count_threshold 的,使用 "<unk>" 代替
def preprocess_data(train_data, test_data, count_threshold):
    """
    Preprocess data, i.e.,
        - Find tokens that appear at least N times in the training data.
        - Replace tokens that appear less than N times by "<unk>" both for training and test data.        
    Args:
        train_data, test_data: List of lists of strings.
        count_threshold: Words whose count is less than this are 
                      treated as unknown.
    
    Returns:
        Tuple of
        - training data with low frequent words replaced by "<unk>"
        - test data with low frequent words replaced by "<unk>"
        - vocabulary of words that appear n times or more in the training data
    """
    
    # Get the closed vocabulary using the train data
    vocabulary = get_words_with_nplus_frequency(train_data,count_threshold)
    
    # For the train data, replace less common words with "<unk>"
    train_data_replaced = replace_oov_words_by_unk(train_data,vocabulary)
    
    # For the test data, replace less common words with "<unk>"
    test_data_replaced = replace_oov_words_by_unk(test_data,vocabulary)

Part 2: 开发 n-gram 基本语言模型

在这个章节里, 将会开发 n-grams 语言模型.

  • 假设当前词的概率只和前面的 n-gram,有关
  • 前面的 n-gram 是指前面的一系列的n个词

这句子中,位置t的词,它前面的词是 wt−1,wt−2⋯wt−nwt−1,wt−2⋯wt−n ,它的条件概率是is:

P(wt|wt−1…wt−n)

你可以通过计算在训练集里这一序列的词的次数来估算这个概率值.

这个概率中,分子是位置t的词(wt)出现在wt−n.....wt−1这一序列词之后的次数。分母是wt−n.....wt−1这一序列的词出现的次数。

80d1da45897fbb42332085cfd3401361.png

如果数量为0(无论是分子还是分母),可以通过添加k平滑修改这个概率计算公式。

根据以上概率计算公式,对于分母我们需要计算n个词组成的序列的次数;对于分子,我们需要计算n+1个词组成的序列的次数

Exercise 08

下面,你将会编写一个可以计算n-grams 数量(n是任意值)的方法。

计算之前,需要在句子面前添加n个<s>表明是句子的开头,比如n=2,假若句子是"I like food",则需要将句子修改成"<s><s> I like food"。同时,也要在句子后面添加<e>表明是句子的结尾。???

Technical note: 在这个方法里, 你将会使用dictionary来保存数量.

  • dictionary 的key是一个 n 个单词的tuple (并非list)
  • dictionary 的value是出现的次数
  • key使用tuple而不是list的原因是list在python里是一个可变对象(可以修改的);但是 tuple 是不可变的,一旦创建就不能修改。
def count_n_grams(data, n, start_token='<s>', end_token = '<e>'):
    """
    Count all n-grams in the data
    
    Args:
        data: List of lists of words
        n: number of words in a sequence
    
    Returns:
        A dictionary that maps a tuple of n-words to its frequency
    """
    # Initialize dictionary of n-grams and their counts
    n_grams = {}
    
    for sentence in data:
        
        # prepend start token n times, and append <e> one time
        sentence = [start_token] * n + sentence + [end_token]
        
        # convert list to tuple 
        # So that the sequence of words can be used as
        # a key in the dictionary
        sentence = tuple(sentence)
        
        # Use i to indicate the start of the n-gram
        # from index 0
        # to the last index where the end of the n-gram
        # is within the sentence
        
        m = len(sentence) if n==1 else len(sentence) -1 
        
        for i in range(m):
            
            # get the n-gram is in the dictionary
            n_gram = sentence[i: i + n]
            
            # check if the n-gram if in the dictionary
            if n_gram in n_grams.keys():
                n_grams[n_gram] += 1
            else:
                n_grams[n_gram] = 1
                
    return n_grams

Exercise 09

下一步,评估在给定的n个词之后的单词的概率。

83c5efe1ea4ca8894aa4bd6187da0650.png

根据这个式子,如果n-grams在训练集里没有出现过,那么分母就为0,上面的式子就行不通。为了处理这种数量为0的情况,我们加入了 k平滑。

9e29e931493de7287e352597d59587d1.png

分子加入一个常量k,分母加入k|v|,任何数量为0的n-gram的概率都是1/v(v是单词数量)

def estimate_probability(word, previous_n_gram, 
                         n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):
    """
    Estimate the probabilities of a next word using the n-gram counts with k-smoothing
    
    Args:
        word: next word
        previous_n_gram: A sequence of words of length n
        n_gram_counts: Dictionary of counts of (n+1)-grams
        n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
        vocabulary_size: number of words in the vocabulary
        k: positive constant, smoothing parameter
    
    Returns:
        A probability
    """
    # convert list to tuple to use it as a dictionary key
    previous_n_gram = tuple(previous_n_gram)
    
    # Set the denominator
    # If the previous n-gram exists in the dictionary of n-gram counts,
    # Get its count.  Otherwise set the count to zero
    # Use the dictionary that has counts for n-grams
    previous_n_gram_count = n_gram_counts[previous_n_gram] if previous_n_gram in n_gram_counts else 0
    
    # Calculate the denominator using the count of the previous n-gram
    # and apply k-smoothing
    denominator = previous_n_gram_count + k*vocabulary_size
    
    # Define n plus 1 gram as the previous n-gram plus the current word as a tuple
    n_plus1_gram = previous_n_gram + (word,)
    
    # Set the count to the count in the dictionary,
    # otherwise 0 if not in the dictionary
    # use the dictionary that has counts for the n-gram plus current word
    n_plus1_gram_count = n_plus1_gram_counts[n_plus1_gram] if n_plus1_gram in n_plus1_gram_counts else 0
    
    # Define the numerator use the count of the n-gram plus current word,
    # and apply smoothing
    numerator = n_plus1_gram_count + k
    
    # Calculate the probability as the numerator divided bt denominator
    probability = numerator / denominator
    
    return probability

计算所有词的概率

下面的方法是遍历语料库,计算所有词的概率。

def estimate_probabilities(previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0):
    """
    Estimate the probabilities of next words using the n-gram counts with k-smoothing
    
    Args:
        previous_n_gram: A sequence of words of length n
        n_gram_counts: Dictionary of counts of (n+1)-grams
        n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
        vocabulary: List of words
        k: positive constant, smoothing parameter
    
    Returns:
        A dictionary mapping from next words to the probability.
    """
    # Convert list to tuple to use it as dictionary key
    previous_n_gram = tuple(previous_n_gram)
    
    # add <e> <unk> to the vocabulary
    # <s> is not needed since it should not appear as the next word
    vocabulary = vocabulary + ['<e>','<unk>']
    vocabulary_size = len(vocabulary)
    
    probabilities = {}
    for word in vocabulary:
        probability = estimate_probability(word,previous_n_gram,n_gram_counts,n_plus1_gram_counts,
                                           vocabulary_size,k=k)
        
        probabilities[word] = probability
        
    return probabilities

Count and probability matrices

数量和概率矩阵。

根据以上我们定义的方法,我们可以创建数量矩阵和概率矩阵。

def make_count_matrix(n_plus1_gram_counts, vocabulary):
    # add <e> <unk> to the vocabulary
    # <s> is omitted since it should not appear as the next word
    vocabulary = vocabulary + ["<e>", "<unk>"]
    
    # obtain unique n-grams
    n_grams = []
    for n_plus1_gram in n_plus1_gram_counts.keys():
        n_gram = n_plus1_gram[0:-1]
        n_grams.append(n_gram)
    n_grams = list(set(n_grams))
    
    # mapping from n-gram to row
    row_index = {n_gram:i for i, n_gram in enumerate(n_grams)}
    # mapping from next word to column
    col_index = {word:j for j, word in enumerate(vocabulary)}
    
    nrow = len(n_grams)
    ncol = len(vocabulary)
    count_matrix = np.zeros((nrow, ncol))
    for n_plus1_gram, count in n_plus1_gram_counts.items():
        n_gram = n_plus1_gram[0:-1]
        word = n_plus1_gram[-1]
        if word not in vocabulary:
            continue
        i = row_index[n_gram]
        j = col_index[word]
        count_matrix[i, j] = count
    
    count_matrix = pd.DataFrame(count_matrix, index=n_grams, columns=vocabulary)
    return count_matrix

vv

sentences = [['i', 'like', 'a', 'cat'],
                 ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
print('ntrigram counts')
trigram_counts = count_n_grams(sentences, 3)
display(make_count_matrix(trigram_counts, unique_words))

可得如下结果

7c164655702c75e22b2982ad009ff8e8.png

计算概率矩阵

def make_probability_matrix(n_plus1_gram_counts, vocabulary, k):
    count_matrix = make_count_matrix(n_plus1_gram_counts, unique_words)
    count_matrix += k
    prob_matrix = count_matrix.div(count_matrix.sum(axis=1), axis=0)
    return prob_matrix


sentences = [['i', 'like', 'a', 'cat'],
                 ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
print("trigram probabilities")
trigram_counts = count_n_grams(sentences, 3)
display(make_probability_matrix(trigram_counts, unique_words, k=1))

可得如下结果

5f1e85e049394acc68eed23d53bdcbcd.png

Part 3: 困惑度

困惑度用来衡量一个模型的好坏,困惑度越低,模型越好。

在这节里,我们会在测试集里计算困惑度。

3bd5697c92df15dceb5ac700be0e6244.png
  • N 是句子的数量
  • n 是在n-gram中,单词的数量 in the n-gram (e.g. 2 for a bigram).
  • 数字从1开始,而非0

在代码里,数组索引从0开始,所以在代码里t的范围改成从n到N+1将使用以下公式

f27fed87b56d47223e97c890019ee03c.png

概率越好,困惑度越低.

n-grams给我们提供越多的句子信息,困惑度越低。

Exercise 10

def calculate_perplexity(sentence, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):
    """
    Calculate perplexity for a list of sentences
    
    Args:
        sentence: List of strings
        n_gram_counts: Dictionary of counts of (n+1)-grams
        n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
        vocabulary_size: number of unique words in the vocabulary
        k: Positive smoothing constant
    
    Returns:
        Perplexity score
    """
    # length of previous words
    n = len(list(n_gram_counts.keys())[0]) 
    
    # prepend <s> and append <e>
    sentence = ["<s>"] * n + sentence + ["<e>"]
    
    # Cast the sentence from a list to a tuple
    sentence = tuple(sentence)
    
    # length of sentence (after adding <s> and <e> tokens)
    N = len(sentence)
    
    # The variable p will hold the product
    # that is calculated inside the n-root
    # Update this in the code below
    product_pi = 1.0
    
    ### START CODE HERE (Replace instances of 'None' with your code) ###
    # Index t ranges from n to N - 1(t的范围是从n到N-1)
    for t in range(n, N): # complete this line

        # get the n-gram preceding the word at position t
        n_gram = sentence[t-n:t]
        
        # get the word at position t
        word = sentence[t]
        
        # Estimate the probability of the word given the n-gram
        # using the n-gram counts, n-plus1-gram counts,
        # vocabulary size, and smoothing constant
        probability = estimate_probability(word,n_gram, n_gram_counts, n_plus1_gram_counts, len(unique_words), k=1)
        
        # Update the product of the probabilities
        # This 'product_pi' is a cumulative product 
        # of the (1/P) factors that are calculated in the loop
        product_pi *= 1 / probability

    # Take the Nth root of the product
    perplexity = product_pi**(1/float(N))
    
    ### END CODE HERE ### 
    return perplexity

Part 4: 创建一个自动完成系统

在这一章节里,我们将使用前面使用的方法,来创建一个自动完成系统。

在以下的方法中,使用了一个start_with方法,指定下个单词的前几个字母

def suggest_a_word(previous_tokens, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0, start_with=None):
    """
    Get suggestion for the next word
    
    Args:
        previous_tokens: The sentence you input where each token is a word. Must have length > n 
        n_gram_counts: Dictionary of counts of (n+1)-grams
        n_plus1_gram_counts: Dictionary of counts of (n+1)-grams
        vocabulary: List of words
        k: positive constant, smoothing parameter
        start_with: If not None, specifies the first few letters of the next word
        
    Returns:
        A tuple of 
          - string of the most likely next word
          - corresponding probability
    """
    
    # length of previous words
    n = len(list(n_gram_counts.keys())[0]) 
    
    # From the words that the user already typed
    # get the most recent 'n' words as the previous n-gram
    previous_n_gram = previous_tokens[-n:]

    # Estimate the probabilities that each word in the vocabulary
    # is the next word,
    # given the previous n-gram, the dictionary of n-gram counts,
    # the dictionary of n plus 1 gram counts, and the smoothing constant
    probabilities = estimate_probabilities(previous_n_gram,
                                           n_gram_counts, n_plus1_gram_counts,
                                           vocabulary, k=k)
    
    # Initialize suggested word to None
    # This will be set to the word with highest probability
    suggestion = None
    
    # Initialize the highest word probability to 0
    # this will be set to the highest probability 
    # of all words to be suggested
    max_prob = 0
    
    ### START CODE HERE (Replace instances of 'None' with your code) ###
    
    # For each word and its probability in the probabilities dictionary:
    for word, prob in probabilities.items(): # complete this line
        
        # If the optional start_with string is set
        if start_with != None: # complete this line

            # Check if the word starts with the letters in 'start_with'
            if not word.startswith(start_with): # complete this line

                #If so, don't consider this word (move onto the next word)
                continue # complete this line
        
        # Check if this word's probability
        # is greater than the current maximum probability
        if prob > max_prob: # complete this line
            
            # If so, save this word as the best suggestion (so far)
            suggestion = word
            
            # Save the new maximum probability
            max_prob = prob

    ### END CODE HERE
    
    return suggestion, max_prob

简单测试下:

sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)

previous_tokens = ["i", "like"]
tmp_suggest1 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0)
print(f"The previous words are 'i like',ntand the suggested word is `{tmp_suggest1[0]}` with a probability of {tmp_suggest1[1]:.4f}")

print()
# test your code when setting the starts_with
tmp_starts_with = 'c'
tmp_suggest2 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0, start_with=tmp_starts_with)
print(f"The previous words are 'i like', the suggestion must start with `{tmp_starts_with}`ntand the suggested word is `{tmp_suggest2[0]}` with a probability of {tmp_suggest2[1]:.4f}")

输出是

The previous words are 'i like',
and the suggested word is `a` with a probability of 0.2727
The previous words are 'i like', the suggestion must start with `c`
and the suggested word is `cat` with a probability of 0.0909

获取多条建议

def get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0, start_with=None):
    model_counts = len(n_gram_counts_list)
    suggestions = []
    for i in range(model_counts-1):
        n_gram_counts = n_gram_counts_list[i]
        n_plus1_gram_counts = n_gram_counts_list[i+1]
        
        suggestion = suggest_a_word(previous_tokens, n_gram_counts,
                                    n_plus1_gram_counts, vocabulary,
                                    k=k, start_with=start_with)
        suggestions.append(suggestion)
    return suggestions

使用任意长度的n-grams ,获取多条建议

祝贺你! 你已经开发了自己的自动完成系统需要的所有模块。

让我们来看看基于任意长度的n-grams模型 (unigrams, bigrams, trigrams, 4-grams...6-grams).

n_gram_counts_list = []
for n in range(1, 6):
    print("Computing n-gram counts with n =", n, "...")
    n_model_counts = count_n_grams(train_data_processed, n)
    n_gram_counts_list.append(n_model_counts)
previous_tokens = ["i", "am", "to"]
tmp_suggest4 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest4)

输出

The previous words are ['i', 'am', 'to'], the suggestions are:
[('be', 0.027665685098338604), ('have', 0.00013487086115044844), ('have', 0.00013490725126475548), ('i', 6.746272684341901e-05)]

previous_tokens = ["hey", "how", "are", "you"]
tmp_suggest7 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest7)

输出:

The previous words are ['hey', 'how', 'are', 'you'], the suggestions are:
[("'re", 0.023973994311255586), ('?', 0.002888465830762161), ('?', 0.0016134453781512605), ('<e>', 0.00013491635186184566)]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值