N元语言模型

知源书院

于 2024-04-18 09:44:33 发布

阅读量1.4k

点赞数 32

分类专栏： Educoder实训文章标签：语言模型概率论人工智能

本文链接：https://blog.csdn.net/cfy2401926342/article/details/137906454

版权

Educoder实训专栏收录该内容

7 篇文章 2 订阅

订阅专栏

第1关：预测句子概率

任务描述

本关任务：利用二元语言模型计算句子的概率

编程要求

根据提示，在右侧编辑器补充代码，计算并输出测试语句的概率

测试说明

平台会对你编写的代码进行测试：语料库：

研究生物很有意思。他大学时代是研究生物的。生物专业是他的首选目标。他是研究生。

测试输入：

研究生物专业是他的首选目标

预期输出：

0.004629629629629629

import jieba
 
jieba.setLogLevel(jieba.logging.INFO)
 
# 将句子变为"BOSxxxxxEOS"这种形式
def reform(sentence):
    if sentence.endswith("。"):
        sentence = sentence[:-1]
    sentence = sentence.replace("。", "EOSBOS")
    sentence = "BOS" + sentence + "EOS"
    return sentence
 
# 分词并统计词频
def segmentation(sentence, dic):
    jieba.suggest_freq("BOS", True)
    jieba.suggest_freq("EOS", True)  # 让jieba库知道"BOS"和"EOS"这两个词的存在，并记录它们的出现频率
    lists = jieba.lcut(sentence, HMM=False) # 当输入的文本比较短时，隐马尔科夫模型的效果可能会下降，导致分词结果不准确
    if dic is not None:
        for word in lists:
            if word not in dic:
                dic[word] = 1
            else:
                dic[word] += 1
    return lists
 
# 比较两个数列，二元语法
def compareList(ori_list, tes_list):
    count_list = [0] * len(tes_list)
    for t in range(len(tes_list)-1):
        for n in range(len(ori_list)-1):
            if tes_list[t] == ori_list[n]:
                if tes_list[t+1] == ori_list[n+1]:
                    count_list[t] += 1
    return count_list 
      
 
# 计算概率       
def probability(tes_list, ori_dic, count_list):
    flag = 0
    p = 1
    del tes_list[-1]
    for key in tes_list:
        p *= float(count_list[flag]) / float(ori_dic[key])
        flag += 1
    return p
 
if __name__ == "__main__":
    # 语料句子
    sentence_ori = "研究生物很有意思。他大学时代是研究生物的。生物专业是他的首选目标。他是研究生。"
    ori_dict = {}
    
    # 测试句子
    sentence_test = input()
    ori_dict2 = {}
 
    sentence_ori_temp = reform(sentence_ori)
    ori_list = segmentation(sentence_ori_temp, ori_dict)
 
    sentence_tes_temp = reform(sentence_test)
    tes_list = segmentation(sentence_tes_temp, None)
 
    count_list = compareList(ori_list, tes_list)
 
    p = probability(tes_list, ori_dict, count_list)
    print(p)

第2关：数据平滑

任务描述

本关任务：实现二元语言模型的数据平滑，并利用平滑后的数据计算句子概率。

编程要求

根据提示，在右侧编辑器补充代码，编写平滑函数，计算句子的概率

测试说明

平台会对你编写的代码进行测试：

语料库：

研究生物很有意思。他大学时代是研究生物的。生物专业是他的首选目标。他是研究生。

测试输入：

他是研究物理的

预期输出：

5.6888888888888895e-05

import jieba
#语料句子
sentence_ori="研究生物很有意思。他大学时代是研究生物的。生物专业是他的首选目标。他是研究生。"
#测试句子
sentence_test=input()
#任务：编写平滑函数完成数据平滑，利用平滑数据完成对2-gram模型的建立，计算测试句子概率并输出结果
# ********** Begin *********#
def gt(N, c):
    if c+1 not in N:
        cx = c+1
    else:
        cx = (c+1) * N[c+1]/N[c]
    return cx
jieba.setLogLevel(jieba.logging.INFO)
sentence_ori = sentence_ori[:-1]
words = jieba.lcut(sentence_ori)
words.insert(0, "BOS")
words.append("EOS")
i = 0
lengh = len(words)
while i < lengh:
    if words[i] == "。":
        words[i] = "BOS"
        words.insert(i, "EOS")
        i += 1
        lengh += 1
    i += 1
phrases = []
for i in range(len(words)-1):
    phrases.append(words[i]+words[i+1])
phrasedict = {}
for phrase in phrases:
    if phrase not in phrasedict:
        phrasedict[phrase] = 1
    else:
        phrasedict[phrase] += 1
words_test = jieba.lcut(sentence_test)
words_test.insert(0, "BOS")
words_test.append("EOS")
phrases_test = []
for i in range(len(words_test)-1):
    phrases_test.append(words_test[i]+words_test[i+1])
pdict = {}
for phrase in phrases_test:
    if phrase not in phrasedict:
        pdict[phrase] = 0
    else:
        pdict[phrase] = phrasedict[phrase]
N = {}
for i in pdict:
    if pdict[i] not in N:
        N[pdict[i]] = 1
    else:
        N[pdict[i]] += 1
N[0] += 1
Nnum = 0
for i in N:
    Nnum += i*N[i]
p = 1
for phrase in phrases_test:
    c = pdict[phrase]
    cx = gt(N, c)
    p *= cx/Nnum
print(p)