Kenlm初步使用----评估句子中"a/an"使用情况

本文链接：https://blog.csdn.net/qq_16829085/article/details/102518798

1 基本要求

通过kenlm训练自己的语言模型，判别测试集中的句子是否存在a/an的使用错误。例如：in an sense , he is the antithesis of dagny是错误，而in a sense , he is the antithesis of dagny是正确的。

2 准备工作

实验环境:Python3.7.3、Ubuntu16.04、ntlk、kenlm
训练预料:https://pan.baidu.com/s/1On1CoL96qIxTXctBB94NuA 提取码: 9pu3 测试集:链接: https://pan.baidu.com/s/1qpD9nWnRPqpH8bbfkNt8KA 提取码: n6vw

/kenlm/build  # 当前所在路径

3 使用kenlm对训练预料进行训练

kenlm的下载安装可以参考官方文档:https://kheafield.com/code/kenlm/。也可以查看我的上篇文章子系统python环境配置。调用kenlm\build\bin目录下的lmplz生成.arpa语言模型。为了方便起见，我们在build目录下建立data文件夹，将语料库放进去。在build目录下执行下面命令。

 bin/lmplz -o 5 -S 50% --verbose_header --text data/train_set --arpa data/res.arpa

-o:最高采用n-gram语法
-S:内存预占用量
–verbose_header:在生成的文件头位置加上统计信息
–text:训练文件地址
–arpa:指定输出.arpa文件的地址

训练完成后，终端输入head -n 20 data/res.arpa查看训练结果，可以看到一些训练信息。

# Input file: ###############
# Token count: 78536442
# Smoothing: Modified Kneser-Ney
\data\
ngram 1=863701
ngram 2=10683699
ngram 3=33500658
ngram 4=53398390
ngram 5=62155319

\1-grams:
-7.065953       <unk>   0
0       <s>     -1.3640472
-5.621103       </s>    0
-3.0624595      before  -0.7323587
-2.48831        the     -0.79645663
-5.2839327      autopsy -0.36377412
-2.2108881      was     -1.2027053
-4.1015754      complete        -0.3351077
-1.9286767      and     -0.8878988

输出的三个字段分别是:Pro,word,back_pro,pro是联合概率。数值是以10为底取对数后的结果。
将arpa文件转换为binary文件，可以对文件进行压缩，提高后续在python中加载的速度。

 bin/build_binary -s data/res.arpa data/res.bin

4 kenlm python模块的基本操作

在python终端简单测试下模型

import kenlm

# 导入模型
model= kenlm.LanguageModel("./res.bin")

sentence='before the autopsy was complete and toxicology results known , medical examiner dr. jerry francisco declared the cause of death as cardiac arrhythmia , a condition that can be determined only in someone who is still alive .'

# 使用语言模型对句子进行打分,bos=True, eos=True 给句子开头和结尾加上标记符
# 让 score 返回输入字符串的 log10 概率，得分越高，句子的组合方式越好
model.score(sentence, bos=True, eos=True) # -35.82015609741211

4.1 model.score()函数

Return the log10 probability of a string.  By default, the string is treated as a sentence.
                return log10 p(sentence </s> | <s>)

bos 和eos指定是否给句子开头和结尾加上标记符。score函数输出的是对数概率，分数输出为负数，越接近于0越好。该模块可以用来测试词条与句子的通顺度。model.score("this is a sentence .")等价于model.score("this is a sentence .", bos = True, eos = True)，后者更为明确，均是输出 log10 p(this is a sentence . </s> | <s>)概率大小。
注意:句子中永远不要使用<s> this is a sentence，即使bos=False

4.2 model.full_scores()函数

full_scores(sentence, bos = True, eos = Ture) -> generate full scores (prob, ngram length, oov)
@param sentence is a string (do not use boundary symbols)
@param bos should kenlm add a bos state
@param eos should kenlm add an eos state
Type:      builtin_function_or_method

score是full_scores是精简版，full_scores会返回： (prob, ngram length, oov) 包括：概率，ngram长度，是否为oov

4.3 model.perplexity()函数

该函数返回的是整个句子的困惑度。

import kenlm

model = kenlm.LanguageModel("./res.bin")
s1 = 'love is now or never'
perplexity =model.perplexity(s1)
print(perplexity)

也可以通过model.full_scores()函数来进行计算

import kenlm
import numpy as np

model = kenlm.LanguageModel("./res.bin")
s1 = 'love is now or never'
#score是以10为底的log
prob = np.prod([math.pow(10.0, score) for score, _, _ in model.full_scores(s)])
n = len(list(m.full_scores(s)))
perplexity = math.pow(prob, 1.0/n)

# 先求和,最后在输出
# prob = -1 * sum(score for score, _, _ in m.full_scores(s))
# perplexity = math.pow(10.0, sum_inv_logs / n)

在自然语言处理中，语言模型(Language Model，LM)指的是:给出一句话的前k-1个词，通过LM预测第k个词是什么，即给出一个第k个词可能出现的概率的分布 $p(w_k|w_1,w_2,\dots w_{k-1})$
而PPL(Perplexity) 是用来衡量LM模型好坏的指标。它根据句子中的每个词来估算一句话出现的概率。公式为：
$\begin{aligned} PP(S) &= p(w_1w_2\dots w_n)^{-\frac{1}{N}}\\ &=\sqrt[N]{\frac{1}{ p(w_1w_2\dots w_n)}}\\ &=\sqrt[N]{\prod_{i=1}^N \frac{1}{ p(w_1|w_2\dots w_{i-1})}} \end{aligned} \begin{aligned} &\quad &\quad &\quad \end{aligned} \begin{aligned} PP(S) &= 2^{-\frac{1}{N}\sum log(p(w_i))} \end{aligned}$
上面两个公式是等价的，左式两边取对数后，再进行求解就可以得到右式。S就是一句话，N是这句话的长度。对于中文而言，N是分词后词的个数。N的作用就是一个Norm，使得不同长度的句子困惑度可以在一个量级下比较。PPL越低, $P(w_i)$ 就越大,句子出现的概率就越高。
对Perplexity的影响因素有:

训练数据集越大，PPL会下降得更低，1billion dataset和10万dataset训练效果是很不一样的
数据中的标点会对模型的PPL产生很大影响，一个句号能让PPL波动几十，标点的预测总是不稳定
预测语句中的“的，了”等词也对PPL有很大影响，可能“我借你的书”比“我借你书”的指标值小几十，但从语义上分析有没有这些停用词并不能完全代表句子生成的好坏

5 对测试集进行评估

大致思路如下:

按行读取测试集数据
    ## 判断每行是否含有 a/an 的单词
        ## 统计 a/an 单词在这个字符串中的数量 N
        ## 构建长度为 N 的 a/an 两个字符的排列组合方式 
        ## 对原有句子中的 a/an 进行替换，得到各种不同排列组合下的新句子
        ## 对各个组合下的新句子运用语言模型进行打分，得到最高分
        ## 判断最高分的句子是否为原句子
            ## 不是，就输出相应的修改建议

完整代码如下:

# encoding=utf-8
import kenlm
from collections import Counter
import nltk
import re
from itertools import product

input_file_name = "./test_set"
output_file_name = "./output.txt"


# 按行读取文件，去除每行末尾的换行符
def read_file(file_name):
    fp = open(file_name, "r")
    content_lines = fp.readlines()
    fp.close()
    # 去除行末的换行符
    content_lines = list(map(lambda s:s.rstrip('\n'),content_lines))
    return content_lines


# 对句子中的a/an进行全排列组合，返回全排列后的所有组合字符串列表
def change_a_an(line):
    new_lines = []
    if "a" in line or "an" in line:
        # 获取a/an的总数量
        a_an_counter = Counter(nltk.word_tokenize(line))
        a_an_num = a_an_counter["a"] + a_an_counter["an"]
        # 对字符串中的‘%’进行替换，防止后续操作出错
        percentage_regex = re.compile(r"%")
        new_line = percentage_regex.sub(r"%%", line)
        # 字符串行总计2个a或an的情况
        a_and_an_regex = re.compile(
            r"""
            \sa\sa\s | 
            \sa\san\s |
            \san\sa\s |
            \san\san\s
            """, re.VERBOSE) # re.VERBOSE 忽略空格，把表达式写成多行
        new_line = a_and_an_regex.sub(r" %s %s ", new_line)
        # 字符串行总计1个a或an的情况
        a_an_regex = re.compile(r"\sa\s|\san\s")
        new_line = a_an_regex.sub(r" %s ", new_line)
        # 字符串行a或an开头的情况
        a_an_regex_front = re.compile(r"^a\s|^an\s")
        new_line = a_an_regex_front.sub(r"%s ", new_line)
        a_an_regex_quotatio = re.compile(r"([^a-zA-Z]'a\s)|([^a-zA-Z]'an\s)")
        new_line = a_an_regex_quotatio.sub(r"'%s ", new_line)
        # 长度为a_an_num的a/an的排列组合方式的枚举
        a_an_form = list(product(("a", "an"), repeat=a_an_num))
        # 按照排列组合枚举对字符串列表进行组合，形成新的句子
        for form in a_an_form:
            new_lines += [new_line % form]
    return new_lines


# 主函数
if __name__ == "__main__":
    lines = read_file(input_file_name)
    output_file = open(output_file_name, "w")

    wrong_line_num = 0
    # 导入模型
    model = kenlm.LanguageModel("./res.bin")

    for i in range(len(lines)):
        line = lines[i]
        try:
            new_lines = change_a_an(line)
        except TypeError:
            continue

        # 得分判断
        line_best = line
        line_best_score = model.score(line, bos=True, eos=True)
        for new_line in new_lines:
            new_line_score = model.score(new_line, bos=True, eos=True)
            if new_line_score > line_best_score:
                line_best = new_line
                line_best_score = new_line_score
        if line_best != line:
            output_file.write("%s. " % (i + 1) + line + "\n")
            output_file.write("###Wrong###")
            output_file.write("=> " + line_best + "\n\n")
            wrong_line_num += 1
        else:
            output_file.write("%s. " % (i + 1) + line + "\n")
            output_file.write("---Correct---\n\n")

    print("The wrong number of sentences that is: " +str(wrong_line_num))
    output_file.close()