KenLM 高效 n-gram 语言模型库介绍及使用

原创于 2025-07-11 12:39:15 发布 · 983 阅读

18 ·

CC 4.0 BY-SA版权

文章标签：

#easyui #前端 #javascript

大模型同时被 2 个专栏收录

146 篇文章

订阅专栏

预研

51 篇文章

订阅专栏

KenLM 是一个高效的开源 n-gram 语言模型库，其 Python 接口 kenlm 广泛应用于自然语言处理任务（如文本纠错、机器翻译评分、语音识别）。以下从核心功能、接口详解到实战案例进行系统解析：

🔧 一、安装与编译

1. 源码编译（Linux 推荐）

# 安装依赖
sudo apt install libboost-all-dev libbz2-dev cmake
# 下载并编译 KenLM
wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz
mkdir kenlm/build && cd kenlm/build
cmake .. && make -j8

关键点：需 g++ ≥9.0 和 Boost 库支持。

2. Python 模块安装

pip install https://github.com/kpu/kenlm/archive/master.zip

🧩 二、核心接口详解

1. 模型加载与基础属性

import kenlm
model = kenlm.Model("lm.bin")  # 加载二进制模型
print(f"{model.order}-gram model")  # 查看 N-gram 阶数

支持格式：.bin（二进制）或 .arpa（文本）。

2. 句子评分

整句概率（对数域）：

sentence = "how are you"
score = model.score(sentence, bos=True, eos=True)  # 包含句首/句尾标记
print(score)  # 值越接近 0 表示概率越高

评分原理：
$\log P(S) = \sum \log P(\text{token}_i | \text{context}_{i-1})$ 。

3. 细粒度得分分析

逐 Token 得分：

words = sentence.split()
for i, (prob, ngram_len, oov) in enumerate(model.full_scores(sentence)):
    print(f"Token: {words[i]}, Prob: {prob}, N-gram: {ngram_len}, OOV: {oov}")

OOV 检测：

if "unknown_word" not in model:
    print("Out-of-Vocabulary!")

4. 状态流评分（实时解码）

state_in = kenlm.State()
state_out = kenlm.State()
model.BeginSentenceWrite(state_in)  # 初始化句首状态

total_score = 0
for word in words:
    score = model.BaseScore(state_in, word, state_out)
    total_score += score
    state_in = state_out  # 状态转移

适用场景：语音识别、实时文本生成等流式任务。

在这里插入图片描述

⚡ 三、典型应用场景

1. 文本纠错（智能替换）

def correct_a_an(sentence):
    if " a " not in sentence and " an " not in sentence:
        return sentence
    
    # 生成所有 a/an 组合候选
    candidates = generate_candidates(sentence)  
    best_sentence = sentence
    best_score = model.score(sentence)
    
    for cand in candidates:
        cand_score = model.score(cand)
        if cand_score > best_score:  # 分数越高，句子越合理
            best_sentence, best_score = cand, cand_score
    return best_sentence

案例：将 "a apple" 纠正为 "an apple"。

2. 语言模型训练流程

数据预处理：分词后空格分隔（例："我去了学校"）。

训练 ARPA 模型：

bin/lmplz -o 3 --text corpus.txt --arpa model.arpa  # 3-gram

压缩为二进制：

bin/build_binary model.arpa model.bin  # 加速加载

🚀 四、高级功能

自定义 N-gram 阶数
-o 参数控制阶数（如 -o 5 为 5-gram），高阶需更大内存。
状态序列复用
通过 State 对象传递上下文，避免重复计算，提升解码效率。

与 PyCorrector 集成
直接加载 KenLM 模型用于中文纠错：

from pycorrector import Corrector
cor = Corrector(language_model_path="lm.bin")
print(cor.correct("消读的步骤"))  # 输出：("消毒的步骤", [('消读', '消毒')])

💎 五、最佳实践与避坑指南

问题	解决方案	参考
编译时报 `No CMAKE_CXX_COMPILER`	安装 `g++`：`sudo apt install g++`
Windows 编译失败	使用 WSL 或 Linux 环境
评分时忽略句首/句尾	`model.score(sentence, bos=False, eos=False)`
内存不足	降低 N-gram 阶数（`-o 2`）或使用二进制压缩