pycorrector源码阅读和纠错一些思考

最新推荐文章于 2023-08-29 14:01:00 发布

ox180x

最新推荐文章于 2023-08-29 14:01:00 发布

阅读量668

点赞数

文章标签： python 开发语言

本文链接：https://blog.csdn.net/ox180x/article/details/124095194

版权

介绍

这篇文章主要对pycorrector默认使用规则的代码进行debug和理解，不论怎样，应先读下作者的readme，有个充分的理解先。

初始化工作

初始化主要做了三件事：

初始化一些词典，用于后面纠错用。
加载kenlm模型。
初始化jieba分词器。

1. 初始化一些词典等

加载常用的汉字列表、加载同音字列表、加载形似字

check_corrector_initialized()

def _initialize_corrector(self):
    # chinese common char 加载常用的汉字列表，这里大概3000多个常见的汉字。
    self.cn_char_set = self.load_set_file(self.common_char_path)
    # same pinyin 加载同音的列表，自定义
    self.same_pinyin = self.load_same_pinyin(self.same_pinyin_text_path)
    # same stroke 加载形似字，比如{坐:[座, ...]}， 自定义
    self.same_stroke = self.load_same_stroke(self.same_stroke_text_path)
    self.initialized_corrector = True

转unicode

1
2

# 编码统一，utf-8 to unicode
text = convert_to_unicode(text)

长句分短句

额外插句：单从re_han这里就可以看出，作者至少对jieba很熟悉。

def split_2_short_text(text, include_symbol=False):
    """
    长句切分为短句
    :param text: str
    :param include_symbol: bool
    :return: (sentence, idx)
    """
    result = []
    blocks = re_han.split(text)
    start_idx = 0
    for blk in blocks:
        if not blk:
            continue
        if include_symbol:
            result.append((blk, start_idx))
        else:
            if re_han.match(blk):
                result.append((blk, start_idx))
        start_idx += len(blk)
    return result

2. 加载kenlm模型

关于kenlm，网上搜了下，除了纠错基本很少有人用到(而且还是针对pycorrector～)，只有这篇文章说的还有点意思，而且看Github kenlm介绍，作者也是十分任性，只强调速度，没有强调用处。。。

简单来讲，kenlm是基于n-gram训练出来的一个预训练模型，它的更多用法可看Example。

3. 初始化jieba

加载词频
(这个我看了下，和jieba自带的那个dict.txt基本没关系，相当于作者自己训练了一个词频词典)
~~ * 自定义混淆集(空的，所以忽略这步)~~
自定义切词词典
(默认是空，个人感觉可以把jieba那个dict.txt加进去，哈哈哈)
一些特定词典
人名词典词频、place词典词频、停用词词典词频、将这些词典词频合并到一起

对于人名和place这种词典，不如使用现成了命名实体模型，这种词典的方式总之是无法完全枚举的。

 # 词、频数dict, TODO: 这里虽然有，但是貌似没有用到
self.word_freq = self.load_word_freq_dict(self.word_freq_path)
# 自定义混淆集
self.custom_confusion = self._get_custom_confusion_dict(self.custom_confusion_path)
# 自定义切词词典
self.custom_word_freq = self.load_word_freq_dict(self.custom_word_freq_path)
self.person_names = self.load_word_freq_dict(self.person_name_path)
self.place_names = self.load_word_freq_dict(self.place_name_path)
self.stopwords = self.load_word_freq_dict(self.stopwords_path)
# 合并切词词典及自定义词典
self.custom_word_freq.update(self.person_names)
self.custom_word_freq.update(self.place_names)
self.custom_word_freq.update(self.stopwords)
self.word_freq.update(self.custom_word_freq) # TODO:这里
self.tokenizer = Tokenizer(dict_path=self.word_freq_path, custom_word_freq_dict=self.custom_word_freq,
                            custom_confusion_dict=self.custom_confusion)

错字识别

1. 基于word级别的错字识别

这部分使用jieba的search模式进行分词。

它的实现原理是：先使用hmm进行分词，比如少先队员因该为老人让坐，它的分词结果是["少先队员", "因该", "为", "老人", "让", "坐"]，然后对每个词再用2阶gram和3阶gram进行切分，在self.FREQ中进行查找是否存在，得到的结果如下：

('队员', 2, 4)
('少先队', 0, 3)
('少先队员', 0, 4)
('因该', 4, 6)
('为', 6, 7)
('老人', 7, 9)
('让', 9, 10)
('坐', 10, 11)

分完词后，按词粒度判断是否在词典里，符号，英文则跳过,否则则认为是可能错的。

到这里识别出因该是可能错误的。

2. 基于kenlm级别的错字识别

取bigram和trigram，通过kenlm获取对应的score，然后求平均获取和句子长度一致的score。

比如：

sent_scores = [-5.629326581954956, -6.566553155581156, -6.908517241477966, -7.255491574605306, -7.401519060134888, -7.489806890487671, -7.1438290278116865, -6.559153278668722, -6.858733296394348, -7.7903218269348145, -8.28114366531372]

然后通过这个sent_scores取判断哪些index是错的。

那作者是怎么判断的呢？

def _get_maybe_error_index(scores, ratio=0.6745, threshold=2):
    """
    取疑似错字的位置，通过平均绝对离差(MAD)
    :param scores: np.array
    :param ratio: 正态分布表参数
    :param threshold: 阈值越小，得到疑似错别字越多
    :return: 全部疑似错误字的index: list
    """
    result = []
    scores = np.array(scores)
    if len(scores.shape) == 1:
        scores = scores[:, None]
    median = np.median(scores, axis=0)  # get median of all scores
    margin_median = np.abs(scores - median).flatten()  # deviation from the median
    # 平均绝对离差值
    med_abs_deviation = np.median(margin_median)
    if med_abs_deviation == 0:
        return result
    y_score = ratio * margin_median / med_abs_deviation
    # 打平
    scores = scores.flatten()
    maybe_error_indices = np.where((y_score > threshold) & (scores < median))
    # 取全部疑似错误字的index
    result = [int(i) for i in maybe_error_indices[0]]
    return result

按照百度百科平均绝对离差的定义：平均绝对离差定义为各数据与平均值的离差的绝对值的平均数，那作者这里的计算方式貌似就不一样了。
作者这里的计算方式不是求平均值，而是每个值减去中位数，然后再求中位数，这样做的好处更多是防止数据分布比较大，就比如大家的平均工资都很高～
作者接着使用两个比较，(1)ratio * np.abs(score - median) / 平均绝对离差
(2)scores 小于中位数的，这地方看的迷迷糊糊，总有种凭经验的感觉。
获取对应的错字index。

至此获取到的可能错误列表是：

`1`	`[['因该', 4, 6, 'word'], ['坐', 10, 11, 'char']]`

纠错

1. 获取纠错候选集

假设当前输入word是因该：

一、获取词粒度的候选集

获取相同拼音的(不包含声调) _confusion_word_set
自定义混淆集 _confusion_custom_set

他这个获取相同拼音的写法就让我觉得emo，直接在self.known(自定义词典)里找长度相同，然后判断拼音一样不就得了～

自定义混淆集就是自定义一些经验进行。比如{“因该”: “应该”}这种，增大候选集。

二、获取基于字粒度的候选集

这地方分成三部分：

如果word的长度等于1。获取相同拼音的same pinyin 加载同音的列表，以及加载形似字same stroke 加载形似字。
如果word的长度等于2。截取第一个字符，如因，然后获取相同拼音的same pinyin 加载同音的列表，以及加载形似字same stroke 加载形似字，然后和该进行拼接，获取新的候选集。第二个字该执行相同操作。
如果word的长度大于2。同理上述操作，只不过粒度不同(此处忽略)。

三、对候选集进行排序，以word_freq进行排序，然后只截取前K个候选集

2. 从候选集里面进行筛选

这个地方就有意思了，如何获取最正确的那个呢？看下面代码。

def get_lm_correct_item(self, cur_item, candidates, before_sent, after_sent, threshold=57, cut_type='char'):
    """
    通过语言模型纠正字词错误
    :param cur_item: 当前词
    :param candidates: 候选词
    :param before_sent: 前半部分句子
    :param after_sent: 后半部分句子
    :param threshold: ppl阈值, 原始字词替换后大于该ppl值则认为是错误
    :param cut_type: 切词方式, 字粒度
    :return: str, correct item, 正确的字词
    """
    result = cur_item
    if cur_item not in candidates:
        candidates.append(cur_item)
    # 对每个候选集进行拼接成新句子，然后进行计算ppl_score。
    ppl_scores = {i: self.ppl_score(segment(before_sent + i + after_sent, cut_type=cut_type)) for i in candidates}
    sorted_ppl_scores = sorted(ppl_scores.items(), key=lambda d: d[1])

    # 增加正确字词的修正范围，减少误纠
    top_items = []
    top_score = 0.0
    for i, v in enumerate(sorted_ppl_scores):
        v_word = v[0]
        v_score = v[1]
        if i == 0:
            top_score = v_score
            top_items.append(v_word)
        # 通过阈值修正范围
        elif v_score < top_score + threshold:
            top_items.append(v_word)
        else:
            break
    if cur_item not in top_items:
        result = top_items[0]
    return result

核心的地方在self.ppl_score那里，代码如下：


def ppl_score(self, words):

    
    """
    words比如：['少', '先', '队', '员', '应', '该', '为', '老', '人', '让', '坐']
    取语言模型困惑度得分，越小句子越通顺
    :param words: list, 以词或字切分
    :return:
    """
    self.check_detector_initialized()
    return self.lm.perplexity(' '.join(words))

看作者注释，说的很明白了，如果这个句子越是流畅的，那么他的score就会更高。

pprint(sorted_ppl_scores)
[('应该', 144.39704182754554),
 ('因改', 236.80615502078768),
 ('因该', 284.14769660593794),
 ('听该', 357.8835799332408),
 ('因盖', 360.68106481988417),
 ('因核', 365.9438178618582),
 # 这里只截取一部分！！
]