pycorrector检测OCR错字实践

最新推荐文章于 2025-02-06 15:31:44 发布

飞锡2024

最新推荐文章于 2025-02-06 15:31:44 发布

阅读量810

点赞数 6

分类专栏： NLP算法文章标签：文本纠错

本文链接：https://blog.csdn.net/weixin_38235865/article/details/135616913

版权

NLP算法专栏收录该内容

11 篇文章

订阅专栏

参考：https://github.com/shibing624/pycorrector/tree/master/examples/macbert

stopwords.txt 添加专业停用词，避免错误

设置自定义词典，避免将正确的词错误检测成错误的词

from pycorrector import Corrector
m = Corrector()
m.set_custom_word_freq(path='./dictionary/dict.txt')

在这里插入图片描述
去掉拼音纠正（OCR不会出现同音错误,更正后检测到的数量少了30%）
更改corrector.py内容，路径类似 miniconda/envs/env_name/lib/python3.x/site-packages/pycorrector/corrector.py

更改miniconda/envs/env_name/lib/python3.x/site-packages/pycorrector/proper_corrector.py文件
注释掉拼音相似度的比较 self.get_word_pinyin_similarity_score(word1, word2) 太慢，算了
在这里插入图片描述

vim xxxx/lib/python3.9/site-packages/pycorrector/data/proper_name.txt
设置专业名词词典

dector自定义频数
self.word_freq = {}
在这里插入图片描述

get_wor_simi
在这里插入图片描述

一些地名容易被检测成错字，提取地名代码

from pprint import pprint
from paddlenlp import Taskflow
schema = ['校区名称']   # Define the schema for entity extraction

ie = Taskflow('information_extraction', schema=schema)

pprint(ie("实验班，第一年在通州校区，第二至四年在平乐园校区"))