参考:https://github.com/shibing624/pycorrector/tree/master/examples/macbert
stopwords.txt 添加专业停用词,避免错误
设置自定义词典,避免将正确的词错误检测成错误的词
from pycorrector import Corrector
m = Corrector()
m.set_custom_word_freq(path='./dictionary/dict.txt')
去掉拼音纠正(OCR不会出现同音错误,更正后检测到的数量少了30%)
更改corrector.py内容,路径类似 miniconda/envs/env_name/lib/python3.x/site-packages/pycorrector/corrector.py
更改miniconda/envs/env_name/lib/python3.x/site-packages/pycorrector/proper_corrector.py文件
注释掉拼音相似度的比较 self.get_word_pinyin_similarity_score(word1, word2) 太慢,算了
vim xxxx/lib/python3.9/site-packages/pycorrector/data/proper_name.txt
设置专业名词词典
dector自定义频数
self.word_freq = {}
get_wor_simi
一些地名容易被检测成错字,提取地名代码
from pprint import pprint
from paddlenlp import Taskflow
schema = ['校区名称'] # Define the schema for entity extraction
ie = Taskflow('information_extraction', schema=schema)
pprint(ie("实验班,第一年在通州校区,第二至四年在平乐园校区"))
from paddlenlp import Taskflow 报错ModuleNotFoundError: No module named ‘paddle.nn.layer.layers’
在使用paddle框架时,遇到以上错误,原因是版本不兼容。
paddlepaddle 2.4.2时 会自动安装最新版的paddlenlp 目前paddlenlp版本是2.6.0 该版本与paddlepaddle 2.4.2不兼容会报这个错
手动pip安装paddlenlp 2.5.2
参考:https://blog.csdn.net/qq_56942824/article/details/133776987