jcorrector 中文文本纠错工具

jcorrector

项目地址:https://github.com/jiangnanboy/jcorrector

中文文本纠错工具。音似、形似错字(或变体字)纠正,可用于中文拼音、笔画输入法的错误纠正。项目为java开发,此项目参考了pycorrector,在此对作者表示感谢。

jcorrector依据语言模型检测错别字位置,通过拼音音似特征、笔画五笔编辑距离特征及语言模型句子概率值特征纠正错别字。

1.利用n-gram语言模型检测错别字位置,通过拼音音似特征、笔画五笔编辑距离特征及语言模型句子概率值特征纠正错别字。

2.利用深度学习模型(如macbert等)进行中文拼写纠错。

Guide

Question

中文文本纠错任务,常见错误类型包括:

  • 谐音字词,如 配副眼睛-配副眼镜
  • 混淆音字词,如 流浪织女-牛郎织女
  • 字词顺序颠倒,如 伍迪艾伦-艾伦伍迪
  • 字词补全,如 爱有天意-假如爱有天意
  • 形似字错误,如 高梁-高粱
  • 中文拼音全拼,如 xingfu-幸福
  • 中文拼音缩写,如 sz-深圳
  • 语法错误,如 想象难以-难以想象

当然,针对不同业务场景,这些问题并不一定全部存在,比如输入法中需要处理前四种,搜索引擎需要处理所有类型,语音识别后文本纠错只需要处理前两种, 其中'形似字错误'主要针对五笔或者笔画手写输入等。本项目重点解决其中的谐音、混淆音、形似字错误、中文拼音全拼、语法错误带来的纠错任务。

Solution

规则的解决思路

  1. 中文纠错分为两步走,第一步是错误检测,第二步是错误纠正;
  2. 错误检测部分先通过hanlp分词器切词,由于句子中含有错别字,所以切词结果往往会有切分错误的情况,这样从字粒度和词粒度两方面检测错误, 整合这两种粒度的疑似错误结果,形成疑似错误位置候选集;
  3. 错误纠正部分,是遍历所有的疑似错误位置,并使用音似、形似词典替换错误位置的词,然后通过语言模型计算句子概率值,对所有候选集结果比较并排序,得到最优纠正词。

Feature

模型

  • berkeleylm:berkeleylm统计语言模型工具,规则方法,语言模型纠错,利用混淆集,扩展性强

错误检测

  • 字粒度:语言模型句子概率值检测某字的似然概率值低于句子文本平均值,则判定该字是疑似错别字的概率大。
  • 词粒度:切词后不在词典中的词是疑似错词的概率大。

错误纠正

  • 通过错误检测定位所有疑似错误后,取所有疑似错字的音似、形似候选词,
  • 使用候选词替换,基于语言模型得到类似翻译模型的候选排序结果,得到最优纠正词。

安装依赖

  • berkeleylm源码已集成到项目中
  • 其他jar包安装 其它通过maven pom下载

Usage

见【examples/】

  • 文本纠错
import sy.core.spelling.Corrector;

Corrector corrector = new Corrector();
String result = corrector.correct("少先队员因该为老人让坐");
System.out.println(result);

output:

[{"endIdx":6,"correct":"应该","startIdx":4,"type":"拼写错误","error":"因该"},{"endIdx":11,"correct":"让座","startIdx":9,"type":"拼写错误","error":"让坐"}]
  • 错误检测
import sy.core.spelling.Detector;

Detector detector = new Detector();
String result = detector.detect(sentence);
System.out.println(result);

output:

[[因该, 4, 6, confusion], [让坐, 9, 11, confusion]]

默认字粒度、词粒度的纠错都打开,可以通过enableCharError()和enableWordError()进行设置。

训练

语言模型的训练及加载见sy/core/ngram/NGramModel

Test

以下是部分测试结果:

detector:

少先队员因该为老人让坐
[[因该, 4, 6, confusion], [让坐, 9, 11, confusion]]

机七学习是人工智能领遇最能体现智能的一个分知
[[机, 0, 1, char], [领, 9, 10, char], [遇, 10, 11, char], [分, 20, 21, char], [知, 21, 22, char]]

一只小鱼船浮在平净的河面上
[[平净, 7, 9, word]]

我的家乡是有明的渔米之乡
[[渔米之乡, 8, 12, confusion]]

corrector:

少先队员因该为老人让坐
[{"endIdx":6,"correct":"应该","startIdx":4,"type":"拼写错误","error":"因该"},{"endIdx":11,"correct":"让座","startIdx":9,"type":"拼写错误","error":"让坐"}]

真麻烦你了。希望你们好好的跳无
[{"endIdx":14,"correct":["条","桃","跳","调","挑","逃","眺","兆","窕","佻","笤"],"startIdx":13,"type":"拼写错误","error":"跳"},{"endIdx":15,"correct":["舞","物","污","武","务","屋","无","乌","于","恶","伍","误","午","雾","吴","悟","亡","勿","巫","五","坞","梧","芜","捂","侮","呜","诬","晤","蜈","鹉"],"startIdx":14,"type":"拼写错误","error":"无"}]

机七学习是人工智能领遇最能体现智能的一个分知
[{"endIdx":1,"correct":["其","继","既","冀","吉","奇","鸡","急","几","给","即","基","骑","祭","挤","借","激","寄","吃","疾","际","击","积","棘","己","忌","籍","寂","集","系","机","计","济","剂","季","迹","稽","脊","记","辑","荠","肌","极","饥","级","及","箕","期","居","嫉","畸","绩","讥","叽","唧","鲫","技","妓","圾","纪"],"startIdx":0,"type":"拼写错误","error":"机"},{"endIdx":11,"correct":["域","语","与","于","羽","愈","郁","育","偶","浴","淤","喻","芋","雨","尉","宇","愉","愚","吁","遇","狱","逾","欲","寓","隅","榆","裕","娱","御","鱼","余","藕","豫","迂","誉","玉","屿","蔚","予","谷","渔","预","舆"],"startIdx":10,"type":"拼写错误","error":"遇"},{"endIdx":21,"correct":["粪","纷","芬","愤","坟","粉","分","奋","氛","焚","忿","吩","份"],"startIdx":20,"type":"拼写错误","error":"分"},{"endIdx":22,"correct":["支","值","指","质","枝","芝","氏","旨","殖","治","织","汁","肢","植","掷","制","址","挚","直","至","只","知","滞","侄","秩","致","趾","脂","置","吱","帜","智","纸","职","识","窒","执","志","稚","征","之","止","蜘"],"startIdx":21,"type":"拼写错误","error":"知"}]

一只小鱼船浮在平净的河面上
[{"endIdx":9,"correct":["平静","平经","冯净","屏净","坪净","评净","平精","凭净","平净","萍净","瓶净","平京","乒净","苹净","平景","平晶","平靖","平井","平挣","平争","平峥","平铮","平劲","平径","平惊","平鲸","平睁","平镜","平颈","平敬","平茎","平睛","平诤","平警","平兢","平荆","平筝","平境","平竟","平狰","平阱","平更","平竞"],"startIdx":7,"type":"拼写错误","error":"平净"}]

我的家乡是有明的渔米之乡
[{"endIdx":12,"correct":"鱼米之乡","startIdx":8,"type":"拼写错误","error":"渔米之乡"}]

遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇,这样我们才能朝著成功之路前进。
[{"endIdx":31,"correct":"朝着","startIdx":29,"type":"拼写错误","error":"朝著"}]

人生就是如此,经过磨练才能让自己更加拙壮,才能使自己更加乐观。
[{"endIdx":11,"correct":"磨炼","startIdx":9,"type":"拼写错误","error":"磨练"},{"endIdx":20,"correct":["茁壮","拙装","拙庄","旺壮","卓壮","捉壮","出壮","着壮","咄壮","屈壮","汪壮","酌壮","浊壮","拙状","桌壮","拙壮","拙桩","啄壮","琢壮","灼壮","拙妆","倔壮","掘壮","拙撞"],"startIdx":18,"type":"拼写错误","error":"拙壮"}]

Dataset

数据集来自人民日报2014版,将每句进行分字处理。

Neural-Net

利用深度学习进行端到端中文拼写纠错。

Macbert

 这里利用java加载macbert模型,并进行中文拼写纠错,具体见【https://github.com/jiangnanboy/macbert-java-onnx】
 

usage

1.sy.dl/MacBert

 String text = "今天新情很好。";
 Pair<BertTokenizer, Map<String, OnnxTensor>> pair = null;
 try {
     pair = parseInputText(text);
 } catch (Exception e) {
     e.printStackTrace();
 }
 var predString = predCSC(pair);
 List<Pair<String, String>> resultList = getErrors(predString, text);
 for(Pair<String, String> result : resultList) {
     System.out.println(text + " => " + result.getLeft() + " " + result.getRight());
 }

2.result

 String text = "今天新情很好。";
 
 tokens -> [[CLS], 今, 天, 新, 情, 很, 好, 。, [SEP]]
 今天新情很好。 => 今天心情很好。 新,心,2,3
 
 String text = "你找到你最喜欢的工作,我也很高心。";
 
 tokens -> [[CLS], 你, 找, 到, 你, 最, 喜, 欢, 的, 工, 作, ,, 我, 也, 很, 高, 心, 。, [SEP]]
 你找到你最喜欢的工作,我也很高心。 => 你找到你最喜欢的工作,我也很高兴。 心,兴,15,16
  • 3
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值