中文文本纠错
@(自然语言处理)[纠错]
常见错误类型
在中文中,常见的错误类型大概有如下几类:
由于字音字形相似导致的错字形式:体脂称—>体脂秤 多字错误:iphonee —> iphone 少字错误:爱有天意 --> 假如爱有天意 顺序错误: 表达难以 --> 难以表达
纠错组成模块
纠错一般分两大模块:
错误检测:识别错误发生的位置 错误纠正:对疑似的错误词,根据字音字形等对错词进行候选词召回,并且根据语言模型等对纠错后的结果进行排序,选择最优结果。
赛事
几届中文纠错评测,例如CGED与NLPCC
- Chinese Spelling Check Evaluation at SIGHAN Bake-off 2013 [Wu et al., 2013][^1]
- CLP-2014 Chinese Spelling Check Evaluation (Yu et al., 2014)
数据集
1、Academia Sinica Balanced Corpus (ASBC for short hereafter, cf. Chen et al., 1996). 2、混淆词数据集[^A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check]
[^A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check]: Wang, D. , Song, Y. , Li, J. , Han, J. , & Zhang, H. . (2018). A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. https://aclanthology.org/D18-1273.pdf
3、Chinese Grammatical Error Diagnosis NLPTEA 2016 Shared Task: http://ir.itc.ntnu.edu.tw/lre/nlptea16cged.htm NLPTEA 2015 Shared Task: http://ir.itc.ntnu.edu.tw/lre/nlptea15cged.htm NLPTEA 2014 Shared Task: http://ir.itc.ntnu.edu.tw/lre/nlptea14cfl.htm
4、Chinese Spelling Check SIGHAN 2015 Bake-off: http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html CLP 2014 Bake-off: http://ir.itc.ntnu.edu.tw/lre/clp14csc.html SIGHAN 2013 Bake-off: http://ir.itc.ntnu.edu.tw/lre/sighan7csc.html