文本纠错-基于词典匹配的文本纠错

最新推荐文章于 2024-06-03 12:35:34 发布

jajakala

最新推荐文章于 2024-06-03 12:35:34 发布

阅读量1.1k

点赞数 2

分类专栏：文本纠错

本文链接：https://blog.csdn.net/jajakala/article/details/119822603

版权

Trie树文本纠错字符串替换算法优化 Java实现

关键词由CSDN通过智能技术生成

文本纠错专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1. 映射数据

按“错误文本正确文本” 方式填写，中间为Tab键，如：

缸肠        肛肠
人名医院        人民医院
门珍        门诊
闪分泌        内分泌
爱尔眠科        爱尔眼科
笫一        第一

2. 算法

当映射数据量很大时，单纯使用循环、替换的方式不能满足性能要求：

String[] errors = new String[]{"人名医院",...};
String[] corrections = new String[]{"人民医院",...};

String text = "浙江省人名医院贤内科";

for(int i=0;i<errors.length;i++){
    text = text.replace(errors[i],corrections[i]);
}

考虑借鉴分词的方法，从文本中把错误词匹配出来，这里要用到trie树，可自行搜索相关文章。

ansj中已经实现有很好用的trie树方法，直接复用：

import org.nlpcn.commons.lang.tire.GetWord;
import org.nlpcn.commons.lang.tire.domain.Forest;
import org.nlpcn.commons.lang.tire.library.Library;

import java.io.*;

public class RuleBasedCorrection {

    private Forest forest;

    public RuleBasedCorrection(String correctionDic) throws Exception {
        File file = new File(correctionDic);
        BufferedReader reader = new BufferedReader(new FileReader(file));
        forest = Library.makeForest(reader);
    }

    public String correct(String text){
        GetWord udg = forest.getWord(text);
        String temp = null;
        while((temp = udg.getFrontWords()) != null){
            text = text.replace(temp, udg.getParam(0));
        }
        return text;
    }


    public static void main(String[] args) throws Exception {
        RuleBasedCorrection correction = new RuleBasedCorrection("tmp");
        System.out.println(correction.correct("浙江省人名医院余杭区贤内科"));
    }
}

jajakala

关注

2
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
文本纠错-基于词典匹配的文本纠错

1. 映射数据按“错误文本正确文本” 方式填写，中间为Tab键，如：缸肠肛肠人名医院人民医院门珍门诊闪分泌内分泌爱尔眠科爱尔眼科笫一第一2. 算法当映射数据量很大时，单纯使用循环、替换的方式不能满足性能要求：String[] errors = new String[]{"人名医院",...};String[] corrections = new String...
复制链接

扫一扫