java英文段落拆分成句(Split an article into sentences)

本文介绍了在Java中使用lingpipe库来拆分英文段落为句子的过程,包括遇到的挑战如缩略词、标点符号以及解决方法。通过下载lingpipe库并简单引入,作者在一个下午内完成了实现。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

    最近研究一个翻译系统,对老师上传的一段文本自动拆分成句,乍一听好像很简单哦,split分隔下句号不就完事了嘛!。。。mdzz还是太年轻,一不小心上当了,还有叹号问好双引号呢~!当然这个也不算什么,找个正则表达式就好啦^_^!太天真了!!!劳资突然发现英文简直了,竟然还有缩略词!!!这尼玛怎么分析哦,一顿翻山越岭,发现国内的相关文章有限,对于缩略词都不能有很好的支持,于是在这个时间段,国内严禁翻墙的时间。。。我偷偷翻墙去问问歪果仁了,警察叔叔不要抓我,我只是爱学习的骚年Σ( ° △ °|||)︴    然而实际情况是,歪果仁自己也烦躁他们自己的语言太事逼。。。为什么就不能像中文一样有明显的句子边界呢。。。好吧,我特么也是醉了,正当我一筹莫展之际,一个白胡子老头从天而降,说,骚年,需要帮助吗。别误会,不是援助交际ヽ(=^・ω・^=)丿。。。好吧言归正传,我看到了NLP,并找到了lingpipe,引用起来相当简单,一个下午从接触到实现彻底搞定,说了一堆废话,开始正文

    首先lingpipe直接有jar包进行下载(打不开请翻墙,这是官网的),下载之后放到工程下,有兴趣的的可以了解下lingpipe,不过我英文实在太渣了,就不多介绍别的了,只针对如何进行英文句子边界识别,引用jar包完事以后我们直接创建个util类进行测试

import java.util.ArrayList;
import java.util.List;

import com.aliasi.sentences.IndoEuropeanSentenceModel;
import com.aliasi.sentences.SentenceModel;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;
import com.aliasi.tokenizer.TokenizerFactory;

public class SpliteTextInSentence {
	static final TokenizerFactory TOKENIZER_FACTORY = IndoEuropeanTokenizerFactory.INSTANCE;  
        static final SentenceModel SENTENCE_MODEL = new IndoEuropeanSentenceModel();
	//这里我选择了好多典型例子,属于正则表达式筛选会有问题的,你们的正则如果都能处理你牛逼,请留言给我,我也想要
	public static void main(String[] args) {
		SpliteTextInSentence s = new SpliteTextInSentence();
		String str1 = "Water-splashing Festival is one of the most important festivals in the world, which is popular among Dai people of China and the southeast Asia. It has been celebrated by people for more than 700 years and now this festival is an necessary way for people to promote the cooperation and communication among countries.";
		String str2 = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S and numbers like 2.2. They all got split by the above code.";
		String str3 = "My friend holds a Msc. in Computer Science.";
		String str4 = "This is a test? This is a T.L.A. test!";
		String text = "50 Cent XYZ120 DVD Player 50 Cent lawyer. Person is john, he is a lawyer.";
		String str5 = "\"I do not ask for your forgiveness,\" he said, in a tone that became more firm and forceful. \"I have no illusions, and I am convinced that death is waiting for me: it is just.\"";
		String str6 = "\"The Times have had too much influence on me.\" He laughed bitterly and said to himself, \"it is only two steps away from death. Alone with me, I am still hypocritical... Ah, the 19th century!\"";
		String str7 = "泼水节是世界上最重要节日之一,深受中国傣族和东南亚人民的喜爱。七百多年来,人们一直在庆祝这个节日,现在这个节日是促进国家间合作和交流的必要方式。";
		System.out.println(s.splitfuhao(str7));
		List<String> sl = testChunkSentences(s.splitfuhao(str7));
		if(sl.isEmpty()){
			System.out.println("没有识别到句子");
		}
		for (String row : sl) {
			System.out.println(row);
		}
	}
        //这个是引用句子识别的方法,找了好多资料,在一个用它做文本分析里的找到的↓
        //https://blog.csdn.net/textboy/article/details/45580009
        private static List<String> testChunkSentences(String text) {  
            List<String> result = new ArrayList<String>();  
            List<String> tokenList = new ArrayList<String>();  
            List<String> whiteList = new ArrayList<String>();  
            Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(text.toCharArray(),  
                    0, text.length());  
            tokenizer.tokenize(tokenList, whiteList);  
            String[] tokens = new String[tokenList.size()];  
            String[] whites = new String[whiteList.size()];  
            tokenList.toArray(tokens);  
            whiteList.toArray(whites);  
            int[] sentenceBoundaries = SENTENCE_MODEL.boundaryIndices(tokens,  
                    whites);  
            int sentStartTok = 0;  
            int sentEndTok = 0;  
        
            for (int i = 0; i < sentenceBoundaries.length; ++i) {  
                System.out.println("Sentense " + (i + 1) + ", sentense's length(from 0):" + (sentenceBoundaries[i]));  
                StringBuilder sb = new StringBuilder();  
                sentEndTok = sentenceBoundaries[i];  
                for (int j = sentStartTok; j <= sentEndTok; j++) {  
                sb.append(tokens[j]).append(whites[j + 1]);  
            }  
                sentStartTok = sentEndTok + 1;  
                result.add(sb.toString());  
            }  
            //System.out.println("Final result:" + result);  
            return result;
        }  
	//替换中文标点符号,用于检测是否识别中文分句
	public String splitfuhao(String str){
		String[] ChineseInterpunction = { "“", "”", "‘", "’", "。", ",", ";", ":", "?", "!", "……", "—", "~", "(", ")", "《", "》" };   
		String[] EnglishInterpunction = { "\"", "\"", "'", "'", ".", ",", ";", ":", "?", "!", "…", "-", "~", "(", ")", "<", ">" };   
        for (int j = 0; j < ChineseInterpunction.length; j++)   
        {   
            //alert("txt.replace("+ChineseInterpunction[j]+", "+EnglishInterpunction[j]+")"); 
            //String reg=str.matches(ChineseInterpunction[j],"g"); 
            str = str.replace(ChineseInterpunction[j], EnglishInterpunction[j]+" ");  
        }  
        return str; 
	}
	
}

原理不多解释,因为我也不知道。。。代码直接粘贴复制可用,哦对了,注意你的jdk版本,1.5以下的就算了。。。

    还有英文句子要标准,没有标点符号的话是识别不到句子的(被这个坑了好久,以为这个NLP是骗人的呢)

    你们看吧,超级简单!希望可以帮到大家

    

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值