最近研究一个翻译系统,对老师上传的一段文本自动拆分成句,乍一听好像很简单哦,split分隔下句号不就完事了嘛!。。。mdzz还是太年轻,一不小心上当了,还有叹号问好双引号呢~!当然这个也不算什么,找个正则表达式就好啦^_^!太天真了!!!劳资突然发现英文简直了,竟然还有缩略词!!!这尼玛怎么分析哦,一顿翻山越岭,发现国内的相关文章有限,对于缩略词都不能有很好的支持,于是在这个时间段,国内严禁翻墙的时间。。。我偷偷翻墙去问问歪果仁了,警察叔叔不要抓我,我只是爱学习的骚年Σ( ° △ °|||)︴ 然而实际情况是,歪果仁自己也烦躁他们自己的语言太事逼。。。为什么就不能像中文一样有明显的句子边界呢。。。好吧,我特么也是醉了,正当我一筹莫展之际,一个白胡子老头从天而降,说,骚年,需要帮助吗。别误会,不是援助交际ヽ(=^・ω・^=)丿。。。好吧言归正传,我看到了NLP,并找到了lingpipe,引用起来相当简单,一个下午从接触到实现彻底搞定,说了一堆废话,开始正文!
首先lingpipe直接有jar包进行下载(打不开请翻墙,这是官网的),下载之后放到工程下,有兴趣的的可以了解下lingpipe,不过我英文实在太渣了,就不多介绍别的了,只针对如何进行英文句子边界识别,引用jar包完事以后我们直接创建个util类进行测试
import java.util.ArrayList;
import java.util.List;
import com.aliasi.sentences.IndoEuropeanSentenceModel;
import com.aliasi.sentences.SentenceModel;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;
import com.aliasi.tokenizer.TokenizerFactory;
public class SpliteTextInSentence {
static final TokenizerFactory TOKENIZER_FACTORY = IndoEuropeanTokenizerFactory.INSTANCE;
static final SentenceModel SENTENCE_MODEL = new IndoEuropeanSentenceModel();
//这里我选择了好多典型例子,属于正则表达式筛选会有问题的,你们的正则如果都能处理你牛逼,请留言给我,我也想要
public static void main(String[] args) {
SpliteTextInSentence s = new SpliteTextInSentence();
String str1 = "Water-splashing Festival is one of the most important festivals in the world, which is popular among Dai people of China and the southeast Asia. It has been celebrated by people for more than 700 years and now this festival is an necessary way for people to promote the cooperation and communication among countries.";
String str2 = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S and numbers like 2.2. They all got split by the above code.";
String str3 = "My friend holds a Msc. in Computer Science.";
String str4 = "This is a test? This is a T.L.A. test!";
String text = "50 Cent XYZ120 DVD Player 50 Cent lawyer. Person is john, he is a lawyer.";
String str5 = "\"I do not ask for your forgiveness,\" he said, in a tone that became more firm and forceful. \"I have no illusions, and I am convinced that death is waiting for me: it is just.\"";
String str6 = "\"The Times have had too much influence on me.\" He laughed bitterly and said to himself, \"it is only two steps away from death. Alone with me, I am still hypocritical... Ah, the 19th century!\"";
String str7 = "泼水节是世界上最重要节日之一,深受中国傣族和东南亚人民的喜爱。七百多年来,人们一直在庆祝这个节日,现在这个节日是促进国家间合作和交流的必要方式。";
System.out.println(s.splitfuhao(str7));
List<String> sl = testChunkSentences(s.splitfuhao(str7));
if(sl.isEmpty()){
System.out.println("没有识别到句子");
}
for (String row : sl) {
System.out.println(row);
}
}
//这个是引用句子识别的方法,找了好多资料,在一个用它做文本分析里的找到的↓
//https://blog.csdn.net/textboy/article/details/45580009
private static List<String> testChunkSentences(String text) {
List<String> result = new ArrayList<String>();
List<String> tokenList = new ArrayList<String>();
List<String> whiteList = new ArrayList<String>();
Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(text.toCharArray(),
0, text.length());
tokenizer.tokenize(tokenList, whiteList);
String[] tokens = new String[tokenList.size()];
String[] whites = new String[whiteList.size()];
tokenList.toArray(tokens);
whiteList.toArray(whites);
int[] sentenceBoundaries = SENTENCE_MODEL.boundaryIndices(tokens,
whites);
int sentStartTok = 0;
int sentEndTok = 0;
for (int i = 0; i < sentenceBoundaries.length; ++i) {
System.out.println("Sentense " + (i + 1) + ", sentense's length(from 0):" + (sentenceBoundaries[i]));
StringBuilder sb = new StringBuilder();
sentEndTok = sentenceBoundaries[i];
for (int j = sentStartTok; j <= sentEndTok; j++) {
sb.append(tokens[j]).append(whites[j + 1]);
}
sentStartTok = sentEndTok + 1;
result.add(sb.toString());
}
//System.out.println("Final result:" + result);
return result;
}
//替换中文标点符号,用于检测是否识别中文分句
public String splitfuhao(String str){
String[] ChineseInterpunction = { "“", "”", "‘", "’", "。", ",", ";", ":", "?", "!", "……", "—", "~", "(", ")", "《", "》" };
String[] EnglishInterpunction = { "\"", "\"", "'", "'", ".", ",", ";", ":", "?", "!", "…", "-", "~", "(", ")", "<", ">" };
for (int j = 0; j < ChineseInterpunction.length; j++)
{
//alert("txt.replace("+ChineseInterpunction[j]+", "+EnglishInterpunction[j]+")");
//String reg=str.matches(ChineseInterpunction[j],"g");
str = str.replace(ChineseInterpunction[j], EnglishInterpunction[j]+" ");
}
return str;
}
}
原理不多解释,因为我也不知道。。。代码直接粘贴复制可用,哦对了,注意你的jdk版本,1.5以下的就算了。。。
还有英文句子要标准,没有标点符号的话是识别不到句子的(被这个坑了好久,以为这个NLP是骗人的呢)
你们看吧,超级简单!希望可以帮到大家