目前比较好的的分词器有IKAnalyzer、Paoding,都是开源的。
一下是IKAnalyzer代码
package com.datamine.WordSegmenter;
import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;
import org.wltea.analyzer.lucene.IKAnalyzer;
public class IKAnalyzerTest {
public static String str="基于java语言开发的轻量级的中文分词工具包";
public static void main(String[] args) throws Exception {
//基于Lucene实现
Analyzer analyzer = new IKAnalyzer(true); //true智能切分,false为细粒度的分词
StringReader reader = new StringReader(str);
TokenStream ts = analyzer.tokenStream("", reader);
CharTermAttribute term = ts.getAttribute(CharTermAttribute.class);
while(ts.incrementToken()){
System.out.print(term.toString()+ "|");
}
reader.close();
System.out.println();
//独立实现
StringReader re = new StringReader(str);
IKSegmenter ik = new IKSegmenter(re, false);
Lexeme lex = null;
while((lex = ik.next())!=null){
System.out.print(lex.getLexemeText()+"|");
}
}
}
结果:
基于|java|语言|开发|的|轻量级|的|中文|分词|工具包|
基于|java|语言|开发|的|轻量级|量级|的|中文|分词|工具包|工具|包|
StandardAnalyzer是Lucene中自带的分词算法,一元分词(逐字拆分)的方法
package com.datamine.WordSegmenter;
import java.io.IOException;
import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
public class StandardAnalyzerTest {
public static String str="基于java语言开发的轻量级的中文分词工具包";
public static void main(String[] args) throws Exception {
//Lucene一元分词(逐字拆分)的方法
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
StringReader reader = new StringReader(str);
TokenStream ts = analyzer.tokenStream("", reader);
CharTermAttribute term = ts.getAttribute(CharTermAttribute.class);
while(ts.incrementToken()){
System.out.print(term.toString()+ "|");
}
reader.close();
System.out.println();
}
}
结果:
基|于|java|语|言|开|发|的|轻|量|级|的|中|文|分|词|工|具|包|
Paoding分词网上找的资料都是lucene3.*版本的,网上资料也很多,这里就没有自己实现了,但是在使用庖丁的时候要注意自己配置词典的位置:
可以参考如下地址:http://blog.sina.com.cn/s/blog_68b606350101rdil.html