Lucene之庖丁分词器及性能分析

首先简单介绍一下中文分词器,lucene默认的中文分词器有:单字分词StandardAnalyzer 、 二分法分词 CJKAnalyzer。另外就是外部的词典分词了,最简单的是极易分词MMAnalyzer、庖丁分词PaodingAnalyzer。
单字分词就是把一句中文一个字一个字的分开,二分法分词就是相邻的二个字是一个关键词,基本上这两种分词方法用的很少了,用法也很简单。
MMAnalyzer分词基本也不用什么设置,直接导入jar包然后 new MMAnalyzer();就行了。
下面具体介绍庖丁分词。首先下载jar包: http://code.google.com/p/paoding/downloads/list
得到paoding-analysis-2.0.4-beta.zip,然后解压开把paoding-analysis.jar加到classpath路径下,建立如下测试文件:

String enStr = "What are you believe.txt"; String chStr = "我们是中国人!"; Analyzer en1 = new StandardAnalyzer(); Analyzer en2 = new SimpleAnalyzer(); Analyzer ch1 = new CJKAnalyzer(); Analyzer ch2 = new MMAnalyzer(); Analyzer ch3 = new PaodingAnalyzer(); @Test public void test() throws Exception{ // analyzer(en1,enStr); // analyzer(en2,enStr); // analyzer(en1, chStr); // analyzer(ch1, chStr); Long time1 = System.currentTimeMillis(); analyzer(ch2, chStr); Long time2 = System.currentTimeMillis(); System.out.println("MMAnalyzer用时: "+(time2 - time1)); analyzer(ch3, chStr); Long time3 = System.currentTimeMillis(); System.out.println("PaodingAnalyzer用时: "+(time3 - time2)); } private void analyzer(Analyzer analyzer, String text) throws IOException { System.out.println("分词器---> " + analyzer); TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text)); for (Token token = new Token(); (token = tokenStream.next(token)) != null;) { System.out.println(token); } }

报错:

net.paoding.analysis.exception.PaodingAnalysisException: please set a system env PAODING_DIC_HOME or Config paoding.dic.home in paoding-dic-home.properties point to the dictionaries! at net.paoding.analysis.knife.PaodingMaker.setDicHomeProperties(PaodingMaker.java:320) at net.paoding.analysis.knife.PaodingMaker.getDicHome(PaodingMaker.java:261) at net.paoding.analysis.knife.PaodingMaker.loadProperties(PaodingMaker.java:189) at net.paoding.analysis.knife.PaodingMaker.loadProperties(PaodingMaker.java:228) at net.paoding.analysis.knife.PaodingMaker.loadProperties(PaodingMaker.java:228) at net.paoding.analysis.knife.PaodingMaker.getProperties(PaodingMaker.java:130) at net.paoding.analysis.analyzer.PaodingAnalyzer.init(PaodingAnalyzer.java:70) at net.paoding.analysis.analyzer.PaodingAnalyzer.<init>(PaodingAnalyzer.java:59) at net.paoding.analysis.analyzer.PaodingAnalyzer.<init>(PaodingAnalyzer.java:52) at com.itcast.lucene.AnalyzerTest.<init>(AnalyzerTest.java:27) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.junit.runners.BlockJUnit4ClassRunner.createTest(BlockJUnit4ClassRunner.java:202) at org.junit.runners.BlockJUnit4ClassRunner$1.runReflectiveCall(BlockJUnit4ClassRunner.java:251) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.BlockJUnit4ClassRunner.methodBlock(BlockJUnit4ClassRunner.java:248) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:38) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

大概意思是说没有索引字典 ,建立系统环境变量  PAODING_DIC_HOME  或者 在paoding-dic-home.properties  中配置。
于是建立目录:D:\java\paoding\dic,此处最好为英文目录。否则可能乱码导致找不到。 打开jar包,找到  paoding-dic-home.properties 把 paoding.dic.home,去掉注释,修改为你的dic路径即可: paoding.dic.home=D:\\java\\paoding\\dic (注意要用转义的双反斜杠)。
再运行,报错:

org.apache.jasper.JasperException: not found the dic home dirctory! D:\java\paoding\dic
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:346)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
javax.servlet.http.HttpServlet.service(HttpServlet.java:810)

没有找到字典从 D:\java\paoding\dic目录中。
解决方案:把zip解压包中的dic文件夹中的内容拷贝到上述文件目录。运行一切正常。
下面是性能分析:输出结果:

分词器---> jeasy.analysis.MMAnalyzer@1372a1a (我们,0,2) (中国人,3,6) MMAnalyzer用时: 929 分词器---> net.paoding.analysis.analyzer.PaodingAnalyzer@15ff48b (我们,0,2) (中国,3,5) (国人,4,6) PaodingAnalyzer用时: 7

MMAnalyzer@1372a1a  用900多毫秒,  PaodingAnalyzer  用7秒,性能高低显而易见。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值