在pyLucene中使用中文分词器(在pyLucene中引用Jar包)

Subject:在pyLucene中使用中文分词器(在pyLucene中引用Jar包)
From:Cloud Zhang (zhon@gmail.com)
Date:Jun 1, 2008 11:14:12 pm
List:com.googlegroups.python-cn

(刚刚解决的一个问题,在中文里面没有搜索到相关内容(英文里其实也没搜到...),发一篇在这里备人索引,关键字:pyLucene, JCC,Lucene, Importing JAR)

在Lucene里面引用别人写好的中文分词器很简单,加个CLASSPATH就好。但是在pyLucene(JCC版)里,由于python所能够引用到的Jar包都是用JCC这个编译器(姑且认为是个编译器吧)预先编译了python调用接口的。(反过来说,就是没有经JCC编译的Jar包是休想在python里面直接访问的)

所以,在pyLucene中使用Jar包形式的中文分词器不得不重新编译。分隔线以下是OSFoundation某热心人关于如何修改Makefile让Jar包可以和pyLucene打包到一起的回复。

-------------------------------------热心人回复的分隔线-------------------------------------Andi Vajda:To access your class(es) by name from Python, you must have JCCgenerate wrappers for it (them). This is what is done line 177 and onin PyLucene's Makefile. The easiest way for you to add your own Javaclasses to PyLucene is to create another jar file with your ownanalyzer classes and code and add it to the JCC invocation there.

For example, the Makefile snippet in question currently says:

GENERATE=$(JCC) $(foreach jar,$(JARS),--jar $(jar)) \ --package java.lang java.lang.System \ java.lang.Runtime \ --package java.util \ java.text.SimpleDateFormat \ --package java.io java.io.StringReader \ java.io.InputStreamReader \ java.io.FileInputStream \ --exclude org.apache.lucene.queryParser.Token \ --exclude org.apache.lucene.queryParser.TokenMgrError \ --excludeorg.apache.lucene.queryParser.QueryParserTokenManager \ --exclude org.apache.lucene.queryParser.ParseException \ --python lucene \ --mapping org.apache.lucene.document.Document 'get:(Ljava/lang/String;)Ljava/lang/String;' \ --mapping java.util.Properties 'getProperty:(Ljava/lang/String;)Ljava/lang/String;' \ --sequence org.apache.lucene.search.Hits 'length:()I' 'doc:(I)Lorg/apache/lucene/document/Document;' \ --version $(LUCENE_VER) \ --files $(NUM_FILES)

change the first line to say:

GENERATE=$(JCC) $(foreach jar,$(JARS),--jar $(jar)) --jar myjar.jar \ ...

and rebuild PyLucene. That should be all you need to do. Your jar fileis going to be installed along with lucene's in the lucene egg and itis going to be put on lucene.CLASSPATH which you use withlucene.initVM().

Your classes can be declared in any Java package you want. Just makesure that their names don't clash with other Lucene class names thatyou also need to use as the class namespace is flattened in PyLucene.

For more information about JCC and its command line args see JCC'sREADME file at [1].

Andi..

[1] http://svn.osafoundation.org/pylucene/trunk/jcc/jcc/README

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值