最近做有关搜索引擎的项目,使用中文分词工具“庖丁解牛”。
配置过程如下:
下载地址:http://code.google.com/p/paoding/downloads/list]http://code.google.com/p/paoding/downloads/list
SVN地址:http://paoding.googlecode.com/svn/trunk/paoding-analysis/]http://paoding.googlecode.com/svn/trunk/paoding-analysis/
1. 将paoding-analysis.jar添加至项目的classpath路径下,
2. 将下载的庖丁zip压缩包下的dic文件夹拷至项目src文件夹下,或放置在磁盘某个位置,如F:/paoding-analysis/dic
3. paoding 比较麻烦的是要设置字典的环境变量,
* 一般做法是新建环境变量 PAODING_DIC_HOME 再加入字典路径(如 F:\paoding-analysis\dic)
这种方法在项目移位后还得配置字典环境,比较麻烦 ,可以直接
* 使用解压工具打开paoding-analysis.jar包,修改配置文件paoding-dic-home.properties
修改如下
#values are "system-env" or "this";
#if value is "this" , using the paoding.dic.home as dicHome if configed!
paoding.dic.home.config-fisrt=system-env
#dictionary home (directory)
#"classpath:xxx" means dictionary home is in classpath.
#e.g "classpath:dic" means dictionaries are in "classes/dic" directory or any other classpath directory
paoding.dic.home=D:/bookSearch/dic
#seconds for dic modification detection
#paoding.dic.detector.interval=60
以上配置是词典库dic在工程classpath下的情形,若词典库在别的地方,可如下配置:
#paoding.dic.home=dic修改成
paoding.dic.home=F:/paoding-analysis/dic即可
测试程序:
String IDNEX_PATH = "E:/paoding_test_index";
//获取Paoding中文分词器
Analyzer analyzer = new PaodingAnalyzer();
//建立索引
IndexWriter writer = new IndexWriter(IDNEX_PATH, analyzer, true);
Document doc = new Document();
Field field = new Field("content", "你好,世界!", Field.Store.YES,
Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
doc.add(field);
writer.addDocument(doc);
writer.close();
System.out.println("Indexed success!");
//检索
IndexReader reader = IndexReader.open(IDNEX_PATH);
QueryParser parser = new QueryParser("content", analyzer);
Query query = parser.parse("你好");
Searcher searcher = new IndexSearcher(reader);
Hits hits = searcher.search(query);
if (hits.length() == 0) {
System.out.println("hits.length=0");
}
Document doc2 = hits.doc(0);
//高亮处理
String text = doc2.get("content");
TermPositionVector tpv = (TermPositionVector) reader.getTermFreqVector(
0, "content");
TokenStream ts = TokenSources.getTokenStream(tpv);
Formatter formatter = new Formatter() {
public String highlightTerm(String srcText, TokenGroup g) {
if (g.getTotalScore() <= 0) {
return srcText;
}
return "<b>" + srcText + "</b>";
}
};
Highlighter highlighter = new Highlighter(formatter, new QueryScorer(
query));
String result = highlighter.getBestFragments(ts, text, 5, "…");
System.out.println("result:\n\t" + result);
reader.close();
在笔者项目运行时发生如下错误:
java.lang.VerifyError: Cannot inherit from finalclass
查资料发现,应把lucene包升级为2.2以上版本,更换后问题解决
自定义词库的方法
http://hi.baidu.com/xwx520/blog/item/c288ee3eb0f5b9f0838b137f.html