摘 要
Lucene是apache软件基金会jakarta项目组的一个子项目,是一个开放源代码的全文检索引擎工具包,即它不是一个完整的全文检索引擎,而是一个全文检索引擎的架构,提供了完整的查询引擎和索引引擎,部分文本分析引擎(英文与德文两种西方语言)。Lucene的目的是为软件开发人员提供一个简单易用的工具包,以方便的在目标系统中实现全文检索的功能,或者是以此为基础建立起完整的全文检索引擎。
引子
Lucene作为开源社区里面大名鼎鼎的全文检索系统被广泛传播,网上关于Lucene的资料可以说是俯首皆是,也有很多的其他全文检索系统或多或少的借鉴了Lucene的设计思想,比如国内刚刚出现的FirTex项目。因为一直非常信任Apache,也比较关注开源项目的发展,所以应该很好几年前就知道了Lucene,但总感觉他离我很远。也曾试图去深入的学习一下,最终不了了之。
直到最近,因为要设计一个系统,当中会用到全文检索技术,于是重拾Lucene,决定好好研究一下。搜遍了互联网,看到的大部分资料是讲解如何使用Lucene的,对Lucene深入讲解的不多,包括比较流行的《Lucene In Action》,都让我大失所望。我希望能深入的了解Lucene的设计思想,Lucene文件的数据结构,创建索引和检索的过程,为何可以做到如此高效。
看来只能自己慢慢啃了,我知道这个系统的复杂性,但当看到源代码时还是大吃一惊,竟然有4万多行。虽然以前曾经有过研习ICTCLAS源代码的经历,但它大概只有1万行左右,已经花费了我很长的时间才把各个细节部分搞清楚。看来这次又是一个攻坚战!
初探
虽然 Lucene 的源代码很庞大,很复杂,但对于使用者来说还是非常简单的,没有几行代码就可以实现完整的创建索引和检索的全过程。我们来看两个例子 :
//
检索数据
package
org.apache.lucene.demo2;
import
java.io.File;
import
java.util.Date;
import
org.apache.lucene.analysis.standard.StandardAnalyzer;
import
org.apache.lucene.document.Document;
import
org.apache.lucene.queryParser.QueryParser;
import
org.apache.lucene.search.Hits;
import
org.apache.lucene.search.IndexSearcher;
import
org.apache.lucene.search.Query;
import
org.apache.lucene.store.Directory;
import
org.apache.lucene.store.FSDirectory;
public
class
Searcher
...
{ /** */ /** * @param args */ public static void main(String[] args) ... { try ... { File indexDir = new File( " test2/index " ); String q = " mscomctl.ocx " ; if ( ! indexDir.exists() || ! indexDir.isDirectory()) ... { throw new Exception(indexDir + " does not exist or is not a directory. " ); } search(indexDir, q); } catch (Exception e) ... { e.printStackTrace(); } } public static void search(File indexDir, String q) throws Exception ... { Directory fsDir = FSDirectory.getDirectory(indexDir, false ); IndexSearcher is = new IndexSearcher(fsDir); QueryParser parser = new QueryParser( " contents " , new StandardAnalyzer()); Query query = parser.parse(q); long start = new Date().getTime(); Hits hits = is.search(query); long end = new Date().getTime(); System.err.println( " Found " + hits.length() + " document(s) (in " + (end - start) + " milliseconds) that matched query ' " + q + " ': " ); for ( int i = 0 ; i < hits.length(); i ++ ) ... { Document doc = hits.doc(i); System.out.println(doc.get( " filename " )); } } }
//
创建索引
package
org.apache.lucene.demo2;
import
java.io.File;
import
java.io.FileReader;
import
java.io.IOException;
import
java.util.Date;
import
org.apache.lucene.analysis.standard.StandardAnalyzer;
import
org.apache.lucene.document.Document;
import
org.apache.lucene.document.Field;
import
org.apache.lucene.index.IndexWriter;
public
class
Inderer
...
{ public static void main(String[] args) ... { try ... { File dataDir = new File( " test2/data " ); File indexDir = new File( " test2/index " ); long start = new Date().getTime(); int numIndexed = index(indexDir, dataDir); long end = new Date().getTime(); System.out.println( " Indexing " + numIndexed + " files took " + (end - start) + " milliseconds " ); } catch (IOException e) ... { e.printStackTrace(); } } // open an index and start file directory traversal public static int index(File indexDir, File dataDir) throws IOException ... { if ( ! dataDir.exists() || ! dataDir.isDirectory()) ... { throw new IOException(dataDir + " does not exist or is not a directory " ); } IndexWriter writer = new IndexWriter(indexDir, new StandardAnalyzer(), true ); writer.setUseCompoundFile( false ); indexDirectory(writer, dataDir); int numIndexed = writer.docCount(); writer.optimize(); writer.close(); return numIndexed; } // recursive method that calls itself when it finds a directory private static void indexDirectory(IndexWriter writer, File dir) throws IOException ... { File[] files = dir.listFiles(); for ( int i = 0 ; i < files.length; i ++ ) ... { File f = files[i]; if (f.isDirectory()) ... { indexDirectory(writer, f); } else if (f.getName().endsWith( " .txt " )) ... { indexFile(writer, f); } } } // method to actually index a file using Lucene private static void indexFile(IndexWriter writer, File f) throws IOException ... { if (f.isHidden() || ! f.exists() || ! f.canRead()) ... { return ; } System.out.println(" Indexing " + f.getCanonicalPath()); Document doc = new Document(); doc.add( new Field( " contents " , new FileReader(f))); doc.add( new Field( " filename " , f.getCanonicalPath(), Field.Store.YES, Field.Index.NO)); // System.out.println(doc); writer.addDocument(doc); } }
参考