lucene2.2.0学习实例

最新推荐文章于 2024-05-03 21:05:18 发布

hongjue

最新推荐文章于 2024-05-03 21:05:18 发布

阅读量3.3k

点赞数

分类专栏： lucene专题文章标签： lucene import 文档 path class apache

本文链接：https://blog.csdn.net/hongjue/article/details/1795729

版权

lucene专题专栏收录该内容

1 篇文章 0 订阅

订阅专栏

org.apache.lucene包是纯java语言的全文索引检索工具包。
Lucene的作者是资深的全文索引/检索专家，最开始发布在他本人的主页上，2001年10月贡献给APACHE，成为APACHE基金jakarta的一个子项目。lucene广泛用于全文索引/检索的项目中，目前已经有很多应用程序的搜索功能是基于 Lucene 的，比如 Eclipse 的帮助系统的搜索功能。Lucene 能够为文本类型的数据建立索引，所以你只要能把你要索引的数据格式转化的文本的，Lucene 就能对你的文档进行索引和搜索。比如你要对一些 HTML 文档，PDF 文档进行索引的话你就首先需要把 HTML 文档和 PDF 文档转化成文本格式的，然后将转化后的内容交给 Lucene 进行索引，然后把创建好的索引文件保存到磁盘或者内存中，最后根据用户输入的查询条件在索引文件上进行查询。不指定要索引的文档的格式也使 Lucene 能够几乎适用于所有的搜索应用程序。

我在参考前人代码的时候，发现lucene2.2.0已经有了一些变化（很小），所以把我的代码共享出来。以下是我用lucene2.2.0包对文本文件进行索引和关键字查询的实例代码：

1.txtFileIndex.java
主程序

/**

* 使用lucene2.2.0对txt文件建立索引

package luceneTest;

import java.io. * ;

import java.util.Date;

import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.index.IndexWriter;

/**

* @author 红角（somesongs） 2007-9-20

* mail:hongjuesir@gmail.com

* msn:hongjue@live.com

public class txtFileIndex {

/**

* @param args

public static void main(String[] args) {

//索引存放目录

File indexDir = new File("E:/JavaGo/luceneTestData/index");

try{

if(args.length>0){

//需索引的数据存放目录

File dataDir = new File(args[0]);

Analyzer luceneAnalyzer = new StandardAnalyzer();

IndexWriter writer = new IndexWriter(indexDir,luceneAnalyzer,true);

long startTime = new Date().getTime();

IndexDir theIndexDir = new IndexDir();

theIndexDir.indexDocs(writer,dataDir);

writer.optimize();

writer.close();

long endTime = new Date().getTime();

System.out.println("共花费"+(endTime-startTime)+"秒,索引文件存放在:"+indexDir.getCanonicalPath());

}

else{

System.out.println("请输入要查找的关键字：");

String keyWords = new String();

BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in));

keyWords = stdin.readLine();

SearchIndex searchIndex = new SearchIndex();

searchIndex.search(keyWords,indexDir);

}

catch(Exception e){

System.out.println(e);

}

2.IndexDir.java
对指定目录下的文本文件进行索引类

/**

* 对目录中的文件建立索引

package luceneTest;

import java.io. * ;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.index.IndexWriter;

/**

* @author 红角（somesongs） 2007-9-20

* mail:hongjuesir@gmail.com

* msn:hongjue@live.com

public class IndexDir {

IndexDir(){}

public void indexDocs(IndexWriter writer,File dataDir)

{

if(dataDir.isDirectory()){

File[] dataFiles = dataDir.listFiles();

if(dataFiles!=null){

for(int i=0;i<dataFiles.length;i++){

indexDocs(writer,dataFiles[i]); //递归

}

else if(dataDir.getName().endsWith(".txt"))

{

indexFile(dataDir,writer);

}

private void indexFile(File dataFile,IndexWriter writer){

try{

System.out.println("正索引文件："+dataFile.getCanonicalPath());

Document doc = new Document();

Reader txtReader = new FileReader(dataFile);

doc.add(new Field("path",dataFile.getCanonicalPath(),Field.Store.YES,Field.Index.UN_TOKENIZED));

doc.add(new Field("contents",txtReader));

writer.addDocument(doc);

}

catch(IOException e){

System.out.println(e);

}

3.SearchIndex.java
对关键字进行搜索类

/**

* 对关键字进行搜索

package luceneTest;

import java.io.File;

import org.apache.lucene.index.Term;

import org.apache.lucene.search. * ;

import org.apache.lucene.store.FSDirectory;

/**

* @author 红角（somesongs） 2007-9-20

* mail:hongjuesir@gmail.com

* msn:hongjue@live.com

public class SearchIndex {

SearchIndex(){}

public void search(String keyWords,File indexDir){

if(!indexDir.exists()){

System.out.println("索引目录不存在！");

return;

}

try{

FSDirectory indexDirectory = FSDirectory.getDirectory(indexDir);

IndexSearcher searcher = new IndexSearcher(indexDirectory);

Term term = new Term("contents",keyWords.toLowerCase());

TermQuery termQuery = new TermQuery(term);

Hits hits = searcher.search(termQuery);

System.out.println("共有"+searcher.maxDoc()+"条索引，命中"+hits.length()+"条");

for(int i=0;i<hits.length();i++)

{

int DocId = hits.id(i);

String DocPath = hits.doc(i).get("path");

System.out.println(DocId+":"+DocPath);

}

catch(Exception e){

System.out.println(e);

}

以上代码与以前版本的最大区别在于：

A.在《实战 Lucene，第 1 部分: 初识 Lucene》和 idior 大哥的是理论代码中：

document.add(Field.Text("path",dataFiles[i].getCanonicalPath()));
document.add(Field.Text("contents",txtReader));

而在lucene2.2.0中已经改变了，如下：
doc.add(new Field("path",dataFile.getCanonicalPath(),Field.Store.YES,Field.Index.UN_TOKENIZED));
doc.add(new Field("contents",txtReader));

Field需要新建一个实例，而不是静态调用了。

注：这可以说是我的第一个java程序，我是因为lucene才开始学习java的，代码可能有些不合理的地方，多多包含。

参考资料，总结一下：
http://lucene.apache.org/
http://www.chedong.com/tech/lucene.html
http://www.lucene.com.cn/rm.htm
http://www.ibm.com/developerworks/cn/java/j-lo-lucene1/(主要代码参考)
http://www.ibm.com/developerworks/cn/java/wa-lucene/
http://blog.csdn.net/duketang/archive/2006/01/10/575215.aspx(这个找不到出去，放一个转贴)
http://www.cnblogs.com/idior/articles/120301.html(主要代码参考)
http://www.rainsts.net/article.asp?id=313(主要代码参考)