Lucene入门教程（一）

最新推荐文章于 2022-05-07 23:10:14 发布

Destiny_bobo

最新推荐文章于 2022-05-07 23:10:14 发布

阅读量806

点赞数 1

文章标签： lucene 开源代码 web应用全文检索 doug cutting

本文链接：https://blog.csdn.net/Destiny_bobo/article/details/52635247

版权

什么是Lucene？

Lucene是Apache软件基金会4jakarta项目组的一个子项目，是一个开放源代码的全文检索引擎工具包，但它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎。在Java开发环境里Lucene是一个成熟的免费开源工具。就其本身而言，Lucene是当前以及最近几年最受欢迎的免费Java信息检索程序库。

Lucene的历史

Lucene最初是由Doug Cutting开发的，在SourceForge的网站上提供下载。在2001年9月做为高质量的开源Java产品加入到Apache软件基金会的 Jakarta家族中。随着每个版本的发布，这个项目得到明显的增强，也吸引了更多的用户和开发人员。2004年7月，Lucene1.4版正式发布，10月的1.4.2版本做了一次bug修正。表1.1显示了Lucene的发布历史。

Lucene特点优势

作为一个开放源代码项目，Lucene从问世之后，引发了开放源代码社群的巨大反响，程序员们不仅使用它构建具体的全文检索应用，而且将之集成到各种系统软件中去，以及构建Web应用，甚至某些商业软件也采用了Lucene作为其内部全文检索子系统的核心。它是一个高性能、可伸缩的信息搜索(IR)库。它可以为你的应用程序添加索引和搜索能力。Lucene是用java实现的、成熟的开源项目，是著名的Apache Jakarta大家庭的一员，并且基于Apache软件许可 [ASF, License]。同样，Lucene是当前非常流行的、免费的Java信息搜索(IR)库。

Lucene入门代码：向文档写索引并读取

首先创建一个maven项目，接下来配置pom.xml文件，代码如下：

	 <!-- 下载junit包 -->
         <dependency>
             <groupId>junit</groupId>
             <artifactId>junit</artifactId>
             <version>3.8.1</version>
             <scope>test</scope>
         </dependency>

         <!-- lucene核心包 -->
         <dependency>
             <groupId>org.apache.lucene</groupId>
             <artifactId>lucene-core</artifactId>
             <version>5.3.1</version>
         </dependency>

         <!-- 查询解析器 -->
         <dependency>
             <groupId>org.apache.lucene</groupId>
             <artifactId>lucene-queryparser</artifactId>
             <version>5.3.1</version>
         </dependency>

         <!-- 分析器 -->
         <dependency>
             <groupId>org.apache.lucene</groupId>
             <artifactId>lucene-analyzers-common</artifactId>
             <version>5.3.1</version>
         </dependency>

         <!-- 中文分词查询器smartcn -->
         <dependency>
             <groupId>org.apache.lucene</groupId>
             <artifactId>lucene-analyzers-smartcn</artifactId>
             <version>5.3.1</version>
         </dependency>

         <!-- 高亮显示 -->
         <dependency>
             <groupId>org.apache.lucene</groupId>
             <artifactId>lucene-highlighter</artifactId>
             <version>5.3.1</version>
         </dependency>

下jar包ing........

等待n分钟后，，jar包下完了。

开始建包写代码，建包每个人都会，这里省略，直接写代码，代码如下：

import java.io.File;
import java.io.FileReader;
import java.nio.file.Paths;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

/**
*   类简介：
 *           ①简单的向文档里写索引；
 *           ②在根据索引读取文档；
 *           ③运用路径来找被索引的文档，找到返回结果
 */
public class Indexer {
	
	//写索引的实例到指定目录下
	private IndexWriter writer;
	
	/**
	 * 构造方法：为了实例化IndexWriter
	 */
	public Indexer(String indexDir) throws Exception{
		
		//得到索引所在目录的路径
		Directory dir = FSDirectory.open(Paths.get(indexDir));
		
		//实例化分析器
		Analyzer analyzer = new StandardAnalyzer();
		
		//实例化IndexWriterConfig
		IndexWriterConfig con = new IndexWriterConfig(analyzer);
		
		//实例化IndexWriter
		writer = new IndexWriter(dir, con);
	
	}
	
	/**
	 * 关闭写索引
	 * @throws Exception
	 */
	public void close()throws Exception{
		
		writer.close();
	}
	
	
	/**
	 * 索引指定目录的所有文件
	 * @throws Exception 
	 */
	public int index(String dataDir) throws Exception{
		
		//定义文件数组，循环得出要加索引的文件
		File[] file = new File(dataDir).listFiles();
		
		for (File files : file) {
			
			//从这开始，对每个文件加索引
			indexFile(files);
		}
		
		//返回索引了多少个文件，有几个文件返回几个
		return writer.numDocs();
		
	}

	/**
	 * 索引指定文件
	 * @throws Exception 
	 */
	private void indexFile(File files) throws Exception {
		
		System.out.println("索引文件："+files.getCanonicalPath());
		
		//索引要一行一行的找，，在数据中为文档，所以要得到所有行，即文档
		Document document = getDocument(files);
		
		//开始写入,就把文档写进了索引文件里去了；
		writer.addDocument(document);
	
	}

	/**
	 * 获得文档，在文档里在设置三个字段
	 * 
	 * 获得文档，相当于数据库里的一行
	 * @throws Exception 
	 * */
	private Document getDocument(File files) throws Exception {
		
		//实例化Document
		Document doc = new Document();
		
		doc.add(new TextField("contents",new FileReader(files)));
		
		//Field.Store.YES：把文件名存索引文件里，为NO就说明不需要加到索引文件里去
		doc.add(new TextField("FileName", files.getName(), Field.Store.YES));
		
		//把完整路径存在索引文件里
		doc.add(new TextField("fullPath", files.getCanonicalPath(),Field.Store.YES));
	
		//返回document
		return doc;
	}
	
	
	//开始测试写入索引
	public static void main(String[] args){
		
		//索引指定的文档路径
		String indexDir = "E:\\luceneDemo";
		
		//被索引数据的路径
		String dataDir = "E:\\luceneDemo\\data";
		
		//写索引
		Indexer indexer = null;
		int numIndex = 0;
		
		//索引开始时间
		long start = System.currentTimeMillis();
		
		try {
			//通过索引指定的路径，得到indexer
			indexer = new  Indexer(indexDir);

			//将要索引的数据路径(int:因为这是要索引的数据，有多少就返回多少数量的索引文件)
			numIndex = indexer.index(dataDir);
			
		} catch (Exception e) {
			
			e.printStackTrace();
		}
		//索引结束时间
		long end = System.currentTimeMillis();
		
		//显示结果
		System.out.println("索引了  "+numIndex+"  个文件，花费了  "+(end-start)+"  毫秒");
		
	}

}

写完索引的显示效果如下：

写完索引，再根据索引读取代码如下：

import java.nio.file.Paths;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

/**
 * 
 * 通过索引字段来读取文档
 * @author Admin
 *
 */
public class ReaderByIndexerTest {

	public static void search(String indexDir,String q)throws Exception{
		
		//得到读取索引文件的路径
		Directory dir=FSDirectory.open(Paths.get(indexDir));
		
		//通过dir得到的路径下的所有的文件
		IndexReader reader=DirectoryReader.open(dir);
		
		//建立索引查询器
		IndexSearcher is=new IndexSearcher(reader);
		
		//实例化分析器
		Analyzer analyzer=new StandardAnalyzer(); 
		
		//建立查询解析器
		/**
		 * 第一个参数是要查询的字段；
		 * 第二个参数是分析器Analyzer
		 * */
		QueryParser parser=new QueryParser("contents", analyzer);
		
		//根据传进来的p查找
		Query query=parser.parse(q);
		
		//计算索引开始时间
		long start=System.currentTimeMillis();
		
		//开始查询
		/**
		 * 第一个参数是通过传过来的参数来查找得到的query；
		 * 第二个参数是要出查询的行数
		 * */
		TopDocs hits=is.search(query, 10);
		
		//计算索引结束时间
		long end=System.currentTimeMillis();
		
		System.out.println("匹配 "+q+" ，总共花费"+(end-start)+"毫秒"+"查询到"+hits.totalHits+"个记录");
		
		//遍历hits.scoreDocs，得到scoreDoc
		/**
		 * ScoreDoc:得分文档,即得到文档
		 * scoreDocs:代表的是topDocs这个文档数组
		 * @throws Exception 
		 * */
		for(ScoreDoc scoreDoc:hits.scoreDocs){
			Document doc=is.doc(scoreDoc.doc);
			System.out.println(doc.get("fullPath"));
		}
		
		//关闭reader
		reader.close();
	}
	
	//测试
	public static void main(String[] args) {
		
		String indexDir="E:\\luceneDemo";
		String q="Zygmunt Saloni";
		
		try {
		              search(indexDir,q);
		} catch (Exception e) {
		    // TODO Auto-generated catch block
		     e.printStackTrace();
		}
	}
	
		
}

读取也完事了，接下来就是读取之后效果了，如下：

总结：

以上就是lucene的简单的想文档里写索引，然后再根据索引来全文搜索。看到了这，各位想必也了解了lucene的工作流程，对，就是写和读。写，就是向要检索的文档写索引；读，就是通过索引来搜索要检索得到的东西。所以说，当遇到内容非常大的文档，可以使用lucene来检索，帮助查找，快速得到信息。

掌握写索引的核心代码，以及根据索引读取文档。

1、索引指定的文档路径的时候，在代码中为E:\\luceneDemo，必须为被索引文件的正确路径，否则会出现以下异常：

org.apache.lucene.index.IndexNotFoundException:no segments* file found in SimpleFSDirectory@D:\luceneDemo lockFactory=org.apache.lucene.store.NativeFSLockFactory@eb724:files: []

atorg.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:726)

atorg.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:50)

atorg.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:63)

atreaderByIndexer.ReaderByIndexerTest.search(ReaderByIndexerTest.java:32)

atreaderByIndexer.ReaderByIndexerTest.main(ReaderByIndexerTest.java:87)

意思是没有找到E:\\luceneDemo文件夹下的文件。有时还会出现空指针的异常。

2、当写完索引后，想在测试一遍，先把被索引文件的文件夹下所产生的文件先删除，如若不删除，就是检索到重复的文档。

3、测试的参数和路径必须确定是完全正确的，有其是要查询的参数，否则会查到0个文档。

4、这会需要检索的文档必须为全英文。

5、注意导包是否正确。

温馨提示：

简单向指定文档写索引，需要大家熟练掌握，懂得每个核心词的用法。那么 lucene 入门学习就到这里，接下来就是对被索引文档的一个 CRUD 。