Lucene原理讲解及实例

最新推荐文章于 2022-08-19 09:44:28 发布

洛神夫

最新推荐文章于 2022-08-19 09:44:28 发布

阅读量2.6k

点赞数

分类专栏： Java 技术探究文章标签： Lucene实例索引检索

本文链接：https://blog.csdn.net/luoshenfu001/article/details/16362501

版权

Java 同时被 2 个专栏收录

21 篇文章 2 订阅

订阅专栏

技术探究

15 篇文章 0 订阅

订阅专栏

Lucene简介：Lucene是apache软件基金会4 jakarta项目组的一个子项目，是一个开放源代码的全文检索引擎工具包，即它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分文本分析引擎（英文与德文两种西方语言）。Lucene的目的是为软件开发人员提供一个简单易用的工具包，以方便的在目标系统中实现全文检索的功能，或者是以此为基础建立起完整的全文检索引擎。有一篇文章不错：Lucene 基础理论。

Lucene支持索引的文件格式：目前已经有很多应用程序的搜索功能是基于 Lucene 的，比如 Eclipse 的帮助系统的搜索功能。Lucene 能够为文本类型的数据建立索引，所以你只要能把你要索引的数据格式转化的文本的，Lucene 就能对你的文档进行索引和搜索。比如你要对一些 HTML 文档，PDF 文档进行索引的话你就首先需要把 HTML 文档和 PDF 文档转化成文本格式的，然后将转化后的内容交给 Lucene 进行索引，然后把创建好的索引文件保存到磁盘或者内存中，最后根据用户输入的查询条件在索引文件上进行查询。不指定要索引的文档的格式也使 Lucene 能够几乎适用于所有的搜索应用程序。

Lucene 采用的是一种称为反向索引（inverted index）的机制。反向索引就是说我们维护了一个词/短语表，对于这个表中的每个词/短语，都有一个链表描述了有哪些文档包含了这个词/短语。这样在用户输入查询条件的时候，就能非常快的得到搜索结果。

Lucene中的核心类：

Document：用来描述文档，这里的文档可以指一个 HTML 页面，一封电子邮件，或者是一个文本文件。一个 Document 对象由多个 Field 对象组成的。可以把一个 Document 对象想象成数据库中的一个记录，而每个 Field 对象就是记录的一个字段。一条记录经过索引之后，就是以一个Document的形式存储在索引文件中的。用户进行搜索，也是以Document列表的形式返回。
Field：一个Document可以包含多个信息域，例如一篇文章可以包含“标题”、“正文”、“最后修改时间”等信息域，这些信息域就是通过Field在Document中存储的。 Field有两个属性可选：存储和索引。通过存储属性你可以控制是否对这个Field进行存储；通过索引属性你可以控制是否对该Field进行索引。下面举例说明：还是以刚才的文章为例子，我们需要对标题和正文进行全文搜索，所以我们要把索引属性设置为真，同时我们希望能直接从搜索结果中提取文章标题，所以我们把标题域的存储属性设置为真，但是由于正文域太大了，我们为了缩小索引文件大小，将正文域的存储属性设置为假，当需要时再直接读取文件；我们只是希望能从搜索解果中提取最后修改时间，不需要对它进行搜索，所以我们把最后修改时间域的存储属性设置为真，索引属性设置为假。上面的三个域涵盖了两个属性的三种组合，还有一种全为假的没有用到，事实上Field不允许你那么设置，因为既不存储又不索引的域是没有意义的。//uses a Reader instead of a String to represent the value. In this case the value cannot be stored (hardwired to Store.NO) and is always analyzed and indexed (Index.ANALYZED).
//never indexed /no term vectors /must be Store.YES
Field(String name, byte[] value, Store store)
Analyzer：在一个文档被索引之前，首先需要对文档内容进行分词处理，这部分工作就是由 Analyzer 来做的。Analyzer 类是一个抽象类，它有多个实现。针对不同的语言和应用需要选择适合的 Analyzer。Analyzer 把分词后的内容交给 IndexWriter 来建立索引。Analyzer是分析器，它的作用是把一个字符串按某种规则划分成一个个词语，并去除其中的无效词语，这里说的无效词语是指英文中的“of”、“the”，中文中的“的”、“地”等词语，这些词语在文章中大量出现，但是本身不包含什么关键信息，去掉有利于缩小索引文件、提高效率、提高命中率。分词的规则千变万化，但目的只有一个：按语义划分。这点在英文中比较容易实现，因为英文本身就是以单词为单位的，已经用空格分开；而中文则必须以某种方法将连成一片的句子划分成一个个词语。
Directory：这个类代表了 Lucene 的索引的存储的位置，这是一个抽象类，它目前有两个实现，第一个是 FSDirectory，它表示一个存储在文件系统中的索引的位置。第二个是 RAMDirectory，它表示一个存储在内存当中的索引的位置。
IndexWriter 和 IndexReader:其中 IndexWriter 是用来创建索引并添加文档到索引中的，IndexReader 是用来删除索引中的文档的。
IndexSearcher 和 Hits: IndexSearcher 定义了在指定的索引上进行搜索的方法，Hits 用来保存搜索得到的结果。
Term,Token和Segment: term是搜索的基本或最小单位，它表示文档的一个词语，term由两部分组成：它表示的词语和这个词语所出现的field，生成一个Term对象可以有如下一条语句来完成：Term term = new Term(“fieldName”,”queryWord”); 其中第一个参数代表了要在文档的哪一个Field上进行查找，第二个参数代表了要查询的关键词。。 tocken是term的一次出现，它包含term文本和相应的起止偏移，以及一个类型字符串。一句话中可以出现多次相同的词语，它们都用同一个term表示，但是用不同的tocken，每个tocken标记该词语出现的地方。添加索引时并不是每个document都马上添加到同一个索引文件，它们首先被写入到不同的小文件，然后再合并成一个大索引文件，这里每个小文件都是一个segment。添加完所有document，我们对索引进行优化，优化主要是将多个segment合并到一个，有利于提高索引速度：indexWriter.optimize()。
Query:这是一个抽象类，他有多个实现，比如TermQuery, BooleanQuery, PrefixQuery. 这个类的目的是把用户输入的查询字符串封装成Lucene能够识别的Query。TermQuery是抽象类Query的一个子类，它同时也是Lucene支持的最为基本的一个查询类。生成一个TermQuery对象由如下语句完成： TermQuery termQuery = new TermQuery(new Term(“fieldName”,”queryWord”)); 它的构造函数只接受一个参数，那就是一个Term对象。

理解上以上的核心类后看Lucene如何工作就比较容易了。

Lucene如何工作（1）建立索引：

public class IndexComingFile {

	private IndexWriter indexWriter;
	private final Logger logger = Logger.getLogger(IndexComingFile.class);

	public IndexComingFile() {
		try {
			Properties prop = new PropertiesFile().getPropertiesFile();
			String indexDirPath = prop.getProperty("indexDir"); // "E:/workspace/index";//存放索引的目录

			File indexDirFile = new File(indexDirPath);
			if (!indexDirFile.exists()) {
				indexDirFile.mkdirs();
			}

			Directory dir = new SimpleFSDirectory(indexDirFile);
			indexWriter = new IndexWriter(dir, new StandardAnalyzer(
					Version.LUCENE_30), true,
					IndexWriter.MaxFieldLength.LIMITED);
			indexWriter.setUseCompoundFile(false);
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

	public void begin(VEachFile ef) throws IOException {

		File f = new File(ef.getFilePath());

		if (f.isHidden() || !f.exists() || !f.canRead()) {
			return;
		}

		logger.debug("Indexing a file:" + ef.getFileName());

		Document doc = new Document();
		Reader txtReader = new FileReader(f);

		doc.add(new Field(VEachFile.FilePath_T, ef.getFilePath(),
				Field.Store.YES, Field.Index.NO));
		doc.add(new Field(VEachFile.Contents_T, txtReader));
		doc.add(new Field(VEachFile.FileName_T, ef.getFileName(),
				Field.Store.NO, Field.Index.NO));
		doc.add(new Field(VEachFile.FileSize_T,
				String.valueOf(ef.getFileSize()), Field.Store.YES,
				Field.Index.NO));
		doc.add(new Field(VEachFile.FileDate_T,
				String.valueOf(ef.getFileDate()), Field.Store.YES,
				Field.Index.NO));
		doc.add(new Field(VEachFile.FileAuthor_T, String.valueOf(ef
				.getFileAuthor()), Field.Store.YES, Field.Index.NO));
		doc.add(new Field(VEachFile.FileDirId_T, String.valueOf(ef.getDirId()),
				Field.Store.YES, Field.Index.NO));
		doc.add(new Field(VEachFile.FileUpdator_T, String.valueOf(ef
				.getUpdator()), Field.Store.YES, Field.Index.NO));

		indexWriter.addDocument(doc);

	}

	public void closeIndexer() {

		try {
			indexWriter.optimize();
			indexWriter.close();
		} catch (CorruptIndexException e) {
			logger.error("when close the inedexs error!", e);
		} catch (IOException e) {
			logger.error("when close the inedexs error!", e);
		}

	}
}

Lucene如何工作（2）检索关键字：

public class Searcher {

	private static int TOP_NUM = 10;
	private static String indexDirPath;
	static {
		Properties prop = new PropertiesFile().getPropertiesFile();
		indexDirPath = prop.getProperty("indexDir"); // "E:/workspace/index";//存放索引的目录
	}

	public static List<VEachFile> begin(String key) throws Exception {

		File indexDir = new File(indexDirPath);
		if (!indexDir.exists()) {
			return null;
		}

		Directory path = FSDirectory.open(indexDir);
		IndexSearcher is = new IndexSearcher(path);

		QueryParser parser = new QueryParser(Version.LUCENE_30,
				VEachFile.Contents_T, new StandardAnalyzer(Version.LUCENE_30));
		Query query = parser.parse(key);

		TopScoreDocCollector collector = TopScoreDocCollector.create(TOP_NUM,
				true);
		is.search(query, collector);
		ScoreDoc[] hits = collector.topDocs().scoreDocs;
		// System.out.println(hits.length);
		ArrayList<VEachFile> docs = new ArrayList<VEachFile>();

		for (int i = 0; i < hits.length; i++) {
			Document doc = is.doc(hits[i].doc);
			VEachFile ef = new VEachFile(Integer.parseInt(doc.getField(
					VEachFile.FileDirId_T).toString()), doc.getField(
					VEachFile.FileName_T).toString(), Long.valueOf(doc
					.getField(VEachFile.FileSize_T).toString()), new Date(doc
					.getField(VEachFile.FileDate_T).toString()), doc.getField(
					VEachFile.FileAuthor_T).toString(), doc.getField(
					VEachFile.FilePath_T).toString(), 0, doc.getField(
					VEachFile.FileUpdator_T).toString(), null);
			docs.add(ef);
		}

		return docs;
	}

}

中文分词种类

http://www.oschina.net/project/tag/264/segment?sort=view&lang=19&os=0

来自官方文档上的例子：

    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

    // Store the index in memory:
    Directory directory = new RAMDirectory();
    // To store an index on disk, use this instead:
    //Directory directory = FSDirectory.open("/tmp/testindex");
    IndexWriter iwriter = new IndexWriter(directory, analyzer, true,new IndexWriter.MaxFieldLength(25000));
    Document doc = new Document();
    String text = "This is the text to be indexed.";
    doc.add(new Field("fieldname", text, Field.Store.YES,Field.Index.ANALYZED));
    iwriter.addDocument(doc);
    iwriter.close();
    
    // Now search the index:
    IndexReader ireader = IndexReader.open(directory); // read-only=true
    IndexSearcher isearcher = new IndexSearcher(ireader);
    // Parse a simple query that searches for "text":
    QueryParser parser = new QueryParser("fieldname", analyzer);
    Query query = parser.parse("text");
    ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    // Iterate through the results:
    for (int i = 0; i < hits.length; i++) {
      Document hitDoc = isearcher.doc(hits[i].doc);
      assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));
    }
    isearcher.close();
    ireader.close();
    directory.close();

http://javahy.iteye.com/blog/1050402 对word, PDF, excel的转化

The Lucene API is divided into several packages:

org.apache.lucene.analysis defines an abstract Analyzer API for converting text from a java.io.Reader into a TokenStream, an enumeration of token Attributes. A TokenStream can be composed by applying TokenFilters to the output of a Tokenizer. Tokenizers and TokenFilters are strung together and applied with an Analyzer. A handful of Analyzer implementations are provided, including StopAnalyzer and the grammar-based StandardAnalyzer.
org.apache.lucene.document provides a simple Document class. A Document is simply a set of named Fields, whose values may be strings or instances of java.io.Reader.
org.apache.lucene.index provides two primary classes: IndexWriter, which creates and adds documents to indices; andIndexReader, which accesses the data in the index.
org.apache.lucene.search provides data structures to represent queries (ie TermQuery for individual words, PhraseQueryfor phrases, and BooleanQuery for boolean combinations of queries) and the abstract Searcher which turns queries intoTopDocs. IndexSearcher implements search over a single IndexReader.
org.apache.lucene.queryParser uses JavaCC to implement a QueryParser.
org.apache.lucene.store defines an abstract class for storing persistent data, the Directory, which is a collection of named files written by an IndexOutput and read by an IndexInput. Multiple implementations are provided, includingFSDirectory, which uses a file system directory to store files, and RAMDirectory which implements files as memory-resident data structures.
org.apache.lucene.util contains a few handy data structures and util classes, ie BitVector and PriorityQueue.

To use Lucene, an application should:

Create Documents by adding Fields;
Create an IndexWriter and add documents to it with addDocument();
Call QueryParser.parse() to build a query from a string; and
Create an IndexSearcher and pass the query to its search() method.

Some simple examples of code which does this are:

IndexFiles.java creates an index for all the files contained in a directory.
SearchFiles.java prompts for queries and searches an index.

To demonstrate these, try something like:

> java -cp lucene.jar:lucene-demo.jar:lucene-analyzers-common.jar org.apache.lucene.demo.IndexFiles rec.food.recipes/soups
adding rec.food.recipes/soups/abalone-chowder
  [ ... ]
> java -cp lucene.jar:lucene-demo.jar:lucene-analyzers-common.jar org.apache.lucene.demo.SearchFiles
Query: chowder
Searching for: chowder
34 total matching documents
1. rec.food.recipes/soups/spam-chowder
  [ ... thirty-four documents contain the word "chowder" ... ]

Query: "clam chowder" AND Manhattan
Searching for: +"clam chowder" +manhattan
2 total matching documents
1. rec.food.recipes/soups/clam-chowder
  [ ... two documents contain the phrase "clam chowder" and the word "manhattan" ... ]
    [ Note: "+" and "-" are canonical, but "AND", "OR" and "NOT" may be used. ]

使用IndexReader 删除索引

若需要从索引中删除某一个或者某一类文档，IndexReader提供了两种方法：
reader.DeleteDocument(int docNum) – 根据Document的docId删除单个Document
reader.DeleteDocuments(Term term) – 根据Term来删除单个或多个Document

前者是根据文档的编号来删除该文档，docNum是该文档进入索引时Lucene的编号，是按照顺序编的，不过这个docNum怎么获取呢？后者是删除满足某一个条件的多个文档。

在执行了DeleteDocument或者DeleteDocuments方法后，系统会生成一个*.del的文件，该文件中记录了删除的文档，但并未从物理上删除这些文档。此时，这些文档是受保护的，当使用Document doc = reader.Document(i)来访问这些受保护的文档时，Lucene会报“Attempt to access a deleted document”异常。如果一次需要删除多个文档时，可以用两种方法来解决：

1. 删除一个文档后，用IndexWriter的Optimize方法来优化索引，这样我们就可以继续删除另一个文档。

2. 先扫描整个索引文件，记录下需要删除的文档在索引中的编号。然后，一次性调用DeleteDocument删除这些文档，再调用IndexWriter的Optimize方法来优化索引。

使用IndexReader进行Document删除操作时，文档并不会立即被删除，而是把这个删除动作缓存起来，直到调用IndexReader.Close()时，删除操作才会被真正执行。

注：使用IndexReader进行删除时，必须关闭所有已经打开的IndexWriter；当使用当前的IndexReader进行搜索时，即使在不关闭IndexReader的情况下，被删除的Document也不会再出现在搜索结果中。

IndexWriter 删除索引

IndexWriter.DeleteDocuments(Query query)——根据Query条件来删除单个或多个Document
IndexWriter.DeleteDocuments(Query[] queries)——根据Query条件来删除单个或多个Document
IndexWriter.DeleteDocuments(Term term)——根据Term来删除单个或多个Document
IndexWriter.DeleteDocuments(Term[] terms)——根据Term来删除单个或多个Document
IndexWriter.DeleteAll()——删除所有的Document

使用IndexWriter进行Document删除操作时，文档并不会立即被删除，而是把这个删除动作缓存起来，当IndexWriter.Commit()或IndexWriter.Close()时，删除操作才会被真正执行。

使用IndexWriter与IndexReader进行Document删除时的不同是： IndexReader.DeleteDocuments()可以返回被删除的文档数目。

使用 IndexWriter 删除索引的范例代码：

Term term = new Term(“title”, “update”);
MultiFieldQueryParser parser = new MultiFieldQueryParser(version, new string[] { “title”, “content” }, analyzer);
Query query = parser.Parse(“update”);

IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29), false, IndexWriter.MaxFieldLength.LIMITED);

writer.DeleteDocuments(term); // 或者 writer.DeleteDocuments(query); 
writer.Commit();

txtMessage.Text += “删除记录： ” + writer.HasDeletions().ToString() + Environment.NewLine;
writer.Close();