Lucene初探

最新推荐文章于 2024-10-12 22:35:37 发布

weixin_34019929

最新推荐文章于 2024-10-12 22:35:37 发布

阅读量66

点赞数

文章标签： java python

原文链接：https://my.oschina.net/propagator/blog/956631

版权

2019独角兽企业重金招聘Python工程师标准>>>

最近因工作需要研究了一下lucene搜索引擎核心，初步了解了其运行机制。Lucene搜索引擎分为两大块，首先对要搜索的内容建立索引，然后在此基础上进行搜索。

Lucene可在如下网址下载：

http://mirror.bit.edu.cn/apache/lucene/

下载其中的Lucene-x.x.x.zip即可。解压后可获得编译好的jar包，在自己工程中引用相应的jar包即可。

Lucene的官方文档可在其官网上找到

https://lucene.apache.org/core/

但并没有给出很多例子和讲解，只给出了api的说明和两个样例源文件，一个用来索引，另一个用来搜索。

为了简单快速了解Lucene，自己写了一个例子，先贴代码

import java.nio.file.Paths;
import java.nio.file.Path;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;

public class SimpleDemo {
	public static void main(String[] args) throws Exception {
	    Analyzer analyzer = new SmartChineseAnalyzer();
	    //Analyzer analyzer = new StandardAnalyzer();

	    // Store the index in memory:
	    Directory directory = new RAMDirectory();
	    // To store an index on disk, use this instead:
	    //Directory directory = FSDirectory.open("/tmp/testindex");
	    IndexWriterConfig config = new IndexWriterConfig(analyzer);
	    IndexWriter iwriter = new IndexWriter(directory, config);

	    Document doc = new Document();
	    String queryString = "因为";
	    String text = "请把悲伤留下，带着欢乐离去，因为我爱你。";

	    doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
	    iwriter.addDocument(doc);
	    
	    Path filePath = Paths.get("F:\\Temp\\Test.txt");
	    InputStream stream = Files.newInputStream(filePath);
        // To show the features manually split a document into several
	    //doc.add(new TextField("fieldname", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));
	    InputStreamReader m_isr = new InputStreamReader(stream, StandardCharsets.UTF_8);
	    BufferedReader m_br = new BufferedReader(m_isr);
	    String data = null;
	    int idx = 0;
	    while((data=m_br.readLine())!=null){
	    	idx += 1;
	    	doc.clear();
	    	doc.add(new TextField("fieldname", data, Field.Store.YES));
	    	iwriter.updateDocument(new Term("key"+idx+"", doc.toString()), doc);
	    }
	    iwriter.close();
	    
	    
	    // Now search the index:
	    DirectoryReader ireader = DirectoryReader.open(directory);
	    IndexSearcher isearcher = new IndexSearcher(ireader);
	    // Parse a simple query that searches for "text":
	    QueryParser parser = new QueryParser("fieldname", analyzer);
	    Query query = parser.parse(queryString);
	    TopDocs result = isearcher.search(query, 200);
	    ScoreDoc[] hits = result.scoreDocs;
	    System.out.println("matches: " + result.totalHits);
	    // Iterate through the results:
	    for (int i = 0; i < hits.length; i++) {
	      Document hitDoc = isearcher.doc(hits[i].doc);
	      System.out.println("score="+hits[i].score+", doc["+i+"]: "+hitDoc.get("fieldname"));
	    }
	    ireader.close();
	    directory.close();
	    
	    System.out.println("searching complete successfully!");
	}
}

接下来逐段说明一下。

	    Analyzer analyzer = new SmartChineseAnalyzer();
	    //Analyzer analyzer = new StandardAnalyzer();

	    // Store the index in memory:
	    Directory directory = new RAMDirectory();
	    // To store an index on disk, use this instead:
	    //Directory directory = FSDirectory.open("/tmp/testindex");
	    IndexWriterConfig config = new IndexWriterConfig(analyzer);
	    IndexWriter iwriter = new IndexWriter(directory, config);

这一段先对用到的基本变量做了定义和初始化。Analyzer是文本分析器，用来对文本进行分词，建立标记(token)等。directory顾名思义表示文件夹，用来保存文档。此处用来保存建立的索引，如注释说明的，因为是样例程序，采用内存来保存索引文件，如果是大规模的数据，需要改为用硬盘保存索引文件。IndexWriterConfig和IndexWriter用来初始化索引写入器，并将索引写入directory中。

	    Document doc = new Document();
	    String queryString = "因为";
	    String text = "请把悲伤留下，带着欢乐离去，因为我爱你。";

	    doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
	    iwriter.addDocument(doc);
	    
	    Path filePath = Paths.get("F:\\Temp\\Test.txt");
	    InputStream stream = Files.newInputStream(filePath);
        // To show the features manually split a document into several
	    //doc.add(new TextField("fieldname", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));
	    InputStreamReader m_isr = new InputStreamReader(stream, StandardCharsets.UTF_8);
	    BufferedReader m_br = new BufferedReader(m_isr);
	    String data = null;
	    int idx = 0;
	    while((data=m_br.readLine())!=null){
	    	idx += 1;
	    	doc.clear();
	    	doc.add(new TextField("fieldname", data, Field.Store.YES));
	    	iwriter.updateDocument(new Term("key"+idx+"", doc.toString()), doc);
	    }
	    iwriter.close();

这一段做了两件事，1是把一段字符串加入document中，放入directory；2是把一个文本文件的每一行作为一个document，放入directory。其中，每个文档(document)只有一个字段，即fieldname，原则上每个文档可以有很多字段，这些字段的名称可以自己定义，比如这里就定义为fieldname。如果把待搜索的文件作为一个document，那么原则上可以把文件名，路径，创建时间，修改时间，文件大小等都作为一个字段，以方便未来对不同的字段进行搜索。注意每次document的更新/修改都被添加/更新到directory中（利用iwriter.addDocument和iwriter.updateDocument来实现），从而保证directory记录了所有最新的索引信息。

	    // Now search the index:
	    DirectoryReader ireader = DirectoryReader.open(directory);
	    IndexSearcher isearcher = new IndexSearcher(ireader);
	    // Parse a simple query that searches for "text":
	    QueryParser parser = new QueryParser("fieldname", analyzer);
	    Query query = parser.parse(queryString);
	    TopDocs result = isearcher.search(query, 200);
	    ScoreDoc[] hits = result.scoreDocs;
	    System.out.println("matches: " + result.totalHits);
	    // Iterate through the results:
	    for (int i = 0; i < hits.length; i++) {
	      Document hitDoc = isearcher.doc(hits[i].doc);
	      System.out.println("score="+hits[i].score+", doc["+i+"]: "+hitDoc.get("fieldname"));
	    }
	    ireader.close();
	    directory.close();

这一段开始，对刚才索引的信息进行检索。首先打开保存索引信息的directory，创建IndexSearcher，然后定义请求，此处为检索所有fieldname字段中的queryString（“因为”）字符串。isearcher.search(query, 200)表示搜索前200条信息，搜索结果保存在ScoreDoc数组hits中。需要注意的是，实际上hits中给出的是真实文档的编号，要得到真正的文档内容，需要利用isearcher.doc函数将该编号转换成对应的document。另外每个文档对关键字queryString的匹配度保存在hits[i].score中，hits中的结果是按照匹配度从高到低的顺序排的。

虽然上面的例子已经可以对中文进行检索（利用SmartChineseAnalyzer），但实际上，Lucene目前对中文的支持还不太好。为了改善中文检索性能，还需要对Lucene进行一些改造，这部分内容将在后续的博文中给出。

转载于:https://my.oschina.net/propagator/blog/956631