lucene 3.0 学习笔记（1）— 建立索引

最新推荐文章于 2024-06-04 15:20:06 发布

liliugen

最新推荐文章于 2024-06-04 15:20:06 发布

阅读量80

点赞数

分类专栏： lucene 文章标签： lucene Eclipse Java Apache 搜索引擎

本文链接：https://blog.csdn.net/liliugen/article/details/83723293

版权

lucene 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

正在学习lucene，下载的新版本是3.0的，这里把学习中整理的笔记，放在blog中做为备份。

使用lucene做为搜索引擎，主要做的2件事就是：1、建立索引；2、利用索引查询。

即lucene先将要搜索的内容，转化成一个个单词，然后对单词及其与内容的关系建索引；查询是根据你输入的内容，在索引中找到符合条件单词，并进而找到对应的内容。

这里先从创建索引开始，下面是一段代码示例：

/**
 * Copyright (c) 2010 TeleNav, Inc
 * All rights reserved
 * 
 * Created on Sep 25, 2010
 * Filename is CreateIndex.java
 * Packagename:example
 */
package example;

import java.io.File;
import java.io.FileReader;
import java.util.Date;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 * @author lgli
 *
 * Created on  Sep 25, 2010
 */
public class CreateIndex
{
    //索引文件存放目录
    final File INDEX_DIR = new File("index");
    //测试数据目录
    final File docDir = new File("E:\\workspace_eclipse\\DOCTEST\\txt\\");

    public boolean createIndex()
    {
        if (!docDir.exists() || !docDir.canRead())
        {
            System.out.println("Document directory '" + docDir.getAbsolutePath() + "' does not exist or is not readable, please check the path");
            return false;
        }

        Date start = new Date();
        try
        {
            IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), 
                new StandardAnalyzer(Version.LUCENE_30), 
                    true, //true为覆盖原index；false为追加
                    IndexWriter.MaxFieldLength.LIMITED);
            System.out.println("Indexing to directory '" + INDEX_DIR + "'...");
            indexDocs(writer, docDir);  //创建Document，加入IndexWriter中
            System.out.println("Optimizing...");
            writer.optimize();          //优化索引
            writer.close();

            Date end = new Date();
            System.out.println(end.getTime() - start.getTime() + " total milliseconds");

        }
        catch (Exception e)
        {
            System.out.println(e.getMessage());
            return false;
        }
        return true;
    }
    //向IndexWriter中添加Document
    private void indexDocs(IndexWriter writer, File file) throws Exception
    {
        if (file.canRead())
        {
            if (file.isDirectory())
            {
                //递归遍历目录下所有文件
                String[] files = file.list();
                if (files != null)
                {
                    for (int i = 0; i < files.length; i++)
                    {
                        indexDocs(writer, new File(file, files[i]));
                    }
                }
            }
            else
            {
                //添加Document
                System.out.println("indexing: " + file);
                writer.addDocument(getDocument(file));
            }
        }
    }
    public Document getDocument(File f) throws java.io.FileNotFoundException
    {
        //创建Document
        Document doc = new Document();
        doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));
        doc.add(new Field("modified", DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.NOT_ANALYZED));
        doc.add(new Field("contents", new FileReader(f)));
        return doc;
    }
}

创建索引的主要步骤：

1、先指定要搜索的文件（docDir 目录下放了一些文本文件做为测试数据）和存放索引文件的目录（INDEX_DIR ）

2、创建IndexWriter－需要提供索引目录、使用的Analyzer等。Analyzer用于解析文本内容，拆分成单词，这里我使用的是lucene自带的分词器。

3、递归遍历所有文件，生成Document，每一个文件对于一个Document。

4、将Document逐个加入索引

5、关闭IndexWriter，保存索引信息

说明：

1、索引文件可存放在目录或内存中，分别使用FSDirectory和RAMDirectory

2、每个Document类似与索引中的一行记录，具体的字段由Field标识，我这里加入了3个Field，分别是路径、修改时间和内容。

3、Field中的枚举字段

Field.Store.YES表示此字段内容需要在索引中保存

Field.Index.NOT_ANALYZED表示此字段内容不需要做分词

new Field("contents", new FileReader(f))表示此字段内容从reader中取，做分词但不保存

几个基本对象间的关系如下图：

[img]http://dl.iteye.com/upload/attachment/316065/aae740b9-7957-3dc2-af26-a0a5e0d67808.jpg[/img]

最后，上面这段代码的输出结果如下：
Indexing to directory 'index'...
indexing: E:\workspace_eclipse\DOCTEST\txt\DeleteFiles.java
indexing: E:\workspace_eclipse\DOCTEST\txt\FileDocument.java
indexing: E:\workspace_eclipse\DOCTEST\txt\html\Entities.java
indexing: E:\workspace_eclipse\DOCTEST\txt\html\HTMLParser.java
indexing: E:\workspace_eclipse\DOCTEST\txt\html\HTMLParser.jj
indexing: E:\workspace_eclipse\DOCTEST\txt\html\HTMLParserConstants.java
indexing: E:\workspace_eclipse\DOCTEST\txt\html\HTMLParserTokenManager.java
indexing: E:\workspace_eclipse\DOCTEST\txt\html\ParseException.java
indexing: E:\workspace_eclipse\DOCTEST\txt\html\ParserThread.java
indexing: E:\workspace_eclipse\DOCTEST\txt\html\SimpleCharStream.java
indexing: E:\workspace_eclipse\DOCTEST\txt\html\Tags.java
indexing: E:\workspace_eclipse\DOCTEST\txt\html\Test.java
indexing: E:\workspace_eclipse\DOCTEST\txt\html\Token.java
indexing: E:\workspace_eclipse\DOCTEST\txt\html\TokenMgrError.java
indexing: E:\workspace_eclipse\DOCTEST\txt\HTMLDocument.java
indexing: E:\workspace_eclipse\DOCTEST\txt\IndexFiles.java
indexing: E:\workspace_eclipse\DOCTEST\txt\IndexHTML.java
indexing: E:\workspace_eclipse\DOCTEST\txt\SearchFiles.java
Optimizing...
734 total milliseconds

liliugen

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
lucene 3.0 学习笔记（1）— 建立索引

正在学习lucene，下载的新版本是3.0的，这里把学习中整理的笔记，放在blog中做为备份。使用lucene做为搜索引擎，主要做的2件事就是：1、建立索引；2、利用索引查询。即lucene先将要搜索的内容，转化成一个个单词，然后对单词及其与内容的关系建索引；查询是根据你输入的内容，在索引中找到符合条件单词，并进而找到对应的内容。这里先从创建索引开始，下面是一段代...
复制链接

扫一扫