lucene 搜索引擎创建索引过程

最新推荐文章于 2024-09-15 22:39:06 发布

sm983728902

最新推荐文章于 2024-09-15 22:39:06 发布

阅读量429

点赞数

分类专栏：搜索引擎 lucene 文章标签： lucene Lucene 搜索引擎

本文链接：https://blog.csdn.net/sm983728902/article/details/8557683

版权

搜索引擎 lucene 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1.创建IndexWriter

IndexWriter是用来操作（增、删、查）索引库的。它的构造方法如下：

public IndexWriter（Directory d ,Analyzer a,boolean create,MaxFieldLength mfl）{}

它的参数：-------（1）--------Directory d ,-指的是索引存放的位置，它是抽象类型，我们需要传入其子类对象，常用的子类有FSDirectory（将索引存放到硬盘中）、RAMDirectory(将索引存放到内存中)；

FSDirectory 存入磁盘，由于FSDirectory没有public 的构造方法，我们只能通过调用其静态方法返回该实例对象：

public static FSDirectory open（File path）

RAMDirectory存入内存

public RAMDirectory()

----------(2)----------Analyzer a 分词器，常用分词器StandardAnalyzer 单字分词（单字分词是指对一句话进行分词时，一个字为一个词）、CJKAnalyzer 二分法分词、IKAnalyzer 词典分词和文法分词（analyzer负责帮助创建索引）

----------（3）---------boolean create-- 取值true表示创建索引或者覆盖已经存在的索引，取值false表示追加已存在的索引；

-----------（4）---------MaxFieldLength mfl -- 对每个Field显示创建索引的最大数目；MaxFieldLength mfl = MaxFieldLength.LiMITED;-----指限制关键词的搜索，不是能特别精确的查找，但一般情况下都用此；MaxFieldLength mfl = MaxFieldLength.UNLIMITED;----指不限制关键词，精确的搜索

（2）创建Document（通常是一个文件或一个页面或者数据库中的一行记录，是搜索的最小单位），添加到INdexWriter；

构造方法如下：public Document（）

public final void add（Fieldable field）

（3）创建Field（每一个列，称为一个域），添加到Document；此处需要考虑存储方式和索引方式；

其构造方法如下：public Field（String name,String value,Store store,Index index）

public Field(String name,Reader reader)

public Field(String name,byte[] value,Store store)

name:名称（域的名称可以自己取）

value：值，可以使文字或二进制数组（声音，图像等），文字较多可以用流读入（该域的值）

store：存储方式（是需不需要存储，搜索要显示的字段需要存储）

Store.NO不存储

Store.YES存储

Store.COMPRESS 压缩存储（二进制适用）

index：索引方式（需不需要索引，如果是查找条件，则需要条件）

Index.NO 不索引

Index.ANALYZED分词索引（索引且分词）

Index.NOT_ANALYZED不分词索引（------分词是根据查找条件拆分进行查询，整体匹配则不分词，部分匹配则分词）

Index.NO_NORMS 不分词索引，禁止参与评分，减少内存消耗；

（4）优化索引

（5）.关闭IndexWriter

注：其工具类有：

数字辅助类 NumericUtils

int 转字符串，创建索引时使用，其方法声明如下：

public static String intToPrefixCoded（int val）

字符串转int ，查询索引时使用，其方法声明如下：

public static int prefixCodedToInt(String prefixCoded)

时间辅助类DateTools

时间转字符串，创建索引时使用，其方法声明如下：

public static String DateToString(Date date,DateTools.Resolution resolution)

public static String timeToString (long time,DateTools.Resolution resolution)

(Resolution --时间精度:Resolution.DAY 精确到天，Resolution.SECOND 精确到秒

字符串转时间，查询索引时使用：其构造方法如下，

public static Date stringToDate（String dateString）

public static long stringToTime（String dateString）

文件辅助类 FileUtils

FileUtils public String readFileToString（File file，String encoding）

举例如下：

package com.puckasoft.lucene;

import java.io.File;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.DateTools.Resolution;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.NumericUtils;
import org.apache.lucene.util.Version;
import org.junit.Test;

public class TestLucene {

@Test
public void testWriter() throws Exception {

/* 首先获得要生成索引的文件夹的路径如：D:\my_workspaces\1224\Lucence_01_FS\index

System.getProperty（“user.dir”）-----查询工程的路径，此处为固定写法，File.separator 为工程与index之间的\生成器，根据操作系统的不同可以动态生成适合的\或/*/

String indexPath = System.getProperty("user.dir") + File.separator+ "index";

/*根据生成文件的路径创建Flie对象，若这个index文件夹存在则不创建，如不存在则创建index*/
File indexDir = new File(indexPath);
if (!indexDir.exists()) {
indexDir.mkdirs();
}

//先声明一个IndexWriter（用来操作索引库）对象变量，此处要进行编程时异常处理
IndexWriter indexWriter = null;
try {

/*分别获得IndexWriter类操作索引库的四个参数*/

Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
boolean create = true;
MaxFieldLength mfl = MaxFieldLength.LIMITED;
// 1.创建索引工具类IndexWriter
indexWriter = new IndexWriter(directory, analyzer, create, mfl);

// 2.创建Document，添加到IndexWriter中
String dataPath = System.getProperty("user.dir") + File.separator+ "text_ds";//存放数据的根路径

//列出所有文件，放在一个File数组中
File dataDir = new File(dataPath);
File[] files = dataDir.listFiles();
for (File file : files) {
Document document = new Document();
// 3.创建Field，添加到Document中
// 文件名文件路径文件大小文件修改时间文件内容
document.add(new Field("fileName", file.getName(), Store.YES,Index.ANALYZED));
document.add(new Field("filePath", file.getAbsolutePath(),Store.YES, Index.NO));
document.add(new Field("fileSize", NumericUtils.longToPrefixCoded(file.length()), Store.YES,Index.NOT_ANALYZED));
document.add(new Field("lastModified", DateTools.timeToString(file.lastModified(), Resolution.MINUTE),
Store.YES,Index.NOT_ANALYZED));
document.add(new Field("fileContent",FileUtils.readFileToString(file),Store.YES,Index.ANALYZED));

indexWriter.addDocument(document);
}
//4.优化索引
indexWriter.optimize();
} catch (Exception e) {
throw e;
} finally {
//5.关闭IndexWriter
if(indexWriter!=null)
indexWriter.close();
}

}

}