Lucene-索引详解（3）

最新推荐文章于 2020-07-13 23:23:22 发布

无怨_无悔

最新推荐文章于 2020-07-13 23:23:22 发布

阅读量236

点赞数

分类专栏：全文检索

本文链接：https://blog.csdn.net/sdmxdzb/article/details/104968178

版权

全文检索专栏收录该内容

2 篇文章 0 订阅

订阅专栏

IndexWriter详解

创建API详解图示

代码示例

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import java.io.File;
import java.io.IOException;

/***
 *@author dongsheng
 *@date 2020/3/19 15:10
 *@version 1.0.0
 *@Description
 */
public class CreateIndexTest {

    public static void main(String[] args) throws IOException {
        // 创建使用的分词器
        Analyzer analyzer = new SimpleAnalyzer();
        // 索引配置对象
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        // 设置索引库的打开模式：新建、追加、新建或追加
        config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
        // 索引存放目录
        // 存放到文件系统中
        Directory directory = FSDirectory
                .open((new File("f:/test/indextest")).toPath());
        // 存放到内存中
        // Directory directory = new RAMDirectory();
        // 创建索引写对象
        IndexWriter writer = new IndexWriter(directory, config);
        // 创建document
        Document doc = new Document();
        // 往document中添加 商品id字段
        doc.add(new StoredField("productId", "00001"));
        // 往document中添加 商品名称字段
        String name = "ThinkPad X1 Carbon 20KH0009CD/25CD 超极本轻薄笔记本电脑联想";
        doc.add(new TextField("name", name, Field.Store.YES));
        writer.addDocument(doc);
    }
}

设计类图

IndexWriterCconfig 索引的配置

使用分词器
打开索引（创建、追加）
可配置缓冲区大小或者存储多少个文档，再刷新
可配置合并、删除等策略

Directory 存放的位置

从类结构看

内存、文件系统、数据库

Directory directory = FSDirectory.open(path文件目录地址)

创建、维护索引的API流程

// 创建索引写对象
IndexWriter writer = new IndexWriter(directory, config);
// 创建document
// 将文档添加到索引
writer.addDocument(doc);
// 删除文档
//writer.deleteDocuments(terms);
//修改文档
//writer.updateDocument(term, doc);
// 刷新
writer.flush();
// 提交
writer.commit();


//indexwriter  是一个线程安全的，如果你要使用其它同步控制，请避免死锁，竟量不使用。

Document 文档详解

索引的数据记录、文档在lucene中的表示，是索引、搜索的基本单元。一个Document由多个字段Field构成。就像数据库的记录-字段。IndexWriter按加入的顺序为Document指定一个递增的id（从0开始），称为文档id。反向索引中存储的是这个id，文档存储中正向索引也是这个id。业务数据的主键id只是文档的一个字段。

Field
字段：由字段名name、字段值value（fieldsData）、字段类型 type 三部分构成。
字段值可以是文本（String、Reader 或预分析的 TokenStream）、二进制值（byte[]）或数值。

IndexableFieldType

字段类型：描述该如何索引存储该字段
注意：未存储的字段，从索引中取得的document中是没有这些字段的。

IndexOptions 是否忽略标准化

NONE Not indexed 不索引
DOCS 反向索引中只存储了包含该词的文档id，没有词频、位置
DOCS_AND_FREQS 反向索引中会存储文档id、词频
DOCS_AND_FREQS_AND_POSITIONS 反向索引中存储文档id、词频、位置
DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS 反向索引中存储文档id、词频、位置、偏移量

storeTermVectors

对于不需要在搜索反向索引时用到，但在搜索结果处理时需要的位置、偏移量、附加数据(payLoad) 的字段，我们可以单独为该字段存储（文档id词项向量）的正向索引。

boolean storeTermVectors() 是否存储词项向量
boolean storeTermVectorPositions() 是否在词项向量中存储位置
boolean storeTermVectorOffsets() 是否在词项向量中存储偏移量
boolean storeTermVectorPayloads() 是否在词项向量中存储附加信息

附加信息Payloads

docValuesType
IndexableFieldType 中的 docValuesType方法就是让你来为需要排序、分组、
聚合的字段指定如何为该字段创建文档->字段值的正向索引的。

空间换时间
对这种需要排序、分组、聚合的字段，为其建立独立的文档->字段值的正向
索引、列式存储。这样我们要加载搜中文档的这个字段的数据就快很多，
耗内存少。

DocValuesType 选项说明

NONE 不开启docvalue
NUMERIC 单值、数值字段，用这个
BINARY 单值、字节数组字段用
SORTED 单值、字符字段用，会预先对值字节进行排序、去重存储
SORTED_NUMERIC 单值、数值数组字段用，会预先对数值数组进行排序
SORTED_SET 多值字段用，会预先对值字节进行排序、去重存储

注：DocValuesType 是强类型要求的 ,字段的值必须保证同类型

具体的选择

字符串+单值会选择SORTED作为docvalue存储
字符串+多值会选择SORTED_SET作为docvalue存储
数值或日期或枚举字段+单值会选择NUMERIC 作为docvalue存储
数值或日期或枚举字段+多值会选择SORTED_SET作为docvalue存储

Lucene预定义的字段子类，你可灵活选用

TextField: Reader or String indexed for full-text search
StringField: String indexed verbatim as a single token
IntPoint: int indexed for exact/range queries.
LongPoint: long indexed for exact/range queries.
FloatPoint: float indexed for exact/range queries.
DoublePoint: double indexed for exact/range queries.
SortedDocValuesField: byte[] indexed column-wise for sorting/faceting
SortedSetDocValuesField: SortedSet<byte[]> indexed column-wise for sorting/faceting
NumericDocValuesField: long indexed column-wise for sorting/faceting
SortedNumericDocValuesField: SortedSet<long> indexed column-wise for sorting/faceting
StoredField: Stored-only value for retrieving in summary results

luke索引查看工具安装

下载地址：https://github.com/DmitryKey/luke/releases

开箱即用