Lucene底层架构与优化

最新推荐文章于 2023-06-08 02:28:17 发布

寒暄

最新推荐文章于 2023-06-08 02:28:17 发布

阅读量484

点赞数 1

分类专栏： # ---Elastic Stack 文章标签：大数据 java 索引 lucene

本文链接：https://blog.csdn.net/qq_41106844/article/details/106500998

版权

---Elastic Stack 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Lucene底层存储结构

在这里插入图片描述

这是一个物理上的索引库。

在这里插入图片描述

这是一个逻辑上的索引库。

物理索引库中的Segment_1文件对应了逻辑索引库中的Segment段。

Segment段的文件大小有上限，达到上限后自动产生新的Segment段文件。

上限可以去使用版本的文档中查询，每个版本的上限不一样。

物理索引库中的write.lock是锁文件，保证当前只有一个线程在操作Segment文件

逻辑索引库中的词典分为三部分，分别是关键词+文档号+出现位置。关键词的大小是有限制的，最大也就是新华词典+牛津词典+文言文词典+阿拉伯数字（考虑在国内的使用）。

查询的时候会通过关键词查询到文档号，然后通过文档号对应的出现位置去定位文档。

索引库文件扩展名对照表

名称	文件扩展名	简短描述
Segments File	segments_N	保存了一个提交点的信息
Lock File	Write.lock	防止多个IndexWriter同时写到一份索引文件中
Segment Info	.si	保存了索引段的元数据信息
Compound File	.cfs .cfe	一个可选的虚拟文件，把所以索引信息存储到复合索引文件中
Fields	.fnm	保存fields的相关信息
Field Index	.fdx	保存指向field data的指针
Field Data	.fdt	文档存储的字段的值
Term Dictonary	.tim	term词典，存储term信息
Term Index	.tip	到Term Dictionary的索引
Frequencies	.doc	由包含每个term以及频率的docs列表组成
Positions	.pos	存储出现在索引中的term的位置信息
Payloads	.pay	存储额外的per-position元数据信息
Norms	.nvd .nvm	.nvm保存索引字段加权因子的元数据，.nvd保存索引字段加权因子的数据
Per-Document Values	.dvd .dvm	.dvm保存索引文档评分因子的元数据，.dvd保存索引文档评分数据
Term Vector Index	.tvx	将偏移存储到文档数据文件中
Term Vector Documents	.tvd	包含有term vectors的每个文档信息
Term Vector Fields	.tvf	字段级别有关term vector的信息
Live Documents	.liv	哪些是有效文件的信息
Point values	.dll .dim	保留索引点

Lucene词典存储结构

倒排索引中的词典位于内存，其结构尤为重要。下面是一些常见结构

数据结构	优缺点
跳跃表	占用内存小，可调，但是对模糊查询支持不好
排序列表	使用二分法查找，不平衡
字典树	查询效率跟字符串长度有关，但只适合英文词典
哈希表	性能高，内存消耗大，几乎是原始数据的三倍
双数组字典树	适合做中文词典，内存占用小，很多分词工具均采用此种算法
FST	一种有限状态转移机，Lucene4有开源实现，并且大量使用
B树	磁盘索引，更新方便，检索速度慢，常用于数据库

跳跃表

Lucene3之前使用跳跃表机制，Lucene3之后换成了FST结构。

优点：结构简单、跳跃间隔、级数可控，Lucene3.0之前使用的也是跳跃表结构，但跳跃表在Lucene其他地方还有应用如倒排表合并和文档号索引
缺点：模糊查询支持不好

在了解跳跃表查询的方式前，先看一下单链表的查询方式。

假设我们有一个单链表：

7——》14——》21——》32——》37——》71——》85——》117

现在查询出85这个节点，需要查询几次。

因为单链表中数据是有序的，所以我们不能通过二分法来降低查询所用时间，我们必须一个栈一个栈的查，也就是要查询7次。

如果使用跳跃表的机制呢？

跳跃表会随机从单链表中抽取一半数据组成第二层，然后从第二层中随机抽取一半的数据组成第三层，如果第三层数据量还是相对偏大，可以继续有第四层，第五层。抽取后如下表：

start			21		37				end
start	7		21		37	71			end
start	7	14	21	32	37	71	85	117	end

第一层:查询3次，分别是21,37，end。发现85比37大，不再查询21后的元素，只查询37后的元素。

第二层:查询2次，分别是71，end。发现85比71大，继续查询。

第三层:查询1次，直接查询到85。

一共查询6次。

FST

Lucene现在采用的数据结构是FST，他的特点是：

优点

内存占用率高，压缩率一般在3倍-20倍之间、模糊查询支持好，查询快。
缺点

结构复杂，输入要求有序、更新不易。

FST要求输入时有序的，而且Lucene也会将解析出来的文档单词预先排序，然后构建FST。假设我们输入abd,abe,acf,acg，那么构建过程如下：

		a
	   /
	  /
	 b
	/
   /
  d
  
----------

		a
	   / 
	  /
	 b
	/ \
   /   \
  d     e
  
----------

		 a
	   /   \
	  /     \
	 b       c
	/ \     /
   /   \   /
  d     e f
  
----------
  
		 a
	   /   \
	  /     \
	 b       c
	/ \     / \
   /   \   /   \
  d     e f     g

FST使用的是二叉树，跳跃表使用的是链表。

Lucene优化

解决大量磁盘IO问题

设置Buffere大小，加快建索引的速度

config.setMaxBufferedDocs(100000);控制写入一个新的segment前内存中保存的documen的数目，设置较大的数目可以加快建索引的速度。

数值越大索引速度越快，但是会消耗更多的内存。
将多个文档合并成一个段

IndexWriter.forceMerge(文档数量):设置N个文档合并成一个段。

合并的文档数量越大索引速度越快，搜索速度越慢；合并的文档数量越小索引速度越慢，搜索速度越快。

更高的数值表示索引期间更低的合并开销，但是同样也代表更慢的搜索速度。如果设置的不是很高，可以带来很高的搜索速度，因为索引操作期间程序会利用并发机制完成合并操作。

应该对程序进行多次调整，让机器的性能告诉你最优值。

没有优化

素材有大概120W条数据。

package com.itheima.lucene;

import java.io.File;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

// 建立索引库
public class LuceneFirst {
    public static void main(String[] args) throws Exception {
        // 获得开始时间
        long start = System.currentTimeMillis();
        
        createIndex();
        
        // 获得结束时间
        long end = System.currentTimeMillis();
        System.out.println("消耗的时间为："+(end - start));
    }

    public static void createIndex() throws Exception{


        // 1.创建一个Director对象，指定索引库保存的位置
        Directory directory = FSDirectory.open(new File("E:\\syk").toPath());
        
        // 2.基于Director对象创建一个IndexWriter对象
        IndexWriter indexwriter = new IndexWriter(directory,new IndexWriterConfig());

        // 3.读取磁盘上的文件，对应每个文件来创建一个文档对象

        // - 创建一个文档对象指向需要建立索引库的文件目录
        File dir = new File("E:\\searchsource");
        System.out.println(dir);
        // - 得到目录下的文件列表
        File[] files = dir.listFiles();

        // - 遍历文件列表
        for(File f : files){
            // - 获取文件名
            String fileName = f.getName();

            // - 获取文件路径
            String filePath = f.getPath();

            // - 获取文件内容
            String fileContent = FileUtils.readFileToString(f, "utf-8");

            // - 文件大小
            long fileSize = FileUtils.sizeOf(f);

            // - 创建Field 参数1:域的名称；参数2:域的内容；参数3:是否存储
            Field fieldName = new TextField("name",fileName,Field.Store.YES);
            Field fieldPath = new TextField("Path",filePath,Field.Store.YES);
            Field fieldContent = new TextField("Content",fileContent,Field.Store.YES);
            Field fieldSize = new TextField("Size",fileSize+"",Field.Store.YES);

            // - 创建文档对象
            Document document = new Document();

            // 4.向文档对象中添加域
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
            document.add(fieldSize);

            // 5.将文档对象写入索引库
            indexwriter.addDocument(document);
        }

        // 6.关闭IndexWriter对象
        indexwriter.close();
    }
}

消耗的时间为：7725

缓冲区优化

package com.itheima.lucene;

import java.io.File;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

// 建立索引库
public class LuceneFirst {
    public static void main(String[] args) throws Exception {
        // 获得开始时间
        long start = System.currentTimeMillis();
        
        createIndex();
        
        // 获得结束时间
        long end = System.currentTimeMillis();
        System.out.println("消耗的时间为："+(end - start));
    }

    public static void createIndex() throws Exception{


        // 1.创建一个Director对象，指定索引库保存的位置
        Directory directory = FSDirectory.open(new File("E:\\syk").toPath());
        
        IndexWriterConfig config = new IndexWriterConfig();
        
        // - 缓冲区优化
        config.setMaxBufferedDocs(100000);
        
        // 2.基于Director对象创建一个IndexWriter对象
        IndexWriter indexwriter = new IndexWriter(directory,config);

        // 3.读取磁盘上的文件，对应每个文件来创建一个文档对象

        // - 创建一个文档对象指向需要建立索引库的文件目录
        File dir = new File("E:\\searchsource");
        System.out.println(dir);
        // - 得到目录下的文件列表
        File[] files = dir.listFiles();

        // - 遍历文件列表
        for(File f : files){
            // - 获取文件名
            String fileName = f.getName();

            // - 获取文件路径
            String filePath = f.getPath();

            // - 获取文件内容
            String fileContent = FileUtils.readFileToString(f, "utf-8");

            // - 文件大小
            long fileSize = FileUtils.sizeOf(f);

            // - 创建Field 参数1:域的名称；参数2:域的内容；参数3:是否存储
            Field fieldName = new TextField("name",fileName,Field.Store.YES);
            Field fieldPath = new TextField("Path",filePath,Field.Store.YES);
            Field fieldContent = new TextField("Content",fileContent,Field.Store.YES);
            Field fieldSize = new TextField("Size",fileSize+"",Field.Store.YES);

            // - 创建文档对象
            Document document = new Document();

            // 4.向文档对象中添加域
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
            document.add(fieldSize);

            // 5.将文档对象写入索引库
            indexwriter.addDocument(document);
        }

        // 6.关闭IndexWriter对象
        indexwriter.close();
    }
}

消耗的时间为：6880

后来将他的数值改为500000，结果为7105。这样的话100000更适合当前的机器性能。

文件合并优化

package com.itheima.lucene;

import java.io.File;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

// 建立索引库
public class LuceneFirst {
    public static void main(String[] args) throws Exception {
        // 获得开始时间
        long start = System.currentTimeMillis();
        
        createIndex();
        
        // 获得结束时间
        long end = System.currentTimeMillis();
        System.out.println("消耗的时间为："+(end - start));
    }

    public static void createIndex() throws Exception{


        // 1.创建一个Director对象，指定索引库保存的位置
        Directory directory = FSDirectory.open(new File("E:\\syk").toPath());
        
        // 2.基于Director对象创建一个IndexWriter对象
        IndexWriter indexwriter = new IndexWriter(directory,new IndexWriterConfig());
		indexwriter.forceMerge(100000);
        // 3.读取磁盘上的文件，对应每个文件来创建一个文档对象

        // - 创建一个文档对象指向需要建立索引库的文件目录
        File dir = new File("E:\\searchsource");
        System.out.println(dir);
        // - 得到目录下的文件列表
        File[] files = dir.listFiles();

        // - 遍历文件列表
        for(File f : files){
            // - 获取文件名
            String fileName = f.getName();

            // - 获取文件路径
            String filePath = f.getPath();

            // - 获取文件内容
            String fileContent = FileUtils.readFileToString(f, "utf-8");

            // - 文件大小
            long fileSize = FileUtils.sizeOf(f);

            // - 创建Field 参数1:域的名称；参数2:域的内容；参数3:是否存储
            Field fieldName = new TextField("name",fileName,Field.Store.YES);
            Field fieldPath = new TextField("Path",filePath,Field.Store.YES);
            Field fieldContent = new TextField("Content",fileContent,Field.Store.YES);
            Field fieldSize = new TextField("Size",fileSize+"",Field.Store.YES);

            // - 创建文档对象
            Document document = new Document();

            // 4.向文档对象中添加域
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
            document.add(fieldSize);

            // 5.将文档对象写入索引库
            indexwriter.addDocument(document);
        }

        // 6.关闭IndexWriter对象
        indexwriter.close();
    }
}

消耗的时间为：6890

分词器的选择

不同分词器分词效果不同，所用时间也不同。

虽然默认分词器切分速度高于IKAnalyzer，但是他对中文支持不好，所以结合实际情况只能使用IKAnalyzer。但是可以通过调整IKAnalyzer的扩展词典和停用词典来提升查询匹配度。

没有优化（StandirAnalyzer）

package com.itheima.lucene;

import java.io.File;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

// 建立索引库
public class LuceneFirst {
    public static void main(String[] args) throws Exception {
        // 获得开始时间
        long start = System.currentTimeMillis();
        
        createIndex();
        
        // 获得结束时间
        long end = System.currentTimeMillis();
        System.out.println("消耗的时间为："+(end - start));
    }

    public static void createIndex() throws Exception{


        // 1.创建一个Director对象，指定索引库保存的位置
        Directory directory = FSDirectory.open(new File("E:\\syk").toPath());
        
        // 2.基于Director对象创建一个IndexWriter对象
        IndexWriter indexwriter = new IndexWriter(directory,new IndexWriterConfig());

        // 3.读取磁盘上的文件，对应每个文件来创建一个文档对象

        // - 创建一个文档对象指向需要建立索引库的文件目录
        File dir = new File("E:\\searchsource");
        System.out.println(dir);
        // - 得到目录下的文件列表
        File[] files = dir.listFiles();

        // - 遍历文件列表
        for(File f : files){
            // - 获取文件名
            String fileName = f.getName();

            // - 获取文件路径
            String filePath = f.getPath();

            // - 获取文件内容
            String fileContent = FileUtils.readFileToString(f, "utf-8");

            // - 文件大小
            long fileSize = FileUtils.sizeOf(f);

            // - 创建Field 参数1:域的名称；参数2:域的内容；参数3:是否存储
            Field fieldName = new TextField("name",fileName,Field.Store.YES);
            Field fieldPath = new TextField("Path",filePath,Field.Store.YES);
            Field fieldContent = new TextField("Content",fileContent,Field.Store.YES);
            Field fieldSize = new TextField("Size",fileSize+"",Field.Store.YES);

            // - 创建文档对象
            Document document = new Document();

            // 4.向文档对象中添加域
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
            document.add(fieldSize);

            // 5.将文档对象写入索引库
            indexwriter.addDocument(document);
        }

        // 6.关闭IndexWriter对象
        indexwriter.close();
    }
}

消耗的时间为：7725

IKAnalyzer

package com.itheima.lucene;

import java.io.File;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.wltea.analyzer.lucene.IKAnalyzer;

// 建立索引库
public class LuceneFirst {
    public static void main(String[] args) throws Exception {
        // 获得开始时间
        long start = System.currentTimeMillis();
        
        createIndex();
        
        // 获得结束时间
        long end = System.currentTimeMillis();
        System.out.println("消耗的时间为："+(end - start));
    }

    public static void createIndex() throws Exception{


        // 1.创建一个Director对象，指定索引库保存的位置
        Directory directory = FSDirectory.open(new File("E:\\syk").toPath());
        
        // 2.基于Director对象创建一个IndexWriter对象
        IndexWriter indexwriter = new IndexWriter(directory,new IndexWriterConfig(new IKAnalyzer()));

        // 3.读取磁盘上的文件，对应每个文件来创建一个文档对象

        // - 创建一个文档对象指向需要建立索引库的文件目录
        File dir = new File("E:\\searchsource");
        System.out.println(dir);
        // - 得到目录下的文件列表
        File[] files = dir.listFiles();

        // - 遍历文件列表
        for(File f : files){
            // - 获取文件名
            String fileName = f.getName();

            // - 获取文件路径
            String filePath = f.getPath();

            // - 获取文件内容
            String fileContent = FileUtils.readFileToString(f, "utf-8");

            // - 文件大小
            long fileSize = FileUtils.sizeOf(f);

            // - 创建Field 参数1:域的名称；参数2:域的内容；参数3:是否存储
            Field fieldName = new TextField("name",fileName,Field.Store.YES);
            Field fieldPath = new TextField("Path",filePath,Field.Store.YES);
            Field fieldContent = new TextField("Content",fileContent,Field.Store.YES);
            Field fieldSize = new TextField("Size",fileSize+"",Field.Store.YES);

            // - 创建文档对象
            Document document = new Document();

            // 4.向文档对象中添加域
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
            document.add(fieldSize);

            // 5.将文档对象写入索引库
            indexwriter.addDocument(document);
        }

        // 6.关闭IndexWriter对象
        indexwriter.close();
    }
}

消耗的时间为：19215

索引库存放位置的选择

类	写操作	读操作	特点
SimpleFSDirectory	java.io.RandomAccessFile	java.io.RandomAccessFile	简单实现，并发能力差
NIOFSDirectory	java.nio.FileChannel	FSDirectory.FSIndexOutput	并发能力强(别在win下面用)
MMapDirectory	内存映射	FSDirectory.FSIndexOutput	读取基于内存

SimpleFSDirectory和我们默认使用的FSDirectory没有特别大的差距，NIOFSDirectory的并发优秀，但是不建议在Windows下使用。我们主要说MMapDirectory，他在第一次读的时候会将数据读到内存中，然后第二次读会直接读取内存中的数据，相当于在内存中建立了缓存。

package com.itheima.lucene;

import java.io.File;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.MMapDirectory;

// 建立索引库
public class LuceneFirst {
    public static void main(String[] args) throws Exception {
        // 获得开始时间
        long start = System.currentTimeMillis();
        
        searchIndex();
        
        // 获得结束时间
        long end = System.currentTimeMillis();
        System.out.println("消耗的时间为："+(end - start));
    }

 public static void searchIndex() throws Exception{
        // 1.创建一个Director对象，指定索引库保存的位置

        Directory directory = MMapDirectory.open(new File("E:\\syk").toPath());
        
        // 2.基于Director对象创建一个IndexReader对象
        IndexReader indexReader = DirectoryReader.open(directory);
       
        // 3.创建一个IndexSearcher对象，构造方法中的参数IndexReader对象
        IndexSearcher indexSearcher = new IndexSearcher(indexReader); 

        // 4.创建一个Query对象，TermQuery  参数1:查询范围(选择查询文档内容)；参数2:查询文本
        Query query = new TermQuery(new Term("Content","spring"));

        // 5.执行查询，得到一个TopDocs对象  参数1:查询对象；参数2:返回记录数
        TopDocs topDocs = indexSearcher.search(query, 10);

        // 6.取查询结果的总记录数
        System.out.println("查询总记录数："+topDocs.totalHits);


        // 7.获取文档列表
        ScoreDoc[] scoreDocs =  topDocs.scoreDocs;
        
        // 8.打印文档中的内容
        for(ScoreDoc doc: scoreDocs){
            // - 文档ID
            int docId = doc.doc;
            // - 根据ID获得文档对象
            Document document = indexSearcher.doc(docId);
            // - 获取文档名称
            System.out.println("文档名称是："+document.get("name"));

            // - 获取文档路径
            System.out.println("文档路径是："+document.get("Path"));

            // - 获取文档大小
            System.out.println("文档长度是："+document.get("Size"));

            // - 获取文档内容
            System.out.println("文档内容是："+document.get("Content"));

        }

        // 9.关闭IndexReader对象
        indexReader.close();
    }

}

使用MMapDirectory查询时，第一次查询与FSDirectory差别不大，都是300ms+，但是第二次查询时，MMapDirectory只需要15ms（FSDirectory也有缓存机制，但是也需要20ms）。

如果数据量上千万或上亿后，差别就会显著体现出来。

Lucene使用注意事项

关键词区分大小写

OR AND TO等关键词是区分大小写的，lucene只认大写，小写当做普通单词。
读写互斥性

同一个时刻只能有一个对索引的写操作，在写的同时可以进行搜索。
文件锁

在写索引的过程中强行退出将在TMP目录留下一个lock文件，便以后的写操作无法进行，可以将其手工删除
时间格式

lucene只支持一种时间格式yyMMddHHmmss,所以你传一个yy-MM-dd HH:mm:ss的时间给他，他是不会将它作为时间格式来处理。
设置boost

有些时候在搜索时某个字段的权重需要大一些，比如想让标题的权重比正文大一些，你可以把标题的boost设置的更大。搜索结果中也会优先显示标题出现关键词的文字。

寒暄

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
Lucene底层架构与优化

Lucene底层存储结构这是一个物理上的索引库。这是一个逻辑上的索引库。物理索引库中的Segment_1文件对应了逻辑索引库中的Segment段。Segment段的文件大小有上限，达到上限后自动产生新的Segment段文件。上限可以去使用版本的文档中查询，每个版本的上限不一样。物理索引库中的write.lock是锁文件，保证当前只有一个线程在操作Segment文件逻辑索引库中的词典分为三部分，分别是关键词+文档号+出现位置。关键词的大小是有限制的，最大也就是新华词典+牛津词典+文言文词典+
复制链接

扫一扫