Apache Lucene建立索引与搜索数据

18 篇文章 1 订阅

Apache Lucene建立索引与搜索数据

章节

1.Lucene开发环境的Maven依赖
2.Lucene包结构功能表
3.Lucene中IndexOptions说明
4.创建索引
5.修改索引
6.删除索引
7.精确查询
8.模糊查询
9.组合查询

Lucene

Lucene是非常成熟的开源免费的Java语言的全文索引检索工具包.

全文检索

是指计算机索引程序通过扫描文章中每一个词,对每一个词建立一个索引,指明该词在文章中出现的次数和位置,当用户查询时,检索程序就根据事先建立的索引进行查找,并将查找的结果反馈给用户的检索放视.
Lucene的索引结构是有层次结构的,主要分以下几个层次:

索引(Index):
在Lucene中一个索引是放在一个文件夹中的。
如上图,同一文件夹中的所有的文件构成一个Lucene索引。
段(Segment):
一个索引可以包含多个段,段与段之间是独立的,添加新文档可以生成新的段,不同的段可以合并。
如上图,具有相同前缀文件的属同一个段,图中共两个段 “_0” 和 “_1”。
segments.gen和segments_5是段的元数据文件,也即它们保存了段的属性信息。
文档(Document):
文档是我们建索引的基本单位,不同的文档是保存在不同的段中的,一个段可以包含多篇文档。
新添加的文档是单独保存在一个新生成的段中,随着段的合并,不同的文档合并到同一个段中。
域(Field):
一篇文档包含不同类型的信息,可以分开索引,比如标题,时间,正文,作者等,都可以保存在不同的域里。
不同域的索引方式可以不同,在真正解析域的存储的时候,我们会详细解读。
词(Term):
词是索引的最小单位,是经过词法分析和语言处理后的字符串。

Lucene开发环境的Maven依赖
<dependencies>

<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-core -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-core</artifactId>
    <version>8.9.0</version>
</dependency>

<!-- 一般分词器,适用于英文分词 -->
<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-common -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-analyzers-common</artifactId>
    <version>8.11.1</version>
</dependency>
<!-- 中文分词器 -->
<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-smartcn -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-analyzers-smartcn</artifactId>
    <version>8.11.1</version>
</dependency>

<!-- 对分词索引查询解析 -->
<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queryparser -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>8.11.1</version>
</dependency>

<!-- 检索关键字高亮显示 -->
<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-highlighter -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-highlighter</artifactId>
    <version>8.11.1</version>
</dependency>

 <!-- 中文分词工具包 -->
<!-- https://mvnrepository.com/artifact/com.jianggujin/IKAnalyzer-lucene -->
<dependency>
    <groupId>com.jianggujin</groupId>
    <artifactId>IKAnalyzer-lucene</artifactId>
    <version>8.0.0</version>
</dependency>
</dependencies>
注:IKAnalyzer分词器的版本需要与lucene版本一致
Lucene包结构功能表
包名功能
org.apache.lucene.analysis语言分析器,主要用于切词,支持中文主要是扩展此类
org.apache.lucene.document索引存储时的文档结构管理,类似于关系型数据库的表结构
org.apache.lucene.index索引管理,包括索引建立、删除等
org.apache.lucene.queryParser查询分析器,实现查询关键词间的运算,如与、或、非等
org.apache.lucene.search检索管理,根据查询条件,检索得到结果
org.apache.lucene.store数据存储管理,主要包括一些底层的I/O操作
org.apache.lucene.util一些共用类
创建索引
例:
例:
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.MMapDirectory;
import org.wltea.analyzer.lucene.IKAnalyzer;

@Test
    public void buildIndex() throws Exception
    {
        String indexDir="索引目录位置";//lucene索引目录位置
        File luceneIndexDirectory=new File(indexDir);
        /**
         * Directory有多种实现:
         * FSDirectory直接操作磁盘(固态推荐)
         * MMapDirectory内存映射(大内存推荐)
         * */

        FSDirectory directory= MMapDirectory.open(luceneIndexDirectory.toPath());//打开索引目录
        IndexWriterConfig indexWriterConfig=new IndexWriterConfig(new IKAnalyzer());
        indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
        indexWriterConfig.setMaxBufferedDocs(1000);//设置缓存区,每n个文档写入一次
        IndexWriter writer=new IndexWriter(directory,indexWriterConfig);
        writer.forceMerge(100);//设置每个segments保存几个文档,该值越大索引时间越小,对应搜索会变慢

        /**
         *定义字段
         * */

        /**
         * id:存储、不分词、索引
         * */

        FieldType idFieldType=new FieldType();
        idFieldType.setStored(true);
        idFieldType.setTokenized(false);
        idFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);

        /**
         * title:存储、分词、索引
         **/
        FieldType titleFieldType=new FieldType();
        titleFieldType.setStored(true);
        titleFieldType.setTokenized(true);
        titleFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);

        /**
         * content:存储、索引、分词
         * */

        FieldType contentFieldType=new FieldType();
        contentFieldType.setStored(true);
        contentFieldType.setTokenized(true);
        contentFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);

        Document document=new Document();
        document.add(new Field("id","1",idFieldType));
        document.add(new Field("title","一往无前",titleFieldType));
        FileReader fileReader=new FileReader("D:\\SourceCodes\\Temp\\example\\一往无前.txt");
        char[] chbuf=new char[1024*8];
        int len=0;
        String content="";
        while((len=fileReader.read(chbuf))!=-1)
        {
            content=new String(chbuf,0,len);
        }
        document.add(new Field("content",content,contentFieldType));

        writer.addDocument(document);
        writer.flush();
        writer.commit();
        writer.close();

    }

Directory的多种实现:
  • FSDrectory直接操作磁盘(固态推荐)
  • MMapDirectory内存映射(大内存推荐)
Lucene中IndexOptions说明:
属性说明
NONE不建立索引
DOCS文档建立索引
DOCS_AND_FREQS文档、词频建立索引
DOCS_AND_FREQS_AND_POSITIONS文档、词频、词位置建立索引
DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS文档、词频、词位置、偏移量建立索引
修改索引
例:
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.wltea.analyzer.lucene.IKAnalyzer;

import java.io.File;
import java.io.FileReader;
@Test
public void updateIndex()
    {
        String indexDir="lucene索引目录位置";
        File luceneIndexDirecotry=new File(indexDir);
        try(FSDirectory fsd=FSDirectory.open(luceneIndexDirecotry.toPath()))
        {
            IndexWriterConfig indexWriterConfig=new IndexWriterConfig(new IKAnalyzer());
            indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
            IndexWriter indexWriter=new IndexWriter(fsd,indexWriterConfig);

            /**
             * 定义字段
             * */
            //id:存储、不分词、索引
            FieldType idFieldType=new FieldType();
            idFieldType.setStored(true);
            idFieldType.setTokenized(false);
            idFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);

            //title:存储、分词、索引
            FieldType titleFieldType=new FieldType();
            titleFieldType.setStored(true);
            titleFieldType.setTokenized(true);
            titleFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);

            //content:存储、分词、索引
            FieldType contentFieldType=new FieldType();
            contentFieldType.setStored(true);
            contentFieldType.setTokenized(true);
            contentFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);

            Document document=new Document();
            document.add(new Field("id","1",idFieldType));
            document.add(new Field("title","一往无前",titleFieldType));
            FileReader fileReader=new FileReader("D:\\SourceCodes\\Temp\\example\\一往无前.txt");
            char[] chbuf=new char[1024*8];
            int len=0;
            String content="";
            while((len=fileReader.read(chbuf))!=-1)
            {
                content=new String(chbuf,0,len);
            }
            document.add(new Field("content",content,contentFieldType));

            indexWriter.addDocument(document);
            indexWriter.flush();
            indexWriter.commit();
            indexWriter.close();
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }

    }
删除索引
例:
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.MMapDirectory;
import org.junit.jupiter.api.Test;
import org.wltea.analyzer.lucene.IKAnalyzer;
@Test
    public void deleteIndex()
    {
        String indexDir="lucene索引目录位置";
        File luceneIndexDirectory=new File(indexDir);
        try(Directory directory= MMapDirectory.open(luceneIndexDirectory.toPath()))
        {
            IndexWriterConfig indexWriterConfig=new IndexWriterConfig(new IKAnalyzer());
            indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
            indexWriterConfig.setMaxBufferedDocs(1000);
            IndexWriter writer=new IndexWriter(directory,indexWriterConfig);
            writer.forceMerge(100);
            writer.deleteDocuments(new Term("id","1"));
            writer.flush();
            writer.commit();
            writer.close();

        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
    }
精确查询
例:
@Test
public void termQuery()
{
     String indexDir="lucene索引目录位置";
        File luceneIndexDirectory=new File(indexDir);
        String content="雷军";
        try(FSDirectory fsd=FSDirectory.open(luceneIndexDirectory.toPath()))
        {
            DirectoryReader reader=DirectoryReader.open(fsd);
            IndexSearcher searcher=new IndexSearcher(reader);
            TopDocs docs=searcher.search(new TermQuery(new Term("content",content)),10);

            for(ScoreDoc doc:docs.scoreDocs)
            {
                Document document=searcher.doc(doc.doc);
                System.out.println(document.get("content"));
            }
             reader.close();
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
}
模糊查询
例:
@Test
public void termQuery()
{
     String indexDir="lucene索引目录位置";
        File luceneIndexDirectory=new File(indexDir);
        String content="雷军";
        try(FSDirectory fsd=FSDirectory.open(luceneIndexDirectory.toPath()))
        {
            DirectoryReader reader=DirectoryReader.open(fsd);
            IndexSearcher searcher=new IndexSearcher(reader);
            FuzzyQuery query=new FuzzyQuery(new Term("content","10年"));
            TopDocs docs1=searcher.search(query,100);
            for(ScoreDoc doc:docs1.scoreDocs)
            {
                Document document=searcher.doc(doc.doc);
                System.out.println(document.get("title"));
            }

                reader.close();
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
组合查询
布尔规则
参数含义
MUST
MUST_NOT
SHOULD
FILTER
布尔组合查询1
例:
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.junit.jupiter.api.Test;
import org.wltea.analyzer.lucene.IKAnalyzer;

@Test
 public void screeningQuery()
    {
        String indexDir="lucene索引目录位置";
        File luceneIndexDirectory=new File(indexDir);
        try(FSDirectory fsd=FSDirectory.open(luceneIndexDirectory.toPath()))
        {
            DirectoryReader reader=DirectoryReader.open(fsd);
            IndexSearcher searcher=new IndexSearcher(reader);

            String[] fields={"title","content"};
            QueryParser titleQueryParser=new QueryParser("title",new IKAnalyzer());
            Query titleQuery=titleQueryParser.parse("一往无前");
            QueryParser contentQueryParser=new QueryParser("content",new IKAnalyzer());
            Query contentQuery=contentQueryParser.parse("雷军");
            //布尔运算
            Query query=new BooleanQuery.Builder().add(titleQuery, BooleanClause.Occur.MUST).add(contentQuery, BooleanClause.Occur.MUST).build();

            TopDocs docs= searcher.search(query,10);
            for(ScoreDoc doc: docs.scoreDocs)
            {
                Document document=searcher.doc(doc.doc);
                System.out.println(document);
            }
            reader.close();
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
    }
布尔组合查询2
例:
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.junit.jupiter.api.Test;
import org.wltea.analyzer.lucene.IKAnalyzer;
@Test
public void booleanScreenQuery()
    {
        String indexDir="lucene索引目录位置";
        File luceneIndexDirectory=new File(indexDir);
        try(FSDirectory fsd=FSDirectory.open(luceneIndexDirectory.toPath()))
        {
            DirectoryReader reader=DirectoryReader.open(fsd);
            IndexSearcher searcher=new IndexSearcher(reader);
            String[] fields={"title","content"};
            String[] stringQuery={"一往无前","雷军"};
            BooleanClause.Occur[] flags={BooleanClause.Occur.MUST,BooleanClause.Occur.MUST};
            Query query=MultiFieldQueryParser.parse(stringQuery,fields,flags,new IKAnalyzer());

            TopDocs docs=searcher.search(query,10);

            for(ScoreDoc doc: docs.scoreDocs)
            {
                Document document=searcher.doc(doc.doc);
                System.out.println(document);
            }
            reader.close();

        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
    }
注:由于中文分词器IKanalyzer版本过老产生的编译问题可通过下述方法解决:
创建类MyIKTokenizer
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;

import java.io.IOException;

public class MyIKTokenizer extends Tokenizer {
    private IKSegmenter _IKImplement;
    private final CharTermAttribute termAtt = (CharTermAttribute)this.addAttribute(CharTermAttribute.class);
    private final OffsetAttribute offsetAtt = (OffsetAttribute)this.addAttribute(OffsetAttribute.class);
    private final TypeAttribute typeAtt = (TypeAttribute)this.addAttribute(TypeAttribute.class);
    private int endPosition;

    public MyIKTokenizer(boolean useSmart) {
        this._IKImplement = new IKSegmenter(this.input, useSmart);
    }

    public boolean incrementToken() throws IOException {
        this.clearAttributes();
        Lexeme nextLexeme = this._IKImplement.next();
        if (nextLexeme != null) {
            this.termAtt.append(nextLexeme.getLexemeText());
            this.termAtt.setLength(nextLexeme.getLength());
            this.offsetAtt.setOffset(nextLexeme.getBeginPosition(), nextLexeme.getEndPosition());
            this.endPosition = nextLexeme.getEndPosition();
            this.typeAtt.setType(nextLexeme.getLexemeTypeString());
            return true;
        } else {
            return false;
        }
    }

    public void reset() throws IOException {
        super.reset();
        this._IKImplement.reset(this.input);
    }

    public final void end() {
        int finalOffset = this.correctOffset(this.endPosition);
        this.offsetAtt.setOffset(finalOffset, finalOffset);
    }
}

创建类MIKAnalyzer
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;

public class MyIKAnalyzer {

    public static final class MyIKAnalyzer extends Analyzer {
        private boolean useSmart;

        public boolean useSmart() {
            return this.useSmart;
        }

        public void setUseSmart(boolean useSmart) {
            this.useSmart = useSmart;
        }

        public MyIKAnalyzer() {
            this(false);
        }

        @Override
        protected TokenStreamComponents createComponents(String s) {
            Tokenizer _MyIKTokenizer = new MyIKTokenizer(this.useSmart());
            return new TokenStreamComponents(_MyIKTokenizer);
        }

        public MyIKAnalyzer(boolean useSmart) {
            this.useSmart = useSmart;
        }

    }

}

调用时调用MyIKAnalyzer

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

盛者无名

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值