娱乐头条-04lucene_）是一个能够为用户提供检索服务，将用户检索到相关信息展示给用户的系统-CSDN博客

本文链接：https://blog.csdn.net/WYJ____/article/details/103871161

1.搜索引擎

2.lucene

一、搜索引擎

1.什么是搜索引擎

搜索引擎是指根据一定的策略、运用特定的计算机程序从互联网上搜集信息，在对信息进行组织和处理后，为用户提供检索服务，将用户检索相关的信息展示给用户的系统。例如: 百度谷歌

2.搜索引擎基本的运行原理

3.原始数据库查询的缺陷

慢, 当数据库中的数据量很庞大的时候, 整个的查询效率非常低, 无法及时返回内容
搜索效果比较差, 只能根据用户输入的完整关键字的进行首尾的模糊匹配
如果用户输入的关键字出现错别字, 或者多输入了内容, 可能就导致结果远离用户期望的内容

4.倒排索引技术

倒排索引, 又称为反向索引: 以字或者词,甚至是一句话一段话作为一个关键字进行索引, 每一个关键字都会对应着一个记录项, 记录项中记录了这个关键字出现在哪些文档中,以及在此文档的什么位置上

为什么说倒排索引可以提升查询的效率和精准度呢?

倒排索引, 是将数据提前按照格式分词放好,建立索引, 当用户进行搜索, 将用户的关键字进行分词, 然后根据分词后的单词到索引库中寻找对应词条,根据词条, 查到对应所在的文档位置, 将其文档内容直接获取即可

二、lucene

lucene是Apache提供的一个开源的全文检索引擎工具包, 其本质就是一个工具包, 而非一个完整的搜索引擎, 但是我们可以通过Lucene来构建一个搜索引擎

lucene建立索引库的原理

lucene查询的原理

1.lucene 与solr的关系

Lucene: 底层的api, 工具包
solr: 基于Lucene开发的企业级的搜索引擎产品

2.使用lucene如何构建索引

1、添加依赖

        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>4.10.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queries</artifactId>
            <version>4.10.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-test-framework</artifactId>
            <version>4.10.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>4.10.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>4.10.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-highlighter</artifactId>
            <version>4.10.2</version>
        </dependency>

添加编译插件（自带的版本较低，会报错）

        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.7.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
        </plugins>

2、书写写入索引的代码

    @Test
    public void testWriterIndex() throws Exception{
        //写入索引

        //创建目录对象
        Directory directory = new SimpleFSDirectory(new File("D:\\index"));
        //创建索引写入器配置对象
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_2,new SimpleAnalyzer());
        //设置Lucene的打开索引库的方式
        config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

        //创建索引写入器对象
        IndexWriter indexWriter = new IndexWriter(directory,config);

        //创建字段类型
        FieldType fieldType = new FieldType();
        fieldType.setStored(true);//是否存储到索引库
        fieldType.setTokenized(true);//是否分词
        fieldType.setIndexed(true);//是否创建索引


        //创建文档对象
        Document document = new Document();
        document.add(new Field("title","lucene简介",fieldType));
        document.add(new Field("content","Lucene是Apache提供的一个开源的全文检索引擎工具包, 其本质就是一个工具包",fieldType));

        //添加内容
        indexWriter.addDocument(document);

        //写入索引
        indexWriter.commit();

        //关闭资源
        indexWriter.close();

    }
}

3.索引查看工具

链接：https://pan.baidu.com/s/1jXCbufIE2AwYGbusZ_ugWA
提取码：i82i

执行run.bat

4.API详解

IndexWriter: 索引写入器对象

其主要的作用, 添加索引, 修改索引和删除索引

创建此对象的时候, 需要传入Directory和indexWriterConfig对象
Directory: 目录类, 用来指定索引库的目录
常用的实现类:
- FSDirectory: 用来指定文件系统的目录, 将索引信息保存到磁盘上
- 优点: 索引可以进行长期保存, 安全系数高
- 缺点: 读取略慢
- RAMDriectory: 内存目录, 将索引库信息存放到内存中
- 优点: 读取速度快
- 缺点: 不安全, 无法长期保存, 关机后就消失了
IndexWriterConfig: 索引写入器的配置类
创建此对象, 需要传递Lucene的版本和分词器
作用:
- 作用1 : 指定Lucene的版本和需要使用的分词器
- 作用2: 设置Lucene的打开索引库的方式: setOpenMode();

        //参数值: APPEND CREATE   CREATE_OR_APPEND
        /**
         * APPEND: 表示追加, 如果索引库存在, 就会向索引库中追加数据, 如果索引库不存在, 直接报错
         *
         * CREATE: 表示创建, 不管索引库有没有, 每一次都是重新创建一个新的索引库
         *
         * CREATE_OR_APPEND: 如果索引库有, 就会追加, 如果没有就会创建索引库
                默认值也是 CREATE_OR_APPEND
         */
        config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);

Document: 文档

在Lucene中, 每一条数据以文档的形式进行存储, 文档中也有其对应的属性和值, Lucene中一个文档类似数据库的一个表, 表中的字段类似于文档中的字段,只不过这个文档只能保存一条数据

Document看做是一个文件, 文件的属性就是文档的属性, 文件对应属性的值就是文档的属性的值 content

一个文档中可以有多个字段, 每一个字段就是一个field对象,不同的文档可以有不同的属性
字段也有其对应数据类型, 故Field类也提供了各种数据类型的实现类

Field类	数据类型	Analyzed是否分析	Indexed是否索引	Stored是否存储	说明
StringField(FieldName, FieldValue,Store.YES))	字符串	N	Y	Y或N	这个Field用来构建一个字符串Field，但是不会进行分析，会将整个串存储在索引中，比如(订单号,姓名等)是否存储在文档中用Store.YES或Store.NO决定
LongField(FieldName, FieldValue,Store.YES)	Long型	Y	Y	Y或N	这个Field用来构建一个Long数字型Field，进行分析和索引，比如(价格)是否存储在文档中用Store.YES或Store.NO决定
StoredField(FieldName, FieldValue)	重载方法，支持多种类型	N	N	Y	这个Field用来构建不同类型Field不分析，不索引，但要Field存储在文档中
TextField(FieldName, FieldValue, Store.NO)或TextField(FieldName, reader)	字符串或流	Y	Y	Y或N	如果是一个Reader, lucene猜测内容比较多,会采用Unstored的策略.

名称解释:

分析: 是否将字段的值进行分词

索引: 指的是能否被搜索

是否保存: 指的的初始值是否需要保存

如果一个字段中的值可以被分词, 那么必然是支持搜索的

Analyzer: 分词器:

用于对文档中的数据进行分词, 其分词的效果取决于分词器的选择, Lucene中根据各个国家制定了各种语言的分词器,对中文有一个ChineseAnalyzer 但是其分词的效果, 是将中文进行一个一个字的分开

针对中文分词一般只能使用第三方的分词词:

一般采用IK分词器

5.集成IK分词器

1、导入相关依赖

        <dependency>
            <groupId>com.janeluo</groupId>
            <artifactId>ikanalyzer</artifactId>
            <version>2012_u6</version>
        </dependency>

2、将上面的简单分词器换成ik分词器

        //创建索引写入器配置对象
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_2,new IKAnalyzer());

如果出现新的词，例如：

        document.add(new Field("title","网络词",fieldType));
        document.add(new Field("content","吊炸天不明觉厉",fieldType));

高级使用:

ik分词器在2012年更新后, 就在没有更新, 其原因就取决于其强大的扩展功能,以保证ik能够持续使用

链接：https://pan.baidu.com/s/1SXy2dcYTz-vp4MAQ0D6nmw
提取码：ir4f

ik支持对自定义词库, 其可以定义两个扩展的词典
- 扩展词典（新创建词功能）:有些词IK分词器不识别例如：“传智播客”，“碉堡了”
- 停用词典（停用某些词功能）有些词不需要建立索引例如：“哦”，“啊”，“的”

如何使用:

将此三个文件复制到项目中

接着在ext.dic中设置需要进行分词的内容即可, 在stopword中设置不被分词的内容即可

6.查询索引

public class Search {

    @Test
    public void testIndexSearch() throws Exception{

        //创建索引读取id
        IndexReader reader = DirectoryReader.open(new SimpleFSDirectory(new File("d:\\index")));

        //创建索引的搜索对象
        IndexSearcher searcher = new IndexSearcher(reader);

        //执行搜索
        /*
        * 第一个参数：
        * 第二个参数：返回多少条数据
        * */
        QueryParser parser = new QueryParser("content",new IKAnalyzer());
        Query query = parser.parse("不明觉厉");
        TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);

        //System.out.println(topDocs.totalHits);//命中的文档数
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for(ScoreDoc scoreDoc:scoreDocs){
            System.out.println("得分为："+scoreDoc.score);
            System.out.println("文档id为："+scoreDoc.doc);

            Document document = searcher.doc(scoreDoc.doc);
            List<IndexableField> fields = document.getFields();
            for(IndexableField field:fields){
                System.out.println(document.getField(field.name()).stringValue());
            }
        }

    }
}

7.查询相关API详解

IndexSearcher: Lucene中索引查询对象, 用来执行查询和排序操作
常用方法:
- search(Query query, int n);//执行查询
- 参数1: 查询条件
- 参数2: 返回的最大条数
- search(Query query, int n,Sort sort);
- 参数1: 查询的条件
- 参数2: 返回的最大的条数
- 参数3: 排序
- doc(int id);//根据文档id查询文档对象
IndexReader: 索引库读取工具
使用DirectoryReader来打开索引库
Query:查询对象
获取方式:
- 通过查询解析器
- 单字段的解析器: queryParse
- 多字段的解析器: multiFieldQueryParse
- 使用Lucene自定义的实现类
- Lucene中提供了五种常用的多样化的查询

TopDocs:查询结果对象
第一部分: 查询到的总条数
- int topDocs.totalHits
第二部分: 得分文档的数组
- ScoreDoc[] topDocs.scoreDocs;
ScoreDoc: 得分文档对象
第一部分: 文档的id
- topDoc.doc
第二部分: 文档的得分
- topDoc.score

8.多样化查询

准备数据：（1）向索引库中多添加几条索引

（2）提取一个查询的方法, 传递不同的query, 即可

    public void mulipartQuery(Query query) throws Exception{

        IndexReader reader = DirectoryReader.open(new SimpleFSDirectory(new File("D:\\index")));
        IndexSearcher searcher = new IndexSearcher(reader);


        TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);


        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {

            System.out.println("得分为:"+scoreDoc.score);
            System.out.println("文档Id为:"+scoreDoc.doc);

            Document document = searcher.doc(scoreDoc.doc);

            System.out.println("文档的title为:"+document.getField("title").stringValue());
            System.out.println("文档的content为:"+document.getField("content").stringValue());
        }
    }

1、词条查询

    //1.词条查询
    @Test
    public void testTermQuery() throws Exception{
        TermQuery query = new TermQuery(new Term("title","简介"));
        mulipartQuery(query);
    }

词条查询是一个不可在分割内容, 可以把词条看做是一个分词后的单词

词条查询在书写词条内容的时候, 是不允许输入错误的

2、模糊查询

    //2.模糊查询
    @Test
    public void testFuzzyQuery() throws Exception{
        FuzzyQuery query = new FuzzyQuery(new Term("title","lucene"));
        mulipartQuery(query);
    }

最大编辑次数: 2(0~2)
2 : 替换, 修改, 补位 : 这三种操作只要能够在2次(每一此都是单个字符和单个字符的处理)的范围内将词条恢复回来, 就可以查询到数据
过半机制: 如果词条是小于等于4的, 最大编辑次数就为1了, 如果只有二个, 压根就没有最大编辑次数

3、通配符查询

    //3.通配符查询
    @Test
    public void testWildCardQuery() throws Exception{
        WildcardQuery query = new WildcardQuery(new Term("title","lucene"));
        mulipartQuery(query);
    }

* : 占用 0 ~ 多个字符
? : 占用一个字符

4、数值范围查询

准备数据，在Writer类中写入索引：数字

        //创建文档对象
        Document document1 = new Document();
        document1.add(new IntField("title",10, Field.Store.YES));
        document1.add(new Field("content","10",fieldType));
        //创建文档对象
        Document document2 = new Document();
        document2.add(new IntField("title",20, Field.Store.YES));
        document2.add(new Field("content","20",fieldType));
        //创建文档对象
        Document document3 = new Document();
        document3.add(new IntField("title",30, Field.Store.YES));
        document3.add(new Field("content","30",fieldType));

        //添加内容
        indexWriter.addDocument(document1);
        indexWriter.addDocument(document2);
        indexWriter.addDocument(document3);

    //4.数值范围查询
    @Test
    public void testRangeQuery() throws Exception{
        //参数1: 默认查询的字段
        //参数2: 最小值
        //参数3: 最大值
        //参数4: 是否包含最小值
        //参数五: 是否包含最大值
        NumericRangeQuery<Integer> query = NumericRangeQuery.newIntRange(
                "title",10,30,true,true
        );
        mulipartQuery(query);
    }

5、组合查询

    //5.组合查询
    @Test
    public void testBooleanQuery() throws Exception{
        BooleanQuery query = new BooleanQuery();
        NumericRangeQuery<Integer> rangeQuery = NumericRangeQuery.newIntRange(
                "title",10,20,true,true
        );
        query.add(new TermQuery(new Term("content","solr")), BooleanClause.Occur.MUST_NOT);
        query.add(rangeQuery, BooleanClause.Occur.SHOULD);
        mulipartQuery(query);
    }

组合查询:本身自己是没有任何的条件的, 组合查询的目的是为了将其他的查询条件, 并入组合查询的条件中, 实现多条件查询
MUST : 必须的. 这个条件是必须存在的, 获取到的结果, 必须是包含这个条件的内容
MUST_NOT: 不必须. 这个条件是必须不包含的, 获取到的结果, 是不能有这个条件里的内容
SHOULD: 可选的, 如果这个条件能获取到数据, 那么就展示, 如果没有获取到数据, 就不展示, 同时也不会影响其他的条件

9.lucene的索引修改

    @Test
    public void updateIndex() throws Exception{

        Directory directory = new SimpleFSDirectory(new File("D:\\index"));
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_2,new IKAnalyzer());

        IndexWriter writer = new IndexWriter(directory, config);

        FieldType fieldType = new FieldType();
        fieldType.setStored(true);//是否存储到索引库
        fieldType.setTokenized(true);//是否分词
        fieldType.setIndexed(true);//是否创建索引

        Document document = new Document();
        document.add(new Field("title","lucene简介",fieldType));
        document.add(new Field("content","lucene是一个全文检索的工具包,是solr的底层原理",fieldType));
        writer.updateDocument(new Term("title","lucene"),document);

        writer.commit();
        writer.close();

    }

10.lucene的删除

    @Test
    public void delIndex() throws Exception{

        Directory directory = new SimpleFSDirectory(new File("D:\\index"));
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_2,new IKAnalyzer());

        IndexWriter writer = new IndexWriter(directory, config);

        //writer.deleteDocuments(new Term("title","lucene"));
        writer.deleteAll();

        writer.commit();
        writer.close();

    }