全文检索技术之Lucene

最新推荐文章于 2024-08-05 22:37:16 发布

OceanStar的学习笔记

最新推荐文章于 2024-08-05 22:37:16 发布

阅读量298

点赞数

分类专栏： # java

原文链接：https://www.bilibili.com/video/av67100294?from=search&seid=15179619786696932625

版权

java 专栏收录该内容

72 篇文章 5 订阅

订阅专栏

什么是全文检索

数据分类

1）结构化数据

格式固定，长度固定，数据类型固定
例如数据库中的数据

2）非结构化数据

格式不固定，长度不固定，数据类型故补丁
例如word稳定，pdf稳定，html

数据的查询

1）结构化数据的查询

sql语句：简单、速度快

2）非结构化数据的查询

顺序扫描：使用程序把稳定读取到内存中，然后匹配字符串。
全文检索：把非结构化数据数据变成结构化数据，例如先根据空格进行字符串拆分，得到一个单词列表，基于单词列表创建一个索引，然后查询索引，根据单词和稳定的对应关系找到文档列表

全文检索

先建立索引，再对索引进行搜索的过程就叫全文检索(Full-text Search)。
索引一次创建可以多次使用。

应用场景

百度、Google等搜索引擎、论坛站内搜索、电商网站站内搜索

lucene实现全文检索的流程

什么是lucene

lucene是apache下的一个开放源代码的全文检索引擎工具包。提供了完整的查询引擎和索引引擎，部分文本分析引擎

它是一个基于java开发全文检索工具包

索引和搜索流程图

在这里插入图片描述

创建索引

对文档索引的过程，将用户要索索的文档内容进行索引，索引存储在索引库中。

获取原始文档

原始文档：要基于哪些数据来进行搜索，那么这些数据就是原始文档。

搜索引擎：使用爬虫获取原始文档。
站内搜索：数据库中的数据
案例：直接使用io流读取磁盘上的文档

创建文档对象

获取原始内容的目的是为了索引，在索引前需要将原始内容创建成文档（Document）。

对应每个原始文档创建一个Document对象。
每个Document对象中包含多个域（Field）。
域中保存的就是原始文档数据
每个文档都有一个唯一的编号，就是文档id。

在这里插入图片描述

分析文档

1、根据空格进行字符串拆分，得到一个单词列表
2、将单词统一转换成小写
3、去除标点符号
4、去除停用词（无意义的词）

每个关键词都封装成一个Term对象中。
Team中包含两部分内容：

关键词所在的域
关键词本身

不同的域中拆分出来的相同的关键词是不同的Term

创建索引

基于关键词列表创建一个索引。保存到索引库中。

索引库中：

索引
document对象
关键词和文档的对应关系

在这里插入图片描述
注意：创建索引是对词汇单词索引，通过词语找文档，这种索引的结构叫做倒排索引结构

传统方法是根据文件找到该文件的内容，在文件内容中匹配关键字，这种方法是顺序扫描方法，数据量大，搜索慢

倒排索引结构是根据内容找文档：

在这里插入图片描述
倒排索引结构也叫反向索引结构，包括索引和文档两部分，索引即词汇表，它的规模较小，而文档集合较大

查询索引

用户查询接口

用户输入查询条件的地方。
比如：
在这里插入图片描述
Lucene不提供制作用户搜索界面的功能，需要根据自己的需求开发搜索界面。

创建查询

将关键词封装成一个查询对象：

要查询的域
要搜索的关键词

例如：
语法 “fileName:lucene”表示要搜索Field域的内容为“lucene”的文档

执行查询

根据关键词找到对应索引，从而找到索引所链接的文档链表
在这里插入图片描述

渲染结果

根据文档的id找到文档对象：关键词高亮，分页处理等
在这里插入图片描述

配置开发环境

下载

官方网站： http://lucene.apache.org/
版本：lucene-7.4.0
Jdk要求：1.8以上

搭建工程

1、创建一个空java工程

选择Project-SDK版本：1.8以上
创建模块

2、添加jar包

创建索引

 @Test
    public void createIndex() throws Exception{
        // 指定索引存放的路径
        Directory directory = FSDirectory.open(new File("E:\\workspace\\index").toPath());

        //创建indexwriterCofig对象
        IndexWriterConfig config = new IndexWriterConfig();
        IndexWriter indexWriter = new IndexWriter(directory, config);

        //原始文档读取
        File dir = new File("F:\\学习资料备份\\java\\黑马(1)\\讲义+笔记+资料\\流行框架\\61.会员版(2.0)-就业课(2.0)-Lucene\\lucene\\02.参考资料\\searchsource");
        for (File f:dir.listFiles()) {
            String fileName = f.getName();
            String fileContent = FileUtils.readFileToString(f, "UTF-8");
            String filePath = f.getPath();
            long fileSize = FileUtils.sizeOf(f);

            //创建文件名域
            Field fileNameField = new TextField("filename", fileName, Field.Store.YES);
            Field fileContentField = new TextField("content", fileContent, Field.Store.YES);
            Field filePathField = new TextField("path", filePath, Field.Store.YES);
            Field fileSizeField = new TextField("size", fileSize + "", Field.Store.YES);

            //创建document对象
            Document document = new Document();
            document.add(fileNameField);
            document.add(fileContentField);
            document.add(filePathField);
            document.add(fileSizeField);

            //创建索引，并写入索引库
            indexWriter.addDocument(document);
        }

        //关闭indexwriter
        indexWriter.close();
    }

使用Luke工具查看索引文件

在这里插入图片描述

查询索引

    @Test
    public void searchIndex() throws Exception{
        Directory directory = FSDirectory.open(new File("E:\\workspace\\index").toPath());

        IndexReader indexReader = DirectoryReader.open(directory);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);

        Query query = new TermQuery(new Term("filename", "apache"));

        TopDocs topDocs = indexSearcher.search(query, 10);  //第一个参数是查询对象，第二个参数是查询结果返回的最大值
        System.out.println("查询结果的总条数："+ topDocs.totalHits);

        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
            //scoreDoc.doc属性就是document对象的id
            //根据document的id找到document对象
            Document document = indexSearcher.doc(scoreDoc.doc);
            System.out.println(document.get("filename"));
         //   System.out.println(document.get("content"));
            System.out.println(document.get("path"));
            System.out.println(document.get("size"));
            System.out.println("-------------------------");
        }

        indexReader.close();
    }

分析器

标准分析器

    @Test
    public void testTokenStream() throws Exception{
        // 创建一个标准分析其对象
        Analyzer analyzer = new StandardAnalyzer();

        TokenStream tokenStream = analyzer.tokenStream("aaabbb",
                "The Spring Framework provides a comprehensive programming and configuration model。我们要学好spring全家桶");
        //添加一个引用，可以获得每个关键词
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //添加一个偏移量的引用，记录了关键词的开始位置以及结束位置
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        //将指针调整到列表的头部
        tokenStream.reset();
        //遍历关键词列表，通过incrementToken方法判断列表是否结束
        while(tokenStream.incrementToken()) {
            //关键词的起始位置
     //       System.out.println("start->" + offsetAttribute.startOffset());
            //取关键词
            System.out.println(charTermAttribute);
            //结束位置
         //   System.out.println("end->" + offsetAttribute.endOffset());
        }
        tokenStream.close();
    }

中文分析器

Lucene自带中文分词器

StandardAnalyzer：
单字分词：就是按照中文一个字一个字地进行分词。如：“我爱中国”，
效果：“我”、“爱”、“中”、“国”。
SmartChineseAnalyzer
对中文支持较好，但扩展性差，扩展词库，禁用词库和同义词库等不好处理

IKAnalyzer

引入jar包
把配置文件和扩展词典添加到classpath下

注意：hotword.dic和ext_stopword.dic文件的格式为UTF-8，注意是无BOM 的UTF-8 编码。
也就是说禁止使用windows记事本编辑扩展词典文件
在这里插入图片描述

    @Test
    public void testTokenStream() throws Exception{
        // 创建一个标准分析其对象
        Analyzer analyzer = new IKAnalyzer();

        TokenStream tokenStream = analyzer.tokenStream("aaabbb",
                "Lucene是apache软件基金会4 jakarta项目组的一个子项目，是一个开放源代码的全文检索引擎工具包，但它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构");
        //添加一个引用，可以获得每个关键词
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //添加一个偏移量的引用，记录了关键词的开始位置以及结束位置
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        //将指针调整到列表的头部
        tokenStream.reset();
        //遍历关键词列表，通过incrementToken方法判断列表是否结束
        while(tokenStream.incrementToken()) {
            //关键词的起始位置
     //       System.out.println("start->" + offsetAttribute.startOffset());
            //取关键词
            System.out.println(charTermAttribute);
            //结束位置
         //   System.out.println("end->" + offsetAttribute.endOffset());
        }
        tokenStream.close();
    }

如果想要在lucene中使用，只需要
在这里插入图片描述

索引库的维护

Field域的属性

是否分析：是否对域的内容进行分词处理。前提是我们要对域的内容进行查询。
是否索引：将Field分析后的词或整个Field值进行索引，只有索引方可搜索到。
比如：商品名称、商品简介分析后进行索引，订单号、身份证号不用分析但也要索引，这些将来都要作为查询条件。
是否存储：将Field值存储在文档中，存储在文档中的Field才可以从Document中获取
比如：商品名称、订单号，凡是将来要从Document中获取的Field都要存储。

是否存储的标准：是否要将内容展示给用户

Field类	数据类型	Analyzed是否分析	Indexed是否索引	Stored是否存储	说明
StringField(FieldName, FieldValue,Store.YES))	字符串	N	Y	Y/N	这个Field用来构建一个字符串Field，但是不会进行分析，会将整个串存储在索引中，比如(订单号,姓名等)是否存储在文档中用Store.YES或Store.NO决定
LongPoint(String name, long… point)	Long型	Y	Y	N	可以使用LongPoint、IntPoint等类型存储数值类型的数据。让数值类型可以进行索引。但是不能存储数据，如果想存储数据还需要使用StoredField。
StoredField(FieldName, FieldValue)	重载方法，支持多种类型	N	N	Y	这个Field用来构建不同类型Field,不分析，不索引，但要Field存储在文档中
TextField(FieldName, FieldValue, Store.NO)或TextField(FieldName, reader)	字符串或流	Y	Y	y/n	如果是一个Reader, lucene猜测内容比较多,会采用Unstored的策略.

在这里插入图片描述

public class lucene_manager {
    private IndexWriter indexWriter;

    @Before
    public void init() throws Exception{
        indexWriter = new IndexWriter(
                FSDirectory.open(new File("E:\\workspace\\index").toPath()),
                new IndexWriterConfig(new IKAnalyzer()));
    }

    /*
    * 添加文档
    * */
    @Test
    public void addDocument() throws Exception{
        Document document = new Document() ;
        document.add(new TextField("filename", "新添加的文档", Field.Store.YES));
        document.add(new TextField("content", "新添加的文档的内容", Field.Store.NO));
        document.add(new LongPoint("size", 1000l));
        document.add(new StoredField("size", 1000l));
        document.add(new StoredField("path", "E:\\docs\\index.html"));

        //添加文档到索引库
        indexWriter.addDocument(document);

        indexWriter.close();
    }

    /*
     * 删除全部文档：将索引目录的索引信息全部删除，直接彻底删除，无法恢复。**此方法慎用！！
     * */
    @Test
    public void deleteAllDocument() throws Exception{
        indexWriter.deleteAll();
        indexWriter.close();
    }

    /*
    * 删除查询到的文档
    * */
    @Test
    public void deleteIndexByQuery()throws Exception{
        indexWriter.deleteDocuments(
                new TermQuery(new Term("filename", "apache"))
        );
        indexWriter.close();
    }

    /*
    * 修改索引库： 先删除后添加
    * */
    @Test
    public void updateIndex() throws Exception{
        Document document = new Document();

        document.add(new TextField("filename", "要更新的文档", Field.Store.YES));
        document.add(new TextField("content", " Lucene 简介 Lucene 是一个基于 Java 的全文信息检索工具包," +
                "它不是一个完整的搜索应用程序,而是为你的应用程序提供索引和搜索功能。",
                Field.Store.YES));

        indexWriter.updateDocument(new Term("content", "java"), document);
        indexWriter.close();
    }
}

lucene索引库查询

对要搜索的信息创建Query查询对象，Lucene会根据Query查询对象生成最终的查询语法，类似关系数据库Sql语法一样Lucene也有自己的查询语法，比如：“name:lucene”，表示查询Field的name为"lucene"的文档信息

可以通过两种方法创建查询对象

使用Lucene提供Query子类：TermQuery，通过项查询，TermQuery不使用分析其所以建议匹配不分词的Field域查询，比如订单号、分类ID号等。
指定要查询的域和要查询的关键字。
使用QueryParse解析查询表达式：对要查询的内容先粉刺，然后基于粉刺的结果进行查询，注意，必须引入queryParser依赖的jar包。

public class LuceneSearch {
    private IndexReader indexReader;
    private IndexSearcher indexSearcher;

    @Before
    public void init() throws Exception{
        indexReader = DirectoryReader.open(
                FSDirectory.open(new File("E:\\workspace\\index").toPath()));
        indexSearcher = new IndexSearcher(indexReader);
    }

    /*
    * TermQuery查询：TermQuery，通过项查询，TermQuery不使用分析其所以建议匹配不分词的Field域查询，比如订单号、分类ID号等。
指定要查询的域和要查询的关键字。
    * */
    @Test
    public void testTermQuery()throws Exception{
        Query query = new TermQuery(new Term("filename", "apache"));
        printResult(query, indexSearcher);
    }

    /*
    * 数值范围查询
    * */
    @Test
    public void testRangeQuery()throws Exception{
        Query query = LongPoint.newRangeQuery("size", 0l, 1000l);
        printResult(query, indexSearcher);
    }

    /*
    * 分词查询
    * */
    @Test
    public void testQueryParser() throws Exception {
        //创建queryparser对象:第一个参数默认搜索的域,第二个参数就是分析器对象
        QueryParser queryParser = new QueryParser("filename", new IKAnalyzer());
        Query query = queryParser.parse("全文检索");
        //执行查询
        printResult(query, indexSearcher);
    }

    private void printResult(Query query, IndexSearcher indexSearcher) throws Exception {
        //执行查询
        TopDocs topDocs = indexSearcher.search(query, 10);
        //共查询到的document个数
        System.out.println("查询结果总数量：" + topDocs.totalHits);
        //遍历查询结果
        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
            Document document = indexSearcher.doc(scoreDoc.doc);
            System.out.println(document.get("filename"));
            //System.out.println(document.get("content"));
            System.out.println(document.get("path"));
            System.out.println(document.get("size"));
        }
        //关闭indexreader
        indexSearcher.getIndexReader().close();
    }
}