Lucene全文检索

最新推荐文章于 2023-05-29 02:03:01 发布

龙源lll

最新推荐文章于 2023-05-29 02:03:01 发布

阅读量1.6k

点赞数 4

分类专栏： Elaticsearch 文章标签： lucene

本文链接：https://blog.csdn.net/Lzy410992/article/details/109787425

版权

Elaticsearch 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

ElasticSearch学习目录：

Lucene全文检索
ElasticSearch 概述
ElasticSearch客户端操作
IK分词器和ElasticSearch集成使用
ElasticSearch集群的概念及搭建过程
使用Java语言操作索引库
Spring Data ElasticSearch基本使用

什么是全文检索

我们生活中的数据分为结构化数据和非结构化数据：

结构化数据：具有固定格式或有限长度的数据，如数据库，元数据等。
非结构化数据：指不定长或无固定格式的数据，如办公文档、文本、图片、XML、HTML、各类报表、图像和音频/视频信息等等。

对于结构化数据的搜索：如对数据库中的数据进行搜索，用 SQL 语句可以很快的得到查询结果，因为数据库中的数据都是有规律的，按照一定的格式长度进行存储的。但是对非结构化数据的搜索：如利用 windows 进行文件搜索，用浏览器通过关键词搜索数据等。因为数据都是非结构化的，在搜索大量文件时，可能就会非常慢了。怎样才能解决这个问题呢？

全文检索(Full-text Search)

将非结构化数据中的一部分信息提取出来，重新组织，使其变得有一定结构，然后对此有一定结构的数据进行搜索，从而达到搜索相对较快的目的。这部分从非结构化数据中提取出的然后重新组织的信息，我们称之索引。而这种先建立索引，在对索引进行搜索的过程叫全文检索。

当然了先创建索引然后在进行查询索引的过程并没有明显减少查询的时间。但是对于已经创建的索引是可以多次使用的，耗时间创建索引相对于每次查询速度的提升是值得的。

全文检索的应用场景
对于数据量大、数据结构不固定的数据可采用全文检索方式搜索，比如百度、Google等搜索引擎、论坛站内搜索、电商网站站内搜索等。

Lucene实现全文检索

Lucene是apache下的一个开放源代码的全文检索引擎工具包，具有高性能、可伸缩的、开源信息、功能齐全等优点。我们可以使用Lucene实现全文检索，在Lucene中提供了完整的查询引擎和索引引擎，部分文本分析引擎。

全文检索大体分两个过程，索引创建(Indexing)：将现实世界中所有的结构化和非结构化数据提取信息，搜索索引(Search)：通过用户的查询请求搜索创建的索引，然后返回查询结果的过程。

在这里插入图片描述

Lucene实现全文检索的也同样需要这两个过程，其具体实现如下：

创建索引：

获得文档，表示我们要对那些文档或数据进行搜所
构建文档对象，根据获取的文档创建一个Document对象，每个document对象中包含多个域（field），域中保存就是原始文档数据。（域的名称、域的值、每个文档都有一个唯一的编号）
分析文档，将文档中的数据结构化，比如根据空格进行字符串拆分、把单词统一转换成小写、去除停止词等，获得一个关键词列表。
创建索引，基于关键词列表创建一个索引，保存到索引库中。通过关键词找文档，这种索引的结构叫倒排索引结构。

倒排索引（inverted index）也称反向索引，搜索引擎中最常见的数据结构，将文档中的词作为关键字，建立词与文档的映射关系，通过对倒排索引的检索，可以根据词快速获取包含这个词的文档列表
分词：将句子或者段落进行切割，从中提取出包含固定语义的词
停止词（stop word）：没有具体含义、区分度低的词
排序：当输入一个关键字进行搜索时，将相关度更大的内容排在前面

查询索引:

把关键词封装成一个查询对象，包括要查询的域，要搜索的关键词
执行查询，根据要查询的关键词到对应的域上进行搜索。找到关键词，根据关键词找到对应的文档
渲染结果：根据文档的id找到文档对象，对关键词进行高亮显示，分页处理，最终展示给用户看。

Lucene的使用

下载Lucene：https://lucene.apache.org/

工程搭建：创建一个java工程，添加jar，新建测试程序
在这里插入图片描述

从文本文件中找出包含指定单词单词的文件

1. 创建索引

在这里插入图片描述

@Test
public void createIndex() throws Exception {
    //1、创建一个Director对象，指定索引库保存的位置。
    //把索引库保存在内存中
    //Directory directory = new RAMDirectory();
    //把索引库保存在磁盘
    Directory directory = FSDirectory.open(new File("F:\\Intellij idea\\logs").toPath());

    //2、基于Directory对象创建一个IndexWriter对象
    IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig());

    //3、读取磁盘上的文件，对应每个文件创建一个文档对象。
    File dir = new File("F:\\Intellij idea\\Lucene\\source");

    File[] files = dir.listFiles();
    for (File f : files) {
        //取文件名
        String fileName = f.getName();
        //文件的路径
        String filePath = f.getPath();
        //文件的内容
        String fileContent = FileUtils.readFileToString(f, "utf-8");

        //System.out.println(fileContent);
        //文件的大小
        long fileSize = FileUtils.sizeOf(f);

        //创建Field
        //参数1：域的名称，参数2：域的内容，参数3：是否存储
        Field fieldName = new TextField("name", fileName,Field.Store.YES);
        Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
        
        Field fieldPath = new TextField("path", filePath, Field.Store.YES);
        Field fieldSize = new TextField("size", fileSize + "", Field.Store.YES);
        //不需要分词和索引，只进行存储
        //Field fieldPath = new StoredField("path", filePath);
        
        //存储长整型数据，做运算使用，不能取值
        //Field fieldSizeValue = new LongPoint("size", fileSize);
        //只存储
        //Field fieldSizeStore = new StoredField("size", fileSize);

        //创建文档对象
        Document document = new Document();
        //向文档对象中添加域
        document.add(fieldName);
        document.add(fieldPath);
        document.add(fieldContent);
        
        document.add(fieldSize);
        //document.add(fieldSizeValue);
        //document.add(fieldSizeStore);

        //5、把文档对象写入索引库
        indexWriter.addDocument(document);
    }
    //6、关闭indexwriter对象
    indexWriter.close();
}

2. 查询索引库

在这里插入图片描述

@Test
public void searchIndex() throws Exception {
    //1、创建一个Director对象，指定索引库的位置
    Directory directory = FSDirectory.open(new File("F:\\Intellij idea\\logs").toPath());
    //2、创建一个IndexReader对象
    IndexReader indexReader = DirectoryReader.open(directory);
    //3、创建一个IndexSearcher对象，构造方法中的参数indexReader对象。
    IndexSearcher indexSearcher = new IndexSearcher(indexReader);
    //4、创建一个Query对象，TermQuery
    Query query = new TermQuery(new Term("content", "void"));
    //5、执行查询，得到一个TopDocs对象
    //参数1：查询对象 参数2：查询结果返回的最大记录数
    TopDocs topDocs = indexSearcher.search(query, 10);
    //6、取查询结果的总记录数
    System.out.println("查询总记录数：" + topDocs.totalHits);
    //7、取文档列表
    ScoreDoc[] scoreDocs = topDocs.scoreDocs;
    //8、打印文档中的内容
    for (ScoreDoc doc : scoreDocs) {
        //取文档id
        int docId = doc.doc;
        //根据id取文档对象
        Document document = indexSearcher.doc(docId);
        System.out.println(document.get("name"));
        System.out.println(document.get("path"));
        System.out.println(document.get("size"));
        //System.out.println(document.get("content"));
        System.out.println("-----------------");
    }
    //9、关闭IndexReader对象
    indexReader.close();
}

分析器

进行检索时，默认使用的是标准分析器StandardAnalyzer通过指定的数据格式，通过将 Lucene 的 Document 传递给分析器 Analyzer 对各字段进行分词，经过分词器分词之后，通过索引写入工具 IndexWriter 将得到的索引写入到索引库，Document 本身也会被写入一个文档信息库，分词器不同，建立的索引数据就不同；比较通用的一个中文分词器 IKAnalyzer。

测试分析器

@Test
public void testTokenStream() throws Exception {
    //使用Analyzer对象的tokenStream方法返回一个TokenStream对象。词对象中包含了最终分词结果。
    
    //1. 创建一个Analyzer对象，StandardAnalyzer对象
    Analyzer analyzer = new StandardAnalyzer();
   
    //2.使用分析器对象的tokenStream方法获得一个TokenStream对象
    TokenStream tokenStream = analyzer.tokenStream("", "Lucene是apache下的一个开放源代码的全文检索引擎工具包。");
    //3.向TokenStream对象中设置一个引用，相当于数一个指针
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    //4.调用TokenStream对象的rest方法。如果不调用抛异常
    tokenStream.reset();
    //5.使用while循环遍历TokenStream对象
    while(tokenStream.incrementToken()) {
        System.out.println(charTermAttribute.toString());
    }
    //6.关闭TokenStream对象
    tokenStream.close();
}

测试结果：

lucene
是
apache
下
的
一
个
开
放
源
代
码
的
全
文
检
索
引
擎
工
具
包

2、IKAnalyze的使用方法

把IKAnalyzer的jar包添加到工程中，把配置文件和扩展词典添加到工程的classpath下
注意：扩展词典严禁使用windows记事本编辑保证扩展词典的编码格式是utf-8
扩展词典：添加一些新词
停用词词典：无意义的词或者是敏感词汇

修改测试代码：

//Analyzer analyzer = new StandardAnalyzer();
Analyzer analyzer = new IKAnalyzer();

在代码中使用：修改分析器即可

//IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig());
IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());
IndexWriter indexWriter = new IndexWriter(directory, config);

索引库维护

1、添加文档
2、删除文档（删除全部、根据查询关键词删除文档）
3、修改文档（先删除后添加）

public class IndexManager {

    private IndexWriter indexWriter;

    @Before
    public void init() throws Exception {
        //创建一个IndexWriter对象，使用IKAnalyzer作为分析器
        indexWriter =
                new IndexWriter(FSDirectory.open(new File("F:\\Intellij idea\\logs").toPath()),
                        new IndexWriterConfig(new IKAnalyzer()));
    }
    
    @Test
    public void addDocument() throws Exception {
        //创建一个IndexWriter对象，需要使用IKAnalyzer作为分析器
        IndexWriter indexWriter =
                new IndexWriter(FSDirectory.open(new File("F:\\Intellij idea\\logs").toPath()),
                new IndexWriterConfig(new IKAnalyzer()));
        //创建一个Document对象
        Document document = new Document();
        //向document对象中添加域
        document.add(new TextField("name", "新添加的文件", Field.Store.YES));
        document.add(new TextField("content", "新添加的文件内容", Field.Store.NO));
        document.add(new StoredField("path", "F:\\Intellij idea\\logs"));
        // 把文档写入索引库
        indexWriter.addDocument(document);
        //关闭索引库
        indexWriter.close();
    }

    @Test
    public void deleteAllDocument() throws Exception {
        //删除全部文档
        indexWriter.deleteAll();
        //关闭索引库
        indexWriter.close();
    }

    @Test
    public void deleteDocumentByQuery() throws Exception {
        indexWriter.deleteDocuments(new Term("name", "apache"));
        indexWriter.close();
    }

    @Test
    public void updateDocument() throws Exception {
        //创建一个新的文档对象
        Document document = new Document();
        //向文档对象中添加域
        document.add(new TextField("name", "更新之后的文档", Field.Store.YES));
        document.add(new TextField("name1", "更新之后的文档2", Field.Store.YES));
        document.add(new TextField("name2", "更新之后的文档3", Field.Store.YES));
        //更新操作
        indexWriter.updateDocument(new Term("name", "spring"), document);
        //关闭索引库
        indexWriter.close();
    }

}

索引库查询

1、使用Query的子类
1.TermQuery
根据关键词进行查询。
需要指定要查询的域及要查询的关键词
2.RangeQuery
范围查询

2、使用QueryPaser进行查询
可以对要查询的内容先分词，然后基于分词的结果进行查询。
添加一个jar包
lucene-queryparser-7.4.0.jar

public class SearchIndex {
    private IndexReader indexReader;
    private IndexSearcher indexSearcher;
    @Before
    public void init() throws Exception {
        indexReader = DirectoryReader.open(FSDirectory.open(new File("F:\\Intellij idea\\logs").toPath()));
        indexSearcher = new IndexSearcher(indexReader);
    }

    @Test
    public void testRangeQuery() throws Exception {
        //创建一个Query对象
        Query query = LongPoint.newRangeQuery("size", 0l, 100l);
        printResult(query);
    }

    private void printResult(Query query) throws Exception {
        //执行查询
        TopDocs topDocs = indexSearcher.search(query, 10);
        System.out.println("总记录数：" + topDocs.totalHits);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc doc:scoreDocs){
            //取文档id
            int docId = doc.doc;
            //根据id取文档对象
            Document document = indexSearcher.doc(docId);
            System.out.println(document.get("name"));
            System.out.println(document.get("path"));
            System.out.println(document.get("size"));
            //System.out.println(document.get("content"));
            System.out.println("-----------------");
        }
        indexReader.close();
    }

    @Test
    public void testQueryParser() throws Exception {
        //创建一个QueryPaser对象，两个参数
        QueryParser queryParser = new QueryParser("name", new IKAnalyzer());
        //参数1：默认搜索域，参数2：分析器对象
        //使用QueryPaser对象创建一个Query对象
        Query query = queryParser.parse("lucene是一个Java开发的全文检索工具包");
        //执行查询
        printResult(query);
    }
}

龙源lll

关注

4
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
1
评论
Lucene全文检索

全文检索：将非结构化数据中的一部分信息提取出来，重新组织，使其变得有一定结构，然后对此有一定结构的数据进行搜索，从而达到搜索相对较快的目的。这部分从非结构化数据中提取出的然后重新组织的信息，我们称之索引。而这种先建立索引，在对索引进行搜索的过程
复制链接

扫一扫