lucene入门程序

最新推荐文章于 2024-07-04 14:41:08 发布

小煦仔

最新推荐文章于 2024-07-04 14:41:08 发布

阅读量218

点赞数

分类专栏： lucene 文章标签： lucene elasticsearch java 搜索引擎 solr

本文链接：https://blog.csdn.net/qq_39151461/article/details/111351662

版权

lucene 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

文章目录

一、lucene简介
二、总体概述
- 1、Lucene的各组件：
- 2、Lucene API 的调用实现索引和搜索过程
三、代码演练

一、lucene简介

lucene官方网站https://lucene.apache.org/

各版本下载地址
http://archive.apache.org/dist/lucene/java/

对应版本lucene所需运行环境查看
https://lucene.apache.org/core/8_7_0/SYSTEM_REQUIREMENTS.html

二、总体概述

Lucene 是有索引和搜索的两个过程，包含索引创建，索引，搜索三个要点。

1、Lucene的各组件：

在这里插入图片描述

被索引的文档用Document对象表示。
IndexWriter 通过函数addDocument 将文档添加到索引中，实现创建索引的过程。
Lucene 的索引是应用反向索引。
当用户有请求时，Query 代表用户的查询语句。
IndexSearcher 通过函数search 搜索Lucene Index 。
IndexSearcher 计算term weight 和score 并且将结果返回给用户。
返回给用户的文档集合用TopDocsCollector 表示。

2、Lucene API 的调用实现索引和搜索过程

在这里插入图片描述

三、代码演练

lucene 8.0以后自带luke，在自行学习的时候可以通过下载lucene，解压后在Lucene-8.**/luke目录下可以看到对应的启动脚本，通过luke更直观的观察学习，需要注意的是此luke需要jdk1.9版本

项目源代码地址 lucene-feature-1218分支

运行环境：
jdk版本 1.8
lucene版本 8.7.0

1、springboot 项目创建

添加相关配置

添加pom文件：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<parent>
		<groupId>org.springframework.boot</groupId>
		<artifactId>spring-boot-starter-parent</artifactId>
		<version>2.4.1</version>
		<relativePath/> <!-- lookup parent from repository -->
	</parent>
	<groupId>com._520siyuan</groupId>
	<artifactId>lucene</artifactId>
	<version>0.0.1-SNAPSHOT</version>
	<name>lucene</name>
	<description>Demo project for Spring Boot</description>
	<packaging>jar</packaging>

	<properties>
		<java.version>1.8</java.version>
		<lucene.version>8.7.0</lucene.version>
	</properties>

	<dependencies>

		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-web</artifactId>
		</dependency>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter</artifactId>
		</dependency>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-test</artifactId>
			<scope>test</scope>
			<exclusions>
				<exclusion>
					<groupId>org.junit.jupiter</groupId>
					<artifactId>junit-jupiter-api</artifactId>
				</exclusion>
				<exclusion>
					<groupId>org.junit.vintage</groupId>
					<artifactId>junit-vintage-engine</artifactId>
				</exclusion>
			</exclusions>
		</dependency>
		<dependency>
			<groupId>junit</groupId>
			<artifactId>junit</artifactId>
			<version>4.12</version>
			<scope>test</scope>
		</dependency>

		<!--对分词索引查询解析-->
		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-queryparser</artifactId>
			<version>${lucene.version}</version>
		</dependency>

		<!--高亮 -->
		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-highlighter</artifactId>
			<version>${lucene.version}</version>
		</dependency>

		<!--smartcn 中文分词器 SmartChineseAnalyzer  smartcn分词器 需要lucene依赖 且和lucene版本同步-->
		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-analyzers-smartcn</artifactId>
			<version>${lucene.version}</version>
		</dependency>

		<!--ik-analyzer 中文分词器-->
		<dependency>
			<groupId>cn.bestwu</groupId>
			<artifactId>ik-analyzers</artifactId>
			<version>5.1.0</version>
		</dependency>

		<!--MMSeg4j 分词器-->
		<dependency>
			<groupId>com.chenlb.mmseg4j</groupId>
			<artifactId>mmseg4j-solr</artifactId>
			<version>2.4.0</version>
			<exclusions>
				<exclusion>
					<groupId>org.apache.solr</groupId>
					<artifactId>solr-core</artifactId>
				</exclusion>
			</exclusions>
		</dependency>
		<dependency>
			<groupId>commons-io</groupId>
			<artifactId>commons-io</artifactId>
			<version>2.0.1</version>
			<scope>test</scope>
		</dependency>

	</dependencies>


	<profiles>
		<profile>
			<id>dev</id>
			<properties>
				<env>dev</env>
			</properties>
			<activation>
				<activeByDefault>true</activeByDefault>
			</activation>
		</profile>

		<profile>
			<id>test</id>
			<properties>
				<env>test</env>
			</properties>
		</profile>

		<profile>
			<id>pro</id>
			<properties>
				<env>pro</env>
			</properties>
		</profile>
	</profiles>
	<build>
		<finalName>lucene</finalName>
		<resources>
			<resource>
				<directory>src/main/resources</directory>
				<filtering>true</filtering>
				<excludes>
					<exclude>pro/**</exclude>
					<exclude>dev/**</exclude>
					<exclude>test/**</exclude>
				</excludes>
			</resource>
			<resource>
				<directory>src/main/resources/${env}</directory>
				<filtering>true</filtering>
				<includes>
					<include>**/*.*</include>
				</includes>
			</resource>
		</resources>
		<plugins>
			<plugin>
				<groupId>org.springframework.boot</groupId>
				<artifactId>spring-boot-maven-plugin</artifactId>
				<executions>
					<execution>
						<id>repackage</id>
						<goals>
							<goal>repackage</goal>
						</goals>
					</execution>
				</executions>
				<configuration>
					<mainClass>com._520siyuan.lucene.LuceneApplication</mainClass>
				</configuration>
			</plugin>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-compiler-plugin</artifactId>
				<!--<version>${maven.plugins.version}</version>-->
				<version>3.8.1</version>
				<configuration>
					<source>1.8</source>
					<target>1.8</target>
					<annotationProcessorPaths>
						<path>
							<groupId>org.mapstruct</groupId>
							<artifactId>mapstruct-processor</artifactId>
							<version>1.3.1.Final</version>
						</path>
					</annotationProcessorPaths>
				</configuration>
			</plugin>
		</plugins>
	</build>


</project>

2、创建索引

索引过程如下：

创建一个IndexWriter 用来写索引文件，它有几个参数，INDEX_DIR 就是索引文件所存放的位置，Analyzer 便是用来对文档进行词法分析和语言处理的。
创建一个Document 代表我们要索引的文档。
将不同的Field 加入到文档中。我们知道，一篇文档有多种信息，如题目，作者，修改时间，内容等。不同类型的信息用不同的Field 来表示，在本例子中，一共有两类信息进行了索引，一个是文件路径，一个是文件内容。其中FileReader 的SRC_FILE 就表示要索引的源文件。
IndexWriter 调用函数addDocument 将索引写到索引文件夹中。

 IndexWriter getIndexWriter()throws Exception{
        //指定索引库存放的路径
        //F:\temp\index
        Directory directory = FSDirectory.open(new File("F:\\temp\\index").toPath());
        //索引库还可以存放到内存中
        //Directory directory = new RAMDirectory();
        //todo 使用自定义分词器
        IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());
        //创建indexwriterCofig对象
//        IndexWriterConfig config = new IndexWriterConfig();
        //创建indexwriter对象
        IndexWriter indexWriter = new IndexWriter(directory, config);
        return indexWriter;
    }

    //创建索引
    @Test
    public void createIndex() throws Exception {

        IndexWriter indexWriter = getIndexWriter();
        //原始文档的路径
        File dir = new File("F:\\temp\\searchsource");
        File[] files = dir.listFiles();
        if (files == null) return;
        for (File f : files) {
            //文件名
            String fileName = f.getName();
            //文件内容
            String fileContent = FileUtils.readFileToString(f);
            //文件路径
            String filePath = f.getPath();
            //文件的大小
            long fileSize  = FileUtils.sizeOf(f);
            //创建文件名域
            //第一个参数：域的名称
            //第二个参数：域的内容
            //第三个参数：是否存储
            Field fileNameField = new TextField("filename", fileName, Field.Store.YES);
            //文件内容域
            Field fileContentField = new TextField("content", fileContent, Field.Store.YES);
            //文件路径域（不分析、不索引、只存储）
            Field filePathField = new TextField("path", filePath, Field.Store.YES);
            //文件大小域
            Field fileSizeField = new TextField("size", fileSize + "", Field.Store.YES);

            //创建document对象
            Document document = new Document();
            document.add(fileNameField);
            document.add(fileContentField);
            document.add(filePathField);
            document.add(fileSizeField);
            //创建索引，并写入索引库
            indexWriter.addDocument(document);
        }
        //关闭indexwriter
        indexWriter.close();
    }

使用luke查看：
在这里插入图片描述

3、查询索引

搜索过程如下：

IndexReader 将磁盘上的索引信息读入到内存，INDEX_DIR 就是索引文件存放的位置。
创建IndexSearcher 准备进行搜索。
创建Analyer 用来对查询语句进行词法分析和语言处理。
创建QueryParser 用来对查询语句进行语法分析。
QueryParser 调用parser 进行语法分析，形成查询语法树，放到Query 中。
IndexSearcher 调用search 对查询语法树Query 进行搜索，得到结果TopScoreDocCollector 。



    //获取 IndexSearcher
    IndexSearcher getIndexSearcher() throws Exception {
        //指定索引库存放的路径
        //F:\temp\index
        Directory directory = FSDirectory.open(new File("F:\\temp\\index").toPath());
        //创建indexReader对象
        IndexReader indexReader = DirectoryReader.open(directory);
        //创建indexsearcher对象
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        return indexSearcher;
    }

    //执行查询
    private void printResult(Query query, IndexSearcher indexSearcher) throws Exception {
        //执行查询,第一个参数是查询对象，第二个参数是查询结果返回的最大值
        TopDocs topDocs = indexSearcher.search(query, 10);
        //共查询到的document个数
        System.out.println("查询结果总数量：" + topDocs.totalHits);
        //遍历查询结果,topDocs.scoreDocs存储了document对象的id
        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
            //scoreDoc.doc属性就是document对象的id
            //根据document的id找到document对象
            Document document = indexSearcher.doc(scoreDoc.doc);
            System.out.println(document.get("filename"));
            //System.out.println(document.get("content"));
            System.out.println(document.get("path"));
            System.out.println(document.get("size"));
            System.out.println("-------------------------");
        }
        //关闭indexreader
        indexSearcher.getIndexReader().close();
    }
    
    //使用Termquery查询
    @Test
    public void testTermQuery() throws Exception {
        IndexSearcher indexSearcher = getIndexSearcher();
        //创建查询对象
        Query query = new TermQuery(new Term("content", "lucene"));
        printResult(query, indexSearcher);
    }


    @Test
    public void testRangeQuery() throws Exception {
        IndexSearcher indexSearcher = getIndexSearcher();
        Query query = LongPoint.newRangeQuery("size", 0l, 10000l);
        printResult(query, indexSearcher);
    }


    @Test
    public void testQueryParser() throws Exception {
        IndexSearcher indexSearcher = getIndexSearcher();
        //创建queryparser对象
        //第一个参数默认搜索的域
        //第二个参数就是分析器对象
        QueryParser queryParser = new QueryParser("content", new IKAnalyzer());
        Query query = queryParser.parse("Lucene是java开发的");
        //执行查询
        printResult(query, indexSearcher);
    }

4、标准分析器的分词效果

Lucene 中文分词器概述与 Ik-Analyzer 使用教程

    @Test
    public void testTokenStream() throws Exception {
        //创建一个标准分析器对象
        Analyzer analyzer = new StandardAnalyzer();
        //获得tokenStream对象
        //第一个参数：域名，可以随便给一个
        //第二个参数：要分析的文本内容
        TokenStream tokenStream = analyzer.tokenStream("test", "The Spring Framework provides a comprehensive programming and configuration model.");
        //添加一个引用，可以获得每个关键词
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //添加一个偏移量的引用，记录了关键词的开始位置以及结束位置
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        //将指针调整到列表的头部
        tokenStream.reset();
        //遍历关键词列表，通过incrementToken方法判断列表是否结束
        while(tokenStream.incrementToken()) {
            //关键词的起始位置
            System.out.println("start->" + offsetAttribute.startOffset());
            //取关键词
            System.out.println(charTermAttribute);
            //结束位置
            System.out.println("end->" + offsetAttribute.endOffset());
        }
        tokenStream.close();
    }

5、索引库的维护

1、索引库的添加

Field域的属性
是否分析： 是否对域的内容进行分词处理。前提是我们要对域的内容进行查询。
是否索引： 将Field分析后的词或整个Field值进行索引，只有索引方可搜索到。
比如：商品名称、商品简介分析后进行索引，订单号、身份证号不用分析但也要索引，这些将来都要作为查询条件。
是否存储：将Field值存储在文档中，存储在文档中的Field才可以从Document中获取
比如：商品名称、订单号，凡是将来要从Document中获取的Field都要存储。
是否存储的标准：是否要将内容展示给用户

在这里插入图片描述

 //添加索引
    @Test
    public void addDocument() throws Exception {
        //索引库存放路径
        Directory directory = FSDirectory.open(new File("F:\\temp\\index").toPath());
        File file = new File("F:\\temp\\1.txt");
        //文件内容
        String fileContent = FileUtils.readFileToString(file);
        //文件路径
        String filePath = file.getPath();
        //文件的大小
        long fileSize  = FileUtils.sizeOf(file);
        IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());
        //创建一个indexwriter对象
        IndexWriter indexWriter = new IndexWriter(directory, config);
        //创建一个Document对象
        Document document = new Document();
        //向document对象中添加域。
        //不同的document可以有不同的域，同一个document可以有相同的域。
        document.add(new TextField("filename", "新添加的文档", Field.Store.YES));
        document.add(new TextField("content", fileContent, Field.Store.NO));
        //LongPoint创建索引
        document.add(new LongPoint("size", fileSize));
        //StoreField存储数据
        document.add(new StoredField("size", fileSize));
        //不需要创建索引的就使用StoreField存储
        document.add(new StoredField("path",filePath));
        //添加文档到索引库
        indexWriter.addDocument(document);
        //关闭indexwriter
        indexWriter.close();

    }

2、索引库删除


    //删除全部索引
    @Test
    public void deleteAllIndex() throws Exception {
        IndexWriter indexWriter = getIndexWriter();
        //删除全部索引
        indexWriter.deleteAll();
        //关闭indexwriter
        indexWriter.close();
    }

    //根据查询条件删除索引
    @Test
    public void deleteIndexByQuery() throws Exception {
        IndexWriter indexWriter = getIndexWriter();
        //创建一个查询条件
        Query query = new TermQuery(new Term("filename", "apache"));
        //根据查询条件删除
        indexWriter.deleteDocuments(query);
        //关闭indexwriter
        indexWriter.close();
    }

3、索引库的修改

//修改索引库 原理就是先删除后添加
    @Test
    public void updateIndex() throws Exception {
        IndexWriter indexWriter = getIndexWriter();
        //创建一个Document对象
        Document document = new Document();
        //向document对象中添加域。
        //不同的document可以有不同的域，同一个document可以有相同的域。
        document.add(new TextField("filename", "要更新的文档", Field.Store.YES));
        document.add(new TextField("content", " Lucene 简介 Lucene 是一个基于 Java 的全文信息检索工具包," +
                "它不是一个完整的搜索应用程序,而是为你的应用程序提供索引和搜索功能。",
                Field.Store.YES));
        indexWriter.updateDocument(new Term("content", "java"), document);
        //关闭indexWriter
        indexWriter.close();
    }

小煦仔

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
lucene入门程序

lucene入门程序提示：这里可以添加系列文章的所有文章的目录，目录需要自己手动添加例如：第一章 Python 机器学习入门之pandas的使用提示：写完文章后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录lucene入门程序前言一、pandas是什么？二、使用步骤1.引入库2.读入数据总结前言提示：这里可以添加本文要记录的大概内容：例如：随着人工智能的不断发展，机器学习这门技术也越来越重要，很多人都开启了学习机器学习，本文就介绍了机器学习的基础内容。提示：以下是本篇文章正
复制链接

扫一扫

专栏目录