lucene实现本地文本搜索引擎

最新推荐文章于 2023-02-06 15:32:55 发布

Braylon1002

最新推荐文章于 2023-02-06 15:32:55 发布

阅读量1.3k

点赞数 3

分类专栏：数据挖掘文章标签： lucene 文本检索搜索引擎 lucene自定义评分

本文链接：https://blog.csdn.net/qq_40742298/article/details/103267266

版权

数据挖掘专栏收录该内容

53 篇文章 11 订阅

订阅专栏

这个blog用来记录一下我用的lucene做的searchEngine，实现了文本的创建索引、自评分排序和查找。和大家共勉。

文本数据：
链接：https://pan.baidu.com/s/1vmAyg1iw8D32MqgkYRBbJg
提取码：4znr

源代码：
https://gitee.com/S_Braylon/lucene_searchEngine
一．准备工作
1.环境
Windows10 + Idea
2.Maven依赖

1.<dependencies>  
2.        <dependency>  
3.            <groupId>org.apache.lucene</groupId>  
4.            <artifactId>lucene-core</artifactId>  
5.            <version>4.6.1</version>  
6.        </dependency>  
7.        <dependency>  
8.            <groupId>org.apache.lucene</groupId>  
9.            <artifactId>lucene-queryparser</artifactId>  
10.            <version>4.6.1</version>  
11.        </dependency>  
12.        <dependency>  
13.            <groupId>org.apache.lucene</groupId>  
14.            <artifactId>lucene-analyzers-common</artifactId>  
15.            <version>4.6.1</version>  
16.        </dependency>  
17.        <dependency>  
18.            <groupId>org.apache.lucene</groupId>  
19.            <artifactId>lucene-codecs</artifactId>  
20.            <version>4.6.1</version>  
21.        </dependency>  
22.        <dependency>  
23.            <groupId>org.apache.lucene</groupId>  
24.            <artifactId>lucene-queries</artifactId>  
25.            <version>4.6.1</version>  
26.        </dependency>  
27.        <dependency>  
28.            <groupId>commons-io</groupId>  
29.            <artifactId>commons-io</artifactId>  
30.            <version>2.6</version>  
31.        </dependency>  
32.        <dependency>  
33.            <groupId>commons-cli</groupId>  
34.            <artifactId>commons-cli</artifactId>  
35.            <version>1.2</version>  
36.        </dependency>  
37.        <!--test测试-->  
38.        <dependency>  
39.            <groupId>junit</groupId>  
40.            <artifactId>junit</artifactId>  
41.            <version>4.12</version>  
42.        </dependency>  
43.  
44.    </dependencies>  
45.    <build>  
46.        <plugins>  
47.            <plugin>  
48.                <groupId>org.apache.maven.plugins</groupId>  
49.                <artifactId>maven-jar-plugin</artifactId>  
50.                <version>2.1</version>  
51.                <configuration>  
52.                    <archive>  
53.                        <manifest>  
54.                            <mainClass>runPackage.main</mainClass>  
55.                        </manifest>  
56.                    </archive>  
57.                </configuration>  
58.            </plugin>  
59.        </plugins>  
    </build>

3.Jar包
在这里插入图片描述
4.Lucene工具类

5.Lucene工作原理
****生成索引

第一步：需要索引的原文档(Document)。
第二步：将原文档传给分词器(Tokenizer)。
第三步：将得到的词元(Token)传给语言处理组件(Linguistic Processor)。
第四步：将得到的词(Term)传给索引组件(Indexer)。利用得到的词(Term)创建一个字典
这里引用网络上的一个用例截图：
在这里插入图片描述
对字典按字母顺序进行排序

合并相同的词成为文档倒排链表。
****开始搜索
第一步：用户输入查询语句。
第二步：对查询语句进行词法分析，语法分析，及语言处理。
这包括统一大小写，去掉to，the等高频且无意义的词等操作。
第三步：搜索索引，得到符合语法树的文档。
第四步：根据得到的文档和查询语句的相关性，对结果进行排序。
二．具体实现
1.代码结构
在这里插入图片描述

2.FilterUtil
过滤工具类
作用：通过正则表达式获得fileNo，fileContent

1./* 
2.* 过滤文档 
3.* @Author tianyu 
4.* @Version 1.0 
5.* 
6.* */  
7.public class txtFilter {  
8.    //过滤\n换行符  
9.    public String trimStr(String text)  
10.    //正则获取fileNo  
11.    public String getFileNo(String text)  
12.    //正则获取fileContent  
13.    public String getFileContent(String text)  
14.}

3.Index
创建索引工具类
作用：创建索引

1./* 
2.* 
3.* 创建索引 
4.* @Author tianyu 
5.* @Version 1.0 
6.* 
7.* */  
8.public class makeIndex {  
9.    //读取path中的目录创建index  
10.    public void index(String path)  
11.}

4.MyScoreUtil
自定义评分的工具类
作用：实现自定义评分，按照需要修改评分权重，对最终的文档匹配评分进行调整。

1./* 
2.* 继承CustomScoreQuery类，重写getCustomScoreProvider方法 
3.* 返回CustomScoreProvider，用于CustomScoreProvider修改评分 
4.* @Author tianyu 
5.* @Version 1.0 
6.* */  
7.public class MyScoreQueries extends CustomScoreQuery {  
8.    public MyScoreQueries(Query query, FunctionQuery functionQuery) {  
9.        super(query, functionQuery);  
10.    }  
11.  
12.    @Override  
13.    protected CustomScoreProvider getCustomScoreProvider(AtomicReaderContext context) throws IOException {  
14.        //return super.getCustomScoreProvider(context);  
15.        return new myGetCustomScoreProvider(context);  
16.    }  
17.}

1./* 
2.* 重写customScore方法，subSrcScore是默认评分， 
3.* valSrcScore是在makeIndex类中传入的自定义评分。 
4.*@Author tianyu 
5.*@Version 1.0 
6.* */  
7.public class myGetCustomScoreProvider extends CustomScoreProvider {  
8.    public myGetCustomScoreProvider(AtomicReaderContext context) {  
9.        super(context);  
10.    }  
11.  
12.    @Override  
13.    public float customScore(int doc, float subQueryScore, float valSrcScore) throws IOException {  
14.        return subQueryScore/**valSrcScore*/;  
15.    }  
16.}

5.SearchUtil
搜索工具类
作用：通过makeIndex创建的index目录中的index，进行查找并且将结果输出。

1.*  
2.* 搜索类  
3.* @Author tianyu  
4.*@version 1.0  
5.*   
6.* */  
7.public class searchUtil {  
8.    public void search(String input){}  
9.}

6.流程图
我用process自己做的流程图标记了所有主要的类

在这里插入图片描述

三．效果展示
• Q1 – hurricane
在这里插入图片描述
• Q2 – mitch george

注：本来我用swing做了一个UI但是这里就不展示了，主要是说明已经可以正确的进行文本的查找了，那么在用web或者GUI展示的更好看一些就可以了。这里就不提了。

Braylon1002

关注

3
点赞
踩
7

收藏

觉得还不错? 一键收藏
2
评论
lucene实现本地文本搜索引擎

这个blog用来记录一下我用的lucene做的searchEngine，实现了文本的创建索引、自评分排序和查找。和大家共勉。文本数据：链接：https://pan.baidu.com/s/1vmAyg1iw8D32MqgkYRBbJg提取码：4znr源代码：https://gitee.com/S_Braylon/lucene_searchEngine一．准备工作1.环境Wind...
复制链接

扫一扫