全文检索

最新推荐文章于 2024-07-15 10:53:04 发布

TWOFOUR_

最新推荐文章于 2024-07-15 10:53:04 发布

阅读量167

点赞数

本文链接：https://blog.csdn.net/qq_43868329/article/details/104472110

版权

全文索引是针对非结构化数据的一种高效检索方式，将文本文档、HTML等非结构化数据转化为结构化数据，通过创建索引来加速查询。常见应用场景包括搜索引擎和站内搜索。Lucene作为入门示例，可用于搭建环境，创建并查看索引库内容。

摘要由CSDN通过智能技术生成

一.什么是全文索引

就是在检索数据，数据的分类：
在计算机当中，比如说存在磁盘的文本文档，HTML页面，Word文档
1.结构化数据
1.1.1:格式固定，长度固定，数据类型固定等等，我们称之为结构化数据从，比如说数据库中的数据。

2.非结构化数据
1.2.2:word文档，HTML文件，pdf文档，文本文档等等，格式不固定，长度不固定，数据类型不固定，成为非结构化数据
3.半结构化数据

二.数据的查询

1.结构话数据查询
2.1.1:结构化数据查询语言：SQL语句 select * from user where userid=1

2.:非结构化数据查询
非结构化数据查询一些难度，比如我们在一个文本文件中找到spring关键字
2.2.1:目测一个一个找文件
2.2.2:使用程序将文件读取到内存中，然后匹配字符串spring，这种方式被称为顺序扫描

3.将我们非结构化数据转换为结构化数据

2.3.1:例如Spring.txt文件，英文文件每一个单词以空格进行区分,那么我们可以采用空格经行分割，然后将分割结构保存到数据库，这样就形成了一张表，我们在列上创建索引加，加快查询速度，根据单词和文档的对应关系找到文档列表，这样的过程我们称为全文检索

三.全文索引概念=

3.1：创建索引，然后查询索引的过程我们称之为全文检索，索川一次创建可以多次使用.这样就不用了每一次都进行文件数据查分，比较快

四.全文索引应用场景

1.搜索索引
百度，谷歌，bing
2.站内搜索等。。。。

Lucene入门程序

环境搭建，创建一个maven工程

<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-core -->
<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-core</artifactId>
  <version>7.4.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-common -->
<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-analyzers-common</artifactId>
  <version>7.4.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
<dependency>
  <groupId>commons-io</groupId>
  <artifactId>commons-io</artifactId>
  <version>2.4</version>
</dependency>

2.1 创建索引

public static void main(String[] args) throws IOException {
//步骤一：创建Directory对象，用于指定索引库的位置    RAMDirectory内存
Directory directory = FSDirectory.open(new File("C:\\Users\\FLC\\Desktop\\授课内容\\授课资料\\Y2170\\Luncen\\Index").toPath());
//步骤二：创建一个IndexWriter对象，用于写索引
IndexWriter indexWriter=new IndexWriter(directory,new IndexWriterConfig());
//步骤三：读取磁盘中文件，对应每一个文件创建一个文档对象
File file=new File("C:\\Users\\FLC\\Desktop\\授课内容\\授课资料\\Y2170\\Luncen\\资料\\searchsource");
//步骤四：获取文件列表
File[] files = file.listFiles();
for (File item:files) {
//步骤五：获取文件数据，封装域   参数三：是否存储
Field fieldName=new TextField("fieldName",item.getName(), Field.Store.YES);
Field fieldPath=new TextField("fieldPath",item.getPath(), Field.Store.YES);
Field fieldSize=new TextField("fieldSize", FileUtils.sizeOf(item)+"", Field.Store.YES);
Field fieldContent=new TextField("fieldContent", FileUtils.readFileToString(item,"UTF-8"), Field.Store.YES);
//步骤六：创建文档对象，向文档对象中添加域
Document document=new Document();
document.add(fieldName);
document.add(fieldPath);
document.add(fieldSize);
document.add(fieldContent);

//步骤七：创建索引，将文档对象写入到索引库
indexWriter.addDocument(document);
}
//步骤八：关闭资源
indexWriter.close();
}

2.2 利用Luke工具查看索引库内容
2.2.1 指定索引库位置
2.2.2 查看当前索引库内容

2.3 查看索引

public static void main(String[] args) throws IOException {
//1.创建Directory对象，指定索引库位置
Directory directory = FSDirectory.open(new File("C:\\Users\\FLC\\Desktop\\授课内容\\授课资料\\Y2170\\Luncen\\Index").toPath());
//2.创建IndexReader对象，读取索引库内容
IndexReader indexReader= DirectoryReader.open(directory);
//3.创建IndexSearcher对象
IndexSearcher indexSearcher=new IndexSearcher(indexReader);
//4.创建Query查询对象
Query query=new TermQuery(new Term("fieldContent","spring"));
//5.执行查询，获取到文档对象
TopDocs topDocs = indexSearcher.search(query, 10);
System.out.println("共获取："+topDocs.totalHits+"个文档~~~~~~~~~~~~~~~~~~~~~");
//6.获取文档列表
ScoreDoc[] scoreDocs=topDocs.scoreDocs;
for (ScoreDoc item:scoreDocs) {
	//获取文档ID
	int docId=item.doc;
	//取出文档
	Document doc = indexSearcher.doc(docId);
	//获取到文档域中数据
	System.out.println("fieldName:"+doc.get("fieldName"));
	System.out.println("fieldPath:"+doc.get("fieldPath"));
	System.out.println("fieldSize:"+doc.get("fieldSize"));
	System.out.println("fieldContent:"+doc.get("fieldContent"));
	System.out.println("==============================================================");
}
//7.关闭资源
indexReader.close();
}