Lucene入门(1) | 创建索引库

最新推荐文章于 2021-10-14 15:20:26 发布

秃头崽崽

最新推荐文章于 2021-10-14 15:20:26 发布

阅读量944

点赞数 1

分类专栏： Hadoop 文章标签： lucene java 索引 eclipse 大数据

本文链接：https://blog.csdn.net/SartinL/article/details/106244148

版权

Hadoop 专栏收录该内容

24 篇文章 2 订阅

订阅专栏

Lucene

文章目录

- Lucene

创建文档对象

获取原始内容的目的是为了索引，在索引前需要将原始内容创建成文档（Document），文档中包括一个一个的域（Field），域中存储内容。

这里我们可以将磁盘上的一个文件当成一个document，Document中包括一些Field（file_name文件名称、file_path文件路径、file_size文件大小、file_content文件内容），如下图：

在这里插入图片描述

注意：每个Document可以有多个Field，不同的Document可以有不同的Field，同一个Document可以有相同的Field（域名和域值都相同）

每个文档都有一个唯一的编号，就是文档id。

分析文档

将原始内容创建为包含域（Field）的文档（document），需要再对域中的内容进行分析，分析的过程是经过对原始文档提取单词、将字母转为小写、去除标点符号、去除停用词等过程生成最终的语汇单元，可以将语汇单元理解为一个一个的单词。

比如下边的文档经过分析如下：

原文档内容：

Lucene is a Java full-text search engine. Lucene is not a complete

application, but rather a code library and API that can easily be used

to add search capabilities to applications.

分析后得到的语汇单元：

lucene、java、full、search、engine。。。。

每个单词叫做一个Term，不同的域中拆分出来的相同的单词是不同的term。term中包含两部分一部分是文档的域名，另一部分是单词的内容。

例如：文件名中包含apache和文件内容中包含的apache是不同的term。

创建索引

对所有文档分析得出的语汇单元进行索引，索引的目的是为了搜索，最终要实现只搜索被索引的语汇单元从而找到Document（文档）。

注意：创建索引是对语汇单元索引，通过词语找文档，这种索引的结构叫倒排索引结构。

传统方法是根据文件找到该文件的内容，在文件内容中匹配搜索关键字，这种方法是顺序扫描方法，数据量大、搜索慢。

倒排索引结构是根据内容（词语）找文档，如下图：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wpsHxxUY-1589979172497)(D:\StudyInSchool\Hadoop\笔记\Lucene笔记及实操.assets\clip_image002.jpg)]

倒排索引结构也叫反向索引结构，包括索引和文档两部分，索引即词汇表，它的规模较小，而文档集合较大。

创建索引库

实现步骤

第一步：创建一个java工程，并导入jar包。

第二步：创建一个indexwriter对象。

1）指定索引库的存放位置Directory对象

2）指定一个分析器，对文档内容进行分析。

第三步：创建document对象。

第四步：创建field对象，将field添加到document对象中。

第五步：使用indexwriter对象将document对象写入索引库，此过程进行索引创建。并将索引和document对象写入索引库。

第六步：关闭IndexWriter对象。

实操代码

1、创建一个maven项目为Lucene，在pom.xml里添加相应的依赖

    <!--核心包-->
	<dependency>
		<groupId>org.apache.lucene</groupId>
		<artifactId>lucene-core</artifactId>
		<version>5.3.1</version>
	  </dependency>
	  <!--一般分词器，适用于英文分词-->
	  <dependency>
		<groupId>org.apache.lucene</groupId>
		<artifactId>lucene-analyzers-common</artifactId>
		<version>5.3.1</version>
	 </dependency>
	 <!--中文分词器-->
	 <dependency>
		<groupId>org.apache.lucene</groupId>
		<artifactId>lucene-analyzers-smartcn</artifactId>
		<version>5.3.1</version>
	</dependency>
	 
	 <!--对分词索引查询解析-->
	 <dependency>
		<groupId>org.apache.lucene</groupId>
		<artifactId>lucene-queryparser</artifactId>
		<version>5.3.1</version>
	 </dependency>
	 <!--检索关键字高亮显示-->
	 <dependency>
		<groupId>org.apache.lucene</groupId>
		<artifactId>lucene-highlighter</artifactId>
		<version>5.3.1</version>
	</dependency>
	
	<!--类名：FileUtils.java，用来读文件内容的-->
	<dependency>
		<groupId>commons-io</groupId>
		<artifactId>commons-io</artifactId>
		<version>2.4</version>
	</dependency>

2、创建document对象,并引入File对象

①创建一个docu的文件夹，里面含有三个文本文件，分别为1.txt，2.txt，3.txt
在这里插入图片描述
内容分别为：

1.txt

Lucene Core is a Java library providing powerful 
indexing and search features, 
as well as spellchecking, hit highlighting and 
advanced analysis/tokenization capabilities.
 The PyLucene sub project provides 
Python bindings for Lucene Core.

2.txt

SolrTM is a high performance search server 
built using Lucene Core. 
Solr is highly scalable, providing fully fault 
tolerant distributed indexing, 
search and analytics. It exposes Lucene's features 
through easy to use JSON/HTTP interfaces or native clients 
for Java and other languages.

3.txt

The Apache Software Foundation provides support for
 the Apache community of open-source software projects. 
The Apache projects are defined by collaborative consensus
 based processes, an open, pragmatic software license
 and a desire to create high quality software that leads
 the way in its field.
 Apache Lucene, Apache Solr, Apache PyLucene, Apache Open
 Relevance Project and their respective logos are trademarks of 
The Apache Software Foundation. 
All other marks mentioned may be trademarks or registered trademarks
 of their respective owners.

②在同级目录下创建一个索引库，即空文件夹，名为index

Field的属性

是否分析：是否对域的内容进行分词处理。前提是我们要对域的内容进行查询。

是否索引：将Field分析后的词或整个Field值进行索引，只有索引方可搜索到。

比如：商品名称、商品简介分析后进行索引，订单号、身份证号不用分析但也要索引，这些将来都要作为查询条件。

是否存储：将Field值存储在文档中，存储在文档中的Field才可以从Document中获取

比如：商品名称、订单号，凡是将来要从Document中获取的Field都要存储。

是否存储的标准：是否要将内容展示给用户

Field类	数据类型	Analyzed 是否分析	Indexed 是否索引	Stored 是否存储	说明
StringField(FieldName, FieldValue,Store.YES))	字符串	N	Y	Y或N	这个Field用来构建一个字符串Field，但是不会进行分析，会将整个串存储在索引中，比如(订单号,姓名等) 是否存储在文档中用Store.YES或Store.NO决定
LongField(FieldName, FieldValue,Store.YES)	Long型	Y	Y	Y或N	这个Field用来构建一个Long数字型Field，进行分析和索引，比如(价格) 是否存储在文档中用Store.YES或Store.NO决定
StoredField(FieldName, FieldValue)	重载方法，支持多种类型	N	N	Y	这个Field用来构建不同类型Field 不分析，不索引，但要Field存储在文档中(如图片,因为要存放图片地址)
TextField(FieldName, FieldValue, Store.NO) 或 TextField(FieldName, reader)	字符串或流	Y	Y	Y或N	如果是一个Reader, lucene猜测内容比较多,会采用Unstored的策略.

3、新建一个class，名为textLucene.java

代码实现

package com.hadoop.Lucene;

import java.io.File;
import java.io.IOException;
import java.nio.file.Paths;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.junit.Test;

public class testLucene {

	public static void main(String[] args) {
		// TODO Auto-generated method stub
	}
	
	@Test
	public void createIndex() throws IOException
	{
		//指定索引库存放路径   D:\\Program\\Java\\index
		String path = "D:\\Program\\Java\\index\\docuindex";
		Directory directory = FSDirectory.open(Paths.get(path));
		//索引库还可以存放在内存中
		//Directory directory = new RAMDirectory();
		
		//创建一个标准分析器
		Analyzer analyzer = new StandardAnalyzer();
		
		//创建 indexwriterCofig对象
		//第一个参数： Lucene的版本信息，可以选择对应的lucene版本也可以使用LATEST
		//第二根参数：分析器对象
		IndexWriterConfig config = new IndexWriterConfig(analyzer);
		
		//创建 indexwriter 对象
		IndexWriter indexWriter = new IndexWriter(directory, config);
		
		//原始文档路径：D:\\Program\\Java\\docu
		File dirs = new File("D:\\Program\\Java\\docu");
		for (File f : dirs.listFiles())
		{
			//文件名
			String fileName = f.getName();
			//文件内容
			String fileContent = FileUtils.readFileToString(f);
			//文件路径
			String filePath = f.getPath();
			//文件大小
			long fileSize = f.getTotalSpace();
			
			//创建文件名域
			//第一个参数：域的名称
			//第二个参数：域的内容
			//第三个参数：是否存储
			Field fileNameField = new TextField("filename", fileName, Store.YES);
			
			//文件内容域
			Field fileContentField = new TextField("content", fileContent, Store.YES);
			//文件路径域（不分析，不索引，不存储）
			Field filePathField = new StoredField("path", filePath);
			//文件大小域
			Field fileSizeField = new LongField("size", fileSize, Store.YES);
			
			//创建 document 对象
			Document document = new Document();
			document.add(fileNameField);
			document.add(fileContentField);
			document.add(filePathField);
			document.add(fileSizeField);
			
			//创建索引，并写入索引库
			indexWriter.addDocument(document);
		}
		System.out.println("创建index成功");
		indexWriter.close();
	}

}

4、在index索引库中查看索引结果

创建成功：
在这里插入图片描述
查看索引库：

秃头崽崽

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lucene入门(1) | 创建索引库

Lucene文章目录Lucene创建文档对象分析文档创建索引创建索引库实现步骤实操代码Field的属性创建文档对象获取原始内容的目的是为了索引，在索引前需要将原始内容创建成文档（Document），文档中包括一个一个的域（Field），域中存储内容。这里我们可以将磁盘上的一个文件当成一个document，Document中包括一些Field（file_name文件名称、file_path文件路径、file_size文件大小、file_content文件内容），如下图：注意：每个Document可
复制链接

扫一扫