全文检索☞Lucene

最新推荐文章于 2021-01-26 21:34:08 发布

如何删库不跑路

最新推荐文章于 2021-01-26 21:34:08 发布

阅读量158

点赞数 1

分类专栏：新技术文章标签：技术学习记录

本文链接：https://blog.csdn.net/lis345310731/article/details/83755144

版权

新技术专栏收录该内容

2 篇文章 0 订阅

订阅专栏

全文检索

什么是全文检索：
全文检索指在全文数据库中进行的检索操作，基于文本的一种搜索，又称全文搜索。
全文数据库：
指将一个完整信息源的全部内容转化为计算机能够识别、处理的信息单元组成的数据集合，是全文检索系统的主要构成部分，全文数据库泛指储存海量信息的数据库。
数据可以分为：
结构化数据：具有固定格式或有限长度的数据，例如数据库、元数据等；
非结构化数据：无固定格式或不定长的数据，例如Email、word等；
半结构化数据：可以根据实际需要进行结构化处理，或者抽取出纯文本进行非结构化处理的数据，例如XML、HTML等。
全文检索的特点：
对摘要进行截取，提高查询效率；
只关注文本，不考虑语义；
关键词可以高亮显示；
可以根据相关度（热度）进行排序。
图形示例：

全文检索图形示例

全文检索核心

索引创建：
指从所有结构型和非结构型数据中提取信息，对信息数据进行词法分析处理，给每个处理好的元数据创建对应的索引。
索引搜索：
输入关键文本进行搜索，会自动根据对应索引进行匹配，例如⑤

在这里插入图片描述

初识Lucene

什么是Lucene：
Apache Lucene是一个用Java语言编写的高性能（以空间换时间）、可伸缩、开源的全文检索工具包，不仅仅是一个全文检索引擎，更是一个全文检索引擎架构。能够非常方便的嵌入到各种应用，以实现对应用的全文索引、检索功能，主要用于中小型应用。
Hello Lucene：
Lucene中IndexWriter提供增、删、该的操作，IndexSearcher则提供索引搜索操作。

public class HelloLucene {

	//带索引的数据：例句
	private String doc1 = "hello java world";
	private String doc2 = "java proxy has a method called newProxyIntance";
	private String doc3 = "hello lucene world";
	private String path ="E:\\Code\\lucene\\index";

	//创建Lucene
	@Test
	public void testCreateLucene() throws Exception {
		//1.准备索引写入器：IndexWriter
		IndexWriter indexWriter = null;
		try {
			//索引目录，索引储存位置
			Directory d = FSDirectory.open(Paths.get(path)); 
			//词法分析器——英文
			Analyzer analyzer = new StandardAnalyzer();
			//写入器配置信息
			IndexWriterConfig conf = new IndexWriterConfig(analyzer);
			//该配置下，如果重复提交数据则会覆盖前面以有内容。
			conf.setOpenMode(OpenMode.CREATE);
			indexWriter = new IndexWriter(d, conf);
			//2.准备索引数据
			//数据行：第一行
			Document document1 = new Document();
			//数据列：列名、值、是否存储
			document1.add(new TextField("id", "1", Store.YES));
			document1.add(new TextField("content", doc1, Store.YES));
			//数据行：第二行
			Document document2 = new Document();
			document2.add(new TextField("id", "2", Store.YES));
			document2.add(new TextField("content", doc2, Store.YES));
			//数据行：第三行
			Document document3 = new Document();
			document3.add(new TextField("id", "3", Store.YES));
			document3.add(new TextField("content", doc3, Store.YES));
			//通过IndexWriter添加数据到缓冲区
			indexWriter.addDocument(document1);
			indexWriter.addDocument(document2);
			indexWriter.addDocument(document3);
			//提交数据
			indexWriter.commit();
		} catch (Exception e) {
			e.printStackTrace();
		}finally {
			//关闭资源
			indexWriter.close();
		}
	}
}
测试成功后会生成下面的文件：文件内容有其特殊编码格式

在这里插入图片描述

//搜索测试
	@Test
	public void testSearch() throws Exception {
		//搜索的关键字
		String keyword = "hello";
		//获取索引目录
		Directory directory = FSDirectory.open(Paths.get(path));
		//获取索引目录读取对象
		IndexReader r = DirectoryReader.open(directory);
		//1.准备索引搜索器
		IndexSearcher indexSearcher = new IndexSearcher(r);
		//词法分析器——英文
		Analyzer analyzer = new StandardAnalyzer();
		//获取查询解析器
		QueryParser queryParser = new QueryParser("content", analyzer);
		//获取查询对象
		Query query = queryParser.parse(keyword);
		//搜索:前3条
		TopDocs docs = indexSearcher.search(query, 3);
		//搜索出的总条数
		int totalHits = docs.totalHits;
		System.out.println(totalHits);//2
		//搜索到的具体数据
		ScoreDoc[] scoreDocs = docs.scoreDocs;
		for (ScoreDoc scoreDoc : scoreDocs) {
			//文档id
			int docId = scoreDoc.doc;
			//搜索到的文档对象
			Document doc = indexSearcher.doc(docId);
			System.out.println(doc.get("id")+":"+doc.get("content"));
			//1:hello java world
			//3:hello lucene world
		}
	}

Lucene API

索引目录 Directory：
Directory是索引目录的抽象，用于存放lucene索引文件，可通过如下方式创建：
MMapDirectory：针对64系统，它在维护索引库时，会结合“内存”与硬盘同步来处理索引。
SimpleFSDirectory：传统的文件系统索引库。
RAMDirectory：内存索引库。
Document 行和 IndexableField 列：
当往索引中加入内容时，每条信息都用一个Document 记录，类似于关系型数据中的一行数据。
每个Document由一系列的IndexableField组成，表示字段，类似于关系型数据库的列数据。
Document 主要提供如下方法：
添加：add(Fieldable field) ；
删除：removeField、removeFields；
获取字段或值:get、getBinaryValue、getField、getFields等。
IndexableField及Field
Field代表Document中的一列数据，相当于一条表记录中的一列。
Lucene提供了一个接口IndexableField，其它的API大多针对这个接口编程，因此Lucene中的列对象实际上是由IndexableField来定义。在实际开发中，主要使用的是Field类的子类。
Field属性：
store：表示字段值是否储存，Store.YES/Store.NO。
index：表示字段的索引方式。
Tokenized：表示根据自定义的词法分析器建立该字段的，true–>分词，false–>不分词。
索引库：
索引库分为两个部分，一个占用空间相对大一些的称为数据区，由Store属性维护，用于存储字段内容；另一个只能用空间相对较小的目录区，由Index维护，表示支持搜索。
词法分析器Analyzer
词法分析器又称语言分析器，指索引中内容建立的方式，分为英文单词、中文词义等。
中文分词器又分为：SmartChineseAnalyzer极易分词器；MMAnalyzer 庖丁分词器；IKAnalyzer IK分词器，推荐使用IK分词器。使用Ik分词器时注意导入其jar包：IKAnalyzer.jar，同时引入三个配置文件：IKAnalyzer.cfg.xml（引用配置）,ext.dic（扩展词），stopword.dic（停止词）。

public class AnalyzerTest {
	
	// 创建索引的数据
	private String en = "oh my lady gaga"; 
	private String cn = "迅雷不及掩耳盗铃儿响叮当仁不让";
	private String str = "小白初识FullText Search Lucene框架的学习，TMD今晚吃鸡";

	// 标准分词：不支持中文
	@Test
	public void testStandardAnalyzer() throws Exception {

		testAnalyzer(new StandardAnalyzer(), cn);
	}

	// 简单分词：不支持中文
	@Test
	public void testSimpleAnalyzer() throws Exception {
		testAnalyzer(new SimpleAnalyzer(), cn);
	}

	// 二分分词：两个字是一个词
	@Test
	public void testCJKAnalyzer() throws Exception {
		testAnalyzer(new CJKAnalyzer(), cn);
	}

	// 词典分词：从词典中查找
	@Test
	public void testSmartChineseAnalyzer() throws Exception {
		testAnalyzer(new SmartChineseAnalyzer(), cn);
	}

	/*IK分词：从词典中查找
	简单使用：拷贝三个配置文件，IKAnalyzer.cfg.xml（引用配置）,ext.dic（扩展词），stopword.dic（停止词）拷贝一个jar包
	注意：打开方式，不要使用其他的*/
	@Test
	public void testIKAnalyzer() throws Exception {
		// true 粗密度分词(智能分词) false 细密度分词
		//testAnalyzer(new IKAnalyzer(true), cn);
		testAnalyzer(new IKAnalyzer(false), str);
	}
}

简单增、删、改、查：

public class Lucene_CRUD {

	// 带索引的数据：例句
	private String doc1 = "hello java world";
	private String doc2 = "java proxy has a method called newProxyIntance";
	private String doc3 = "hello lucene world";
	private String path = "E:\\Code\\lucene\\index";

	// 创建Lucene
	@Test
	public void testCreateLucene() throws Exception {
		// 1.准备索引写入器：IndexWriter
		IndexWriter indexWriter = null;
		try {
			// 索引目录，索引储存位置
			Directory d = FSDirectory.open(Paths.get(path));
			// 词法分析器——英文
			Analyzer analyzer = new StandardAnalyzer();
			IndexWriterConfig conf = new IndexWriterConfig(analyzer);
			// 写入器配置信息
			// 该配置下，如果重复提交数据则会覆盖前面以有内容。
			conf.setOpenMode(OpenMode.CREATE);
			indexWriter = new IndexWriter(d, conf);
			// 2.准备索引数据
			// 数据行：第一行
			Document document1 = new Document();
			// 数据列：列名、值、是否存储
			document1.add(new TextField("id", "1", Store.YES));
			document1.add(new TextField("content", doc1, Store.YES));
			// 数据行：第二行
			Document document2 = new Document();
			document2.add(new TextField("id", "2", Store.YES));
			document2.add(new TextField("content", doc2, Store.YES));
			// 数据行：第三行
			Document document3 = new Document();
			document3.add(new TextField("id", "3", Store.YES));
			document3.add(new TextField("content", doc3, Store.YES));
			// 通过IndexWriter添加数据到缓冲区
			indexWriter.addDocument(document1);
			indexWriter.addDocument(document2);
			indexWriter.addDocument(document3);
			// 提交数据
			indexWriter.commit();
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			// 关闭资源
			indexWriter.close();
		}
		// 查询所有字段
		testSearch("*:*");
	}

	// 修改测试,如果存在则修改，不存在则新增
	@Test
	public void testUpdate() throws Exception {
		// 1.准备索引写入器：IndexWriter
		IndexWriter indexWriter = null;
		try {
			// 索引目录，索引储存位置
			Directory d = FSDirectory.open(Paths.get(path));
			// 词法分析器——英文
			Analyzer analyzer = new StandardAnalyzer();
			IndexWriterConfig conf = new IndexWriterConfig(analyzer);
			// 写入器配置信息
			indexWriter = new IndexWriter(d, conf);
			// 2.准备索引数据
			// 数据行：第一行
			Document document = new Document();
			// 数据列：列名、值、是否存储
			document.add(new TextField("id", "1", Store.YES));
			document.add(new TextField("content", doc1 + "修改过", Store.YES));
			// 修改字段
			indexWriter.updateDocument(new Term("id", "1"), document);
			// 提交数据
			indexWriter.commit();
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			// 关闭资源
			indexWriter.close();
		}
		// 查询所有字段
		testSearch("*:*");
	}

	// 删除测试
	@Test
	public void testDelete() throws Exception {
		// 1.准备索引写入器：IndexWriter
		IndexWriter indexWriter = null;
		try {
			// 索引目录，索引储存位置
			Directory d = FSDirectory.open(Paths.get(path));
			// 词法分析器——英文
			Analyzer analyzer = new StandardAnalyzer();
			IndexWriterConfig conf = new IndexWriterConfig(analyzer);
			// 写入器配置信息
			indexWriter = new IndexWriter(d, conf);
			// 删除字段
			indexWriter.deleteDocuments(new Term("id", "1"));
			// 提交数据
			indexWriter.commit();
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			// 关闭资源
			indexWriter.close();
		}
		// 查询所有字段
		testSearch("*:*");
	}

	// 搜索
	public void testSearch(String keyword) throws Exception {
		// 获取索引目录
		Directory directory = FSDirectory.open(Paths.get(path));
		// 获取索引目录读取对象
		IndexReader r = DirectoryReader.open(directory);
		// 1.准备索引搜索器
		IndexSearcher indexSearcher = new IndexSearcher(r);
		// 词法分析器——英文
		Analyzer analyzer = new StandardAnalyzer();
		// 获取查询解析器
		QueryParser queryParser = new QueryParser("content", analyzer);
		// 获取查询对象
		Query query = queryParser.parse(keyword);
		// 搜索:前5条
		TopDocs docs = indexSearcher.search(query, 5);
		// 搜索出的总条数
		int totalHits = docs.totalHits;
		System.out.println(totalHits);// 2
		// 搜索到的具体数据
		ScoreDoc[] scoreDocs = docs.scoreDocs;
		for (ScoreDoc scoreDoc : scoreDocs) {
			// 文档id
			int docId = scoreDoc.doc;
			// 搜索到的文档对象
			Document doc = indexSearcher.doc(docId);
			System.out.println(doc.get("id") + ":" + doc.get("content"));
		}
	}
}

高级查询：

public class LuceneQuery {

	private String path = "E:\\Code\\lucene\\index";

	// 关键字搜索
	public void testSearch(String keyword) throws Exception {
		// 获取索引目录
		Directory directory = FSDirectory.open(Paths.get(path));
		// 获取索引目录读取对象
		IndexReader r = DirectoryReader.open(directory);
		// 1.准备索引搜索器
		IndexSearcher indexSearcher = new IndexSearcher(r);
		// 词法分析器——英文
		Analyzer analyzer = new StandardAnalyzer();
		// 获取查询解析器
		QueryParser queryParser = new QueryParser("content", analyzer);
		// 获取查询对象
		Query query = queryParser.parse(keyword);
		// 搜索:前5条
		TopDocs docs = indexSearcher.search(query, 5);
		// 搜索出的总条数
		int totalHits = docs.totalHits;
		System.out.println(totalHits);// 2
		// 搜索到的具体数据
		ScoreDoc[] scoreDocs = docs.scoreDocs;
		for (ScoreDoc scoreDoc : scoreDocs) {
			// 文档id
			int docId = scoreDoc.doc;
			// 搜索到的文档对象
			Document doc = indexSearcher.doc(docId);
			System.out.println(doc.get("id") + ":" + doc.get("content"));
		}
	}
	
	// Query对象搜索
	public void testSearch(Query query) throws Exception {
		// 获取索引目录
		Directory directory = FSDirectory.open(Paths.get(path));
		// 获取索引目录读取对象
		IndexReader r = DirectoryReader.open(directory);
		// 1.准备索引搜索器
		IndexSearcher indexSearcher = new IndexSearcher(r);
		// 词法分析器——英文
		Analyzer analyzer = new StandardAnalyzer();
		// 搜索:前5条
		TopDocs docs = indexSearcher.search(query, 5);
		// 搜索出的总条数
		int totalHits = docs.totalHits;
		System.out.println(totalHits);// 2
		// 搜索到的具体数据
		ScoreDoc[] scoreDocs = docs.scoreDocs;
		for (ScoreDoc scoreDoc : scoreDocs) {
			// 文档id
			int docId = scoreDoc.doc;
			// 搜索到的文档对象
			Document doc = indexSearcher.doc(docId);
			System.out.println(doc.get("id") + ":" + doc.get("content"));
		}
	}
	
	//单词查询,常用
	@Test
	public void testTerm() throws Exception {
		//关键字查询,常用
		String keyword = "content:lucene";
		testSearch(keyword);//3:hello lucene world
		
		//对象查询
		Query query = new TermQuery(new Term("content", "lucene"));
		testSearch(query);//3:hello lucene world
	}
	
	//查询所有
	@Test
	public void testAll() throws Exception {
		//关键字查询,常用
		String keyword = "*:*";
		testSearch(keyword);
	}
	
	//段落查询,整体查询，如：人名、地名等
	@Test
	public void testPhs() throws Exception {
		//关键字查询,常用
		String keyword = "\"hello java world\"";
		testSearch(keyword);
	}
	
	//通配符查询 *
	@Test
	public void testTpf() throws Exception {
		//关键字查询,常用
		String keyword = "ja*a";
		testSearch(keyword);
	}
	
	//容错查询 ~+num,最多容错3个
	@Test
	public void testRc() throws Exception {
		//关键字查询,常用
		String keyword = "hexlo~2";
		testSearch(keyword);
	}
	
	//临近查询,在段落查询的基础上用“~”后面跟一个1到正无穷的正整数。代表段落中，单词与单词之间最大的间隔数。
	//段落本身也会被查询出来
	@Test
	public void testLj() throws Exception {
		//关键字查询,常用
		String keyword = "\"hello world\"~1";
		testSearch(keyword);
	}
	
	/*
	 * 组合查询:
	 * 	+ （must） : 对应的单词必须出现
	 * 	- （must_not）: 不能出现
	 * 	不写 （should）： 可能出现
	 * 	关键字之间的逻辑计算是 AND
	 */
	@Test
	public void testZh() throws Exception {
		//关键字查询,常用
		//String keyword = "+hello -java";
		//String keyword = "hello +java";
		String keyword = "-hello -java";
		testSearch(keyword);
	}

}

如何删库不跑路

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
5
评论
全文检索☞Lucene

全文检索什么是全文检索：全文检索指在全文数据库中进行的检索操作，基于文本的一种搜索，又称全文搜索。全文数据库：指将一个完整信息源的全部内容转化为计算机能够识别、处理的信息单元组成的数据集合，是全文检索系统的主要构成部分，全文数据库泛指储存海量信息的数据库。数据可以分为：结构化数据：具有固定格式或有限长度的数据，例如数据库、元数据等；非结构化数据：无固定格式或不定长的数据，...
复制链接

扫一扫