Lucene

最新推荐文章于 2023-05-29 02:03:01 发布

酒巷

最新推荐文章于 2023-05-29 02:03:01 发布

阅读量146

点赞数

分类专栏： JAVA 文章标签： Lecene

本文链接：https://blog.csdn.net/weixin_43844237/article/details/84862773

版权

JAVA 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

1. 全文搜索是什么？

从全文数据中进行检索就叫全文检索(全文搜索)。是基于文本的搜索。
结构化数据：指具有“固定格式”或“有限长度”的数据，如数据库，元数据等;
非结构化数据：指不定长或无固定格式的数据，如邮件，word文档等;
半结构化数据，如XML，HTML等，当根据需要可按结构化数据来处理，也可抽取出纯文本按非结构化数据来处理。

2. 全文检索

非结构化数据顺序扫描很慢，对结构化数据的搜索却相对较快（由于结构化数据有一定的结构可以采取一定的搜索算法加快速度），那么把我们的非结构化数据想办法弄得有一定结构不就行了吗？关系数据库中存储的都是结构化数据，因此很检索都比较快。
从非结构化数据中提取出的然后重新组织的信息，我们称之索引。即为文本数据建立类似字典目录，从而提高检索速度。

3. Lucene入门

3.1. Lucene是什么

Apache Lucene是一个用Java写的高性能、可伸缩的全文检索引擎工具包，它可以方便的嵌入到各种应用中实现针对应用的全文索引/检索功能。Lucene的目标是为各种中小型应用程序加入全文检索功能。
Lucene的核心作者：Doug Cutting是一位资深全文索引/检索专家。
版本发布情况：2000年3月，最初版发布，2001年9月，加入apache；2004年7月，发布1.4正式版；2009年11月，发布2.9.1（jdk1.4)及3.0(jdk1.5)版本；2015年3月，发布4.10.4。2016年2月，发布5.5.0。

3.2. Helloworld

Lucene的索引库和数据库一样，都提供相应的API来便捷操作。
在这里插入图片描述

Lucene中的索引维护使用IndexWriter，由这个类提供添删改相关的操作；索引的搜索则是使用IndexSearcher进行索引的搜索。HelloWorld代码如下。

3.2.1. 创建索引

步骤：
	 1、 把文本内容转换为Document对象
	    文本是作为Document对象的一个字段而存在
	 2、准备IndexWriter（索引写入器）
	 3 、通过IndexWriter，把Document添加到缓冲区并提交
	      addDocument
	      commit
	      close
//创建索引的数据 现在写死，以后根据实际应用场景
	String doc1 = "hello world";
	String doc2 = "hello java world";
	String doc3 = "hello lucene world";
	private String path ="F:/eclipse/workspace/lucene/index/
            hello";
@Test
	public void testCreate() {
		try {
			//2、准备IndexWriter（索引写入器）
			//索引库的位置 FS fileSystem
			Directory d = FSDirectory.open(Paths.get(path ));
			//分词器
			Analyzer analyzer = new StandardAnalyzer();
			//索引写入器的配置对象
			IndexWriterConfig conf = new IndexWriterConfig(analyzer);
			IndexWriter indexWriter = new IndexWriter(d, conf);
			System.out.println(indexWriter);
			
			//1、 把文本内容转换为Document对象
			//把文本转换为document对象
			Document document1 = new Document();
			//标题字段
			document1.add(new TextField("title", "doc1", Store.YES));
			document1.add(new TextField("content", doc1, Store.YES));
			//添加document到缓冲区
			indexWriter.addDocument(document1);
			Document document2 = new Document();
			//标题字段
			document2.add(new TextField("title", "doc2", Store.YES));
			document2.add(new TextField("content", doc2, Store.YES));
			//添加document到缓冲区
			indexWriter.addDocument(document2);
			Document document3 = new Document();
			//标题字段
			document3.add(new TextField("title", "doc3", Store.YES));
			document3.add(new TextField("content", doc3, Store.YES));
			
			//3 、通过IndexWriter，把Document添加到缓冲区并提交
			//添加document到缓冲区
			indexWriter.addDocument(document3);
			indexWriter.commit();
			indexWriter.close();
			
		} catch (Exception e) {
			e.printStackTrace();
		}
	
	}
    
       // OpenMode=create 每次都会重置索引库然后重新添加索引文档
       // 后者覆盖前者(默认是不覆盖累加模式)
		conf.setOpenMode(OpenMode.CREATE);

图形界面客户端使用
4.2.2. 搜索索引
1 封装查询提交为查询对象
2 准备IndexSearcher
3 使用IndexSearcher传入查询对象做查询-----查询出来只是文档编号DocID
4 通过IndexSearcher传入DocID获取文档
5 把文档转换为前台需要的对象 Docment----> Article

@Test
	public void testSearch() {
		String keyWord = "lucene";
		try {
			// * 1 封装查询提交为查询对象
		    //通过查询解析器解析一个字符串为查询对象
			String f = "content"; //查询的默认字段名,
			Analyzer a = new StandardAnalyzer();//查询关键字要分词，所有需要分词器
			QueryParser parser = new QueryParser(f, a);
			Query query = parser.parse("content:"+keyWord);
			// * 2 准备IndexSearcher
			Directory d = FSDirectory.open(Paths.get(path ));
			IndexReader r = DirectoryReader.open(d);
			IndexSearcher searcher = new IndexSearcher(r);
			// * 3 使用IndexSearcher传入查询对象做查询-----查询出来只是文档编号DocID
			TopDocs topDocs = searcher.search(query, 1000);//查询ton条记录 前多少条记录
			System.out.println("总命中数："+topDocs.totalHits);
			ScoreDoc[] scoreDocs = topDocs.scoreDocs;//命中的所有的文档的封装（docId）
			// * 4 通过IndexSearcher传入DocID获取文档
			for (ScoreDoc scoreDoc : scoreDocs) {
				int docId = scoreDoc.doc;
				Document document = searcher.doc(docId);
				// * 5 把文档转换为前台需要的对象 Docment----> Article
				System.out.println("=======================================");
				System.out.println("title:"+document.get("title")
								+",content:"+document.get("content"));
			}
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

4. Lucene API详解

前面已经讲了luncene的核心，但还有很多细节，也就是一些LuceneAPI使用，接下来一一讲解。

4.1. 索引目录Directory

Directory是一个对索引目录的一个抽象。索引目录用于存放lucene索引文件。直接根据一个文件夹地址来创建索引目录使用SimpleFSDirectory。
MMapDirectory : 针对64系统，它在维护索引库时，会结合“内存”与硬盘同步来处理索引。
SimpleFSDirectory ：传统的文件系统索引库。
RAMDirectory ：内存索引库

4.2. Document（行）及IndexableField（列）

在这里插入图片描述
当往索引中加入内容的时候，每一条信息用一个Document来表示,Document的意思表示文档，也可以理解成记录，与关系数据表中的一行数据记录类似；
IndexableField表示字段，与关系数据表中的列类似(列数量不定！！)，每个Document也由一系列的IndexableField组成，可以理解为数据库的动态列；
在这里插入图片描述
Document提供的方法主要包括：
字段添加：add(Fieldable field)
字段删除：removeField、removeFields
获取字段或值:get、getBinaryValue、getField、getFields等

IndexableField及Field
Field代表Document中的一列数据，相当于一条表记录中的一列。
Lucene提供了一个接口IndexableField，其它的API大多针对这个接口编程，因此Lucene中的列对象实际上是由IndexableField来定义。在实际开发中，主要使用的是Field类的子类。
在这里插入图片描述

Field的Store方式及Index方式
Lucene中，在创建Field的时候，可以指定Field的store及index属性；
 store属性：表示字段值是否存储，Store.YES表示要存储，而Store.NO则表示不存储；
 index属性：表示字段的索引方式，
 Tokenized表示根据设定的词法分析器来建立该字段的索引；FALSE，不分词；true要分词。

在这里插入图片描述

索引库中实际分为两个部分，一个部分占的空间相对大一些叫做数据区，有Store属性维护，代表是否把字段的内容存到数据区；另一个部分相对小一些，叫做目录区，由Index维护，代表是否支持搜索。
Store和Index组合使用的适用情况见下图：
在这里插入图片描述
是否要创建索引：看是否需要搜索。
是否要分词：看是否是专有名词。
是否要存储：结果页面，是否要显示，看文档字段的内容能不能链接找到。—大字段

4.3. 分词Analyzer(词法分析器)

分词器是Lucene中非常重要的一个知识点，如果你面试时说你用过Lucene面试官一定会问你用的什么分词器。
分词，也称词法分析器（或者叫语言分析器），就是指索引中的内容按什么样的方式来建立，这在全文检索中非常关键，是按英文单词建立索引，还是按中文词意建立索引；这些需要由Analyzer来指定。
对于中文，需要采用字典分词，也叫词库分词；把中文件的词全部放置到一个词库中，按某种算法来维护词库内容；如果匹配到就切分出来成为词语。通常词库分词被认为是最理想的中文分词算法。如：“我们是中国人”，效果为：“我们”、“中国人”。（可以使用SmartChineseAnalyzer，“极易分词” MMAnalyzer ，或者是“庖丁分词”分词器、IKAnalyzer。推荐使用IKAnalyzer ）
在这里我们推荐IKAnalyzer。使用时需导入IKAnalyzer.jar,并且拷贝IKAnalyzer.cfg.xml，ext_stopword.dic文件，分词器测试代码如下：
public class AnalyzerTest {
//创建索引的数据现在写死，以后根据实际应用场景
private String en = “oh my lady gaga”; // oh my god
private String cn = “迅雷不及掩耳盗铃儿响叮当仁不让”;
private String str = “源代码教育FullText Search Lucene框架的学习”;

	/**
	 * 把特定字符串按特定的分词器来分词
	 * @param analyzer
	 * @param str
	 * @throws Exception
	 */
	public void testAnalyzer(Analyzer analyzer,String str) throws Exception {
		TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(str));
		// 在读取词元流后，需要先重置/重加载一次
		tokenStream.reset();
		while(tokenStream.incrementToken()){
			System.out.println(tokenStream);
		}
	}
	
	//标准分词：不支持中文
	@Test
	public void testStandardAnalyzer() throws Exception {
		
		testAnalyzer(new StandardAnalyzer(), cn);
	}
	
	//简单分词：不支持中文
	@Test
	public void testSimpleAnalyzer() throws Exception {
		testAnalyzer(new SimpleAnalyzer(), cn);
	}
	
	//二分分词：两个字是一个词
	@Test
	public void testCJKAnalyzer() throws Exception {
		testAnalyzer(new CJKAnalyzer(), cn);
	}
	
	//词典分词：从词典中查找
	@Test
	public void testSmartChineseAnalyzer() throws Exception {
		testAnalyzer(new SmartChineseAnalyzer(), str);
	}

//IK分词：从词典中查找
// 简单使用：拷贝两个配置文件，IKAnalyzer.cfg.xml,stopword.dic拷贝一个jar包 
IKAnalyzer2012_V5.jar
//       扩展词，停止词
//  注意：打开方式，不要使用其他的，

//直接使用eclipse的text Editor,
修改以后要刷新一下让项目重新编译（有时候需要有时候不需要刷新）

@Test
public void testIKAnalyzer() throws Exception {
	//true 粗密度分词(智能分词)  false 细密度分词
	testAnalyzer(new IKAnalyzer(true), str);
}

}

4.4. 索引的添删改

经过之前的分析，我们知道对索引的操作统一使用IndexWriter。测试代码如下：

// 数据源
	private String doc1 = "hello world";
	private String doc2 = "hello java world";
	private String doc3 = "hello lucene world";

// 索引库目录
private String indexPath = "F:\\ecworkspace\\lucene\\indexCRUD";
@Test
public void createIndex() throws IOException, ParseException {
	/**
	 * 准备工作
	 */
	// 索引目录
	Directory d = FSDirectory.open(Paths.get(indexPath));
	// 词法分析器
	Analyzer analyzer = new StandardAnalyzer();
	// 写操作核心配置对象
	IndexWriterConfig conf = new IndexWriterConfig(analyzer);
	conf.setOpenMode(OpenMode.CREATE);
	// 写操作核心对象
	IndexWriter indexWriter = new IndexWriter(d, conf);
	System.out.println(indexWriter);

	/**
	 * 操作
	 */
	Document document1 = new Document();
	document1.add(new TextField("id", "1", Store.YES));
	document1.add(new TextField("name", "doc1", Store.YES));
	document1.add(new TextField("content", doc1, Store.YES));
	indexWriter.addDocument(document1);

	Document document2 = new Document();
	document2.add(new TextField("id", "2", Store.YES));
	document2.add(new TextField("name", "doc2", Store.YES));
	document2.add(new TextField("content", doc2, Store.YES));
	indexWriter.addDocument(document2);
	Document document3 = new Document();
	document3.add(new TextField("id", "3", Store.YES));
	document3.add(new TextField("name", "doc3", Store.YES));
	document3.add(new TextField("content", doc3, Store.YES));
	indexWriter.addDocument(document3);
	/**
	 * 收尾
	 */
	indexWriter.commit();
	indexWriter.close();
	
	searchIndex();
}

@Test
public void del() throws IOException, ParseException{
	/**
	 * 准备工作
	 */
	// 索引目录
	Directory d = FSDirectory.open(Paths.get(indexPath));
	// 词法分析器
	Analyzer analyzer = new StandardAnalyzer();
	// 写操作核心配置对象
	IndexWriterConfig conf = new IndexWriterConfig(analyzer);
	// 写操作核心对象
	IndexWriter indexWriter = new IndexWriter(d, conf);
	System.out.println(indexWriter);
	
	
	//删除所有
	//indexWriter.deleteAll();
	//第一种
//		QueryParser qpParser = new QueryParser("id", analyzer);
//		Query query = qpParser.parse("1");
//		indexWriter.deleteDocuments(query);
	
	//第二种
	indexWriter.deleteDocuments(new Term("id", "1"));
	
	indexWriter.commit();
	indexWriter.close();
	
	searchIndex();
}

@Test
public void update() throws IOException, ParseException{
	/**
	 * 准备工作
	 */
	// 索引目录
	Directory d = FSDirectory.open(Paths.get(indexPath));
	// 词法分析器
	Analyzer analyzer = new StandardAnalyzer();
	// 写操作核心配置对象
	IndexWriterConfig conf = new IndexWriterConfig(analyzer);
	// 写操作核心对象
	IndexWriter indexWriter = new IndexWriter(d, conf);
	System.out.println(indexWriter);
	
	
	Document doc = new Document();
	doc.add(new TextField("id", "2", Store.YES));
	doc.add(new TextField("name", "doc2", Store.YES));
	doc.add(new TextField("content", "修改后 -的doc2", Store.YES));
	
	indexWriter.updateDocument(new Term("id","2"), doc );
	/*等价于
	 indexWriter.deleteDocuments(new Term("id", "2"));
	 indexWriter.addDocument(doc);
	 */
	indexWriter.commit();
	indexWriter.close();
	
	searchIndex();
}


@Test
public void searchIndex() throws IOException, ParseException {
	// 索引目录
	Directory d = FSDirectory.open(Paths.get(indexPath));
	// 词法分析器
	Analyzer analyzer = new StandardAnalyzer();
	// 创建索引的读写对象
	IndexReader r = DirectoryReader.open(d);
	// 创建核心对象
	IndexSearcher indexSearcher = new IndexSearcher(r);

	// 查询解析器
	// 参数1：默认查询的字段
	// 参数2：分词器
	QueryParser queryParser = new QueryParser("content", analyzer);
	String queryString = "*:*";

	Query query = queryParser.parse(queryString);
	// 调用核心对象的search方法
	// 参数query： 查询对象
	// 参数 n : 前n条
	TopDocs topDocs = indexSearcher.search(query, 50);
	System.out.println("一共查询到的数量：" + topDocs.totalHits);

	// 获得数据集合
	ScoreDoc[] scoreDocs = topDocs.scoreDocs;
	for (ScoreDoc scoreDoc : scoreDocs) {
		// 获取文档ID
		int docId = scoreDoc.doc;
		// 通过docId获取Document
		Document doc = indexSearcher.doc(docId);

		System.out.println("id="+doc.get("id")+",name=" + doc.get("name") + ",content=" + doc.get("content"));
	}
}

4.5. Query及Searcher

搜索是全文检索中最重要的一部分，前面HelloWorld中也发现，Query对象只是一个接口，他有很多子类的实现。在前面直接使用QueryParser的Parse方法来创建Query对象的实例，实际他会根据我们传入的搜索关键字自动解析成需要的查询类型，索引在这里我们也可以直接new一个Query实例来达到不同的搜索效
抽取结构：

// 先做一个准备工作，提供两个search方法 
//一个传入搜索关键字进行搜索
public void search(String keyword) throws Exception {
		Directory directory = FSDirectory.open(Paths.get("E:\\tools\\eclipse\\workspace\\lucene\\helloIndex"));
		;
		// 索引的和读取对象
		IndexReader reader = DirectoryReader.open(directory);
		// 搜索文档通过核心搜索类IndexSearcher来查询
		IndexSearcher indexSearcher = new IndexSearcher(reader);

		// 先创建一个QueryParse对象
		QueryParser queryParser = new QueryParser("content", new StandardAnalyzer());
		// 通过queryParse对象解析关键字并创建对应的查询对象
		Query query = queryParser.parse(keyword);

		// 通过search方法返回前n个文档的封装对象
		TopDocs topDocs = indexSearcher.search(query, 5);
		// 总共找到的相关的文档数
		int totalHits = topDocs.totalHits;
		System.out.println("总条数：" + totalHits);
		// 获取查询的结果（并不包含文档本身）
		ScoreDoc[] scoreDocs = topDocs.scoreDocs;
		for (ScoreDoc scoreDoc : scoreDocs) {
			int documentId = scoreDoc.doc;
			Document document = indexSearcher.doc(documentId);
			float score = scoreDoc.score;

			// 获取文档的字段值
			String docId = document.get("docId");
			String content = document.get("content");

			System.out.println("ID:" + documentId + ",score:" + score + ",docId:" + docId + ",content:" + content);
		}
	}



// 传入一个查询对象
	public static void testSearch(Query q) throws Exception {
        // 索引库地址
		String path = "E:\\work\\eclipse4.7_project\\Luncene-demo\\index";
		System.out.println("对应的查询语句为：" + q);
		// 获取索引库的目录
		Directory d = FSDirectory.open(Paths.get(path));
		// 获取索引读取对象
		IndexReader reader = DirectoryReader.open(d);
		// 创建索引查询器
		IndexSearcher searcher = new IndexSearcher(reader);
		// 执行查询
		TopDocs td = searcher.search(q, 10);
		// 遍历结果
		for (int i = 0; i < td.scoreDocs.length; i++) {
			// 得到符合条件的内部文档对象
			ScoreDoc doc = td.scoreDocs[i];
			// 得到文档对象
			Document d1 = searcher.doc(doc.doc);
			System.out.println("title: " + d1.get("title") + "     content:" + d1.get("content"));
		}
	}

1)单词查询
在这里插入图片描述
2)段落搜索, 要想把多个单词当成一个整体进行搜索，使用双引号包裹

3)通配符搜索

4)模糊搜索最多允许 2个错误

5)临近查询，在段落查询的基础上用“~”后面跟一个1到正无穷的正整数。代表段落中，单词与单词之间最大的间隔数
在这里插入图片描述
6)组合查询
// + （must） : 对应的单词必须出现
// - （must_not）: 不能出现
// 不写（should）：可能出现
// 关键字之间的逻辑计算是 AND

酒巷

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lucene

1. 全文搜索是什么？从全文数据中进行检索就叫全文检索(全文搜索)。是基于文本的搜索。结构化数据：指具有“固定格式”或“有限长度”的数据，如数据库，元数据等;非结构化数据：指不定长或无固定格式的数据，如邮件，word文档等;半结构化数据，如XML，HTML等，当根据需要可按结构化数据来处理，也可抽取出纯文本按非结构化数据来处理。2. 全文检索非结构化数据顺序扫描很慢，对结构化数据的搜索...
复制链接

扫一扫