Lucene学习笔记之（三）文档加权

最新推荐文章于 2022-08-30 14:43:27 发布

孤师

最新推荐文章于 2022-08-30 14:43:27 发布

阅读量2.9k

点赞数 1

分类专栏： Lucene 文章标签： lucene 全文检索文档索引搜索

本文链接：https://blog.csdn.net/yang307511977/article/details/52071285

版权

Lucene 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

在进入学习之前。先给大家介绍一下，什么是加权？

加权就是有时在搜索的时候，会根据需要的不同，对不同的关键值或者不同的关键索引分配不同的权值，让权值高的内容更容易被用户搜索出来，而且排在前面。

为索引域添加权是再创建索引之前，把索引域的权值设置好，这样，在进行搜索时，lucene会对文档进行评分，这个评分机制是跟权值有关的，而且其它情况相同时，权值跟评分是成正相关的。

也就是说，给谁加权，就会给谁评分，评分越高，就越排在最前面。

那么问题又来了，评分是啥呢？评分就是信息过滤，让用户快速，准确的找到其想要的结果，丰富用户体验。下图为计算公式：

q为查询语句，t是q分词后的每一项，d为去匹配的文档。

打分流程：http://www.360doc.com/content/13/0426/20/891660_281142154.shtml

步骤一：创建maven现目

步骤二：配置pom.xml文件

   <pre name="code" class="java">                <!-- junit包
			因为是java程序，，需要用到@Test，这就是他的jar包下载。-->
		<dependency>
			<groupId>junit</groupId>
			<artifactId>junit</artifactId>
			<version>4.12</version>
			<scope>test</scope>
		</dependency>

		<!-- lucene核心包 
			以下这三个是用在lucene的全部jar包，core是核心包，queryparser是查询jar包。
			查询被索引文件如果是全英文的情况下，pom.xml文件写这三个，就欧了!-->
		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-core</artifactId>
			<version>5.3.1</version>
		</dependency>

		<!-- 查询解析器 -->
		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-queryparser</artifactId>
			<version>5.3.1</version>
		</dependency>

		<!-- 分析器 -->
		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-analyzers-common</artifactId>
			<version>5.3.1</version>
		</dependency>
		<!-- 很明显，这个是查询被检索文件为全中文的情况下，
			加上以上的三个，再加上这两个就行了。
			值得提一下，“高亮显示”的jar包可加可不加，
			在这里面加上，是因为这个在后面会用到。
			但是还是建议大家把这个加上，懂得多也不是个错。
		 	-->
		<!-- 中文分词查询器smartcn -->
		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-analyzers-smartcn</artifactId>
			<version>5.3.1</version>
		</dependency>

		<!-- 高亮显示 -->
		<dependency>
			<groupId>org.apache.lucene</groupId>
			<artifactId>lucene-highlighter</artifactId>
			<version>5.3.1</version>
		</dependency>

步骤三：实现代码

import java.nio.file.Paths;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.junit.Test;


/**
 * 文档加权
 * 	
 * 	1、加权操作：给要迅速查找的东西加上权，可提升查找速度！
 * 
 */

public class AddDocumentReg {

	//写测试数据，这些数据是写到索引文档里去的。
		private String ids[] = {"1","2","3","4"}; //标示文档
		private String author[] = {"Jack","Mary","Jerry","Machech"}; 
		private String title[] = {"java of china","Apple of china","Androw of apple the USA","People of Apple java"}; //
		private String contents[] = {
				"java  of China!the world the why what",
				"why a dertity compante is my hometown!",
				"Jdekia ssde hhh is a beautiful city!",
				"Jdekia ssde hhh is a beautiful java!"
		};
		
		private Directory dir;
		
		/**
		 * 获取IndexWriter实例
		 * @return
		 * @throws Exception
		 */
		private IndexWriter getWriter()throws Exception{
			
			//实例化分析器
			Analyzer analyzer = new StandardAnalyzer();
					
			//实例化IndexWriterConfig
			IndexWriterConfig con = new IndexWriterConfig(analyzer);
					
			//实例化IndexWriter
			IndexWriter writer = new IndexWriter(dir, con);
			
			return writer;
		}
		
		/**
		 * 生成索引（对应图一）
		 * @throws Exception
		 */
		@Test 
		public void index()throws Exception{
			
			dir=FSDirectory.open(Paths.get("E:\\luceneDemo3"));
			
			IndexWriter writer=getWriter();
			
			for(int i=0;i<ids.length;i++){
				
				Document doc=new Document();
				
				doc.add(new StringField("id", ids[i], Field.Store.YES));
				doc.add(new StringField("author", author[i], Field.Store.YES));
				// 加权操作
				TextField field=new TextField("title", title[i], Field.Store.YES);
				
				if("Mary".equals(author[i])){
					
					//设权 默认为1
					field.setBoost(1.5f);
				}
				
				doc.add(field);
				doc.add(new StringField("contents",contents[i],Field.Store.NO));
				
				// 添加文档
				writer.addDocument(doc); 
			}
			
			//关闭writer
			writer.close();
		}

}

图一：见下图说明成功！

/**
		 * 查询（对应图二）
		 * @throws Exception
		 */
		@Test
		public void search()throws Exception{
			
			//得到读取索引文件的路径
			dir = FSDirectory.open(Paths.get("E:\\luceneDemo3"));
			
			//通过dir得到的路径下的所有的文件
			IndexReader reader = DirectoryReader.open(dir);

			//建立索引查询器
			IndexSearcher searcher = new IndexSearcher(reader);
			
			//查找的范围
			String searchField = "title";
			
			//查找的字段
			String q = "apple";
			
			//运用term来查找
			Term t = new Term(searchField,q);
			
			//通过term得到query对象
			Query query = new TermQuery(t);
			
			//获得查询的hits
			TopDocs hits = searcher.search(query, 10);
			
			//显示结果
			System.out.println("匹配 '"+q+"'，总共查询到"+hits.totalHits+"个文档");
			
			//循环得到文档，得到文档就可以得到数据
			for(ScoreDoc scoreDoc:hits.scoreDocs){
				
				Document doc=searcher.doc(scoreDoc.doc);
				
				System.out.println(doc.get("author"));
			}
			
			//关闭reader
			reader.close();
		}

图二：

代码中匹配的是：apple，4个文档里3个都有，验证一下到底是不是呢？看下图：

大家看 author的这个数组，Mary、Jerry、Machech 对应的title 是不是有apple这个关键字。而且是不区分大小写的。哪有同学要问了，问什么它就是把索引为title的查出来了，为什么不差别的呢？这是因为在代码中，有注释为 “//查找的范围” 的代码：String searchField = "title";因为它规定了只在索引为title范围里找符合条件的元素。当然换成别的也是可以实现的。