Lucene全文索引学习笔记(一)

最新推荐文章于 2021-07-16 16:37:51 发布

大雄号

最新推荐文章于 2021-07-16 16:37:51 发布

阅读量507

点赞数

分类专栏：数据库文章标签： Lucene入门 Lucene查询方式全文检索 Lucene使用 Lucene

本文链接：https://blog.csdn.net/qq_27339781/article/details/82814360

版权

数据库专栏收录该内容

15 篇文章 0 订阅

订阅专栏

3.3 IKAnalyzer分词器词库的扩充

一、全文检索与数据库搜索的区别

1.1、数据库的搜索

类似：select * from 表名 where 字段名 like ‘%关键字%’

例如：select * from article where content like’%here%’

缺点：

搜索效果比较差
在搜索的结果中，有大量的数据被搜索出来，有很多数据是没有用的。
查询速度在大量数据的情况下是很难做到快速的。

1.2、全文检索

搜索结果按相关度排序：意味着只有前几个页面对于用户来说是比较有用的，其他的结果与用户想要的答案很可能相差甚远。数据库搜索是做不到相关度排序的。
因为全文检索是采用引索的方式，所以在速度上肯定比数据库方式like要快。
所以数据库不能代替全文检索。

二、Lucene入门

2.1、 Lucene是什么？

Lucene是一个用Java写的高性能、可伸缩的全文检索引擎工具包，它可以方便的嵌入到各种应用中实现针对应用的全文索引/检索功能。Lucene的目标是为各种中小型应用程序加入全文检索功能。Lucene的主页http://lucene.apache.org/ 本文为3.0.1版本

2.2、

Lucene的结构

Lucene 的analysis 模块主要负责词法分析及语言处理而形成Term。
Lucene的index模块主要负责索引的创建，里面有IndexWriter。
Lucene的store模块主要负责索引的读写。
Lucene 的QueryParser主要负责语法分析。
Lucene的search模块主要负责对索引的搜索

2.3、开发环境

搭建lucene的开发环境，要准备lucene的jar包，要加入的jar包至少有：

lucene-core-3.1.0.jar (核心包)
lucene-analyzers-3.1.0.jar (分词器)
lucene-highlighter-3.1.0.jar (高亮器)
lucene-memory-3.1.0.jar (高亮器)

2.4.1、建立索引结构图

2.4.2、代码

/**
	 * 创建索引
	 */
	@Test
	public void testCreateIndex() throws Exception{
		/**
		 * 1、创建一个article对象
		 * 2、创建一个IndexWriter对象
		 * 3、把article对象变成document对象
		 * 4、把document对象放入到索引库中
		 * 5、关闭资源
		 */
		Article article = new Article();
		article.setAid(1L);
		article.setTitle("lucene是一个全文检索引擎");
		article.setContent("baidu,google都是很好的全文检索引擎");
		//创建IndexWriter
		/**
		 * 第一个参数
		 *      索引库的位置
		 */
		Directory directory = FSDirectory.open(new File("./Dirindex"));
		/**
		 * 第二个参数
		 *       分词器
		 */
		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
		/**
		 * 第三个参数
		 *       限制索引库中字段的大小
		 */
		IndexWriter indexWriter = new IndexWriter(directory, analyzer, MaxFieldLength.LIMITED);
		//把article转化成document
		Document document = new Document();
		/**
		 * 参数
		 *     name  存储在索引库的名字
		 *     value 存储在索引库中的值
         *     store 代表是否存储在索引库中
         *            YES 存储    NO  不存储
         *     index 是否更新索引列表
         *        NO  表示在添加内容的时候，不忘索引目录里添加
         *        ANALYZED  表示在添加内容时，使用分词往目录里添加相应的目录
         *        NOT_ANALYZED  表示在添加内容时，整个字符串作为一个索引添加到相应的目录
		 */
		Field idField =  new Field("aid", article.getAid().toString(), Store.YES, Index.NOT_ANALYZED);
		Field titleField =  new Field("title", article.getTitle(), Store.YES, Index.NO);
		Field contentField =  new Field("content", article.getContent(), Store.YES, Index.ANALYZED);
		//把上面的field放入到document中
		document.add(idField);
		document.add(titleField);
		document.add(contentField);
		
		indexWriter.addDocument(document);
		
		indexWriter.close();
	}

运行后会出现此文件

2.5.1、搜索结构图

2.5.2、代码

/**
	 * 从索引库中根据关键字把信息检索出来
	 */
	@Test
	public void testSearchIndex() throws Exception{
		/**
		 * 创建一个IndexSearch对象
		 */
		Directory directory = FSDirectory.open(new File("./Dirindex"));
		IndexSearcher indexSearcher = new IndexSearcher(directory);
		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
		/**
		 * 第一个参数为版本号
		 * 第二个参数为在哪个字段中进行检索
		 */
		//QueryParser queryParser = new QueryParser(Version.LUCENE_30,"content",analyzer);
		QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_30,new String[]{"title","content"},analyzer);
		/**
		 * 关键词
		 */
		Query query = queryParser.parse("全文检索引擎");
		/**
		 * 第二个参数  
		 *      查找前多少个
		 *  TopDocs-->Top Documents
		 */
		TopDocs topDocs = indexSearcher.search(query, 1);
		int count = topDocs.totalHits;//根据关键词计算出来的总的记录数
		ScoreDoc[] scoreDocs = topDocs.scoreDocs;
		List<Article> articles = new ArrayList<Article>();
		for (ScoreDoc scoreDoc : scoreDocs) {
			/**
			 * 关键词的索引
			 */
			int index = scoreDoc.doc;
			/**
			 * 根据关键词的索引查找到document
			 */
			Document document = indexSearcher.doc(index);
			//把document转化成article
			Article article = new Article();
			article.setAid(Long.parseLong(document.get("aid")));
			article.setTitle(document.get("title"));
			article.setContent(document.get("content"));
			articles.add(article);
		}
		
		for (Article article : articles) {
			System.out.println(article.getAid());
			System.out.println(article.getTitle());
			System.out.println(article.getContent());
		}
	}

创建IndexSearch
创建Query对象
进行搜索
获得总结果数和前N行记录ID列表
根据目录ID列表把Document转为为JavaBean并放入集合中。
循环出要检索的内容

2.6、Lucene知识点

执行两次建立引索说明：执行两次同样的JavaBean数据增加的引索都能成功，说明JavaBean中的ID不是唯一确定索引的标示。在lucene中，唯一确定索引的标示(目录ID)是由lucene内部生成的。
在搜索的时候，可以尝试用”Lucene”或者”lucene”来测试，结果是一样的。因为分词器把输入的关键字都变成小写。
在建立索引和搜索索引的时候都用到了分词器。
在索引库中存放的有目录和内容两大类数据。
Store这个参数表明是否将内容存放到索引库内容中。
Index这个参数表明是否存放关键字到索引目录中。
当一个IndexWriter在进行读索引库操作的时候，lucene会索索引库，以防止其他IndexWriter访问索引库而导致数据不一致，直到IndexWriter关闭为止。结论：同一个索引库只能有一个IndexWriter进行操作，这里用单例模式做比较好。

8.lucene在执行删除的时候，是先把要删除的元素形成了一个文件del文件，然后再和cfs文件进行整合得出最后结果。

三、分词器

3.1、英文分词器

步骤：eg：切割 Creates a searcher searching the index in the named directory

（1）切分关键词

Creates

searcher

searching

the

index

the

named

3.2、中文分词器

3.2.1单字分词

Analyzer analyzer2 = new ChineseAnalyzer();

把汉字一个字一个字分解出来。效率比较低。

3.2.2二分法分词

Analyzer analyzer3 = new CJKAnalyzer(Version.LUCENE_30);

把相邻的两个字组成词分解出来，效率也比较低。而且很多情况下分的词不对。

3.2.3词库分词(IKAnalyzer)

Analyzer analyzer4 = new IKAnalyzer();

基本上可以把词分出来(经常用的分词器)

3.3 IKAnalyzer分词器词库的扩充

如图：（编码必须为UTF-8）

ext_stopword.dic为停止词的词库，词库里的词都被当作为停止词使用。

IKAnalyzer.cfg.xml为IKAnalyzer的配置文件。

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
   <comment>IK Analyzer 扩展配置</comment>
   
   <entry key="ext_dict">/mydict.dic; </entry>

   
</properties>

四、查询索引库

MultiFieldQueryParser.这个类的好处可以选择多个属性进行查询，QueryParser只能选择一个。

1.索引方式

1.1、字符串查询

QueryParser->Query对象

可以使用查询条件

“lucene AND 互联网” 都出现符合查询条件

“lucene OR 互联网” 只要出现其一就符合查询条件

1.2、自己创建与配置Query对象

1.2.1、关键词查询(TermQuery)

注：因为保存引索的时候是通过分词器保存，所以所有的因为在索引库里都为小写，所以lucene必须得小写，不然查询不到

如果使用查询字符串进行查询，对应的语法格式为：title:lucene

1.2.2、查询所有文档

如果使用查询字符串，对应语法：*:*

1.2.3、范围查询

如果使用查询字符串，

第一个： id:[5 TO 15]

第二个: id:{5 TO 15}

注：在lucene中，处理数字是不能直接写入的，要进行转化。NumberStringTools帮助类给出了转化工具：

在工具类DocumentUtils中也做相应的转化：

1.2.4、通配符查询

如果使用查询字符串：title:lucen?

1.2.5、短语查询

上面的0代表第0个位置

上面的1代表第3个位置

使用查询字符串：title:"lucene ? ? 互联网"

1.2.6、Boolean查询

可以把多个查询条件组合成一个查询条件

如图为：同时满足title中有lucene关键字和ID为5到15的所有索引数据。不包括5和15

使用查询字符串：+id:{5 TO 15} +title:lucene

注意:

单独使用MUST_NOT 没有意义
MUST_NOT和MUST_NOT 无意义，检索无结果
单独使用SHOULD:结果相当于MUST
SHOULD和MUST_NOT: 此时SHOULD相当于MUST,结果同MUST和MUST_NOT
MUST和SHOULD:此时SHOULD无意义，结果为MUST子句的检索结果

1.3、相关度得分

利用Document.setBoost可以控制得分。默认值为1

结论：利用Document.setBoost可以人为控制相关度得分，从而把某一个引索内容排到最前面。

1.4、按照某个字段进行排序

如图所示的代码，重载了indexSearcher.search方法。在这个方法中，Sort对象就是指定的按照id升序排列。SortField.INT指定了ID的类型。类型不一样，大小的比较就不一样。最后一个参数为reverse : Reverse为false 升序（默认） Reverse 为true 降序

1.5、高亮

（1）创建和配置高量器

一高亮器的创建需要两个条件：

Formatter 要把关键词显示成什么样子

Scorer 查询条件

Fragmenter设置文本的长度。默认为100。

如果文本的内容长度超过所设定的大小，超过的部分将显示不出来。

（2）使用高亮器

getBestFragment为得到高亮后的文本。

参数：

1、分词器。

如果是英文：StandardAnalyzer 如果是中文：IKAnalyzer

2、在哪个属性上进行高亮

3、要高亮的内容

五、修改、删除

5.1、删除

/**
	 * 删除
	 *   不是把原来的cfs文件删除掉了，而是多了一个del文件
	 */
	@Test
	public boolean delete(String id){
		IndexWriter indexWriter = null;
		try {
			/**
			 * 第一个参数 :索引库的位置
			 */
			Directory directory = FSDirectory.open(new File(dirPath));
			/**
			 * 第二个参数:分词器
			 */
			Analyzer analyzer =  new IKAnalyzer(); 
			/**
			 * 第三个参数:限制索引库中字段的大小
			 */
			 indexWriter = new IndexWriter(directory, analyzer, MaxFieldLength.LIMITED);
			 Term term = new Term("id",id);
			 indexWriter.deleteDocuments(term);
			 return true;
		} catch (IOException e) {
			e.printStackTrace();
			return false;
		}finally{
			try {
				indexWriter.close();
			} catch (Exception e) {
				e.printStackTrace();
			}
		}
	}

说明：indexWriter.deleteDocuments的参数为Term.Term指关键词。因为ID的索引保存类型为Index.NOT_ANALYZED,因为直接写ID即可。

5.2、修改

/**
	 * 修改
	 *    先删除后增加
	 */
	@Test
	public boolean update(LuceneBean luceneBean){
		IndexWriter indexWriter = null;
		try {
			/**
			 * 第一个参数 :索引库的位置
			 */
			Directory directory = FSDirectory.open(new File(dirPath));
			/**
			 * 第二个参数:分词器
			 */
			Analyzer analyzer =  new IKAnalyzer(); 
			/**
			 * 第三个参数:限制索引库中字段的大小
			 */
			 indexWriter = new IndexWriter(directory, analyzer, MaxFieldLength.LIMITED);
			 Term term = new Term("id",luceneBean.getId());
			 //把lucene转化成document
			 Document document = DocumentUtils.bean2Document(luceneBean);
			 indexWriter.updateDocument(term, document);
			 return true;
		} catch (IOException e) {
			e.printStackTrace();
			return false;
		}finally{
			try {
				indexWriter.close();
			} catch (Exception e) {
				e.printStackTrace();
			}
		}
	}

说明：lucene的更新操作与数据库的更新操作是不一样的。因为在更新的时候，有可能变换了关键字的位置，这样分词器对关键字还得重新查找，而且还得在目录和内容中替换，这样做的效率比较低，所以lucene的更新操作是删除和增加两步骤来完成的。