Solr__luence（一）简述_入门

最新推荐文章于 2019-09-04 23:05:28 发布

CoffeeAndIce

最新推荐文章于 2019-09-04 23:05:28 发布

阅读量452

点赞数

分类专栏： solr

本文链接：https://blog.csdn.net/CoffeeAndIce/article/details/77335948

版权

solr 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

简介

Lucene4它是非常优秀的开源的全文检索框架，但是不是一个引擎，与搜索引擎是有差异的，最少需要爬虫以及对数据的储存管理等。它主要用于解决一些sql代码无法完成或者完成需要许多like、or的sql语句问题，也就是全文检索：将要查询的目标文档中的词提取出来，组成索引，通过查询索引达到搜索目标文档的目的。这种先建立索引，再对索引进行搜索的过程就叫全文检索（Full-text Search）

全文检索获取的数据称为非结构化数据：

结构化数据:指具有固定格式或有限长度的数据，如数据库，元数据等。

非结构化数据: 指不定长或无固定格式的数据，如邮件，word文档等。

Lucene4框架的主要应用场景：

1. 购物商城: 物品信息的检索. 2. 论坛系统: 帖子信息的检索. 3. 新闻系统: 新闻内容的检索. 4. 搜索引擎: 信息的检索. (如：百度、谷歌)

环境准备-Maven

Lucene 官网下载网址http://lucene.apache.org

本文事例为了同一版本：lucene-4.10.3.zip (学习的版本) 2014.12.10发布的。

luke查看索引库工具下载地址：https://github.com/DmitryKey/luke/releases

项目为打包类型为jar的项目

pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.linge</groupId>
<artifactId>lucene_test</artifactId>
<version>0.0.1-SNAPSHOT</version>
<!-- 配置依赖关系 -->
<dependencies>
<!-- junit4 -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<!-- lucene-core -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>4.10.3</version>
</dependency>
<!-- lucene-analyzers-common -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>4.10.3</version>
</dependency>
<!-- lucene-queryparser -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>4.10.3</version>
<exclusions>
<exclusion>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-sandbox</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queries</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- lucene-analyzers-smartcn -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-smartcn</artifactId>
<version>4.10.3</version>
</dependency>
<!-- lucene-highlighter -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-highlighter</artifactId>
<version>4.10.3</version>
</dependency>
<!-- IKAnalyzer分词器 -->
<dependency>
<groupId>org.wltea</groupId>
<artifactId>IKAnalyzer</artifactId>
<version>2012FF</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
<encoding>utf-8</encoding>
</configuration>
</plugin>
</plugins>
</build>
</project>

入门Demo-创建索引

步骤分析：

基本思路：

采集数据——>构建文档对象——>分析文档(分词)——>创建索引

如何采集数据？（3）

1、对于互联网上网页采用http将网页抓取到本地生成html文件。

2、如果数据在数据库中就连接数据库读取表中的数据。

3、如果数据是文件系统中的某个文件，就通过文件系统读取文件的内容。

思路实现流程

四个基本步骤可以分开为七步

1、配置依赖jar包（lucene-core-4.10.3.jar、lucene-analyzers-common-4.10.3.jar)
2、创建IndexWriter对象(CUD)
3、创建Document对象
4、创建Field对象
5、文档对象循环添加字段
6、 IndexWriter对象不断循环添加文档到索引库
7、 IndexWriter对象提交数据并关闭

/**
 * 写索引到索引库
 * @author LinGe
 * @email lg625740749@outlook.com
 * @version 1.0
 */
public class IndexWriterTest {
	
	@Test
	public void test() throws Exception{
				/** 创建索引库存储目录 */
				Directory directory = FSDirectory.open(new File("D:\\Lucene4\\lucene_index"));
				/** 创建分词器(单字分词器) */
				Analyzer analyzer = new StandardAnalyzer();
				/** 创建写索引需要的配置信息对象 */
				IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);
				/** 
		 		 * 设置打开索引库的模式 
		 		 * OpenMode.CREATE": 每次都重新创建索引库
		 		 * OpenMode.APPEND: 追加模式(不会创建索引库)
		 		 * OpenMode.CREATE_OR_APPEND: 如果没有创建，第一次创建索引库，后面都是追加模式
		 		 * */
				indexWriterConfig.setOpenMode(OpenMode.CREATE);
				/** 
		 		 * 创建IndexWriter对象(对索引库做CUD操作) 
		 		 * 第一个参数：索引库存储目录
		 		 * 第二个参数：写索引需要的配置信息对象
		 		 * */
				IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
		
				/** 获取文件目录 */
				File dir = new File("D:\\Lucene4\\file");
				int cursor = 1;
				/** 迭代所有的文件，写入索引库 */
				for (File file : dir.listFiles()){
						/** 一个文件对应一个文档 */
						Document doc = new Document();
						/** 添加字段 */
						doc.add(new StringField("id", String.valueOf(cursor++), Store.YES));
						doc.add(new TextField("name", file.getName(), Store.YES));
						doc.add(new TextField("filePath", file.getPath(), Store.YES));
						doc.add(new TextField("fileContent", new InputStreamReader(new FileInputStream(file), "gbk")));
						doc.add(new TextField("content", readFile(file), Store.YES));
						doc.add(new IntField("int", cursor * 10, Store.YES));
						doc.add(new FloatField("float", cursor * 50.5f, Store.YES));
						doc.add(new DoubleField("double", cursor * 100.5d, Store.YES));
						doc.add(new LongField("long", cursor * 1000l, Store.YES));
						doc.add(new StoredField("stored", "只存储: " + cursor));
			
						/** 定义字段类型 */
						FieldType ft = new FieldType();
						ft.setStored(true); // 是否存储
						ft.setIndexed(true); // 是否创建索引
						ft.setTokenized(true); // 是否分词
						Field field = new Field("field", "传智" + cursor, ft);
						doc.add(field);
			
						indexWriter.addDocument(doc);
						indexWriter.commit();
				}
				indexWriter.close();
		}

		private String readFile(File file) throws Exception{
				BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file), "gbk"));
				String line = null;
				String res = "";
				while((line = br.readLine()) != null){
							res += line;
				}
				br.close();
				return res;
}
}

Luke工具

打开Luke方法：

   l 
    
  命令运行：cmd 
  运行：java -jar lukeall-4.10.3.jar 
 

   l 
    
  手动执行：双击lukeall-4.10.3.jar 
  。 
 

这样就可以方便查看索引库了

CoffeeAndIce

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Solr__luence（一）简述_入门

简介Lucene4它是非常优秀的开源的全文检索框架，但是不是一个引擎，与搜索引擎是有差异的，最少需要爬虫以及对数据的储存管理等。它主要用于解决一些sql代码无法完成或者完成需要许多like、or的sql语句问题，也就是全文检索：将要查询的目标文档中的词提取出来，组成索引，通过查询索引达到搜索目标文档的目的。这种先建立索引，再对索引进行搜索的过程就叫全文检索（Full-text Se
复制链接

扫一扫