lucene初探学习

最新推荐文章于 2018-05-17 10:27:53 发布

xyang_1128

最新推荐文章于 2018-05-17 10:27:53 发布

阅读量229

点赞数

分类专栏：框架的学习

框架的学习专栏收录该内容

54 篇文章 0 订阅

订阅专栏

在网上找了很多关于lucene的资料,但好像始终都不是很理解lucene,概念的东西特别杂,所以本章也不会讲什么深入的东西,仅仅只是对入门的小萌新有帮助,

它的特点概述起来就是：全Java实现、开源、高性能、功能完整、易拓展，功能完整体现在对分词的支持、各种查询方式（前缀、模糊、正则等）、打分高亮、列式存储（DocValues）等等。

1.什么是lucene?

Lucene是apache下的一个开源的全文检索引擎工具包(jar包), 他的目的是为开发人员提供一个简单易用的jar包,以方便其目标系统中的全文检索的功能..他的特点是:全java实现,开源,性能高,功能较完整,简单扩展,支持各种查询方式,打分高亮,列式存储等....

2.全文检索的定义

全文检索的首先要将数据内容提取出来,组成索引,然后通过查询索引达到索搜文档的目的,这个过程就叫做全文检索,我在网上找了一张图,可以更好的帮助我们理解,

全文检索分成俩部分

1.索引流程,也就是图中左边的部分,收集数据 -> 创建文档(分词analyzer) ->创建索引

2.搜索流程,就是图中右边的部分, 用户界面输入 -> 创建查询对象 -> 在索引库中查询 -> 渲染搜索结果

分词(Analyzer)

分词的主要作用就是为了索引查询,通过对查询内容进行一系列化的处理,使内容能准确,更接近用户输入的输入内容,分词的规则千变万化，但目的只有一个：按语义划分。这点在英文中比较容易实现，因为英文本身就是以单词为单位的，已经用空格分开；而中文则必须以某种方法将连成一片的句子划分成一个个词语

1.标准分词器:通过根据英文空格进行切割形成词汇,如 this is a girl 分成 this,is,a,gir四个词汇 2.标准过滤:通过取出标准符号,及大小写过滤,如 this is a Angel 分成this,is,a,angel四个词汇 3.停用词过滤:及过滤一些没有一些具体语义的词,如 is a the 啊哦额,共产党等,如this is a man分成man一个词汇,

采集数据:

全文检索的数据是各种各样的,长篇小说,视频,图片等,全文检索称这些数据为非结构化数据. 1.什么是非结构化数据? 结构化数据,指有限长度或指定格式的数据,非结构化数据,相反就是值没有固定长度,没有结构化的数据. 2. 如何对非结构数据结构化数据? 需要将所有的数据使用某些技术采取到某一个地方,想办法把非结构化数据组成结构化数据,

采取数据的过程就是将数据按照规则封装到lucene中,

数据库数据采集(Demo)

入如需要导入的jar包

l mysql5.1驱动包：mysql-connector-java-5.1.7-bin.jar

l 核心包：lucene-core-4.10.3.jar

l 分词器通用包：lucene-analyzers-common-4.10.3.jar

数据库输入如下,数据有点乱,直接cv过去运行即可使用

/*
Navicat MySQL Data Transfer

Source Server         : connection
Source Server Version : 50715
Source Host           : localhost:3306
Source Database       : book

Target Server Type    : MYSQL
Target Server Version : 50715
File Encoding         : 65001

Date: 2017-06-01 11:34:10
*/

SET FOREIGN_KEY_CHECKS=0;

-- ----------------------------
-- Table structure for book
-- ----------------------------
DROP TABLE IF EXISTS `book`;
CREATE TABLE `book` (
  `id` int(11) DEFAULT NULL,
  `name` varchar(192) DEFAULT NULL,
  `price` float DEFAULT NULL,
  `pic` varchar(96) DEFAULT NULL,
  `description` text
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

-- ----------------------------
-- Records of book
-- ----------------------------
INSERT INTO `book` VALUES ('1', 'apache lucene', '66', '77373773737.jpg', 'lucene是apache的开源项目，是一个全文检索的工具包。\r\n# Apache Lucene README file\r\n\r\n## Introduction\r\n\r\nLucene is a Java full-text search engine.  Lucene is not a complete\r\napplication, but rather a code library and API that can easily be used\r\nto add search capabilities to applications.\r\n\r\n * The Lucene web site is at: http://lucene.apache.org/\r\n * Please join the Lucene-User mailing list by sending a message to:\r\n   java-user-subscribe@lucene.apache.org\r\n\r\n## Files in a binary distribution\r\n\r\nFiles are organized by module, for example in core/:\r\n\r\n* `core/lucene-core-XX.jar`:\r\n  The compiled core Lucene library.\r\n\r\nTo review the documentation, read the main documentation page, located at:\r\n`docs/index.html`\r\n\r\nTo build Lucene or its documentation for a source distribution, see BUILD.txt');
INSERT INTO `book` VALUES ('2', 'mybatis', '55', '88272828282.jpg', 'MyBatis介绍\r\n\r\nMyBatis 本是apache的一个开源项目iBatis, 2010年这个项目由apache software foundation 迁移到了google code，并且改名为MyBatis。 \r\nMyBatis是一个优秀的持久层框架，它对jdbc的操作数据库的过程进行封装，使开发者只需要关注 SQL 本身，而不需要花费精力去处理例如注册驱动、创建connection、创建statement、手动设置参数、结果集检索等jdbc繁杂的过程代码。\r\nMybatis通过xml或注解的方式将要执行的statement配置起来，并通过java对象和statement中的sql进行映射生成最终执行的sql语句，最后由mybatis框架执行sql并将结果映射成java对象并返回。\r\n');
INSERT INTO `book` VALUES ('3', 'spring', '56', '83938383222.jpg', '## Spring Framework\r\nspringmvc.txt\r\nThe Spring Framework provides a comprehensive programming and configuration model for modern\r\nJava-based enterprise applications - on any kind of deployment platform. A key element of Spring is\r\ninfrastructural support at the application level: Spring focuses on the \"plumbing\" of enterprise\r\napplications so that teams can focus on application-level business logic, without unnecessary ties\r\nto specific deployment environments.\r\n\r\nThe framework also serves as the foundation for\r\n[Spring Integration](https://github.com/SpringSource/spring-integration),\r\n[Spring Batch](https://github.com/SpringSource/spring-batch) and the rest of the Spring\r\n[family of projects](http://springsource.org/projects). Browse the repositories under the\r\n[SpringSource organization](https://github.com/SpringSource) on GitHub for a full list.\r\n\r\n[.NET](https://github.com/SpringSource/spring-net) and\r\n[Python](https://github.com/SpringSource/spring-python) variants are available as well.\r\n\r\n## Downloading artifacts\r\nInstructions on\r\n[downloading Spring artifacts](https://github.com/SpringSource/spring-framework/wiki/Downloading-Spring-artifacts)\r\nvia Maven and other build systems are available via the project wiki.\r\n\r\n## Documentation\r\nSee the current [Javadoc](http://static.springsource.org/spring-framework/docs/current/api)\r\nand [Reference docs](http://static.springsource.org/spring-framework/docs/current/reference).\r\n\r\n## Getting support\r\nCheck out the [Spring forums](http://forum.springsource.org) and the\r\n[Spring tag](http://stackoverflow.com/questions/tagged/spring) on StackOverflow.\r\n[Commercial support](http://springsource.com/support/springsupport) is available too.\r\n\r\n## Issue Tracking\r\nSpring\'s JIRA issue tracker can be found [here](http://jira.springsource.org/browse/SPR). Think\r\nyou\'ve found a bug? Please consider submitting a reproduction project via the\r\n[spring-framework-issues](https://github.com/springsource/spring-framework-issues) repository. The\r\n[readme](https://github.com/springsource/spring-framework-issues#readme) provides simple\r\nstep-by-step instructions.\r\n\r\n## Building from source\r\nInstructions on\r\n[building Spring from source](https://github.com/SpringSource/spring-framework/wiki/Building-from-source)\r\nare available via the project wiki.\r\n\r\n## Contributing\r\n[Pull requests](http://help.github.com/send-pull-requests) are welcome; you\'ll be asked to sign our\r\ncontributor license agreement ([CLA](https://support.springsource.com/spring_committer_signup)).\r\nTrivial changes like typo fixes are especially appreciated (just\r\n[fork and edit!](https://github.com/blog/844-forking-with-the-edit-button)). For larger changes,\r\nplease search through JIRA for similiar issues, creating a new one if necessary, and discuss your\r\nideas with the Spring team.\r\n\r\n## Staying in touch\r\nFollow [@springframework](http://twitter.com/springframework) and its\r\n[team members](http://twitter.com/springframework/team/members) on Twitter. In-depth articles can be\r\nfound at the SpringSource [team blog](http://blog.springsource.org), and releases are announced via\r\nour [news feed](http://www.springsource.org/news-events).\r\n\r\n## License\r\nThe Spring Framework is released under version 2.0 of the\r\n[Apache License](http://www.apache.org/licenses/LICENSE-2.0).\r\n');

pojo对象

public class Book {
    // 图书ID
	private Integer id;
	// 图书名称
	private String name;
	// 图书价格
	private Float price;
	// 图书图片
	private String pic;
	// 图书描述
	private String description;
//getter setter


	public Integer getId() {
		return id;
	}

	public void setId(Integer id) {
		this.id = id;
	}

	public String getName() {
		return name;
	}

	public void setName(String name) {
		this.name = name;
	}

	public Float getPrice() {
		return price;
	}

	public void setPrice(Float price) {
		this.price = price;
	}

	public String getPic() {
		return pic;
	}

	public void setPic(String pic) {
		this.pic = pic;
	}

	public String getDescription() {
		return description;
	}

	public void setDescription(String description) {
		this.description = description;
	}
}

因为只是做演示,所以连接数据库选择了原始的jdbc连接方式,这里需要导入数据库驱动

import java.sql.*;
import java.util.ArrayList;
import java.util.List;

public class BookDaoImpl {

	@Override
    public List<Book> queryBookList() {
		// 数据库链接
		Connection connection = null;

		// 预编译statement
		PreparedStatement preparedStatement = null;

		// 结果集
		ResultSet resultSet = null;

		// 图书列表
		List<Book> list = new ArrayList<Book>();

		try {
			// 加载数据库驱动
			Class.forName("com.mysql.jdbc.Driver");
			// 连接数据库
			connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/book", "root", "1039191520");

			// SQL语句
			String sql = "SELECT * FROM book";
			// 创建preparedStatement
			preparedStatement = connection.prepareStatement(sql);

			// 获取结果集
			resultSet = preparedStatement.executeQuery();

			// 结果集解析
			while (resultSet.next()) {
				Book book = new Book();
				book.setId(resultSet.getInt("id"));
				book.setName(resultSet.getString("name"));
				book.setPrice(resultSet.getFloat("price"));
				book.setPic(resultSet.getString("pic"));
				book.setDescription(resultSet.getString("description"));
				list.add(book);
			}

		} catch (Exception e) {
			e.printStackTrace();
		}finally {
			if (resultSet!=null) {
				try {
					resultSet.close();
				} catch (SQLException e) {
					e.printStackTrace();
				}
			}
			if (preparedStatement!=null) {
				try {
					preparedStatement.close();
				} catch (SQLException e) {
					e.printStackTrace();
				}
			}
			if(connection!=null){
				try {
					connection.close();
				} catch (SQLException e) {
					e.printStackTrace();
				}
			}
		}
		return list;
	}
}

然后写一个采集数据的类,

public class ClassA {
    @Test
    public void fun01() throws IOException {
        //获取数据源
        BookDaoImpl bookDao = new BookDaoImpl();
        List<Book> books = bookDao.queryBookList();

        List<Document> arrays = new ArrayList<>();
        for (Book book : books) {
            //        创建document
            Document document = new Document();
            StringField id = new StringField("id", book.getId().toString(), Field.Store.YES);
            document.add(id);

            arrays.add(document);
        }
        //fsd  写入磁盘,性能相对较低,安全  //RAMDirectory 写如内存,性能相对较高,但是没那么安全
        FSDirectory direct = FSDirectory.open(new File("D:\\csdn\\index"));//文件采集到指定磁盘

        Analyzer aZer = new IKAnalyzer();//使用 ik分词器
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_3,aZer);

        //输出流创建索引
        IndexWriter writer = new IndexWriter(direct,config);

        try {
            for (Document document : arrays) {
                //写入磁盘中
                writer.addDocument(document);
            }
        }catch (Exception e){

        } finally {
            writer.close();
        }

    }
}

运行完后,我们到目录里看看

以上就是索引流程

然后我们来看看如何搜索索引,下面是我在网上找到的一张搜索流程图

可以根据流程图来写代码,这样会简单很多,

import com.sun.org.apache.bcel.internal.generic.NEW;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.IndexInput;
import org.junit.Test;

import java.io.File;
import java.io.IOException;

public class ClassB {

    @Test
    public void fun01() throws IOException {
        //第一步, 指定索引库位置
        FSDirectory directory = FSDirectory.open(new File("D:\\csdn\\index"));

        //2.创建读取对象
        IndexReader reader = DirectoryReader.open(directory);

        //3.获取一个index查询对象
        IndexSearcher searcher = new IndexSearcher(reader);

        //创建查询对象
        TermQuery query = new TermQuery(new Term("id","1")); //使用id关键字查询

        //执行查询
        TopDocs docs = searcher.search(query, 1);//差一条记录

        ScoreDoc[] scoreDocs = docs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            int id = scoreDoc.doc;//文档的id
            Document doc = searcher.doc(id);//获取doc对象

            System.out.println(doc.get("id"));
            System.out.println(doc.get("name"));
            System.out.println(doc.get("price"));
            System.out.println(doc.get("pic"));

        }


        //第七步：关闭资源
           reader.close();

    }
}

查询结果如下:

以上就是lucene简单使用,

xyang_1128

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
lucene初探学习

在网上找了很多关于lucene的资料,但好像始终都不是很理解lucene,概念的东西特别杂,所以本章也不会讲什么深入的东西,仅仅只是对入门的小萌新有帮助,它的特点概述起来就是：全Java实现、开源、高性能、功能完整、易拓展，功能完整体现在对分词的支持、各种查询方式（前缀、模糊、正则等）、打分高亮、列式存储（DocValues）等等。 1.什么是lucene?Lucene
复制链接

扫一扫

专栏目录