Lucene学习笔记 - 20181130

最新推荐文章于 2024-07-30 20:50:32 发布

weixin_30478757

最新推荐文章于 2024-07-30 20:50:32 发布

阅读量84

点赞数

文章标签： java 数据库面试

原文链接：http://www.cnblogs.com/gospurs/p/10460703.html

版权

一、引言

介绍
Lucene是apache的一个子项目，是一个开源全文检索引擎工具包，但它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分文本分析引擎。
在Java开发环境里Lucene是一个成熟的免费开源工具。就其本身而言，Lucene是当前以及最近几年最受欢迎的免费Java信息检索程序库。人们经常提到信息检索程序库，虽然与搜索引擎有关，但不应该将信息检索程序库与搜索引擎相混淆。
区别与Solr
Solr是一个独立的企业级搜索应用服务器，它对外提供类似于Web-service的API接口。用户可以通过http请求，向搜索引擎服务器提交一定格式的XML文件，生成索引；也可以通过Http Get操作提出查找请求，并得到XML格式的返回结果。
简而言之：内网Luence、外网Solr
什么是全文检索
全文检索是计算机程序通过扫描文章中的每一个词，对每一个词建立一个索引，指明该词在文章中出现的次数和位置。当用户查询时根据建立的索引查找，类似于通过字典的检索字表查字的过程。
全文检索（Full-Text Retrieval）以文本作为检索对象，找出含有指定词汇的文本。全面、准确和快速是衡量全文检索系统的关键指标。

关于全文检索，重要四点：
①.只处理文本
②.不处理语义
③.不分大小写
④.结果列表有相关度排序

全文检索与数据库检索的区别
全文检索不同于数据库的SQL查询。（他们所解决的问题不一样，解决的方案也不一样，所以不应进行对比）。在数据库中的搜索就是使用SQL，这样会有如下问题：
①.匹配效果：
②.相关度排序：
③.全文检索的速度大大快于SQL的like搜索的速度。
全文检索的使用场景
使用Lucene，主要是做站内搜索，对系统内资源搜索。如BBS（论坛）、BLOG（博客）中文章搜索；电商网站中的商品及信息检索。
面试回答点
①如何建立索引，根据什么分词、②如何使用索引，又如何索引优化、③开发效率优化，运行效率优化、④总结
大致示例：
我使用Luence为我商品信息，建立一个索引库；索引库通过XX分词技术建立；最终使用索引库书写一些util工具；利用索引对我商品数据进行搜索；从而完成站内全文搜索。

二、第一个Lucene程序

导入依赖

		<!--lucene核心jar-->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>4.4.0</version>
        </dependency>
        
        <!--lucene 分词器 实现jar-->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>4.4.0</version>
        </dependency>

写入器示例

 	/**
     * 第一个写入器
     * 书写顺序：
     * ①IndexWriter写入器 —— ②Directory索引库(本地/内存 二选一)
     * ③IndexWriterConfig索引库配置 —— ④分词器
     */
    @Test
    public void testWriter() throws IOException {

  //分词器 Analyzer 抽象类 需要外部自行实现（lucene-analyzers-common） 一个参数：Version lucence版本
        //StandardAnalyzer 标准分词器 单字分词（重复加数，单字）
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_44);

        //索引库配置 IndexWriterConfig final类 需要两个参数：Version lucence版本, Analyzer 分词器
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_44,analyzer);

        //获取内存索引库
        //Directory directory = new RAMDirectory();
        //获取本地索引库 Directory是抽象类，无法直接new出
        Directory directory=FSDirectory.open(new File("E:\\index"));

        //IndexWriter写入器需要两个参数：Directory 索引库位置, IndexWriterConfig 索引库配置
        IndexWriter indexWriter = new IndexWriter(directory,indexWriterConfig);
        
        //获取文档模型  1. id列 2.标题列 3.文本列
        Document document = new Document();
//8种基本数据类型 + String 不做分词 String字符串分词，text字分词
        IndexableField idField = new StringField("id","1", Field.Store.YES);//参数类似键值对：key，value，Store.YES
        //Field.Store.YES(索引库中记录数据)、Field.Store.NO(当前列不加索引检索，用于大致展示时，忽略部分内容)
        IndexableField titleField = new TextField("title","我是标题", Field.Store.YES);
        IndexableField contentField = new StringField("content","我是内容", Field.Store.YES);

        document.add(idField);//参数（可添加索引列）
        document.add(titleField);
        document.add(contentField);

        //写入文档
        indexWriter.addDocument(document);
        indexWriter.commit();
        indexWriter.close();
    }

搜索器示例

	/**
     * 第一个搜索器
     * 书写顺序：
     * ①IndexSearcher搜索器 —— ②IndexReader创建读索引工具
     * ③Directory本地索引库位置
     * ④搜索器的search方法进行搜索内容
     */
    @Test
    public void testSearcher() throws IOException {

        //获取本地索引库
        Directory directory = FSDirectory.open(new File("E:\\index"));

        //IndexReader读取索引库位置 抽象类 需要一个参数 索引库对象
        IndexReader indexReader = DirectoryReader.open(directory);

        //IndexSearcher搜索器 需要一个参数IndexReader
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);

        //query对象的实现类 TermQuery 需要参数 Term对象 直接new 哪列哪值
        Query query = new TermQuery(new Term("id","1"));

        //search 方法 需要两个参数 Query 搜索条件, int/result 要搜索前多少条，TopDocs 封装了检索结果
        TopDocs search = indexSearcher.search(query, 10);
        System.out.println("命中的结果数量："+topDocs.totalHits);
        
        ScoreDoc[] scoreDocs = search.scoreDocs;//相关度排序

        for (ScoreDoc scoreDoc : scoreDocs) {
            //文章在索引库中唯一标识
            System.out.println(scoreDoc.doc);

            //用标识在搜索器中找出对象
            Document document = indexSearcher.doc(scoreDoc.doc);

            System.out.println(document.get("id"));
            System.out.println(document.get("title"));
            System.out.println(document.get("content"));
        }
    }

删除操作

	indexWriter.deleteAll();	//全部删除 —— 基本不使用
	indexWriter.deleteDocuments();	//根据Term或Query删除

修改操作

	//两个参数 一个是Term 查出原对象,	一个是修改后的Doucment对象
	//注意：没有更新的数据，需原样给出
	indexWriter.updateDocument();

三、Lucene分词器

分词器作用：把一段文本中的词按照一定的规则拆分

切分词原理
切分关键词、去除停用词(例如：的，是，在，停用词不建立索引)、保留关键词、不区分大小写
注意：同语言数据，一定使用同一分词器，否则可能搜索不到结果。
常见的分词器
StandardAnalyzer 单字分词器中文单字，英文单词分词
CJKAnalyzer 二分分词器重复两两分词
ChineseAnalyzer
SmartChineseAnalyzer 智能中文分词器

IKAnalyzer java中文分词器可自定义保留词和停用词
尝试测试一些分词器

查看停用词：

	    Analyzer analyzer = new StandardAnalyzer();
        CharArraySet stopwordSet = ((StandardAnalyzer) analyzer).getStopwordSet();
        System.out.println(stopwordSet);

工具方法：

  public static void  testAnalyzer(Analyzer analyzer, String text) throws IOException {
        System.out.println("当前分词器:--->"+analyzer.getClass().getName());

        TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));

        tokenStream.addAttribute(CharTermAttribute.class);

        tokenStream.reset();
        while(tokenStream.incrementToken()){
            CharTermAttribute attribute = tokenStream.getAttribute(CharTermAttribute.class);
            System.out.println(attribute.toString());
        }
        tokenStream.end();
    }

pom依赖

<!--扩展查询-->
	<!--多列查询-->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>4.4.0</version>
        </dependency>
	<!--智能中文分词器-->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-smartcn</artifactId>
            <version>4.4.0</version>
        </dependency>

<!--mvn install:install-file -DgroupId=org.apache.lucene -DartifactId=ikanalyzer -Dversion=2012FF_u1 -Dpackaging=jar -Dfile=F:\IKAnalyzer2012FF_u1.jar-->
	<!--自定义jar包导入-->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>ikanalyzer</artifactId>
            <version>2012FF_u1</version>
        </dependency>

如果4.4.0无法下载依赖，可用换成5.5.0尝试下
扩展词典，停用词典必须是UTF-8的编码，否则不生效

四、Query的查询扩展（一组API方法）

	@Test
    public void testQuery() throws ParseException {
        //①TermQuery 关键字查询
        Query query1 = new TermQuery(new Term("id", "1"));

        //②MultiFieldQueryParser 多列查询
        String[] fields={"id","title"};
        MultiFieldQueryParser multiFieldQueryParser = new MultiFieldQueryParser(fields,new SmartChineseAnalyzer());
        Query query2 = multiFieldQueryParser.parse("1");

        //③MatchAllDocsQuery 查询所有索引对应的文档模型
        Query query3 = new MatchAllDocsQuery();

        //④NumericRangeQuery 数值区间查询 | 根据指定列的数据类型选择方法 | 参数：列名，条数，最小，最大，是否包含最小，是否包含最大
        Query query4 = NumericRangeQuery.newIntRange("id", 10, 1, 5, true, true);

        //⑤占位查询 ?表示单个字符位置，*表示多个字符位置
        Query query5 = new WildcardQuery(new Term("title", "凉?"));

        //⑥FuzzyQuery 模糊查询 第二个参数表示，允许错误的字符数量
        Query query6 = new FuzzyQuery(new Term("title", "小小"), 2);

        //⑦PhraseQuery 短语查询
        Query query7 = new PhraseQuery();
        //设置两个词的间隔范围，范围越大，匹配结果越多，性能越慢，短时间间隔越小，得分越高
        ((PhraseQuery) query6).add(new Term("title","小小"));
        ((PhraseQuery) query6).add(new Term("title","大大"));
        ((PhraseQuery) query6).setSlop(7);

        //⑧BooleanQuery 多条件查询（联合查询）
        BooleanQuery query8 = new BooleanQuery();
        Query q1 = new TermQuery(new Term("id", "1"));
        Query q2 = NumericRangeQuery.newDoubleRange("id", 10, 1.0, 5.0, true, true);
        //MUST 必须有、SHOULD 应该有、MUST_NOT 必须没有
        query8.add(q1, BooleanClause.Occur.MUST);
        query8.add(q2, BooleanClause.Occur.SHOULD);
    }

五、高亮提示（全文搜索，必须高亮提醒）

导入依赖

	<!--高亮-->	
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-highlighter</artifactId>
            <version>4.4.0</version>
        </dependency>

额外处理

	/**
     *  高亮提示：
     *  ①创建高亮对象Highlighter 需要两个参数 Formatter对象、Scorer对象
     *  ②创建Formatter对象  高亮的形式
     *  ③创建Scorer对象 哪些词高亮
     *  ④使用.getBestFragment方法 三个参数（分词器，列名，搜索的关键字）
     */
		//new QueryTermScorer 或 new TermScorer
        Scorer scorer = new QueryTermScorer(query);

        //new SimpleHTMLFormatter 两个参数（前标签，后标签）
        Formatter formatter = new SimpleHTMLFormatter("<font color='red'>","</font>");

        // 需要参数 1.Formatter 对象  2.Scorer对象
        Highlighter highlighter = new Highlighter(formatter,scorer);
        
        //使用高亮
        //注意：当没有搜索到关键字时 bestFragment 为 null
        String bestFragment = highlighter.getBestFragment(new IKAnalyzer(), "title", "小");
        if (bestFragment==null){
              System.out.println("没有搜索到对应内容");
        }else{
              System.out.println("高亮处理后----"+bestFragment);
        }

六、加权展示（Lucene的全文检索的得分/热度）

相关度得分是在查询时根据查询条件实进计算出来的
如果索引库据不变，查询条件不变，查出的文档得分也不变
这个相关度的得分我们是可以手动的干预的

TextField textField = new TextField("title", "小小", Field.Store.YES);
        //设置热度
        textField.setBoost(10F);//设置为原权重的倍数

        //在ScoreDoc[]中可查看权重分数 .score方法

转载于:https://www.cnblogs.com/gospurs/p/10460703.html

weixin_30478757

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lucene学习笔记 - 20181130

一、引言介绍Lucene是apache的一个子项目，是一个开源全文检索引擎工具包，但它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分文本分析引擎。在Java开发环境里Lucene是一个成熟的免费开源工具。就其本身而言，Lucene是当前以及最近几年最受欢迎的免费Java信息检索程序库。人们经常...
复制链接

扫一扫