2014-08-20 Lucene(0)--------lucene全文检索

最新推荐文章于 2024-08-06 10:22:43 发布

weixin_34245749

最新推荐文章于 2024-08-06 10:22:43 发布

阅读量92

点赞数

文章标签： java 数据库 c/c++

原文链接：https://my.oschina.net/codeWatching/blog/399299

版权

2019独角兽企业重金招聘Python工程师标准>>>

1.全文检索概念

全文检索是计算机程序通过扫描文章中的每一个词, 对每一个词建立一个索引,指明该词在文章中出现的次数和位置.当我们用户查询时就可以根据建立的索引查找相应的内容,类似于通过字典的检索字表查字的过程. 总结来说:为'词'建立索引,方便查找.索引是重点.

2.全文检索的应用场景

在使用开源框架Apache Lucene,一般情况下是做站内的搜索.即对一个系统内的资源进行搜索.如BBS论坛,网上商店的商品搜索.由于不容易获取与管理海量资源.所以一般不做互联网中资源的搜索.

3.全文检索的基本作用

a.只处理文本,还没有做到图片等多媒体检索

b.不会处理文本中语义

c.搜索的关键字英文是不区分大小写的

d.搜索出来的结果列表存在相关度排序(由Lucene内部的算法实现).

4.Lucene学习入门

Lucene简介(docs):" Lucene is a Java full-text search engine. Lucene is not a complete application, but rather a code library and API that can easily be used to add search capabilities to applications."

意思是说:Lucene是一个Java语言实现的全文检索引擎.Lucene它不是一个完整的应用程序,但是我们可以使用Lucene框架的库(Jar)和相关的API为自己的应用程序添加全文检索的功能.

4.1.入门案例----1(由于Lucene自带的Demo需要ant构建项目而且为文件做索引.自己先做个Demo简易练习)

package com.codewatching.lucene.mydemo;

/**

* lucene - core - 4.4.0.jar（核心包）

* lucene - analyzers - common - 4.4.0.jar（分词器）

* @author LISAI

public class LuceneDemo1 {

/**

* 创建 Lucene索引

@Test

public void createIndex() throws Exception {

//创建将文章建立成索引时所用的分词器...

Analyzer analyzer = new StandardAnalyzer(Version. LUCENE_44 );

//索引库的配置

/**

* 1， lucene 当前使用的版本

* 2，创建索引时所用的分词器。

IndexWriterConfig config = new IndexWriterConfig(Version. LUCENE_44 ,

analyzer);

//索引存放的目录

File path = new File( "src//index" );

Directory directory = new SimpleFSDirectory(path);

//创建用于操作索引的对象。

IndexWriter indexWriter = new IndexWriter(directory, config);

//根据字段的内容创建对应类型的字段, 1字段的名称，2字段的值，3是否往索引库当中存储...

IndexableField intField = new IntField( "id" , 1, Store. YES );

IndexableField stringField = new StringField( "title" , "体育新闻NBA赛事" , Store. YES );

IndexableField textField = new TextField( "content" , "这项体育NBA职业联赛已经进行好多年啦" , Store. YES );

//创建document ，因为在索引库当中保存的都是document

Document document = new Document();

document.add(intField);

document.add(stringField);

document.add(textField);

//添加document,即建立索引

indexWriter.addDocument(document);

indexWriter.close();

}

/**

* 查询 Lucene索引

@Test

public void searchIndex() throws Exception {

/**存放索引的文件目录*/

File path = new File( "src//index" );

Directory directory = new SimpleFSDirectory(path);

/**创建索引读取器对象*/

IndexReader indexReader = DirectoryReader.open(directory);

/**用于查询索引的对象*/

IndexSearcher indexSearcher = new IndexSearcher(indexReader);

/**构造查询条件----单字段查询*/

Term term = new Term( "content" , "育" );

Query query = new TermQuery(term);

/**返回查询结果*/

TopDocs topDocs = indexSearcher.search(query, 5);

/**返回结果数组*/

ScoreDoc scoreDocs[] = topDocs. scoreDocs ;

/**击中次数*/

System. out .println(topDocs. totalHits );

System. out .println(scoreDocs. length );

for (ScoreDoc scoreDoc : scoreDocs) {

//返回document的ID

int docID = scoreDoc. doc ;

//返回文档，将索引当中的document

Document document = indexSearcher.doc(docID);

System. out .println( "id:" +document.get( "id" ));

System. out .println( "title:" +document.get( "title" ));

System. out .println( "content:" +document.get( "content" ));

}

4.2 入门案例----2(/docs/demo/overview-summary.html,某个目录下所有文件遍历建立索引)

       1.CreateIndexFiles.java
       2.SearchFiles.java
       具体的操作见源文件代码.

4.3 Lucene 建立索引和查询分析.

a.创建索引

b.查询分析

4.4 抽取Lucene工具类.LuceneUtil.java,ArticleToDocument.java

4.5 Lucene索引的增删改查及分页.IndexDao.java(如何保持索引库和关系型数据库一致)

保持索引库和数据库状态一致
提出问题：所有的数据（对象），我们都要存到数据库中。对于要进行搜索的数据，还要存到索引库中，以供搜索。一份数据同时存到数据库与索引库中（格式不同），就要想办法保证他们的状态一致。否则，就会影响搜索结果。
解决思路：对于上一段提出的问题：保证索引库中与数据库中的数据一致（只要针对要进行搜索的数据）。我们采用的方法是，在数据库中做了相应的操作后，在索引库中也做相应的操作。具体的索引库操作，是通过调用相应的IndexDao方法完成的。IndexDao类似于数据库层的Dao.

代码附件:

5.Lucene中的分词器

1.分词器的作用

    在创建索引时会用到分词器，在使用字符串搜索时也会用到分词器，这两个地方要使用同一个分词器，否则可能会搜索不出结果。
    Analyzer（分词器）的作用是把一段文本中的词按规则取出所包含的所有词。对应的是Analyzer类，这是一个抽象类，切分词的具体规则是由子类实现的，所以对于不同的语言（规则），要用不同的分词器。如下图:

2.分词器的工作流程

   （英文）分词器的一般工作流程：
       1、切分关键词
       2、去除停用词
       3、对于英文单词，把所有字母转为小写（搜索时不区分大小写）
          说明：有的分词器还对英文进行形态还原，就是去除单词词尾的形态变化,将其还原为词的原形。这样做可以搜索出更多有意义的结果。如搜索sutdent时,也可以搜索出                   students这是很有用的。

3.停用词的概念

有些词在文本中出现的频率非常高，但是对文本所携带的信息基本不产生影响，例如英文的“a、an、the、of”，或中文的“的、了、着、是”，以及各种标点符号等，这样的词称为停用词（stop word）。文本经过分词之后，停用词通常被过滤掉，不会被进行索引。在检索的时候，用户的查询中如果含有停用词，检索系统也会将其过滤掉（因为用户输入的查询字符串也要进行分词处理）。排除停用词可以加快建立索引的速度，减小索引库文件的大小。

4.常用的中文分词器

    中文的分词比较复杂，因为不是一个字就是一个词，而且一个词在另外一个地方就可能不是一个词，如在“帽子和服装”中，“和服”就不是一个词。对于中文分词，通常有三种方式：单字分词、二分法分词、词典分词。
     1、单字分词：就是按照中文一个字一个字地进行分词。如：“我们是中国人”，效果：“我”、“们”、“是”、“中”、“国”、“人”。（StandardAnalyzer、ChineseAnalyzer就是这样）。
     2、二分法分词：按两个字进行切分。如：“我们是中国人”，效果：“我们”、“们是”、“是中”、“中国”、“国人”。（CJKAnalyzer就是这样）。
     3、词库分词：按某种算法构造词，然后去匹配已建好的词库集合，如果匹配到就切分出来成为词语。通常词库分词被认为是最理想的中文分词算法。如：“我们是中国人”，效果为：“我们”、“中国人”。（使用极易分词的MMAnalyzer。可以使用“极易分词”，或者是“庖丁分词”分词器、IKAnalyzer）。

5.使用庖丁解牛分词器具体操作

package com.codewatching.lucene.mydemo;

import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cjk.CJKAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
import org.junit.Test;
public class LuceneDemo2 {
     /**
     * 使用庖丁解牛分词器测试创建索引查询索引
     */
     @Test
     public void testAnalyzer() throws Exception{
          String keywords="lucene 黑马程序员是全文检索的框架";
          Analyzer analyzer = new CJKAnalyzer(Version.LUCENE_44);
          testAnalyzer(analyzer, keywords);
     }
     public void testAnalyzer(Analyzer analyzer, String text) throws Exception {
          System.out.println("当前使用的分词器：" + analyzer.getClass().getSimpleName());
          //切分关键字，切分的关键字都放在tokenStream 里面...
          TokenStream tokenStream = analyzer.tokenStream("content",
                    new StringReader(text));
          tokenStream.addAttribute(CharTermAttribute.class);
          tokenStream.reset();
          while (tokenStream.incrementToken()) {
               CharTermAttribute charTermAttribute = tokenStream
                         .getAttribute(CharTermAttribute.class);
               System.out.println(new String(charTermAttribute.toString()));
          }
          tokenStream.close();
     }
}

6.Lucene中相关度排序(VSM).

lucene 为查询出来的结果（document）都打一个分，得分越高，排列的顺序越靠前，我们可以自己去干预..
1：排序我们可以通过设置这个权重值来影响他的得分
2：得分更命中的次数有关系
3：这个得分跟文档出现的搜索关键字的次数或者位置有关系...
API:Field.java
public interface Field {

/**

* Sets the boost factor on this field.

* @throws IllegalArgumentException if this field is not indexed,

* or if it omits norms.

* @see #boost()

public void setBoost ( float boost) {

if (boost != 1.0f) {

if ( type .indexed() == false || type .omitNorms()) {

throw new IllegalArgumentException( "You cannot set an index-time boost on an unindexed field, or one that omits norms" );

}

this . boost = boost;

}

7.SEO优化简单介绍

1:关键字，优化网页当中的关键字(title)
2:建立论坛,在百度贴吧活跃度.
3:点击率.

8.Lucene索引库的优化

9.Lucene检索与数据库检索的区别以及两者之间数据同步.

例如：全文检索的数据有索引库完成.

产品的详细信息则有数据库查询完成.

在增删改的操作必须在同一事务中操作，从而保持同步一致。

怎么同步数据?如图所示:

10.RSS建立索引

RSS:信息聚合 (信息采集)

< rssversion = "2.0" >

< channel >

< title >网站标题</ title >

< link >网站首页地址</ link >

< description >描述</ description >

< copyright >授权信息</ copyright >

< language >使用的语言（zh-cn表示简体中文）</ language >

< pubDate >发布的时间</ pubDate >

< lastBuildDate >最后更新的时间</ lastBuildDate >

< generator >生成器</ generator >

< item >

< title >标题</ title >

< link >链接地址</ link >

< description >内容简要描述</ description >

< pubDate >发布时间</ pubDate >

< category >所属目录</ category >

< author >作者</ author >

</ item >

</ channel >

</ rss >

11.补充Lucene的API的使用

1.对查询结果的进行排序

SortField field = new SortField( "id" , Type. LONG ); //关键的API

// SortField field = new SortField("id",Type.LONG,true);

Sort sort = new Sort(field);

TopDocs topDocs = indexSearcher.search(query, num,sort);

2.对查询结果进行过滤

/**505 ---- 513*/

Filter filter = NumericRangeFilter. newLongRange( "id" , 505L, 513L, true , true );

TopDocs topDocs = indexSearcher.search(query, filter, num);

3.对查询结果中的关键字进行高亮显示

/**高亮显示关键字*/

Scorer scorer = new QueryScorer(query); //关联到query对象

Formatter formatter = new SimpleHTMLFormatter( "<font color='red'>" , "</font>" );

Highlighter highlighter = new Highlighter(formatter, scorer);

------------------------Result------------------------------------------------------------------------------

//highContent可能不包含关键字----->null;

String highContent = highlighter.getBestFragment(analyzer, "contents" , article.getContents());

System. out .println( "highTitle:" +highTitle);

System. out .println( "highContent:" +highContent);

/*引入Jar包*/

4.Store枚举类的具体作用,如图所示:

5. 各种查询条件.

转载于:https://my.oschina.net/codeWatching/blog/399299

weixin_34245749

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2014-08-20 Lucene(0)--------lucene全文检索

2019独角兽企业重金招聘Python工程师标准>>> ...
复制链接

扫一扫