Lucene学习笔记

最新推荐文章于 2018-04-19 16:04:01 发布

xiewenbo

最新推荐文章于 2018-04-19 16:04:01 发布

阅读量575

点赞数

分类专栏：搜索引擎

搜索引擎专栏收录该内容

19 篇文章 0 订阅

订阅专栏

最近研究学习Lucene中，基本上是用到了项目中，下面是一些笔记，比较零碎，主要是参考JavaEye上面的一些文章。

我下载的是 Lucene 2.4.1 版本，在开始之前，我们先来看下，不同版本之间的区别：

1.x,2.0和2.4是有一些区别的
比如说：
1

Java代码

//1.x
IndexWriter writer = new IndexWriter(indexPath, getAnalyzer(), true);
//2.0,2.4
IndexWriter writer = new IndexWriter(indexPath, getAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);

//1.x

IndexWriter writer = new IndexWriter(indexPath, getAnalyzer(), true);

//2.0,2.4

IndexWriter writer = new IndexWriter(indexPath, getAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);

Java代码

Field.Index.TOKENIZED 替换为 Field.Index.ANALYZED
没啥特殊的，改了一个名字而已

Field.Index.TOKENIZED 替换为 Field.Index.ANALYZED

没啥特殊的，改了一个名字而已

Java代码

IndexWriter.flush();
替换为
IndexWriter.commit();

IndexWriter.flush();

替换为

IndexWriter.commit();

4.org.apache.lucene.search.Hits;
这个类将在3.0中被删除
新的使用方法在上面的例子中

---- 上面的例子----

· //表示查出前四个

· TopDocCollector collector = new TopDocCollector(4);

· //查找

· indexSearch.search(query,collector);

· //ScoreDoc这个对象还不清楚，但是有多少结果，就有多少个这个对象

· ScoreDoc[] hits = collector.topDocs().scoreDocs;

· for(int i=0;i<hits.length;++i) {

· //找到这个Document原来的索引值

· int docId = hits[i].doc;

· System.out.println(docId);

· //根据这个值找到对象的Document

· Document d = indexSearch.doc(docId);

· System.out.println((i + 1) + ". " + d.get("title"));

· }

--- 上面的例子 end -----

5.Field的创建

Java代码

在2.0+中
Field没了Keyword、UnIndexed、UnStored、Text这几个静态成员，只能用
Field(String, String, Store, Index)。
Keyword对应Field.Store.YES, Field.Index.UN_TOKENIZED，
UnIndexed 对应Field.Store.YES, Field.Index.NO，
UnStored对应Field.Store.NO, Field.Index.TOKENIZED，
Text对应Field.Store.YES, Field.Index.TOKENIZED
//2.0版本以上
Field(String name, byte[] value, Field.Store store)
// Create a stored field with binary value.
Field(String name, Reader reader)
// Create a tokenized and indexed field that is not stored.
Field(String name, Reader reader, Field.TermVector termVector)
// Create a tokenized and indexed field that is not stored, optionally with storing term vectors.
Field(String name, String value, Field.Store store, Field.Index index)
// Create a field by specifying its name, value and how it will be saved in the index.
Field(String name, String value, Field.Store store, Field.Index index, Field.TermVector termVector)
// Create a field by specifying its name, value and how it will be saved in the index.

在2.0+中

Field没了Keyword、UnIndexed、UnStored、Text这几个静态成员，只能用

Field(String, String, Store, Index)。

Keyword对应Field.Store.YES, Field.Index.UN_TOKENIZED，

UnIndexed 对应Field.Store.YES, Field.Index.NO，

UnStored对应Field.Store.NO, Field.Index.TOKENIZED，

Text对应Field.Store.YES, Field.Index.TOKENIZED

//2.0版本以上

Field(String name, byte[] value, Field.Store store)

// Create a stored field with binary value.

Field(String name, Reader reader)

// Create a tokenized and indexed field that is not stored.

Field(String name, Reader reader, Field.TermVector termVector)

// Create a tokenized and indexed field that is not stored, optionally with storing term vectors.

Field(String name, String value, Field.Store store, Field.Index index)

// Create a field by specifying its name, value and how it will be saved in the index.

Field(String name, String value, Field.Store store, Field.Index index, Field.TermVector termVector)

// Create a field by specifying its name, value and how it will be saved in the index.

Field.Store 表示“是否存储”，即该Field内的信息是否要被原封不动的保存在索引中。

Field.Index 表示“是否索引”，即在这个Field中的数据是否在将来检索时需要被用户检索到，一个“不索引”的Field通常仅是提供辅助信息储存的功能。

Field.TermVector 表示“是否切词”，即在这个Field中的数据是否需要被切词。

通常，参数用Reader，表示在文本流数据源中获取数据，数据量一般会比较大。像链接地址URL、文件系统路径信息、时间日期、人名、居民身份证、电话号码等等通常将被索引并且完整的存储在索引中，但一般不需要切分词，通常用上面的第四个构造函数，第三四个参数分别为 Field.Store.YES, Field.Index.YES。而长文本通常可用第3个构造函数。

------ 看了一两天我最疑惑不解的地方L 终于有点眉目了)

现在的Hits 变成了这种取法：

TopDocCollector collector = new TopDocCollector(4);

· //查找

· indexSearch.search(query,collector);

· //ScoreDoc这个对象还不清楚，但是有多少结果，就有多少个这个对象

ScoreDoc[] hits = collector.topDocs().scoreDocs;

而 lucene 2.0 版本中的取法如下：

IndexSearcher indexSearch = new IndexSearcher("c:\\index");

// new StandardAnalyzer() 是一个分词器

QueryParser queryParser = new QueryParser("file",new StandardAnalyzer());

String queryString = "中华";

Query query = queryParser.parse(queryString);

~~Hits~~ hits = indexSearch.~~search~~(query);

这让我在考虑分页的时候，就有些搞不懂了，因为新的取法中，hits 是根据 TopDocCollector 来的，而 TopDocCollector 声明的时候，又必须带一个参数，就是所取的记录的条数，这让我想不通，就是那我如何获取lucene供查询处了多少条记录了。。。。，比如在以前的版本中， Hits hits = indexSearch.search(query); 获得的 hits.length 就表示了共获取的查询记录数。。。这让我一直搜索想知道该如何来解决，知道看到了这篇文章：

http://xuganggogo.javaeye.com/blog/323886 Lucene实例

下面我仅列出有用的代码：

· int CACHE_PAGE = 3; // 缓存的页面数

· BooleanClause.Occur[] clauses = { BooleanClause.Occur.SHOULD,

· BooleanClause.Occur.SHOULD };

· Query query = MultiFieldQueryParser.parse(key, new String[] {

· "filename", "contents" }, clauses, analyzer);

· // QueryParser parser = new QueryParser("contents", analyzer);

· // Query query = parser.parse(key);

· TopDocCollector collector = new TopDocCollector(perPage * CACHE_PAGE); // perPage

· searcher.search(query, collector);

· ScoreDoc[] hits = collector.topDocs().scoreDocs;

· int numTotalHits = collector.getTotalHits();

· System.out.println("符合查询词的文件数：" + numTotalHits);

· // 获得总页数

· if (numTotalHits % perPage != 0) {

· total_Page = numTotalHits / perPage + 1;

· } else {

· total_Page = numTotalHits / perPage;

· }

· if (begin > total_Page) {

· System.err.println("超出范围");

· } else {

· // 如果起始页大于缓存页，这就代表我们需要重新搜索更多的资源

· if (begin > CACHE_PAGE) {

· // 这时，我把搜索的资源都搜索出来，缓存页数=总页数

· CACHE_PAGE = total_Page;

· // 返回调用

· search(key, perPage, begin);

· // collector = new TopDocCollector( numTotalHits ); //缓存不够，重新搜索

· // searcher.search(query, collector);

· // hits = collector.topDocs().scoreDocs;

· } else {

· int temp = (begin - 1) * perPage + perPage;

· if ((begin - 1) * perPage + perPage > numTotalHits) {

· temp = numTotalHits;

· }

看了这个我才想到了个大概，其实 int CACHE_PAGE = 3; // 缓存的页面数

应该就是值我们默认查询时需要显示的页数，比如10页，这样的话，我们查询时就之需要显示出 CACHE_PAGE * perPage (每页显示多少条记录) 的记录了。

我们来看看Google,和Baidu的搜索就知道了：

我们以Google为例，当我们搜索时：

这个能不能就相当于 google CACHE_PAGE 是 10, 每页显示了10条记录及 perPage ，当你点击了 Next 按钮之后，分页标签如下：

想想像Google这样搜索引擎中不知道针对一个关键词有多少记录，有必要一次把这些相符合的记录都取出来吗？一般的Google搜索处理的第一页我们就找到了想要的结果，又有多少机会，去点击 Next 按钮了。

另在实际项目使用过程中，发现，共用的记录数还是可以获取的，如下所示：

建立索引：

    Java代码   
    
  
 package paoding;  
   
 import java.io.BufferedReader;  
 import java.io.File;  
 import java.io.FileInputStream;  
 import java.io.FileNotFoundException;  
 import java.io.IOException;  
 import java.io.InputStreamReader;  
 import net.paoding.analysis.analyzer.PaodingAnalyzer;  
 import org.apache.lucene.analysis.Analyzer;  
 import org.apache.lucene.analysis.standard.StandardAnalyzer;  
 import org.apache.lucene.document.Document;  
 import org.apache.lucene.document.Field;  
 import org.apache.lucene.index.IndexWriter;  
   
 public class IndexFiles {  
   
     public static void main(String[] args) {  
         long start = System.currentTimeMillis();  
         try {  
             // 获取Paoding中文分词器  
             Analyzer analyzer = new PaodingAnalyzer();  
             // Analyzer analyzer = new StandardAnalyzer();  
             // indexWriter建立索引  
             IndexWriter writer = new IndexWriter("f:\\indexpaoding", analyzer, true,   
                     IndexWriter.MaxFieldLength.UNLIMITED);  
             indexDocs(writer, new File("F:\\徐剛：28tel(繁firfox)"));  
             writer.optimize();  
             writer.close();  
             System.out.println("用时：" + (System.currentTimeMillis() - start)  
                     + " 毫秒");  
         } catch (IOException e) {  
             e.printStackTrace();  
         }  
     }  
     // 遍历文件夹文件，对需要的文件建立索引  
     static void indexDocs(IndexWriter writer, File file) throws IOException {  
         if (file.canRead()) {  
             if (file.isDirectory()) {  
                 String[] files = file.list();  
                 if (files != null) {  
                     for (int i = 0; i < files.length; i++) {  
                         indexDocs(writer, new File(file, files[i]));  
                     }  
                 }  
             } else {  
                 if (file.getName().endsWith(".htm")  
                         || file.getName().endsWith(".html")  
                         || file.getName().endsWith(".jsp")  
                         || file.getName().endsWith(".php")  
                         || file.getName().endsWith(".txt")) {  
                     System.out.println("添加 " + file);  
                     try {  
                         // 针对参数文件建立索引文档 ，一个Document就相当于一跳记录  
                         Document doc = new Document();  
                         // Field.Index.ANALYZED 文件名称 建立索引，分词  
                         doc.add(new Field("filename", file.getCanonicalPath(),  
                                 Field.Store.YES, Field.Index.ANALYZED,  
                                 Field.TermVector.WITH_POSITIONS_OFFSETS));  
                         doc.add(new Field("contents", ReadFile(file),  
                                 Field.Store.YES, Field.Index.ANALYZED,  
                                 Field.TermVector.WITH_POSITIONS_OFFSETS));  
                         // new InputStreamReader(new  
                         // FileInputStream(file.getCanonicalPath()), "utf-8")));  
                         writer.addDocument(doc);  
                     } catch (FileNotFoundException fnfe) {  
                         ;  
                     }  
                 }  
             }  
         }  
     }  
   
     // 用字符串形式，读取一个File的内容  
     public static String ReadFile(File f) {  
         String line = null;  
         StringBuffer temp = new StringBuffer();  
         try {  
             BufferedReader br = new BufferedReader(new InputStreamReader(  
                     new FileInputStream(f), "utf-8"));  
             while ((line = br.readLine()) != null) {  
                 temp.append(line);  
             }  
         } catch (FileNotFoundException e) {  
             e.printStackTrace();  
         } catch (IOException e) {  
             e.printStackTrace();  
         }  
         return temp.toString();  
     }  
   
 }  

用来搜索：并带简单分页效果

    Java代码   
    
  
 package paoding;  
   
 import java.io.BufferedReader;  
 import java.io.IOException;  
 import java.io.InputStreamReader;  
 import net.paoding.analysis.analyzer.PaodingAnalyzer;  
 import org.apache.lucene.analysis.Analyzer;  
 import org.apache.lucene.analysis.TokenStream;  
 import org.apache.lucene.analysis.standard.StandardAnalyzer;  
 import org.apache.lucene.document.Document;  
 import org.apache.lucene.index.CorruptIndexException;  
 import org.apache.lucene.index.IndexReader;  
 import org.apache.lucene.index.TermPositionVector;  
 import org.apache.lucene.queryParser.MultiFieldQueryParser;  
 import org.apache.lucene.queryParser.ParseException;  
 import org.apache.lucene.queryParser.QueryParser;  
 import org.apache.lucene.search.BooleanClause;  
 import org.apache.lucene.search.IndexSearcher;  
 import org.apache.lucene.search.Query;  
 import org.apache.lucene.search.ScoreDoc;  
 import org.apache.lucene.search.Searcher;  
 import org.apache.lucene.search.TopDocCollector;  
 import org.apache.lucene.search.highlight.Formatter;  
 import org.apache.lucene.search.highlight.Highlighter;  
 import org.apache.lucene.search.highlight.QueryScorer;  
 import org.apache.lucene.search.highlight.TokenGroup;  
 import org.apache.lucene.search.highlight.TokenSources;  
   
 public class SearchFiles {  
     /** 
      *  
      * @param key 
      *            搜索的关键字 
      * @param perPage 
      *            每页显示多少条记录 
      * @param begin 
      *            从第几页开始显示 
      * @throws CorruptIndexException 
      * @throws IOException 
      * @throws ParseException 
      */  
     int CACHE_PAGE = 3; // 缓存的页面数  
   
     public void search(String key, int perPage, int begin)  
             throws CorruptIndexException, IOException, ParseException {  
         String IDNEX_PATH = "f:\\indexpaoding";     //索引所在目录  
   
         int total_Page = 0; // 总页数  
   
         // 获取Paoding中文分词器  
         Analyzer analyzer = new PaodingAnalyzer();  
         // Analyzer analyzer = new StandardAnalyzer();  
         // 检索  
         IndexReader reader = IndexReader.open(IDNEX_PATH);  
         Searcher searcher = new IndexSearcher(reader);  
   
         /* 下面这个表示要同时搜索这两个域，而且只要一个域里面有满足我们搜索的内容就行 */  
         BooleanClause.Occur[] clauses = { BooleanClause.Occur.SHOULD,  
                 BooleanClause.Occur.SHOULD };  
         Query query = MultiFieldQueryParser.parse(key, new String[] {  
                 "filename", "contents" }, clauses, analyzer);  
         // QueryParser parser = new QueryParser("contents", analyzer);  
         // Query query = parser.parse(key);  
   
         TopDocCollector collector = new TopDocCollector(perPage * CACHE_PAGE); // perPage  
           
         searcher.search(query, collector);  
         ScoreDoc[] hits = collector.topDocs().scoreDocs;  
   
         int numTotalHits = collector.getTotalHits();  
         System.out.println("符合查询词的文件数：" + numTotalHits);  
   
         // 获得总页数  
         if (numTotalHits % perPage != 0) {  
             total_Page = numTotalHits / perPage + 1;  
         } else {  
             total_Page = numTotalHits / perPage;  
         }  
   
         if (begin > total_Page) {  
             System.err.println("超出范围");  
         } else {  
             // 如果起始页大于缓存页，这就代表我们需要重新搜索更多的资源  
             if (begin > CACHE_PAGE) {  
                 // 这时，我把搜索的资源都搜索出来，缓存页数=总页数  
                 CACHE_PAGE = total_Page;  
                 // 返回调用  
                 search(key, perPage, begin);  
                 // collector = new TopDocCollector( numTotalHits ); //缓存不够，重新搜索  
                 // searcher.search(query, collector);  
                 // hits = collector.topDocs().scoreDocs;  
             } else {  
                 int temp = (begin - 1) * perPage + perPage;  
                 if ((begin - 1) * perPage + perPage > numTotalHits) {  
                     temp = numTotalHits;  
                 }  
                 // 根据参数，从指定的位置开始获取数据（用于分页）  
                 for (int i = (begin - 1) * perPage; i < temp; i++) {  
                     System.out.println(i);  
                     int docId = hits[i].doc;  
                     Document doc3 = searcher.doc(docId);  
                     String filename = doc3.get("filename");  
                     System.out.println("filename=" + filename);  
                     // 高亮处理  
                     String text = doc3.get("contents");  
                     TermPositionVector tpv = (TermPositionVector) reader  
                             .getTermFreqVector(hits[i].doc, "contents");  
                     TokenStream ts = TokenSources.getTokenStream(tpv);  
                     Formatter formatter = new Formatter() {  
                         public String highlightTerm(String srcText, TokenGroup g) {  
                             if (g.getTotalScore() <= 0) {  
                                 return srcText;  
                             }  
                             return "<b>" + srcText + "</b>";  
                         }  
                     };  
                     Highlighter highlighter = new Highlighter(formatter,  
                             new QueryScorer(query));  
                     String result = highlighter.getBestFragments(ts, text, 5,  
                             "…");  
                     System.out.println("result:\n\t" + result);  
                 }  
                 System.out.println("循环结束");  
             }  
         }  
         reader.close();  
         System.out.println("关闭reader");  
   
     }  
   
     public static void main(String[] args) throws Exception {  
         SearchFiles sf = new SearchFiles();  
         sf.search("vvczvxcxz", 5, 1);  
     }  
 }  

xiewenbo

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lucene学习笔记

最近研究学习Lucene中，基本上是用到了项目中，下面是一些笔记，比较零碎，主要是参考JavaEye上面的一些文章。我下载的是 Lucene 2.4.1 版本，在开始之前，我们先来看下，不同版本之间的区别： 1.x,2.0和2.4是有一些区别的比如说：1Java代码//1.x IndexWriter write
复制链接

扫一扫

专栏目录