Lucene入门学习之文献选择、版本差异

本文链接：https://blog.csdn.net/JieJiuXunHuan/article/details/8530393

最近开始学习lucene。

学习之初遇到了很大的困惑：网上下载的无数例子，写法很多，但是好多拿过来的时候，发现要么参数不全，要么类型不对，甚至出现了没有发现此类的情况。

后来经过朋友的指点，发现是Lucene的各个版本均有很大的变化，一些早期的版本效率不高，在后期被改的面目全非。

最终我学习的版本定位lucene3.0.3。

举个例子：

public   class  TextFileIndexer  {   
     public   static   void  main(String[] args)  throws  Exception  {   
         /**/ /*  指明要索引文件夹的位置*/   
        File fileDir  =   new  File( "D://search//data" );   
  
         /**/ /*  这里放索引文件的位置  */   
        File indexDir  =   new  File( "D://search" );   
        Directory dir = FSDirectory.open(indexDir);
//        IndexWriter writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_29),true, IndexWriter.MaxFieldLength.UNLIMITED);
//        Document doc = new Document();
//        doc.add(new Field("id", "101", Field.Store.YES, Field.Index.NO));
//        doc.add(new Field("name", "kobe bryant", Field.Store.YES, Field.Index.NO));
//        writer.addDocument(doc);
//        writer.optimize();
//        writer.close();
        
        Analyzer luceneAnalyzer  =   new  StandardAnalyzer(Version.LUCENE_29);  //建立一个标准分析器 
        IndexWriter indexWriter  =   new  IndexWriter(dir, luceneAnalyzer, true, IndexWriter.MaxFieldLength.UNLIMITED );   //创建一个索引器
        File[] textFiles  =  fileDir.listFiles();   
         long  startTime  =   new  Date().getTime();   
           
         //增加document到索引去    
         for  ( int  i  =   0 ; i  <  textFiles.length; i ++ )  {   
             if  (textFiles[i].isFile()   
                     &&  textFiles[i].getName().endsWith( ".txt" ))  {   
                System.out.println( " File  "   +  textFiles[i].getCanonicalPath()   
                         +   "正在被索引 . " );   
                String temp  =  FileReaderAll(textFiles[i].getCanonicalPath(),   
                         "UTF8" );   
                System.out.println(temp);   
                Document document  =   new  Document();  //Document是一个记录。用来表示一个条目。就是搜索建立的倒排索引的条目。比如说，你要搜索自己电脑上的文件。这个时候就可以创建field。然后用field组合成 document 。最后会变成若干文件。这个document和 文件系统document不是一个概念。 
                Field FieldPath  =   new  Field( "path" , textFiles[i].getPath(),   
                        Field.Store.YES, Field.Index.NO);   //创建一个字段
                Field FieldBody  =   new  Field( "body" , temp, 
                		Field.Store.YES,   
                        Field.Index.ANALYZED,   
                        Field.TermVector.WITH_POSITIONS_OFFSETS);   
                document.add(FieldPath);   
                document.add(FieldBody);   
                indexWriter.addDocument(document);   
            } //end if   
        }    
         // optimize()方法是对索引进行优化    
        indexWriter.optimize();   
        indexWriter.close();   
           
         //测试一下索引的时间    
         long  endTime  =   new  Date().getTime();   
        System.out   
                .println( "这花费了 "   
                         +  (endTime  -  startTime)   
                         +   "  毫秒来把文档增加到索引里面去! "   
                         +  fileDir.getPath());   
    }    
  
     public   static  String FileReaderAll(String FileName, String charset)   
             throws  IOException  {   
        BufferedReader reader  =   new  BufferedReader( new  InputStreamReader(   
                 new  FileInputStream(FileName), charset));   
        String line  =   new  String();   
        String temp  =   new  String();   
           
         while  ((line  =  reader.readLine())  !=   null )  {   
            temp  +=  line;   
        }    
        reader.close();   
         return  temp;   
    }    
}

在2.x的版本的时候，创建标准分析器和创建索引器的写法为：

Analyzer luceneAnalyzer = new StandardAnalyzer(); //建立一个标准分析器
IndexWriter indexWriter = new IndexWriter(indexDir, luceneAnalyzer,true ); //创建一个索引器

而且当时使用Field.Index.TOKENIZED这个字段

而到3.0以上的版本时，写法已经改变：

Analyzer luceneAnalyzer = new StandardAnalyzer(Version.LUCENE_29); //建立一个标准分析器
IndexWriter indexWriter = new IndexWriter(dir, luceneAnalyzer, true, IndexWriter.MaxFieldLength.UNLIMITED ); //创建一个索引器

这个Analyzer 是对旧版本的兼容，在3.0之后的版本，支持stopwords的参数。

构建索引器，dir，也就是索引存放位置，由File转变为了Directory，可以用Directory dir = FSDirectory.open(indexDir);来转换。

Field有两个属性可选：存储和索引。

通过存储属性你可以控制是否对这个Field进行存储；

通过索引属性你可以控制是否对该Field进行索引。

事实上对这两个属性的正确组合很重要。

Field.Index Field.Store 说明
TOKENIZED(分词) YES 被分词索引且存储
TOKENIZED NO 被分词索引但不存储
NO YES 这是不能被搜索的，它只是被搜索内容的附属物。如URL等
UN_TOKENIZED YES/NO 不被分词，它作为一个整体被搜索,搜一部分是搜不出来的
NO NO 没有这种用法
在新版本3,0中，已经没有TOKENIZED了，取代的是Field.Index.ANALYZED；

	Field FieldBody  =   new  Field( "body" , temp,Field.Store.YES,Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS);

实际上，3.0以上的版本，还在Field的构造函数中新增了一个参数：

	Field.TermVector.WITH_POSITIONS_OFFSETS		//表示文档的条目（由一个Document和Field定位）和它们在当前文档中所出现的次数

好吧，暂时写到这里。建议大家，如果有想学习lucene，要针对3.0之后的版本进行学习，功能更加强大，效率也更高。