lucene 入门实例

Apache Lucene is a high-performance, full-featured text search engine library.Here's a simple example how to use Lucene for indexing and searching (using JUnitto check if the results are what we expect):

Apache Lucene 是高性能,全文搜索引擎;下例显示如何索引(index)和搜索(search):

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

    // Store the index in memory:
    Directory directory = new RAMDirectory();
    // To store an index on disk, use this instead:
    //Directory directory = FSDirectory.open("/tmp/testindex");
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
    IndexWriter iwriter = new IndexWriter(directory, config);
    Document doc = new Document();
    String text = "This is the text to be indexed.";
    doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
    iwriter.addDocument(doc);
    iwriter.close();
    
    // Now search the index:
    DirectoryReader ireader = DirectoryReader.open(directory);
    IndexSearcher isearcher = new IndexSearcher(ireader);
    // Parse a simple query that searches for "text":
    QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "fieldname", analyzer);
    Query query = parser.parse("text");
    ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    // Iterate through the results:
    for (int i = 0; i < hits.length; i++) {
      Document hitDoc = isearcher.doc(hits[i].doc);
      assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));
    }
    ireader.close();
    directory.close();


The Lucene API is divided into several packages:

To use Lucene, an application should:
  1. Create Documents byaddingFields;
  2. Create an IndexWriterand add documents to it with addDocument();
  3. Call QueryParser.parse()to build a query from a string; and
  4. Create an IndexSearcherand pass the query to its search()method.
Some simple examples of code which does this are:
  •  IndexFiles.java creates an index for all the files contained in a directory. 为目录下的所有文件建立index
  •  SearchFiles.java prompts forqueries and searches an index.
To demonstrate these, try something like:
> java -cp lucene-core.jar:lucene-demo.jar:lucene-analyzers-common.jar org.apache.lucene.demo.IndexFiles -index index -docs rec.food.recipes/soups
adding rec.food.recipes/soups/abalone-chowder
  [ ... ]

> java -cp lucene-core.jar:lucene-demo.jar:lucene-queryparser.jar:lucene-analyzers-common.jar org.apache.lucene.demo.SearchFiles
Query: chowder
Searching for: chowder
34 total matching documents
1. rec.food.recipes/soups/spam-chowder
  [ ... thirty-four documents contain the word "chowder" ... ]

Query: "clam chowder" AND Manhattan
Searching for: +"clam chowder" +manhattan
2 total matching documents
1. rec.food.recipes/soups/clam-chowder
  [ ... two documents contain the phrase "clam chowder"and the word "manhattan" ... ]
    [ Note: "+" and "-" are canonical, but "AND", "OR"and "NOT" may be used. ]




public final class Document
extends Object
implements Iterable<IndexableField>

Documents are the unit of indexing and search. A Document is a set of fields. Each field has a name and a textual value. A field may bestored with the document, in which case it is returned with search hits on the document. Thus each document should typically contain one or more stored fields which uniquely identify it.

Documents 是索引和搜索的单元,每个Document就是一系列的field,每个field有一个name和文本value。

field可以存放在document中,这样搜索时也就搜索field。因此每个document一般都有至少1个field,document因field才相互区别。

public final void add(IndexableField field)

Adds a field to a document. Several fields may be added with the same name. In this case, if the fields are indexed, their text is treated as though appended for the purposes of search.

向document加入field。当多个同name的field添加后,如果field被index,那么所有同name的field,其value(即文本)在search时会一并search,如同被合并一样。

Note that add like the removeField(s) methods only makes sense prior to adding a document to an index. These methods cannot be used to change the content of an existing index! In order to achieve this, a document has to be deleted from an index and a new changed version of that document has to be added.

注意,仅在document添加入index之前才有调用add和remove的意义。这些方法无法改变已存在的index;要改变index,只能从index中删除document,再加入已更新的document。



public class IndexWriter
extends Object
implements Closeable, TwoPhaseCommit
An IndexWriter creates and maintains an index.
创建并维护一个index

The IndexWriterConfig.OpenMode option on IndexWriterConfig.setOpenMode(OpenMode) determines whether a new index is created, or whether an existing index is opened. Note that you can open an index withIndexWriterConfig.OpenMode.CREATE even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, and won't see the newly created index until they re-open. IfIndexWriterConfig.OpenMode.CREATE_OR_APPEND is used IndexWriter will create a new index if there is not already an index at the provided path and otherwise open the existing index.

In either case, documents are added with addDocument and removed with deleteDocuments(Term) or deleteDocuments(Query). A document can be updated with updateDocument (which just deletes and then adds the entire document). When finished adding, deleting and updating documents,close should be called.

These changes are buffered in memory and periodically flushed to theDirectory (during the above method calls). A flush is triggered when there are enough added documents since the last flush. Flushing is triggered either by RAM usage of the documents (seeIndexWriterConfig.setRAMBufferSizeMB(double)) or the number of added documents (seeIndexWriterConfig.setMaxBufferedDocs(int)). The default is to flush when RAM usage hitsIndexWriterConfig.DEFAULT_RAM_BUFFER_SIZE_MB MB. For best indexing speed you should flush by RAM usage with a large RAM buffer. Additionally, if IndexWriter reaches the configured number of buffered deletes (seeIndexWriterConfig.setMaxBufferedDeleteTerms(int)) the deleted terms and queries are flushed and applied to existing segments. In contrast to the other flush optionsIndexWriterConfig.setRAMBufferSizeMB(double) and IndexWriterConfig.setMaxBufferedDocs(int), deleted terms won't trigger a segment flush. Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until eithercommit() or close() is called. A flush may also trigger one or more segment merges which by default run with a background thread so as not to block the addDocument calls (seebelow for changing the MergeScheduler).

Opening an IndexWriter creates a lock file for the directory in use. Trying to open anotherIndexWriter on the same directory will lead to a LockObtainFailedException. The LockObtainFailedException is also thrown if an IndexReader on the same directory is used to delete documents from the index.

Expert: IndexWriter allows an optionalIndexDeletionPolicy implementation to be specified. You can use this to control when prior commits are deleted from the index. The default policy isKeepOnlyLastCommitDeletionPolicy which removes all prior commits as soon as a new commit is done (this matches behavior before 2.2). Creating your own policy can allow you to explicitly keep previous "point in time" commits alive in the index for some time, to allow readers to refresh to the new commit without having the old commit deleted out from under them. This is necessary on filesystems like NFS that do not support "delete on last close" semantics, which Lucene's "point in time" search normally relies on.

Expert: IndexWriter allows you to separately change theMergePolicy and the MergeScheduler. The MergePolicy is invoked whenever there are changes to the segments in the index. Its role is to select which merges to do, if any, and return aMergePolicy.MergeSpecification describing the merges. The default isLogByteSizeMergePolicy. Then, the MergeScheduler is invoked with the requested merges and it decides when and how to run the merges. The default isConcurrentMergeScheduler.

NOTE: if you hit an OutOfMemoryError then IndexWriter will quietly record this fact and block all future segment commits. This is a defensive measure in case any internal state (buffered documents and deletions) were corrupted. Any subsequent calls to commit() will throw an IllegalStateException. The only course of action is to callclose(), which internally will call rollback(), to undo any changes to the index since the last commit. You can also just callrollback() directly.

NOTE: IndexWriter instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently. If your application requires external synchronization, you shouldnot synchronize on the IndexWriter instance as this may cause deadlock; use your own (non-Lucene) objects instead.

NOTE: If you call Thread.interrupt() on a thread that's within IndexWriter, IndexWriter will try to catch this (eg, if it's in a wait() or Thread.sleep()), and will then throw the unchecked exceptionThreadInterruptedException and clear the interrupt status on the thread.



//document 函数 输入参数为IndexableFieldType类型,是抽象的接口类,实际输入参数是Document类型,
//实现了所有的抽象函数
 public final void add(IndexableField field) {
    fields.add(field);
  }
 //document内私有变量,是ArrayList 数据结构
 private final List<IndexableField> fields = new ArrayList<IndexableField>();
 
 // IndexableField 接口
 public interface IndexableField {

  /** Field name */
  public String name();

  /** {@link IndexableFieldType} describing the properties
   * of this field. */
  public IndexableFieldType fieldType();
  
  /** 
   * Returns the field's index-time boost.
   * <p>
   * Only fields can have an index-time boost, if you want to simulate
   * a "document boost", then you must pre-multiply it across all the
   * relevant fields yourself. 
   * <p>The boost is used to compute the norm factor for the field.  By
   * default, in the {@link Similarity#computeNorm(FieldInvertState)} method, 
   * the boost value is multiplied by the length normalization factor and then
   * rounded by {@link DefaultSimilarity#encodeNormValue(float)} before it is stored in the
   * index.  One should attempt to ensure that this product does not overflow
   * the range of that encoding.
   * <p>
   * It is illegal to return a boost other than 1.0f for a field that is not
   * indexed ({@link IndexableFieldType#indexed()} is false) or omits normalization values
   * ({@link IndexableFieldType#omitNorms()} returns true).
   *
   * @see Similarity#computeNorm(FieldInvertState)
   * @see DefaultSimilarity#encodeNormValue(float)
   */
  public float boost();

  /** Non-null if this field has a binary value */
  public BytesRef binaryValue();

  /** Non-null if this field has a string value */
  public String stringValue();

  /** Non-null if this field has a Reader value */
  public Reader readerValue();

  /** Non-null if this field has a numeric value */
  public Number numericValue();

  /**
   * Creates the TokenStream used for indexing this field.  If appropriate,
   * implementations should use the given Analyzer to create the TokenStreams.
   *
   * @param analyzer Analyzer that should be used to create the TokenStreams from
   * @return TokenStream value for indexing the document.  Should always return
   *         a non-null value if the field is to be indexed
   * @throws IOException Can be thrown while creating the TokenStream
   */
  public TokenStream tokenStream(Analyzer analyzer) throws IOException;
}

//indexwrite
public void addDocument(Iterable<? extends IndexableField> doc, Analyzer analyzer) throws IOException {
    updateDocument(null, doc, analyzer);
  } 


 public TextField(String name, Reader reader) {
    super(name, reader, TYPE_NOT_STORED);
  }


 //DocumentsWriterPerThread 构造函数

public DocumentsWriterPerThread(Directory directory, DocumentsWriter parent,
      FieldInfos.Builder fieldInfos, IndexingChain indexingChain) {
    this.directoryOrig = directory;
    this.directory = new TrackingDirectoryWrapper(directory);
    this.parent = parent;
    this.fieldInfos = fieldInfos;
    this.writer = parent.indexWriter;
    this.infoStream = parent.infoStream;
    this.codec = parent.codec;
    this.docState = new DocState(this, infoStream);
    this.docState.similarity = parent.indexWriter.getConfig().getSimilarity();
    bytesUsed = Counter.newCounter();
    byteBlockAllocator = new DirectTrackingAllocator(bytesUsed);
    pendingDeletes = new BufferedDeletes();
    intBlockAllocator = new IntBlockAllocator(bytesUsed);
    initialize();
    // this should be the last call in the ctor 
    // it really sucks that we need to pull this within the ctor and pass this ref to the chain!
    consumer = indexingChain.getChain(this);
  }


//DocumentsWriterPerThread 内函数,
static final IndexingChain defaultIndexingChain = new IndexingChain() {

    @Override
    DocConsumer getChain(DocumentsWriterPerThread documentsWriterPerThread) {
      /*
      This is the current indexing chain:

      DocConsumer / DocConsumerPerThread
        --> code: DocFieldProcessor
          --> DocFieldConsumer / DocFieldConsumerPerField
            --> code: DocFieldConsumers / DocFieldConsumersPerField
              --> code: DocInverter / DocInverterPerField
                --> InvertedDocConsumer / InvertedDocConsumerPerField
                  --> code: TermsHash / TermsHashPerField
                    --> TermsHashConsumer / TermsHashConsumerPerField
                      --> code: FreqProxTermsWriter / FreqProxTermsWriterPerField
                      --> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerField
                --> InvertedDocEndConsumer / InvertedDocConsumerPerField
                  --> code: NormsConsumer / NormsConsumerPerField
          --> StoredFieldsConsumer
            --> TwoStoredFieldConsumers
              -> code: StoredFieldsProcessor
              -> code: DocValuesProcessor
    */

    // Build up indexing chain:

      final TermsHashConsumer termVectorsWriter = new TermVectorsConsumer(documentsWriterPerThread);
      final TermsHashConsumer freqProxWriter = new FreqProxTermsWriter();

      final InvertedDocConsumer termsHash = new TermsHash(documentsWriterPerThread, freqProxWriter, true,
                                                          new TermsHash(documentsWriterPerThread, termVectorsWriter, false, null));
      final NormsConsumer normsWriter = new NormsConsumer();
      final DocInverter docInverter = new DocInverter(documentsWriterPerThread.docState, termsHash, normsWriter);
      final StoredFieldsConsumer storedFields = new TwoStoredFieldsConsumers(
                                                      new StoredFieldsProcessor(documentsWriterPerThread),
                                                      new DocValuesProcessor(documentsWriterPerThread.bytesUsed));
      return new DocFieldProcessor(documentsWriterPerThread, docInverter, storedFields);
    }
  };


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值