lucene 入门实例

最新推荐文章于 2022-03-27 23:16:46 发布

firefaith

最新推荐文章于 2022-03-27 23:16:46 发布

阅读量1.1k

点赞数

分类专栏： lucene

lucene 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Apache Lucene is a high-performance, full-featured text search engine library.Here's a simple example how to use Lucene for indexing and searching (using JUnitto check if the results are what we expect):

Apache Lucene 是高性能，全文搜索引擎；下例显示如何索引(index)和搜索(search):

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

    // Store the index in memory:
    Directory directory = new RAMDirectory();
    // To store an index on disk, use this instead:
    //Directory directory = FSDirectory.open("/tmp/testindex");
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
    IndexWriter iwriter = new IndexWriter(directory, config);
    Document doc = new Document();
    String text = "This is the text to be indexed.";
    doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
    iwriter.addDocument(doc);
    iwriter.close();
    
    // Now search the index:
    DirectoryReader ireader = DirectoryReader.open(directory);
    IndexSearcher isearcher = new IndexSearcher(ireader);
    // Parse a simple query that searches for "text":
    QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "fieldname", analyzer);
    Query query = parser.parse("text");
    ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    // Iterate through the results:
    for (int i = 0; i < hits.length; i++) {
      Document hitDoc = isearcher.doc(hits[i].doc);
      assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));
    }
    ireader.close();
    directory.close();

The Lucene API is divided into several packages:

org.apache.lucene.analysisdefines an abstractAnalyzerAPI for converting text from a Readerinto a TokenStream,an enumeration of token Attributes. A TokenStream can be composed by applying TokenFilters to the output of a Tokenizer. Tokenizers and TokenFilters are strung together and applied with anAnalyzer. analyzers-common provides a number of Analyzer implementations, includingStopAnalyzerand the grammar-based StandardAnalyzer.
org.apache.lucene.codecs provides an abstraction over the encoding and decoding of the inverted index structure,as well as different implementations that can be chosen depending upon application needs.

提供反向索引结构的编码与译码抽象接口及不同应用需要的实现
org.apache.lucene.documentprovides a simpleDocumentclass. A Document is simply a set of named Fields,whose values may be strings or instances of Reader.

由fields组成，fields 的vaule即string或reader实例
org.apache.lucene.indexprovides two primary classes:IndexWriter,which creates and adds documents to indices; and IndexReader,which accesses the data in the index.

writer 为创建添加doc入index中，reader即从index中读数据
org.apache.lucene.searchprovides data structures to represent queries (ieTermQueryfor individual words, PhraseQuery for phrases, and BooleanQuery for boolean combinations of queries) and the IndexSearcherwhich turns queries into TopDocs.A number of QueryParsers are provided for producingquery structures from strings or xml.

提供代表查询的数据结构，(TermQuery 单词 ,PhraseQuery 词组 ,BooleanQurey 布尔)
IndexSearcher 将查询转换为TopDocs
org.apache.lucene.storedefines an abstract class for storing persistent data, theDirectory,which is a collection of named files written by an IndexOutputand read by an IndexInput. Multiple implementations are provided, including FSDirectory,which uses a file system directory to store files, andRAMDirectorywhich implements files as memory-resident data structures.
定义存储数据的抽象类-Directory，其实例如FSDirectory 文件系统目录，RAMDirectory 内存数据结构
org.apache.lucene.utilcontains a few handy data structures and util classes, ieOpenBitSetand PriorityQueue.

To use Lucene, an application should:

Create Documents byaddingFields;
Create an IndexWriterand add documents to it with addDocument();
Call QueryParser.parse()to build a query from a string; and
Create an IndexSearcherand pass the query to its search()method.

Some simple examples of code which does this are:

IndexFiles.java creates an index for all the files contained in a directory. 为目录下的所有文件建立index
SearchFiles.java prompts forqueries and searches an index.

To demonstrate these, try something like:

> java -cp lucene-core.jar:lucene-demo.jar:lucene-analyzers-common.jar org.apache.lucene.demo.IndexFiles -index index -docs rec.food.recipes/soups
adding rec.food.recipes/soups/abalone-chowder
[ ... ]
> java -cp lucene-core.jar:lucene-demo.jar:lucene-queryparser.jar:lucene-analyzers-common.jar org.apache.lucene.demo.SearchFiles
Query: chowder
Searching for: chowder
34 total matching documents
1. rec.food.recipes/soups/spam-chowder
[ ... thirty-four documents contain the word "chowder" ... ]

Query: "clam chowder" AND Manhattan
Searching for: +"clam chowder" +manhattan
2 total matching documents
1. rec.food.recipes/soups/clam-chowder
[ ... two documents contain the phrase "clam chowder"and the word "manhattan" ... ]
[ Note: "+" and "-" are canonical, but "AND", "OR"and "NOT" may be used. ]

public final class Document
extends Object
implements Iterable<IndexableField>

Documents are the unit of indexing and search. A Document is a set of fields. Each field has a name and a textual value. A field may bestored with the document, in which case it is returned with search hits on the document. Thus each document should typically contain one or more stored fields which uniquely identify it.

Documents 是索引和搜索的单元，每个Document就是一系列的field，每个field有一个name和文本value。

field可以存放在document中，这样搜索时也就搜索field。因此每个document一般都有至少1个field，document因field才相互区别。

public final void add(IndexableField field)

Adds a field to a document. Several fields may be added with the same name. In this case, if the fields are indexed, their text is treated as though appended for the purposes of search.

向document加入field。当多个同name的field添加后，如果field被index，那么所有同name的field，其value（即文本）在search时会一并search，如同被合并一样。

Note that add like the removeField(s) methods only makes sense prior to adding a document to an index. These methods cannot be used to change the content of an existing index! In order to achieve this, a document has to be deleted from an index and a new changed version of that document has to be added.

注意，仅在document添加入index之前才有调用add和remove的意义。这些方法无法改变已存在的index；要改变index，只能从index中删除document，再加入已更新的document。

public class IndexWriter
extends Object
implements Closeable, TwoPhaseCommit

An IndexWriter creates and maintains an index.
创建并维护一个index

The IndexWriterConfig.OpenMode option on IndexWriterConfig.setOpenMode(OpenMode) determines whether a new index is created, or whether an existing index is opened. Note that you can open an index withIndexWriterConfig.OpenMode.CREATE even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, and won't see the newly created index until they re-open. IfIndexWriterConfig.OpenMode.CREATE_OR_APPEND is used IndexWriter will create a new index if there is not already an index at the provided path and otherwise open the existing index.

In either case, documents are added with addDocument and removed with deleteDocuments(Term) or deleteDocuments(Query). A document can be updated with updateDocument (which just deletes and then adds the entire document). When finished adding, deleting and updating documents,close should be called.

These changes are buffered in memory and periodically flushed to theDirectory (during the above method calls). A flush is triggered when there are enough added documents since the last flush. Flushing is triggered either by RAM usage of the documents (seeIndexWriterConfig.setRAMBufferSizeMB(double)) or the number of added documents (seeIndexWriterConfig.setMaxBufferedDocs(int)). The default is to flush when RAM usage hitsIndexWriterConfig.DEFAULT_RAM_BUFFER_SIZE_MB MB. For best indexing speed you should flush by RAM usage with a large RAM buffer. Additionally, if IndexWriter reaches the configured number of buffered deletes (seeIndexWriterConfig.setMaxBufferedDeleteTerms(int)) the deleted terms and queries are flushed and applied to existing segments. In contrast to the other flush optionsIndexWriterConfig.setRAMBufferSizeMB(double) and IndexWriterConfig.setMaxBufferedDocs(int), deleted terms won't trigger a segment flush. Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until eithercommit() or close() is called. A flush may also trigger one or more segment merges which by default run with a background thread so as not to block the addDocument calls (seebelow for changing the MergeScheduler).

Opening an IndexWriter creates a lock file for the directory in use. Trying to open anotherIndexWriter on the same directory will lead to a LockObtainFailedException. The LockObtainFailedException is also thrown if an IndexReader on the same directory is used to delete documents from the index.

Expert: IndexWriter allows an optionalIndexDeletionPolicy implementation to be specified. You can use this to control when prior commits are deleted from the index. The default policy isKeepOnlyLastCommitDeletionPolicy which removes all prior commits as soon as a new commit is done (this matches behavior before 2.2). Creating your own policy can allow you to explicitly keep previous "point in time" commits alive in the index for some time, to allow readers to refresh to the new commit without having the old commit deleted out from under them. This is necessary on filesystems like NFS that do not support "delete on last close" semantics, which Lucene's "point in time" search normally relies on.

Expert: IndexWriter allows you to separately change theMergePolicy and the MergeScheduler. The MergePolicy is invoked whenever there are changes to the segments in the index. Its role is to select which merges to do, if any, and return aMergePolicy.MergeSpecification describing the merges. The default isLogByteSizeMergePolicy. Then, the MergeScheduler is invoked with the requested merges and it decides when and how to run the merges. The default isConcurrentMergeScheduler.

NOTE: if you hit an OutOfMemoryError then IndexWriter will quietly record this fact and block all future segment commits. This is a defensive measure in case any internal state (buffered documents and deletions) were corrupted. Any subsequent calls to commit() will throw an IllegalStateException. The only course of action is to callclose(), which internally will call rollback(), to undo any changes to the index since the last commit. You can also just callrollback() directly.

NOTE: IndexWriter instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently. If your application requires external synchronization, you shouldnot synchronize on the IndexWriter instance as this may cause deadlock; use your own (non-Lucene) objects instead.

NOTE: If you call Thread.interrupt() on a thread that's within IndexWriter, IndexWriter will try to catch this (eg, if it's in a wait() or Thread.sleep()), and will then throw the unchecked exceptionThreadInterruptedException and clear the interrupt status on the thread.

//document 函数 输入参数为IndexableFieldType类型，是抽象的接口类，实际输入参数是Document类型，
//实现了所有的抽象函数
 public final void add(IndexableField field) {
    fields.add(field);
  }
 //document内私有变量，是ArrayList 数据结构
 private final List<IndexableField> fields = new ArrayList<IndexableField>();
 
 // IndexableField 接口
 public interface IndexableField {

  /** Field name */
  public String name();

  /** {@link IndexableFieldType} describing the properties
   * of this field. */
  public IndexableFieldType fieldType();
  
  /** 
   * Returns the field's index-time boost.
   * <p>
   * Only fields can have an index-time boost, if you want to simulate
   * a "document boost", then you must pre-multiply it across all the
   * relevant fields yourself. 
   * <p>The boost is used to compute the norm factor for the field.  By
   * default, in the {@link Similarity#computeNorm(FieldInvertState)} method, 
   * the boost value is multiplied by the length normalization factor and then
   * rounded by {@link DefaultSimilarity#encodeNormValue(float)} before it is stored in the
   * index.  One should attempt to ensure that this product does not overflow
   * the range of that encoding.
   * <p>
   * It is illegal to return a boost other than 1.0f for a field that is not
   * indexed ({@link IndexableFieldType#indexed()} is false) or omits normalization values
   * ({@link IndexableFieldType#omitNorms()} returns true).
   *
   * @see Similarity#computeNorm(FieldInvertState)
   * @see DefaultSimilarity#encodeNormValue(float)
   */
  public float boost();

  /** Non-null if this field has a binary value */
  public BytesRef binaryValue();

  /** Non-null if this field has a string value */
  public String stringValue();

  /** Non-null if this field has a Reader value */
  public Reader readerValue();

  /** Non-null if this field has a numeric value */
  public Number numericValue();

  /**
   * Creates the TokenStream used for indexing this field.  If appropriate,
   * implementations should use the given Analyzer to create the TokenStreams.
   *
   * @param analyzer Analyzer that should be used to create the TokenStreams from
   * @return TokenStream value for indexing the document.  Should always return
   *         a non-null value if the field is to be indexed
   * @throws IOException Can be thrown while creating the TokenStream
   */
  public TokenStream tokenStream(Analyzer analyzer) throws IOException;
}

//indexwrite
public void addDocument(Iterable<? extends IndexableField> doc, Analyzer analyzer) throws IOException {
    updateDocument(null, doc, analyzer);
  } 


 public TextField(String name, Reader reader) {
    super(name, reader, TYPE_NOT_STORED);
  }

 //DocumentsWriterPerThread 构造函数

public DocumentsWriterPerThread(Directory directory, DocumentsWriter parent,
      FieldInfos.Builder fieldInfos, IndexingChain indexingChain) {
    this.directoryOrig = directory;
    this.directory = new TrackingDirectoryWrapper(directory);
    this.parent = parent;
    this.fieldInfos = fieldInfos;
    this.writer = parent.indexWriter;
    this.infoStream = parent.infoStream;
    this.codec = parent.codec;
    this.docState = new DocState(this, infoStream);
    this.docState.similarity = parent.indexWriter.getConfig().getSimilarity();
    bytesUsed = Counter.newCounter();
    byteBlockAllocator = new DirectTrackingAllocator(bytesUsed);
    pendingDeletes = new BufferedDeletes();
    intBlockAllocator = new IntBlockAllocator(bytesUsed);
    initialize();
    // this should be the last call in the ctor 
    // it really sucks that we need to pull this within the ctor and pass this ref to the chain!
    consumer = indexingChain.getChain(this);
  }

//DocumentsWriterPerThread 内函数，
static final IndexingChain defaultIndexingChain = new IndexingChain() {

    @Override
    DocConsumer getChain(DocumentsWriterPerThread documentsWriterPerThread) {
      /*
      This is the current indexing chain:

      DocConsumer / DocConsumerPerThread
        --> code: DocFieldProcessor
          --> DocFieldConsumer / DocFieldConsumerPerField
            --> code: DocFieldConsumers / DocFieldConsumersPerField
              --> code: DocInverter / DocInverterPerField
                --> InvertedDocConsumer / InvertedDocConsumerPerField
                  --> code: TermsHash / TermsHashPerField
                    --> TermsHashConsumer / TermsHashConsumerPerField
                      --> code: FreqProxTermsWriter / FreqProxTermsWriterPerField
                      --> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerField
                --> InvertedDocEndConsumer / InvertedDocConsumerPerField
                  --> code: NormsConsumer / NormsConsumerPerField
          --> StoredFieldsConsumer
            --> TwoStoredFieldConsumers
              -> code: StoredFieldsProcessor
              -> code: DocValuesProcessor
    */

    // Build up indexing chain:

      final TermsHashConsumer termVectorsWriter = new TermVectorsConsumer(documentsWriterPerThread);
      final TermsHashConsumer freqProxWriter = new FreqProxTermsWriter();

      final InvertedDocConsumer termsHash = new TermsHash(documentsWriterPerThread, freqProxWriter, true,
                                                          new TermsHash(documentsWriterPerThread, termVectorsWriter, false, null));
      final NormsConsumer normsWriter = new NormsConsumer();
      final DocInverter docInverter = new DocInverter(documentsWriterPerThread.docState, termsHash, normsWriter);
      final StoredFieldsConsumer storedFields = new TwoStoredFieldsConsumers(
                                                      new StoredFieldsProcessor(documentsWriterPerThread),
                                                      new DocValuesProcessor(documentsWriterPerThread.bytesUsed));
      return new DocFieldProcessor(documentsWriterPerThread, docInverter, storedFields);
    }
  };

firefaith

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
lucene 入门实例

Apache Lucene is a high-performance, full-featured text search engine library.Here's a simple example how to use Lucene for indexing and searching (using JUnitto check if the results are what we expec
复制链接

扫一扫

专栏目录