Lucene 学习笔记

最新推荐文章于 2024-08-12 22:04:42 发布

iteye_17106

最新推荐文章于 2024-08-12 22:04:42 发布

阅读量86

点赞数

分类专栏：搜索引擎--lucene 文章标签： lucene Apache F# junit performance

搜索引擎--lucene 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

[b]Apache Lucene is a high-performance, full-featured text search engine library. [/b]

1.[b]Here's a simple example how to use Lucene for indexing and searching[/b] (using JUnit to check if the results are what we expect):


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

/**
 * @since V2.0
 * @author David.Wei
 * @date 2008-4-16
 * @param args
 * @return void
 */
public class Test {

 public static void main(String[] args) throws Exception {

  Analyzer analyzer = new StandardAnalyzer();

  // Store the index in memory:
  Directory directory = new RAMDirectory();
  // To store an index on disk, use this instead:
  // Directory directory = FSDirectory.getDirectory("/tmp/testindex");
  IndexWriter iwriter = new IndexWriter(directory, analyzer, true);
  iwriter.setMaxFieldLength(25000);
  Document doc = new Document();
  String text = "This is the text to be indexed.";
  doc.add(new Field("fieldname", text, Field.Store.YES,
    Field.Index.TOKENIZED));
  iwriter.addDocument(doc);
  iwriter.optimize();
  iwriter.close();

  // Now search the index:
  IndexSearcher isearcher = new IndexSearcher(directory);
  // Parse a simple query that searches for "text":
  QueryParser parser = new QueryParser("fieldname", analyzer);
  Query query = parser.parse("text");
  Hits hits = isearcher.search(query);
  // assertEquals(1, hits.length());
  // Iterate through the results:
  for (int i = 0; i < hits.length(); i++) {
   Document hitDoc = hits.doc(i);
   System.out.println("This is the text to be indexed."
     + hitDoc.get("fieldname"));
  }
  isearcher.close();
  directory.close();
 }

}

[b]2.The Lucene API is divided into several packages:[/b]
[list]
[*][u]org.apache.lucene.analysis[/u] defines an abstract Analyzer API for converting text from a java.io.Reader into a TokenStream, an enumeration of Tokens. A TokenStream is composed by applying TokenFilters to the output of a Tokenizer. A few simple implemenations are provided, including StopAnalyzer and the grammar-based StandardAnalyzer.
[*][u]org.apache.lucene.document[/u] provides a simple Document class. A document is simply a set of named Fields, whose values may be strings or instances of java.io.Reader.
[*][u]org.apache.lucene.index[/u] provides two primary classes: IndexWriter, which creates and adds documents to indices; and IndexReader, which accesses the data in the index.
[*][u]org.apache.lucene.search[/u] provides data structures to represent queries (TermQuery for individual words, PhraseQuery for phrases, and BooleanQuery for boolean combinations of queries) and the abstract Searcher which turns queries into Hits. IndexSearcher implements search over a single IndexReader.
[*][u]org.apache.lucene.queryParser[/u] uses JavaCC to implement a QueryParser.
[*][u]org.apache.lucene.store[/u] defines an abstract class for storing persistent data, the Directory, a collection of named files written by an IndexOutput and read by an IndexInput. Two implementations are provided, FSDirectory, which uses a file system directory to store files, and RAMDirectory which implements files as memory-resident data structures.
[*][u]org.apache.lucene.util[/u] contains a few handy data structures, e.g., BitVector and PriorityQueue.
[/list]

[b]3.To use Lucene, an application should:[/b]
[list]
[*]Create Documents by adding Fields;
[*]Create an IndexWriter and add documents to it with addDocument();
[*]Call QueryParser.parse() to build a query from a string; and
[*]Create an IndexSearcher and pass the query to its search() method.
[/list]

[b]4.Some simple examples of code which does this are: [/b]
[list]
[*][u]FileDocument.java[/u] contains code to create a Document for a file.
[*][u]IndexFiles.java[/u] creates an index for all the files contained in a directory.
[*][u]DeleteFiles.java[/u] deletes some of these files from the index.
[*][u]SearchFiles.java[/u] prompts for queries and searches an index.
[/list]

[b]code detail:[/b]

(1)FileDocument.java


/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.io.File;
import java.io.FileReader;

import org.apache.lucene.document.DateTools;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

/** A utility for making Lucene Documents from a File. */

public class FileDocument {
 /**
  * Makes a document for a File.
  * <p>
  * The document has three fields:
  * <ul>
  * <li><code>path</code>--containing the pathname of the file, as a
  * stored, untokenized field;
  * <li><code>modified</code>--containing the last modified date of the
  * file as a field as created by <a
  * href="lucene.document.DateTools.html">DateTools</a>; and
  * <li><code>contents</code>--containing the full contents of the file,
  * as a Reader field;
  */
 public static Document Document(File f)
   throws java.io.FileNotFoundException {

  // make a new, empty document
  Document doc = new Document();

  // Add the path of the file as a field named "path". Use a field that is
  // indexed (i.e. searchable), but don't tokenize the field into words.
  doc.add(new Field("path", f.getPath(), Field.Store.YES,
    Field.Index.UN_TOKENIZED));

  // Add the last modified date of the file a field named "modified". Use
  // a field that is indexed (i.e. searchable), but don't tokenize the
  // field
  // into words.
  doc.add(new Field("modified", DateTools.timeToString(f.lastModified(),
    DateTools.Resolution.MINUTE), Field.Store.YES,
    Field.Index.UN_TOKENIZED));

  // Add the contents of the file to a field named "contents". Specify a
  // Reader,
  // so that the text of the file is tokenized and indexed, but not
  // stored.
  // Note that FileReader expects the file to be in the system's default
  // encoding.
  // If that's not the case searching for special characters will fail.
  doc.add(new Field("contents", new FileReader(f)));

  // return the document
  return doc;
 }

 private FileDocument() {
 }
}

(2)IndexFiles.java


/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Date;

/** Index all text files under a directory. */
public class IndexFiles {

 private IndexFiles() {
 }

 static final File INDEX_DIR = new File("index");

 /** Index all text files under a directory. */
 public static void main(String[] args) {
  String usage = "java org.apache.lucene.demo.IndexFiles <root_directory>";
  if (args.length == 0) {
   System.err.println("Usage: " + usage);
   System.exit(1);
  }

  if (INDEX_DIR.exists()) {
   System.out.println("Cannot save index to '" + INDEX_DIR
     + "' directory, please delete it first");
   System.exit(1);
  }

  final File docDir = new File(args[0]);
  if (!docDir.exists() || !docDir.canRead()) {
   System.out
     .println("Document directory '"
       + docDir.getAbsolutePath()
       + "' does not exist or is not readable, please check the path");
   System.exit(1);
  }

  Date start = new Date();
  try {
   IndexWriter writer = new IndexWriter(INDEX_DIR,
     new StandardAnalyzer(), true);
   System.out.println("Indexing to directory '" + INDEX_DIR + "'...");
   indexDocs(writer, docDir);
   System.out.println("Optimizing...");
   writer.optimize();
   writer.close();

   Date end = new Date();
   System.out.println(end.getTime() - start.getTime()
     + " total milliseconds");

  } catch (IOException e) {
   System.out.println(" caught a " + e.getClass()
     + "\n with message: " + e.getMessage());
  }
 }

 static void indexDocs(IndexWriter writer, File file) throws IOException {
  // do not try to index files that cannot be read
  if (file.canRead()) {
   if (file.isDirectory()) {
    String[] files = file.list();
    // an IO error could occur
    if (files != null) {
     for (int i = 0; i < files.length; i++) {
      indexDocs(writer, new File(file, files[i]));
     }
    }
   } else {
    System.out.println("adding " + file);
    try {
     writer.addDocument(FileDocument.Document(file));
    }
    // at least on windows, some temporary files raise this
    // exception with an "access denied" message
    // checking if the file can be read doesn't help
    catch (FileNotFoundException fnfe) {
     ;
    }
   }
  }
 }

}

(3)DeleteFiles.java


/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;

// import org.apache.lucene.index.Term;

/** Deletes documents from an index that do not contain a term. */
public class DeleteFiles {

 private DeleteFiles() {
 } // singleton

 /** Deletes documents from an index that do not contain a term. */
 public static void main(String[] args) {
  String usage = "java org.apache.lucene.demo.DeleteFiles <unique_term>";
  if (args.length == 0) {
   System.err.println("Usage: " + usage);
   System.exit(1);
  }
  try {
   Directory directory = FSDirectory.getDirectory("index");
   IndexReader reader = IndexReader.open(directory);

   Term term = new Term("path", args[0]);
   int deleted = reader.deleteDocuments(term);

   System.out.println("deleted " + deleted + " documents containing "
     + term);

   // one can also delete documents by their internal id:

   // for (int i = 0; i < reader.maxDoc(); i++) {
   // System.out.println("Deleting document with id " + i);
   // reader.delete(i);
   //   }

   reader.close();
   directory.close();

  } catch (Exception e) {
   System.out.println(" caught a " + e.getClass()
     + "\n with message: " + e.getMessage());
  }
 }
}

(4)SearchFiles.java


/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.FilterIndexReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

/** Simple command-line based search demo. */
public class SearchFiles {

 /**
  * Use the norms from one field for all fields. Norms are read into memory,
  * using a byte of memory per document per searched field. This can cause
  * search of large collections with a large number of fields to run out of
  * memory. If all of the fields contain only a single token, then the norms
  * are all identical, then single norm vector may be shared.
  */
 private static class OneNormsReader extends FilterIndexReader {
  private String field;

  public OneNormsReader(IndexReader in, String field) {
   super(in);
   this.field = field;
  }

  public byte[] norms(String field) throws IOException {
   return in.norms(this.field);
  }
 }

 private SearchFiles() {
 }

 /** Simple command-line based search demo. */
 public static void main(String[] args) throws Exception {
  String usage = "Usage: java org.apache.lucene.demo.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-raw] [-norms field]";
  if (args.length > 0
    && ("-h".equals(args[0]) || "-help".equals(args[0]))) {
   System.out.println(usage);
   System.exit(0);
  }

  String index = "index";
  String field = "contents";
  String queries = null;
  int repeat = 0;
  boolean raw = false;
  String normsField = null;

  for (int i = 0; i < args.length; i++) {
   if ("-index".equals(args[i])) {
    index = args[i + 1];
    i++;
   } else if ("-field".equals(args[i])) {
    field = args[i + 1];
    i++;
   } else if ("-queries".equals(args[i])) {
    queries = args[i + 1];
    i++;
   } else if ("-repeat".equals(args[i])) {
    repeat = Integer.parseInt(args[i + 1]);
    i++;
   } else if ("-raw".equals(args[i])) {
    raw = true;
   } else if ("-norms".equals(args[i])) {
    normsField = args[i + 1];
    i++;
   }
  }

  IndexReader reader = IndexReader.open(index);

  if (normsField != null)
   reader = new OneNormsReader(reader, normsField);

  Searcher searcher = new IndexSearcher(reader);
  Analyzer analyzer = new StandardAnalyzer();

  BufferedReader in = null;
  if (queries != null) {
   in = new BufferedReader(new FileReader(queries));
  } else {
   in = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
  }
  QueryParser parser = new QueryParser(field, analyzer);
  while (true) {
   if (queries == null) // prompt the user
    System.out.println("Enter query: ");

   String line = in.readLine();

   if (line == null || line.length() == -1)
    break;

   line = line.trim();
   if (line.length() == 0)
    break;

   Query query = parser.parse(line);
   System.out.println("Searching for: " + query.toString(field));

   Hits hits = searcher.search(query);

   if (repeat > 0) { // repeat & time as benchmark
    Date start = new Date();
    for (int i = 0; i < repeat; i++) {
     hits = searcher.search(query);
    }
    Date end = new Date();
    System.out.println("Time: " + (end.getTime() - start.getTime())
      + "ms");
   }

   System.out.println(hits.length() + " total matching documents");

   final int HITS_PER_PAGE = 10;
   for (int start = 0; start < hits.length(); start += HITS_PER_PAGE) {
    int end = Math.min(hits.length(), start + HITS_PER_PAGE);
    for (int i = start; i < end; i++) {

     if (raw) { // output raw format
      System.out.println("doc=" + hits.id(i) + " score="
        + hits.score(i));
      continue;
     }

     Document doc = hits.doc(i);
     String path = doc.get("path");
     if (path != null) {
      System.out.println((i + 1) + ". " + path);
      String title = doc.get("title");
      if (title != null) {
       System.out.println("   Title: " + doc.get("title"));
      }
     } else {
      System.out.println((i + 1) + ". "
        + "No path for this document");
     }
    }

    if (queries != null) // non-interactive
     break;

    if (hits.length() > end) {
     System.out.println("more (y/n) ? ");
     line = in.readLine();
     if (line.length() == 0 || line.charAt(0) == 'n')
      break;
    }
   }
  }
  reader.close();
 }
}

iteye_17106

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lucene 学习笔记

[b]Apache Lucene is a high-performance, full-featured text search engine library. [/b]1.[b]Here's a simple example how to use Lucene for indexing and searching[/b] (using JUnit to check if the res...
复制链接

扫一扫

专栏目录