A Short Introduction to Lucene

最新推荐文章于 2024-05-03 21:05:18 发布

Dreamer who

最新推荐文章于 2024-05-03 21:05:18 发布

阅读量887

点赞数

分类专栏：搜索技术文章标签： lucene

搜索技术专栏收录该内容

2 篇文章 0 订阅

订阅专栏

A Short Introduction to Lucene

Lucene is an extremely rich and powerful full-text search librarywritten in Java. You can use Lucene to provide full-text indexingacross both database objects and documents in various formats(Microsoft Office documents, PDF, HTML, text, and so on). In thistutorial, we'll go through the basics of using Lucene to add full-textsearch functionality to a fairly typical J2EE application: an onlineaccommodation database. The main business object is the Hotelclass. In this tutorial, a Hotel has a unique identifier, aname, a city, and a description.

Roughly, supporting full-text search using Lucene requires two steps:(1) creating a lucence index on the documents and/or databaseobjects and (2) parsing the user query and lookingup the prebuilt index to answer the query. In the first part ofthis tutorial, we learn how to create a lucene index. In the secondpart, we learn how to use the prebuilt index to answer userqueries.

For your convenience, all of the code for this article's Lucenedemo is included inthe lucene-tutorial.zip file. Inthis demo, the class Indexer insrc/lucene/demo/search/Indexer.java is responsible forcreating the index. The class SearchEngine insrc/lucene/demo/search/SearchEngine.java is responsible forsupporting user queries. The class Main in src/lucene/demo/Main.java has a test code thatbuilds a Lucene index using a small dataset(the actual data is provided by the Hotel classstored in src/lucene/demo/business/HotelDatabase.java)and performs a simple keyword query on the data using the index.Briefly go over the two java source files, Indexer.java andSearchEngine.java, toget yourself familiar with the overall structure of the code.

1. Creating an Index

The first step in implementing full-text searching with Lucene is to build anindex. Here's a simple attempt to diagram how the Lucene classes go together when you create an index:

Index

Document 1

Field A (name/value)

Field B (name/value)

Document 2

Field A (name/value)

Field B (name/value)

At the heart of Lucene is an Index. You pump your datainto the Index, then do searches on the Index to getresults out. Document objects are stored inthe Index, and it is your job to "convert" your data intoDocument objects and store them to the Index. Thatis, you read in each data file (or Web document, database tuple orwhatever), instantiate a Document for it, break down the datainto chunks and store the chunks in the Documentas Field objects (a name/value pair). When you're donebuilding a Document, you write it to the Index usingthe IndexWriter. Now let us get into details on how this is done.

1.1 IndexWriter Class: Creating Index

To create an index, the first thing that need to do is to createan IndexWriter object. The IndexWriter object isused to create the index and to add new index entries(i.e., Documents) to this index. You can createan IndexWriter as follows:

Directory indexDir = FSDirectory.open(new File("index-directory"));
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_2, new StandardAnalyzer());
IndexWriter indexWriter = new IndexWriter(indexDir, config);

Note that IndexWriter takes two parameters, indexDir and config, which are Directory and IndexWriterConfig objects, respectively. The first parameter, indexDir specifiesthe directory in which the Lucene indexwill be created, which is index-directory in this case. Thesecond parameter specifies the "configuration" of our index, which are the version of our Lucene library (4.10.2) and the "documentanalyzer" to be used when Lucene indexes your data. Here, weare using the StandardAnalyzer for this purpose.More details on lucene analyzers follow shortly.

1.2 Analyzer Class: Parsing the Documents

Most likely, the data that you want to index by Lucene is plain textEnglish. The job of Analyzer is to "parse" each field ofyour data into indexable "tokens" or keywords. Several types ofanalyzers are provided out of the box. Table 1 shows some of the moreinteresting ones.

Table 1 Lucene analyzers.

Analyzer	Description
`StandardAnalyzer`	A sophisticated general-purpose analyzer.
`WhitespaceAnalyzer`	A very simple analyzer that just separates tokens usingwhite space.
`StopAnalyzer`	Removes common English words that are not usually useful forindexing.
`SnowballAnalyzer`	An interesting experimental analyzer that works on wordroots (a search on rain should also return entrieswith raining, rained, and so on).

There are even a number of language-specific analyzers, including analyzersfor German, Russian, French, Dutch, and others.

It isn't difficult to implement your own analyzer, though thestandard ones often do the job well enough. When you createan IndexWriter, you have to specify which Analyzeryou will use for the index as we did before. In our previous example, we usedthe StandardAnalyzer as the document analyzer.

1.3 Adding a Document/object to Index

Now you need to index your documents or business objects. To index an object,you use the Lucene Document class, to which you add thefields that you want indexed. As we briefly mentioned before, aLucene Document is basically a container for a set of indexedfields. This is best illustrated by an example:

Document doc = new Document();
doc.add(new StringField("id", "Hotel-1345", Field.Store.YES));
doc.add(new TextField("description", "A beautiful hotel", Field.Store.YES));

In the above example, we add two fields, "id" and "description", with the respective values "Hotel-1345" and "A beautiful hotel" to the document.

More precisely, to add a field to a document, you create a new instance of the Fieldclass, which can be either a StringField or a TextField (the difference between the two will be explained shortly). A field object takes the following three parameters:

Field name: This is the name of the field. In the above example, they are "id" and "description".
Field value: This is the value of the field. In the above example, they are "Hotel-1345" and "A beautiful hotel". A value can be a String like our example or a Reader if the object to be indexed is a file.
Storage flag: The third parameter specifies whetherthe actual value of the field needs to be stored in the lucene indexor it can be discarded after it is indexed. Storing the value is useful if you need the valuelater, like you want to display it in the search result list or you use the value to look upa tuple from a database table, for example. If the value must bestored, use Field.Store.YES. You can also useField.Store.COMPRESS for large documents or binary value fields. If youdon't need to store the value, use Field.Store.NO.

StringField vs TextField: In the above example, the "id" field contains the ID of the hotel, which is a single atomic value. In contrast, the "description" field contains an English text, which should be parsed (or "tokenized") into a set of words for indexing. Use StringField for a field with an atomic value that should not be tokenized. Use TextField for a field that needs to be tokenized into a set of words.

For our hotel example, we just want some fairly simple full-text searching. So weadd the following fields:

The hotel identifier (or the key to the hotel tuple), so we can retrieve the corresponding hotel object from the database later oncewe obtain the query result list from the Lucene index.
The hotel name, which we need to display in the query result lists.
The hotel city, if we need to display this information in the queryresult lists.
Composite text containing the important fields of the Hotel object:
- Hotel name
- Hotel city
- Hotel description
We want full-text indexing on this field. We don't need to display theindexed text in the query results, so we use Field.Store.NO to saveindex space.

Here's the method in the Indexer class in our demo that indexes a given hotel:

public void indexHotel(Hotel hotel) throws IOException {
    IndexWriter writer = getIndexWriter(false);
    Document doc = new Document();
    doc.add(new StringField("id", hotel.getId(), Field.Store.YES));
    doc.add(new StringField("name", hotel.getName(), Field.Store.YES));
    doc.add(new StringField("city", hotel.getCity(), Field.Store.YES));
    String fullSearchableText = hotel.getName() + " " + hotel.getCity() + " " + hotel.getDescription();
    doc.add(new TextField("content", fullSearchableText, Field.Store.NO));
    writer.addDocument(doc);
}

Once the indexing is finished, you have to close the index writer, whichupdates and closes the associated files on the disk. Opening and closing theindex writer is time-consuming, so it's not a good idea to do itsystematically for each operation in the case of batch updates. For example,here's a method in the Indexer class in our demo that rebuilds the whole index:

public void rebuildIndexes() throws IOException {
   //
   // Erase existing index
   //
   getIndexWriter(true);
   //
   // Index all hotel entries
   //
   Hotel[] hotels = HotelDatabase.getHotels();
   for(Hotel hotel: hotels) {
     indexHotel(hotel);
   }
   //
   // Don't forget to close the index writer when done
   //
   closeIndexWriter();
 }

For your reference, here is complete source code ofthe src/lucene/demo/search/Indexer.java.

package lucene.demo.search;

import java.io.IOException;
import java.io.StringReader;
import java.io.File;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import lucene.demo.business.Hotel;
import lucene.demo.business.HotelDatabase;

public class Indexer {

    /** Creates a new instance of Indexer */
    public Indexer() {
    }

    private IndexWriter indexWriter = null;

    public IndexWriter getIndexWriter(boolean create) throws IOException {
        if (indexWriter == null) {
            Directory indexDir = FSDirectory.open(new File("index-directory"));
            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_2, new StandardAnalyzer());
            indexWriter = new IndexWriter(indexDir, config);
        }
        return indexWriter;
   }

    public void closeIndexWriter() throws IOException {
        if (indexWriter != null) {
            indexWriter.close();
        }
   }

    public void indexHotel(Hotel hotel) throws IOException {

        System.out.println("Indexing hotel: " + hotel);
        IndexWriter writer = getIndexWriter(false);
        Document doc = new Document();
        doc.add(new StringField("id", hotel.getId(), Field.Store.YES));
        doc.add(new StringField("name", hotel.getName(), Field.Store.YES));
        doc.add(new StringField("city", hotel.getCity(), Field.Store.YES));
        String fullSearchableText = hotel.getName() + " " + hotel.getCity() + " " + hotel.getDescription();
        doc.add(new TextField("content", fullSearchableText, Field.Store.NO));
        writer.addDocument(doc);
    }

    public void rebuildIndexes() throws IOException {
          //
          // Erase existing index
          //
          getIndexWriter(true);
          //
          // Index all Accommodation entries
          //
          Hotel[] hotels = HotelDatabase.getHotels();
          for(Hotel hotel : hotels) {
              indexHotel(hotel);
          }
          //
          // Don't forget to close the index writer when done
          //
          closeIndexWriter();
     }
}

2. Text Search Using Lucene Index

Now that we've indexed our data, we can do some searching. In our demo,this part is implemented by the SearchEngine class in src/lucene/demo/search/SearchEngine.java.

In most cases, you need to use two classes to support full-text searching:QueryParserand IndexSearcher. QueryParser parsesthe user query string and constructs a Lucene Query object,which is passed on to IndexSearcher.search() as theinput. Based on this Query object and the prebuilt Luceneindex, IndexSearcher.search() identifies the matching documentsand returns them as an TopDocs objects in the result.To get started, look at the following example code.

package lucene.demo.search;

import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.File;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.document.Document;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import lucene.demo.business.Hotel;
import lucene.demo.business.HotelDatabase;

public class SearchEngine {
    private IndexSearcher searcher = null;
    private QueryParser parser = null;

    /** Creates a new instance of SearchEngine */
    public SearchEngine() throws IOException {
        searcher = new IndexSearcher(DirectoryReader.open(FSDirectory.open(new File("index-directory"))));
        parser = new QueryParser("content", new StandardAnalyzer());
    }

    public TopDocs performSearch(String queryString, int n)
    throws IOException, ParseException {
        Query query = parser.parse(queryString);
        return searcher.search(query, n);
    }

    public Document getDocument(int docId)
    throws IOException {
        return searcher.doc(docId);
    }
}

Inside the constructor of SearchEngine, we first createan IndexSearcher object using the indexin index-directory that we created before. We then createa QueryParser. The firstparameter to the QueryParser constructor specifies thedefault search field, which is content field in thiscase. This default field is used if the query string does not specifythe search field. The second parameter specifiesthe Analyzer to be used when the QueryParser parsesthe user query string.

The class SearchEngine provides a methodcalled performSearch which takes a query string and the maximum number of matching documents that should be returned as the input parametersand returns the list of matching documents as a Lucene TopDocsobject. The method takes the query string, parses it using QueryParserand performs search() using IndexSearcher.

Important Note: There's a very common mistakes that peopleoften make, so I have to mention it here. When you use Lucene, youhave to specify the Analyzer twice, once when you createan IndexWriter object (for index construction) and once morewhen you create a QueryParser (for query parsing). Pleasenote that it is extremely important that you use the sameanalyzer for both. In our example, since wecreated IndexWriter using StandardAnalyzer before,we are also passing StandardAnalyzer to QueryParser.Otherwise, you will get into all sorts of problems that you do notexpect.

The last method getDocument of the SearchEngine class takes the unique ID of a document and returns the corresponding Document object from the index. This method is used to retrieve a particular matching document from the index.

Now we briefly explain the syntax of the user's query string.

2.1 Query Syntax

In the simpliest form, the query string canbe a simple list of keywords like Mariott Hotel. This query willreturn the documents that contain either Mariott or Hotel in thedefault field (i.e., the content field in our example). Ifyou want to search for documents that contain both keywords, the queryshould be Mariott AND Hotel. Note that AND boolean operator must beALL CAPS.

The general syntax for a query string is as follows: A query is aseries of clauses. A clause may be prefixed by:

a plus (+) or a minus (-) sign, indicating that the clause is required or prohibited respectively; or
a field name followed by a colon, indicating the search field. This enables one to construct a query on multiple search fields.

A clause may be either:

a keyword, indicating all the documents that contain this keyword; or
a nested query, enclosed in parentheses.

For example, the following query string will search for "Mariott" in the name field or "Comfortable" in the description field:

  name:Mariott OR description:Comfortable

The following query will search for a hotel that contains both the words "Mariott" and "Resort" in the name field:

name:(+Mariott +Resort)

More examples of query strings can be found in the query syntax documentation.

2.2 Retrieving Matching Documents

The search() function of the Lucene IndexSearcher object returns the list of matching document informationas a Lucene TopDocs object. Thisobject contains a list of ScoreDoc objects in the scoreDocs field, which, in turn, has the doc field (the unique document ID of the matching document) and the score field (the document's relevance score).More precisely, from the TopDocs object you can obtain the matching Document objects as follows:

// instantiate the search engine
SearchEngine se = new SearchEngine();

// retrieve top 100 matching document list for the query "Notre Dame museum"
TopDocs topDocs = se.performSearch("Notre Dame museum", 100); 

// obtain the ScoreDoc (= documentID, relevanceScore) array from topDocs
ScoreDoc[] hits = topDocs.scoreDocs;

// retrieve each matching document from the ScoreDoc arry
for (int i = 0; i < hits.length; i++) {
    Document doc = instance.getDocument(hits[i].doc);
    String hotelName = doc.get("name");
   ...
}

As in this example, once you obtain the Document object from the index, you can usethe get() method to fetch field values that have been stored duringindexing.

Now read the src/lucene/demo/Main.java file to see how it builds, search, and retrieve from a Lucene index.

Notes on CLASSPATH

In order to use Lucene, you need the lucene-*.jar library files available in the /usr/share/java directory of our VM. Since this is a third-party jar library file that is not part of the standard Java Runtime environment, the Java compiler and runtime engine are NOT aware of this file and may generate "class not found" error when you try to compile and run your code. To avoid this error you have to make sure one of the following:

Your ant script must pass the jar file as the classpath parameter during compilation and runtime. The included build.xml file in lucene-tutorial.zip does this automatically for the two targets "compile" and "run".
If you run javac and java commands directly from a shell, pass the locations of the libraries (separated by :) using the -classpath option like
```
javac -classpath ".:/usr/share/java/*.jar" YourClass.java
```
and
```
java -classpath ".:/usr/share/java/*.jar" YourClass
```
This method is strongly discouraged, but it still works. You can set your environment variable CLASSPATH to include the library files.

Summary and References

There is much more to Lucene than is described here. In fact, we barelyscratched the surface. However, this example does show how easy it is toimplement full-text search functions in a Java database application. Try it out,and add some powerful full-text search functions to your web site today!

Lucene web site
Lucene in Action(Manning, 2004), by Erik Hatcher and Otis Gospodnetic

This article was originally written by John Ferguson Smart on Apr 14, 2006. It was then modified by Junghoo "John" Cho for the CS144 class at UCLA.

Dreamer who

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
A Short Introduction to Lucene

A Short Introduction to LuceneLucene is an extremely rich and powerful full-text search librarywritten in Java. You can use Lucene to provide full-text indexingacross both database objects and doc
复制链接

扫一扫