[转]结合php5与zend_search_lucene来创建一个全文搜索引擎_搜索 authorme://seriesmain?src=m

Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html

By Quentin Zervaas, 27 April 2006

有时间的话，我会将它翻译成中文，本身不难的，可慢慢看。

This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.

There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).

It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.

In this article we will be covering the following:

How to index a document or series of documents
The different types of fields that can be indexed

Searching the index

To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.

How fulltext indexing and querying works

Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.

The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.

So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.

Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.

Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.

Querying the data

Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.

Keeping the index up-to-date

If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.

There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.

Getting started

The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.

You can download this from http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.

I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.

Highlight: Plain

$ cd /usr/local/src

$ wget http://framework.zend.com/download/tgz

$ tar -zxf ZendFramework-0.1.3.tar.gz

$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

Which now becomes:

Highlight: Plain

php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

Creating our first index

The basic process for creating an index is:

Open the index
Add each document

Commit (save) the index

The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the Zend_Search_Lucene class.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath, true);

?>

You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.

Adding a document to our index

Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

?>

The next thing we must do is determine which fields we need to add to our index.

There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.

As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:

- Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
- Title – we’re definitely going to include the title in our results
- Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
- Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author

Created – We’ll also store a timestamp of when the article was created.

This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:

UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)

Text – Data that is available for search and is stored in full (title and author)

There is also the Keyword and Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.

To add a field to our indexed document, we use the addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.

In other words, to create the title field data, we use:

Highlight: PHP

<?php

    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);

?>

Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.

So to add all the data with the field types we just worked out, we would use this:

Highlight: PHP

<?php

    $doc = newZend_Search_Lucene_Document();

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));

    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));

    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));

    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));

    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));

?>

Important note: We added the main search content with a field name of contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like title:foo. This will be covered further in the section about querying the index.

Finally, we add the document to the index using addDocument():

Highlight: PHP

<?php

    $index->addDocument($doc);

?>

We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).

Committing / saving the index

Once all documents have been added, the index must be saved.

Highlight: PHP

<?php

    $index->commit();

?>

You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.

If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.

Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.

Indexing all the articles on phpRiot

Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.

Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.

Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article Managing your data with DatabaseObject.

Extending Zend_Search_Lucene_Document

On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.

In other words, we’re going to move the calls to addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.

Highlight: PHP

<?php

    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document

/**

         * Constructor. Creates our indexable document and adds all

         * necessary fields to it using the passed in DatabaseObject

         * holding the article data.

*/

        publicfunction__construct(&$document)

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));

            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));

            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));

            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));

            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));

?>

As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).

Building the full index

Now that we have our class, we can create our index, loop over the documents, and then save our index:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    require_once('DatabaseObject/PhpriotDocument.class.php');

    // where to save our index

    $indexPath = '/var/www/phpriot.com/data/docindex';

    // get a list of all database ids for documents

    $doc_ids = PhpriotDocument::GetDocIds($db);

    // create our index

    $index = newZend_Search_Lucene($indexPath, true);

    foreach($doc_idsas$doc_id){

        // load our databaseobject

        $document = newPhpriotDocument($db);

        $document->loadRecord($doc_id);

        // create our indexed document and add it to the index

        $index->addDocument(newPhpRiotIndexedDocument($document));

    // write the index to disk

    $index->commit();

?>

The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).

How querying works in Zend_Search_Lucene

Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.

When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.

Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).

So to search the author field for ‘Quentin’, the search query would be author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the Zend_Search_Lucene manual section on Extensibility)

Likewise, to search in the title field for ‘php’, we would use title:php.

As we briefly mentioned earlier in this article, the default section that is searched in is the field called contents. So if you wanted to search the document body for the world ‘google’, you could use contents:google or just google.

Including and excluding terms

By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.

Searching for phrases

It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the Zend_Search_Lucene manual section on query types.

Sample queries

Here are some queries you can pass to Zend_Search_Lucene and their meanings.

Highlight: Plain

php

    // search the index for any article with the word php

php -author:quentin

    // find any article with the word php not written by me

author:quentin

    // find all the articles by me

php -ajax

    // find all articles with the word php that don't have the word ajax

title:mysql

    // find all articles with MySQL in the title

title:mysql -author:quentin

    // find all articles with MySQL in the title not by me

And so on. Hopefully you get the idea.

Scoring of results

All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.

The results are ordered by their score, from highest to lowest.

I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.

You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.

Querying our index

On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.

Now we will look at actually pulling documents from our index using that term.

There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.

In either case, you use the find() method on the index. The find() method returns a list of matches from your index.

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query('php +author:Quentin');

?>

This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.

We could also manually build this same query with function calls like so:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $query = newZend_Search_Lucene_Search_Query_MultiTerm();

    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);

    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);

    $hits = $index->query($query);

?>

The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.

The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is contents.

On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.

Dealing with returned results

The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.

Each of the indexed fields are available as a class property.

So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $query = 'php +author:Quentin';

    $indexPath = '/var/www/phpriot.com/data/docindex';

    $index = newZend_Search_Lucene($indexPath);

    $hits = $index->query($query);

    $numHits = count($hits);

?>

<p>

    Found <?=$hits?> result(s) for query <?=$query?>.

</p>

<?phpforeach($hitsas$hit){?>

    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>

<p>

        By <?=$hit->author?>

    </p>

<p>

        <?=$hit->teaser?><br />

        <a href="<?=$hit->url?>">Read more...</a>

    </p>

<?php}?>

Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.

Creating a simple search engine

Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:

Highlight: PHP

<?php

    require_once('Zend/Search/Lucene.php');

    $query = isset(
  
  
   
   Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html
  
  

  
  
   
   By Quentin Zervaas, 27 April 2006 
  
  

  
  
   
   This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.
  
  

  
  
   
   There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).
  
  

  
  
   
   It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.
  
  

  
  
   
   In this article we will be covering the following:
  
  

  
  How to index a document or series of documents 
    
The different types of fields that can be indexed 


  
  Searching the index 


  
  
   
   To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
  
  

  
  
   
   How fulltext indexing and querying works
  
  

  
  
   
   Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
  
  

  
  
   
   The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
  
  

  
  
   
   So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
  
  

  
  
   
   Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
  
  

  
  
   
   Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
  
  

  
  
   
   Querying the data
  
  

  
  
   
   Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
  
  

  
  
   
   Keeping the index up-to-date
  
  

  
  
   
   If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
  
  

  
  
   
   There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
  
  

  
  
   
   Getting started
  
  

  
  
   
   The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
  
  

  
  
   
   You can download this from 
   
   http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
  
  

  
  
   
   I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
  
  

  
  
   
   Highlight: Plain
  
  
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

  
  
   
   So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
  
  

  
  
   
   Highlight: Plain
  
  
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

  
  
   
   Which now becomes:
  
  

  
  
   
   Highlight: Plain
  
  
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

  
  
   
   Creating our first index
  
  

  
  
   
   The basic process for creating an index is:
  
  

  
  Open the index
    
Add each document
    

  
  Commit (save) the index
    

  
  
   
   The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the 
   
   Zend_Search_Lucene class.
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath, true);
?>

  
  
   
   You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
  
  

  
  
   
   Adding a document to our index
  
  

  
  
   
   Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    $doc = newZend_Search_Lucene_Document();
?>

  
  
   
   The next thing we must do is determine which fields we need to add to our index.
  
  

  
  
   
   There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
  
  

  
  
   
   As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
  
  

  
  Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
        
Title – we’re definitely going to include the title in our results
        
Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
        
Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
        

  
  

  
  Created – We’ll also store a timestamp of when the article was created.
    

  
  
   
   This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
  
  

  
  UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
    
UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
    

  
  Text – Data that is available for search and is stored in full (title and author)
    

  
  
   
   There is also the 
   
   Keyword and 
   
   Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
  
  

  
  
   
   To add a field to our indexed document, we use the 
   
   addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
  
  

  
  
   
   In other words, to create the 
   
   title field data, we use:
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>

  
  
   
   Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
  
  

  
  
   
   So to add all the data with the field types we just worked out, we would use this:
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    $doc = newZend_Search_Lucene_Document();
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>

  
  
   
   Important note: We added the main search content with a field name of 
   
   contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like 
   
   title:foo. This will be covered further in the section about querying the index.
  
  

  
  
   
   Finally, we add the document to the index using 
   
   addDocument():
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    $index->addDocument($doc);
?>

  
  
   
   We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
  
  

  
  
   
   Committing / saving the index
  
  

  
  
   
   Once all documents have been added, the index must be saved.
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    $index->commit();
?>

  
  
   
   You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
  
  

  
  
   
   If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
  
  

  
  
   
   Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
  
  

  
  
   
   Indexing all the articles on phpRiot
  
  

  
  
   
   Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
  
  

  
  
   
   Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
  
  

  
  
   
   Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article 
   
   Managing your data with DatabaseObject.
  
  

  
  
   
   Extending Zend_Search_Lucene_Document
  
  

  
  
   
   On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
  
  

  
  
   
   In other words, we’re going to move the calls to 
   
   addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
    {
        /**
         * Constructor. Creates our indexable document and adds all
         * necessary fields to it using the passed in DatabaseObject
         * holding the article data.
         */
        publicfunction__construct(&$document)
        {
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
        }
    }
?>

  
  
   
   As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The 
   
   generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
  
  

  
  
   
   Building the full index
  
  

  
  
   
   Now that we have our class, we can create our index, loop over the documents, and then save our index:
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    require_once('Zend/Search/Lucene.php');
    require_once('DatabaseObject/PhpriotDocument.class.php');
    // where to save our index
    $indexPath = '/var/www/phpriot.com/data/docindex';
    // get a list of all database ids for documents
    $doc_ids = PhpriotDocument::GetDocIds($db);
    // create our index
    $index = newZend_Search_Lucene($indexPath, true);
    foreach($doc_idsas$doc_id){
        // load our databaseobject
        $document = newPhpriotDocument($db);
        $document->loadRecord($doc_id);
        // create our indexed document and add it to the index
        $index->addDocument(newPhpRiotIndexedDocument($document));
    }
    // write the index to disk
    $index->commit();
?>

  
  
   
   The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
  
  

  
  
   
   How querying works in Zend_Search_Lucene
  
  

  
  
   
   Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
  
  

  
  
   
   When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
  
  

  
  
   
   Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
  
  

  
  
   
   So to search the author field for ‘Quentin’, the search query would be 
   
   author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the 
   
   Zend_Search_Lucene manual section on Extensibility)
  
  

  
  
   
   Likewise, to search in the title field for ‘php’, we would use 
   
   title:php.
  
  

  
  
   
   As we briefly mentioned earlier in this article, the default section that is searched in is the field called 
   
   contents. So if you wanted to search the document body for the world ‘google’, you could use 
   
   contents:google or just 
   
   google.
  
  

  
  
   
   Including and excluding terms
  
  

  
  
   
   By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
  
  

  
  
   
   Searching for phrases
  
  

  
  
   
   It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the 
   
   Zend_Search_Lucene manual section on query types.
  
  

  
  
   
   Sample queries
  
  

  
  
   
   Here are some queries you can pass to Zend_Search_Lucene and their meanings.
  
  

  
  
   
   Highlight: Plain
  
  
php
    // search the index for any article with the word php
 
php -author:quentin
    // find any article with the word php not written by me
 
author:quentin
    // find all the articles by me
 
php -ajax
    // find all articles with the word php that don't have the word ajax
 
title:mysql
    // find all articles with MySQL in the title
 
title:mysql -author:quentin
    // find all articles with MySQL in the title not by me

  
  
   
   And so on. Hopefully you get the idea.
  
  

  
  
   
   Scoring of results
  
  

  
  
   
   All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
  
  

  
  
   
   The results are ordered by their score, from highest to lowest.
  
  

  
  
   
   I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
  
  

  
  
   
   You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
  
  

  
  
   
   Querying our index
  
  

  
  
   
   On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
  
  

  
  
   
   Now we will look at actually pulling documents from our index using that term.
  
  

  
  
   
   There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
  
  

  
  
   
   In either case, you use the 
   
   find() method on the index. The find() method returns a list of matches from your index.
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query('php +author:Quentin');
?>

  
  
   
   This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
  
  

  
  
   
   We could also manually build this same query with function calls like so:
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $query = newZend_Search_Lucene_Search_Query_MultiTerm();
    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
    $hits = $index->query($query);
?>

  
  
   
   The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
  
  

  
  
   
   The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is 
   
   contents.
  
  

  
  
   
   On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
  
  

  
  
   
   Dealing with returned results
  
  

  
  
   
   The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
  
  

  
  
   
   Each of the indexed fields are available as a class property.
  
  

  
  
   
   So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    require_once('Zend/Search/Lucene.php');
    $query = 'php +author:Quentin';
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query($query);
    $numHits = count($hits);
?>
<p>
    Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
    <p>
        By <?=$hit->author?>
    </p>
    <p>
        <?=$hit->teaser?><br />
        <a href="<?=$hit->url?>">Read more...</a>
    </p>
<?php}?>

  
  
   
   Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
  
  

  
  
   
   Creating a simple search engine
  
  

  
  
   
   Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    require_once('Zend/Search/Lucene.php');
___FCKpd___125
    $query = trim($query);
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    if(strlen($query) > 0){
        $hits = $index->query($query);
        $numHits = count($hits);
    }
?>
<form method="get" action="search.php">
    <input type="text" name="query" value="<?=htmlSpecialChars($query)?>" />
    <input type="submit" value="Search" />
</form>
<?phpif(strlen($query) > 0){?>
    <p>
        Found <?=$hits?> result(s) for query <?=$query?>.
    </p>
    <?phpforeach($hitsas$hit){?>
        <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
        <p>
            By <?=$hit->author?>
        </p>
        <p>
            <?=$hit->teaser?><br />
            <a href="<?=$hit->url?>">Read more...</a>
        </p>
    <?php}?>
<?php}?>

  
  
   
   Error handling
  
  

  
  
   
   The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the 
   
   Zend_Search_Lucene_Exception exception.
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    require_once('Zend/Search/Lucene.php');
    $query = isset(
    
    
     
     Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html
    
    

    
    
     
     By Quentin Zervaas, 27 April 2006 
    
    

    
    
     
     This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.
    
    

    
    
     
     There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).
    
    

    
    
     
     It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.
    
    

    
    
     
     In this article we will be covering the following:
    
    

    
    How to index a document or series of documents 
    
The different types of fields that can be indexed 


    
    Searching the index 


    
    
     
     To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
    
    

    
    
     
     How fulltext indexing and querying works
    
    

    
    
     
     Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
    
    

    
    
     
     The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
    
    

    
    
     
     So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
    
    

    
    
     
     Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
    
    

    
    
     
     Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
    
    

    
    
     
     Querying the data
    
    

    
    
     
     Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
    
    

    
    
     
     Keeping the index up-to-date
    
    

    
    
     
     If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
    
    

    
    
     
     There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
    
    

    
    
     
     Getting started
    
    

    
    
     
     The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
    
    

    
    
     
     You can download this from 
     
     http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
    
    

    
    
     
     I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
    
    

    
    
     
     Highlight: Plain
    
    
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

    
    
     
     So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
    
    

    
    
     
     Highlight: Plain
    
    
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

    
    
     
     Which now becomes:
    
    

    
    
     
     Highlight: Plain
    
    
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

    
    
     
     Creating our first index
    
    

    
    
     
     The basic process for creating an index is:
    
    

    
    Open the index
    
Add each document
    

    
    Commit (save) the index
    

    
    
     
     The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the 
     
     Zend_Search_Lucene class.
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath, true);
?>

    
    
     
     You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
    
    

    
    
     
     Adding a document to our index
    
    

    
    
     
     Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    $doc = newZend_Search_Lucene_Document();
?>

    
    
     
     The next thing we must do is determine which fields we need to add to our index.
    
    

    
    
     
     There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
    
    

    
    
     
     As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
    
    

    
    Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
        
Title – we’re definitely going to include the title in our results
        
Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
        
Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
        

    
    

    
    Created – We’ll also store a timestamp of when the article was created.
    

    
    
     
     This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
    
    

    
    UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
    
UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
    

    
    Text – Data that is available for search and is stored in full (title and author)
    

    
    
     
     There is also the 
     
     Keyword and 
     
     Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
    
    

    
    
     
     To add a field to our indexed document, we use the 
     
     addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
    
    

    
    
     
     In other words, to create the 
     
     title field data, we use:
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>

    
    
     
     Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
    
    

    
    
     
     So to add all the data with the field types we just worked out, we would use this:
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    $doc = newZend_Search_Lucene_Document();
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>

    
    
     
     Important note: We added the main search content with a field name of 
     
     contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like 
     
     title:foo. This will be covered further in the section about querying the index.
    
    

    
    
     
     Finally, we add the document to the index using 
     
     addDocument():
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    $index->addDocument($doc);
?>

    
    
     
     We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
    
    

    
    
     
     Committing / saving the index
    
    

    
    
     
     Once all documents have been added, the index must be saved.
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    $index->commit();
?>

    
    
     
     You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
    
    

    
    
     
     If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
    
    

    
    
     
     Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
    
    

    
    
     
     Indexing all the articles on phpRiot
    
    

    
    
     
     Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
    
    

    
    
     
     Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
    
    

    
    
     
     Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article 
     
     Managing your data with DatabaseObject.
    
    

    
    
     
     Extending Zend_Search_Lucene_Document
    
    

    
    
     
     On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
    
    

    
    
     
     In other words, we’re going to move the calls to 
     
     addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
    {
        /**
         * Constructor. Creates our indexable document and adds all
         * necessary fields to it using the passed in DatabaseObject
         * holding the article data.
         */
        publicfunction__construct(&$document)
        {
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
        }
    }
?>

    
    
     
     As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The 
     
     generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
    
    

    
    
     
     Building the full index
    
    

    
    
     
     Now that we have our class, we can create our index, loop over the documents, and then save our index:
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    require_once('Zend/Search/Lucene.php');
    require_once('DatabaseObject/PhpriotDocument.class.php');
    // where to save our index
    $indexPath = '/var/www/phpriot.com/data/docindex';
    // get a list of all database ids for documents
    $doc_ids = PhpriotDocument::GetDocIds($db);
    // create our index
    $index = newZend_Search_Lucene($indexPath, true);
    foreach($doc_idsas$doc_id){
        // load our databaseobject
        $document = newPhpriotDocument($db);
        $document->loadRecord($doc_id);
        // create our indexed document and add it to the index
        $index->addDocument(newPhpRiotIndexedDocument($document));
    }
    // write the index to disk
    $index->commit();
?>

    
    
     
     The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
    
    

    
    
     
     How querying works in Zend_Search_Lucene
    
    

    
    
     
     Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
    
    

    
    
     
     When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
    
    

    
    
     
     Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
    
    

    
    
     
     So to search the author field for ‘Quentin’, the search query would be 
     
     author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the 
     
     Zend_Search_Lucene manual section on Extensibility)
    
    

    
    
     
     Likewise, to search in the title field for ‘php’, we would use 
     
     title:php.
    
    

    
    
     
     As we briefly mentioned earlier in this article, the default section that is searched in is the field called 
     
     contents. So if you wanted to search the document body for the world ‘google’, you could use 
     
     contents:google or just 
     
     google.
    
    

    
    
     
     Including and excluding terms
    
    

    
    
     
     By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
    
    

    
    
     
     Searching for phrases
    
    

    
    
     
     It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the 
     
     Zend_Search_Lucene manual section on query types.
    
    

    
    
     
     Sample queries
    
    

    
    
     
     Here are some queries you can pass to Zend_Search_Lucene and their meanings.
    
    

    
    
     
     Highlight: Plain
    
    
php
    // search the index for any article with the word php
 
php -author:quentin
    // find any article with the word php not written by me
 
author:quentin
    // find all the articles by me
 
php -ajax
    // find all articles with the word php that don't have the word ajax
 
title:mysql
    // find all articles with MySQL in the title
 
title:mysql -author:quentin
    // find all articles with MySQL in the title not by me

    
    
     
     And so on. Hopefully you get the idea.
    
    

    
    
     
     Scoring of results
    
    

    
    
     
     All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
    
    

    
    
     
     The results are ordered by their score, from highest to lowest.
    
    

    
    
     
     I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
    
    

    
    
     
     You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
    
    

    
    
     
     Querying our index
    
    

    
    
     
     On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
    
    

    
    
     
     Now we will look at actually pulling documents from our index using that term.
    
    

    
    
     
     There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
    
    

    
    
     
     In either case, you use the 
     
     find() method on the index. The find() method returns a list of matches from your index.
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query('php +author:Quentin');
?>

    
    
     
     This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
    
    

    
    
     
     We could also manually build this same query with function calls like so:
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $query = newZend_Search_Lucene_Search_Query_MultiTerm();
    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
    $hits = $index->query($query);
?>

    
    
     
     The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
    
    

    
    
     
     The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is 
     
     contents.
    
    

    
    
     
     On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
    
    

    
    
     
     Dealing with returned results
    
    

    
    
     
     The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
    
    

    
    
     
     Each of the indexed fields are available as a class property.
    
    

    
    
     
     So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    require_once('Zend/Search/Lucene.php');
    $query = 'php +author:Quentin';
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query($query);
    $numHits = count($hits);
?>
<p>
    Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
    <p>
        By <?=$hit->author?>
    </p>
    <p>
        <?=$hit->teaser?><br />
        <a href="<?=$hit->url?>">Read more...</a>
    </p>
<?php}?>

    
    
     
     Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
    
    

    
    
     
     Creating a simple search engine
    
    

    
    
     
     Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    require_once('Zend/Search/Lucene.php');
    $query = isset(
      
      
       
       Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html
      
      

      
      
       
       By Quentin Zervaas, 27 April 2006 
      
      

      
      
       
       This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.
      
      

      
      
       
       There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).
      
      

      
      
       
       It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.
      
      

      
      
       
       In this article we will be covering the following:
      
      

      
      How to index a document or series of documents 
    
The different types of fields that can be indexed 


      
      Searching the index 


      
      
       
       To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
      
      

      
      
       
       How fulltext indexing and querying works
      
      

      
      
       
       Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
      
      

      
      
       
       The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
      
      

      
      
       
       So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
      
      

      
      
       
       Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
      
      

      
      
       
       Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
      
      

      
      
       
       Querying the data
      
      

      
      
       
       Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
      
      

      
      
       
       Keeping the index up-to-date
      
      

      
      
       
       If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
      
      

      
      
       
       There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
      
      

      
      
       
       Getting started
      
      

      
      
       
       The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
      
      

      
      
       
       You can download this from 
       
       http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
      
      

      
      
       
       I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
      
      

      
      
       
       Highlight: Plain
      
      
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

      
      
       
       So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
      
      

      
      
       
       Highlight: Plain
      
      
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

      
      
       
       Which now becomes:
      
      

      
      
       
       Highlight: Plain
      
      
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

      
      
       
       Creating our first index
      
      

      
      
       
       The basic process for creating an index is:
      
      

      
      Open the index
    
Add each document
    

      
      Commit (save) the index
    

      
      
       
       The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the 
       
       Zend_Search_Lucene class.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath, true);
?>

      
      
       
       You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
      
      

      
      
       
       Adding a document to our index
      
      

      
      
       
       Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $doc = newZend_Search_Lucene_Document();
?>

      
      
       
       The next thing we must do is determine which fields we need to add to our index.
      
      

      
      
       
       There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
      
      

      
      
       
       As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
      
      

      
      Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
        
Title – we’re definitely going to include the title in our results
        
Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
        
Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
        

      
      

      
      Created – We’ll also store a timestamp of when the article was created.
    

      
      
       
       This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
      
      

      
      UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
    
UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
    

      
      Text – Data that is available for search and is stored in full (title and author)
    

      
      
       
       There is also the 
       
       Keyword and 
       
       Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
      
      

      
      
       
       To add a field to our indexed document, we use the 
       
       addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
      
      

      
      
       
       In other words, to create the 
       
       title field data, we use:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>

      
      
       
       Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
      
      

      
      
       
       So to add all the data with the field types we just worked out, we would use this:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $doc = newZend_Search_Lucene_Document();
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>

      
      
       
       Important note: We added the main search content with a field name of 
       
       contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like 
       
       title:foo. This will be covered further in the section about querying the index.
      
      

      
      
       
       Finally, we add the document to the index using 
       
       addDocument():
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $index->addDocument($doc);
?>

      
      
       
       We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
      
      

      
      
       
       Committing / saving the index
      
      

      
      
       
       Once all documents have been added, the index must be saved.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $index->commit();
?>

      
      
       
       You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
      
      

      
      
       
       If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
      
      

      
      
       
       Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
      
      

      
      
       
       Indexing all the articles on phpRiot
      
      

      
      
       
       Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
      
      

      
      
       
       Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
      
      

      
      
       
       Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article 
       
       Managing your data with DatabaseObject.
      
      

      
      
       
       Extending Zend_Search_Lucene_Document
      
      

      
      
       
       On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
      
      

      
      
       
       In other words, we’re going to move the calls to 
       
       addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
    {
        /**
         * Constructor. Creates our indexable document and adds all
         * necessary fields to it using the passed in DatabaseObject
         * holding the article data.
         */
        publicfunction__construct(&$document)
        {
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
        }
    }
?>

      
      
       
       As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The 
       
       generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
      
      

      
      
       
       Building the full index
      
      

      
      
       
       Now that we have our class, we can create our index, loop over the documents, and then save our index:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    require_once('DatabaseObject/PhpriotDocument.class.php');
    // where to save our index
    $indexPath = '/var/www/phpriot.com/data/docindex';
    // get a list of all database ids for documents
    $doc_ids = PhpriotDocument::GetDocIds($db);
    // create our index
    $index = newZend_Search_Lucene($indexPath, true);
    foreach($doc_idsas$doc_id){
        // load our databaseobject
        $document = newPhpriotDocument($db);
        $document->loadRecord($doc_id);
        // create our indexed document and add it to the index
        $index->addDocument(newPhpRiotIndexedDocument($document));
    }
    // write the index to disk
    $index->commit();
?>

      
      
       
       The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
      
      

      
      
       
       How querying works in Zend_Search_Lucene
      
      

      
      
       
       Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
      
      

      
      
       
       When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
      
      

      
      
       
       Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
      
      

      
      
       
       So to search the author field for ‘Quentin’, the search query would be 
       
       author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the 
       
       Zend_Search_Lucene manual section on Extensibility)
      
      

      
      
       
       Likewise, to search in the title field for ‘php’, we would use 
       
       title:php.
      
      

      
      
       
       As we briefly mentioned earlier in this article, the default section that is searched in is the field called 
       
       contents. So if you wanted to search the document body for the world ‘google’, you could use 
       
       contents:google or just 
       
       google.
      
      

      
      
       
       Including and excluding terms
      
      

      
      
       
       By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
      
      

      
      
       
       Searching for phrases
      
      

      
      
       
       It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the 
       
       Zend_Search_Lucene manual section on query types.
      
      

      
      
       
       Sample queries
      
      

      
      
       
       Here are some queries you can pass to Zend_Search_Lucene and their meanings.
      
      

      
      
       
       Highlight: Plain
      
      
php
    // search the index for any article with the word php
 
php -author:quentin
    // find any article with the word php not written by me
 
author:quentin
    // find all the articles by me
 
php -ajax
    // find all articles with the word php that don't have the word ajax
 
title:mysql
    // find all articles with MySQL in the title
 
title:mysql -author:quentin
    // find all articles with MySQL in the title not by me

      
      
       
       And so on. Hopefully you get the idea.
      
      

      
      
       
       Scoring of results
      
      

      
      
       
       All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
      
      

      
      
       
       The results are ordered by their score, from highest to lowest.
      
      

      
      
       
       I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
      
      

      
      
       
       You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
      
      

      
      
       
       Querying our index
      
      

      
      
       
       On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
      
      

      
      
       
       Now we will look at actually pulling documents from our index using that term.
      
      

      
      
       
       There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
      
      

      
      
       
       In either case, you use the 
       
       find() method on the index. The find() method returns a list of matches from your index.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query('php +author:Quentin');
?>

      
      
       
       This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
      
      

      
      
       
       We could also manually build this same query with function calls like so:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $query = newZend_Search_Lucene_Search_Query_MultiTerm();
    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
    $hits = $index->query($query);
?>

      
      
       
       The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
      
      

      
      
       
       The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is 
       
       contents.
      
      

      
      
       
       On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
      
      

      
      
       
       Dealing with returned results
      
      

      
      
       
       The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
      
      

      
      
       
       Each of the indexed fields are available as a class property.
      
      

      
      
       
       So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $query = 'php +author:Quentin';
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query($query);
    $numHits = count($hits);
?>
<p>
    Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
    <p>
        By <?=$hit->author?>
    </p>
    <p>
        <?=$hit->teaser?><br />
        <a href="<?=$hit->url?>">Read more...</a>
    </p>
<?php}?>

      
      
       
       Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
      
      

      
      
       
       Creating a simple search engine
      
      

      
      
       
       Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
___FCKpd___125
    $query = trim($query);
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    if(strlen($query) > 0){
        $hits = $index->query($query);
        $numHits = count($hits);
    }
?>
<form method="get" action="search.php">
    <input type="text" name="query" value="<?=htmlSpecialChars($query)?>" />
    <input type="submit" value="Search" />
</form>
<?phpif(strlen($query) > 0){?>
    <p>
        Found <?=$hits?> result(s) for query <?=$query?>.
    </p>
    <?phpforeach($hitsas$hit){?>
        <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
        <p>
            By <?=$hit->author?>
        </p>
        <p>
            <?=$hit->teaser?><br />
            <a href="<?=$hit->url?>">Read more...</a>
        </p>
    <?php}?>
<?php}?>

      
      
       
       Error handling
      
      

      
      
       
       The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the 
       
       Zend_Search_Lucene_Exception exception.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
___FCKpd___155
    $query = trim($query);
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    try{
        $hits = $index->query($query);
    }
    catch(Zend_Search_Lucene_Exception$ex){
        $hits = array();
    }
    $numHits = count($hits);
?>

      
      
       
       This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
      
      

      
      
       
       Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
      
      

      
      
       
       Keeping the index up-to-date
      
      

      
      
       
       The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
      
      

      
      Update just the entry for the updated document straight away
    
Rebuild the entire index when a document is updated straight away
    

      
      Rebuild the entire index at a certain time each day (or several times per day)
    

      
      
       
       The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
      
      

      
      
       
       To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
      
      

      
      
       
       There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
      
      

      
      
       
       So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
      
      

      
      
       
       Extending Zend_Search_Lucene
      
      

      
      
       
       There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
      
      

      
      A custom tokenizer for determining keywords in a document
    
Custom scoring algorithms to determine how well a document matches a search query
    

      
      A custom storage method, to your index is stored however and wherever you please
    

      
      
       
       A custom tokenizer
      
      

      
      
       
       There are many reasons why a custom tokenizer can be useful. Here are some ideas:
      
      

      
      PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
    
Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
    

      
      HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
    

      
      
       
       More information on this can be found at 
       
       http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
      
      

      
      
       
       Custom scoring algorithms
      
      

      
      
       
       Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
      
      

      
      
       
       More information can be found on this at 
       
       http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
      
      

      
      
       
       Custom storage method
      
      

      
      
       
       You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
      
      

      
      
       
       It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
      
      

      
      
       
       More information on this can be found at 
       
       http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
      
      

      
      
       
       Conclusion
      
      

      
      
       
       In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
      
      

      
      
       
       We also looked briefly at some ways of extending the search capabilities.
      
      

      
      
       
       Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
      
      

      
      

      
      
 
GET['query']) ? 

      
      
       
       Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html
      
      

      
      
       
       By Quentin Zervaas, 27 April 2006 
      
      

      
      
       
       This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.
      
      

      
      
       
       There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).
      
      

      
      
       
       It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.
      
      

      
      
       
       In this article we will be covering the following:
      
      

      
      How to index a document or series of documents 
    
The different types of fields that can be indexed 
    

      
      Searching the index 
    

      
      
       
       To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
      
      

      
      
       
       How fulltext indexing and querying works
      
      

      
      
       
       Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
      
      

      
      
       
       The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
      
      

      
      
       
       So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
      
      

      
      
       
       Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
      
      

      
      
       
       Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
      
      

      
      
       
       Querying the data
      
      

      
      
       
       Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
      
      

      
      
       
       Keeping the index up-to-date
      
      

      
      
       
       If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
      
      

      
      
       
       There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
      
      

      
      
       
       Getting started
      
      

      
      
       
       The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
      
      

      
      
       
       You can download this from 
       
       http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
      
      

      
      
       
       I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
      
      

      
      
       
       Highlight: Plain
      
      
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

      
      
       
       So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
      
      

      
      
       
       Highlight: Plain
      
      
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

      
      
       
       Which now becomes:
      
      

      
      
       
       Highlight: Plain
      
      
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

      
      
       
       Creating our first index
      
      

      
      
       
       The basic process for creating an index is:
      
      

      
      Open the index
    
Add each document
    

      
      Commit (save) the index
    

      
      
       
       The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the 
       
       Zend_Search_Lucene class.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath, true);
?>

      
      
       
       You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
      
      

      
      
       
       Adding a document to our index
      
      

      
      
       
       Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $doc = newZend_Search_Lucene_Document();
?>

      
      
       
       The next thing we must do is determine which fields we need to add to our index.
      
      

      
      
       
       There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
      
      

      
      
       
       As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
      
      

      
      Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
        
Title – we’re definitely going to include the title in our results
        
Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
        
Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
        

      
      

      
      Created – We’ll also store a timestamp of when the article was created.
    

      
      
       
       This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
      
      

      
      UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
    
UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
    

      
      Text – Data that is available for search and is stored in full (title and author)
    

      
      
       
       There is also the 
       
       Keyword and 
       
       Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
      
      

      
      
       
       To add a field to our indexed document, we use the 
       
       addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
      
      

      
      
       
       In other words, to create the 
       
       title field data, we use:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>

      
      
       
       Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
      
      

      
      
       
       So to add all the data with the field types we just worked out, we would use this:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $doc = newZend_Search_Lucene_Document();
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>

      
      
       
       Important note: We added the main search content with a field name of 
       
       contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like 
       
       title:foo. This will be covered further in the section about querying the index.
      
      

      
      
       
       Finally, we add the document to the index using 
       
       addDocument():
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $index->addDocument($doc);
?>

      
      
       
       We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
      
      

      
      
       
       Committing / saving the index
      
      

      
      
       
       Once all documents have been added, the index must be saved.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $index->commit();
?>

      
      
       
       You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
      
      

      
      
       
       If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
      
      

      
      
       
       Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
      
      

      
      
       
       Indexing all the articles on phpRiot
      
      

      
      
       
       Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
      
      

      
      
       
       Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
      
      

      
      
       
       Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article 
       
       Managing your data with DatabaseObject.
      
      

      
      
       
       Extending Zend_Search_Lucene_Document
      
      

      
      
       
       On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
      
      

      
      
       
       In other words, we’re going to move the calls to 
       
       addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
    {
        /**
         * Constructor. Creates our indexable document and adds all
         * necessary fields to it using the passed in DatabaseObject
         * holding the article data.
         */
        publicfunction__construct(&$document)
        {
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
        }
    }
?>

      
      
       
       As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The 
       
       generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
      
      

      
      
       
       Building the full index
      
      

      
      
       
       Now that we have our class, we can create our index, loop over the documents, and then save our index:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    require_once('DatabaseObject/PhpriotDocument.class.php');
    // where to save our index
    $indexPath = '/var/www/phpriot.com/data/docindex';
    // get a list of all database ids for documents
    $doc_ids = PhpriotDocument::GetDocIds($db);
    // create our index
    $index = newZend_Search_Lucene($indexPath, true);
    foreach($doc_idsas$doc_id){
        // load our databaseobject
        $document = newPhpriotDocument($db);
        $document->loadRecord($doc_id);
        // create our indexed document and add it to the index
        $index->addDocument(newPhpRiotIndexedDocument($document));
    }
    // write the index to disk
    $index->commit();
?>

      
      
       
       The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
      
      

      
      
       
       How querying works in Zend_Search_Lucene
      
      

      
      
       
       Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
      
      

      
      
       
       When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
      
      

      
      
       
       Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
      
      

      
      
       
       So to search the author field for ‘Quentin’, the search query would be 
       
       author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the 
       
       Zend_Search_Lucene manual section on Extensibility)
      
      

      
      
       
       Likewise, to search in the title field for ‘php’, we would use 
       
       title:php.
      
      

      
      
       
       As we briefly mentioned earlier in this article, the default section that is searched in is the field called 
       
       contents. So if you wanted to search the document body for the world ‘google’, you could use 
       
       contents:google or just 
       
       google.
      
      

      
      
       
       Including and excluding terms
      
      

      
      
       
       By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
      
      

      
      
       
       Searching for phrases
      
      

      
      
       
       It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the 
       
       Zend_Search_Lucene manual section on query types.
      
      

      
      
       
       Sample queries
      
      

      
      
       
       Here are some queries you can pass to Zend_Search_Lucene and their meanings.
      
      

      
      
       
       Highlight: Plain
      
      
php
    // search the index for any article with the word php
 
php -author:quentin
    // find any article with the word php not written by me
 
author:quentin
    // find all the articles by me
 
php -ajax
    // find all articles with the word php that don't have the word ajax
 
title:mysql
    // find all articles with MySQL in the title
 
title:mysql -author:quentin
    // find all articles with MySQL in the title not by me

      
      
       
       And so on. Hopefully you get the idea.
      
      

      
      
       
       Scoring of results
      
      

      
      
       
       All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
      
      

      
      
       
       The results are ordered by their score, from highest to lowest.
      
      

      
      
       
       I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
      
      

      
      
       
       You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
      
      

      
      
       
       Querying our index
      
      

      
      
       
       On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
      
      

      
      
       
       Now we will look at actually pulling documents from our index using that term.
      
      

      
      
       
       There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
      
      

      
      
       
       In either case, you use the 
       
       find() method on the index. The find() method returns a list of matches from your index.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query('php +author:Quentin');
?>

      
      
       
       This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
      
      

      
      
       
       We could also manually build this same query with function calls like so:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $query = newZend_Search_Lucene_Search_Query_MultiTerm();
    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
    $hits = $index->query($query);
?>

      
      
       
       The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
      
      

      
      
       
       The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is 
       
       contents.
      
      

      
      
       
       On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
      
      

      
      
       
       Dealing with returned results
      
      

      
      
       
       The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
      
      

      
      
       
       Each of the indexed fields are available as a class property.
      
      

      
      
       
       So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $query = 'php +author:Quentin';
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query($query);
    $numHits = count($hits);
?>
<p>
    Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
    <p>
        By <?=$hit->author?>
    </p>
    <p>
        <?=$hit->teaser?><br />
        <a href="<?=$hit->url?>">Read more...</a>
    </p>
<?php}?>

      
      
       
       Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
      
      

      
      
       
       Creating a simple search engine
      
      

      
      
       
       Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
___FCKpd___125
___FCKpd___126
___FCKpd___127
___FCKpd___128
___FCKpd___129
___FCKpd___130
___FCKpd___131
___FCKpd___132
___FCKpd___133
___FCKpd___134
___FCKpd___135
___FCKpd___136
___FCKpd___137
___FCKpd___138
___FCKpd___139
___FCKpd___140
___FCKpd___141
___FCKpd___142
___FCKpd___143
___FCKpd___144
___FCKpd___145
___FCKpd___146
___FCKpd___147
___FCKpd___148
___FCKpd___149
___FCKpd___150
___FCKpd___151
___FCKpd___152

      
      
       
       Error handling
      
      

      
      
       
       The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the 
       
       Zend_Search_Lucene_Exception exception.
      
      

      
      
       
       Highlight: PHP
      
      
___FCKpd___153
___FCKpd___154
___FCKpd___155
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166

      
      
       
       This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
      
      

      
      
       
       Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
      
      

      
      
       
       Keeping the index up-to-date
      
      

      
      
       
       The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
      
      

      
      Update just the entry for the updated document straight away
    
Rebuild the entire index when a document is updated straight away
    

      
      Rebuild the entire index at a certain time each day (or several times per day)
    

      
      
       
       The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
      
      

      
      
       
       To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
      
      

      
      
       
       There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
      
      

      
      
       
       So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
      
      

      
      
       
       Extending Zend_Search_Lucene
      
      

      
      
       
       There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
      
      

      
      A custom tokenizer for determining keywords in a document
    
Custom scoring algorithms to determine how well a document matches a search query
    

      
      A custom storage method, to your index is stored however and wherever you please
    

      
      
       
       A custom tokenizer
      
      

      
      
       
       There are many reasons why a custom tokenizer can be useful. Here are some ideas:
      
      

      
      PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
    
Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
    

      
      HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
    

      
      
       
       More information on this can be found at 
       
       http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
      
      

      
      
       
       Custom scoring algorithms
      
      

      
      
       
       Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
      
      

      
      
       
       More information can be found on this at 
       
       http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
      
      

      
      
       
       Custom storage method
      
      

      
      
       
       You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
      
      

      
      
       
       It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
      
      

      
      
       
       More information on this can be found at 
       
       http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
      
      

      
      
       
       Conclusion
      
      

      
      
       
       In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
      
      

      
      
       
       We also looked briefly at some ways of extending the search capabilities.
      
      

      
      
       
       Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
      
      

      
      

      
      
 
GET['query'] : '';
___FCKpd___126
___FCKpd___127
___FCKpd___128
___FCKpd___129
___FCKpd___130
___FCKpd___131
___FCKpd___132
___FCKpd___133
___FCKpd___134
___FCKpd___135
___FCKpd___136
___FCKpd___137
___FCKpd___138
___FCKpd___139
___FCKpd___140
___FCKpd___141
___FCKpd___142
___FCKpd___143
___FCKpd___144
___FCKpd___145
___FCKpd___146
___FCKpd___147
___FCKpd___148
___FCKpd___149
___FCKpd___150
___FCKpd___151
___FCKpd___152

    
    
     
     Error handling
    
    

    
    
     
     The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the 
     
     Zend_Search_Lucene_Exception exception.
    
    

    
    
     
     Highlight: PHP
    
    
___FCKpd___153
___FCKpd___154
___FCKpd___155
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166

    
    
     
     This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
    
    

    
    
     
     Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
    
    

    
    
     
     Keeping the index up-to-date
    
    

    
    
     
     The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
    
    

    
    Update just the entry for the updated document straight away
    
Rebuild the entire index when a document is updated straight away
    

    
    Rebuild the entire index at a certain time each day (or several times per day)
    

    
    
     
     The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
    
    

    
    
     
     To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
    
    

    
    
     
     There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
    
    

    
    
     
     So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
    
    

    
    
     
     Extending Zend_Search_Lucene
    
    

    
    
     
     There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
    
    

    
    A custom tokenizer for determining keywords in a document
    
Custom scoring algorithms to determine how well a document matches a search query
    

    
    A custom storage method, to your index is stored however and wherever you please
    

    
    
     
     A custom tokenizer
    
    

    
    
     
     There are many reasons why a custom tokenizer can be useful. Here are some ideas:
    
    

    
    PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
    
Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
    

    
    HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
    

    
    
     
     More information on this can be found at 
     
     http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
    
    

    
    
     
     Custom scoring algorithms
    
    

    
    
     
     Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
    
    

    
    
     
     More information can be found on this at 
     
     http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
    
    

    
    
     
     Custom storage method
    
    

    
    
     
     You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
    
    

    
    
     
     It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
    
    

    
    
     
     More information on this can be found at 
     
     http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
    
    

    
    
     
     Conclusion
    
    

    
    
     
     In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
    
    

    
    
     
     We also looked briefly at some ways of extending the search capabilities.
    
    

    
    
     
     Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
    
    

    
    

    
    
 
GET['query']) ? 

    
    
     
     Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html
    
    

    
    
     
     By Quentin Zervaas, 27 April 2006 
    
    

    
    
     
     This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.
    
    

    
    
     
     There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).
    
    

    
    
     
     It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.
    
    

    
    
     
     In this article we will be covering the following:
    
    

    
    How to index a document or series of documents 
    
The different types of fields that can be indexed 
    

    
    Searching the index 
    

    
    
     
     To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
    
    

    
    
     
     How fulltext indexing and querying works
    
    

    
    
     
     Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
    
    

    
    
     
     The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
    
    

    
    
     
     So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
    
    

    
    
     
     Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
    
    

    
    
     
     Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
    
    

    
    
     
     Querying the data
    
    

    
    
     
     Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
    
    

    
    
     
     Keeping the index up-to-date
    
    

    
    
     
     If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
    
    

    
    
     
     There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
    
    

    
    
     
     Getting started
    
    

    
    
     
     The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
    
    

    
    
     
     You can download this from 
     
     http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
    
    

    
    
     
     I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
    
    

    
    
     
     Highlight: Plain
    
    
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

    
    
     
     So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
    
    

    
    
     
     Highlight: Plain
    
    
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

    
    
     
     Which now becomes:
    
    

    
    
     
     Highlight: Plain
    
    
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

    
    
     
     Creating our first index
    
    

    
    
     
     The basic process for creating an index is:
    
    

    
    Open the index
    
Add each document
    

    
    Commit (save) the index
    

    
    
     
     The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the 
     
     Zend_Search_Lucene class.
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath, true);
?>

    
    
     
     You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
    
    

    
    
     
     Adding a document to our index
    
    

    
    
     
     Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    $doc = newZend_Search_Lucene_Document();
?>

    
    
     
     The next thing we must do is determine which fields we need to add to our index.
    
    

    
    
     
     There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
    
    

    
    
     
     As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
    
    

    
    Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
        
Title – we’re definitely going to include the title in our results
        
Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
        
Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
        

    
    

    
    Created – We’ll also store a timestamp of when the article was created.
    

    
    
     
     This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
    
    

    
    UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
    
UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
    

    
    Text – Data that is available for search and is stored in full (title and author)
    

    
    
     
     There is also the 
     
     Keyword and 
     
     Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
    
    

    
    
     
     To add a field to our indexed document, we use the 
     
     addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
    
    

    
    
     
     In other words, to create the 
     
     title field data, we use:
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>

    
    
     
     Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
    
    

    
    
     
     So to add all the data with the field types we just worked out, we would use this:
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    $doc = newZend_Search_Lucene_Document();
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>

    
    
     
     Important note: We added the main search content with a field name of 
     
     contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like 
     
     title:foo. This will be covered further in the section about querying the index.
    
    

    
    
     
     Finally, we add the document to the index using 
     
     addDocument():
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    $index->addDocument($doc);
?>

    
    
     
     We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
    
    

    
    
     
     Committing / saving the index
    
    

    
    
     
     Once all documents have been added, the index must be saved.
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    $index->commit();
?>

    
    
     
     You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
    
    

    
    
     
     If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
    
    

    
    
     
     Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
    
    

    
    
     
     Indexing all the articles on phpRiot
    
    

    
    
     
     Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
    
    

    
    
     
     Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
    
    

    
    
     
     Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article 
     
     Managing your data with DatabaseObject.
    
    

    
    
     
     Extending Zend_Search_Lucene_Document
    
    

    
    
     
     On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
    
    

    
    
     
     In other words, we’re going to move the calls to 
     
     addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
    {
        /**
         * Constructor. Creates our indexable document and adds all
         * necessary fields to it using the passed in DatabaseObject
         * holding the article data.
         */
        publicfunction__construct(&$document)
        {
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
        }
    }
?>

    
    
     
     As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The 
     
     generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
    
    

    
    
     
     Building the full index
    
    

    
    
     
     Now that we have our class, we can create our index, loop over the documents, and then save our index:
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    require_once('Zend/Search/Lucene.php');
    require_once('DatabaseObject/PhpriotDocument.class.php');
    // where to save our index
    $indexPath = '/var/www/phpriot.com/data/docindex';
    // get a list of all database ids for documents
    $doc_ids = PhpriotDocument::GetDocIds($db);
    // create our index
    $index = newZend_Search_Lucene($indexPath, true);
    foreach($doc_idsas$doc_id){
        // load our databaseobject
        $document = newPhpriotDocument($db);
        $document->loadRecord($doc_id);
        // create our indexed document and add it to the index
        $index->addDocument(newPhpRiotIndexedDocument($document));
    }
    // write the index to disk
    $index->commit();
?>

    
    
     
     The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
    
    

    
    
     
     How querying works in Zend_Search_Lucene
    
    

    
    
     
     Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
    
    

    
    
     
     When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
    
    

    
    
     
     Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
    
    

    
    
     
     So to search the author field for ‘Quentin’, the search query would be 
     
     author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the 
     
     Zend_Search_Lucene manual section on Extensibility)
    
    

    
    
     
     Likewise, to search in the title field for ‘php’, we would use 
     
     title:php.
    
    

    
    
     
     As we briefly mentioned earlier in this article, the default section that is searched in is the field called 
     
     contents. So if you wanted to search the document body for the world ‘google’, you could use 
     
     contents:google or just 
     
     google.
    
    

    
    
     
     Including and excluding terms
    
    

    
    
     
     By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
    
    

    
    
     
     Searching for phrases
    
    

    
    
     
     It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the 
     
     Zend_Search_Lucene manual section on query types.
    
    

    
    
     
     Sample queries
    
    

    
    
     
     Here are some queries you can pass to Zend_Search_Lucene and their meanings.
    
    

    
    
     
     Highlight: Plain
    
    
php
    // search the index for any article with the word php
 
php -author:quentin
    // find any article with the word php not written by me
 
author:quentin
    // find all the articles by me
 
php -ajax
    // find all articles with the word php that don't have the word ajax
 
title:mysql
    // find all articles with MySQL in the title
 
title:mysql -author:quentin
    // find all articles with MySQL in the title not by me

    
    
     
     And so on. Hopefully you get the idea.
    
    

    
    
     
     Scoring of results
    
    

    
    
     
     All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
    
    

    
    
     
     The results are ordered by their score, from highest to lowest.
    
    

    
    
     
     I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
    
    

    
    
     
     You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
    
    

    
    
     
     Querying our index
    
    

    
    
     
     On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
    
    

    
    
     
     Now we will look at actually pulling documents from our index using that term.
    
    

    
    
     
     There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
    
    

    
    
     
     In either case, you use the 
     
     find() method on the index. The find() method returns a list of matches from your index.
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query('php +author:Quentin');
?>

    
    
     
     This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
    
    

    
    
     
     We could also manually build this same query with function calls like so:
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $query = newZend_Search_Lucene_Search_Query_MultiTerm();
    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
    $hits = $index->query($query);
?>

    
    
     
     The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
    
    

    
    
     
     The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is 
     
     contents.
    
    

    
    
     
     On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
    
    

    
    
     
     Dealing with returned results
    
    

    
    
     
     The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
    
    

    
    
     
     Each of the indexed fields are available as a class property.
    
    

    
    
     
     So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    require_once('Zend/Search/Lucene.php');
    $query = 'php +author:Quentin';
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query($query);
    $numHits = count($hits);
?>
<p>
    Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
    <p>
        By <?=$hit->author?>
    </p>
    <p>
        <?=$hit->teaser?><br />
        <a href="<?=$hit->url?>">Read more...</a>
    </p>
<?php}?>

    
    
     
     Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
    
    

    
    
     
     Creating a simple search engine
    
    

    
    
     
     Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
    
    

    
    
     
     Highlight: PHP
    
    
<?php
    require_once('Zend/Search/Lucene.php');
    $query = isset(
      
      
       
       Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html
      
      

      
      
       
       By Quentin Zervaas, 27 April 2006 
      
      

      
      
       
       This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.
      
      

      
      
       
       There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).
      
      

      
      
       
       It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.
      
      

      
      
       
       In this article we will be covering the following:
      
      

      
      How to index a document or series of documents 
    
The different types of fields that can be indexed 


      
      Searching the index 


      
      
       
       To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
      
      

      
      
       
       How fulltext indexing and querying works
      
      

      
      
       
       Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
      
      

      
      
       
       The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
      
      

      
      
       
       So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
      
      

      
      
       
       Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
      
      

      
      
       
       Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
      
      

      
      
       
       Querying the data
      
      

      
      
       
       Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
      
      

      
      
       
       Keeping the index up-to-date
      
      

      
      
       
       If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
      
      

      
      
       
       There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
      
      

      
      
       
       Getting started
      
      

      
      
       
       The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
      
      

      
      
       
       You can download this from 
       
       http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
      
      

      
      
       
       I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
      
      

      
      
       
       Highlight: Plain
      
      
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

      
      
       
       So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
      
      

      
      
       
       Highlight: Plain
      
      
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

      
      
       
       Which now becomes:
      
      

      
      
       
       Highlight: Plain
      
      
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

      
      
       
       Creating our first index
      
      

      
      
       
       The basic process for creating an index is:
      
      

      
      Open the index
    
Add each document
    

      
      Commit (save) the index
    

      
      
       
       The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the 
       
       Zend_Search_Lucene class.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath, true);
?>

      
      
       
       You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
      
      

      
      
       
       Adding a document to our index
      
      

      
      
       
       Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $doc = newZend_Search_Lucene_Document();
?>

      
      
       
       The next thing we must do is determine which fields we need to add to our index.
      
      

      
      
       
       There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
      
      

      
      
       
       As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
      
      

      
      Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
        
Title – we’re definitely going to include the title in our results
        
Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
        
Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
        

      
      

      
      Created – We’ll also store a timestamp of when the article was created.
    

      
      
       
       This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
      
      

      
      UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
    
UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
    

      
      Text – Data that is available for search and is stored in full (title and author)
    

      
      
       
       There is also the 
       
       Keyword and 
       
       Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
      
      

      
      
       
       To add a field to our indexed document, we use the 
       
       addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
      
      

      
      
       
       In other words, to create the 
       
       title field data, we use:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>

      
      
       
       Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
      
      

      
      
       
       So to add all the data with the field types we just worked out, we would use this:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $doc = newZend_Search_Lucene_Document();
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>

      
      
       
       Important note: We added the main search content with a field name of 
       
       contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like 
       
       title:foo. This will be covered further in the section about querying the index.
      
      

      
      
       
       Finally, we add the document to the index using 
       
       addDocument():
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $index->addDocument($doc);
?>

      
      
       
       We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
      
      

      
      
       
       Committing / saving the index
      
      

      
      
       
       Once all documents have been added, the index must be saved.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $index->commit();
?>

      
      
       
       You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
      
      

      
      
       
       If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
      
      

      
      
       
       Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
      
      

      
      
       
       Indexing all the articles on phpRiot
      
      

      
      
       
       Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
      
      

      
      
       
       Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
      
      

      
      
       
       Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article 
       
       Managing your data with DatabaseObject.
      
      

      
      
       
       Extending Zend_Search_Lucene_Document
      
      

      
      
       
       On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
      
      

      
      
       
       In other words, we’re going to move the calls to 
       
       addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
    {
        /**
         * Constructor. Creates our indexable document and adds all
         * necessary fields to it using the passed in DatabaseObject
         * holding the article data.
         */
        publicfunction__construct(&$document)
        {
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
        }
    }
?>

      
      
       
       As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The 
       
       generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
      
      

      
      
       
       Building the full index
      
      

      
      
       
       Now that we have our class, we can create our index, loop over the documents, and then save our index:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    require_once('DatabaseObject/PhpriotDocument.class.php');
    // where to save our index
    $indexPath = '/var/www/phpriot.com/data/docindex';
    // get a list of all database ids for documents
    $doc_ids = PhpriotDocument::GetDocIds($db);
    // create our index
    $index = newZend_Search_Lucene($indexPath, true);
    foreach($doc_idsas$doc_id){
        // load our databaseobject
        $document = newPhpriotDocument($db);
        $document->loadRecord($doc_id);
        // create our indexed document and add it to the index
        $index->addDocument(newPhpRiotIndexedDocument($document));
    }
    // write the index to disk
    $index->commit();
?>

      
      
       
       The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
      
      

      
      
       
       How querying works in Zend_Search_Lucene
      
      

      
      
       
       Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
      
      

      
      
       
       When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
      
      

      
      
       
       Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
      
      

      
      
       
       So to search the author field for ‘Quentin’, the search query would be 
       
       author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the 
       
       Zend_Search_Lucene manual section on Extensibility)
      
      

      
      
       
       Likewise, to search in the title field for ‘php’, we would use 
       
       title:php.
      
      

      
      
       
       As we briefly mentioned earlier in this article, the default section that is searched in is the field called 
       
       contents. So if you wanted to search the document body for the world ‘google’, you could use 
       
       contents:google or just 
       
       google.
      
      

      
      
       
       Including and excluding terms
      
      

      
      
       
       By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
      
      

      
      
       
       Searching for phrases
      
      

      
      
       
       It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the 
       
       Zend_Search_Lucene manual section on query types.
      
      

      
      
       
       Sample queries
      
      

      
      
       
       Here are some queries you can pass to Zend_Search_Lucene and their meanings.
      
      

      
      
       
       Highlight: Plain
      
      
php
    // search the index for any article with the word php
 
php -author:quentin
    // find any article with the word php not written by me
 
author:quentin
    // find all the articles by me
 
php -ajax
    // find all articles with the word php that don't have the word ajax
 
title:mysql
    // find all articles with MySQL in the title
 
title:mysql -author:quentin
    // find all articles with MySQL in the title not by me

      
      
       
       And so on. Hopefully you get the idea.
      
      

      
      
       
       Scoring of results
      
      

      
      
       
       All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
      
      

      
      
       
       The results are ordered by their score, from highest to lowest.
      
      

      
      
       
       I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
      
      

      
      
       
       You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
      
      

      
      
       
       Querying our index
      
      

      
      
       
       On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
      
      

      
      
       
       Now we will look at actually pulling documents from our index using that term.
      
      

      
      
       
       There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
      
      

      
      
       
       In either case, you use the 
       
       find() method on the index. The find() method returns a list of matches from your index.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query('php +author:Quentin');
?>

      
      
       
       This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
      
      

      
      
       
       We could also manually build this same query with function calls like so:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $query = newZend_Search_Lucene_Search_Query_MultiTerm();
    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
    $hits = $index->query($query);
?>

      
      
       
       The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
      
      

      
      
       
       The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is 
       
       contents.
      
      

      
      
       
       On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
      
      

      
      
       
       Dealing with returned results
      
      

      
      
       
       The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
      
      

      
      
       
       Each of the indexed fields are available as a class property.
      
      

      
      
       
       So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $query = 'php +author:Quentin';
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query($query);
    $numHits = count($hits);
?>
<p>
    Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
    <p>
        By <?=$hit->author?>
    </p>
    <p>
        <?=$hit->teaser?><br />
        <a href="<?=$hit->url?>">Read more...</a>
    </p>
<?php}?>

      
      
       
       Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
      
      

      
      
       
       Creating a simple search engine
      
      

      
      
       
       Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
___FCKpd___125
    $query = trim($query);
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    if(strlen($query) > 0){
        $hits = $index->query($query);
        $numHits = count($hits);
    }
?>
<form method="get" action="search.php">
    <input type="text" name="query" value="<?=htmlSpecialChars($query)?>" />
    <input type="submit" value="Search" />
</form>
<?phpif(strlen($query) > 0){?>
    <p>
        Found <?=$hits?> result(s) for query <?=$query?>.
    </p>
    <?phpforeach($hitsas$hit){?>
        <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
        <p>
            By <?=$hit->author?>
        </p>
        <p>
            <?=$hit->teaser?><br />
            <a href="<?=$hit->url?>">Read more...</a>
        </p>
    <?php}?>
<?php}?>

      
      
       
       Error handling
      
      

      
      
       
       The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the 
       
       Zend_Search_Lucene_Exception exception.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
___FCKpd___155
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166

      
      
       
       This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
      
      

      
      
       
       Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
      
      

      
      
       
       Keeping the index up-to-date
      
      

      
      
       
       The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
      
      

      
      Update just the entry for the updated document straight away
    
Rebuild the entire index when a document is updated straight away
    

      
      Rebuild the entire index at a certain time each day (or several times per day)
    

      
      
       
       The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
      
      

      
      
       
       To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
      
      

      
      
       
       There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
      
      

      
      
       
       So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
      
      

      
      
       
       Extending Zend_Search_Lucene
      
      

      
      
       
       There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
      
      

      
      A custom tokenizer for determining keywords in a document
    
Custom scoring algorithms to determine how well a document matches a search query
    

      
      A custom storage method, to your index is stored however and wherever you please
    

      
      
       
       A custom tokenizer
      
      

      
      
       
       There are many reasons why a custom tokenizer can be useful. Here are some ideas:
      
      

      
      PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
    
Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
    

      
      HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
    

      
      
       
       More information on this can be found at 
       
       http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
      
      

      
      
       
       Custom scoring algorithms
      
      

      
      
       
       Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
      
      

      
      
       
       More information can be found on this at 
       
       http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
      
      

      
      
       
       Custom storage method
      
      

      
      
       
       You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
      
      

      
      
       
       It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
      
      

      
      
       
       More information on this can be found at 
       
       http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
      
      

      
      
       
       Conclusion
      
      

      
      
       
       In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
      
      

      
      
       
       We also looked briefly at some ways of extending the search capabilities.
      
      

      
      
       
       Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
      
      

      
      

      
      
 
GET['query']) ? 

      
      
       
       Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html
      
      

      
      
       
       By Quentin Zervaas, 27 April 2006 
      
      

      
      
       
       This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.
      
      

      
      
       
       There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).
      
      

      
      
       
       It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.
      
      

      
      
       
       In this article we will be covering the following:
      
      

      
      How to index a document or series of documents 
    
The different types of fields that can be indexed 
    

      
      Searching the index 
    

      
      
       
       To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
      
      

      
      
       
       How fulltext indexing and querying works
      
      

      
      
       
       Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
      
      

      
      
       
       The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
      
      

      
      
       
       So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
      
      

      
      
       
       Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
      
      

      
      
       
       Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
      
      

      
      
       
       Querying the data
      
      

      
      
       
       Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
      
      

      
      
       
       Keeping the index up-to-date
      
      

      
      
       
       If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
      
      

      
      
       
       There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
      
      

      
      
       
       Getting started
      
      

      
      
       
       The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
      
      

      
      
       
       You can download this from 
       
       http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
      
      

      
      
       
       I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
      
      

      
      
       
       Highlight: Plain
      
      
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

      
      
       
       So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
      
      

      
      
       
       Highlight: Plain
      
      
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

      
      
       
       Which now becomes:
      
      

      
      
       
       Highlight: Plain
      
      
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

      
      
       
       Creating our first index
      
      

      
      
       
       The basic process for creating an index is:
      
      

      
      Open the index
    
Add each document
    

      
      Commit (save) the index
    

      
      
       
       The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the 
       
       Zend_Search_Lucene class.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath, true);
?>

      
      
       
       You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
      
      

      
      
       
       Adding a document to our index
      
      

      
      
       
       Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $doc = newZend_Search_Lucene_Document();
?>

      
      
       
       The next thing we must do is determine which fields we need to add to our index.
      
      

      
      
       
       There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
      
      

      
      
       
       As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
      
      

      
      Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
        
Title – we’re definitely going to include the title in our results
        
Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
        
Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
        

      
      

      
      Created – We’ll also store a timestamp of when the article was created.
    

      
      
       
       This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
      
      

      
      UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
    
UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
    

      
      Text – Data that is available for search and is stored in full (title and author)
    

      
      
       
       There is also the 
       
       Keyword and 
       
       Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
      
      

      
      
       
       To add a field to our indexed document, we use the 
       
       addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
      
      

      
      
       
       In other words, to create the 
       
       title field data, we use:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>

      
      
       
       Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
      
      

      
      
       
       So to add all the data with the field types we just worked out, we would use this:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $doc = newZend_Search_Lucene_Document();
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>

      
      
       
       Important note: We added the main search content with a field name of 
       
       contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like 
       
       title:foo. This will be covered further in the section about querying the index.
      
      

      
      
       
       Finally, we add the document to the index using 
       
       addDocument():
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $index->addDocument($doc);
?>

      
      
       
       We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
      
      

      
      
       
       Committing / saving the index
      
      

      
      
       
       Once all documents have been added, the index must be saved.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    $index->commit();
?>

      
      
       
       You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
      
      

      
      
       
       If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
      
      

      
      
       
       Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
      
      

      
      
       
       Indexing all the articles on phpRiot
      
      

      
      
       
       Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
      
      

      
      
       
       Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
      
      

      
      
       
       Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article 
       
       Managing your data with DatabaseObject.
      
      

      
      
       
       Extending Zend_Search_Lucene_Document
      
      

      
      
       
       On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
      
      

      
      
       
       In other words, we’re going to move the calls to 
       
       addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
    {
        /**
         * Constructor. Creates our indexable document and adds all
         * necessary fields to it using the passed in DatabaseObject
         * holding the article data.
         */
        publicfunction__construct(&$document)
        {
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
        }
    }
?>

      
      
       
       As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The 
       
       generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
      
      

      
      
       
       Building the full index
      
      

      
      
       
       Now that we have our class, we can create our index, loop over the documents, and then save our index:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    require_once('DatabaseObject/PhpriotDocument.class.php');
    // where to save our index
    $indexPath = '/var/www/phpriot.com/data/docindex';
    // get a list of all database ids for documents
    $doc_ids = PhpriotDocument::GetDocIds($db);
    // create our index
    $index = newZend_Search_Lucene($indexPath, true);
    foreach($doc_idsas$doc_id){
        // load our databaseobject
        $document = newPhpriotDocument($db);
        $document->loadRecord($doc_id);
        // create our indexed document and add it to the index
        $index->addDocument(newPhpRiotIndexedDocument($document));
    }
    // write the index to disk
    $index->commit();
?>

      
      
       
       The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
      
      

      
      
       
       How querying works in Zend_Search_Lucene
      
      

      
      
       
       Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
      
      

      
      
       
       When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
      
      

      
      
       
       Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
      
      

      
      
       
       So to search the author field for ‘Quentin’, the search query would be 
       
       author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the 
       
       Zend_Search_Lucene manual section on Extensibility)
      
      

      
      
       
       Likewise, to search in the title field for ‘php’, we would use 
       
       title:php.
      
      

      
      
       
       As we briefly mentioned earlier in this article, the default section that is searched in is the field called 
       
       contents. So if you wanted to search the document body for the world ‘google’, you could use 
       
       contents:google or just 
       
       google.
      
      

      
      
       
       Including and excluding terms
      
      

      
      
       
       By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
      
      

      
      
       
       Searching for phrases
      
      

      
      
       
       It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the 
       
       Zend_Search_Lucene manual section on query types.
      
      

      
      
       
       Sample queries
      
      

      
      
       
       Here are some queries you can pass to Zend_Search_Lucene and their meanings.
      
      

      
      
       
       Highlight: Plain
      
      
php
    // search the index for any article with the word php
 
php -author:quentin
    // find any article with the word php not written by me
 
author:quentin
    // find all the articles by me
 
php -ajax
    // find all articles with the word php that don't have the word ajax
 
title:mysql
    // find all articles with MySQL in the title
 
title:mysql -author:quentin
    // find all articles with MySQL in the title not by me

      
      
       
       And so on. Hopefully you get the idea.
      
      

      
      
       
       Scoring of results
      
      

      
      
       
       All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
      
      

      
      
       
       The results are ordered by their score, from highest to lowest.
      
      

      
      
       
       I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
      
      

      
      
       
       You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
      
      

      
      
       
       Querying our index
      
      

      
      
       
       On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
      
      

      
      
       
       Now we will look at actually pulling documents from our index using that term.
      
      

      
      
       
       There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
      
      

      
      
       
       In either case, you use the 
       
       find() method on the index. The find() method returns a list of matches from your index.
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query('php +author:Quentin');
?>

      
      
       
       This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
      
      

      
      
       
       We could also manually build this same query with function calls like so:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $query = newZend_Search_Lucene_Search_Query_MultiTerm();
    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
    $hits = $index->query($query);
?>

      
      
       
       The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
      
      

      
      
       
       The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is 
       
       contents.
      
      

      
      
       
       On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
      
      

      
      
       
       Dealing with returned results
      
      

      
      
       
       The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
      
      

      
      
       
       Each of the indexed fields are available as a class property.
      
      

      
      
       
       So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
    $query = 'php +author:Quentin';
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query($query);
    $numHits = count($hits);
?>
<p>
    Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
    <p>
        By <?=$hit->author?>
    </p>
    <p>
        <?=$hit->teaser?><br />
        <a href="<?=$hit->url?>">Read more...</a>
    </p>
<?php}?>

      
      
       
       Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
      
      

      
      
       
       Creating a simple search engine
      
      

      
      
       
       Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
      
      

      
      
       
       Highlight: PHP
      
      
<?php
    require_once('Zend/Search/Lucene.php');
___FCKpd___125
___FCKpd___126
___FCKpd___127
___FCKpd___128
___FCKpd___129
___FCKpd___130
___FCKpd___131
___FCKpd___132
___FCKpd___133
___FCKpd___134
___FCKpd___135
___FCKpd___136
___FCKpd___137
___FCKpd___138
___FCKpd___139
___FCKpd___140
___FCKpd___141
___FCKpd___142
___FCKpd___143
___FCKpd___144
___FCKpd___145
___FCKpd___146
___FCKpd___147
___FCKpd___148
___FCKpd___149
___FCKpd___150
___FCKpd___151
___FCKpd___152

      
      
       
       Error handling
      
      

      
      
       
       The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the 
       
       Zend_Search_Lucene_Exception exception.
      
      

      
      
       
       Highlight: PHP
      
      
___FCKpd___153
___FCKpd___154
___FCKpd___155
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166

      
      
       
       This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
      
      

      
      
       
       Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
      
      

      
      
       
       Keeping the index up-to-date
      
      

      
      
       
       The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
      
      

      
      Update just the entry for the updated document straight away
    
Rebuild the entire index when a document is updated straight away
    

      
      Rebuild the entire index at a certain time each day (or several times per day)
    

      
      
       
       The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
      
      

      
      
       
       To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
      
      

      
      
       
       There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
      
      

      
      
       
       So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
      
      

      
      
       
       Extending Zend_Search_Lucene
      
      

      
      
       
       There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
      
      

      
      A custom tokenizer for determining keywords in a document
    
Custom scoring algorithms to determine how well a document matches a search query
    

      
      A custom storage method, to your index is stored however and wherever you please
    

      
      
       
       A custom tokenizer
      
      

      
      
       
       There are many reasons why a custom tokenizer can be useful. Here are some ideas:
      
      

      
      PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
    
Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
    

      
      HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
    

      
      
       
       More information on this can be found at 
       
       http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
      
      

      
      
       
       Custom scoring algorithms
      
      

      
      
       
       Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
      
      

      
      
       
       More information can be found on this at 
       
       http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
      
      

      
      
       
       Custom storage method
      
      

      
      
       
       You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
      
      

      
      
       
       It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
      
      

      
      
       
       More information on this can be found at 
       
       http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
      
      

      
      
       
       Conclusion
      
      

      
      
       
       In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
      
      

      
      
       
       We also looked briefly at some ways of extending the search capabilities.
      
      

      
      
       
       Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
      
      

      
      

      
      
 
GET['query'] : '';
___FCKpd___126
___FCKpd___127
___FCKpd___128
___FCKpd___129
___FCKpd___130
___FCKpd___131
___FCKpd___132
___FCKpd___133
___FCKpd___134
___FCKpd___135
___FCKpd___136
___FCKpd___137
___FCKpd___138
___FCKpd___139
___FCKpd___140
___FCKpd___141
___FCKpd___142
___FCKpd___143
___FCKpd___144
___FCKpd___145
___FCKpd___146
___FCKpd___147
___FCKpd___148
___FCKpd___149
___FCKpd___150
___FCKpd___151
___FCKpd___152

    
    
     
     Error handling
    
    

    
    
     
     The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the 
     
     Zend_Search_Lucene_Exception exception.
    
    

    
    
     
     Highlight: PHP
    
    
___FCKpd___153
___FCKpd___154
___FCKpd___155
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166

    
    
     
     This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
    
    

    
    
     
     Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
    
    

    
    
     
     Keeping the index up-to-date
    
    

    
    
     
     The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
    
    

    
    Update just the entry for the updated document straight away
    
Rebuild the entire index when a document is updated straight away
    

    
    Rebuild the entire index at a certain time each day (or several times per day)
    

    
    
     
     The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
    
    

    
    
     
     To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
    
    

    
    
     
     There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
    
    

    
    
     
     So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
    
    

    
    
     
     Extending Zend_Search_Lucene
    
    

    
    
     
     There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
    
    

    
    A custom tokenizer for determining keywords in a document
    
Custom scoring algorithms to determine how well a document matches a search query
    

    
    A custom storage method, to your index is stored however and wherever you please
    

    
    
     
     A custom tokenizer
    
    

    
    
     
     There are many reasons why a custom tokenizer can be useful. Here are some ideas:
    
    

    
    PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
    
Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
    

    
    HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
    

    
    
     
     More information on this can be found at 
     
     http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
    
    

    
    
     
     Custom scoring algorithms
    
    

    
    
     
     Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
    
    

    
    
     
     More information can be found on this at 
     
     http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
    
    

    
    
     
     Custom storage method
    
    

    
    
     
     You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
    
    

    
    
     
     It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
    
    

    
    
     
     More information on this can be found at 
     
     http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
    
    

    
    
     
     Conclusion
    
    

    
    
     
     In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
    
    

    
    
     
     We also looked briefly at some ways of extending the search capabilities.
    
    

    
    
     
     Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
    
    

    
    

    
    
 
GET['query'] : '';
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166

  
  
   
   This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
  
  

  
  
   
   Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
  
  

  
  
   
   Keeping the index up-to-date
  
  

  
  
   
   The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
  
  

  
  Update just the entry for the updated document straight away
    
Rebuild the entire index when a document is updated straight away
    

  
  Rebuild the entire index at a certain time each day (or several times per day)
    

  
  
   
   The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
  
  

  
  
   
   To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
  
  

  
  
   
   There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
  
  

  
  
   
   So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
  
  

  
  
   
   Extending Zend_Search_Lucene
  
  

  
  
   
   There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
  
  

  
  A custom tokenizer for determining keywords in a document
    
Custom scoring algorithms to determine how well a document matches a search query
    

  
  A custom storage method, to your index is stored however and wherever you please
    

  
  
   
   A custom tokenizer
  
  

  
  
   
   There are many reasons why a custom tokenizer can be useful. Here are some ideas:
  
  

  
  PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
    
Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
    

  
  HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
    

  
  
   
   More information on this can be found at 
   
   http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
  
  

  
  
   
   Custom scoring algorithms
  
  

  
  
   
   Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
  
  

  
  
   
   More information can be found on this at 
   
   http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
  
  

  
  
   
   Custom storage method
  
  

  
  
   
   You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
  
  

  
  
   
   It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
  
  

  
  
   
   More information on this can be found at 
   
   http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
  
  

  
  
   
   Conclusion
  
  

  
  
   
   In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
  
  

  
  
   
   We also looked briefly at some ways of extending the search capabilities.
  
  

  
  
   
   Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
  
  

  
  

  
  
 
GET['query']) ? 

  
  
   
   Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html
  
  

  
  
   
   By Quentin Zervaas, 27 April 2006 
  
  

  
  
   
   This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.
  
  

  
  
   
   There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).
  
  

  
  
   
   It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.
  
  

  
  
   
   In this article we will be covering the following:
  
  

  
  How to index a document or series of documents 
    
The different types of fields that can be indexed 
    

  
  Searching the index 
    

  
  
   
   To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
  
  

  
  
   
   How fulltext indexing and querying works
  
  

  
  
   
   Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
  
  

  
  
   
   The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
  
  

  
  
   
   So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
  
  

  
  
   
   Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
  
  

  
  
   
   Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
  
  

  
  
   
   Querying the data
  
  

  
  
   
   Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
  
  

  
  
   
   Keeping the index up-to-date
  
  

  
  
   
   If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
  
  

  
  
   
   There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
  
  

  
  
   
   Getting started
  
  

  
  
   
   The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
  
  

  
  
   
   You can download this from 
   
   http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
  
  

  
  
   
   I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
  
  

  
  
   
   Highlight: Plain
  
  
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend

  
  
   
   So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
  
  

  
  
   
   Highlight: Plain
  
  
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php

  
  
   
   Which now becomes:
  
  

  
  
   
   Highlight: Plain
  
  
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend

  
  
   
   Creating our first index
  
  

  
  
   
   The basic process for creating an index is:
  
  

  
  Open the index
    
Add each document
    

  
  Commit (save) the index
    

  
  
   
   The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the 
   
   Zend_Search_Lucene class.
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath, true);
?>

  
  
   
   You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
  
  

  
  
   
   Adding a document to our index
  
  

  
  
   
   Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    $doc = newZend_Search_Lucene_Document();
?>

  
  
   
   The next thing we must do is determine which fields we need to add to our index.
  
  

  
  
   
   There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
  
  

  
  
   
   As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
  
  

  
  Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
        
Title – we’re definitely going to include the title in our results
        
Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
        
Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
        

  
  

  
  Created – We’ll also store a timestamp of when the article was created.
    

  
  
   
   This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
  
  

  
  UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
    
UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
    

  
  Text – Data that is available for search and is stored in full (title and author)
    

  
  
   
   There is also the 
   
   Keyword and 
   
   Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
  
  

  
  
   
   To add a field to our indexed document, we use the 
   
   addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
  
  

  
  
   
   In other words, to create the 
   
   title field data, we use:
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>

  
  
   
   Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
  
  

  
  
   
   So to add all the data with the field types we just worked out, we would use this:
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    $doc = newZend_Search_Lucene_Document();
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>

  
  
   
   Important note: We added the main search content with a field name of 
   
   contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like 
   
   title:foo. This will be covered further in the section about querying the index.
  
  

  
  
   
   Finally, we add the document to the index using 
   
   addDocument():
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    $index->addDocument($doc);
?>

  
  
   
   We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
  
  

  
  
   
   Committing / saving the index
  
  

  
  
   
   Once all documents have been added, the index must be saved.
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    $index->commit();
?>

  
  
   
   You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
  
  

  
  
   
   If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
  
  

  
  
   
   Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
  
  

  
  
   
   Indexing all the articles on phpRiot
  
  

  
  
   
   Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
  
  

  
  
   
   Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
  
  

  
  
   
   Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article 
   
   Managing your data with DatabaseObject.
  
  

  
  
   
   Extending Zend_Search_Lucene_Document
  
  

  
  
   
   On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
  
  

  
  
   
   In other words, we’re going to move the calls to 
   
   addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
    {
        /**
         * Constructor. Creates our indexable document and adds all
         * necessary fields to it using the passed in DatabaseObject
         * holding the article data.
         */
        publicfunction__construct(&$document)
        {
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('url',     $document->generateUrl()));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
            $this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
            $this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
            $this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
            $this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
        }
    }
?>

  
  
   
   As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The 
   
   generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
  
  

  
  
   
   Building the full index
  
  

  
  
   
   Now that we have our class, we can create our index, loop over the documents, and then save our index:
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    require_once('Zend/Search/Lucene.php');
    require_once('DatabaseObject/PhpriotDocument.class.php');
    // where to save our index
    $indexPath = '/var/www/phpriot.com/data/docindex';
    // get a list of all database ids for documents
    $doc_ids = PhpriotDocument::GetDocIds($db);
    // create our index
    $index = newZend_Search_Lucene($indexPath, true);
    foreach($doc_idsas$doc_id){
        // load our databaseobject
        $document = newPhpriotDocument($db);
        $document->loadRecord($doc_id);
        // create our indexed document and add it to the index
        $index->addDocument(newPhpRiotIndexedDocument($document));
    }
    // write the index to disk
    $index->commit();
?>

  
  
   
   The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
  
  

  
  
   
   How querying works in Zend_Search_Lucene
  
  

  
  
   
   Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
  
  

  
  
   
   When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
  
  

  
  
   
   Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
  
  

  
  
   
   So to search the author field for ‘Quentin’, the search query would be 
   
   author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the 
   
   Zend_Search_Lucene manual section on Extensibility)
  
  

  
  
   
   Likewise, to search in the title field for ‘php’, we would use 
   
   title:php.
  
  

  
  
   
   As we briefly mentioned earlier in this article, the default section that is searched in is the field called 
   
   contents. So if you wanted to search the document body for the world ‘google’, you could use 
   
   contents:google or just 
   
   google.
  
  

  
  
   
   Including and excluding terms
  
  

  
  
   
   By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
  
  

  
  
   
   Searching for phrases
  
  

  
  
   
   It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the 
   
   Zend_Search_Lucene manual section on query types.
  
  

  
  
   
   Sample queries
  
  

  
  
   
   Here are some queries you can pass to Zend_Search_Lucene and their meanings.
  
  

  
  
   
   Highlight: Plain
  
  
php
    // search the index for any article with the word php
 
php -author:quentin
    // find any article with the word php not written by me
 
author:quentin
    // find all the articles by me
 
php -ajax
    // find all articles with the word php that don't have the word ajax
 
title:mysql
    // find all articles with MySQL in the title
 
title:mysql -author:quentin
    // find all articles with MySQL in the title not by me

  
  
   
   And so on. Hopefully you get the idea.
  
  

  
  
   
   Scoring of results
  
  

  
  
   
   All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
  
  

  
  
   
   The results are ordered by their score, from highest to lowest.
  
  

  
  
   
   I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
  
  

  
  
   
   You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
  
  

  
  
   
   Querying our index
  
  

  
  
   
   On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
  
  

  
  
   
   Now we will look at actually pulling documents from our index using that term.
  
  

  
  
   
   There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
  
  

  
  
   
   In either case, you use the 
   
   find() method on the index. The find() method returns a list of matches from your index.
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query('php +author:Quentin');
?>

  
  
   
   This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
  
  

  
  
   
   We could also manually build this same query with function calls like so:
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    require_once('Zend/Search/Lucene.php');
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $query = newZend_Search_Lucene_Search_Query_MultiTerm();
    $query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
    $query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
    $hits = $index->query($query);
?>

  
  
   
   The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
  
  

  
  
   
   The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is 
   
   contents.
  
  

  
  
   
   On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
  
  

  
  
   
   Dealing with returned results
  
  

  
  
   
   The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
  
  

  
  
   
   Each of the indexed fields are available as a class property.
  
  

  
  
   
   So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    require_once('Zend/Search/Lucene.php');
    $query = 'php +author:Quentin';
    $indexPath = '/var/www/phpriot.com/data/docindex';
    $index = newZend_Search_Lucene($indexPath);
    $hits = $index->query($query);
    $numHits = count($hits);
?>
<p>
    Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
    <h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
    <p>
        By <?=$hit->author?>
    </p>
    <p>
        <?=$hit->teaser?><br />
        <a href="<?=$hit->url?>">Read more...</a>
    </p>
<?php}?>

  
  
   
   Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
  
  

  
  
   
   Creating a simple search engine
  
  

  
  
   
   Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
  
  

  
  
   
   Highlight: PHP
  
  
<?php
    require_once('Zend/Search/Lucene.php');
___FCKpd___125
___FCKpd___126
___FCKpd___127
___FCKpd___128
___FCKpd___129
___FCKpd___130
___FCKpd___131
___FCKpd___132
___FCKpd___133
___FCKpd___134
___FCKpd___135
___FCKpd___136
___FCKpd___137
___FCKpd___138
___FCKpd___139
___FCKpd___140
___FCKpd___141
___FCKpd___142
___FCKpd___143
___FCKpd___144
___FCKpd___145
___FCKpd___146
___FCKpd___147
___FCKpd___148
___FCKpd___149
___FCKpd___150
___FCKpd___151
___FCKpd___152

  
  
   
   Error handling
  
  

  
  
   
   The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the 
   
   Zend_Search_Lucene_Exception exception.
  
  

  
  
   
   Highlight: PHP
  
  
___FCKpd___153
___FCKpd___154
___FCKpd___155
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166

  
  
   
   This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
  
  

  
  
   
   Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
  
  

  
  
   
   Keeping the index up-to-date
  
  

  
  
   
   The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
  
  

  
  Update just the entry for the updated document straight away
    
Rebuild the entire index when a document is updated straight away
    

  
  Rebuild the entire index at a certain time each day (or several times per day)
    

  
  
   
   The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
  
  

  
  
   
   To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
  
  

  
  
   
   There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
  
  

  
  
   
   So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
  
  

  
  
   
   Extending Zend_Search_Lucene
  
  

  
  
   
   There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
  
  

  
  A custom tokenizer for determining keywords in a document
    
Custom scoring algorithms to determine how well a document matches a search query
    

  
  A custom storage method, to your index is stored however and wherever you please
    

  
  
   
   A custom tokenizer
  
  

  
  
   
   There are many reasons why a custom tokenizer can be useful. Here are some ideas:
  
  

  
  PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
    
Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
    

  
  HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
    

  
  
   
   More information on this can be found at 
   
   http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
  
  

  
  
   
   Custom scoring algorithms
  
  

  
  
   
   Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
  
  

  
  
   
   More information can be found on this at 
   
   http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
  
  

  
  
   
   Custom storage method
  
  

  
  
   
   You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
  
  

  
  
   
   It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
  
  

  
  
   
   More information on this can be found at 
   
   http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
  
  

  
  
   
   Conclusion
  
  

  
  
   
   In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
  
  

  
  
   
   We also looked briefly at some ways of extending the search capabilities.
  
  

  
  
   
   Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
  
  

  
  

  
  
 
GET['query'] : '';

___FCKpd___126

___FCKpd___127

___FCKpd___128

___FCKpd___129

___FCKpd___130

___FCKpd___131

___FCKpd___132

___FCKpd___133

___FCKpd___134

___FCKpd___135

___FCKpd___136

___FCKpd___137

___FCKpd___138

___FCKpd___139

___FCKpd___140

___FCKpd___141

___FCKpd___142

___FCKpd___143

___FCKpd___144

___FCKpd___145

___FCKpd___146

___FCKpd___147

___FCKpd___148

___FCKpd___149

___FCKpd___150

___FCKpd___151

___FCKpd___152

Error handling

The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the Zend_Search_Lucene_Exception exception.

Highlight: PHP

___FCKpd___153

___FCKpd___154

___FCKpd___155

___FCKpd___156

___FCKpd___157

___FCKpd___158

___FCKpd___159

___FCKpd___160

___FCKpd___161

___FCKpd___162

___FCKpd___163

___FCKpd___164

___FCKpd___165

___FCKpd___166

This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.

Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).

Keeping the index up-to-date

The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:

Update just the entry for the updated document straight away
Rebuild the entire index when a document is updated straight away

Rebuild the entire index at a certain time each day (or several times per day)

The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.

To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).

There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.

So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.

Extending Zend_Search_Lucene

There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:

A custom tokenizer for determining keywords in a document
Custom scoring algorithms to determine how well a document matches a search query

A custom storage method, to your index is stored however and wherever you please

A custom tokenizer

There are many reasons why a custom tokenizer can be useful. Here are some ideas:

PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)

HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.

Custom scoring algorithms

Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.

More information can be found on this at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.

Custom storage method

You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.

It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.

More information on this can be found at http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.

Conclusion

In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.

We also looked briefly at some ways of extending the search capabilities.

Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.