Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html
By Quentin Zervaas, 27 April 2006
有时间的话,我会将它翻译成中文,本身不难的,可慢慢看。
This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.
There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).
It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.
In this article we will be covering the following:
- How to index a document or series of documents
- The different types of fields that can be indexed
- Searching the index
To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
How fulltext indexing and querying works
Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
Querying the data
Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
Keeping the index up-to-date
If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
Getting started
The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
You can download this from
http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
Highlight: Plain
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend
So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php
Which now becomes:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend
Creating our first index
The basic process for creating an index is:
- Open the index
- Add each document
- Commit (save) the index
The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the
Zend_Search_Lucene class.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath, true);
?>
You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
Adding a document to our index
Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
?>
The next thing we must do is determine which fields we need to add to our index.
There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
- Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
- Title – we’re definitely going to include the title in our results
- Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
- Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
- Created – We’ll also store a timestamp of when the article was created.
This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
- UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
- UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
- Text – Data that is available for search and is stored in full (title and author)
There is also the
Keyword and
Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
To add a field to our indexed document, we use the
addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
In other words, to create the
title field data, we use:
Highlight: PHP
<?php
$data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>
Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
So to add all the data with the field types we just worked out, we would use this:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
$doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
$doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>
Important note: We added the main search content with a field name of
contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like
title:foo. This will be covered further in the section about querying the index.
Finally, we add the document to the index using
addDocument():
Highlight: PHP
<?php
$index->addDocument($doc);
?>
We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
Committing / saving the index
Once all documents have been added, the index must be saved.
Highlight: PHP
<?php
$index->commit();
?>
You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
Indexing all the articles on phpRiot
Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article
Managing your data with DatabaseObject.
Extending Zend_Search_Lucene_Document
On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
In other words, we’re going to move the calls to
addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
Highlight: PHP
<?php
classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
{
/**
* Constructor. Creates our indexable document and adds all
* necessary fields to it using the passed in DatabaseObject
* holding the article data.
*/
publicfunction__construct(&$document)
{
$this->addField(Zend_Search_Lucene_Field::UnIndexed('url', $document->generateUrl()));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
$this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
$this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
$this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
}
}
?>
As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The
generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
Building the full index
Now that we have our class, we can create our index, loop over the documents, and then save our index:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
require_once('DatabaseObject/PhpriotDocument.class.php');
// where to save our index
$indexPath = '/var/www/phpriot.com/data/docindex';
// get a list of all database ids for documents
$doc_ids = PhpriotDocument::GetDocIds($db);
// create our index
$index = newZend_Search_Lucene($indexPath, true);
foreach($doc_idsas$doc_id){
// load our databaseobject
$document = newPhpriotDocument($db);
$document->loadRecord($doc_id);
// create our indexed document and add it to the index
$index->addDocument(newPhpRiotIndexedDocument($document));
}
// write the index to disk
$index->commit();
?>
The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
How querying works in Zend_Search_Lucene
Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
So to search the author field for ‘Quentin’, the search query would be
author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the
Zend_Search_Lucene manual section on Extensibility)
Likewise, to search in the title field for ‘php’, we would use
title:php.
As we briefly mentioned earlier in this article, the default section that is searched in is the field called
contents. So if you wanted to search the document body for the world ‘google’, you could use
contents:google or just
google.
Including and excluding terms
By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
Searching for phrases
It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the
Zend_Search_Lucene manual section on query types.
Sample queries
Here are some queries you can pass to Zend_Search_Lucene and their meanings.
Highlight: Plain
php
// search the index for any article with the word php
php -author:quentin
// find any article with the word php not written by me
author:quentin
// find all the articles by me
php -ajax
// find all articles with the word php that don't have the word ajax
title:mysql
// find all articles with MySQL in the title
title:mysql -author:quentin
// find all articles with MySQL in the title not by me
And so on. Hopefully you get the idea.
Scoring of results
All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
The results are ordered by their score, from highest to lowest.
I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
Querying our index
On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
Now we will look at actually pulling documents from our index using that term.
There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
In either case, you use the
find() method on the index. The find() method returns a list of matches from your index.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query('php +author:Quentin');
?>
This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
We could also manually build this same query with function calls like so:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$query = newZend_Search_Lucene_Search_Query_MultiTerm();
$query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
$query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
$hits = $index->query($query);
?>
The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is
contents.
On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
Dealing with returned results
The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
Each of the indexed fields are available as a class property.
So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$query = 'php +author:Quentin';
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query($query);
$numHits = count($hits);
?>
<p>
Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
<h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
<p>
By <?=$hit->author?>
</p>
<p>
<?=$hit->teaser?><br />
<a href="<?=$hit->url?>">Read more...</a>
</p>
<?php}?>
Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
Creating a simple search engine
Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$query = isset(Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.htmlBy Quentin Zervaas, 27 April 2006This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.In this article we will be covering the following:
- How to index a document or series of documents
- The different types of fields that can be indexed
- Searching the index
To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
How fulltext indexing and querying works
Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
Querying the data
Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
Keeping the index up-to-date
If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
Getting started
The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
You can download this from
http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
Highlight: Plain
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend
So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php
Which now becomes:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend
Creating our first index
The basic process for creating an index is:
- Open the index
- Add each document
- Commit (save) the index
The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the
Zend_Search_Lucene class.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath, true);
?>
You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
Adding a document to our index
Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
?>
The next thing we must do is determine which fields we need to add to our index.
There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
- Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
- Title – we’re definitely going to include the title in our results
- Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
- Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
- Created – We’ll also store a timestamp of when the article was created.
This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
- UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
- UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
- Text – Data that is available for search and is stored in full (title and author)
There is also the
Keyword and
Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
To add a field to our indexed document, we use the
addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
In other words, to create the
title field data, we use:
Highlight: PHP
<?php
$data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>
Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
So to add all the data with the field types we just worked out, we would use this:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
$doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
$doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>
Important note: We added the main search content with a field name of
contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like
title:foo. This will be covered further in the section about querying the index.
Finally, we add the document to the index using
addDocument():
Highlight: PHP
<?php
$index->addDocument($doc);
?>
We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
Committing / saving the index
Once all documents have been added, the index must be saved.
Highlight: PHP
<?php
$index->commit();
?>
You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
Indexing all the articles on phpRiot
Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article
Managing your data with DatabaseObject.
Extending Zend_Search_Lucene_Document
On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
In other words, we’re going to move the calls to
addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
Highlight: PHP
<?php
classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
{
/**
* Constructor. Creates our indexable document and adds all
* necessary fields to it using the passed in DatabaseObject
* holding the article data.
*/
publicfunction__construct(&$document)
{
$this->addField(Zend_Search_Lucene_Field::UnIndexed('url', $document->generateUrl()));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
$this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
$this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
$this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
}
}
?>
As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The
generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
Building the full index
Now that we have our class, we can create our index, loop over the documents, and then save our index:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
require_once('DatabaseObject/PhpriotDocument.class.php');
// where to save our index
$indexPath = '/var/www/phpriot.com/data/docindex';
// get a list of all database ids for documents
$doc_ids = PhpriotDocument::GetDocIds($db);
// create our index
$index = newZend_Search_Lucene($indexPath, true);
foreach($doc_idsas$doc_id){
// load our databaseobject
$document = newPhpriotDocument($db);
$document->loadRecord($doc_id);
// create our indexed document and add it to the index
$index->addDocument(newPhpRiotIndexedDocument($document));
}
// write the index to disk
$index->commit();
?>
The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
How querying works in Zend_Search_Lucene
Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
So to search the author field for ‘Quentin’, the search query would be
author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the
Zend_Search_Lucene manual section on Extensibility)
Likewise, to search in the title field for ‘php’, we would use
title:php.
As we briefly mentioned earlier in this article, the default section that is searched in is the field called
contents. So if you wanted to search the document body for the world ‘google’, you could use
contents:google or just
google.
Including and excluding terms
By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
Searching for phrases
It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the
Zend_Search_Lucene manual section on query types.
Sample queries
Here are some queries you can pass to Zend_Search_Lucene and their meanings.
Highlight: Plain
php
// search the index for any article with the word php
php -author:quentin
// find any article with the word php not written by me
author:quentin
// find all the articles by me
php -ajax
// find all articles with the word php that don't have the word ajax
title:mysql
// find all articles with MySQL in the title
title:mysql -author:quentin
// find all articles with MySQL in the title not by me
And so on. Hopefully you get the idea.
Scoring of results
All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
The results are ordered by their score, from highest to lowest.
I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
Querying our index
On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
Now we will look at actually pulling documents from our index using that term.
There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
In either case, you use the
find() method on the index. The find() method returns a list of matches from your index.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query('php +author:Quentin');
?>
This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
We could also manually build this same query with function calls like so:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$query = newZend_Search_Lucene_Search_Query_MultiTerm();
$query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
$query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
$hits = $index->query($query);
?>
The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is
contents.
On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
Dealing with returned results
The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
Each of the indexed fields are available as a class property.
So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$query = 'php +author:Quentin';
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query($query);
$numHits = count($hits);
?>
<p>
Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
<h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
<p>
By <?=$hit->author?>
</p>
<p>
<?=$hit->teaser?><br />
<a href="<?=$hit->url?>">Read more...</a>
</p>
<?php}?>
Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
Creating a simple search engine
Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
___FCKpd___125
$query = trim($query);
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
if(strlen($query) > 0){
$hits = $index->query($query);
$numHits = count($hits);
}
?>
<form method="get" action="search.php">
<input type="text" name="query" value="<?=htmlSpecialChars($query)?>" />
<input type="submit" value="Search" />
</form>
<?phpif(strlen($query) > 0){?>
<p>
Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
<h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
<p>
By <?=$hit->author?>
</p>
<p>
<?=$hit->teaser?><br />
<a href="<?=$hit->url?>">Read more...</a>
</p>
<?php}?>
<?php}?>
Error handling
The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the
Zend_Search_Lucene_Exception exception.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$query = isset(Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.htmlBy Quentin Zervaas, 27 April 2006This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.In this article we will be covering the following:
- How to index a document or series of documents
- The different types of fields that can be indexed
- Searching the index
To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
How fulltext indexing and querying works
Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
Querying the data
Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
Keeping the index up-to-date
If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
Getting started
The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
You can download this from
http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
Highlight: Plain
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend
So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php
Which now becomes:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend
Creating our first index
The basic process for creating an index is:
- Open the index
- Add each document
- Commit (save) the index
The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the
Zend_Search_Lucene class.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath, true);
?>
You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
Adding a document to our index
Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
?>
The next thing we must do is determine which fields we need to add to our index.
There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
- Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
- Title – we’re definitely going to include the title in our results
- Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
- Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
- Created – We’ll also store a timestamp of when the article was created.
This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
- UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
- UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
- Text – Data that is available for search and is stored in full (title and author)
There is also the
Keyword and
Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
To add a field to our indexed document, we use the
addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
In other words, to create the
title field data, we use:
Highlight: PHP
<?php
$data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>
Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
So to add all the data with the field types we just worked out, we would use this:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
$doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
$doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>
Important note: We added the main search content with a field name of
contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like
title:foo. This will be covered further in the section about querying the index.
Finally, we add the document to the index using
addDocument():
Highlight: PHP
<?php
$index->addDocument($doc);
?>
We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
Committing / saving the index
Once all documents have been added, the index must be saved.
Highlight: PHP
<?php
$index->commit();
?>
You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
Indexing all the articles on phpRiot
Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article
Managing your data with DatabaseObject.
Extending Zend_Search_Lucene_Document
On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
In other words, we’re going to move the calls to
addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
Highlight: PHP
<?php
classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
{
/**
* Constructor. Creates our indexable document and adds all
* necessary fields to it using the passed in DatabaseObject
* holding the article data.
*/
publicfunction__construct(&$document)
{
$this->addField(Zend_Search_Lucene_Field::UnIndexed('url', $document->generateUrl()));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
$this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
$this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
$this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
}
}
?>
As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The
generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
Building the full index
Now that we have our class, we can create our index, loop over the documents, and then save our index:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
require_once('DatabaseObject/PhpriotDocument.class.php');
// where to save our index
$indexPath = '/var/www/phpriot.com/data/docindex';
// get a list of all database ids for documents
$doc_ids = PhpriotDocument::GetDocIds($db);
// create our index
$index = newZend_Search_Lucene($indexPath, true);
foreach($doc_idsas$doc_id){
// load our databaseobject
$document = newPhpriotDocument($db);
$document->loadRecord($doc_id);
// create our indexed document and add it to the index
$index->addDocument(newPhpRiotIndexedDocument($document));
}
// write the index to disk
$index->commit();
?>
The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
How querying works in Zend_Search_Lucene
Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
So to search the author field for ‘Quentin’, the search query would be
author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the
Zend_Search_Lucene manual section on Extensibility)
Likewise, to search in the title field for ‘php’, we would use
title:php.
As we briefly mentioned earlier in this article, the default section that is searched in is the field called
contents. So if you wanted to search the document body for the world ‘google’, you could use
contents:google or just
google.
Including and excluding terms
By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
Searching for phrases
It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the
Zend_Search_Lucene manual section on query types.
Sample queries
Here are some queries you can pass to Zend_Search_Lucene and their meanings.
Highlight: Plain
php
// search the index for any article with the word php
php -author:quentin
// find any article with the word php not written by me
author:quentin
// find all the articles by me
php -ajax
// find all articles with the word php that don't have the word ajax
title:mysql
// find all articles with MySQL in the title
title:mysql -author:quentin
// find all articles with MySQL in the title not by me
And so on. Hopefully you get the idea.
Scoring of results
All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
The results are ordered by their score, from highest to lowest.
I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
Querying our index
On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
Now we will look at actually pulling documents from our index using that term.
There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
In either case, you use the
find() method on the index. The find() method returns a list of matches from your index.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query('php +author:Quentin');
?>
This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
We could also manually build this same query with function calls like so:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$query = newZend_Search_Lucene_Search_Query_MultiTerm();
$query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
$query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
$hits = $index->query($query);
?>
The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is
contents.
On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
Dealing with returned results
The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
Each of the indexed fields are available as a class property.
So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$query = 'php +author:Quentin';
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query($query);
$numHits = count($hits);
?>
<p>
Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
<h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
<p>
By <?=$hit->author?>
</p>
<p>
<?=$hit->teaser?><br />
<a href="<?=$hit->url?>">Read more...</a>
</p>
<?php}?>
Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
Creating a simple search engine
Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$query = isset(Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.htmlBy Quentin Zervaas, 27 April 2006This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.In this article we will be covering the following:
- How to index a document or series of documents
- The different types of fields that can be indexed
- Searching the index
To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
How fulltext indexing and querying works
Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
Querying the data
Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
Keeping the index up-to-date
If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
Getting started
The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
You can download this from
http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
Highlight: Plain
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend
So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php
Which now becomes:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend
Creating our first index
The basic process for creating an index is:
- Open the index
- Add each document
- Commit (save) the index
The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the
Zend_Search_Lucene class.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath, true);
?>
You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
Adding a document to our index
Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
?>
The next thing we must do is determine which fields we need to add to our index.
There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
- Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
- Title – we’re definitely going to include the title in our results
- Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
- Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
- Created – We’ll also store a timestamp of when the article was created.
This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
- UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
- UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
- Text – Data that is available for search and is stored in full (title and author)
There is also the
Keyword and
Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
To add a field to our indexed document, we use the
addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
In other words, to create the
title field data, we use:
Highlight: PHP
<?php
$data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>
Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
So to add all the data with the field types we just worked out, we would use this:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
$doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
$doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>
Important note: We added the main search content with a field name of
contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like
title:foo. This will be covered further in the section about querying the index.
Finally, we add the document to the index using
addDocument():
Highlight: PHP
<?php
$index->addDocument($doc);
?>
We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
Committing / saving the index
Once all documents have been added, the index must be saved.
Highlight: PHP
<?php
$index->commit();
?>
You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
Indexing all the articles on phpRiot
Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article
Managing your data with DatabaseObject.
Extending Zend_Search_Lucene_Document
On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
In other words, we’re going to move the calls to
addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
Highlight: PHP
<?php
classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
{
/**
* Constructor. Creates our indexable document and adds all
* necessary fields to it using the passed in DatabaseObject
* holding the article data.
*/
publicfunction__construct(&$document)
{
$this->addField(Zend_Search_Lucene_Field::UnIndexed('url', $document->generateUrl()));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
$this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
$this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
$this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
}
}
?>
As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The
generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
Building the full index
Now that we have our class, we can create our index, loop over the documents, and then save our index:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
require_once('DatabaseObject/PhpriotDocument.class.php');
// where to save our index
$indexPath = '/var/www/phpriot.com/data/docindex';
// get a list of all database ids for documents
$doc_ids = PhpriotDocument::GetDocIds($db);
// create our index
$index = newZend_Search_Lucene($indexPath, true);
foreach($doc_idsas$doc_id){
// load our databaseobject
$document = newPhpriotDocument($db);
$document->loadRecord($doc_id);
// create our indexed document and add it to the index
$index->addDocument(newPhpRiotIndexedDocument($document));
}
// write the index to disk
$index->commit();
?>
The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
How querying works in Zend_Search_Lucene
Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
So to search the author field for ‘Quentin’, the search query would be
author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the
Zend_Search_Lucene manual section on Extensibility)
Likewise, to search in the title field for ‘php’, we would use
title:php.
As we briefly mentioned earlier in this article, the default section that is searched in is the field called
contents. So if you wanted to search the document body for the world ‘google’, you could use
contents:google or just
google.
Including and excluding terms
By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
Searching for phrases
It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the
Zend_Search_Lucene manual section on query types.
Sample queries
Here are some queries you can pass to Zend_Search_Lucene and their meanings.
Highlight: Plain
php
// search the index for any article with the word php
php -author:quentin
// find any article with the word php not written by me
author:quentin
// find all the articles by me
php -ajax
// find all articles with the word php that don't have the word ajax
title:mysql
// find all articles with MySQL in the title
title:mysql -author:quentin
// find all articles with MySQL in the title not by me
And so on. Hopefully you get the idea.
Scoring of results
All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
The results are ordered by their score, from highest to lowest.
I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
Querying our index
On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
Now we will look at actually pulling documents from our index using that term.
There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
In either case, you use the
find() method on the index. The find() method returns a list of matches from your index.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query('php +author:Quentin');
?>
This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
We could also manually build this same query with function calls like so:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$query = newZend_Search_Lucene_Search_Query_MultiTerm();
$query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
$query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
$hits = $index->query($query);
?>
The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is
contents.
On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
Dealing with returned results
The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
Each of the indexed fields are available as a class property.
So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$query = 'php +author:Quentin';
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query($query);
$numHits = count($hits);
?>
<p>
Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
<h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
<p>
By <?=$hit->author?>
</p>
<p>
<?=$hit->teaser?><br />
<a href="<?=$hit->url?>">Read more...</a>
</p>
<?php}?>
Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
Creating a simple search engine
Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
___FCKpd___125
$query = trim($query);
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
if(strlen($query) > 0){
$hits = $index->query($query);
$numHits = count($hits);
}
?>
<form method="get" action="search.php">
<input type="text" name="query" value="<?=htmlSpecialChars($query)?>" />
<input type="submit" value="Search" />
</form>
<?phpif(strlen($query) > 0){?>
<p>
Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
<h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
<p>
By <?=$hit->author?>
</p>
<p>
<?=$hit->teaser?><br />
<a href="<?=$hit->url?>">Read more...</a>
</p>
<?php}?>
<?php}?>
Error handling
The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the
Zend_Search_Lucene_Exception exception.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
___FCKpd___155
$query = trim($query);
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
try{
$hits = $index->query($query);
}
catch(Zend_Search_Lucene_Exception$ex){
$hits = array();
}
$numHits = count($hits);
?>
This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
Keeping the index up-to-date
The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
- Update just the entry for the updated document straight away
- Rebuild the entire index when a document is updated straight away
- Rebuild the entire index at a certain time each day (or several times per day)
The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
Extending Zend_Search_Lucene
There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
- A custom tokenizer for determining keywords in a document
- Custom scoring algorithms to determine how well a document matches a search query
- A custom storage method, to your index is stored however and wherever you please
A custom tokenizer
There are many reasons why a custom tokenizer can be useful. Here are some ideas:
- PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
- Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
- HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
Custom scoring algorithms
Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
More information can be found on this at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
Custom storage method
You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
Conclusion
In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
We also looked briefly at some ways of extending the search capabilities.
Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
GET['query']) ?
Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html
By Quentin Zervaas, 27 April 2006
This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.
There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).
It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.
In this article we will be covering the following:
- How to index a document or series of documents
- The different types of fields that can be indexed
- Searching the index
To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
How fulltext indexing and querying works
Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
Querying the data
Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
Keeping the index up-to-date
If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
Getting started
The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
You can download this from
http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
Highlight: Plain
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend
So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php
Which now becomes:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend
Creating our first index
The basic process for creating an index is:
- Open the index
- Add each document
- Commit (save) the index
The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the
Zend_Search_Lucene class.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath, true);
?>
You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
Adding a document to our index
Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
?>
The next thing we must do is determine which fields we need to add to our index.
There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
- Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
- Title – we’re definitely going to include the title in our results
- Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
- Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
- Created – We’ll also store a timestamp of when the article was created.
This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
- UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
- UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
- Text – Data that is available for search and is stored in full (title and author)
There is also the
Keyword and
Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
To add a field to our indexed document, we use the
addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
In other words, to create the
title field data, we use:
Highlight: PHP
<?php
$data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>
Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
So to add all the data with the field types we just worked out, we would use this:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
$doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
$doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>
Important note: We added the main search content with a field name of
contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like
title:foo. This will be covered further in the section about querying the index.
Finally, we add the document to the index using
addDocument():
Highlight: PHP
<?php
$index->addDocument($doc);
?>
We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
Committing / saving the index
Once all documents have been added, the index must be saved.
Highlight: PHP
<?php
$index->commit();
?>
You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
Indexing all the articles on phpRiot
Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article
Managing your data with DatabaseObject.
Extending Zend_Search_Lucene_Document
On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
In other words, we’re going to move the calls to
addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
Highlight: PHP
<?php
classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
{
/**
* Constructor. Creates our indexable document and adds all
* necessary fields to it using the passed in DatabaseObject
* holding the article data.
*/
publicfunction__construct(&$document)
{
$this->addField(Zend_Search_Lucene_Field::UnIndexed('url', $document->generateUrl()));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
$this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
$this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
$this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
}
}
?>
As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The
generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
Building the full index
Now that we have our class, we can create our index, loop over the documents, and then save our index:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
require_once('DatabaseObject/PhpriotDocument.class.php');
// where to save our index
$indexPath = '/var/www/phpriot.com/data/docindex';
// get a list of all database ids for documents
$doc_ids = PhpriotDocument::GetDocIds($db);
// create our index
$index = newZend_Search_Lucene($indexPath, true);
foreach($doc_idsas$doc_id){
// load our databaseobject
$document = newPhpriotDocument($db);
$document->loadRecord($doc_id);
// create our indexed document and add it to the index
$index->addDocument(newPhpRiotIndexedDocument($document));
}
// write the index to disk
$index->commit();
?>
The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
How querying works in Zend_Search_Lucene
Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
So to search the author field for ‘Quentin’, the search query would be
author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the
Zend_Search_Lucene manual section on Extensibility)
Likewise, to search in the title field for ‘php’, we would use
title:php.
As we briefly mentioned earlier in this article, the default section that is searched in is the field called
contents. So if you wanted to search the document body for the world ‘google’, you could use
contents:google or just
google.
Including and excluding terms
By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
Searching for phrases
It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the
Zend_Search_Lucene manual section on query types.
Sample queries
Here are some queries you can pass to Zend_Search_Lucene and their meanings.
Highlight: Plain
php
// search the index for any article with the word php
php -author:quentin
// find any article with the word php not written by me
author:quentin
// find all the articles by me
php -ajax
// find all articles with the word php that don't have the word ajax
title:mysql
// find all articles with MySQL in the title
title:mysql -author:quentin
// find all articles with MySQL in the title not by me
And so on. Hopefully you get the idea.
Scoring of results
All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
The results are ordered by their score, from highest to lowest.
I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
Querying our index
On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
Now we will look at actually pulling documents from our index using that term.
There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
In either case, you use the
find() method on the index. The find() method returns a list of matches from your index.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query('php +author:Quentin');
?>
This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
We could also manually build this same query with function calls like so:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$query = newZend_Search_Lucene_Search_Query_MultiTerm();
$query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
$query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
$hits = $index->query($query);
?>
The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is
contents.
On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
Dealing with returned results
The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
Each of the indexed fields are available as a class property.
So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$query = 'php +author:Quentin';
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query($query);
$numHits = count($hits);
?>
<p>
Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
<h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
<p>
By <?=$hit->author?>
</p>
<p>
<?=$hit->teaser?><br />
<a href="<?=$hit->url?>">Read more...</a>
</p>
<?php}?>
Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
Creating a simple search engine
Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
___FCKpd___125
___FCKpd___126
___FCKpd___127
___FCKpd___128
___FCKpd___129
___FCKpd___130
___FCKpd___131
___FCKpd___132
___FCKpd___133
___FCKpd___134
___FCKpd___135
___FCKpd___136
___FCKpd___137
___FCKpd___138
___FCKpd___139
___FCKpd___140
___FCKpd___141
___FCKpd___142
___FCKpd___143
___FCKpd___144
___FCKpd___145
___FCKpd___146
___FCKpd___147
___FCKpd___148
___FCKpd___149
___FCKpd___150
___FCKpd___151
___FCKpd___152
Error handling
The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the
Zend_Search_Lucene_Exception exception.
Highlight: PHP
___FCKpd___153
___FCKpd___154
___FCKpd___155
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166
This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
Keeping the index up-to-date
The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
- Update just the entry for the updated document straight away
- Rebuild the entire index when a document is updated straight away
- Rebuild the entire index at a certain time each day (or several times per day)
The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
Extending Zend_Search_Lucene
There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
- A custom tokenizer for determining keywords in a document
- Custom scoring algorithms to determine how well a document matches a search query
- A custom storage method, to your index is stored however and wherever you please
A custom tokenizer
There are many reasons why a custom tokenizer can be useful. Here are some ideas:
- PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
- Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
- HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
Custom scoring algorithms
Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
More information can be found on this at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
Custom storage method
You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
Conclusion
In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
We also looked briefly at some ways of extending the search capabilities.
Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
GET['query'] : '';
___FCKpd___126
___FCKpd___127
___FCKpd___128
___FCKpd___129
___FCKpd___130
___FCKpd___131
___FCKpd___132
___FCKpd___133
___FCKpd___134
___FCKpd___135
___FCKpd___136
___FCKpd___137
___FCKpd___138
___FCKpd___139
___FCKpd___140
___FCKpd___141
___FCKpd___142
___FCKpd___143
___FCKpd___144
___FCKpd___145
___FCKpd___146
___FCKpd___147
___FCKpd___148
___FCKpd___149
___FCKpd___150
___FCKpd___151
___FCKpd___152
Error handling
The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the
Zend_Search_Lucene_Exception exception.
Highlight: PHP
___FCKpd___153
___FCKpd___154
___FCKpd___155
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166
This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
Keeping the index up-to-date
The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
- Update just the entry for the updated document straight away
- Rebuild the entire index when a document is updated straight away
- Rebuild the entire index at a certain time each day (or several times per day)
The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
Extending Zend_Search_Lucene
There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
- A custom tokenizer for determining keywords in a document
- Custom scoring algorithms to determine how well a document matches a search query
- A custom storage method, to your index is stored however and wherever you please
A custom tokenizer
There are many reasons why a custom tokenizer can be useful. Here are some ideas:
- PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
- Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
- HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
Custom scoring algorithms
Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
More information can be found on this at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
Custom storage method
You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
Conclusion
In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
We also looked briefly at some ways of extending the search capabilities.
Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
GET['query']) ?
Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html
By Quentin Zervaas, 27 April 2006
This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.
There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).
It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.
In this article we will be covering the following:
- How to index a document or series of documents
- The different types of fields that can be indexed
- Searching the index
To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
How fulltext indexing and querying works
Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
Querying the data
Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
Keeping the index up-to-date
If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
Getting started
The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
You can download this from
http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
Highlight: Plain
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend
So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php
Which now becomes:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend
Creating our first index
The basic process for creating an index is:
- Open the index
- Add each document
- Commit (save) the index
The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the
Zend_Search_Lucene class.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath, true);
?>
You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
Adding a document to our index
Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
?>
The next thing we must do is determine which fields we need to add to our index.
There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
- Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
- Title – we’re definitely going to include the title in our results
- Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
- Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
- Created – We’ll also store a timestamp of when the article was created.
This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
- UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
- UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
- Text – Data that is available for search and is stored in full (title and author)
There is also the
Keyword and
Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
To add a field to our indexed document, we use the
addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
In other words, to create the
title field data, we use:
Highlight: PHP
<?php
$data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>
Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
So to add all the data with the field types we just worked out, we would use this:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
$doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
$doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>
Important note: We added the main search content with a field name of
contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like
title:foo. This will be covered further in the section about querying the index.
Finally, we add the document to the index using
addDocument():
Highlight: PHP
<?php
$index->addDocument($doc);
?>
We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
Committing / saving the index
Once all documents have been added, the index must be saved.
Highlight: PHP
<?php
$index->commit();
?>
You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
Indexing all the articles on phpRiot
Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article
Managing your data with DatabaseObject.
Extending Zend_Search_Lucene_Document
On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
In other words, we’re going to move the calls to
addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
Highlight: PHP
<?php
classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
{
/**
* Constructor. Creates our indexable document and adds all
* necessary fields to it using the passed in DatabaseObject
* holding the article data.
*/
publicfunction__construct(&$document)
{
$this->addField(Zend_Search_Lucene_Field::UnIndexed('url', $document->generateUrl()));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
$this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
$this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
$this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
}
}
?>
As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The
generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
Building the full index
Now that we have our class, we can create our index, loop over the documents, and then save our index:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
require_once('DatabaseObject/PhpriotDocument.class.php');
// where to save our index
$indexPath = '/var/www/phpriot.com/data/docindex';
// get a list of all database ids for documents
$doc_ids = PhpriotDocument::GetDocIds($db);
// create our index
$index = newZend_Search_Lucene($indexPath, true);
foreach($doc_idsas$doc_id){
// load our databaseobject
$document = newPhpriotDocument($db);
$document->loadRecord($doc_id);
// create our indexed document and add it to the index
$index->addDocument(newPhpRiotIndexedDocument($document));
}
// write the index to disk
$index->commit();
?>
The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
How querying works in Zend_Search_Lucene
Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
So to search the author field for ‘Quentin’, the search query would be
author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the
Zend_Search_Lucene manual section on Extensibility)
Likewise, to search in the title field for ‘php’, we would use
title:php.
As we briefly mentioned earlier in this article, the default section that is searched in is the field called
contents. So if you wanted to search the document body for the world ‘google’, you could use
contents:google or just
google.
Including and excluding terms
By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
Searching for phrases
It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the
Zend_Search_Lucene manual section on query types.
Sample queries
Here are some queries you can pass to Zend_Search_Lucene and their meanings.
Highlight: Plain
php
// search the index for any article with the word php
php -author:quentin
// find any article with the word php not written by me
author:quentin
// find all the articles by me
php -ajax
// find all articles with the word php that don't have the word ajax
title:mysql
// find all articles with MySQL in the title
title:mysql -author:quentin
// find all articles with MySQL in the title not by me
And so on. Hopefully you get the idea.
Scoring of results
All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
The results are ordered by their score, from highest to lowest.
I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
Querying our index
On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
Now we will look at actually pulling documents from our index using that term.
There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
In either case, you use the
find() method on the index. The find() method returns a list of matches from your index.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query('php +author:Quentin');
?>
This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
We could also manually build this same query with function calls like so:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$query = newZend_Search_Lucene_Search_Query_MultiTerm();
$query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
$query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
$hits = $index->query($query);
?>
The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is
contents.
On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
Dealing with returned results
The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
Each of the indexed fields are available as a class property.
So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$query = 'php +author:Quentin';
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query($query);
$numHits = count($hits);
?>
<p>
Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
<h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
<p>
By <?=$hit->author?>
</p>
<p>
<?=$hit->teaser?><br />
<a href="<?=$hit->url?>">Read more...</a>
</p>
<?php}?>
Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
Creating a simple search engine
Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$query = isset(Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.htmlBy Quentin Zervaas, 27 April 2006This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.In this article we will be covering the following:
- How to index a document or series of documents
- The different types of fields that can be indexed
- Searching the index
To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
How fulltext indexing and querying works
Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
Querying the data
Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
Keeping the index up-to-date
If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
Getting started
The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
You can download this from
http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
Highlight: Plain
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend
So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php
Which now becomes:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend
Creating our first index
The basic process for creating an index is:
- Open the index
- Add each document
- Commit (save) the index
The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the
Zend_Search_Lucene class.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath, true);
?>
You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
Adding a document to our index
Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
?>
The next thing we must do is determine which fields we need to add to our index.
There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
- Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
- Title – we’re definitely going to include the title in our results
- Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
- Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
- Created – We’ll also store a timestamp of when the article was created.
This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
- UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
- UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
- Text – Data that is available for search and is stored in full (title and author)
There is also the
Keyword and
Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
To add a field to our indexed document, we use the
addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
In other words, to create the
title field data, we use:
Highlight: PHP
<?php
$data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>
Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
So to add all the data with the field types we just worked out, we would use this:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
$doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
$doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>
Important note: We added the main search content with a field name of
contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like
title:foo. This will be covered further in the section about querying the index.
Finally, we add the document to the index using
addDocument():
Highlight: PHP
<?php
$index->addDocument($doc);
?>
We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
Committing / saving the index
Once all documents have been added, the index must be saved.
Highlight: PHP
<?php
$index->commit();
?>
You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
Indexing all the articles on phpRiot
Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article
Managing your data with DatabaseObject.
Extending Zend_Search_Lucene_Document
On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
In other words, we’re going to move the calls to
addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
Highlight: PHP
<?php
classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
{
/**
* Constructor. Creates our indexable document and adds all
* necessary fields to it using the passed in DatabaseObject
* holding the article data.
*/
publicfunction__construct(&$document)
{
$this->addField(Zend_Search_Lucene_Field::UnIndexed('url', $document->generateUrl()));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
$this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
$this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
$this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
}
}
?>
As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The
generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
Building the full index
Now that we have our class, we can create our index, loop over the documents, and then save our index:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
require_once('DatabaseObject/PhpriotDocument.class.php');
// where to save our index
$indexPath = '/var/www/phpriot.com/data/docindex';
// get a list of all database ids for documents
$doc_ids = PhpriotDocument::GetDocIds($db);
// create our index
$index = newZend_Search_Lucene($indexPath, true);
foreach($doc_idsas$doc_id){
// load our databaseobject
$document = newPhpriotDocument($db);
$document->loadRecord($doc_id);
// create our indexed document and add it to the index
$index->addDocument(newPhpRiotIndexedDocument($document));
}
// write the index to disk
$index->commit();
?>
The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
How querying works in Zend_Search_Lucene
Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
So to search the author field for ‘Quentin’, the search query would be
author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the
Zend_Search_Lucene manual section on Extensibility)
Likewise, to search in the title field for ‘php’, we would use
title:php.
As we briefly mentioned earlier in this article, the default section that is searched in is the field called
contents. So if you wanted to search the document body for the world ‘google’, you could use
contents:google or just
google.
Including and excluding terms
By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
Searching for phrases
It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the
Zend_Search_Lucene manual section on query types.
Sample queries
Here are some queries you can pass to Zend_Search_Lucene and their meanings.
Highlight: Plain
php
// search the index for any article with the word php
php -author:quentin
// find any article with the word php not written by me
author:quentin
// find all the articles by me
php -ajax
// find all articles with the word php that don't have the word ajax
title:mysql
// find all articles with MySQL in the title
title:mysql -author:quentin
// find all articles with MySQL in the title not by me
And so on. Hopefully you get the idea.
Scoring of results
All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
The results are ordered by their score, from highest to lowest.
I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
Querying our index
On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
Now we will look at actually pulling documents from our index using that term.
There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
In either case, you use the
find() method on the index. The find() method returns a list of matches from your index.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query('php +author:Quentin');
?>
This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
We could also manually build this same query with function calls like so:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$query = newZend_Search_Lucene_Search_Query_MultiTerm();
$query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
$query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
$hits = $index->query($query);
?>
The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is
contents.
On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
Dealing with returned results
The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
Each of the indexed fields are available as a class property.
So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$query = 'php +author:Quentin';
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query($query);
$numHits = count($hits);
?>
<p>
Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
<h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
<p>
By <?=$hit->author?>
</p>
<p>
<?=$hit->teaser?><br />
<a href="<?=$hit->url?>">Read more...</a>
</p>
<?php}?>
Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
Creating a simple search engine
Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
___FCKpd___125
$query = trim($query);
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
if(strlen($query) > 0){
$hits = $index->query($query);
$numHits = count($hits);
}
?>
<form method="get" action="search.php">
<input type="text" name="query" value="<?=htmlSpecialChars($query)?>" />
<input type="submit" value="Search" />
</form>
<?phpif(strlen($query) > 0){?>
<p>
Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
<h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
<p>
By <?=$hit->author?>
</p>
<p>
<?=$hit->teaser?><br />
<a href="<?=$hit->url?>">Read more...</a>
</p>
<?php}?>
<?php}?>
Error handling
The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the
Zend_Search_Lucene_Exception exception.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
___FCKpd___155
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166
This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
Keeping the index up-to-date
The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
- Update just the entry for the updated document straight away
- Rebuild the entire index when a document is updated straight away
- Rebuild the entire index at a certain time each day (or several times per day)
The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
Extending Zend_Search_Lucene
There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
- A custom tokenizer for determining keywords in a document
- Custom scoring algorithms to determine how well a document matches a search query
- A custom storage method, to your index is stored however and wherever you please
A custom tokenizer
There are many reasons why a custom tokenizer can be useful. Here are some ideas:
- PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
- Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
- HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
Custom scoring algorithms
Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
More information can be found on this at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
Custom storage method
You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
Conclusion
In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
We also looked briefly at some ways of extending the search capabilities.
Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
GET['query']) ?
Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html
By Quentin Zervaas, 27 April 2006
This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.
There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).
It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.
In this article we will be covering the following:
- How to index a document or series of documents
- The different types of fields that can be indexed
- Searching the index
To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
How fulltext indexing and querying works
Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
Querying the data
Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
Keeping the index up-to-date
If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
Getting started
The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
You can download this from
http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
Highlight: Plain
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend
So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php
Which now becomes:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend
Creating our first index
The basic process for creating an index is:
- Open the index
- Add each document
- Commit (save) the index
The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the
Zend_Search_Lucene class.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath, true);
?>
You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
Adding a document to our index
Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
?>
The next thing we must do is determine which fields we need to add to our index.
There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
- Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
- Title – we’re definitely going to include the title in our results
- Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
- Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
- Created – We’ll also store a timestamp of when the article was created.
This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
- UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
- UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
- Text – Data that is available for search and is stored in full (title and author)
There is also the
Keyword and
Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
To add a field to our indexed document, we use the
addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
In other words, to create the
title field data, we use:
Highlight: PHP
<?php
$data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>
Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
So to add all the data with the field types we just worked out, we would use this:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
$doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
$doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>
Important note: We added the main search content with a field name of
contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like
title:foo. This will be covered further in the section about querying the index.
Finally, we add the document to the index using
addDocument():
Highlight: PHP
<?php
$index->addDocument($doc);
?>
We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
Committing / saving the index
Once all documents have been added, the index must be saved.
Highlight: PHP
<?php
$index->commit();
?>
You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
Indexing all the articles on phpRiot
Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article
Managing your data with DatabaseObject.
Extending Zend_Search_Lucene_Document
On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
In other words, we’re going to move the calls to
addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
Highlight: PHP
<?php
classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
{
/**
* Constructor. Creates our indexable document and adds all
* necessary fields to it using the passed in DatabaseObject
* holding the article data.
*/
publicfunction__construct(&$document)
{
$this->addField(Zend_Search_Lucene_Field::UnIndexed('url', $document->generateUrl()));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
$this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
$this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
$this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
}
}
?>
As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The
generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
Building the full index
Now that we have our class, we can create our index, loop over the documents, and then save our index:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
require_once('DatabaseObject/PhpriotDocument.class.php');
// where to save our index
$indexPath = '/var/www/phpriot.com/data/docindex';
// get a list of all database ids for documents
$doc_ids = PhpriotDocument::GetDocIds($db);
// create our index
$index = newZend_Search_Lucene($indexPath, true);
foreach($doc_idsas$doc_id){
// load our databaseobject
$document = newPhpriotDocument($db);
$document->loadRecord($doc_id);
// create our indexed document and add it to the index
$index->addDocument(newPhpRiotIndexedDocument($document));
}
// write the index to disk
$index->commit();
?>
The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
How querying works in Zend_Search_Lucene
Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
So to search the author field for ‘Quentin’, the search query would be
author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the
Zend_Search_Lucene manual section on Extensibility)
Likewise, to search in the title field for ‘php’, we would use
title:php.
As we briefly mentioned earlier in this article, the default section that is searched in is the field called
contents. So if you wanted to search the document body for the world ‘google’, you could use
contents:google or just
google.
Including and excluding terms
By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
Searching for phrases
It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the
Zend_Search_Lucene manual section on query types.
Sample queries
Here are some queries you can pass to Zend_Search_Lucene and their meanings.
Highlight: Plain
php
// search the index for any article with the word php
php -author:quentin
// find any article with the word php not written by me
author:quentin
// find all the articles by me
php -ajax
// find all articles with the word php that don't have the word ajax
title:mysql
// find all articles with MySQL in the title
title:mysql -author:quentin
// find all articles with MySQL in the title not by me
And so on. Hopefully you get the idea.
Scoring of results
All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
The results are ordered by their score, from highest to lowest.
I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
Querying our index
On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
Now we will look at actually pulling documents from our index using that term.
There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
In either case, you use the
find() method on the index. The find() method returns a list of matches from your index.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query('php +author:Quentin');
?>
This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
We could also manually build this same query with function calls like so:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$query = newZend_Search_Lucene_Search_Query_MultiTerm();
$query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
$query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
$hits = $index->query($query);
?>
The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is
contents.
On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
Dealing with returned results
The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
Each of the indexed fields are available as a class property.
So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$query = 'php +author:Quentin';
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query($query);
$numHits = count($hits);
?>
<p>
Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
<h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
<p>
By <?=$hit->author?>
</p>
<p>
<?=$hit->teaser?><br />
<a href="<?=$hit->url?>">Read more...</a>
</p>
<?php}?>
Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
Creating a simple search engine
Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
___FCKpd___125
___FCKpd___126
___FCKpd___127
___FCKpd___128
___FCKpd___129
___FCKpd___130
___FCKpd___131
___FCKpd___132
___FCKpd___133
___FCKpd___134
___FCKpd___135
___FCKpd___136
___FCKpd___137
___FCKpd___138
___FCKpd___139
___FCKpd___140
___FCKpd___141
___FCKpd___142
___FCKpd___143
___FCKpd___144
___FCKpd___145
___FCKpd___146
___FCKpd___147
___FCKpd___148
___FCKpd___149
___FCKpd___150
___FCKpd___151
___FCKpd___152
Error handling
The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the
Zend_Search_Lucene_Exception exception.
Highlight: PHP
___FCKpd___153
___FCKpd___154
___FCKpd___155
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166
This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
Keeping the index up-to-date
The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
- Update just the entry for the updated document straight away
- Rebuild the entire index when a document is updated straight away
- Rebuild the entire index at a certain time each day (or several times per day)
The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
Extending Zend_Search_Lucene
There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
- A custom tokenizer for determining keywords in a document
- Custom scoring algorithms to determine how well a document matches a search query
- A custom storage method, to your index is stored however and wherever you please
A custom tokenizer
There are many reasons why a custom tokenizer can be useful. Here are some ideas:
- PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
- Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
- HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
Custom scoring algorithms
Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
More information can be found on this at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
Custom storage method
You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
Conclusion
In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
We also looked briefly at some ways of extending the search capabilities.
Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
GET['query'] : '';
___FCKpd___126
___FCKpd___127
___FCKpd___128
___FCKpd___129
___FCKpd___130
___FCKpd___131
___FCKpd___132
___FCKpd___133
___FCKpd___134
___FCKpd___135
___FCKpd___136
___FCKpd___137
___FCKpd___138
___FCKpd___139
___FCKpd___140
___FCKpd___141
___FCKpd___142
___FCKpd___143
___FCKpd___144
___FCKpd___145
___FCKpd___146
___FCKpd___147
___FCKpd___148
___FCKpd___149
___FCKpd___150
___FCKpd___151
___FCKpd___152
Error handling
The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the
Zend_Search_Lucene_Exception exception.
Highlight: PHP
___FCKpd___153
___FCKpd___154
___FCKpd___155
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166
This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
Keeping the index up-to-date
The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
- Update just the entry for the updated document straight away
- Rebuild the entire index when a document is updated straight away
- Rebuild the entire index at a certain time each day (or several times per day)
The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
Extending Zend_Search_Lucene
There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
- A custom tokenizer for determining keywords in a document
- Custom scoring algorithms to determine how well a document matches a search query
- A custom storage method, to your index is stored however and wherever you please
A custom tokenizer
There are many reasons why a custom tokenizer can be useful. Here are some ideas:
- PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
- Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
- HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
Custom scoring algorithms
Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
More information can be found on this at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
Custom storage method
You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
Conclusion
In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
We also looked briefly at some ways of extending the search capabilities.
Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
GET['query'] : '';
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166
This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
Keeping the index up-to-date
The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
- Update just the entry for the updated document straight away
- Rebuild the entire index when a document is updated straight away
- Rebuild the entire index at a certain time each day (or several times per day)
The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
Extending Zend_Search_Lucene
There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
- A custom tokenizer for determining keywords in a document
- Custom scoring algorithms to determine how well a document matches a search query
- A custom storage method, to your index is stored however and wherever you please
A custom tokenizer
There are many reasons why a custom tokenizer can be useful. Here are some ideas:
- PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
- Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
- HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
Custom scoring algorithms
Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
More information can be found on this at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
Custom storage method
You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
Conclusion
In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
We also looked briefly at some ways of extending the search capabilities.
Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
GET['query']) ?
Creating a fulltext search engine in PHP 5 with the Zend Framework's Zend_Search_Lucene --http://www.phpriot.com/d/articles/php/search/zend-search-lucene/index.html
By Quentin Zervaas, 27 April 2006
This article covers the implementation of a fulltext search engine using PHP 5 and the Zend Framework. We will be using the Zend_Search_Lucene component to create and search our fulltext index.
There are several other libraries we could use instead of this one, but Zend_Search_Lucene is completely native to PHP, whereas others such as Xapian or Tsearch2 rely on third party modules (for instance, the Tsearch2 module must be compiled into your PostgreSQL installation).
It must be noted at this point though that we require at least PHP 5 for Zend_Search_Lucene – PHP 4 will not work.
In this article we will be covering the following:
- How to index a document or series of documents
- The different types of fields that can be indexed
- Searching the index
To demonstrate this functionality, we will cover the implementation of a search engine into phpRiot. We previously used the Tsearch2 module but had some problems that we were unable to overcome.
How fulltext indexing and querying works
Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
So if I were to create an index of all the documents on phpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of phpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
Alternatively we just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to do it this way (less queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
Querying the data
Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
Keeping the index up-to-date
If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document.
There are several ways of approaching this. For instance, you could update it in real-time when the document is updated, or you could run an index update at a certain time every day. Each method has its own advantages. For something like phpRiot, updating the index in real time is probably the best way, as the data is not updated very frequently. On the other hand, if you were indexing user submitted comments, it could create a huge load to recreate the index for every comment, as there could be hundreds of comments per day.
Getting started
The first thing we must do is install the Zend Framework if you have not already done so. It is structured in a similar way to how the Pear file structure is organised. At this stage, the Zend Framework is only in a “preview” phase. At time of writing the current version was Preview 0.1.3.
You can download this from
http://framework.zend.com/download. If you use Subversion, you can also checkout the trunk version which may have newer code in it.
I’m not exactly sure where the developers intended the framework to be stored, but like Pear is stored in /usr/local/lib/php, I chose to store it in /usr/local/lib/zend.
Highlight: Plain
$ cd /usr/local/src
$ wget http://framework.zend.com/download/tgz
$ tar -zxf ZendFramework-0.1.3.tar.gz
$ mv ZendFramework-0.1.3/library /usr/local/lib/zend
So now all that is required is that we add /usr/local/lib/zend to our include path. For instance, my include path directive in httpd.conf for phpRiot looks something like:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php
Which now becomes:
Highlight: Plain
php_value include_path .:/var/www/phpriot.com/include:/usr/local/lib/php:/usr/local/lib/zend
Creating our first index
The basic process for creating an index is:
- Open the index
- Add each document
- Commit (save) the index
The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved by instantiating the
Zend_Search_Lucene class.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath, true);
?>
You’ll also notice the second parameter in the constructor call. This means the index is created from scratch. If you set this to false (or omit the argument), an existing index is opened. This is done when updating or querying the index. Since we’re creating the index at this stage, we include that parameter.
Adding a document to our index
Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
?>
The next thing we must do is determine which fields we need to add to our index.
There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
- Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
- Title – we’re definitely going to include the title in our results
- Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
- Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
- Created – We’ll also store a timestamp of when the article was created.
This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
- UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
- UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
- Text – Data that is available for search and is stored in full (title and author)
There is also the
Keyword and
Binary fields available, but we won’t be using them in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
To add a field to our indexed document, we use the
addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.
In other words, to create the
title field data, we use:
Highlight: PHP
<?php
$data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>
Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
So to add all the data with the field types we just worked out, we would use this:
Highlight: PHP
<?php
$doc = newZend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
$doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
$doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>
Important note: We added the main search content with a field name of
contents. This is a special name to Zend_Search_Lucene, which means by default, all queries will search this field. If users instead want to search the title field for ‘foo’, their search query would look like
title:foo. This will be covered further in the section about querying the index.
Finally, we add the document to the index using
addDocument():
Highlight: PHP
<?php
$index->addDocument($doc);
?>
We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
Committing / saving the index
Once all documents have been added, the index must be saved.
Highlight: PHP
<?php
$index->commit();
?>
You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.
Indexing all the articles on phpRiot
Now that you’ve seen how to create a basic index, we will extend this script slightly so it can index all the documents in phpRiot.
Additionally, we will be extending the base Zend_Search_Lucene_Document class to simplify our code slightly. This will also demonstrate ways you can take advantage of the OOP style of programming that Zend_Search_Lucene uses.
Since we are demonstrating the indexing of phpRiot articles, we will also use the DatabaseObject class here to fetch article data, just as phpRiot does. You don’t really need to know how this class works to understand this example (as it is fairly self-explanatory in the function calls), but if you are interested, you can read our article
Managing your data with DatabaseObject.
Extending Zend_Search_Lucene_Document
On the previous page, after we opened the index, we created a new instance of Zend_Search_Lucene_Document to hold the index data for a single document. Instead of calling this class directly, we’re going to extend this class to encapsulate all of the adding of data we also did.
In other words, we’re going to move the calls to
addField into our class, rather than calling it for each field after we create our Zend_Search_Lucene_Document item.
Highlight: PHP
<?php
classPhpRiotIndexedDocumentextendsZend_Search_Lucene_Document
{
/**
* Constructor. Creates our indexable document and adds all
* necessary fields to it using the passed in DatabaseObject
* holding the article data.
*/
publicfunction__construct(&$document)
{
$this->addField(Zend_Search_Lucene_Field::UnIndexed('url', $document->generateUrl()));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('created', $document->getProperty('created')));
$this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $document->getProperty('teaser')));
$this->addField(Zend_Search_Lucene_Field::Text('title', $document->getProperty('title')));
$this->addField(Zend_Search_Lucene_Field::Text('author', $document->getProperty('author')));
$this->addField(Zend_Search_Lucene_Field::UnStored('contents', $document->getProperty('body')));
}
}
?>
As you can see, we’ve made a simple wrapper class which fetches data from the passed in document (which uses DatabaseObject). The
generateUrl() function is just a special internal method which determines a document’s full URL. We are storing this when we build the index so we don’t have to generate it each time a search is run (especially since this will never change, and if it does we can just rebuild the index).
Building the full index
Now that we have our class, we can create our index, loop over the documents, and then save our index:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
require_once('DatabaseObject/PhpriotDocument.class.php');
// where to save our index
$indexPath = '/var/www/phpriot.com/data/docindex';
// get a list of all database ids for documents
$doc_ids = PhpriotDocument::GetDocIds($db);
// create our index
$index = newZend_Search_Lucene($indexPath, true);
foreach($doc_idsas$doc_id){
// load our databaseobject
$document = newPhpriotDocument($db);
$document->loadRecord($doc_id);
// create our indexed document and add it to the index
$index->addDocument(newPhpRiotIndexedDocument($document));
}
// write the index to disk
$index->commit();
?>
The index has now been created! This can take some time if you have many documents or if each document has a large amount of content. We generate this by using PHP on the command line, which allows us to see its progress in real-time if we need to (we also output the title and a status message as each document is indexed).
How querying works in Zend_Search_Lucene
Now comes the most important part—actually finding stuff! There are quite a lot of options when it comes to querying your index, allowing you and your users to have a lot of control in the returned results.
When we created the indexed, we added six fields, but only three of those were actually searchable: document title, document content, and the author.
Each of these items are stored separately for each indexed documents, meaning you can search on them separately. The syntax used to search each section is somewhat like Google’s, in that you specify the field, followed by a colon, followed by the term (with no spaces).
So to search the author field for ‘Quentin’, the search query would be
author:Quentin. (Note that the search is case-insensitive. To make it case-sensitive we would need to change some options when creating our index. For full details on this, please read the
Zend_Search_Lucene manual section on Extensibility)
Likewise, to search in the title field for ‘php’, we would use
title:php.
As we briefly mentioned earlier in this article, the default section that is searched in is the field called
contents. So if you wanted to search the document body for the world ‘google’, you could use
contents:google or just
google.
Including and excluding terms
By default, all specified terms are searched on using a boolean ‘or’. This means that any of the terms can exist for a document to be returned. To force results to have a particular term, the plus symbol is used. This force results to not have a particular term, the minus symbol is used. If you’re searching in a different field, you can put the plus or minus either before the field name or the term name. In other words, +author:Quentin and author:+Quentin are identical.
Searching for phrases
It is possible to search for exact phrases with Zend_Search_Lucene, so if you wanted to search for the exact phrase “PHP Articles” you could. Because this is somewhat complicated to achieve, we will not be including this in our examples or implementation, however, there is alot of information on this on the
Zend_Search_Lucene manual section on query types.
Sample queries
Here are some queries you can pass to Zend_Search_Lucene and their meanings.
Highlight: Plain
php
// search the index for any article with the word php
php -author:quentin
// find any article with the word php not written by me
author:quentin
// find all the articles by me
php -ajax
// find all articles with the word php that don't have the word ajax
title:mysql
// find all articles with MySQL in the title
title:mysql -author:quentin
// find all articles with MySQL in the title not by me
And so on. Hopefully you get the idea.
Scoring of results
All results returned from a search are assigned a score. This is a measure of how well the document matched the search term.
The results are ordered by their score, from highest to lowest.
I’m not exactly sure how the score is calculated or what it represents exactly, but it looks pretty on the search results.
You can customize the scoring algorithm (and hence the ordering of results). Please see the section later in the article on extending Zend_Search_Lucene.
Querying our index
On the previous page we looked at how to write queries to search the index. We learned how to include and exclude terms, and also how to search different fields in our indexed data.
Now we will look at actually pulling documents from our index using that term.
There are essentially two ways to query the index: passing the raw query in and letting Zend_Search_Lucene parse the query (ideal when you’re writing a search engine where you’re not sure what the user will enter), or by manually building up the query with API function calls.
In either case, you use the
find() method on the index. The find() method returns a list of matches from your index.
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query('php +author:Quentin');
?>
This sample code searches our index by also articles containing ‘php’, written by me. Note that when we opened our index, we did not pass the second parameter as we did when we created the index. This is because we are not writing the index, we are querying it.
We could also manually build this same query with function calls like so:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$query = newZend_Search_Lucene_Search_Query_MultiTerm();
$query->addTerm(newZend_Search_Lucene_Index_Term('php'), null);
$query->addTerm(newZend_Search_Lucene_Index_Term('Quentin', 'author'), true);
$hits = $index->query($query);
?>
The second parameter for addTerm used determines whether or not a field is required. True means it is required (like putting a plus sign before the term), false means it is prohibited (like putting a minus sign before the term), null means it isn’t required or prohibited.
The second parameter for Zend_Search_Lucene_Index_Term specifies the field to search index. By default this is
contents.
On the whole, it is easier to simply allow Zend_Search_Lucene to parse the query.
Dealing with returned results
The results found from your query are returned in an array, meaning you can simply use count() on the array to determine the number of hits.
Each of the indexed fields are available as a class property.
So to loop over the results as we indexed them previously (with a title, author and teaser), we would do the following:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
$query = 'php +author:Quentin';
$indexPath = '/var/www/phpriot.com/data/docindex';
$index = newZend_Search_Lucene($indexPath);
$hits = $index->query($query);
$numHits = count($hits);
?>
<p>
Found <?=$hits?> result(s) for query <?=$query?>.
</p>
<?phpforeach($hitsas$hit){?>
<h3><?=$hit->title?> (score: <?=$hit->score?>)</h3>
<p>
By <?=$hit->author?>
</p>
<p>
<?=$hit->teaser?><br />
<a href="<?=$hit->url?>">Read more...</a>
</p>
<?php}?>
Here we also used an extra field called ‘score’. As mentioned previously, this is used as an indicator as to how well a document matched the query. Results with the highest score are listed first.
Creating a simple search engine
Using our code above, we can easily transform this into a simple site search engine. All we need to do is add a form and plug in the submitted query. Let’s assume this script is called search.php:
Highlight: PHP
<?php
require_once('Zend/Search/Lucene.php');
___FCKpd___125
___FCKpd___126
___FCKpd___127
___FCKpd___128
___FCKpd___129
___FCKpd___130
___FCKpd___131
___FCKpd___132
___FCKpd___133
___FCKpd___134
___FCKpd___135
___FCKpd___136
___FCKpd___137
___FCKpd___138
___FCKpd___139
___FCKpd___140
___FCKpd___141
___FCKpd___142
___FCKpd___143
___FCKpd___144
___FCKpd___145
___FCKpd___146
___FCKpd___147
___FCKpd___148
___FCKpd___149
___FCKpd___150
___FCKpd___151
___FCKpd___152
Error handling
The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the
Zend_Search_Lucene_Exception exception.
Highlight: PHP
___FCKpd___153
___FCKpd___154
___FCKpd___155
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166
This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
Keeping the index up-to-date
The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
- Update just the entry for the updated document straight away
- Rebuild the entire index when a document is updated straight away
- Rebuild the entire index at a certain time each day (or several times per day)
The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
Extending Zend_Search_Lucene
There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
- A custom tokenizer for determining keywords in a document
- Custom scoring algorithms to determine how well a document matches a search query
- A custom storage method, to your index is stored however and wherever you please
A custom tokenizer
There are many reasons why a custom tokenizer can be useful. Here are some ideas:
- PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
- Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
- HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
Custom scoring algorithms
Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
More information can be found on this at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
Custom storage method
You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
Conclusion
In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
We also looked briefly at some ways of extending the search capabilities.
Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.
GET['query'] : '';
___FCKpd___126
___FCKpd___127
___FCKpd___128
___FCKpd___129
___FCKpd___130
___FCKpd___131
___FCKpd___132
___FCKpd___133
___FCKpd___134
___FCKpd___135
___FCKpd___136
___FCKpd___137
___FCKpd___138
___FCKpd___139
___FCKpd___140
___FCKpd___141
___FCKpd___142
___FCKpd___143
___FCKpd___144
___FCKpd___145
___FCKpd___146
___FCKpd___147
___FCKpd___148
___FCKpd___149
___FCKpd___150
___FCKpd___151
___FCKpd___152
Error handling
The one thing we haven’t dealt with yet are errors in the search. For instance, if we were to type in ‘title:’ with no query behind it then an error would occur. We handle this by catching the
Zend_Search_Lucene_Exception exception.
Highlight: PHP
___FCKpd___153
___FCKpd___154
___FCKpd___155
___FCKpd___156
___FCKpd___157
___FCKpd___158
___FCKpd___159
___FCKpd___160
___FCKpd___161
___FCKpd___162
___FCKpd___163
___FCKpd___164
___FCKpd___165
___FCKpd___166
This means now that if an error occurs in the search, we simply assume zero hits were returned, thereby handling the error without indicating to the user that anything went wrong.
Of course, you could also choose to get the error message from the exception and output that instead ($ex->getMessage()).
Keeping the index up-to-date
The other thing we haven’t yet dealt with is if any of our documents are updated. There are several ways to handle this:
- Update just the entry for the updated document straight away
- Rebuild the entire index when a document is updated straight away
- Rebuild the entire index at a certain time each day (or several times per day)
The ideal method really depends on the kind of data you have, how often it is updated, and how important it is for it the search index to be completely up-to-date.
To be honest, I haven’t figured out a way to update a single document in the index yet, but I may just be missing something simple. If you open the index (without the second parameter), and then index a document that is already in the index, then it will be duplicated (and hence returned twice in any matching search).
There isn’t too much documentation on this library yet, at least nothing about this specifically. If anybody knows how, please drop me an email or submit a comment with this article.
So at this point, the way to keep the index updated is to rebuild it from scratch when a document is updated.
Extending Zend_Search_Lucene
There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
- A custom tokenizer for determining keywords in a document
- Custom scoring algorithms to determine how well a document matches a search query
- A custom storage method, to your index is stored however and wherever you please
A custom tokenizer
There are many reasons why a custom tokenizer can be useful. Here are some ideas:
- PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
- Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
- HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis.
Custom scoring algorithms
Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘contents’ field.
More information can be found on this at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.scoring.
Custom storage method
You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
More information on this can be found at
http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.storage.
Conclusion
In this article we looked closely at Zend_Search_Lucene, part of the new Zend Framework. While it is still in development, it works sufficiently well to implement fulltext indexing and searching on any PHP 5 website.
We also looked briefly at some ways of extending the search capabilities.
Hopefully this has given you some ideas for your own website. Feel free to add comments and ideas to this article.