Technical Details
Technology
regain is based on Jakarta Lucene, a library for creating and searching search indices.
regain itself is 100% pure Java. The non-Java parts are plugins that read the formats Excel, Powerpoint and Word. For the formats Excel and Word however there are alternatives in 100% pure Java.
Searching with regain
The work of regain is split in two parts: The creation of the search index and the search on the search index.
The following image shows you an overwiew about how regain searches.
The creation of the search index
The crawler searches a website or a directory tree for documents. In the configuration you may specify what exactly should be crawled. From each document the actual text is extracted using so-called preparators. The text is added to the search index.
The search on the search index
After you've created a search index, you are able to perform searches. The search index is built in such way that searching is very fast.
And this already is the whole trick of search machines: The time you need for a full text search is moved from the actual search (where a user waits for the results) to the index creation (which runs automatically in the background) using a clever search index.
Rating the search results
The search results are rated after the relative frequency of the search terms in the document. If a search term appears very often in a document, it will appear more on the top. In doing so, the length of a document is considered as well: A document with 100 words that contains a search term 5 times will be rated as a better hit than a document with 1000 words containing the search term 10 times.