regain的基本步骤

最新推荐文章于 2024-11-04 14:11:27 发布

keensword

最新推荐文章于 2024-11-04 14:11:27 发布

阅读量2.4k

点赞数

分类专栏：课题研究文章标签： search powerpoint lucene excel plugins website

本文链接：https://blog.csdn.net/keensword/article/details/403217

版权

课题研究专栏收录该内容

0 篇文章 0 订阅

订阅专栏

Technical Details

Technology

regain is based on Jakarta Lucene, a library for creating and searching search indices.

regain itself is 100% pure Java. The non-Java parts are plugins that read the formats Excel, Powerpoint and Word. For the formats Excel and Word however there are alternatives in 100% pure Java.

Searching with regain

The work of regain is split in two parts: The creation of the search index and the search on the search index.

The following image shows you an overwiew about how regain searches.

The creation of the search index

The crawler searches a website or a directory tree for documents. In the configuration you may specify what exactly should be crawled. From each document the actual text is extracted using so-called preparators. The text is added to the search index.

The search on the search index

After you've created a search index, you are able to perform searches. The search index is built in such way that searching is very fast.

And this already is the whole trick of search machines: The time you need for a full text search is moved from the actual search (where a user waits for the results) to the index creation (which runs automatically in the background) using a clever search index.

Rating the search results

The search results are rated after the relative frequency of the search terms in the document. If a search term appears very often in a document, it will appear more on the top. In doing so, the length of a document is considered as well: A document with 100 words that contains a search term 5 times will be rated as a better hit than a document with 1000 words containing the search term 10 times.