regain的基本步骤

Technical Details

Technology

regain is based on Jakarta Lucene, a library for creating and searching search indices.

regain itself is 100% pure Java. The non-Java parts are plugins that read the formats Excel, Powerpoint and Word. For the formats Excel and Word however there are alternatives in 100% pure Java.

Searching with regain

The work of regain is split in two parts: The creation of the search index and the search on the search index.

The following image shows you an overwiew about how regain searches. 

The creation of the search index

The crawler searches a website or a directory tree for documents. In the configuration you may specify what exactly should be crawled. From each document the actual text is extracted using so-called preparators. The text is added to the search index.

The search on the search index

After you've created a search index, you are able to perform searches. The search index is built in such way that searching is very fast.

And this already is the whole trick of search machines: The time you need for a full text search is moved from the actual search (where a user waits for the results) to the index creation (which runs automatically in the background) using a clever search index.

Rating the search results

The search results are rated after the relative frequency of the search terms in the document. If a search term appears very often in a document, it will appear more on the top. In doing so, the length of a document is considered as well: A document with 100 words that contains a search term 5 times will be rated as a better hit than a document with 1000 words containing the search term 10 times.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值