Lucene: Indexing numbers, dates, and times And Field truncation

Although most content is textual in nature, in many cases handling numeric or date/time values is crucial. In a commerce setting, the product’s price, and perhaps other numeric attributes like weight and height, are clearly important. A video search engine may index the duration of each video. Press releases and articles have a time-stamp.

 

Indexing numbers

There are two common scenarios in which indexing numbers is important.

  • In one scenario, numbers are embedded in the text to be indexed, and you want to make sure those numbers are preserved and indexed as their own tokens so that you can use them later as ordinary tokens in searches. To enable this, simply pick an analyzer that doesn’t discard numbers.
  • In the other scenario, you have a field that contains a single number and you want to index it as a numeric value and then use it for precise (equals) matching, rangesearching, and/or sorting.
doc.add(new NumericField("price").setDoubleValue(19.99));

 

Indexing dates and times

 

Such values are easily handled by first converting them to an equivalent int or long value, and then indexing that value as a number. The simplest approach is to use Date.getTime to get the equivalent value, in millisecond precision, for a Java Date object:

doc.add(new NumericField("timestamp")
➥ .setLongValue(new Date().getTime()));

 

doc.add(new NumericField("day")
➥ .setIntValue((int) (new Date().getTime()/24/3600)));

 

Calendar cal = Calendar.getInstance();
cal.setTime(date);
doc.add(new NumericField("dayOfMonth")
➥ .setIntValue(cal.get(Calendar.DAY_OF_MONTH)));

 

------------------------------------------------------------------------------------------------------------------------------------

Field truncation

Some applications index documents whose sizes aren’t known in advance. As a safety mechanism to control the amount of RAM and hard disk space used, you may want to limit the amount of input they are allowed index per field. It’s also possible that a large binary document is accidentally misclassified as a text document, or contains binary content embedded in it that your document filter failed to process, which quickly adds many absurd binary terms to your index, much to your horror. Other applications deal with documents of known size but you’d like to index only a portion of each. For example, you may want to index only the first 200 words of each document.

 

To support these diverse cases, IndexWriter allows you to truncate per-Field indexing so that only the first N terms are indexed for an analyzed field. When you instantiate IndexWriter, you must pass in a MaxFieldLength instance expressing this limit. MaxFieldLength provides two convenient default instances: MaxField-Length.UNLIMITED, which means no truncation will take place, and MaxField-Length.LIMITED, which means fields are truncated at 10,000 terms. You can also instantiate MaxFieldLength with your own limit.

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
lucene搜索分页过程中,可以有两种方式 一种是将搜索结果集直接放到session中,但是假如结果集非常大,同时又存在大并发访问的时候,很可能造成服务器的内存不足,而使服务器宕机 还有一种是每次都重新进行搜索,这样虽然避免了内存溢出的可能,但是,每次搜索都要进行一次IO操作,如果大并发访问的时候,你要保证你的硬盘的转速足够的快,还要保证你的cpu有足够高的频率 而我们可以将这两种方式结合下,每次查询都多缓存一部分的结果集,翻页的时候看看所查询的内容是不是在已经存在在缓存当中,如果已经存在了就直接拿出来,如果不存在,就进行查询后,从缓存中读出来. 比如:现在我们有一个搜索结果集 一个有100条数据,每页显示10条,就有10页数据. 安装第一种的思路就是,我直接把这100条数据缓存起来,每次翻页时从缓存种读取 而第二种思路就是,我直接从搜索到的结果集种显示前十条给第一页显示,第二页的时候,我在查询一次,给出10-20条数据给第二页显示,我每次翻页都要重新查询 第三种思路就变成了 我第一页仅需要10条数据,但是我一次读出来50条数据,把这50条数据放入到缓存当中,当我需要10--20之间的数据的时候,我的发现我的这些数据已经在我的缓存种存在了,我就直接存缓存中把数据读出来,少了一次查询,速度自然也提高了很多. 如果我访问第六页的数据,我就把我的缓存更新一次.这样连续翻页10次才进行两次IO操作 同时又保证了内存不容易被溢出.而具体缓存设置多少,要看你的服务器的能力和访问的人数来决定
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值