LuceneManaging resource consumption: Lucene In Action 2 chapter 10

10.3 Managing resource consumption

Like all software, Lucene requires certain precious resources to get its job done. A computer has a limited supply
of things like disk storage, file descriptors and memory. Often Lucene must share these resources with other
applications. Understanding how Lucene uses resources and what you can do to control this lets you keep your
search application healthy. You might assume Lucene's disk usage is simply proportional to the total size of all
documents you've added, but you'll be surprised to see that often, this is far from the truth. Similarly, Lucene's
usage of simultaneous open file descriptors is unexpected: changes to a few Lucene configuration options can
drastically change the number of open files. Finally, to manage Lucene's memory consumption, you'll see why it's
not always best to give Lucene access to all memory on the computer.
We start with everyone's favorite: how much disk space does Lucene require? Next we describe Lucene’s open
file descriptor usage, and finally, memory usage.

10.3.1 Disk space

Lucene's disk usage depends on many factors. An index with only a single pure indexed, typical text field will be
about 1/3rd of the total size of the original text. But at the other extreme, an index that has stored fields and term
vectors with offsets and positions, with numerous deleted documents plus an open reader on the index, with an
optimize running, can easily consume 10X the total size of the original text ! This wide range and seeming
unpredictability makes it exciting to manage disk usage for a Lucene index.


Figure 10.3 shows the disk usage over time while indexing all documents from Wikipedia, finishing with an
optimize call. The final disk usage was 14.2 GB, but the peak disk usage was 32.4 GB, which was reached while
several large concurrent merges were running. You can immediately see how erratic it is. Rather than increasing
gradually with time, as you add documents to the index, disk usage will suddenly ramp up during a merge and
then quickly fall back again once the merge has finished, creating a saw tooth pattern. The size of this jump
corresponds to how large the merge was (the net size of all segments being merged). Furthermore, with
ConcurrentMergeScheduler, several large merges can be running at once and this will cause an even larger
increase of temporary disk usage.


How can you manage disk space when it has such wild swings? Fortunately, there is a method to this
madness. Once you understand what's happening under the hood, you can predict and understand Lucene's disk
usage. It's important to differentiate transient disk usage, while the index is being built (Figure 10.3), versus final
disk usage, when the index is completely built and optimized to one. It's easiest to start with the final size. Here
is a coarse formula to estimate the final size based on the size of all text from the documents:
1/3 x indexed fields +

1 x stored fields +
1 x term vectors fields
(2 x if offsets & positions are stored)
For example if your documents have a single field that is indexed, with stored and has term vectors turned on,
you should expect the index size to be around 2 1/3 x the total size of all text across all docs. Note that this
formula is very approximate. For example, documents with unusually diverse or unique terms, like a large
spreadsheet that contains many unique product SKUs, will use more disk space.
You can reduce disk usage somewhat by turning off norms (section 2.XXX), turning off term frequency
information for fields that don’t need it (section 2.XXX) and indexing and storing fewer fields per document

 

 

The transient disk usage depends on many factors. As the index gets larger, the size of each saw tooth will get
larger as bigger merges are being done. Large merges also take longer to complete and will therefore tie up disk
space for more time. When you optimize the index, down to one segment, the final merge is the largest merge
possible and will require 1X the final size of your index in temporary disk space.
Here are other things that will
affect transient disk usage:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值