转(备忘)提高lucene索引速度技巧汇总

ImproveIndexingSpeed

How to make indexing faster

Here are somethings to try to speed up the indexing speed of your Luceneapplication. Please see ImproveSearchingSpeed for how to speed upsearching. doubanclaim4c08e02b1af5eace

·        Be sure you really need tospeed things up.

Many of the ideas here aresimple to try, but others will necessarily add some complexity to yourapplication. So be sure your indexing speed is indeed too slow and the slownessis indeed within Lucene.

·        Make sure you are using thelatest version of Lucene.

·        Use a local filesystem.

Remote filesystems are typically quite a bit slower for indexing. If yourindex needs to be on the remote fileysystem, consider building it first on thelocal filesystem and then copying it up to the remote filesystem.

·        Get faster hardware,especially a faster IO system.

·        Open a single writer andre-use it for the duration of your indexing session.

·        Flush by RAM usage instead ofdocument count.

Call [www] writer.ramSizeInBytes()after every added doc then call [www] flush()when it's using too much RAM. This is especially good if you have small docs orhighly variable doc sizes. You need to first set [www] maxBufferedDocslarge enough to prevent the writer from flushing based on document count.However, don't set it too large otherwise you may hit [www] LUCENE-845. Somewhere around 2-3X your "typical" flush count should be OK.

·        Use as much RAM as you canafford.

More RAM before flushing means Lucene writes larger segmentsto begin with which means less merging later. Testing in [www] LUCENE-843found that around 48 MB is the sweet spot for that content set, but, yourapplication could have a different sweet spot.

·        Turn off compound file format.

Call [www] setUseCompoundFile(false). Building the compound file format takes time during indexing (7-33% in testingfor [www] LUCENE-888). However, note that doing this will greatly increase the number of file descriptorsused by indexing and by searching, so you could run out of file descriptors ifmergeFactor is also large.

·        Re-use Document and Fieldinstances

As of Lucene 2.3 (not yet released) there arenew setValue(...) methods that allow you to change the value of a Field. Thisallows you to re-use a single Field instance across many added documents, whichcan save substantial GC cost.

It's best to create a singleDocument instance, then add multiple Field instances to it, but hold onto theseField instances and re-use them by changing their values for each addeddocument. For example you might have an idField, bodyField, nameField,storedField1, etc. After the document is added, you then directly change theField values (idField.setValue(...), etc), and then re-add your Documentinstance.

Note that you cannot re-use asingle Field instance within a Document, and, you should not change a Field'svalue until the Document containing that Field has been added to the index. See [www] Fieldfor details.

·        Re-use a single Token instancein your analyzer

Analyzers often create a newToken for each term in sequence that needs to be indexed from a Field. You cansave substantial GC cost by re-using a single Token instance instead.

·        Use the char[] API in Tokeninstead of the String API to represent token Text

As of Lucene 2.3 (not yet released), a Token canrepresent its text as a slice into a char array, which saves the GC cost ofnew'ing and then reclaiming String instances. By re-using a single Tokeninstance and using the char[] API you can avoid new'ing any objects for eachterm. See [www] Tokenfor details.

·        Use autoCommit=false when youopen your IndexWriter

In Lucene 2.3 (not yet released), there aresubstantial optimizations for Documents that use stored fields and termvectors, to save merging of these very large index files. You should see thebest gains by using autoCommit=false for a single long-running session of IndexWriter. Note however that searchers will not see any of the changes flushed by this IndexWriteruntil it is closed; if that is important you should stick with autoCommit=trueinstead or periodically close and re-open the writer.

·        Instead of indexing many smalltext fields, aggregate the text into a single "contents" field andindex only that (you can still store the other fields).

·        Increase [www] mergeFactor, but not too much.

Larger [www] mergeFactorsdefers merging of segments until later, thus speeding up indexing becausemerging is a large part of indexing. However, this will slow down searching,and, you will run out of file descriptors if you make it too large. Values thatare too large may even slow down indexing since merging more segments at oncemeans much more seeking for the hard drives.

·        Turn off any features you arenot in fact using.

If you are storing fields butnot using them at query time, don't store them. Likewise for term vectors. Ifyou are indexing many fields, turning off norms for those fields may helpperformance.

·        Use a faster analyzer.

Sometimes analysis of a documenttakes alot of time. For example, StandardAnalyzer is quite time consuming,especially in Luceneversion <= 2.2. If you can get by with a simpler analyzer, then try it.

·        Speed up documentconstruction.

Often the process of retrievinga document from somewhere external (database, filesystem, crawled from a Website, etc.) is very time consuming.

·        Don't optimize unless youreally need to (for faster searching).

·        Use multiple threads with oneIndexWriter.

Modern hardware is highlyconcurrent (multi-core CPUs, multi-channel memory architectures, native commandqueuing in hard drives, etc.) so using more than one thread to add documentscan give good gains overall. Even on older machines there is often stillconcurrency to be gained between IO and CPU. Test the number of threads to findthe best performance point.

·        Index into separate indicesthen merge.

If you have a very large amountof content to index then you can break your content into N "silos",index each silo on a separate machine, then use the writer.addIndexesNoOptimizeto merge them all into one final index.

·        Run a Java profiler.

If all else fails, profile yourapplication to figure out where the time is going. I've had success with a verysimple profiler called [www] JMP. There are many others. Often you will be pleasantly surprised to find somesilly, unexpected method is taking far too much time.

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值