Solr performance tips

最新推荐文章于 2021-07-01 19:52:34 发布

weixin_33796177

最新推荐文章于 2021-07-01 19:52:34 发布

阅读量88

点赞数

原文链接：http://blog.51cto.com/shadowisper/1629873

版权

Page query result:

purpose: pre-load more documents to avoid additional queries when navigating through pages, so that pagination result loads from document cache
how: solrconfig.xml queryResultWindowSize = documents per page * max number of pages browsed, queryResultMaxDocsCached = queryResultWindowSize

Configure document cache:

purpose: set the right document cache size, so that cache is large enough to avoid fetching data from index multiple times during a single query
how: solrconfig.xml, documentCache section, size = number of concurrent queries * max docs fetched per query, initialSize = size
note: consider using FastLRUCache if there are more reads than writes
autowarmCount needs not to be set for document cache as lucene doc id changes after re-index
set intialSize = size to avoid wasting time on cache resize
monitor cache after size adjustment: frequent eviction (cache too small), low hit rate:(turn off cache)

Configure query result cache:

purpose: load query results from cache as much as possible
how: solrconfig.xml, queryResultCache section, size = number of unique queries * number of sort criteria * 2 (asc or desc) initialSize = size autowarnCount = size * 1/4
note: consider using FastLRUCache if there are more reads than writes
autowarmCount specifies number of entries copied on invalidation (eg. commit operation)

Configure filter cache:

purpose: load filter query (fq) results from cache as much as possible
how: solrconfig.xml, filterCache section, size = number of unique filters

Startup or after commit warmup tuning:

purpose: pre-load result for heavily used or slow queries into cache to avoid warmup slowness (startup or commit)
how: solrconfig.xml

after startup:

<listener event="firstSearcher" class="solr.QuerySenderListener">

    <arr name="queries">
        <lst>
            <str name="q">cats</str>
            <str name="fq">category:1</str>

        </lst>

        <lst>...</lst>

    </arr>

</listener>

after commit:

<listener event="newSearcher" class="solr.QuerySenderListener">
    <arr name="queries">
        <lst>
            <str name="q">cats</str>
            <str name="fq">category:1</str>

        </lst>

        <lst>...</lst>

    </arr>

</listener>

Cache whole result pages (HTTP cache):

purpose: cache solr http response on client side by using http cache
how:

    <requestDispatcher handleSelect="true">
        <httpCaching lastModifiedFrom="openTime" etagSeed="Solr">
            <cacheControl>max-age=3600, public</cacheControl>
        </httpCaching>
    </requestDispatcher>

note: handleSelect=true handler resolution via request parameter qt
set "max-age" to half of the index update interval
set "private" to cacheControl if only want browser to cache solr response
lastModFrom="openTime" is the default, the Last-Modified value (and validation against If-Modified-Since requests) will all be relative to when the current Searcher was opened. You can change it to lastModFrom="dirLastMod" if you want the value to exactly correspond to when the physical index was last modified.
etagSeed="..." is an option you can change to force the ETag header (and validation against If-None-Match requests) to be differnet even if the index has not changed (ie: when making significant changes to your config file).
lastModifiedFrom and etagSeed are both ignored if you use the never304="true" option. (used if you want proxy server to handle tag/modified time calculation)

Improve facet performance:

purpose: improve performance of facet query via facet method if query result contains many documents and facet field's cardinality is low
how: add facet.method=enum to query or facet.<fieldname>.method=enum to query
note: facet.method=fc (default) iterates result documents and calculate count for each facet while facet.method=enum uses facet terms's docId to intersects with query result set's docId

Improve indexing time on large doc set:

purpose: improve response time for indexing a large number of documents by committing more frequently
how:use solr's auto commit feature

commit within specified time:

<updateHandler class="solr.DirectUpdateHandler2">
    <autoCommit>
        <maxTime>60000</maxTime>
        <openSearcher>true</openSearcher>
    </autoCommit>
</updateHandler>

commit after indexing specified number of documents:

<updateHandler class="solr.DirectUpdateHandler2">
    <autoCommit>
        <maxDocs>50000</maxDocs>
        <openSearcher>true</openSearcher>
    </autoCommit>
</updateHandler>

Commit faster than auto-commit setting for specific doc (xml data only)

<add commitWithin="100">
    <doc>
        <field name="id">1</field>
        <field name="title">Book 1</field>
    </doc>
</add>

Analyzing performance:

purpose: give a detail view of solr query execution so that slow queries can be tuned
how: Add request parameter debugQuery=true

Eg. http://localhost:8983/solr/select?q=metal&facet=true&facet.field=date&facet.query=from:[10+TO+2000]&debugQuery=true

this shows a debug section in solr response xml, which has a breakdown of time spent on each components
note: solr processing can be divided into two phases: prepare and process

Avoid filter caching:

purpose: there are cases when want to avoid filter caching for unique queries, such as time range search, to avoid wasting memory and CPU
how: add hint {!cache=false} to query
Eg. q=solr+cookbook&fq=category:books&fq={!cache=false}date:2012-06-12T13:22:12Z
note: filters that are not cached will be executed in parallel with the query

Control filter query execution order:

purpose: a filter query may contain multiple clauses, we want to control the order of execution so that cheap filters are applied first to narrow down result set as much as possible, and expensive filters (Eg, function) are applied later
how: specify cost to fq clause

Eg. q=solr+cookbook&fq=category:books&fq={!frange l=10 u=100 cache=false cost=50}log(sum(sqrt(popularity),100))&fq={!frange l=0 u=10 cache=false cost=150}if(exists(price_promotion),sum(0,price_promotion),sum(0,price))
note: order of execution can only be controlled for non-cached filter queries

Improve numeric query performance:

purpose: improve numerice range search performance
how: decreases the precisionStep of a float field

<fieldType name="float" class="solr.TrieFloatField" precisionStep="4" positionIncrementGap="0"/>
note: text range search is usually faster than numeric range search
decrease precisionStep results in more tokens generated by a single value and slightly increases index size for integer precisionStep = 4 results in 32 bit/4 = 8 tokens
precisionStep=0 turn off indexing of multiple tokens per value

Use near real time search feature:

purpose: solr supports near real time indexing by perform a soft commit. A hard commit syncs index change to disk, which is time consuming. A soft commit is much faster and searcher can see index changes immediately.
how: solrconfig.xml

<autoSoftCommit>

<maxTime>${solr.autoSoftCommit.maxTime:5000}</maxTime>

</autoSoftCommit>
note: A nice article that explains soft commit, hard commit and transaction log in solr
http://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Another way to improve time responsiveness for search is to use “get” query if document id is known. This retrieve document directly from log even though the data is not committed yet. Eg. http://localhost:8983/solr/get?ids=mydoc See https://wiki.apache.org/solr/RealTimeGet