Distributed Searching基础
在单机的情况下,当索引越来越大,检索就显得力不从心了。
solr容许我们将索引切开(多个适当大小的索引,称之为shards),并分布到多台“服务器”上。
solr通过一台服务器(single shard)接受检索任务,并将其分发到各个shards上,最后合并检索结果。
详细信息参见:http://wiki.apache.org/solr/DistributedSearch
1.通过shards参数执行Distributed Searching
我们可以检索请求中加入shards参数执行Distributed Searching,其格式为:
host:port/base_url[,host:port/base_url]*
例如:
http://localhost:8983/solr3.5/core1/
select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on
&shards=localhost:8983/solr3.5/core0,localhost:8983/solr3.5/core1
2.Distributed Searching支持的组件
只有以下组件支持Distributed Searching:
- The Query component that returns documents matching a query
- The Facet component, for facet.query and facet.field requests where facets are sorted by count (the default). Solr 1.4 and later also support sorting by name.
- The Highlighting component
- The Stats component
- The Spell Check Component
- The Terms Component
- The Term Vector Component
- The Debug component
3.Distributed Searching的限定(不足)
Distributed Searching还有种种限定条件,如下:
- Each document indexed must have a unique key.
(每个doc都要有唯一标识,因为solr要对结果进行合并) - If Solr discovers duplicate document IDs, Solr selects the first document and discards subsequent ones.
(solr如果发现重复的id,取首!) - Inverse-document frequency (IDF) calculations cannot be distributed.
(idf计算失效,idf牵涉到总文档数,distributed在各个shards进行检索时不方便计算文档总数。) - Distributed searching does not support the QueryElevationComponent, which configures the top results for a given query regardless of Lucene's scoring. For more information, see http://wiki.apache.org/solr/QueryElevationComponent.
(QueryElevationComponent不顾及scoring,有用户对结果进行编辑,那么简单的结果合并也就无从谈起。) - The index for distributed searching may become out of date; for example, a document that once matched a query and was subsequently changed may no longer match the query but will still be retrieved.
(索引会在distributed searching过程中过时。???) - Distributed searching supports only sorted-field faceting, not date faceting
(distributed searching仅支持sorted-field faceting) - The number of shards is limited by number of characters allowed for GET method's URI; most Web servers generally support at least 4000 characters, but many servers limit URI length to reduce their vulnerability to Denial of Service (DoS) attacks.
(shards数量受GET地址长度的限制) - TF/IDF computations are per shard. This may not matter if content is well (randomly) distributed.
(和第三点类似,tf/idf在各自shard上计算,因此合并出来的scoring排序也不是很“公正”。)