Solr使用Carrot2完成了聚类功能,能够把检索到的内容自动分类,Carrot2聚类示例:
要想Solr支持聚类功能,首先要把Solr发行包中的dist/solr-clustering-4.2.0.jar,复制到\solr\contrib\analysis-extras\lib下,然后打开sonlrconfig.xml进行添加配置:
<searchComponent name = "clustering" enable = "${solr.clustering.enabled:true}" class = "solr.clustering.ClusteringComponent">
<lst name = "engine">
<str name = "name">default</str>
<str name = "carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
<str name = "LingoClusteringAlgorithm.desiredClusterCountBase">30</str> <!-- 2~100 -->
<str name = "LingoClusteringAlgorithm.clusterMergingThreshold">0.70</str> <!-- 0~1 -->
<str name = "LingoClusteringAlgorithm.scoreWeight">0</str> <!-- 0~1 -->
<str name = "LingoClusteringAlgorithm.labelAssigner">org.carrot2.clustering.lingo.SimpleLabelAssigner</str> <! -- org.carrot2.clustering.lingo.UniqueLabelAssigner -->
<str name = "LingoClusteringAlgorithm.phraseLabelBoost">1.5</str> <!-- 0~10 -->
<str name = "LingoClusteringAlgorithm.phraseLengthPenaltyStart">8</str> <!-- 2~8 -->
<str name = "LingoClusteringAlgorithm.phraseLengthPenaltyShop">8</str> <!-- 2~8 -->
<str name = "TermDocumentMatrixReducer.factorizationQuality">HIGH</str> <!-- LOW,MEDIUM,HIGH -->
.......
</lst>
</searchComponent>
配好了聚类组件后,下面配置requestHandler:
<requestHandler name = "/clustering" startup = "lazy" enable = "${solr.clustering.enabled:true}" class = "solr.SearchHandler">
<lst name = "default">
<str name = "echoParams">explicit</str>
<bool name = "clustering">true</bool>
<str name = "clustering.engine">default</str>
<bool name = "clustering.results">default</bool>
<str name = "carrot.title">category_s</str>
<str name = "carrot.snippet">content</str>
<str name = "carrot.url">path</str>
<str name = "carrot.produceSummary">true</str>
</lst>
<arr name = "last-components">
<str>clustering</str>
</arr>
</requestHandler>
有两个参数要注意carrot.title,carrot.snippet是聚类的比较计算字段,这两个参数必须是stored="true",carrot.title的权重要高于carrot.snippet,如果只有一个做计算的字段carrot.snippet可以去掉(是去掉不是值为空)。设完了用下面的URL就可以查询了。
http://localhost:8080/skyCore/clustering?q=*3A*&wt=xml&indent=true