solr和lucene_使用Apache Lucene和Solr 4进行下一代搜索和分析

最新推荐文章于 2022-03-11 16:26:33 发布

cuxiong8996

最新推荐文章于 2022-03-11 16:26:33 发布

阅读量313

点赞数

文章标签：大数据编程语言 python 人工智能 java

原文链接：https://www.ibm.com/developerworks/opensource/library/j-solr-lucene/index.html

版权

solr和lucene

六年前，我开始为developerWorks撰写有关Solr和Lucene的文章（请参阅参考资料）。多年来，Lucene和Solr确立了自己的坚如磐石的技术（Lucene作为Java™API的基础，而Solr作为搜索服务）。例如，它们为Apple iTunes，Netflix，Wikipedia以及其他许多应用程序提供基于搜索的应用程序，并且有助于启用IBM Watson问答系统。

多年来，大多数人对Lucene和Solr的使用主要集中在基于文本的搜索上。同时，随着对分布式计算和大规模分析的关注（重新），出现了大数据的新趋势和有趣的趋势。大数据通常还需要实时，大规模的信息访问。在这种转变的光，了Lucene和Solr社区发现自己处在一个十字路口：Lucene的核心基础开始显现的大数据应用，如索引的所有Twitter用户的应激下，他们的年龄（参见相关主题）。此外，Solr缺乏本机分布式索引支持，这使IT组织越来越难以经济高效地扩展其搜索基础结构。

社区将着手全面改革Lucene和Solr基础（在某些情况下还包括公共API）。我们的重点转移到实现轻松的可伸缩性，近实时索引和搜索以及许多NoSQL功能-所有这些都利用了核心引擎功能。这项改革最终导致了Apache Lucene和Solr 4.x发行版。这些版本直接针对解决下一代大规模数据驱动的访问和分析问题。

本文将引导您完成4.x亮点并向您展示一些代码示例。不过，首先，您将动手操作一个可运行的应用程序，该应用程序演示了利用搜索引擎超越搜索范围的概念。为了从本文中获得最大收益，您应该熟悉Solr和Lucene的基础知识，尤其是Solr请求。如果不是这样，请参阅参考资料中的链接，这些链接将帮助您开始使用Solr和Lucene。

快速入门：实际操作中的搜索和分析

搜索引擎仅用于搜索文本，对不对？错误！从本质上讲，搜索引擎的作用是快速有效地过滤并根据相似性（在Lucene和Solr中灵活定义的概念）对数据进行排名。搜索引擎还可以有效处理稀疏数据和模糊数据，这是现代数据应用程序的标志。 Lucene和Solr能够处理数字，回答复杂的地理空间问题（您很快就会看到）等等。这些功能模糊了搜索应用程序和传统数据库应用程序（甚至NoSQL应用程序）之间的界限。

例如，Lucene和Solr现在：

支持多种类型的联接和分组选项
具有可选的面向列的存储
提供几种处理文本以及枚举和数字数据类型的方法
使您能够定义自己的复杂数据类型以及存储，排名和分析功能

搜索引擎并不是解决所有数据问题的灵丹妙药。但是，过去，文本搜索是Lucene和Solr的主要用途，这一事实不应阻止您现在或将来使用它们来满足数据需求。我鼓励您考虑以一种在谚语（搜索）框外使用的方式使用搜索引擎。

为了演示搜索引擎如何超越搜索范围，本节的其余部分将向您展示一个将与航空相关的数据提取到Solr中的应用程序。该应用程序查询数据（其中大多数不是文本数据），并在显示数据之前使用D3 JavaScript库（请参阅参考资料）对其进行处理。数据集来自美国运输部交通统计局的研究与创新技术管理局（RITA）和OpenFlights。该数据包括特定时间段内所有航班的详细信息，例如始发机场，目的地机场，时间延迟，延迟原因以及航空公司信息。通过使用该应用程序查询此数据，您可以分析特定机场之间的延误，特定机场的流量增长等等。

首先启动并运行该应用程序，然后查看其某些界面。请记住，通过各种方式询问Solr，应用程序与数据进行交互。

建立

首先，您需要满足以下先决条件：

Lucene和Solr。
Java 6或更高版本。
现代的网络浏览器。（我在Google Chrome和Firefox上进行了测试。）
4GB磁盘空间-如果您不想使用所有飞行数据，则更少。
通过* nix上的bash （或类似）外壳进行终端访问。对于Windows，您需要Cygwin。我只在带有bash shell的OS X上进行过测试。
如果您选择使用示例代码包中的下载脚本来下载数据，则为wget 。您也可以手动下载航班数据。
如果要运行任何Java代码示例，则可以使用Apache Ant 1.8+进行编译和打包。

请参阅相关主题的链接了Lucene，Solr的， wget ，和Ant下载站点。

在具备先决条件之后，请按照以下步骤操作以启动和运行应用程序：

下载本文的示例代码ZIP文件，并将其解压缩到您选择的目录中。我将此目录称为$ SOLR_AIR。
在命令行上，转到$ SOLR_AIR目录：
```
cd $SOLR_AIR
```
启动Solr：
```
./bin/start-solr.sh
```
运行创建必要字段以对数据建模的脚本：
```
./bin/setup.sh
```
将浏览器指向http：// localhost：8983 / solr /＃/以显示新的Solr Admin UI。图1显示了一个示例：
图1. Solr用户界面
在终端上，查看bin / download-data.sh脚本的内容，以获取有关从RITA和OpenFlights下载内容的详细信息。手动或通过运行脚本下载数据集：
```
./bin/download-data.sh
```
下载可能需要花费大量时间，具体取决于您的带宽。
下载完成后，对部分或全部数据建立索引。

索引所有数据：
```
bin/index.sh
```
要索引一年的数据，请使用该年的1987年至2008年之间的任何值。例如：
```
bin/index.sh 1987
```
索引编制完成后（可能会花费大量时间，具体取决于您的计算机），将浏览器指向http：// localhost：8983 / solr / collection1 / travel。您将看到一个类似于图2的UI：
图2. Solr Air用户界面

探索数据

随着Solr Air应用程序的启动和运行，您可以浏览数据和UI，以了解可以问的各种问题。在浏览器中，您应该看到两个主要界面点：地图和搜索框。对于地图，我从D3出色的Airport示例开始（请参阅参考资料）。我修改并扩展了代码，以直接从Solr而不是D3示例随附的示例CSV文件中加载所有机场信息。我还对每个机场做了一些初始统计计算，您可以通过将鼠标悬停在特定机场上来查看这些统计数据。

我将使用搜索框展示一些关键功能，以帮助您构建复杂的搜索和分析应用程序。要遵循该代码，请参阅solr / collection1 / conf / velocity / map.vm文件。

主要重点领域是：

枢轴面
统计功能
分组
Lucene和Solr扩大了对地理空间的支持

这些区域中的每一个都可以帮助您获得答案，例如到达特定机场的飞机的平均延误，或者在两个机场之间（每家航空公司，或者从某个起始机场到附近所有机场之间飞行的飞机）的最常见延误时间机场）。该应用程序使用Solr的统计功能，再加上Solr的长期刻面功能，可以绘制机场“点”的初始地图，并生成基本信息，例如总航班以及平均，最小和最大延迟时间。（仅此功能是查找错误数据或至少找到极端异常值的一种绝妙方法。）为了演示这些领域（并展示将Solr与D3集成起来有多么容易），我实现了一些轻量级JavaScript代码，该代码如下：

解析查询。（生产质量的应用程序可能会在服务器端甚至作为Solr查询解析器插件来执行大多数查询解析。）
创建各种Solr请求。
显示结果。

请求类型为：

按三个字母的机场代码查找，例如RDU或SFO 。
每个路由的查找，例如SFO TO ATL或RDU TO ATL 。（不支持多跳。）
搜索框为空时，单击搜索按钮可显示所有航班的各种统计信息。
使用near运算符查找附近的机场，如near:SFO或near:SFO TO ATL 。
查找可能的延误，例如likely:SFO 。在不同的行驶距离（小于500英里，500到1000、1000到2000、2000及以后）。
任何要送入Solr的/travel请求处理程序的Solr查询，例如&q=AirportCity:Francisco 。

前面列表中的前三个请求类型都是相同类型的所有变体。这些变体突出了Solr的枢轴分面功能，例如，显示了每个航空公司每个航班的每条航线最常见的到达延迟时间（例如SFO TO ATL ）。 near选项利用新的Lucene和Solr空间功能来执行显着增强的空间计算，例如复杂的多边形相交。 likely选项展示了Solr的分组功能，可以显示与始发机场相比有一定距离的机场，始发机场的延误时间超过30分钟。所有这些请求类型都通过少量的D3 JavaScript来为地图添加显示信息。对于列表中的最后一个请求类型，我只需返回关联的JSON。此请求类型使您可以自己浏览数据。如果您在自己的应用程序中使用此请求类型，则自然希望以特定于应用程序的方式使用响应。

现在，自行尝试一些查询。例如，如果搜索SFO TO ATL ，您应该看到与图3相似的结果：

图3. SFO TO ATL屏幕示例

在图3中，两个机场在左侧的地图中突出显示。右侧的航线统计信息列表显示了每个航空公司每趟航班最常见的到达延迟时间。（我只加载了1987年的数据。）例如，它告诉您，达美航空156号班机五次到达亚特兰大的航班延迟了五分钟，而四次早六分钟。

您可以在浏览器的控制台（例如，在Mac上的Chrome中，选择“查看”->“开发人员”->“ Javascript控制台”）和Solr日志中查看基础的Solr请求。我使用的SFO-TO-ATL请求（此处仅出于格式化目的分成三行）是：

/solr/collection1/travel?&wt=json&facet=true&facet.limit=5&fq=Origin:SFO 
AND Dest:ATL&q=*:*&facet.pivot=UniqueCarrier,FlightNum,ArrDelay&
f.UniqueCarrier.facet.limit=10&f.FlightNum.facet.limit=10

facet.pivot参数提供此请求中的关键功能。 facet.pivot从航空公司（称为UniqueCarrier ）到FlightNum UniqueCarrier到FlightNum进行ArrDelay ，从而提供了如图3的 “航线统计”中显示的嵌套结构。

如果尝试使用near查询，如near:JFK ，则结果应类似于图4：

图4.示例屏幕显示了肯尼迪国际机场附近的机场

位于查询near的Solr请求利用了Solr的新空间功能，我将在本文后面详细介绍。现在，您可以通过查看请求本身来识别此新功能的某些功能（此处出于格式化目的而将其简化）：

...
&fq=source:Airports&q=AirportLocationJTS:"IsWithin(Circle(40.639751,-73.778925 d=3))"
...

您可能会猜到，该请求将查找所有落在一个圆内的机场，这些圆的中心在纬度为40.639751度，在经度为-73.778925度，并且半径为3度，大约为111公里。

到目前为止，您应该强烈了解Lucene和Solr应用程序可以以有趣的方式对数据（数字，文本或其他数据）进行切片和切块。而且由于Lucene和Solr都是开源的，并具有商业友好的许可证，因此您可以自由添加自己的自定义项。更好的是，Lucene和Solr的4.x系列增加了许多地方，您可以在其中插入自己的想法和功能，而无需检查所有代码。在接下来查看Lucene 4的一些要点（撰写本文时为4.4版），然后再查看Solr 4的要点时，请记住此功能。

Lucene 4：下一代搜索和分析的基础

巨变

Lucene 4几乎完全重写了Lucene的基础，以获得更好的性能和灵活性。同时，由于Lucene全新的随机单元测试框架和与性能相关的严格社区标准，此版本代表了社区开发软件方式的巨大变化。例如，随机测试框架（可作为打包工件供任何人使用）使项目可以轻松测试变量的交互，例如JVM，语言环境，输入内容和查询，存储格式，评分公式以及还有很多。（即使您从未使用过Lucene，您也可能会发现测试框架在您自己的项目中很有用。）

Lucene的一些关键新增功能和更改包括速度和内存，灵活性，数据结构和构面等类别。（要查看有关Lucene更改的所有详细信息，请阅读每个Lucene发行版中随附的CHANGES.txt文件。）

速度与记忆

尽管通常认为以前的Lucene版本足够快（尤其是相对于可比较的通用搜索库而言），但是Lucene 4的增强使许多操作比以前的版本快得多。

图5中的图形捕获了Lucene索引的性能，以GB /小时为单位。（Lucene提交者Mike McCandless提供了夜间Lucene基准测试图；请参阅参考资料。）图5显示，5月上半月[[year？]]发生了巨大的性能改善：

图5. Lucene索引性能

Lucene索引性能的图表，它显示了从5月上旬[[year？]]的每小时100GB增长到每小时大约270GB。

不是你父亲的露西妮

Lucene 4包括对API的重大更改和增强，这些更改和增强对引擎有利-最终将使您能够做许多新的有趣的事情。但是从早期版本的Lucene升级可能需要大量的精力，尤其是在使用任何较低级别或“专家” API的情况下。（从以前的版本中仍然可以大致识别出IndexWriter和IndexReader类的类，但是例如，访问术语向量的方式已经发生了很大变化。）相应地进行计划。

图5所示的改进来自对Lucene如何构建其索引结构以及在构建索引结构时如何处理并发性的一系列更改（以及其他一些更改，包括JVM更改和固态驱动器的使用）。更改的重点是在Lucene将索引写入磁盘时删除同步。有关详细信息（不在本文的讨论范围内），请参阅参考资料，以获得指向Mike McCandless博客文章的链接。

除了提高整体索引性能外，Lucene 4还可执行近实时（NRT）索引操作。 NRT操作可以大大减少搜索引擎反映索引更改所花费的时间。要使用NRT操作，必须在应用程序中的Lucene的IndexWriter和IndexReader之间进行一些协调。清单1（下载包的src / main / java / IndexingExamples.java文件中的片段）说明了这种相互作用：

清单1. Lucene中的NRT搜索示例

...
doc = new HashSet<IndexableField>();
index(writer, doc);
//Get a searcher
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(directory));
printResults(searcher);
//Now, index one more doc
doc.add(new StringField("id", "id_" + 100, Field.Store.YES));
doc.add(new TextField("body", "This is document 100.", Field.Store.YES));
writer.addDocument(doc);
//The results are still 100
printResults(searcher);
//Don't commit; just open a new searcher directly from the writer
searcher = new IndexSearcher(DirectoryReader.open(writer, false));
//The results now reflect the new document that was added
printResults(searcher);
...

在清单1中，我首先为Directory建立索引并将一组文档提交，然后搜索Directory —这是Lucene中的传统方法。当我继续为另一个文档建立索引时，NRT进入了：Lucene不会执行完全提交，而是从IndexWriter创建一个新的IndexSearcher ，然后进行搜索。您可以通过转到$ SOLR_AIR目录并执行以下命令序列来运行此示例：

ant compile
cd build/classes
java -cp ../../lib/*:. IndexingExamples

注意：我将本文的几个代码示例分组到IndexingExamples.java中，因此您可以使用相同的命令序列来运行清单2和清单4中的后续示例。

打印到屏幕上的输出是：

...
Num docs: 100
Num docs: 100
Num docs: 101
...

Lucene 4还包含利用一些更高级的数据结构的内存改进（我在“ 有限状态自动机”和其他文章中对此进行了详细介绍）。这些改进不仅减少了Lucene的内存占用，而且还大大加快了基于通配符和正则表达式的查询。另外，代码库不再使用Java String对象，而是管理大量的字节数组分配。（因此，现在在Lucene中， BytesRef类似乎是无处不在的。）结果，减少了String开销，并且Java堆上的对象数得到了更好的控制，从而降低了世界垃圾回收的可能性。

一些灵活性增强还可以提高性能和存储性能，因为您可以为应用程序使用的数据类型选择更好的数据结构。例如，接下来将要看到的是，您可以选择在Lucene中以一种方式索引/存储唯一键（密集且压缩效果不佳），并以完全不同的方式索引/存储文本，以更好地适应文本的稀疏性。

灵活性

Lucene 4.x的灵活性改进为希望从Lucene挤出质量和性能的最后每一点的开发人员（和研究人员）释放了宝贵的机遇。为了增强灵活性，Lucene提供了两个新的定义明确的插件点。这两个插件点都已经对Lucene的开发和使用方式产生了重大影响。

什么是细分？

Lucene段是整体索引的子集。在许多方面，细分是独立的小型索引。 Lucene通过使用段来构建索引以平衡搜索的索引可用性与写入速度。段是索引编制过程中的一次写入文件，每次在写入过程中提交时都会创建一个新文件。在后台，默认情况下，Lucene会定期将较小的段合并为较大的段，以提高读取性能并减少系统开销。您可以完全控制此过程。

第一个新的插件点旨在使您能够深入控制Lucene 段的编码和解码。 Codec类定义了此功能。 Codec器使您可以控制发布列表的格式（即倒排索引），Lucene存储，增强因子（也称为norm ）等等。

在某些应用程序中，您可能需要实现自己的Codec 。但是您很有可能想要更改用于索引中文档字段子集的Codec 。要了解这一点，有助于考虑要放入应用程序中的数据种类。例如，识别字段（例如您的主键）通常是唯一的。因为主键只出现在一个文档中，所以您可能希望将它们的编码方式与文章正文的编码方式不同。在这些情况下，您实际上不需要更Codec 。相反，您可以更Codec委托给的较低级别的类之一。

为了演示，我将向您展示一个使用我最喜欢的Codec SimpleTextCodec的代码示例。听起来像是SimpleTextCodec ：一种用于将索引编码为简单文本的Codec 。（编写SimpleTextCodec并通过Lucene广泛的测试框架的事实证明了Lucene增强了灵活性。） SimpleTextCodec太大，太慢，无法在生产中使用，但它是查看Lucene索引外观的好方法，这就是为什么它是我的最爱。清单2中的代码将Codec实例更改为SimpleTextCodec ：

清单2.在Lucene中更改`Codec`实例的示例

...
conf.setCodec(new SimpleTextCodec());
File simpleText = new File("simpletext");
directory = new SimpleFSDirectory(simpleText);
//Let's write to disk so that we can see what it looks like
writer = new IndexWriter(directory, conf);
index(writer, doc);//index the same docs as before
...

通过运行清单2代码，您将创建一个本地build / classes / simpletext目录。要查看实际的Codec ，请更改为build / classes / simpletext并在文本编辑器中打开.cfs文件。您可以看到.cfs文件确实是纯旧文本，如清单3所示：

清单3. _0.cfs纯文本索引文件的一部分

...
  term id_97
    doc 97
  term id_98
    doc 98
  term id_99
    doc 99
END
doc 0
  numfields 4
  field 0
    name id
    type string
    value id_100
  field 1
    name body
    type string
    value This is document 100.
...

在大多数情况下，除非您使用的索引和查询量非常大，或者您是喜欢使用裸机的研究人员或搜索引擎专家，否则更Codec才有用。在这种情况下，在更改Codec之前，请使用您的实际数据对各种可用的Codec进行广泛的测试。 Solr用户可以通过修改简单的配置项来设置和更改这些功能。请参阅Solr的参考指南了解更多详情（参见相关主题）。

第二个重要的新插件点使Lucene的评分模型完全可插入。您不再局限于使用Lucene的默认评分模型，一些批评者认为这太简单了。如果你愿意，你可以使用替代评分模型，如随机性BM25与分歧（见相关信息），或者你可以写你自己的。为什么要自己写？也许您的“文件”代表分子或基因；您需要一种对它们进行排名的快速方法，但是术语频率和文档频率不适用。或者，您可能想尝试一种新的评分模型，您可以在研究论文中阅读该评分模型，以了解它如何在您的内容上发挥作用。无论出于何种原因，更改评分模型都要求您在索引IndexWriterConfig.setSimilarity(Similarity)时通过IndexWriterConfig.setSimilarity(Similarity)方法更改模型，而在搜索时通过IndexSearcher.setSimilarity(Similarity)方法更改模型。清单4演示改变Similarity ，首先运行使用默认查询Similarity ，然后重新索引并重新运行使用Lucene的查询BM25Similarity ：

清单4.更改Lucene中的`Similarity`

conf = new IndexWriterConfig(Version.LUCENE_44, analyzer);
directory = new RAMDirectory();
writer = new IndexWriter(directory, conf);
index(writer, DOC_BODIES);
writer.close();
searcher = new IndexSearcher(DirectoryReader.open(directory));
System.out.println("Lucene default scoring:");
TermQuery query = new TermQuery(new Term("body", "snow"));
printResults(searcher, query, 10);

BM25Similarity bm25Similarity = new BM25Similarity();
conf.setSimilarity(bm25Similarity);
Directory bm25Directory = new RAMDirectory();
writer = new IndexWriter(bm25Directory, conf);
index(writer, DOC_BODIES);
writer.close();
searcher = new IndexSearcher(DirectoryReader.open(bm25Directory));
searcher.setSimilarity(bm25Similarity);
System.out.println("Lucene BM25 scoring:");
printResults(searcher, query, 10);

运行清单4中的代码并检查输出。请注意，分数确实不同。 BM25方法的结果是否更准确地反映用户期望的结果集最终取决于您和您的用户。我建议您以一种易于进行实验的方式来设置应用程序。（A / B测试应该有所帮助。）然后不仅要比较“ Similarity结果，还要比较各种查询构造，“ Analyzer ”和许多其他项目的结果。

有限状态自动机和其他优点

对Lucene的数据结构和算法的全面检查产生了Lucene 4中两个特别有趣的进步：

DocValues（也称为列跨度字段）。
有限状态自动机（FSA）和有限状态传感器（FST）。在本文的其余部分中，我将两者都称为FSA。（从技术上讲，FST在访问其节点时会输出值，但是对于本文而言，区别并不重要。）

DocValues和FSA都为可能影响您的应用程序的某些类型的操作提供了重要的新性能优势。

在DocValues方面，在许多情况下，应用程序需要非常快速地依次访问单个字段的所有值。否则应用程序需要快速查找值以进行排序或构面，而不会产生从索引构建内存版本的成本（此过程称为un-inverting ）。 DocValues旨在满足此类需求。

由于使用FSA，执行大量通配符或模糊查询的应用程序应会看到显着的性能改进。 Lucene和Solr现在支持利用FSA的查询自动建议和拼写检查功能。 Lucene的默认Codec通过在后台使用FSA存储术语词典（Lucene在搜索过程中用于查找查询术语的结构），大大减少了磁盘和内存的占用。 FSA在语言处理中有许多用途，因此您可能还会发现Lucene的FSA功能对其他应用程序具有指导意义。

图6显示了从http://examples.mikemccandless.com/fst.py构建的FSA，其中使用了单词mop ， pop ， moth ， star ， stop和top以及相关的权重。从该示例中，您可以想象从诸如moth输入开始，将其分解为字符（飞蛾），然后跟随FSA中的弧线。

图6. FSA示例

来自http://examples.mikemccandless.com/fst.py的FSA插图

清单5（摘自本文的示例代码下载中的FSAExamples.java文件）显示了一个简单的示例，该示例使用Lucene的API构建自己的FSA：

清单5.一个简单的Lucene自动机的例子

String[] words = {"hockey", "hawk", "puck", "text", "textual", "anachronism", "anarchy"};
Collection<BytesRef> strings = new ArrayList<BytesRef>();
for (String word : words) {
  strings.add(new BytesRef(word));

}
//build up a simple automaton out of several words
Automaton automaton = BasicAutomata.makeStringUnion(strings);
CharacterRunAutomaton run = new CharacterRunAutomaton(automaton);
System.out.println("Match: " + run.run("hockey"));
System.out.println("Match: " + run.run("ha"));

在清单5中，我用各种单词构建了一个Automaton并将其馈送到RunAutomaton 。顾名思义， RunAutomaton通过自动机运行输入，在这种情况下，它与清单5末尾print语句中捕获的输入字符串匹配。尽管这个例子很简单，但是它为理解更高级的功能奠定了基础，我将留给读者（与DocValues一起）在Lucene API中进行探索。（请参阅相关主题的相关链接。）。

刻面

分面的核心是生成大量文档属性，以使用户可以轻松地缩小搜索结果的范围，而无需让他们猜测要添加到查询中的关键字。例如，如果某人在购物网站上搜索电视，刻面会告诉他们哪个制造商生产多少个电视型号。刻面也越来越多地用于增强基于搜索的业务分析和报告工具。通过使用更高级的构面功能，您可以使用户以有趣的方式对构面进行切片和切块。

刻面一直是Solr（从1.1版开始）的标志。现在，Lucene拥有自己的构面模块，独立的Lucene应用程序可以利用该模块。 Lucene的faceting模块虽然没有Solr的功能丰富，但确实提供了一些有趣的折衷。 Lucene的构面模块不是动态的，因为您必须在索引编制时做出一些构面决策。但是它是分层的，并且不具有将字段动态地反向转换到内存中的代价。

清单6（示例代码的FacetExamples.java文件的一部分）展示了Lucene的一些新方面功能：

清单6. Lucene构面示例

...
DirectoryTaxonomyWriter taxoWriter = 
     new DirectoryTaxonomyWriter(facetDir, IndexWriterConfig.OpenMode.CREATE);
FacetFields facetFields = new FacetFields(taxoWriter);
for (int i = 0; i < DOC_BODIES.length; i++) {
  String docBody = DOC_BODIES[i];
  String category = CATEGORIES[i];
  Document doc = new Document();
  CategoryPath path = new CategoryPath(category, '/');
  //Setup the fields
  facetFields.addFields(doc, Collections.singleton(path));//just do a single category path
  doc.add(new StringField("id", "id_" + i, Field.Store.YES));
  doc.add(new TextField("body", docBody, Field.Store.YES));
  writer.addDocument(doc);
}
writer.commit();
taxoWriter.commit();
DirectoryReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
DirectoryTaxonomyReader taxor = new DirectoryTaxonomyReader(taxoWriter);
ArrayList<FacetRequest> facetRequests = new ArrayList<FacetRequest>();
CountFacetRequest home = new CountFacetRequest(new CategoryPath("Home", '/'), 100);
home.setDepth(5);
facetRequests.add(home);
facetRequests.add(new CountFacetRequest(new CategoryPath("Home/Sports", '/'), 10));
facetRequests.add(new CountFacetRequest(new CategoryPath("Home/Weather", '/'), 10));
FacetSearchParams fsp = new FacetSearchParams(facetRequests);

FacetsCollector facetsCollector = FacetsCollector.create(fsp, reader, taxor);
searcher.search(new MatchAllDocsQuery(), facetsCollector);

for (FacetResult fres : facetsCollector.getFacetResults()) {
  FacetResultNode root = fres.getFacetResultNode();
  printFacet(root, 0);
}

清单6中超出常规Lucene索引和搜索范围的关键部分是使用FacetFields ， FacetsCollector ， TaxonomyReader和TaxonomyWriter类。 FacetFields在文档中创建适当的字段条目，并在建立索引时与TaxonomyWriter协同工作。在搜索时， TaxonomyReader与FacetsCollector配合FacetsCollector以获取每个类别的适当计数。还要注意，Lucene的构面模块会创建一个辅助索引，该索引必须与主索引保持同步才能生效，该索引必须有效。通过使用与先前示例相同的命令序列来运行清单6的代码，除了用java命令中的FacetExamples代替IndexingExamples之外。您应该得到：

Home (0.0)
 Home/Children (3.0)
  Home/Children/Nursery Rhymes (3.0)
 Home/Weather (2.0)

 Home/Sports (2.0)
  Home/Sports/Rock Climbing (1.0)
  Home/Sports/Hockey (1.0)
 Home/Writing (1.0)
 Home/Quotes (1.0)
  Home/Quotes/Yoda (1.0)
 Home/Music (1.0)
  Home/Music/Lyrics (1.0)
...

请注意，在此特定实现中，我不包括Home方面的计数，因为包括它们可能会很昂贵。通过设置适当的FacetIndexingParams可以支持该选项，在此不做介绍。 Lucene的faceting模块具有我未介绍的其他功能。我鼓励您通过查看参考资料主题探索它们，以及本文未涉及的其他Lucene新功能。现在，到Solr4.x。

Solr 4：大规模搜索和分析

从API的角度来看，Solr 4.x的外观和感觉与以前的版本相同。但是4.x包含许多增强功能，这些功能比以往任何时候都更加易于使用和扩展。 Solr还使您能够回答新类型的问题，同时利用我刚才概述的许多Lucene增强功能。其他更改针对开发人员的入门经验。例如，全新的Solr参考指南（见相关信息）提供所有的Solr的版本（从4.4）的书的质量文件。 Solr的新无模式功能使您无需首先定义模式即可轻松快速地将新数据添加到索引中。稍后您将了解Solr的无模式功能。首先，您将了解Solr中的一些新搜索，构面和相关性增强功能，其中一些是您在Solr Air应用程序中看到的。

搜索，构面和相关性

Solr 4的一些新功能旨在使在索引编制以及在搜索和构面方面更容易构建下一代数据驱动的应用程序。表1总结了重点内容，并在适用时包括命令和代码示例：

表1. Solr 4中的索引，搜索和构面高亮

名称	描述	例
枢轴刻面	收集计数通过父构面筛选的所有构面的子构面。有关更多详细信息，请参见Solr Air示例。	重点介绍各种领域： `http://localhost:8983/solr/collection1/travel?&wt=json&facet=true&facet.limit=5&fq=&q=: &facet.pivot=Origin,Dest,UniqueCarrier,FlightNum,ArrDelay&indent=true`
新的相关功能查询	作为函数查询的一部分，访问各种索引级别的统计信息，例如文档频率和术语频率。	在所有返回的文档中添加术语`Origin:SFO`的`Document`频率： `http://localhost:8983/solr/collection1/travel?&wt=json&q=:&fl=*, {!func}docfreq('Origin',%20'SFO')&indent=true` 请注意，此命令还使用新的`DocTransformers`功能。
加入	表示更复杂的文档关系，然后在搜索时将它们加入。计划将更复杂的联接用于Solr的将来版本。	仅返回具有出现在机场数据集中的原始机场代码的航班（并与没有加入的请求的结果进行比较）： `http://localhost:8983/solr/collection1/travel?&wt=json&indent=true&q={!join%20from=IATA%20to=Origin}:`
`Codec`支持	更改索引的`Codec`和各个字段的过帐格式。	对字段使用`SimpleTextCodec` ： `<fieldType name="string_simpletext" class="solr.StrField" postingsFormat="SimpleText" />`
新的更新处理器	使用Solr的更新处理器框架插入代码以在文档被索引之前但在将文档发送到Solr之后更改它们。	字段变异（例如，串联字段，解析数字，修剪）脚本编写。使用JavaScript或JavaScript引擎支持的其他代码来处理文档。请参阅Solr Air示例中的update-script.js文件。语言识别（在3.5中可用，但在这里值得一提），用于识别文档中使用的语言（例如英语或日语）。
原子更新	仅发送文档中已更改的部分，然后让Solr处理其余部分。	从命令行使用`cURL` ，将文档243551的原点更改为`FOO` ： `curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [{"id": "243551","Origin": {"set":"FOO"}}]'`

您可以针对Solr Air演示数据在浏览器的地址字段（而非Solr Air UI）中运行表1中的前三个示例命令。

有关关联函数，联接和Codec以及其他Solr 4新功能的更多详细信息，请参阅参考资料，以获得指向Solr Wiki和其他地方的相关链接。

扩展，NoSQL和NRT

近年来，Solr唯一最重要的变化就是构建多节点可扩展搜索解决方案变得更加简单。使用Solr 4.x，扩展Solr使其成为数十亿条记录的权威性存储和访问机制比以往任何时候都容易，同时还能享受到Solr一直为人所知的搜索和分面功能。 Furthermore, you can rebalance your cluster as your capacity needs change, as well as take advantage of optimistic locking, atomic updates of content, and real-time retrieval of data even if it hasn't been indexed yet. The new distributed capabilities in Solr are referred to collectively as SolrCloud.

How does SolrCloud work? Documents that are sent to Solr 4 when it's running in (optional) distributed mode are routed according to a hashing mechanism to a node in the cluster (called the leader ). The leader is responsible for indexing the document into a shard . A shard is a single index that is served by a leader and zero or more replicas. As an illustration, assume that you have four machines and two shards. When Solr starts, each of the four machines communicates with the other three. Two of the machines are elected leaders, one for each shard. The other two nodes automatically become replicas of one of the shards. If one of the leaders fails for some reason, a replica (in this case the only replica) becomes the leader, thereby guaranteeing that the system still functions properly. You can infer from this example that in a production system enough nodes must participate to ensure that you can handle system outages.

To see SolrCloud in action, you can run launch a two-node, two-shard system by running the start-solr.sh script that you used in the Solr Air example with a -z flag. From the *NIX command line, first shut down your old instance:

kill -9 PROCESS_ID

Then restart the system:

bin/start-solr.sh -c -z

Apache Zookeeper

Zookeeper is a distributed coordination system that's designed to elect leaders, establish a quorum, and perform other tasks to coordinate the nodes in a cluster. Thanks to Zookeeper, a Solr cluster never suffers from "split-brain" syndrome, whereby part of the cluster behaves independently of the rest of the cluster as the result of a partitioning event. See Related topics to learn more about Zookeeper.

The -c flag erases the old index. The -z flag tells Solr to start up with an embedded version of Apache Zookeeper .

Point your browser at the SolrCloud admin page, http://localhost:8983/solr/#/~cloud, to verify that two nodes are participating in the cluster. You can now re-index your content, and it will be spread across both nodes. All queries to the system are also automatically distributed. You should get the same number of hits for a match-all-documents search against two nodes that you got for one node.

The start-solr.sh script launches Solr with the following command for the first node:

java -Dbootstrap_confdir=$SOLR_HOME/solr/collection1/conf 
-Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

The script tells the second node where Zookeeper is:

java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar

Embedded Zookeeper is great for getting started, but to ensure high availability and fault tolerance for production systems, set up a stand-alone set of Zookeeper instances in your cluster.

Stacked on top the SolrCloud capabilities are support for NRT and many NoSQL-like functions, such as:

Optimistic locking
Atomic updates
Real-time gets (retrieving a specific document before it is committed)
Transaction-log-backed durability

Many of the distributed and NoSQL functions in Solr — such as automatic versioning of documents and transaction logs — work out of the box. For a few other features, the descriptions and examples in Table 2 will be helpful:

Table 2. Summary of distributed and NoSQL features in Solr 4

名称	描述	例
Realtime get	Retrieve a document, by ID, regardless of its state of indexing or distribution.	Get the document whose ID is 243551: `http://localhost:8983/solr/collection1/get?id=243551`
Shard splitting	Split your index into smaller shards so they can be migrated to new nodes in the cluster.	Split `shard1` into two shards: `http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1`
NRT	Use NRT to search for new content much more quickly than in previous versions.	Turn on `<autoSoftCommit>` in your solrconfig.xml file. 例如： `<autoSoftCommit> <maxTime>5000</maxTime> </autoSoftCommit>>`
Document routing	Specify which documents live on which nodes.	Ensure that all of a user's data is on certain machines. Read Joel Bernstein's blog post (see Related topics ).
馆藏	Create, delete, or update collections as needed, programmatically, using Solr's new collections API.	Create a new collection named `hockey` : `http://localhost:8983/solr/admin/collections?action=CREATE&name=hockey&numShards=2`

Going schemaless

Schemaless: Marketing hype?

Data collections rarely lack a schema. Schemaless is a marketing term that's derived from a data-ingestion engine's ability to react appropriately to the data "telling" the engine what the schema is — instead of the engine specifying the form that the data must take. For instance, Solr can accept JSON input and can index content appropriately on the basis of the schema that's implicitly defined in the JSON. As someone pointed out to me on Twitter, less schema is a better term than schemaless , because you define the schema in one place (such as a JSON document) instead of two (such as a JSON document and Solr).

Based on my experience, in the vast majority of cases you should not use schemaless in a production system unless you enjoy debugging errors at 2 am when your system thinks it has one type of data and in reality has another.

Solr's schemaless functionality enables clients to add content rapidly without the overhead of first defining a schema.xml file. Solr examines the incoming data and passes it through a cascading set of value parsers. The value parsers guess the data's type and then automatically add the fields to the internal schema and add the content to the index.

A typical production system (with some exceptions) shouldn't use schemaless, because the value guessing isn't always perfect. For instance, the first time Solr sees a new field, it might identify the field as an integer and thus define an integer FieldType in the underlying schema. But you may discover three weeks later that the field is useless for searching because the rest of the content that Solr sees for that field consists of float point values.

However, schemaless is especially helpful for early-stage development or for indexing content whose format you have little to no control over. For instance, Table 2 includes an example of using the collections API in Solr to create a new collection:

http://localhost:8983/solr/admin/collections?action=CREATE&name=hockey&numShards=2)

After you create the collection, you can use schemaless to add content to it. First, though, take a look at the current schema. As part of implementing schemaless support, Solr also added Representational State Transfer (REST) APIs for accessing the schema. You can see all of the fields defined for the hockey collection by pointing your browser (or cURL on the command line) at http://localhost:8983/solr/hockey/schema/fields. You see all of the fields from the Solr Air example. The schema uses those fields because the create option used my default configuration as the basis for the new collection. You can override that configuration if you want. (A side note: The setup.sh script that's included in the sample code download uses the new schema APIs to create all of the field definitions automatically.)

To add to the collection by using schemaless, run:

bin/schemaless-example.sh

The following JSON is added to the hockey collection that you created earlier:

[
    {
        "id": "id1",
        "team": "Carolina Hurricanes",
        "description": "The NHL franchise located in Raleigh, NC",
        "cupWins": 1
    }
]

As you know from examining the schema before you added this JSON to the collection, the team , description , and cupWins fields are new. When the script ran, Solr guessed their types automatically and created the fields in the schema. To verify, refresh the results at http://localhost:8983/solr/hockey/schema/fields. You should now see team , description , and cupWins all defined in the list of fields.

Spatial (not just geospatial) improvements

Solr's longstanding support for point-based spatial searching enables you to find all documents that are within some distance of a point. Although Solr supports this approach in an n -dimensional space, most people use it for geospatial search (for example, find all restaurants near my location ). But until now, Solr didn't support more-involved spatial capabilities such as indexing polygons or performing searches within indexed polygons. Some of the highlights of the new spatial package are:

Support through the Spatial4J library (see Related topics ) for many new spatial types — such as rectangles, circles, lines, and arbitrary polygons — and support for the Well Known Text (WKT) format
Multivalued indexed fields, which you can use to encode multiple points into the same field
Configurable precision that gives the developer more control over accuracy versus computation speed
Fast filtering of content
Query support for Is Within , Contains , and IsDisjointTo
Optional support for the Java Topological Suite (JTS) (see Related topics )
Lucene APIs and artifacts

The schema for the Solr Air application has several field types that are set up to take advantage of this new spatial functionality. I defined two field types for working with the latitude and longitude of the airport data:

<fieldType name="location_jts" class="solr.SpatialRecursivePrefixTreeFieldType" 
distErrPct="0.025" spatialContextFactory=
"com.spatial4j.core.context.jts.JtsSpatialContextFactory" 
maxDistErr="0.000009" units="degrees"/>

<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType" 
distErrPct="0.025" geo="true" maxDistErr="0.000009" units="degrees"/>

The location_jts field type explicitly uses the optional JTS integration to define a point, and the location_rpt field type doesn't. If you want to index anything more complex than simple rectangles, you need to use the JTS version. The fields' attributes help to define the system's accuracy. These attributes are required at indexing time because Solr, via Lucene and Spatial4j, encodes the data in multiple ways to ensure that the data can be used efficiently at search time. For your applications, you'll likely want to run some tests with your data to determine the tradeoffs to make in terms of index size, precision, and query-time performance.

In addition, the near query that's used in the Solr Air application uses the new spatial-query syntax ( IsWithin on a Circle ) for finding airports near the specified origin and destination airports.

New administration UI

In wrapping up this section on Solr, I would be remiss if I didn't showcase the much more user-friendly and modern Solr admin UI. The new UI not only cleans up the look and feel but also adds new functionality for SolrCloud, document additions, and much more.

For starters, when you first point your browser at http://localhost:8983/solr/#/, you should see a dashboard that succinctly captures much of the current state of Solr: memory usage, working directories, and more, as in Figure 7:

Figure 7. Example Solr dashboard

Screen capture of an example Solr dashboard

If you select Cloud in the left side of the dashboard, the UI displays details about SolrCloud. For example, you get in-depth information about the state of configuration, live nodes, and leaders, as well as visualizations of the cluster topology. Figure 8 shows an example. Take a moment to work your way through all of the cloud UI options. (You must be running in SolrCloud mode to see them.)

Figure 8. Example SolrCloud UI

Screen capture of a SolrCloud UI example

The last area of the UI to cover that's not tied to a specific core/collection/index is the Core Admin set of screens. These screens provides point-and-click control over the administration of cores, including adding, deleting, reloading, and swapping cores. Figure 9 shows the Core Admin UI:

Figure 9. Example of Core Admin UI

Screen capture of the core Solr admin UI

By selecting a core from the Core list, you access an overview of information and statistics that are specific to that core. Figure 10 shows an example:

Figure 10. Example core overview

Screen capture of a core overview example in the Solr UI

Most of the per-core functionality is similar to the pre-4.x UI's functionality (albeit in a much more pleasant way), with the exception of the Documents option. You can use the Documents option to add documents in various formats (JSON, CSV, XML, and others) to the collection directly from the UI, as Figure 11:

Figure 11. Example of adding a document from the UI

Screen capture from the Solr UI that shows a JSON document being added to a collection

You can even upload rich document types such as PDF and Word. Take a moment to add some documents into your index or browse the other per-collection capabilities such as the Query interface or the revamped Analysis screen.

前方的路

Next-generation search-engine technology gives users the power to decide what to do with their data. This article gave you a good taste of what Lucene and Solr 4 are capable of, and, I hope, a broader sense of how search engines solve non-text-based search problems that involve analytics and recommendations.

Lucene and Solr are in constant motion, thanks to a large sustaining community that's backed by more than 30 committers and hundreds of contributors. The community is actively developing two main branches: the current officially released 4.x branch and the trunk branch, which represents the next major (5.x) release On the official release branch, the community is committed to backward compatibility and an incremental approach to development that focuses on easy upgrades of current applications. On the trunk branch, the community is a bit less restricted in terms of ensuring compatibility with previous releases. If you want to try out the cutting edge in Lucene or Solr, check the trunk branch of the code out from Subversion or Git (see Related topics ). Whichever path you choose, you can take advantage of Lucene and Solr for powerful search-based analytics that go well beyond plain text search.

致谢

Thanks to David Smiley, Erik Hatcher, Yonik Seeley, and Mike McCandless for their help.

翻译自: https://www.ibm.com/developerworks/opensource/library/j-solr-lucene/index.html