将Solarium与SOLR一起使用进行搜索-设置

最新推荐文章于 2024-07-11 16:34:57 发布

culi4814

最新推荐文章于 2024-07-11 16:34:57 发布

阅读量197

点赞数

文章标签：大数据 python linux java 数据库

原文链接：https://www.sitepoint.com/using-solarium-solr-search-setup/

版权

Apache’s SOLR is an enterprise-level search platform based on Apache Lucene. It provides a powerful full-text search along with advanced features such as faceted search, result highlighting and geospatial search. It’s extremely scalable and fault tolerant.

Apache的SOLR是基于Apache Lucene的企业级搜索平台。它提供了强大的全文本搜索以及高级功能，例如多面搜索，结果突出显示和地理空间搜索。它具有极高的可扩展性和容错能力。

Well known websites said to use SOLR to power their search functions include digg, Netflix, Instagram and Whitehouse.gov (source).

据说使用SOLR增强搜索功能的知名网站包括digg，Netflix，Instagram和Whitehouse.gov( 来源 )。

While SOLR is written in Java, it’s accessible via HTTP, making it possible to integrate with whatever programming language you prefer. If you’re using PHP then the Solarium Project makes integration even easier, providing a level of abstraction over the underlying requests which enables you to use SOLR as if it were a native implementation running within your application.

尽管SOLR是用Java编写的，但可以通过HTTP进行访问，从而可以与您喜欢的任何编程语言集成。如果您使用的是PHP，那么Solarium Project会使集成变得更加容易，它提供了对基础请求的抽象级别，使您可以像在应用程序中运行的本机实现一样使用SOLR。

In this series, I’m going to introduce both SOLR and Solarium side-by-side. We’ll begin by installing and configuring SOLR and creating a search index. Then, we’ll look at how to index documents. Next, we’ll implement a basic search and then expand it with some more advanced features such as faceted search, result highlighting and suggestions.

在本系列中，我将同时介绍SOLR和Solarium。我们将从安装和配置SOLR并创建搜索索引开始。然后，我们将研究如何对文档建立索引。接下来，我们将实现基本搜索，然后使用更多高级功能(例如，分面搜索，结果突出显示和建议)将其扩展。

As we go along, we’re going to build a simple application for searching a collection of movies. You can grab the source code here, or see an online demo here.

在进行过程中，我们将构建一个简单的应用程序来搜索电影集合。您可以在此处获取源代码，或在此处查看在线演示。

基本概念和操作 (Basic Concepts and Operation)

Before we delve into the implementation, it’s worth looking at a few basic concepts and an overall view of what will happen.

在深入研究实现之前，有必要先了解一些基本概念和将要发生的事情的整体视图。

SOLR is a Java application which runs as a web service, typically in a servlet container such as Tomcat, Glassfish or JBoss. You can manipulate and query it over HTTP using XML, JSON, CSV or binary format – so you can use any programming language for your application. However, the Solarium library provides a level of abstraction, allowing you to call methods as if SOLR were a native implementation. For the purposes of this tutorial we’ll be running the SOLR on the same machine as our application, but in practice it could be located on a separate server.

SOLR是一个Java应用程序，它作为Web服务运行，通常在servlet容器中，例如Tomcat，Glassfish或JBoss。您可以使用XML，JSON，CSV或二进制格式通过HTTP操作和查询它-因此您可以为应用程序使用任何编程语言。但是，Solarium库提供了一个抽象级别，使您可以像SOLR是本机实现那样调用方法。就本教程而言，我们将在与应用程序相同的计算机上运行SOLR，但实际上，它可以位于单独的服务器上。

SOLR creates a search index of documents. Often that mirrors what we might consider a document in real-life; an article, blog post or even a full book. However a document can also represent any object applicable to your application – a product, a place, an event – or in our example application, a movie.

SOLR创建文档的搜索索引。通常，这反映了我们在现实生活中可能会考虑的文档。文章，博客文章，甚至是整本书。但是，文档也可以代表适用于您的应用程序的任何对象-产品，地点，事件-在我们的示例应用程序中是电影。

At its most basic, SOLR allows you to perform full text searches on documents. Think search engines; you’ll typically search for a keyword, a phrase or a full title. You can only get so far with SQL’s LIKE clause; that’s where fulltext search comes in.

从最基本的角度来看，SOLR允许您对文档执行全文搜索。想一想搜索引擎；您通常会搜索关键字，词组或完整标题。到目前为止，您只能使用SQL的LIKE子句。这就是全文搜索的目的。

You can also attach additional information to an indexed search document that doesn’t necessarily get picked up by a text-based search; for example, you can incorporate the price of a product, the number of rooms in a property or the date an item was added to the database.

您还可以将其他信息附加到索引的搜索文档中，而这些信息不一定会被基于文本的搜索获取。例如，您可以合并产品的价格，属性中的房间数或将项目添加到数据库中的日期。

Facets are one of the most useful features of SOLR. You’ll probably have seen faceted search if you’ve ever shopped online; facets allow you to “drill down” search results by applying “filters”. For example, having searched an online bookstore you might use filters to limit the results to those books by a particular author, in a particular genre or in a particular format.

构面是SOLR最有用的功能之一。如果您曾经在网上购物，可能会看到多面搜索；方面允许您通过应用“过滤器”来“向下钻取”搜索结果。例如，搜索过在线书店后，您可能会使用过滤器将搜索结果限制为特定作者，特定类型或特定格式的图书。

An instance of SOLR runs with one or more cores. A core is a collection of configuration and indexes, each with its own schema. Typically, a single instance is specific to a particular application. Since different types of content can have very different structures and information – for example, consider the difference between a product, an article and a user – an application often has multiple cores within an SOLR instance.

一个SOLR实例运行一个或多个内核。核心是配置和索引的集合，每个都有自己的架构。通常，单个实例特定于特定应用程序。由于不同类型的内容可能具有截然不同的结构和信息(例如，考虑产品，文章和用户之间的差异)，因此应用程序通常在SOLR实例中具有多个核心。

安装SOLR (Installing SOLR)

I’m going to provide instructions for how to setup SOLR on a Mac; for other operating systems, consult the documentation – or alternatively, you can download Blaze, an appliance with SOLR pre-installed.

我将提供有关如何在Mac上设置SOLR的说明；对于其他操作系统，请查阅文档 –或者，您可以下载预装有SOLR的设备Blaze 。

The easiest way to install SOLR on a Mac is to use Homebrew:

在Mac上安装SOLR的最简单方法是使用Homebrew ：

brew update
brew install solr

This will install the software in a directory such as /usr/local/Cellar/solr/4.5.0, depending on what version of the software you’re using.

这会将软件安装在/usr/local/Cellar/solr/4.5.0的目录中，具体取决于您所使用的软件版本。

To start the server using the provided Java archive (JAR):

要使用提供的Java归档文件(JAR)启动服务器，请执行以下操作：

cd /usr/local/Cellar/solr/4.5.0/libeexec/example
java -jar start.jar

To verify that the installation is successful, try accessing the admin interface in your web browser:

要验证安装是否成功，请尝试在Web浏览器中访问管理界面：

http://localhost:8983/solr/

If you see an admin dashboard with the Apache SOLR logo top-left, the server is up and running.

如果您在左上角看到带有Apache SOLR徽标的管理控制台，则表明服务器已启动并正在运行。

TIP: to stop SOLR – which you’ll need to do whenever you change the configuration, as we’re about to do shortly – simply press CTRL + C.

提示：要停止SOLR(您将在需要更改配置时立即执行此操作，就像我们将在短期内做的那样)，只需按CTRL + C 。

(Linux instructions: http://www.lullabot.com/blog/article/installing-solr-use-drupal)

(Linux说明： http ： //www.lullabot.com/blog/article/installing-solr-use-drupal )

设置架构 (Setting Up the Schema)

Probably the easiest way to get started with SOLR is to copy the default directory, then customize it.

入门SOLR的最简单方法可能是复制默认目录，然后对其进行自定义。

Copy the solr directory from libexec/example; here, we’re creating a new SOLR core called “movies”:

从libexec/example复制solr目录；在这里，我们正在创建一个称为“电影”的新SOLR核心：

cd /usr/local/Cellar/solr/4.5.0/libeexec/example
cp -R solr movies

We’ll look at the configuration files, movies\solr.xml and movies\collection1\conf\solrconfig.xml later on. For now, what we’re really interested in is the schema, which defines the fields on the documents we’re indexing, along with how they’re handled.

稍后，我们将查看配置文件movies\solr.xml和movies\solr.xml movies\collection1\conf\solrconfig.xml 。现在，我们真正感兴趣的是模式，该模式定义了我们要建立索引的文档上的字段以及它们的处理方式。

The file that defines this is movies\collection1\conf\schema.xml.

定义此文件的文件是movies\collection1\conf\schema.xml 。

If you open up the one you’ve just copied over you’ll see that it not only contains some useful defaults, but it’s also extensively commented to help you understand how to customize it.

如果打开刚刚复制的副本，您会发现它不仅包含一些有用的默认值，而且还被广泛注释以帮助您了解如何自定义它。

The schema configuration file is responsible for two primary aspects; fields and types. Types are simply data types, and under the hood they map type names – such as integers, dates and strings – to the underlying Java classes used in the implementation. For example: solr.TrieIntField, solr.TrieDateField and solr.TextField. The types configuration also defines behavior of tokenizers, analyzers and filters.

模式配置文件负责两个主要方面：字段和类型。类型只是数据类型，实际上它们将类型名称(例如整数，日期和字符串)映射到实现中使用的底层Java类。例如： solr.TrieIntField ， solr.TrieDateField和solr.TextField 。类型配置还定义了标记器，分析器和过滤器的行为。

Here are some examples of basic types:

以下是一些基本类型的示例：

<fieldType name="string"    class="solr.StrField"  sortMissingLast="true" omitNorms="true" />
<fieldType name="long"      class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="int"       class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>

The string type warrants a closer look, because there’s a gotcha there. When you use a field as a string, then any data gets stored exactly as you enter it. Furthermore, in order for a query to match it, it must be identical. For example, suppose you had an article title as a string, and inserted a document entitled “An Introduction to SOLR”. In any proper search implementation, you’d expect to find the article with a query such as “SOLR introduction” – not to mention “an introduction to Solr”. To get around this, if you don’t want this exact match behavior – which actually is useful in some cases, such as faceted search, then you can use a combination of tokenizers and filters.

string类型值得一看，因为那里有一个陷阱。当您将字段用作字符串时，所有数据都将按照您输入的方式进行存储。此外，为了使查询匹配它，它必须相同。例如，假设您有一个文章标题作为字符串，并插入了一个标题为“ SOLR简介”的文档。在任何适当的搜索实现中，您都希望找到带有“ SOLR简介”之类的查询的文章-更不用说“ Solr简介”了。要解决此问题，如果您不希望这种完全匹配的行为–在某些情况下(例如，分面搜索)实际上非常有用，则可以结合使用分词器和过滤器。

Tokenizers split text into chunks – usually separate words. Filters transform text in some way. To illustrate, let’s look at a sensible default for text:

分词器将文本分成多个块–通常是单独的单词。过滤器以某种方式转换文本。为了说明这一点，让我们来看一个合理的文本默认值：

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />        
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

First, you’ll notice that we’re defining behavior at indexing time – in other words, how data is transformed when you add a document – and at query time. In this example, the LowerCaseFilterFactory converts data to lower case both as it’s indexed and when it’s queried, so capitalization becomes irrelevant and we can do a like-for-like comparison. In our example, “introduction” will match “Introduction”, and “SOLR” will match “Solr”.

首先，您会注意到我们正在定义索引编制时的行为(换句话说，即添加文档时数据的转换方式)以及查询时。在此示例中， LowerCaseFilterFactory在索引数据和查询数据时都将数据转换为小写形式，因此大写字母变得无关紧要，我们可以进行类似比较。在我们的示例中，“简介”将匹配“简介”，而“ SOLR”将匹配“ Solr”。

StopFilterFactory is used to strip out stop words, which are common words which are excluded either because they’re not relevant to search, or for efficiency – words such as “a”, “the”, “and”, “etc”. There’s a good, exhaustive list of stop words here. In the code above the stop words are configured in a separate text file.

StopFilterFactory用于StopFilterFactory停用词，这些常用词由于与搜索无关或为了提高效率而被排除在外，例如“ a”，“ the”，“ and”，“ etc”之类的词。有一个很好的，停用词详尽的清单在这里。在上面的代码中，停用词在单独的文本文件中配置。

The fields section is used to define the available fields, their types and additional information such as whether they have multiple values, if they should be indexed and more.

fields部分用于定义可用字段，它们的类型和其他信息，例如它们是否具有多个值，是否应该建立索引等。

We’re not going to try modifying or extending the types definitions – that’s outside the scope of this tutorial – but instead what we’re going to look at are the fields definitions.

我们不会尝试修改或扩展类型定义-这超出了本教程的范围-而是要查看的是字段定义。

Broadly speaking, there are two approaches to defining the structure of your documents. The first is to explicitly define all the possible fields. The second is to use dynamic fields, which enable you to add fields on-the-fly providing you adhere to certain naming conventions. For example, given the following dynamic field definition:

广义上讲，有两种方法来定义文档的结构。首先是显式定义所有可能的字段。第二种是使用动态字段，只要您遵守某些命名约定，就可以动态添加字段。例如，给定以下动态字段定义：

<dynamicField name="*_s"  type="string"  indexed="true"  stored="true" />

…you could add, say, a single value property named author_s to the document and it will be stored as a string, without having pre-configured it. (Note: the “author” and “s” parts are entirely separate, so don’t read it as the plural “authors”.) The following defines a multi-valued string field:

…例如，您可以在文档中添加一个名为author_s的单值属性，该属性将作为字符串存储，而无需预先配置。 (注意：“作者”和“ s”部分是完全分开的，因此不要将其视为复数“作者”。)以下内容定义了一个多值字符串字段：

<dynamicField name="*_ss" type="string"  indexed="true"  stored="true" multiValued="true"/>

…so, for example, you could add categories by using categories_ss.

…因此，例如，您可以使用categories_ss添加categories_ss 。

If you attempt to add a document to the search index which contains properties which haven’t been explicitly defined – or, if you’re using dynamic properties and don’t adhere to these conventions – then SOLR will produce an error. To alter this behavior, locate and uncomment (or add) the following line:

如果您尝试将包含尚未明确定义的属性的文档添加到搜索索引中，或者，如果您使用的是动态属性并且不遵守这些约定，则SOLR会产生错误。若要更改此行为，找到并取消注释(或添加)以下行：

<dynamicField name="*" type="ignored" multiValued="true" />

This line indicates that properties which haven’t been previously defined should be silently ignored instead of generating an error. Because they’re ignored, however, note that they won’t be indexed nor stored – so they won’t have any impact on any searches you might run.

此行指示应将先前未定义的属性静默忽略，而不会产生错误。但是，由于它们被忽略，因此请注意，它们不会被索引或存储-因此它们不会对您可能进行的任何搜索产生任何影响。

For the purposes of this tutorial, we’re going to explicitly define the fields we want.

出于本教程的目的，我们将显式定义所需的字段。

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

    <field name="title" type="text_general" indexed="true" stored="true"/>
    <field name="synopsis" type="text_general" indexed="true" stored="true" omitNorms="true"/>
    <field name="rating" type="string" indexed="true" stored="true" />
    <field name="cast" type="text_general" indexed="true" stored="true" multiValued="true"/>
    <field name="year" type="int" indexed="true" stored="true" />
    <field name="runtime" type="int" indexed="true" stored="true" />

The following line is required by SOLR:

SOLR需要以下行：

<field name="_version_" type="long" indexed="true" stored="true"/>

The following field isn’t used; however, because there are a number of references to it in solrconfig.xml, it’s a good idea to leave it in for now:

不使用以下字段；但是，因为在solrconfig.xml中有许多对其的引用，所以暂时保留它是一个好主意：

<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

We also need to specify which field is the unique identifier – think primary key in SQL terminology – as follows:

我们还需要指定哪个字段是唯一标识符(请考虑SQL术语中的主键)，如下所示：

<uniqueKey>id</uniqueKey>

Now we need to tell SOLR which configuration to use. Stop the server if it’s currently running (CTRL+ C), and this time run it with the Dsolr.solr.home flag:

现在我们需要告诉SOLR使用哪种配置。如果服务器当前正在运行(CTRL + C)，请停止它，这次使用Dsolr.solr.home标志运行它：

cd /usr/local/Cellar/solr/4.5.0/libeexec/example
java -jar start.jar -Dsolr.solr.home=movies

摘要 (Summary)

That’s it for the first part, where we’ve started to look at SOLR and Solarium. We’ve got SOLR installed, and a schema set up. In the next part we’ll set up our application along with Solarium and index some data.

上半部分就是这样，我们开始研究SOLR和Solarium。我们已经安装了SOLR，并设置了模式。在下一部分中，我们将与Solarium一起设置应用程序并为一些数据建立索引。

翻译自: https://www.sitepoint.com/using-solarium-solr-search-setup/

culi4814

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
将Solarium与SOLR一起使用进行搜索-设置

Apache’s SOLR is an enterprise-level search platform based on Apache Lucene. It provides a powerful full-text search along with advanced features such as faceted search, result highlighting and geosp...
复制链接

扫一扫