全文搜索引擎的比较-Lucene,Sphinx,Postgresql,MySQL?

本文翻译自:Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?

I'm building a Django site and I am looking for a search engine. 我正在建立Django网站,并且正在寻找搜索引擎。

A few candidates: 一些候选人:

  • Lucene/Lucene with Compass/Solr Lucene / Lucene与指南针/ Solr

  • Sphinx 狮身人面像

  • Postgresql built-in full text search PostgreSQL内置全文本搜索

  • MySQl built-in full text search MySQl内置全文本搜索

Selection criteria: 选择标准:

  • result relevance and ranking 结果相关性和排名
  • searching and indexing speed 搜索和索引速度
  • ease of use and ease of integration with Django 易于使用,易于与Django集成
  • resource requirements - site will be hosted on a VPS , so ideally the search engine wouldn't require a lot of RAM and CPU 资源需求-网站将托管在VPS上 ,因此理想情况下,搜索引擎不需要大量的RAM和CPU
  • scalability 可扩展性
  • extra features such as "did you mean?", related searches, etc 其他功能,例如“您的意思是?”,相关搜索等

Anyone who has had experience with the search engines above, or other engines not in the list -- I would love to hear your opinions. 任何对以上搜索引擎或其他不在列表中的引擎有经验的人-我很想听听您的意见。

EDIT: As for indexing needs, as users keep entering data into the site, those data would need to be indexed continuously. 编辑:至于索引需求,随着用户不断向站点输入数据,这些数据将需要连续索引。 It doesn't have to be real time, but ideally new data would show up in index with no more than 15 - 30 minutes delay 它不一定是实时的,但是理想情况下新数据将以不超过15-30分钟的延迟显示在索引中


#1楼

参考:https://stackoom.com/question/35nX/全文搜索引擎的比较-Lucene-Sphinx-Postgresql-MySQL


#2楼

I'm looking at PostgreSQL full-text search right now, and it has all the right features of a modern search engine, really good extended character and multilingual support, nice tight integration with text fields in the database. 我现在正在看PostgreSQL全文搜索,它具有现代搜索引擎的所有正确功能,非常好的扩展字符和多语言支持,与数据库中的文本字段紧密集成。

But it doesn't have user-friendly search operators like + or AND (uses & | !) and I'm not thrilled with how it works on their documentation site. 但是它没有像+或AND这样的用户友好的搜索运算符(使用&|!),我对它们在其文档站点上的工作方式并不感到兴奋。 While it has bolding of match terms in the results snippets, the default algorithm for which match terms is not great. 尽管结果片段中的匹配项用粗体显示,但匹配项的默认算法并不理想。 Also, if you want to index rtf, PDF, MS Office, you have to find and integrate a file format converter. 另外,如果要为rtf,PDF,MS Office编制索引,则必须查找并集成文件格式转换器。

OTOH, it's way better than the MySQL text search, which doesn't even index words of three letters or fewer. OTOH,它比MySQL文本搜索更好,后者甚至不索引三个字母或更少的单词。 It's the default for the MediaWiki search, and I really think it's no good for end-users: http://www.searchtools.com/analysis/mediawiki-search/ 这是MediaWiki搜索的默认设置,我真的认为这对最终用户不利: http : //www.searchtools.com/analysis/mediawiki-search/

In all cases I've seen, Lucene/Solr and Sphinx are really great . 在所有情况下,Lucene / Solr和Sphinx都很棒 They're solid code and have evolved with significant improvements in usability, so the tools are all there to make search that satisfies almost everyone. 它们是可靠的代码,并且在可用性方面有了显着的改进,因此已经有了足够的工具来使几乎所有人都满意的搜索。

for SHAILI - SOLR includes the Lucene search code library and has the components to be a nice stand-alone search engine. 对于SHAILI-SOLR包括Lucene搜索代码库,并且具有成为一个不错的独立搜索引擎的组件。


#3楼

SearchTools-Avi said "MySQL text search, which doesn't even index words of three letters or fewer." SearchTools-Avi说:“ MySQL文本搜索,甚至不索引三个字母或更少的单词。”

FYIs, The MySQL fulltext min word length is adjustable since at least MySQL 5.0. 仅供参考, 至少从 MySQL 5.0起 MySQL全文的最小字长是可调的。 Google 'mysql fulltext min length' for simple instructions. 谷歌“ mysql全文最小长度”的简单说明。

That said, MySQL fulltext has limitations: for one, it gets slow to update once you reach a million records or so, ... 就是说,MySQL全文具有局限性:一方面,一旦达到一百万条左右的记录,更新就会变慢,...


#4楼

I would add mnoGoSearch to the list. 我将mnoGoSearch添加到列表中。 Extremely performant and flexible solution, which works as Google : indexer fetches data from multiple sites, You could use basic criterias, or invent Your own hooks to have maximal search quality. 极为高效且灵活的解决方案,可像Google一样工作:索引器可从多个站点获取数据,您可以使用基本条件,也可以发明自己的挂钩来获得最佳搜索质量。 Also it could fetch the data directly from the database. 它还可以直接从数据库中获取数据。

The solution is not so known today, but it feets maximum needs. 该解决方案今天尚不为人所知,但它满足了最大需求。 You could compile and install it or on standalone server, or even on Your principal server, it doesn't need so much ressources as Solr, as it's written in C and runs perfectly even on small servers. 您可以编译并安装它,也可以在独立服务器上,甚至在您的主体服务器上,它都不需要Solr这样的资源,因为它是用C编写的,甚至可以在小型服务器上完美运行。

In the beginning You need to compile it Yourself, so it requires some knowledge. 首先,您需要自己编译,因此需要一些知识。 I made a tiny script for Debian, which could help. 我为Debian 编写了一个小脚本 ,可以帮上忙。 Any adjustments are welcome. 欢迎任何调整。

As You are using Django framework, You could use or PHP client in the middle, or find a solution in Python, I saw some articles . 当您使用Django框架时,您可以在中间使用或PHP客户端,或者在Python中找到解决方案,我看到了一些 文章

And, of course mnoGoSearch is open source, GNU GPL. 而且,mnoGoSearch当然是开源的GNU GPL。


#5楼

Just my two cents to this very old question. 对于这个非常老的问题,只有我两分钱。 I would highly recommend taking a look at ElasticSearch . 我强烈建议您看一下ElasticSearch

Elasticsearch is a search server based on Lucene. Elasticsearch是基于Lucene的搜索服务器。 It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. 它提供了具有RESTful Web界面和无模式JSON文档的分布式,多租户的全文本搜索引擎。 Elasticsearch is developed in Java and is released as open source under the terms of the Apache License. Elasticsearch是用Java开发的,并根据Apache许可的条款作为开源发布。

The advantages over other FTS (full text search) Engines are: 与其他FTS(全文搜索)引擎相比,其优势在于:

  • RESTful interface RESTful接口
  • Better scalability 更好的可扩展性
  • Large community 大型社区
  • Built by Lucene developers 由Lucene开发人员构建
  • Extensive documentation 广泛的文档
  • There are many open source libraries available (including Django) 有许多可用的开源库(包括Django)

We are using this search engine at our project and very happy with it. 我们在项目中使用了此搜索引擎,对此感到非常满意。


#6楼

Apache Solr 阿帕奇· 索尔( Apache Solr)


Apart from answering OP's queries, Let me throw some insights on Apache Solr from simple introduction to detailed installation and implementation . 除了回答OP的查询之外,让我从简单的介绍详细的安装实现,Apache Solr进行一些分析。

Simple Introduction 简单介绍


Anyone who has had experience with the search engines above, or other engines not in the list -- I would love to hear your opinions. 任何对以上搜索引擎或其他不在列表中的引擎有经验的人-我很想听听您的意见。

Solr shouldn't be used to solve real-time problems. Solr不应用于解决实时问题。 For search engines, Solr is pretty much game and works flawlessly . 对于搜索引擎而言, Solr几乎是一款游戏,并且可以完美运行

Solr works fine on High Traffic web-applications ( I read somewhere that it is not suited for this, but I am backing up that statement ). Solr在“高流量” Web应用程序上运行良好( 我在某处读到它不适合此操作,但我正在备份该声明 )。 It utilizes the RAM, not the CPU. 它利用RAM,而不是CPU。

  • result relevance and ranking 结果相关性和排名

The boost helps you rank your results show up on top. 增强功能可帮助您将结果排名显示在最前面。 Say, you're trying to search for a name john in the fields firstname and lastname , and you want to give relevancy to the firstname field, then you need to boost up the firstname field as shown. 假设您要在firstnamelastname字段中搜索john 姓名 ,并且想要与firstname字段相关,那么您需要如图所示增强 firstname字段。

http://localhost:8983/solr/collection1/select?q=firstname:john^2&lastname:john

As you can see, firstname field is boosted up with a score of 2. 如您所见, 名字字段的得分提高了2。

More on SolrRelevancy 有关Solr相关性的更多信息

  • searching and indexing speed 搜索和索引速度

The speed is unbelievably fast and no compromise on that. 速度之快令人难以置信,并且对此没有任何妥协。 The reason I moved to Solr . 我之所以搬到Solr的原因。

Regarding the indexing speed, Solr can also handle JOINS from your database tables. 关于索引速度, Solr还可以处理数据库表中的JOINS A higher and complex JOIN do affect the indexing speed. 较高且复杂的JOIN确实会影响索引编制速度。 However, an enormous RAM config can easily tackle this situation. 但是,巨大的RAM配置可以轻松解决这种情况。

The higher the RAM, The faster the indexing speed of Solr is. RAM越高,Solr的索引速度越快。

  • ease of use and ease of integration with Django 易于使用,易于与Django集成

Never attempted to integrate Solr and Django , however you can achieve to do that with Haystack . 从未尝试过将SolrDjango集成在一起,但是可以使用Haystack做到这一点。 I found some interesting article on the same and here's the github for it. 我在同一篇文章中找到了一些有趣的文章 ,这是它的github

  • resource requirements - site will be hosted on a VPS, so ideally the search engine wouldn't require a lot of RAM and CPU 资源需求-网站将托管在VPS上,因此理想情况下,搜索引擎不需要大量的RAM和CPU

Solr breeds on RAM, so if the RAM is high, you don't to have to worry about Solr . Solr在RAM上繁殖,因此,如果RAM高,则不必担心Solr

Solr's RAM usage shoots up on full-indexing if you have some billion records, you could smartly make use of Delta imports to tackle this situation. 如果您有数十亿条记录, Solr的 RAM使用率会随着完全索引的增加而增加,您可以聪明地利用Delta导入来解决这种情况。 As explained, Solr is only a near real-time solution . 如前所述, Solr 只是近乎实时的解决方案

  • scalability 可扩展性

Solr is highly scalable. Solr具有高度可扩展性。 Have a look on SolrCloud . 看看SolrCloud Some key features of it. 它的一些关键功能。

  • Shards (or sharding is the concept of distributing the index among multiple machines, say if your index has grown too large) 分片(或分片是在多台计算机之间分配索引的概念,比如说索引是否太大)
  • Load Balancing (if Solrj is used with Solr cloud it automatically takes care of load-balancing using it's Round-Robin mechanism) 负载平衡(如果Solrj与Solr云一起使用,它将使用其Round-Robin机制自动处理负载平衡)
  • Distributed Search 分布式搜索
  • High Availability 高可用性
  • extra features such as "did you mean?", related searches, etc 其他功能,例如“您的意思是?”,相关搜索等

For the above scenario, you could use the SpellCheckComponent that is packed up with Solr . 对于上述情况,你可以使用SpellCheckComponent是挤满了Solr的 There are a lot other features, The SnowballPorterFilterFactory helps to retrieve records say if you typed, books instead of book , you will be presented with results related to book . 还有很多其他功能, SnowballPorterFilterFactory有助于检索记录,例如,如果您键入的是书籍而不是book ,那么将显示与book相关的结果。


This answer broadly focuses on Apache Solr & MySQL . 这个答案主要集中在Apache SolrMySQL上 Django is out of scope. Django超出范围。

Assuming that you are under LINUX environment, you could proceed to this article further. 假设您在LINUX环境下,则可以继续阅读本文。 (mine was an Ubuntu 14.04 version) (我的是Ubuntu 14.04版本)

Detailed Installation 详细安装

Getting Started 入门

Download Apache Solr from here . 这里下载Apache Solr That would be version is 4.8.1 . 那将是4.8.1版本。 You could download new versions, I found this stable. 您可以下载新版本,我发现这很稳定。

After downloading the archive , extract it to a folder of your choice. 下载存档后,将其解压缩到您选择的文件夹中。 Say .. Downloads or whatever.. So it will look like Downloads/solr-4.8.1/ 说.. Downloads或其他内容。所以它看起来像Downloads/solr-4.8.1/

On your prompt.. Navigate inside the directory 在提示符下..浏览目录

shankar@shankar-lenovo: cd Downloads/solr-4.8.1

So now you are here .. 所以现在你在这里..

shankar@shankar-lenovo: ~/Downloads/solr-4.8.1$

Start the Jetty Application Server 启动Jetty应用服务器

Jetty is available inside the examples folder of the solr-4.8.1 directory , so navigate inside that and start the Jetty Application Server. solr-4.8.1目录的examples文件夹中可以找到Jetty ,因此请在其中导航并启动Jetty Application Server。

shankar@shankar-lenovo:~/Downloads/solr-4.8.1/example$ java -jar start.jar

Now , do not close the terminal , minimize it and let it stay aside. 现在,不要关闭端子,将其最小化并放在一边。

( TIP : Use & after start.jar to make the Jetty Server run in the background ) (提示:在start.jar之后使用&可使Jetty Server在后台运行)

To check if Apache Solr runs successfully, visit this URL on the browser. 要检查Apache Solr是否成功运行,请在浏览器上访问此URL。 http://localhost:8983/solr http:// localhost:8983 / solr

Running Jetty on custom Port 在自定义端口上运行码头

It runs on the port 8983 as default. 默认情况下,它在端口8983上运行。 You could change the port either here or directly inside the jetty.xml file. 您可以在此处或直接在jetty.xml文件内部更改端口。

java -Djetty.port=9091 -jar start.jar

Download the JConnector 下载JConnector

This JAR file acts as a bridge between MySQL and JDBC , Download the Platform Independent Version here 此JAR文件充当MySQL和JDBC之间的桥梁,请在此处下载独立于平台的版本

After downloading it, extract the folder and copy the mysql-connector-java-5.1.31-bin.jar and paste it to the lib directory. 下载后,解压缩该文件夹并复制mysql-connector-java-5.1.31-bin.jar并将其粘贴到lib目录中。

shankar@shankar-lenovo:~/Downloads/solr-4.8.1/contrib/dataimporthandler/lib

Creating the MySQL table to be linked to Apache Solr 创建要链接到Apache Solr的MySQL表

To put Solr to use, You need to have some tables and data to search for. 要使用Solr ,您需要搜索一些表和数据。 For that, we will use MySQL for creating a table and pushing some random names and then we could use Solr to connect to MySQL and index that table and it's entries. 为此,我们将使用MySQL创建表并推入一些随机名称,然后使用Solr连接到MySQL并对该表及其条目进行索引。

1.Table Structure 1.表结构

CREATE TABLE test_solr_mysql
 (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  name VARCHAR(45) NULL,
  created TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (id)
 );

2.Populate the above table 2.填充上表

INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jean');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jack');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jason');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Vego');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Grunt');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jasper');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Fred');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jenna');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Rebecca');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Roland');

Getting inside the core and adding the lib directives 深入内核并添加lib指令

1.Navigate to 1.导航到

shankar@shankar-lenovo: ~/Downloads/solr-4.8.1/example/solr/collection1/conf

2.Modifying the solrconfig.xml 2,修改solrconfig.xml

Add these two directives to this file.. 将这两个指令添加到此文件。

  <lib dir="../../../contrib/dataimporthandler/lib/" regex=".*\.jar" />
  <lib dir="../../../dist/" regex="solr-dataimporthandler-\d.*\.jar" />

Now add the DIH (Data Import Handler) 现在添加DIH (数据导入处理程序)

<requestHandler name="/dataimport" 
  class="org.apache.solr.handler.dataimport.DataImportHandler" >
    <lst name="defaults">
      <str name="config">db-data-config.xml</str>
    </lst>
</requestHandler>

3.Create the db-data-config.xml file 3.创建db-data-config.xml文件

If the file exists then ignore, add these lines to that file. 如果文件存在,则忽略,将这些行添加到该文件。 As you can see the first line, you need to provide the credentials of your MySQL database. 如第一行所示,您需要提供MySQL数据库的凭据。 The Database name, username and password. 数据库名称,用户名和密码。

<dataConfig>
    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/yourdbname" user="dbuser" password="dbpass"/>
    <document>
   <entity name="test_solr" query="select CONCAT('test_solr-',id) as rid,name from test_solr_mysql WHERE '${dataimporter.request.clean}' != 'false'
      OR `created` > '${dataimporter.last_index_time}'" >
    <field name="id" column="rid" />
    <field name="solr_name" column="name" />
    </entity>
   </document>
</dataConfig>

( TIP : You can have any number of entities but watch out for id field, if they are same then indexing will skipped. ) (提示:您可以有任意数量的实体,但要注意id字段,如果它们相同,则将跳过索引。)

4.Modify the schema.xml file 4,修改schema.xml文件

Add this to your schema.xml as shown.. 如图所示,将其添加到您的schema.xml中。

<uniqueKey>id</uniqueKey>
<field name="solr_name" type="string" indexed="true" stored="true" />

Implementation 实作

Indexing 索引编制

This is where the real deal is. 这才是真正的交易。 You need to do the indexing of data from MySQL to Solr inorder to make use of Solr Queries. 您需要对从MySQLSolr的数据进行索引,以利用Solr查询。

Step 1: Go to Solr Admin Panel 第1步:转到Solr管理面板

Hit the URL http://localhost:8983/solr on your browser. 在浏览器中点击URL http:// localhost:8983 / solr The screen opens like this. 屏幕将像这样打开。

这是主要的Apache Solr管理面板

As the marker indicates, go to Logging inorder to check if any of the above configuration has led to errors. 如标记所示,请转到“ 日志记录”以检查以上任何配置是否导致错误。

Step 2: Check your Logs 第2步:检查您的日志

Ok so now you are here, As you can there are a lot of yellow messages (WARNINGS). 好的,现在您在这里,您将可以看到很多黄色消息(警告)。 Make sure you don't have error messages marked in red. 确保您没有将错误消息标记为红色。 Earlier, on our configuration we had added a select query on our db-data-config.xml , say if there were any errors on that query, it would have shown up here. 之前,在我们的配置中,我们在db-data-config.xml上添加了一个选择查询,说如果该查询有任何错误,它将显示在这里。

这是Apache Solr引擎的日志记录部分

Fine, no errors. 很好,没有错误。 We are good to go. 我们很好。 Let's choose collection1 from the list as depicted and select Dataimport 让我们从如图所示的列表中选择collection1 ,然后选择Dataimport

Step 3: DIH (Data Import Handler) 步骤3:DIH(数据导入处理程序)

Using the DIH, you will be connecting to MySQL from Solr through the configuration file db-data-config.xml from the Solr interface and retrieve the 10 records from the database which gets indexed onto Solr . 使用DIH,您将通过从Solr的接口配置文件DB数据-config.xml中连接到MySQLSolr的和检索其编入索引到Solr的数据库中的10条记录。

To do that, Choose full-import , and check the options Clean and Commit . 为此,选择“ 完全导入” ,然后选中“ 清除提交 ”选项。 Now click Execute as shown. 现在,如图所示,单击执行

Alternatively, you could use a direct full-import query like this too.. 另外,您也可以像这样使用直接的完全导入查询。

http://localhost:8983/solr/collection1/dataimport?command=full-import&commit=true

数据导入处理程序

After you clicked Execute , Solr begins to index the records, if there were any errors, it would say Indexing Failed and you have to go back to the Logging section to see what has gone wrong. 单击Execute之后Solr开始对记录进行索引,如果有任何错误,它将显示Indexing Failed,并且您必须返回到Logging部分以查看出现了什么问题。

Assuming there are no errors with this configuration and if the indexing is successfully complete., you would get this notification. 假设此配置没有错误,并且索引成功完成,您将收到此通知。

索引成功

Step 4: Running Solr Queries 步骤4:运行Solr查询

Seems like everything went well, now you could use Solr Queries to query the data that was indexed. 似乎一切顺利,现在您可以使用Solr查询来查询已索引的数据。 Click the Query on the left and then press Execute button on the bottom. 单击左侧的查询 ,然后按底部的执行按钮。

You will see the indexed records as shown. 您将看到所示的索引记录。

The corresponding Solr query for listing all the records is 用于列出所有记录的相应Solr查询为

http://localhost:8983/solr/collection1/select?q=*:*&wt=json&indent=true

索引数据

Well, there goes all 10 indexed records. 好吧,这里有所有10个索引记录。 Say, we need only names starting with Ja , in this case, you need to target the column name solr_name , Hence your query goes like this. 说,我们只需要以Ja开头的名称,在这种情况下,您需要定位列名称solr_name ,因此查询如下。

http://localhost:8983/solr/collection1/select?q=solr_name:Ja*&wt=json&indent=true

以Ja *开头的JSON数据

That's how you write Solr Queries. 这就是您编写Solr查询的方式。 To read more about it, Check this beautiful article . 要了解更多信息,请查看这篇精美的文章

  • 0
    点赞
  • 0
    评论
  • 0
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

©️2021 CSDN 皮肤主题: 1024 设计师:白松林 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值