solr搜索如何加快速度_使用solr和cassandra使您的搜索速度快10倍的技巧

最新推荐文章于 2021-10-08 17:52:29 发布

weixin_26706093

最新推荐文章于 2021-10-08 17:52:29 发布

阅读量478

点赞数

原文链接：https://medium.com/walmartglobaltech/tricks-to-make-your-search-10x-faster-using-solr-and-cassandra-f6af53d5a25c

版权

solr搜索如何加快速度

written by Jiazhen Zhu

朱家珍着

由全球治理团队设计

动机 (Motivation)

The concept of big data has been popular around the world for a while, even back to a paper from Charles Tilly in the 1980s. To me, I started to get to know it during my undergraduate study in 2008. At the beginning, I thought the larger the data we have, the greater value and power we can get from it. However, after a few years experience as a data engineer, I have been convinced that speed is the most critical feature in the big data area, because data changes so rapidly that we will miss the value from the result we get later.

大数据的概念已经在世界范围内流行了一段时间，甚至可以追溯到1980年代Charles Tilly的论文。对我来说，我在2008年的大学学习期间就开始了解它。一开始，我认为我们拥有的数据越大，从中获得的价值和力量就越大。但是，经过几年的数据工程师经验，我一直坚信速度是大数据领域中最关键的功能，因为数据变化如此之快，以至于我们将无法获得后来的结果所带来的价值。

At Walmart, there are various huge datasets which are available to different use cases. But as I mentioned before, it is critical how to quickly use them for the decision usage.

在沃尔玛，有各种庞大的数据集可用于不同的用例。但是，正如我之前提到的，至关重要的是如何快速将其用于决策。

Normally, with huge data in terabytes (TBs) or pebibytes (PBs), there are two type of cases which most companies need to deal with:

通常，对于兆兆字节(TB)或PB(PB)的巨大数据，大多数公司需要处理两种类型的情况：

The data provided to Operation Level. In this case, the data contain a lot of detail information which makes it hard to do quick search.
提供给操作级别的数据。 在这种情况下，数据包含许多详细信息，这使得很难进行快速搜索。
The data provided to Management Level. In this case, we need to aggregate the Operation Level data into various metric which can reduce the size of original data resulting in faster search.
数据提供给管理层 。在这种情况下，我们需要将“操作级别”数据聚合到各种度量标准中，以减少原始数据的大小，从而加快搜索速度。

In the next section, I will introduce a traditional design and a new paradigm which is designed by Global Governance team at Walmart Global Tech. We will see that the new paradigm will significantly improve the fetch speed by 10X faster than the old one for case 1.

在下一部分中，我将介绍由Walmart Global Tech的Global Governance团队设计的传统设计和新范例。我们将看到，对于情况1，新范例将显着提高提取速度，比旧范例快10倍。

传统设计 (Traditional Design)

1.工具 (1. Tools)

RDMS: PostgreSQL, MySQL, SQL Server, Teradata
RDMS：PostgreSQL，MySQL，SQL Server，Teradata
Key/value Database: Redis
键/值数据库：Redis
Columnar Database: HBase, BigQuery, Redshift
列式数据库：HBase，BigQuery，Redshift

2.数据建模 (2. Data Modeling)

Most of time, data architect will choose the Star or Snowflake schema to design and implement the data modeling. For example, we can choose Teradata or BigQuery as our data warehouse and design the Fact table and Dimensional tables. The data can be joined from multiple tables by using service API.

大多数时候，数据架构师会选择Star或Snowflake模式来设计和实现数据建模。例如，我们可以选择Teradata或BigQuery作为我们的数据仓库，并设计Fact表和Dimensional表。可以使用服务API从多个表中联接数据。

Image for post — A Simple Star Schema: Both Fact Table and Dimension Tables are stored in all-in-one database.

The Star scheme is a good idea because the data is denormalized into two levels (fact and dimensions). However, we still face the time-consuming issue when fetching data which needs to be joined even just at one level. This causes a very bad user experience.

Star方案是一个好主意，因为数据被归一化为两个级别(事实和维度)。但是，在获取需要连接的数据时，即使在一个级别上，我们仍然面临耗时的问题。这会导致非常糟糕的用户体验。

On the other hand, most of databases don’t support upsert method. And all-in-one databases like Teradata are restively costly.

另一方面，大多数数据库不支持upsert方法。而且，像Teradata这样的多合一数据库非常昂贵。

3.工作流程 (3. Workflow)

User Side

用户端

User can search the provided searchable key information on UI
用户可以在UI上搜索提供的可搜索键信息
Backend send an API call to all-in-one database
后端将API调用发送到多合一数据库
Database process the required join statement
数据库处理所需的join语句
All-in-one database response all detail information back to UI
多合一数据库将所有详细信息返回给UI

Engineer Side

工程师方

Manually upsert the delta data to all-in-one database
手动将增量数据上传到多合一数据库

新的天堂 (A New Paradigm)

In order to avoid the high cost, we begin to use open source. In order to avoid time consuming (join) when fetching the data, we use two different type of databases instead of single database.

为了避免高昂的成本，我们开始使用开源。为了避免在获取数据时浪费时间( 联接) ，我们使用两种不同类型的数据库而不是单个数据库。

1.工具 (1. Tools)

To balance the speed (Search), linear scalability, high availability and flexible data model, we chose Columnar Database, NoSQL and Search Engine as our candidates from the list of below pool.

为了平衡速度(搜索)，线性可伸缩性，高可用性和灵活的数据模型，我们从以下池的列表中选择了Columnar Database，NoSQL和Search Engine作为候选对象。

NoSQL Columnar Database: Cassandra
NoSQL列式数据库：Cassandra
Search Engine Database: Solr, Elasticsearch
搜索引擎数据库：Solr，Elasticsearch

After considering the speed is our first citizen at here and the data size, we really consider NoSQL plus Search Engine together.

考虑到速度是我们的首要任务和数据量之后，我们才真正考虑将NoSQL和Search Engine结合在一起。

NoSQL: We chose the Cassandra because it is not only NoSQL but also is columnar database which is good for OLAP (DataWarehouse).
NoSQL：我们之所以选择Cassandra ，是因为它不仅是NoSQL，而且还是适合OLAP(DataWarehouse)的列式数据库。
Search Engine: The reason we chose Solr is because it is a mature project and has good user community behind it.
搜索引擎：我们选择Solr的原因是因为它是一个成熟的项目，并且拥有良好的用户社区。

2.数据建模 (2. Data Modeling)

We still use Star schema design concept, but redesigning the fact table (we rename it as info table). Instead of storing quantitative information for analysis, we store the key information under fact table or info table. Those data will be stored in the Solr search engine. All others’ dimensional tables will be stored in the Cassandra. Because Cassandra is query driven, we design a Global Key to link the dimension tables instead of using many dimensional keys and also avoid to create many different Cassandra tables for query driven usage.

我们仍然使用Star模式设计概念，但是重新设计了事实表(将其重命名为信息表)。而不是存储定量信息进行分析，我们将关键信息存储在事实表或信息表下。这些数据将存储在Solr搜索引擎中。其他所有维度表将存储在Cassandra中。因为Cassandra是查询驱动的，所以我们设计了一个全局键来链接维表，而不是使用许多维键，并且还避免创建许多不同的Cassandra表用于查询驱动的用法。

Global Key is a universal key among data modeling which is generated on some basic logic, not just combine several natural keys into one. For merchandise, the key can be generated by using Item Id, Shipping Id etc. For person, the key can be generated by using Last Name, First Name, Phone etc.

全局密钥是数据建模中的通用密钥，它是根据某些基本逻辑生成的，而不仅仅是将多个自然密钥组合为一个。对于商品，可以通过使用商品ID，运输ID等生成密钥。对于个人，可以通过使用姓，名，电话等生成密钥。

3.工作流程 (3. Workflow)

User Side

用户端

User can search the provided searchable key information on UI
用户可以在UI上搜索提供的可搜索键信息
Backend send an API call to Solr based on key info
后端根据关键信息向Solr发送API调用
Solr response the Global Key to Backend
Solr响应后端的全局密钥
Backend send an API call to Cassandra with Global Key
后端通过全局密钥向Cassandra发送API调用
Cassandra response all detail information back to UI
Cassandra将所有详细信息回复给UI

Engineer Side

工程师方

Easily upsert the delta data to Solr
轻松将增量数据上传到Solr
Easily upsert the delta data to Cassandra
轻松将增量数据上传到Cassandra

效益 (Benefit)

I created a metric between traditional model and new paradigm for filter and join. We can find the new paradigm improve fetching time from seconds to milli seconds.

我在传统模型和新范例之间创建了一个度量标准，用于过滤和联接。我们可以发现新的范例将获取时间从几秒缩短到了几毫秒。

+---------------+--------------------+---------------+
|    Method     |     Traditional    | New Paradigm  |
+---------------+--------------------+---------------+
|    FILTER     |      2536 (ms)     | 307 (ms)      |
| FILTER & JOIN |      6900 (ms)     | 601 (ms)      |
+---------------+--------------------+---------------+

Beside for Improving fetching time, we can gain following benefits also:

除了改善提取时间外，我们还可以获得以下好处：

Saving development and loading time with upsert method.
使用upsert方法节省开发和加载时间。
Avoiding query driven limitation for Cassandra using Global Key design.
使用全局密钥设计避免针对Cassandra的查询驱动限制。
Saving huge cost using open source.
使用开源节省大量成本。

结论 (Conclusion)

Combination of Solr and Cassandra is a good choice for UI when we have million or billion size data. Using global design key among them will give us huge benefit also.

当我们拥有百万或十亿大小的数据时，将Solr和Cassandra结合起来是UI的不错选择。在其中使用全局设计密钥也将给我们带来巨大的好处。