传统数据存储方式_非传统数据存储的数据工程师指南

最新推荐文章于 2024-03-30 11:55:45 发布

cumei1658

最新推荐文章于 2024-03-30 11:55:45 发布

阅读量755

点赞数

文章标签：数据库搜索引擎大数据编程语言 java

原文链接：https://www.pybloggers.com/2016/12/a-data-engineers-guide-to-non-traditional-data-storages/

版权

传统数据存储方式

数据工程 (Data Engineering)

With the rise of big data and data science, many engineering roles are being challenged and expanded. One new-age role is data engineering.

随着大数据和数据科学的兴起，许多工程角色正在受到挑战和扩展。一个新时代的角色是数据工程。

Originally, the purpose of data engineering was the loading of external data sources and the designing of databases (designing and developing pipelines to collect, manipulate, store, and analyze data).

最初，数据工程的目的是加载外部数据源和设计数据库（设计和开发用于收集，操纵，存储和分析数据的管道）。

It has since grown to support the volume and complexity of big data. So data engineering now encapsulates a wide range of skills, from web-crawling, data cleansing, distributed computing, and data storage and retrieval.

从那以后，它已经发展为支持大数据的数量和复杂性。因此，数据工程现在囊括了从爬网，数据清理，分布式计算以及数据存储和检索到的广泛技能。

For data engineering and data engineers, data storage and retrieval is the critical component of the pipeline together with how the data can be used and analyzed.

对于数据工程和数据工程师而言，数据存储和检索以及如何使用和分析数据都是管道的关键组成部分。

In recent times, many new and different data storage technologies have emerged. However, which one is best suited and has the most appropriate features for data engineering?

最近，出现了许多新的和不同的数据存储技术。但是，哪一个最适合并且具有最适合数据工程的功能？

Most engineers are familiar with SQL databases, such as PostgreSQL, MSSQL, and MySQL, which are structured in relational data tables with row-oriented storage.

大多数工程师都熟悉SQL数据库，例如PostgreSQL，MSSQL和MySQL，这些数据库是在具有面向行存储的关系数据表中构造的。

Given how ubiquitous these databases are, we won’t discuss them today. Instead, we explore three types of alternative data storages that are growing in popularity and that have introduced different approaches to dealing with data.

鉴于这些数据库无处不在，我们今天不再讨论它们。取而代之的是，我们探索三种类型的替代数据存储，它们日益流行并且引入了处理数据的不同方法。

Within the context of data engineering, these technologies are search engines, document stores, and columnar stores.

在数据工程的上下文中，这些技术是搜索引擎，文档存储和列式存储。

Search engines excel at text queries. When compared to text matches in SQL databases, such as LIKE, search engines offer higher query capabilities and better performance out of the box.
Document stores provide better data schema adaptability than traditional databases. By storing the data as individual document objects, often represented as JSONs, they do not require schema predefining.
Columnar stores specialize in single column queries and value aggregations. SQL operations, such as SUM and AVG, are considerably faster in columnar stores, as data of the same column are stored closer together on the hard drive.

搜索引擎擅长文本查询。与SQL数据库（例如LIKE文本匹配进行比较时，搜索引擎提供了更高的查询功能和更好的开箱即用性能。
与传统数据库相比， 文档存储提供了更好的数据模式适应性。通过将数据存储为通常表示为JSON的单个文档对象，它们不需要架构预定义。
列式存储专门用于单列查询和值聚合。在列存储中，SQL操作（例如SUM和AVG ）的速度要快得多，因为同一列的数据在硬盘上的存储距离更近。

In this article, we explore all three technologies: Elasticsearch as a search engine, MongoDB as a document store, and Amazon Redshift as a columnar store.

在本文中，我们探索了所有三种技术： Elasticsearch作为搜索引擎， MongoDB作为文档存储以及Amazon Redshift作为列式存储。

By understanding alternative data storage, we can choose the most suitable one for each situation.

通过了解备用数据存储，我们可以为每种情况选择最合适的数据存储。

Storage for Data Engineering: Which is the Best?

For data engineers, the most important aspects of data storages arehow they index, shard, and aggregate data.

对于数据工程师来说，数据存储最重要的方面是如何索引，分片和聚合数据。

To compare these technologies, we’ll examine how they index, shard, and aggregate data.

为了比较这些技术，我们将研究它们如何索引，分片和聚合数据。

Each data indexing strategy improves certain queries while hindering others.

每种数据索引策略都可以改善某些查询，同时又会阻碍其他查询。

Knowing which queries are used most often can influence which data store to adopt.

知道最常使用哪些查询会影响采用哪个数据存储。

Sharding, a methodology by which databases divide its data into chunks, determines how the infrastructure will grow as more data is ingested.

分片（Sharding）是一种将数据库的数据分为多个部分的方法，它决定了随着更多数据的摄取基础架构将如何发展。

Choosing one that matches our growth plan and budget is critical.

选择与我们的增长计划和预算相匹配的方案至关重要。

Finally, these technologies each aggregate its data very differently.

最后，这些技术各自以不同的方式汇总其数据。

When we are dealing with gigabytes and terabytes of data, the wrong aggregation strategy can limit the types and performances of reports we can generate.

当我们处理千兆字节和TB级的数据时，错误的汇总策略可能会限制我们可以生成的报告的类型和性能。

As data engineers, we must consider all three aspects when evaluating different data storages.

作为数据工程师，我们在评估不同的数据存储时必须考虑所有三个方面。

竞争者 (Contenders)

搜索引擎：Elasticsearch (Search Engine: Elasticsearch)

Elasticsearch quickly gained popularity among its peers for its scalability and ease of integration. Built on top of Apache Lucene, it offers a powerful, out-of-the-box text search and indexing functionality. Aside from the traditional search engine tasks, text search, and exact value queries, Elasticsearch also offers layered aggregation capabilities.

Elasticsearch的可伸缩性和易于集成性很快在同行中获得欢迎。它基于Apache Lucene构建，提供了功能强大的即用型文本搜索和索引功能。除了传统的搜索引擎任务，文本搜索和精确值查询之外，Elasticsearch还提供分层的聚合功能。

文档商店：MongoDB (Document Store: MongoDB)

At this point, MongoDB can be considered the go-to NoSQL database. Its ease of use and flexibility quickly earned its popularity. MongoDB supports rich and adaptable querying for digging into complex documents. Often-queried fields can be sped up through indexing, and when aggregating a large chunk of data, MongoDB offers a multi-stage pipeline.

此时，可以将MongoDB视为NoSQL数据库。它的易用性和灵活性很快赢得了人们的欢迎。 MongoDB支持丰富且适应性强的查询，可用于挖掘复杂的文档。经常查询的字段可以通过建立索引来加快速度，当聚合大量数据时，MongoDB提供了多级管道。

柱状商店：Amazon Redshift (Columnar Store: Amazon Redshift)

Alongside the growth of NoSQL’s popularity, columnar databases have also gathered attention, especially for data analytics. By storing data in columns instead of the usual rows, aggregation operations can be executed directly from the disk, greatly increasing performance. A few years ago, Amazon rolled out its hosted service for a columnar store called Redshift.

随着NoSQL的普及，列式数据库也引起了人们的关注，尤其是在数据分析方面。通过将数据存储在列而不是通常的行中，可以直接从磁盘执行聚合操作，从而大大提高了性能。几年前，亚马逊为名为Redshift的柱状商店推出了托管服务。

索引编制 (Indexing)

Elasticsearch的索引能力 (Elasticsearch’s Indexing Capability)

In many ways, search engines are data stores that specialize in indexing texts.

在许多方面，搜索引擎是专门用于索引文本的数据存储。

While other data stores create indices based on the exact values of the field, search engines allow retrieval with only a fragment of the (usually text) field.

虽然其他数据存储区基于字段的确切值创建索引，但是搜索引擎仅允许使用（通常是文本）字段的一部分进行检索。

By default, this retrieval is done automatically for every field through analyzers.

默认情况下，此检索是通过分析器针对每个字段自动完成的。

An analyzer is a module that creates multiple index keys by evaluating the field values and breaking them down into smaller values.

分析器是一个模块，通过评估字段值并将其分解为较小的值来创建多个索引键。

For example, a basic analyzer might examine “the quick brown fox jumped over the lazy dog” into words, such as “the,” “quick,” “brown,” “fox” and so on.

例如，一个基本的分析器可能会检查“跃过懒惰的狗的快速褐狐狸”成单词，例如“ the”，“ quick”，“ brown”，“ fox”等。

This method enables users to find the data by searching for fragments within the results, ranked by how many fragments match the same document data.

此方法使用户可以通过搜索结果中的片段来查找数据，这些片段按与同一文档数据匹配的片段数进行排序。

A more sophisticated analyzer could utilize edit distances, n-grams, and filter by stopwords, to build a comprehensive retrieval index.

功能更强大的分析器可以利用编辑距离， n-gram和停用词过滤功能来构建全面的检索索引。

MongoDB的索引功能 (MongoDB’s Indexing Capability)

As a generic data store, MongoDB has a lot of flexibility for indexing data.

作为通用数据存储，MongoDB在索引数据方面具有很大的灵活性。

Unlike Elasticsearch, it only indexes the _id field by default, and we need to create indices for the commonly queried fields manually.

与Elasticsearch不同，它默认情况下仅对_id字段建立索引，并且我们需要为常见查询的字段手动创建索引。

Compared to Elasticsearch, MongoDB’s text analyzer isn’t as powerful. But it does provide a lot of flexibility with indexing methods, from the compound and geospatial for optimal querying to the TTL and sparse for storage reduction.

与Elasticsearch相比，MongoDB的文本分析器没有那么强大。但是它确实为索引方法提供了很大的灵活性，从复合和地理空间（用于最佳查询）到TTL和稀疏（用于减少存储）。

Redshift的索引能力 (Redshift’s Indexing Capability)

Unlike Elasticsearch, MongoDB, or even traditional databases, including PostgreSQL, Amazon Redshift does not support an indexing method.

与Elasticsearch，MongoDB或什至包括PostgreSQL传统数据库不同，Amazon Redshift不支持索引方法。

Instead, it reduces its query time by maintaining a consistent sorting on the disk.

相反，它通过在磁盘上保持一致的排序来减少查询时间。

As users, we can configure an ordered set of column values as the table sort key. With the data sorted on the disk, Redshift can skip an entire block during retrieval if its value falls outside the queried range, heavily boosting performance.

作为用户，我们可以配置一组有序的列值作为表排序键。通过将数据排序到磁盘上，如果Redshift的值超出查询范围，则Redshift可以跳过整个块，从而极大地提高了性能。

分片 (Sharding)

Elasticsearch的分片能力 (Elasticsearch’s Sharding Capability)

Elasticsearch was built on top of Lucene to scale horizontally and be production ready.

Elasticsearch建立在Lucene之上，可以水平缩放并可以进行生产。

Scaling is done by creating multiple Lucene instances (shards) and distributing them across multiple nodes (servers) within a cluster.

通过创建多个Lucene实例（分片）并将它们分布在集群中的多个节点（服务器）上来完成扩展。

By default, each document is routed to its respective shard through its _id field.

默认情况下，每个文档都通过其_id字段路由到其各自的分片。

During retrieval, the master node sends each shard a copy of the query before finally aggregating and ranking them for output.

在检索期间，主节点向每个分片发送查询的副本，然后最终对其进行汇总和排序以进行输出。

MongoDB的分片能力 (MongoDB’s Sharding Capability)

Within a MongoDB cluster, there are three types of servers: router, config, and shard.

在MongoDB集群中，服务器分为三种：路由器，配置和分片。

By scaling the router, servers can accept more requests, but the heavy lifting happens at the shard servers.

通过扩展路由器，服务器可以接受更多请求，但是繁重的工作发生在分片服务器上。

As with Elasticsearch, MongoDB documents are routed (by default) via _id to their respective shards. At the query time, the config server notifies the router, which shards the query, and the router server then distributes the query and aggregates the results.

与Elasticsearch一样，（默认情况下）MongoDB文档通过_id路由到它们各自的分片。在查询时，配置服务器通知路由器，该路由器将查询分片，然后路由器服务器分发查询并汇总结果。

Redshift的分片能力 (Redshift’s Sharding Capability)

An Amazon Redshift cluster consists of one leader node, and several compute nodes.

Amazon Redshift集群由一个领导者节点和几个计算节点组成。

The leader node handles the compilation and distribution of queries as well as the aggregation of intermediate results.

领导节点处理查询的编译和分发以及中间结果的汇总。

Unlike MongoDB’s router servers, the leader node is consistent and can’t be scaled horizontally.

与MongoDB的路由器服务器不同，领导者节点是一致的，不能水平缩放。

While this creates a bottleneck, it also allows efficient caching of compiled execution plans for popular queries.

尽管这会造成瓶颈，但它还可以有效地缓存流行查询的已编译执行计划。

汇总 (Aggregating)

Elasticsearch的汇总能力 (Elasticsearch’s Aggregating Capability)

Documents within Elasticsearch can be bucketed by exact, ranged, or even temporal and geolocation values.

Elasticsearch中的文档可以按准确的，有范围的，甚至是时间和地理位置值进行分类。

These buckets can be further grouped into finer granularity through nested aggregation.

可以通过嵌套聚合将这些存储桶进一步分组为更精细的粒度。

Metrics, including means and standard deviations, can be calculated for each layer, which provides the ability to calculate a hierarchy of analyses within a single query.

可以为每一层计算包括均值和标准差在内的度量标准，从而可以在单个查询中计算分析层次结构。

Being a document-based storage, it does suffer the limitation of intra-document field comparisons.

作为基于文档的存储，它确实遭受了文档内字段比较的限制。

For example, while it is good at filtering if a field followers is greater than 10, we cannot check if followers is greater than another field following.

例如，虽然一个字段关注者是否大于10可以很好地过滤，但我们无法检查关注者是否大于另一个字段关注者。

As an alternative, we can inject scripts as custom predicates. This feature is great for one-off analysis, but performance suffers in production.

或者，我们可以将脚本作为自定义谓词注入。此功能非常适合一次性分析，但会影响生产性能。

MongoDB的汇总能力 (MongoDB’s Aggregating Capability)

The Aggregation Pipeline is powerful and fast.

聚合管道功能强大且快速。

As its name suggests, it operates on returned data in a stage-wise fashion.

顾名思义，它以阶段方式对返回的数据进行操作。

Each step can filter, aggregate and transform the documents, introduce new metrics, or unwind previously aggregated groups.

每个步骤都可以过滤，汇总和转换文档，引入新指标或展开以前汇总的组。

Because these operations are done in a stage-wise manner, and by ensuring documents and fields are reduced to only filtered, the memory cost can be minimized. Compared to Elasticsearch, and even Redshift, Aggregation Pipeline is an extremely flexible way to view the data.

因为这些操作是按阶段进行的，并且通过确保将文档和字段减少为仅过滤的方式，所以可以将内存成本降至最低。与Elasticsearch甚至Redshift相比，Aggregation Pipeline是一种非常灵活的数据查看方式。

Despite its adaptability, MongoDB suffers the same lack of intra-document field comparison as Elasticsearch.

尽管它具有适应性，但是MongoDB与Elasticsearch一样，缺少文档内字段比较。

Furthermore, some operations, including $group, require the results to be passed to the master node.

此外，某些操作（包括$group ）要求将结果传递到主节点。

Thus, they do not leverage the distributed computing.

因此，它们不利用分布式计算。

Those unfamiliar with the stage-wise pipeline calculation will find certain tasks unintuitive. For example, summing up the number of elements in an array field would require two steps: first, the $unwind, and then the $group operation.

那些不熟悉阶段式管道计算的人会发现某些任务并不直观。例如，对一个数组字段中的元素数量求和需要两个步骤：首先是$unwind ，然后是$group操作。

Redshift的汇总能力 (Redshift’s Aggregating Capability)

The benefits of Amazon Redshift cannot be understated.

不可低估Amazon Redshift的好处。

Frustratingly slow aggregations on MongoDB while analyzing mobile traffic is quickly solved by Amazon Redshift.

Amazon Redshift快速解决了在分析移动流量时MongoDB上令人沮丧的缓慢聚合。

Supporting SQL, traditional database engineers will have an easy time migrating their queries to Redshift.

支持SQL，传统的数据库工程师可以轻松地将其查询迁移到Redshift。

Onboarding time aside, SQL is a proven, scalable, and powerful query language, supporting intra-document/row field comparisons with ease. Amazon Redshift further improves its performance by compiling and caching popular queries executed on the compute nodes.

除了入门时间以外，SQL是一种行之有效，可扩展且功能强大的查询语言，可轻松支持文档内/行字段比较。 Amazon Redshift通过编译和缓存在计算节点上执行的流行查询来进一步提高其性能。

As a relational database, Amazon Redshift does not have the schema flexibility that MongoDB and Elasticsearch have. Optimized for read operations, it suffers performance hits during updates and deletes.

作为关系数据库，Amazon Redshift没有MongoDB和Elasticsearch拥有的模式灵活性。针对读取操作进行了优化，在更新和删除过程中会遭受性能下降。

To maintain the best read time, the rows must be sorted, adding extra operational efforts.

为了保持最佳的读取时间，必须对行进行排序，从而增加了额外的工作量。

Tailored to those with petabyte-sized problems, it is not cheap and likely not worth the investment unless there are scaling problems with other databases.

它是针对具有PB级问题的用户量身定制的，它并不便宜，而且可能不值得投资，除非其他数据库存在扩展问题。

选择优胜者 (Picking the Winner)

In this article, we examined three different technologies – Elasticsearch, MongoDB, and Amazon Redshift – within the context of data engineering. However, there is no clear winner as each of these technologies is a front-runner in its storage type category.

在本文中，我们在数据工程的背景下研究了三种不同的技术-Elasticsearch，MongoDB和Amazon Redshift。但是，尚无明确的赢家，因为每种技术在其存储类型类别中都是领先者。

For data engineering, depending on the use case, some options are better than others.

对于数据工程，根据使用情况，某些选择要好于其他选择。

MongoDB is a fantastic starter database. It provides the flexibility we want when data schema is still to be determined. That said, MongoDB does not outperform specific use cases that other databases specialize in.
While Elasticsearch offers a similar fluid schema to MongoDB, it is optimized for multiple indices and text queries at the expense of write performance and storage size. Thus, we should consider migrating to Elasticsearch when we find ourselves maintaining numerous indices in MongoDB.
Redshift requires a predefined data schema, and is lacking the adaptability that MongoDB provides. In return, it outclasses other databases for queries only involving single (or a few) columns. When the budget permits, Amazon Redshift is a great secret weapon when others cannot handle the data quantity.

MongoDB是一个很棒的入门数据库。当仍要确定数据模式时，它提供了我们想要的灵活性。也就是说，MongoDB的性能不会超过其他数据库专门研究的特定用例。
尽管Elasticsearch提供了与MongoDB类似的流畅模式，但它针对多个索引和文本查询进行了优化，但会降低写入性能和存储大小。因此，当我们发现自己在MongoDB中维护大量索引时，应该考虑迁移到Elasticsearch。
Redshift需要预定义的数据模式，并且缺少MongoDB提供的适应性。作为回报，对于仅涉及单列（或几列）的查询，其性能优于其他数据库。在预算允许的情况下，当其他人无法处理数据量时，Amazon Redshift是一个很好的秘密武器。