系统设计DDIA之Chapter 6 Partitioning 之全局二级索引和局部二级索引

最新推荐文章于 2024-10-08 15:41:48 发布

暴躁老哥在线刷题

最新推荐文章于 2024-10-08 15:41:48 发布

阅读量795

点赞数 12

分类专栏： SystemDesign 文章标签：数据库 DDIA 分布式系统设计

本文链接：https://blog.csdn.net/qq_32424059/article/details/141993884

版权

SystemDesign 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

先讲为什么要二级索引：在数据库里，一级索引可以很方便地对主键进行查找，但是如果我们想要根据除主键以外的属性进行搜索，那么我们就需要扫描整张表。显然这么弄效率很低，所以我们可以利用二级索引，来对非主键属性也实现快速搜索。

比如，在一个二手车数据库中，我们可能经常需要根据颜色、品牌或车型来搜索汽车，而不仅仅是根据唯一的汽车ID。有了这些属性的二级索引，就可以快速找到匹配的所有记录。

处理分区数据库中的二级索引主要有两种方法：

基于文档的分区索引（本地索引）：
在这种方法中，每个分区都有自己的本地二级索引，只覆盖该分区内的文档。这种方式让写操作更简单，因为更新只影响一个分区。但是，从这些索引中读取数据的成本可能很高，因为查询可能需要发送到所有分区并合并结果（这种方法叫做“分散/聚集”）。如果某些分区响应较慢，这会导致读取延迟增加。
基于术语的分区索引（全局索引）：
另一种方式是根据索引的术语进行全局分区，这意味着与特定术语（比如“颜色：红色”）相关的数据会被存储在一起，不管它来自哪个分区。这种方法让读操作更高效，因为查询可以直接定位到相关分区。但是，它会让写操作变得复杂，因为一个文档更新可能影响多个分区，可能需要分布式事务，并导致延迟增加。

虽然全局索引能提高读取效率，但为了避免性能瓶颈，它们通常采用异步更新，这意味着索引可能不会立即反映最新的写入。两种策略各有权衡：基于文档的分区索引写入更快更简单，但读取速度较慢；而基于术语的分区索引读取更快，但写入更慢且更复杂。

选择合适的二级索引策略非常关键，这取决于具体的应用需求和数据访问模式之间的平衡。

In a database, a primary index allows for easy lookup using the primary key, but if we want to search based on attributes other than the primary key, we would need to scan the entire table. Clearly, this is very inefficient, so we use secondary indexes to enable fast searches on non-primary key attributes as well.

There are two main approaches to handling secondary indexes in a partitioned database:

Document-Partitioned Indexes (Local Indexes):
In this approach, each partition maintains its own local secondary indexes, covering only the documents within that partition. This method simplifies writing operations, as updates only affect a single partition. However, reading from these indexes can be costly because queries may need to be sent to all partitions and results combined (a method known as scatter/gather). This can increase read latency, especially if some partitions respond slower than others.
Term-Partitioned Indexes (Global Indexes):
Alternatively, secondary indexes can be partitioned globally based on the indexed term. This means that data related to a specific term (e.g., "color
") is stored together, regardless of the partition it comes from. This approach makes read operations more efficient since a query can directly target the relevant partition. However, it complicates write operations because a single document update may affect multiple partitions, potentially requiring distributed transactions and introducing delays.

While global indexes improve read efficiency, they often involve asynchronous updates to avoid performance bottlenecks, meaning the index might not immediately reflect recent writes. Both strategies have their trade-offs: document-partitioned indexes are simpler and faster for writes but slower for reads, while term-partitioned indexes offer faster reads but can slow down writes and increase complexity.

This balance of efficiency and complexity makes choosing the right secondary indexing strategy crucial based on specific application needs and data access patterns.

问题列表

在分区数据库中使用二级索引的主要挑战是什么？
什么是基于文档分区的二级索引（本地索引），它是如何工作的？
使用基于文档分区的二级索引有哪些优点和缺点？
在查询二级索引时，“分散/聚集”指的是什么？为什么它的代价很高？
什么是基于术语分区的二级索引（全局索引），它与基于文档分区的索引有何不同？
与基于文档分区的索引相比，使用基于术语分区（全局）二级索引有什么好处？
使用基于术语分区的二级索引有哪些权衡？
全局二级索引的异步更新是如何工作的？它有什么潜在的缺点？
为什么全局二级索引可能需要分布式事务？为什么这很有挑战性？
能否举例说明使用基于文档分区或基于术语分区二级索引的数据库或系统？

示例答案

在分区数据库中使用二级索引的主要挑战是什么？
- 回答： 主要挑战在于二级索引无法像主键索引那样直接映射到特定的分区。与主键通常是唯一的并且可以直接确定数据分区不同，二级索引用于基于特定值的搜索，这些值可能会跨越多个分区。
什么是基于文档分区的二级索引（本地索引），它是如何工作的？
- 回答： 基于文档分区的索引是一种二级索引类型，每个分区只维护自己分区内的文档索引。当添加、删除或更新文档时，仅影响包含该文档的分区。但是，这种索引的查询可能需要将查询请求发送到所有分区，并将结果合并，这种方法称为“分散/聚集”。
使用基于文档分区的二级索引有哪些优点和缺点？
- 回答：
  - 优点： 实现相对简单，写入速度快，因为更新只涉及到包含该文档的分区。
  - 缺点： 对二级索引的读取查询可能代价高昂，因为可能需要查询所有分区并合并结果，这会增加延迟和资源消耗。
在查询二级索引时，“分散/聚集”指的是什么？为什么它的代价很高？
- 回答： “分散/聚集”是一种查询方法，查询请求被发送到所有分区（分散），然后将所有返回的结果进行合并（聚集）。这种方法代价高昂是因为高延迟，尤其是当某些分区响应较慢时，会导致尾部延迟放大。
什么是基于术语分区的二级索引（全局索引），它与基于文档分区的索引有何不同？
- 回答： 基于术语分区的索引是一种全局索引，覆盖所有分区，并按索引的术语（例如“颜色：红色”）进行分区。与基于文档分区的索引（每个分区有自己的本地索引）不同，全局索引允许查询直接定位到包含相关术语的分区，使读取更加高效。
与基于文档分区的索引相比，使用基于术语分区（全局）二级索引有什么好处？
- 回答： 主要好处是读取查询更加高效，因为查询只需要发送到包含相关术语的分区，而不是所有分区。这减少了与“分散/聚集”操作相关的开销和延迟。
使用基于术语分区的二级索引有哪些权衡？
- 回答： 权衡包括写入速度较慢和复杂性增加，因为单个文档更新可能会影响全局索引中的多个分区。此外，维护这些分区之间的一致性可能需要分布式事务，这会比较复杂，而且并非所有数据库都支持。
全局二级索引的异步更新是如何工作的？它有什么潜在的缺点？
- 回答： 异步更新意味着索引的更改不会立即在写入操作后发生，而是有一定的延迟。缺点是索引可能不会立即反映最新的数据状态，可能会出现你刚刚做的更改暂时未在索引中显示的情况。
为什么全局二级索引可能需要分布式事务？为什么这很有挑战性？
- 回答： 因为全局二级索引的单个写操作可能会影响多个分区，甚至位于不同节点的多个分区。这需要协调多个节点上的更改，这种分布式事务往往速度较慢、容易出错，并且不是所有数据库都支持。
能否举例说明使用基于文档分区或基于术语分区二级索引的数据库或系统？
- 回答： 使用基于文档分区二级索引的数据库有 MongoDB、Riak、Cassandra、Elasticsearch、SolrCloud 和 VoltDB。使用全局术语分区索引的系统包括 Amazon DynamoDB、Riak 的搜索功能和 Oracle 数据仓库。

Question List

What is the main challenge of using secondary indexes in a partitioned database?
What is a document-partitioned (or local) secondary index, and how does it work?
What are the advantages and disadvantages of using a document-partitioned secondary index?
What does the term "scatter/gather" mean in the context of querying secondary indexes, and why can it be costly?
What is a term-partitioned (or global) secondary index, and how does it differ from a document-partitioned index?
What are the benefits of using a term-partitioned (global) secondary index compared to a document-partitioned index?
What are the trade-offs involved in using a term-partitioned index for secondary indexing?
How does asynchronous updating of global secondary indexes work, and what are its potential downsides?
Why might a global secondary index require distributed transactions, and why is this challenging?
Can you provide examples of databases or systems that use document-partitioned or term-partitioned secondary indexes?

Sample Answers

What is the main challenge of using secondary indexes in a partitioned database?
- Answer: The main challenge is that secondary indexes don’t map neatly to partitions. Unlike primary keys, which are typically unique and directly determine partition placement, secondary indexes are used to search for occurrences of specific values, which can span multiple partitions.
What is a document-partitioned (or local) secondary index, and how does it work?
- Answer: A document-partitioned index is a type of secondary index where each partition maintains its own index for the documents it contains. When a document is added, removed, or updated, only the partition containing that document is affected. However, queries using this type of index may need to be sent to all partitions and results combined, which is known as the scatter/gather approach.
What are the advantages and disadvantages of using a document-partitioned secondary index?
- Answer:
  - Advantages: Simpler to implement, and writes are fast since only the partition containing the document is involved in the update.
  - Disadvantages: Read queries on secondary indexes can be expensive, as they may need to query all partitions and combine results, leading to increased latency and resource usage.
What does the term "scatter/gather" mean in the context of querying secondary indexes, and why can it be costly?
- Answer: "Scatter/gather" is a querying method where a query is sent to all partitions (scatter), and the results are then collected and combined (gather). This can be costly due to high latency, especially if some partitions are slower to respond, leading to tail latency amplification.
What is a term-partitioned (or global) secondary index, and how does it differ from a document-partitioned index?
- Answer: A term-partitioned index is a global index that covers all partitions and is partitioned by the indexed term (e.g., color). Unlike a document-partitioned index, where each partition has its own local index, a global index allows a query to target a specific partition based on the term, making reads more efficient.
What are the benefits of using a term-partitioned (global) secondary index compared to a document-partitioned index?
- Answer: The primary benefit is more efficient read queries, as a query only needs to be sent to the partition containing the relevant term rather than all partitions. This reduces the overhead and latency associated with scatter/gather operations.
What are the trade-offs involved in using a term-partitioned index for secondary indexing?
- Answer: The trade-offs include slower and more complex writes, as a single document update might affect multiple partitions in the global index. Additionally, maintaining consistency across these partitions can require distributed transactions, which are complex and may not be supported by all databases.
How does asynchronous updating of global secondary indexes work, and what are its potential downsides?
- Answer: Asynchronous updating means that changes to the index do not occur immediately after a write operation but are instead propagated with some delay. The downside is that the index may not reflect the most recent state of the data, leading to temporary inconsistencies where a recent update is not yet visible in the index.
Why might a global secondary index require distributed transactions, and why is this challenging?
- Answer: A global secondary index may require distributed transactions because a single write operation might affect multiple partitions of the index, potentially across different nodes. This is challenging because distributed transactions involve coordinating changes across multiple nodes, which can be slow, error-prone, and not supported by all databases.
Can you provide examples of databases or systems that use document-partitioned or term-partitioned secondary indexes?
- Answer: Databases using document-partitioned secondary indexes include MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, and VoltDB. Systems that use global term-partitioned indexes include Amazon DynamoDB, Riak’s search feature, and Oracle data warehouses.