cassandra数据模型_在Apache Cassandra™中建立出色数据模型的五个步骤

最新推荐文章于 2024-04-28 12:59:47 发布

culiu9261

最新推荐文章于 2024-04-28 12:59:47 发布

阅读量381

点赞数

文章标签：数据库大数据 python mysql java

原文链接：https://scotch.io/tutorials/five-steps-to-an-awesome-data-model-in-apache-cassandra

版权

cassandra数据模型

Congratulations, you're starting out on the Cassandra distributed database, a favorite choice among architects and developers for its performance, scalability, continuous uptime, and global data distribution. Whether you plan on using it for ecommerce, IoT, fraud, or anything else, it's important not only to understand the database itself, but also how to create the right data model to fit your application's performance and scalability requirements.

恭喜，您开始使用Cassandra分布式数据库，它的性能，可伸缩性，连续运行时间和全局数据分发是架构师和开发人员最喜欢的选择。无论您打算将其用于电子商务，物联网，欺诈还是其他任何用途，重要的是不仅要了解数据库本身，而且要了解如何创建适合您的应用程序性能和可伸缩性要求的正确数据模型。

Fixing a poorly designed data model after an application is in production is an experience that nobody wants to go through. Therefore, it's better to take some time upfront and use a proven methodology. And, that's exactly what you'll learn here. We've broken it down into five steps:

在应用程序投入生产后修复设计不良的数据模型是一种没人想经历的经历。因此，最好先花一些时间并使用经过验证的方法。而且，这正是您将在这里学到的东西。我们将其分为五个步骤：

Step 1: Understand your application workflow

步骤1：了解您的应用程序工作流程
Step 2: Model the queries required by the application

步骤2：为应用程序所需的查询建模
Step 3: Design the tables

步骤3：设计表格
Step 4: Determine primary keys

步骤4：确定主键
Step 5: Use the right data types effectively

步骤5：有效使用正确的数据类型

Cassandra与关系数据库中的数据建模 ( Data Modeling in Cassandra vs. Relational Databases )

You're likely already familiar with relational databases (RDBMS) such as Oracle, MySQL, and PostgreSQL, so let's start with how Cassandra differs from relational databases when it comes to data modeling:

您可能已经熟悉关系数据库（RDBMS），例如Oracle，MySQL和PostgreSQL，因此让我们从Cassandra在数据建模方面与关系数据库的不同之处入手：

Denormalization is expected. With relational databases, designers are usually encouraged to store data in a normalizedform. In Cassandra, storing the same data redundantly in multiple tables is a feature of a good data model.
有望实现非规范化。 对于关系数据库，通常鼓励设计人员以规范化形式存储数据。在Cassandra中，将相同的数据冗余存储在多个表中是良好数据模型的功能。
Writes are (almost) free. Due to Cassandra's architecture, writes are shockingly fast compared to relational databases.
写（几乎）是免费的 。由于Cassandra的体系结构，与关系数据库相比，写入速度惊人。
No joins. Relational database usually reference fields from multiple tables in a single query by joining tables. With Cassandra, this functionality doesn't exist, so developers must structure their data model accordingly.
没有加入。 关系数据库通常通过联接表来在单个查询中引用多个表中的字段。对于Cassandra，此功能不存在，因此开发人员必须相应地构建其数据模型。
Consistency is tunable. Relational database users usually reference fields from multiple tables in a single query by joining tables. With Cassandra, this functionality doesn't exist, so developers must structure their data model to provide equivalent functionality.
一致性是可调的 。关系数据库用户通常通过联接表来在单个查询中引用多个表中的字段。对于Cassandra，此功能不存在，因此开发人员必须构造其数据模型以提供等效的功能。
Indexing is different. With relational databases, queries are usually optimized by simply creating an index on a field. In Cassandra, tables are usually designed to support specific queries, and secondary indexes are useful only in specific circumstances, rather than being a "silver bullet."
索引编制是不同的。 对于关系数据库，通常只需在字段上创建索引即可优化查询。在Cassandra中，表通常旨在支持特定查询，而二级索引仅在特定情况下才有用，而不是“灵丹妙药”。

Cassandra如何存储数据 (How Cassandra Stores Data)

Cassandra clusters have multiple nodes, and data is typically stored redundantly across those nodes so that the database continues to operate even when nodes are down. Physical records in the table are spread across each cluster at a location determined by a partition key which identifies the Cassandra node where data and replicas are stored. A Cassandra cluster can be conceptually represented as a ring, where each cluster node is responsible for storing tokens in a range.

Cassandra群集具有多个节点，并且数据通常在这些节点之间冗余存储，因此即使节点关闭，数据库也可以继续运行。表中的物理记录分布在每个群集上的分区键确定的位置，分区键标识了存储数据和副本的Cassandra节点。 Cassandra群集在概念上可以表示为环，其中每个群集节点负责存储范围内的令牌。

Queries that look up records based on the partition key are extremely fast because Cassandra can immediately determine the host holding required data. Since clusters can potentially have hundreds or even thousands of nodes, Cassandra can handle many simultaneous queries because queries and data are distributed across cluster nodes.

基于分区键查找记录的查询非常快，因为Cassandra可以立即确定包含所需数据的主机。由于群集可能具有数百甚至数千个节点，因此Cassandra可以处理许多同时查询，因为查询和数据分布在群集节点之间。

三种数据建模最佳实践 (Three Data Modeling Best Practices)

Spread data evenly around the cluster. For Cassandra to work optimally, data should be spread as evenly as possible across cluster nodes which is dependent on selecting a good partition key.
将数据均匀分布在群集中。 为了使Cassandra发挥最佳性能，数据应尽可能均匀地分布在群集节点上，这取决于选择良好的分区键。
Minimize the number of partitions to read. When Cassandra reads data, it's best to read from as few partitions as possible to avoid impacting performance.
最小化要读取的分区数。 当Cassandra读取数据时，最好从尽可能少的分区中读取数据，以免影响性能。
Anticipate how data and requirements will grow. For example, would you design the data model differently if you had 100 versus millions of transactions per user?
预计数据和需求将如何增长。 例如，如果每个用户有100笔交易与数百万笔交易，您是否会以不同的方式设计数据模型？

To learn more about Cassandra’s distributed architecture, and how data is stored, check out the free DataStax Academy courses. You will master Cassandra's internal architecture by studying the read path, write path, and compaction. Topics such as consistency, replication, anti-entropy operations, and gossip ensure you have a strong handle on the technology and the data modeling implications.

要了解有关Cassandra的分布式体系结构以及如何存储数据的更多信息，请查看免费的DataStax学院课程。您将通过研究读取路径，写入路径和压缩来掌握Cassandra的内部体系结构。诸如一致性，复制，反熵操作和八卦之类的主题可确保您对技术和数据建模的含义有深入的了解。

建立出色数据模型的五个步骤 (Five Steps to Building an Awesome Data Model)

It’s always helpful to focus on a concrete example. In the sections that follow, data modeling will be discussed in the context of the DataStax’s reference application, KillrVideo , an online video service.

关注一个具体的例子总是有帮助的。在以下各节中，将在DataStax的参考应用程序KillrVideo （一种在线视频服务）的上下文中讨论数据建模。

步骤1：了解您的应用程序工作流程 (Step 1: Understand your application workflow)

With Cassandra, rather than start with the data model, the best practice is to start with the application workflow. This approach is referred to as "query-first design"—building your data model based on what types of queries the database will need to support. For example, in the KillrVideo example below, the sequence of workflow steps matters because it helps us determine that a userid and videoid are required to support subsequent queries, impacting table design.

对于Cassandra，最佳实践不是从数据模型开始，而是从应用程序工作流开始。这种方法称为“查询优先设计”-根据数据库将需要支持的查询类型来构建数据模型。例如，在下面的KillrVideo示例中，工作流步骤的顺序很重要，因为它可以帮助我们确定需要用户ID和视频 ID来支持后续查询，从而影响表的设计。

步骤2：为应用程序所需的查询建模 (Step 2: Model the queries required by the application)

Taking a query-first approach means not only thinking through the sequence of tasks required, but also helps determine what data will be required when. For example, the entity relationship diagram below shows the entities (users, videos, and comments), the data items, and the relationships required by the KillrVideo applications. Once the application workflow and key data objects are identified, then it's possible to determine the queries the application needs to support.

采用查询优先的方法不仅意味着考虑所需任务的顺序，而且还有助于确定何时需要哪些数据。例如，下面的实体关系图显示了KillrVideo应用程序所需的实体（用户，视频和评论），数据项和关系。一旦确定了应用程序的工作流程和关键数据对象，就可以确定应用程序需要支持的查询。

步骤3：设计表格 (Step 3: Design the tables)

In Cassandra, tables can be grouped into two distinct categories:

在Cassandra中，表可以分为两个不同的类别：

Tables with single-row partitions. These types of tables have primary keys that are also partition keys. They are used to store entities and are usually normalized. It's recommended that the tables be based on the entity for easy reference (e.g., videos).
具有单行 分区的 表。这些类型的表具有主键，这些主键也是分区键。它们用于存储实体，通常进行规范化。建议表格以实体为基础，以方便参考（例如视频）。

Tables with multi-*row partitions. *These types of tables have primary keys that are composed of partition and clustering keys. They are used to store relationships and related entities. Remember that Cassandra doesn't support joins, so structure tables to support queries that relate to multipledata items.
具有多个 -* 行分区的表。 *这些类型的表具有由分区键和集群键组成的主键。它们用于存储关系和相关实体。请记住，Cassandra不支持联接，因此构造表以支持与多个数据项相关的查询。

The

latest_ videos table illustrates what is meant by "query-first design." The application will need to query the most recently uploaded videos every time a user visits the KillrVideo homepage, so this query needs to be very efficient.

最新的 _ 视频表说明了“查询优先设计”的含义。每次用户访问KillrVideo主页时，应用程序都需要查询最近上传的视频，因此此查询必须非常高效。

Additional clustering columns (added_date and videoid) specify how records are grouped and ordered within each partition. By designing the table in this fashion, queries will touch only one partition for the current day and possibly another partition for the day before. This level of optimization and efficiency helps explain how Cassandra can support applications with enormous numbers of queries over very large data sets.

附加的群集列（ 添加的 _ date和videoid ）指定了如何在每个分区内对记录进行分组和排序。通过以这种方式设计表，查询将在当天仅触摸一个分区，并且可能在前一天触摸另一个分区。这种优化和效率水平有助于说明Cassandra如何支持对非常大的数据集进行大量查询的应用程序。

使用Chebotko图表示您的模式 (Use a Chebotko Diagram to Represent Your Schema)

A good tool for mapping the data model that supports an application is known as a Chebotko diagram. Designed to develop the logical and physical data models required to support the application, the Chebotko diagram captures the database schema, showing table names, partition key columns (K), clustering key columns (C) and their ordering, static columns (S), and regular columns with data types. The tables are organized based on the application workflow to support specific workflow steps and application queries.

映射支持应用程序的数据模型的好工具称为Chebotko图。 Chebotko图旨在开发支持该应用程序所需的逻辑和物理数据模型，它捕获数据库模式，显示表名，分区键列（K），集群键列（C）及其顺序，静态列（S），以及具有数据类型的常规列。这些表是根据应用程序工作流程进行组织的，以支持特定的工作流程步骤和应用程序查询。

步骤4：确定主键 (Step 4: Determine primary keys)

In Cassandra, tables have a primary key which is made up of a partition key, followed by one or more optional clustering columns that control how rows are laid out in a Cassandra partition. Getting the primary key right for each table is one of the most crucial aspects of designing a good data model.

在Cassandra中，表具有一个主键，该主键由分区键组成，后跟一个或多个可选的集群列，这些列控制在Cassandra分区中行的布局方式。为每个表正确设置主键是设计良好数据模型的最关键方面之一。

In the latest_videos table, yyyymmdd is the partition key, and it is followed by two clustering columns, added_date and videoid, ordered in a fashion that supports retrieving the latest videos.

在最新的 _ 视频表中， yyyymmdd是分区键，其后是两个群集列，即add_date和videoid，它们以支持检索最新视频的方式排序。

Good examples of unique keys are customer IDs, order IDs, and transaction IDs. Relational databases often use simple auto-incrementing integers to assign unique keys to records, but this approach isn't practical in a distributed system like Cassandra. To ensure that each key is unique, Cassandra supports universally unique identifiers (UUIDs) as a native data type. UUIDs are 128-bit numbers that are guaranteed to be unique within the scope of an application.

唯一键的好例子是客户ID，订单ID和交易ID。关系数据库通常使用简单的自动递增整数来为记录分配唯一键，但是这种方法在像Cassandra这样的分布式系统中并不实用。为了确保每个键都是唯一的，Cassandra支持将通用唯一标识符（UUID）作为本机数据类型。 UUID是128位数字，在应用程序范围内保证是唯一的。

Some developers might prefer to devise their own naming schemes to make keys easier to understand, but it's important to think about the impact if the business changes, rendering the scheme obsolete. UUID scan sometimes be more maintainable in the long run.

一些开发人员可能更喜欢设计自己的命名方案，以使键更易于理解，但重要的是要考虑如果业务发生变化所带来的影响，从而使方案过时。从长远来看，UUID扫描有时更易于维护。

步骤5：有效使用正确的数据类型 (Step 5: Use the right data types effectively)

Cassandra supports a wide variety of data types that will be familiar to most developers including BigInt, Blob, Boolean, Decimal, Double, Float, Inet (IP addresses), Int, Text, VarChar, UUID, TIMEUUID, and more.

Cassandra支持大多数开发人员熟悉的多种数据类型，包括BigInt，Blob，Boolean，Decimal，Double，Float，Inet（IP地址），Int，Text，VarChar，UUID，TIMEUUID等。

It might be tempting to store tags associated with videos in a separate table. When the list of anticipated tags is small however, using a collection data type that stores tags inside the database record can be more efficient. This simplifies the database design and reduces the number of tables required. Collection data types include sets, list, maps, tuple, and nested collection.

将与视频关联的标签存储在单独的表中可能很诱人。但是，当预期标签的列表较小时，使用将标签存储在数据库记录内的收集数据类型会更有效。这简化了数据库设计并减少了所需表的数量。集合数据类型包括集合，列表，地图，元组和嵌套集合。

Another data type in Cassandra that provides flexibility is a user-defined type (UDT). UDTs can attach multiple data fields—each named and typed—to a single column. In the KillrVideo example, rather than add multiple address-related fields, an address type can be created.

Cassandra中提供灵活性的另一种数据类型是用户定义类型（UDT）。 UDT可以将多个数据字段（分别命名和键入）附加到单个列。在KillrVideo示例中，可以添加一个地址类型，而不是添加多个与地址相关的字段。

The user-defined address type can now be included in the users table.

学到更多 (Learn More)

Getting the data model right is a critical first step in building a successful, scalable Cassandra database that is easy to manage and maintain. A five-step approach can help including mapping the application workflow, using a "query-first" approach, designing tables that will support the queries, employing Chebotko diagrams, carefully thinking through keying approaches, and utilizing all the data types at your disposal. To learn more about these five steps, Download the DataStax whitepaper "Data Modeling in Apache Cassandra".

正确建立数据模型是构建成功的，可扩展的，易于管理和维护的Cassandra数据库的关键的第一步。五步方法可以帮助您进行映射，包括映射应用程序工作流程，使用“查询优先”方法，设计将支持查询的表，使用Chebotko图，仔细考虑键控方法以及利用所有可用的数据类型。要了解有关这五个步骤的更多信息，请下载DataStax白皮书“ Apache Cassandra中的数据建模”。

In addition, you can take our free courses on DataStax Academy. You'll learn conceptual data modeling techniques, principles and methodology, design techniques and optimization, and review select data modeling use cases. This course will up your data modeling game!

此外，您可以在DataStax Academy上参加我们的免费课程。您将学习概念数据建模技术，原理和方法，设计技术和优化，并复习选择的数据建模用例。本课程将帮助您进行数据建模游戏！

开始使用 (Get Started)

Get started at warp speed by downloading DataStax Enterprise and run on premises or in any cloud. It’s never been easier to get started data modeling and deploying with Cassandra!

下载DataStax Enterprise并在本地或任何云中运行，以惊人的速度入门。开始使用Cassandra进行数据建模和部署从未如此简单！

翻译自: https://scotch.io/tutorials/five-steps-to-an-awesome-data-model-in-apache-cassandra

cassandra数据模型

culiu9261

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
cassandra数据模型_在Apache Cassandra™中建立出色数据模型的五个步骤

cassandra数据模型Congratulations, you're starting out on the Cassandra distributed database, a favorite choice among architects and developers for its performance, scalability, continuous uptime, and glob...
复制链接

扫一扫