cassandra数据模型_设计cassandra数据模型101

最新推荐文章于 2023-06-18 22:41:30 发布

weixin_26746653

最新推荐文章于 2023-06-18 22:41:30 发布

阅读量201

点赞数

文章标签： python leetcode

原文链接：https://levelup.gitconnected.com/designing-cassandra-data-models-101-27bce86a014d

版权

Cassandra数据模型设计与关系数据库有很大区别，注重分区键和聚类列的使用。分区键决定了数据存储和查询效率，应保持分区大小适中，避免无限增长。聚类列影响查询顺序，设计时需考虑查询需求。数据非规范化和重复是提高性能的关键，读取操作比写入昂贵。设计时应围绕查询进行，以简化查询并优化性能。

摘要由CSDN通过智能技术生成

cassandra数据模型

Designing data models in Cassandra can be tricky if you are coming from a Relational Databases background. Even though Cassandra tries it best to draw parallels to relational databases when it comes to terminologies, I feel like it becomes more misleading. Of course CQL (Cassandra Query Language) being so similar to SQL doesn’t help either.

如果您来自Relational Databases背景，那么在Cassandra中设计数据模型可能会很棘手。即使在涉及术语方面，Cassandra尽最大努力将关系数据库与之平行，但我仍觉得它变得更具误导性。当然，CQL(卡桑德拉查询语言)与SQL如此相似也无济于事。

Fundamentally, Cassandra (a NoSQL database) and other relational databases (MySQL, PostgreSQL) are very different from each other. Cassandra is not a drop-in replacement of relational databases. You have to design your schema with an entirely different way of thinking.

从根本上说，Cassandra(NoSQL数据库)和其他关系数据库(MySQL，PostgreSQL)彼此之间有很大的不同。 Cassandra不是关系数据库的直接替代。 您必须使用完全不同的思维方式来设计架构。

To get you started with designing Cassandra data models, I will go through some of the fundamentals I picked up when moving to Cassandra from a relational background.

为了让您开始设计Cassandra数据模型，我将介绍从关系背景转到Cassandra时所学的一些基础知识。

分区键是基本的 (Partition Keys are Fundamental)

Understanding of partitions is fundamental in Cassandra. Partitions are the atomic units where Cassandra stores data. You define the partition key (simple or composite) when you are designing your schema. Cassandra uses the partition key of your keyspace to decide in which node your data resides.

了解分区是Cassandra中的基础。 分区是Cassandra存储数据的原子单位。 设计架构时，可以定义分区键(简单或组合)。 Cassandra使用键空间的分区键来决定数据驻留在哪个节点上。

Records with the same partition key end up in the same partition. A partition resides in a single host. There are four things to consider when it comes to partitions:

具有相同分区键的记录最终位于同一分区中。分区位于单个主机中。分区时要考虑四件事：

Maximum partition size should be 100MB and ideally less than 10MB
最大分区大小应为100MB，理想情况下应小于10MB
One query ideally gets data from a single partition
理想情况下，一个查询从单个分区获取数据
All partitions should be roughly equal in size avoiding skews and hotspots
所有分区的大小均应大致相等，以免产生歪斜和热点
Partition keys should never give rise to indefinite partitions that grows indefinitely as time passes
分区键绝不应该引起不确定的分区，分区会随着时间的流逝而无限增长

It’s essential to design Cassandra partitions carefully to get the best I/O performance, both for reads and writes. Keeping the partitions small also help with Cassandra’s memory usage, repairs and tombstone evictions.

认真设计Cassandra分区对于获得最佳的I / O性能(读取和写入)至关重要。使分区保持较小还有助于Cassandra的内存使用，修复和逐出墓碑。

聚类列很重要 (Clustering Columns are Important)

Clustering columns are the fields in your primary key that are not part of the partition key. They determine the order in which rows are laid out in a given partition.

群集列是主键中不属于分区键的字段。 它们确定行在给定分区中的排列顺序。

If you are mindful of the sort order in which you will typically query data, you can define the order of the clustering columns at design time. Let’s say you are designing a data model for user information and your table is partitioned by department. Also, the day someone joined (recently_joined) is a clustering column. Now imagine, a common query you have is to find the most recently joined folks in a given department. In that case you can order the recently_joined clustering column in descending order. This way, when querying Cassandra you get the most optimized performance by doing a sliced sequential read.

如果您牢记通常查询数据的排序顺序，则可以在设计时定义聚类列的顺序。假设您正在设计一个用于用户信息的数据模型，并且您的表按部门进行了分区。同样，某人加入的日期( recently_joined加入)是一个群集列。现在想象一下，一个常见的查询是查找给定部门中最新加入的人员。在这种情况下，您可以按降序排列recently_joined集群列。这样，在查询Cassandra时，您可以通过执行切片顺序读取来获得最佳性能。

Sequential reads are always faster. Ordering your clustering columns relative to your queries will make sequential reads a commonplace in your keyspace which is what you want.

顺序读取总是更快。相对于查询对聚簇列进行排序将使顺序读取成为键空间中的常识，这正是您想要的。

数据非规范化和重复 (Denormalization & Duplication of Data)

Everywhere in Computer Science we have learned about DRY (Don’t Repeat Yourself). When it comes to code we always try not to repeat functions and business logic. When it comes to relational databases, we always try to normalize data and connect the tables using foreign keys.

在计算机科学的任何地方，我们都了解DRY(不要重复自己)。当涉及到代码时，我们总是尽量不要重复功能和业务逻辑。对于关系数据库，我们总是尝试规范化数据并使用外键连接表。

In Cassandra, duplicating and denormalizing data is encouraged. Otherwise, you can almost never get good performance for your queries. The way partitioning keys and clustering columns work, the performance of your queries greatly depends on how your data is laid out in Cassandra. You might have designed it with a particular set of queries, but requirements change all the time. Now you might have new queries which are sub-optimal for your table. In that case, always free to duplicate the data in a new table just laid out differently to satisfy your new queries.

在Cassandra中，鼓励复制和非正规化数据。 否则，您几乎无法获得良好的查询性能。分区键和群集列的工作方式，查询的性能在很大程度上取决于Cassandra中数据的布局方式。您可能使用一组特定的查询来设计它，但是需求始终在变化。现在，您可能有一些新查询，这些查询对于您的表而言不是最佳的。在这种情况下，始终可以自由地复制刚刚布置为满足您的新查询的新表中的数据。

Cassandra doesn’t have the concept of JOINs like relational databases. It’s impossible to do cross-node JOINS reliable in distributed databases like Cassandra. So instead of JOINs, you are encouraged to duplicate your data and partition differently for different business requirements.

Cassandra没有像关系数据库这样的JOIN概念。 在像Cassandra这样的分布式数据库中，可靠地进行跨节点联接是不可能的。因此，建议您代替JOIN来复制数据，并针对不同的业务需求进行不同的分区。

Disk space is generally the cheapest resource when compared to Memory and CPU. Cassandra is designed around this assumption.

与内存和CPU相比，磁盘空间通常是最便宜的资源。卡桑德拉(Cassandra)是围绕此假设而设计的。

围绕您的查询建模 (Model Around Your Queries)

I have touched on this point multiple points, but just to drive the point home, your model is determined by your query requirements.

我已经在这一点上提到了多个要点，但是只是为了将要点讲清楚，您的模型取决于查询需求。

This might be counterintuitive if you are coming from a relational background. There we are used to designing models based on Classes, Objects and Entities. Afterwards, we can get any data we want through JOINs and other complicated clauses.

如果您来自关系背景，这可能是违反直觉的。在那里，我们习惯于基于类，对象和实体来设计模型。之后，我们可以通过JOIN和其他复杂的子句获取所需的任何数据。

In Cassandra you want your queries to be very simple. That’s why you want your model to be designed around your queries.

在Cassandra中，您希望查询非常简单。这就是为什么您希望围绕查询设计模型的原因。

What happens if your queries change? You duplicate the data in another table and base your newer queries on that. It might seem like a lot of overhead, but if you have the infrastructure set up with Cassandra connectors and a message broker (like Kafka), spinning up a new table with data from another table is relatively straightforward.

如果您的查询发生变化会怎样？您将数据复制到另一个表中，并基于该表进行新查询。看起来似乎有很多开销，但是如果您的基础架构设置有Cassandra连接器和消息代理(例如Kafka)，则用另一个表中的数据组成一个新表是相对简单的。

写比读便宜 (Writes Cheaper Than Reads)

In Cassandra writes are much cheaper than reads.

在Cassandra中，写入比读取便宜得多。

In high level, when you are writing to Cassandra two things happen:

概括地说，当您写给Cassandra时，会发生两件事：

The node writes a record to an append-only commit log
节点将记录写入仅追加提交日志
The node writes your record to a in-memory data structure called a MemTable
节点将您的记录写到称为MemTable的内存数据结构中
The node acknowledges your write
节点确认您的写入

A disk is not involved immediately, so writing is insanely cheap in Cassandra. That’s why Cassandra is an amazing choice for write intensive applications. All kinds of writes are similarly efficient.

磁盘没有立即涉及，因此在Cassandra中写入非常便宜。 这就是为什么Cassandra是写密集型应用程序的绝佳选择。各种写入都同样有效。

However, as you saw in this article, when it comes to reading, your data has to be laid out in the correct fashion. Otherwise, Cassandra needs to scan multiple partitions for a read which means scanning data on multiple hosts. It can get expensive very quickly.

但是，正如您在本文中看到的那样，在阅读时，必须以正确的方式对数据进行布局。否则，Cassandra需要扫描多个分区以进行读取，这意味着扫描多个主机上的数据。它很快就会变得昂贵。

Designing data models in Cassandra is distinctly different to other databases, especially relational databases. You really need to put some thought behind the data models keeping your queries in mind. However, you can get crazy I/O performance if you design the schema correctly.

在Cassandra中设计数据模型与其他数据库(尤其是关系数据库)明显不同。您确实需要在数据模型后面进行一些思考，以牢记您的查询。但是，如果正确设计架构，您将获得疯狂的I / O性能。