redshift_轻松修复Redshift性能

最新推荐文章于 2023-07-05 04:08:12 发布

weixin_26631359

最新推荐文章于 2023-07-05 04:08:12 发布

阅读量600

点赞数

文章标签： java python

原文链接：https://towardsdatascience.com/easy-fixes-for-redshift-performance-3fb66743b8c

版权

redshift

Redshift is awesome, until it stops being that. Usually, it isn’t so much Redshift’s fault when that happens. One of the most common problems that people using Redshift face is of bad query performance and high query execution times. It might be hard to digest but most of the Redshift problems are seen because people are just used to querying relational databases. Although the look & feel of a traditional ACID compliant relational database & a MPP distributed relational database are the same but the underlying principles of storage & distribution completely change the game.

Redshift很棒，直到不再是那样。通常，发生这种情况并不是Redshift的错。使用Redshift的人最常见的问题之一是查询性能差和查询执行时间长。可能很难消化，但是可以看到大多数Redshift问题，因为人们只是习惯于查询关系数据库。尽管传统的符合ACID的关系数据库和MPP分布式关系数据库的外观相同，但是存储和分发的基本原理完全改变了游戏。

Two major things about Redshift change everything when comparing it to ACID compliant relational databases like MySQL, PostgreSQL and so on.

与Redshift和与ACID兼容的关系数据库(例如MySQL，PostgreSQL等)进行比较时，有关Redshift的两件主要事情改变了一切。

Underlying storage — Columnar storage instead of row-based storage
基础存储 -列存储而不是基于行的存储
Massively Parallel Processing architecture — Although possible in MySQL, PostgreSQL etc. too, this is default setup in Redshift
大规模并行处理体系结构 —尽管在MySQL，PostgreSQL等中也可能，但这是Redshift中的默认设置

We’ll now list down a couple of quick fixes and things to remember when getting started with Redshift workloads. If you’re not across Redshift, going through the following document would be very helpful.

现在，我们将列出一些快速修复以及在开始使用Redshift工作负载时要记住的事情。如果您不熟悉Redshift，请仔细阅读以下文档。

切记— Redshift中没有索引 (Remember — There are No Indexes in Redshift)

Unlike relational databases, Redshift doesn’t support indexes. Don’t write the same kind of queries that you write in relational databases like MySQL or PostgreSQL. Redshift is designed to perform the best when you select the columns that you absolutely, most certainly need to query — the same way you’re supposed to SELECT records in a row-based database, you’re required to select columns in a column-based database.

与关系数据库不同，Redshift不支持索引。不要编写与在关系数据库(如MySQL或PostgreSQL)中编写的查询类型相同的查询。 Redshift旨在在您选择绝对(最肯定需要查询)的列时发挥最佳性能-与您应该SELECT基于行的数据库中的记录的方式相同，您需要在列中选择列-基于数据库。

没有索引？那么如何有效地联接表？ (No Indexes? So How Do You Efficiently Join Tables?)

Joins in relational databases use indexing. As we just established that indexes are absent from Redshift, then what will you use? You will use a similar sounding construct to join tables. That construct is called Distribution Key — a column based on which data is distributed across different nodes of the Redshift cluster. How the data is distributed is defined by the distribution style selected for a given table.

关系数据库中的联接使用索引。当我们刚刚确定Redshift缺少索引时，您将使用什么？您将使用类似的探测构造来联接表。这种结构称为“ 分发密钥” -基于该列的数据在Redshift群集的不同节点之间分布。数据的分配方式由为给定表选择的分配方式定义。

统计的重要性 (The Importance of Statistics)

Similar to any other database like MySQL, PostgreSQL etc., Redshift’s query planner also uses statistics about tables. Based on those statistics, the query plan decides to go one way or the other when choosing one of many plans to execute the query. That’s why it’s a good practice to ANALYZE the tables every now and then. There’s no good frequency to run this that suits all. Beware that ANALYZE can be a time consuming activity. Do a quick cost-benefit analysis before deciding on the frequency.

与MySQL，PostgreSQL等其他任何数据库一样，Redshift的查询计划程序也使用有关表的统计信息。基于这些统计信息，查询计划在选择执行查询的多个计划中的一个时决定采用一种或另一种方式。这就是为什么ANALYZE表是一个好习惯的原因。没有一个很好的频率来运行适合所有人的软件。注意， ANALYZE可能是一项耗时的活动。在确定频率之前，请进行快速的成本效益分析。

There’s a great article which mentions some of the things I talked about here. Please refer this for more detail.

有一篇很棒的文章提到了我在这里谈论的一些事情。请参阅此以获取更多详细信息。

为您的色谱柱选择合适的压缩率 (Choose The Right Compression for Your Columns)

One of the great things about columnar databases is that you can achieve high levels of compression because a lot of the same kind of data is stored contiguously — which means that many blocks on disk contain data of the same data type. Doesn’t that make compression easier!

列式数据库的一大优点是，由于大量相同类型的数据是连续存储的，因此可以实现高级别的压缩-这意味着磁盘上的许多块都包含相同数据类型的数据。这不是使压缩更容易！

With row-based databases, all the data for a single row is contiguously stored. That is why it is better to reduce the number of rows scanned.

对于基于行的数据库，单个行的所有数据都连续存储。 因此，最好减少扫描的行数。

Because of this, compression is hard in row-based databases as a row may contain data of different data types which means that adjacent blocks on disk might have different data types.

因此，在基于行的数据库中很难进行压缩，因为一行可能包含不同数据类型的数据，这意味着磁盘上的相邻块可能具有不同的数据类型。

Image for post — Wikimedia Commons (Labelled for Reuse) 维基共享资源 (重用)

照顾磁盘空间 (Take Care of Disk Space)

Managing disk space is usually a problem with all databases, especially when you’re dealing with analytical loads because analysts, data engineers create a lot of tables for further analysis. Even if that’s not the case, reduced free space affects query performance badly because it makes it difficult for the query execution engine to create temporary stuff on disk by inducing a lot of swapping.

管理磁盘空间通常是所有数据库的问题，尤其是当您要处理分析负载时，因为分析师，数据工程师会创建很多表以进行进一步分析。即使不是这种情况，减少的可用空间也会严重影响查询性能，因为这会使查询执行引擎难以通过诱导大量交换在磁盘上创建临时内容。

If you don’t have any control over disk space, it is better for a more elastic option in Spectrum — here you essentially move all your data to S3 which can scale as much as you want without any downtime for increasing disk space.

如果您对磁盘空间没有任何控制，最好在Spectrum中使用一个更具弹性的选项-在这里，您实际上将所有数据移至S3，它可以按需扩展，而无需任何停机时间即可增加磁盘空间。

These were some of the basic things that you can keep in mind. There are many more ideas which you can find on AWS’s documentation for Redshift. The documentation isn’t always helpful. For that time, you can refer to some of the links that have been shared through this post and below in the references section.

这些是您需要牢记的一些基本内容。您可以在AWS的Redshift文档中找到更多想法。该文档并不总是有用的。这段时间，您可以在本节以及下面的“参考”部分中引用一些共享的链接。