If you’re using AWS for data transformation, you’re going to run into Athena sooner or later. Athena allows you to query data across multiple data stores, with a well-known SQL syntax (Presto 6.15).
如果您使用AWS进行数据转换,则迟早会遇到Athena。 Athena允许您使用众所周知SQL语法(Presto 6.15)在多个数据存储中查询数据。
With S3 as a storage solution, Athena promises to handle the complexity of a huge database for you. Serverless compute and storage means an entirely serverless database experience. All you need to do is know where all of the red flags are. In this article, I’ve listed some of the situations I’ve found myself in over the past few months.
使用S3作为存储解决方案,Athena承诺为您处理庞大数据库的复杂性。 无服务器计算和存储意味着完全无服务器的数据库体验。 您需要做的就是知道所有红色标记在哪里。 在本文中,我列出了过去几个月来遇到的一些情况。
分区与雅典娜不可商议 (Partitioning Is Non-Negotiable With Athena)
When you’re writing out your data into AWS Glue tables, there should be one word at the forefront of your conversation: partitioning. Partitioning instructs AWS Glue on how to group your files together in S3 so that your queries can run over the smallest possible set of data.
在将数据写到AWS Glue表中时,在对话的最前面应该有一个词:partitioning 。 分区指导AWS Glue如何在S3中将文件分组在一起,以便您的查询可以在最小的数据集上运行。
Poor partitioning strategies have been the bane of databases for decades. It is very difficult to get this right since an optimisation inevitably means becoming worse at something, as you specialise in something else. Here are the questions to ask yourself when you’re designing your partition:
几十年来,不良的分区策略一直是数据库的祸根。 做到这一点非常困难,因为优化不可避免地意味着在某些事情上会变得更糟,因为您专注于其他事情。 这是设计分区时要问自己的问题:
- How is this data going to be queried? 如何查询这些数据?
- How much data per partition does that mean? 这意味着每个分区有多少数据?
- How often are we going to be querying this data? 我们将多久查询一次此数据?
- Is this datastore going to morph into something completely different? How would we handle that? 这个资料储存库会变成完全不同的东西吗? 我们将如何处理?
A very common partitioning strategy is to partition on a date key. The AWS Glue libraries come fitted with a mechanism for specifying your partition columns out of the box. Using these libraries, your code may look something like this:
一种非常常见的分区策略是对日期键进行分区。 AWS Glue库配备了一种机制,用于开箱即用地指定分区列。 使用这些库,您的代码可能看起来像这样:
glue_context.write_dynamic_frame.from_options(
frame = projectedEvents,
connection_options = {
"path": "$outpath",
"partitionKeys": ["date"]
},
format = "parquet"
)
In this example, we’re telling Glue to write the output in a parquet format and to partition on the date
field. If we were to open up S3, we would see hive-style partitions of the form:
在此示例中,我们告诉Glue以拼花形式写入输出并在date
字段上进行分区。 如果要打开S3,我们将看到以下形式的蜂巢式分区 :
/date=2020-05-01/…/type=2020-05-02/…/type=2020-05-03/…
The statement we’ve made is this: “We want to optimise on queries within a day.” If we were planning on running lots of queries that spanned over many days, this partitioning strategy would not help us to optimise our costs. Remember, Athena charges by the amount of data scanned — nothing else.
我们发表的声明是:“我们希望在一天内优化查询。” 如果我们计划运行许多跨越许多天的查询,那么这种分区策略将无助于我们优化成本。 请记住,雅典娜按扫描的数据量收费—别无其他。
不要害怕在数据上存储多个视图 (Don’t be afraid to store multiple views on the data)
Your AWS storage costs are nothing compared to the read/write costs. If you can greatly optimise your S3 I/O by storing a duplicated set of data with different partitions, it’ll usually work out as savings.
与读/写成本相比,您的AWS存储成本不算什么。 如果可以通过使用不同分区存储重复的数据集来极大地优化S3 I / O,则通常可以节省下来。
实木复合地板可以为您节省很多钱 (Parquet can save you a lot of money)
The output format you choose to write in can seem like personal preference to the uninitiated (read: me a few weeks ago). However, this choice can profoundly impact the operational cost of your system. Parquet is a columnar storage format, meaning it doesn’t group whole rows together. One file may contain a subset of the columns for a given row.
您选择写入的输出格式可能看起来像是对未开始使用的人的个人喜好(几周前阅读:我)。 但是,这种选择会严重影响系统的运营成本。 Parquet是一种列式存储格式,这意味着它不会将整个行分组在一起。 一个文件可能包含给定行的列的子集。
If you run a query like this against a stack of JSON files, what do you think Athena will have to do?
如果您对一堆JSON文件运行这样的查询,您认为Athena必须做什么?
SELECT name, age, dob from my_huge_json_table where dob = '2020-05-01';
It will be forced to pull the whole JSON document for everything that matches that dob
and scan through it. There could be 100 different columns in your JSON file, but you’re only interested in three of them.
它将强制为匹配dob
所有内容提取整个JSON文档并对其进行扫描。 JSON文件中可能有100个不同的列,但是您只对其中三个感兴趣。
The same query run against parquet is far easier to optimise. Instead of pulling the whole file, Athena can sniff out the exact files it needs. It won’t be perfect. Depending on the size of your files, Athena may be forced to sift through some extra data, but this additional dimension means that specific queries can operate over specific datasets. This results in potentially significant cost savings.
针对镶木地板运行的同一查询要容易得多。 雅典娜可以提取出所需的确切文件,而不必提取整个文件。 这不是完美的。 根据文件的大小,可能会迫使Athena筛选一些额外的数据,但是这个额外的维度意味着特定的查询可以对特定的数据集进行操作。 这样可以节省大量成本。
雅典娜不仅可以提取数据,还可以带来更多好处 (Athena Is Good for More Than Just Extracting Data)
Athena is often discussed in the documentation as a way of extracting the data from your tables once you’re happy with it. It’s almost a presentational layer that APIs can hook into. This is correct but limited.
文档中经常讨论Athena,将其作为对表满意后从表中提取数据的一种方式。 这几乎是API可以挂接的表示层。 这是正确的,但有局限性。
SQL is a powerful data transformation language that, when used properly, can result in very fast-running jobs. Some operations, such as window functions and aggregate functions, work nicely in a SQL syntax and result in much more straightforward, elegant code. With the introduction of CTAS, you can write metadata directly to the Glue datastore without the need for a crawler.
SQL是一种强大的数据转换语言,如果使用得当,它可以导致运行非常快的作业。 某些操作(例如,窗口函数和聚合函数)可以很好地以SQL语法运行,并产生更直接,更简洁的代码。 随着CTAS的引入,您可以将元数据直接写入Glue数据存储区,而无需搜寻器。
One of the lessons we learned was that Athena can be used to clean the data itself. Athena makes use of Presto 6.15 — have a read of the documentation. The Presto DBMS has a plethora of great functions to tap into.
我们吸取的教训之一是,雅典娜可用于清理数据本身。 雅典娜使用Presto 6.15-阅读文档 。 Presto DBMS具有许多强大的功能可供利用。
Athena is powerful, but it has some quirks that took us a while to work out.
雅典娜功能强大,但有些怪癖使我们花了一段时间才能解决。
雅典娜不喜欢连字符 (Athena Doesn’t Like Hyphens)
This is a small one, but it can result in some bizarre behaviour. Don’t put hyphens in your table names. They can break your queries. And not in the “Oh, everything is suddenly very broken” kind of way. In the “Oh, this query is doing something completely random now” kind of way.
这虽然很小,但是可能会导致一些奇怪的行为。 不要在表名中加上连字符。 他们可能破坏您的查询。 而不是以“哦,一切突然变得很崩溃”的方式。 在“哦,此查询现在正在做的事情完全是随机的”这种方式。
Avoid the dumpster fire and go for underscores. It doesn’t change readability too much and is one less thing to worry about. If you’re deadset on using hyphens, you can wrap your column names in "
. This has fixed the issues when I have seen it crop up, but I don’t know if it’s a genuine fix or if it has quirks.
避免垃圾箱起火并强调下划线。 它不会太大地改变可读性,并且不用担心。 如果您对使用连字符不满意,可以将列名包装在"
。这已经解决了我看到的连字符时出现的问题,但是我不知道它是否是真正的修复程序或是否有古怪之处。
雅典娜真的不喜欢Global ORDER BY
(Athena Really Doesn’t Like Global ORDER BY
)
This crops up in other DBMS systems too, but it was the cause of some of our Athena queries hitting the “resource exhausted” error:
这也出现在其他DBMS系统中,但这是导致我们的某些Athena查询遇到“资源耗尽”错误的原因:
AWS Athena Error: Query exhausted resources at this scale factor
In every case where this has popped up, we’ve found that the best way to optimise our queries is to limit the number of ORDER BY
statements we have in our query to the bare minimum. This is because an ORDER BY
over your whole dataset means moving your data onto a single node so that it can be sorted.
在每种情况下,我们都发现优化查询的最佳方法是将查询中拥有的ORDER BY
语句的数量限制到最小。 这是因为对整个数据集进行ORDER BY
意味着将数据移动到单个节点上,以便可以对其进行排序。
The ORDER BY
statement is just one of the culprits for greedy Athena queries. Fortunately, AWS has put together a great list of options for you to make the most out of Athena without setting fire to a server somewhere in Dublin.
ORDER BY
语句只是贪婪的Athena查询的元凶之一。 幸运的是,AWS汇集了很多选项 ,可让您充分利用Athena,而不会在都柏林的某处服务器上惹火。
翻译自: https://medium.com/better-programming/aws-athena-at-scale-a90c58c3b110