使用新的 Amazon Glue DynamoDB Export 加速 Amazon DynamoDB 数据访问评论源

亚马逊云开发者

于 2022-09-13 19:36:29 发布

阅读量232

点赞数

文章标签：大数据 python java 人工智能数据分析

原文链接：https://mp.weixin.qq.com/s?__biz=Mzg4NjU5NDUxNg==&mid=2247528528&idx=1&sn=0ab0d31dde1b474634d355bc6e063e7c&chksm=cf957eebf8e2f7fda9120f81e470ae2ff0c7a76c8b86c85ace925b065a29b7664ebc907580f2&scene=126&&sessionid=0

版权

‍‍

点击上方入口立即【自由构建探索无限】

一起共赴年度科技盛宴！

背景介绍

智能湖仓鼓励数据湖、数据仓库和专用数据存储的集成，从而实现统一治理和轻松数据移动。借助亚马逊云科技上的智能湖仓，您可以将数据存储在数据湖中，并在湖周围使用一系列专门构建的数据服务，从而快速灵活地做出决策。为了实现智能湖仓，可使用 Amazon Glue 这项关键服务，它通过数据湖、数据仓库和专门构建的数据存储集成数据。Amazon Glue 简化了数据移动，如由内而外、由外而内或围绕周边移动。功能强大的专用数据存储为 Amazon DynamoDB，它已被包括 Amazon.com 在内的数十万家公司广泛使用。将数据从 DynamoDB 移到在 Amazon Simple Storage Service (Amazon S3) 之上构建的数据湖非常常见。许多客户使用 Amazon Glue 提取、转换和加载 (ETL) 作业将数据从 DynamoDB 移到 Amazon S3。

智能湖仓：

https://aws.amazon.com/cn/big-data/datalakes-and-analytics/modern-data-architecture/

Amazon Glue：

https://aws.amazon.com/cn/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc

Amazon DynamoDB：

https://aws.amazon.com/cn/dynamodb/

Amazon Simple Storage Service：

https://aws.amazon.com/cn/s3/

现在，我们很高兴地宣布，全新 Amazon Glue DynamoDB 导出连接器正式发布。它在 DynamoDB 表导出功能之上而构建。这是一种可扩展且经济高效的方式，用于在 Amazon Glue ETL 作业中读取大型 DynamoDB 表数据。本文介绍了这个全新导出连接器的好处及其使用案例。以下是使用 Amazon Glue ETL 作业从 DynamoDB 表中读取的典型使用案例：

将数据从 DynamoDB 表移到其他数据存储
将数据与其他服务和应用程序集成
保留历史快照以供审计
根据 DynamoDB 数据构建 S3 数据湖并分析来自各种服务的数据，例如 Amazon Athena, Amazon Redshift 和 Amazon SageMaker

DynamoDB 表导出功能：

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html

Amazon Athena：

https://aws.amazon.com/cn/athena/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc

Amazon Redshift：

https://aws.amazon.com/cn/redshift/

Amazon SageMaker：

https://aws.amazon.com/cn/sagemaker/

1. 全新 Amazon Glue DynamoDB

导出连接器

旧版 Amazon Glue DynamoDB 连接器通过 DynamoDB Scan API 读取 DynamoDB 表。相反，全新 Amazon Glue DynamoDB 导出连接器从快照中读取 DynamoDB 数据，而快照是从 DynamoDB 表中导出的。这种方法具有以下好处：

► 不会占用源 DynamoDB 表的读取容量单位

► 大型 DynamoDB 表的读取性能一致

特别是对于超过 100 GB 的大型 DynamoDB 表，此新连接器比传统连接器快得多。

要使用这个全新导出连接器，您需要提前为源 DynamoDB 表启用时间点恢复 (PITR)。

DynamoDB Scan API：

https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html

时间点恢复：

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/PointInTimeRecovery.html

2. 如何在

Amazon Glue Studio Visual Editor

上使用新连接器

Amazon Glue Studio Visual Editor 是一个图形界面，可以轻松地在 Amazon Glue 中创建、运行和监控 Amazon Glue ETL 作业。Amazon Glue Studio Visual Editor 提供全新 DynamoDB 导出连接器。您可以选择 Amazon DynamoDB 作为源。

选择 Create（创建）后，您将看到可视 Directed Acyclic Graph (DAG) 。在此处，您可以选择此账户或区域中存在的 DynamoDB 表。这样，您可以在 Amazon Glue Studio 中直接选择 DynamoDB 表（启用了 PITR）作为源。这提供了从任何 DynamoDB 表到 Amazon S3 的一键式导出。您还可以轻松地向 DAG 添加任何数据源和目标或转换。例如，它允许您联接两个不同的 DynamoDB 表，并将结果导出到 Amazon S3，如以下屏幕截图所示。

以下两个连接选项会自动添加。此位置用于在 DynamoDB 导出阶段存储临时数据。您可以设置 S3 存储桶生命周期策略（https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html），使临时数据过期。

dynamodb.s3.bucket – 用于在 DynamoDB 导出期间存储临时数据的 S3 存储桶
dynamodb.s3.prefix – 用于在 DynamoDB 导出期间存储临时数据的 S3 前缀

3. 如何在作业脚本代码上

使用新连接器

通过配置以下连接选项，在作业脚本代码中创建 Amazon Glue DynamicFrame 时，可以使用新导出连接器：

dynamodb.export –（必需）您需要将其设置为 ddb 或 s3
dynamodb.tableArn –（必需）您的源 DynamoDB 表 ARN
dynamodb.unnestDDBJson –（可选）如果设置为 true，则执行导出中存在的 DynamoDB JSON 结构的非嵌套转换，默认值为 false
dynamodb.s3.bucket –（可选）用于在 DynamoDB 导出期间存储临时数据的 S3 存储桶
dynamodb.s3.prefix –（可选）用于在 DynamoDB 导出期间存储临时数据的 S3 前缀

以下是使用新导出连接器创建 DynamicFrame 的示例 Python 代码：

dyf = glue_context.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={
"dynamodb.export": "ddb",
"dynamodb.tableArn": "test_source",
"dynamodb.unnestDDBJson": True,
"dynamodb.s3.bucket": "bucket name",
"dynamodb.s3.prefix": "bucket prefix"
}
)

*左滑查看更多

与旧连接器不同，新导出连接器不需要与 Amazon Glue 作业并行度相关的配置。现在，当横向扩展 Amazon Glue 作业时，您不再需要更改配置，也不需要任何有关 DynamoDB 表读/写容量及其容量模式（按需或预置）的配置。

4. DynamoDB 表架构处理

默认情况下，新导出连接器会读取导出中存在的 DynamoDB JSON 结构中的数据。以下是使用 Amazon Customer Review Dataset （https://s3.amazonaws.com/amazon-reviews-pds/readme.html）的框架的示例架构：

root
|-- Item: struct (nullable = true)
| |-- product_id: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- review_id: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- total_votes: struct (nullable = true)
| | |-- N: string (nullable = true)
| |-- product_title: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- star_rating: struct (nullable = true)
| | |-- N: string (nullable = true)
| |-- customer_id: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- marketplace: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- helpful_votes: struct (nullable = true)
| | |-- N: string (nullable = true)
| |-- review_headline: struct (nullable = true)
| | |-- S: string (nullable = true)
| | |-- NULL: boolean (nullable = true)
| |-- review_date: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- vine: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- review_body: struct (nullable = true)
| | |-- S: string (nullable = true)
| | |-- NULL: boolean (nullable = true)
| |-- verified_purchase: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- product_category: struct (nullable = true)
| | |-- S: string (nullable = true)
| |-- year: struct (nullable = true)
| | |-- N: string (nullable = true)
| |-- product_parent: struct (nullable = true)
| | |-- S: string (nullable = true)

*左滑查看更多

要在不处理嵌套数据的情况下读取 DynamoDB 项列，可以将 dynamodb.unnestDDBJson 设置为 True。以下是相同数据架构下将 dynamodb.unnestDDBJson 设置为 True 的示例：

root
|-- product_id: string (nullable = true)
|-- review_id: string (nullable = true)
|-- total_votes: string (nullable = true)
|-- product_title: string (nullable = true)
|-- star_rating: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- marketplace: string (nullable = true)
|-- helpful_votes: string (nullable = true)
|-- review_headline: string (nullable = true)
|-- review_date: string (nullable = true)
|-- vine: string (nullable = true)
|-- review_body: string (nullable = true)
|-- verified_purchase: string (nullable = true)
|-- product_category: string (nullable = true)
|-- year: string (nullable = true)
|-- product_parent: string (nullable = true)

*左滑查看更多

5. 数据新鲜度

数据新鲜度是衡量原始源中活动表数据的过时程度。在新导出连接器中，选项 dynamodb.export 会影响数据新鲜度。

将 dynamodb.export 设置为 ddb 时，Amazon Glue 作业会调用新的导出，然后将放置在 S3 存储桶中的导出读取到 DynamicFrame。它读取活动表的导出，因此，数据可能为最新数据。另一方面，将 dynamodb.export 设置为 s3 时，Amazon Glue 作业会跳过调用新导出，直接读取已放置在 S3 存储桶中的导出。它读取过去表的导出，因此数据可能会过时，但您可以减少触发导出的开销。

下表说明了数据新鲜度以及每个选项的优缺点。

6. 性能

以下基准测试显示了旧版 Amazon Glue DynamoDB 连接器和新导出连接器之间的性能改进。比较使用 DynamoDB 表存储 TPC-DS 基准数据集（https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.5.0.pdf），范围从 10 MB 到 2 TB 不等。示例 Spark 作业从 DynamoDB 表中读取数据，并计算项计数。所有 Spark 作业都在 Amazon Glue 3.0 G.2X 60 工作线程上运行。

下图比较了旧连接器和新导出连接器之间的 Amazon Glue 作业持续时间。对于小型 DynamoDB 表，旧连接器速度更快。对于超过 80 GB 的大型表，新导出连接器速度更快。换句话说，对于旧连接器运行时间超过 5-10 分钟的作业，建议使用 DynamoDB 导出连接器。此外，该图还显示，新导出连接器的持续时间随着数据大小的增加而缓慢增加，而旧连接器的持续时间则随着数据大小的增加而迅速增加。这意味着新导出连接器特别适用于较大的表。

7. 使用

Amazon Glue Auto Scaling

Amazon Glue Auto Scaling 是一项新功能，可自动调整计算资源的大小，从而提高性能，降低成本。您可以通过全新 DynamoDB 导出连接器利用 Amazon Glue Auto Scaling。

如下图所示，使用 Amazon Glue Auto Scaling 时，当源 DynamoDB 表的大小为 100 GB 或更大时，新导出连接器的持续时间将比旧连接器短。在没有 Amazon Glue Auto Scaling 的情况下，它表现出了类似趋势。

您可以获得成本效益，因为在 DynamoDB 导出期间的大部分时间内，只有 Spark 驱动程序处于活动状态（这几乎占了使用基于扫描的旧版连接器的总作业持续时间的 30%）。

结论

Amazon Glue 是一项与多个数据存储集成的关键服务。在亚马逊云科技，我们不断提高服务性能和成本效益。在本文中，我们宣布推出了全新 Amazon Glue DynamoDB 导出连接器。有了这个新连接器，您可以轻松地将大型 DynamoDB 表的数据与不同数据存储相集成。它可帮您以更低的成本从 Amazon Glue 作业中更快地读取大型表。

新 Amazon Glue DynamoDB 导出连接器现已在所有受支持的 Glue 区域中正式推出。立即开始使用全新 Amazon Glue DynamoDB 导出连接器吧！期待您提供反馈和故事，便于我们了解您如何使用连接器来满足您的需求。

关于作者

Noritaka Sekiyama

Amazon Glue 团队的首席大数据架构师。他负责构建软件工件，帮助客户在云中构建数据湖。

Neil Gupta

Amazon Glue 团队的软件开发工程师。他喜欢处理大数据问题和学习更多关于分布式系统的知识。

Andrew Kim

Amazon Glue 团队的软件开发工程师。在为具有挑战性的问题构建可扩展、高效的解决方案和使用分布式系统方面，他充满激情。

Savio Dsouza

Amazon Glue 团队的软件开发经理。他的团队致力于分布式系统，以高效管理 Amazon 上的数据湖，并优化 Apache Spark 以提高性能和可靠性。

点击上方【立即报名】

直通大咖云集的亚马逊云科技中国峰会！

听说，点完下面4个按钮

就不会碰到bug了！

亚马逊云开发者

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用新的 Amazon Glue DynamoDB Export 加速 Amazon DynamoDB 数据访问评论源

‍‍点击上方入口立即【自由构建探索无限】一起共赴年度科技盛宴！背景介绍智能湖仓鼓励数据湖、数据仓库和专用数据存储的集成，从而实现统一治理和轻松数据移动。借助亚马逊云科技上的智能湖仓，您可以将数据存储在数据湖中，并在湖周围使用一系列专门构建的数据服务，从而快速灵活地做出决策。为了实现智能湖仓，可使用Amazon Glue这项关键服务，它通过数据湖、数据仓库和专门构建的数据存储集成数据。Amaz...
复制链接

扫一扫