snowflake 数据库_将数据从Snowflake同步到DynamoDB-CSDN博客

snowflake 数据库

When implementing mobile and web applications, it’s a no-brainer to use a NoSQL service as a datastore. This is due to its ability to handle operations at low latency and it’s intuitive towards developers. Of the many NoSQL database out there, the fully managed AWS DynamoDB (Key-value) is one of the most commonly used highly scalable datastore.

实施移动和Web应用程序时，使用NoSQL服务作为数据存储区是不费吹灰之力的。这是因为它能够以低延迟处理操作，并且对开发人员很直观。在众多NoSQL数据库中，完全托管的AWS DynamoDB(键值)是最常用的高度可扩展的数据存储之一。

In this article, I will walk you through on how to integrate DynamoDB with the performance-enhanced capabilities of Snowflake to bridge the gap between the low latency datastore and heavy complex data computation.

在本文中，我将引导您逐步了解如何将DynamoDB与Snowflake的性能增强功能集成在一起，以弥合低延迟数据存储与繁重的复杂数据计算之间的鸿沟。

为什么要连接DynamoDB和Snowflake (Why connect DynamoDB and Snowflake)

If you have worked with a NoSQL datastore before, you will know its main purpose is for storing and receiving data, as opposed to running analytical computations. However, in a data-driven application, there are always elements of analytics.

如果您以前使用过NoSQL数据存储，您将知道它的主要目的是存储和接收数据，而不是运行分析计算。但是，在数据驱动的应用程序中，总是包含分析元素。

Given a typical scenario for a “Save Your Money” type application, you may want to alert the customer when their non-utilities spending for the current month is 20% more than the average from the past 12 months. To carry out this operation you may need to:

给定“省钱”类型应用程序的典型情况，当客户本月的非公用事业支出比过去12个月的平均支出高出20％时，您可能想提醒客户。要执行此操作，您可能需要：

Sum the spend this month for the non-utilities item
计算本月非公用事业支出
Compute the non-utilities spend in the last 12 months and average it
计算最近12个月的非公用事业支出，并将其平均
Find the percentage difference between the spend this month, and the average of the last 12 months
查找本月支出与最近12个月平均值之间的百分比差异
Compare if the customer spend this month % difference is greater than 20%
比较客户本月支出的百分比差异是否大于20％

From this, you will need the customers, their spending in the last 12 months, the spend classifier, and a bunch of calculations. This is all something DynamoDB is not good at and built for, but something Snowflake is exceptionally good at.

由此，您将需要客户，过去12个月中的客户支出，支出分类和大量计算。这是DynamoDB所不擅长和构建的，而Snowflake则特别擅长。

将Snowflake与DynamoDB集成 (Integrating Snowflake with DynamoDB)

There are some integration or EL tools which will allow you to load DynamoDB to Snowflake but not many the other way around which is what you want to power the frontend applications with analytics.

有一些集成或EL工具可让您将DynamoDB加载到Snowflake，但没有其他方法可以用来为前端应用程序提供分析功能。

I want to show you something I have put together which has been working very well for a project that I am working on. The best part, it’s a serverless and an event-driven approach.

我想向您展示我整理的一些东西，这些东西在我正在进行的项目中一直运行良好。最好的部分是无服务器和事件驱动的方法。

Snowflake Unloads the data into S3, whether someone does this manually, using an orchestration tool, or creating a Snowflake task.
无论有人使用编排工具手动执行此操作还是创建Snowflake任务，Snowflake都将数据卸载到S3中。
Once the data is dropped onto S3, a Lambda is triggered to read the file and PUT into DynamoDB.
将数据放到S3上后，将触发Lambda读取文件并将其放入DynamoDB中。

从Snowflake卸载到S3 (Unloading from Snowflake to S3)

For the purpose of the demo, I have loaded the Melbourne Carpark Sensor data for May 2020 onto Snowflake for the purpose of showing the ingestion volume.

出于演示目的，我已将2020年5月的墨尔本停车场传感器数据加载到Snowflake上，以显示摄入量。

I unloaded all of the parking sensor data into the snowflake-drop2dynamodb S3 bucket, which contains ~1.7 million rows of data. Please note that I created a hash for the “id” field of the DynamoDB table because I did not have a unique identifier. The COPY INTO statement will create chunks of files, with a max size of 10mb. For each chunk, Lambda will be called to ingest into DynamoDB (no guarantees of the order).

我将所有停车传感器数据卸载到了Snow-drop2dynamodb S3存储桶中，该存储桶包含约170万行数据。请注意，因为没有唯一标识符，所以我为DynamoDB表的“ id”字段创建了一个哈希。 COPY INTO语句将创建文件块，最大大小为10mb。对于每个块，将调用Lambda提取到DynamoDB中(不保证顺序)。

使用Lambda将雪花提取物插入DynamoDB (Inserting Snowflake Extract onto DynamoDB using Lambda)

Once the CSV file from Snowfload is unloaded onto the S3 bucket, Lambda with an execution timeout of 1 min, starts loading the file onto DynamoDB Table. The build of this was inspired by an AWS Demo. The python function does not load any of the data onto memory and so far this showed promising results.

将来自Snowfload的CSV文件卸载到S3存储桶后，执行超时为1分钟的Lambda将开始将文件加载到DynamoDB表中。此构建受AWS演示的启发。 python函数不会将任何数据加载到内存中，到目前为止，这已显示出令人鼓舞的结果。

结论 (Conclusion)

With the use of trigger-enabled S3 and Lambda, you can easily build an integration pipeline between Snowflake and DynamoDB. This allows you to sync DynamoDB with data from Snowflake either on-demand or on a schedule. With this, you can either add or update attributes, syncing an online ML feature store and performance batch computations, for DynamoDB.

使用启用了触发器的S3和Lambda，您可以轻松地在Snowflake和DynamoDB之间建立集成管道。这使您可以按需或按计划将DynamoDB与Snowflake的数据同步。这样，您就可以为DynamoDB添加或更新属性，同步在线ML功能存储和性能批处理计算。

There are some improvements that I want to implement due to some challenges that I face and this includes:

由于面临一些挑战，我想实现一些改进，其中包括：

Ability to put failures on the side to be reprocessed, and move completed files to another folder/bucket.
能够将故障放在待处理的一面，并将完成的文件移到另一个文件夹/存储桶。
Even though S3 guarantees at least once delivery, I want to build an audit table on top, powered by Snowflake function.
即使S3至少保证一次交付，但我还是要在由Snowflake函数提供支持的基础上构建审核表。
Have an event-driven orchestrates approach to increase reliability and durability.
采用事件驱动的协调方法来提高可靠性和持久性。

I hope you find this article useful!

希望本文对您有所帮助！

关于我 (About Me)

I have been helping businesses build scalable cloud and data solutions in the area of Production AI/ML, Data Engineering, Serverless and Containers, and have a growing interest in front-end developments. Feel free to connect with me on LinkedIn for a chat — just let me know you’re from Medium.

我一直在帮助企业在生产AI / ML，数据工程，无服务器和容器领域构建可扩展的云和数据解决方案，并且对前端开发越来越感兴趣。请随时在LinkedIn上与我联系以进行聊天-请让我知道您来自Medium。