Data Pipelines

最新推荐文章于 2024-04-01 11:33:59 发布

bill2012x

最新推荐文章于 2024-04-01 11:33:59 发布

阅读量931

点赞数

本文链接：https://blog.csdn.net/one_game/article/details/8788554

版权

Understanding Data Pipelines

“

AWS Data Pipeline is a web service that provides a simple management system for data-driven workflows. Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email.

”

----- From AWS

可见Data Pipeline是一个作业级的Resource Manager+Scheduler+Summary Reporter。与传统的Pipeline概念相比，他们基本的思想是一致的。

在Amazon，AWS data pipeline处理：

l 作业调度，执行和失败重启逻辑

l 跟踪业务逻辑之间的依赖关系，保证在执行作业之前，其所有依赖条件都满足

l 生成并发送必要的失败通知

l 创建并管理作业所需要的零时资源

为了保证activity的顺利执行，AWS Data Pipeline会检测该activity所需的所有资源（在AWS网站上仅仅提到了data，我认为这里还可以包括CPU，网络带宽等其他资源）。这项资源可用性检查叫做“precondition”，在检查通过之前activity会被阻塞。

在用户接口方面AWS Data Pipeline提供了：

l Management Console

l CLI