Understanding Data Pipelines
“
AWS Data Pipeline is a web service that provides a simple management system for data-driven workflows. Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email.
”
----- From AWS
可见Data Pipeline是一个作业级的Resource Manager+Scheduler+Summary Reporter。与传统的Pipeline概念相比,他们基本的思想是一致的。
在Amazon,AWS data pipeline处理:
l 作业调度,执行和失败重启逻辑
l 跟踪业务逻辑之间的依赖关系,保证在执行作业之前,其所有依赖条件都满足
l 生成并发送必要的失败通知
l 创建并管理作业所需要的零时资源
为了保证activity的顺利执行,AWS Data Pipeline会检测该activity所需的所有资源(在AWS网站上仅仅提到了data,我认为这里还可以包括CPU,网络带宽等其他资源)。这项资源可用性检查叫做“precondition”,在检查通过之前activity会被阻塞。
在用户接口方面AWS Data Pipeline提供了:
l Management Console
l CLI
l Service APIs (defining data source, preconditions, activities, the scheduler and notification levels)
Data Pipeline Components
Data Collection: 用于将数据从数据源向存储地点的传输
Data Acquisition: 用于从不同的外部数据源获取数据
Data Storage: 存储系统
Data Processing: The ability to transform data in various useful ways including annotation, filtering and aggregation
Table Management/Meta Data: Provide a consistent API for data consumers with a standard metadata system
Job Coordination/Scheduling: Ability to schedule, submits, manage, retry, reprocess, catch up a DAG
Data Output: Enables push or Pull based delivery of data subject to policies
Data Policy Management: Anonymize, retain, clean up and archive data
Monitoring/System Management: Provide the ability to operate, visualize and install pipelines