Overview
- AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data.
- With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks.
- You define the parameters of your data transformations and AWS Data Pipeline enforces the logic that you've set up.
Components
- A pipeline definition specifies the business logic of your data management.
- A pipeline schedules and runs tasks by creating Amazon EC2 instances to perform the defined work activities.
- Task Runner polls for tasks and then performs those tasks.
Related Services
- AWS Data Pipeline works with the following services to store data.
- Amazon DynamoDB
- Amazon RDS
- Amazon Redshift
- Amazon S3
- AWS Data Pipeline works with the following compute services to transform data.
- Amazon EC2
- Amazon EMR
Concepts
Pipeline Definition
- A pipeline definition is how you communicate your business logic to AWS Data Pipeline. It contains the following information:
- Names, locations, and formats of your data sources
- Activities that transform the data
- The schedule for those activities
- Resources that run your activities and preconditions
- Preconditions that must be satisfied before the activities can be scheduled
- Ways to alert you with status updates as pipeline execution proceeds
- From your pipeline definition, AWS Data Pipeline determines the tasks, schedules them, and assigns them to task runners.
Pipeline Components, Instances, and Attempts
- Pipeline Components
- Pipeline components represent the business logic of the pipeline and are represented by the different sections of a pipeline definition.
- Pipeline components specify the data sources, activities, schedule, and preconditions of the workflow.