ETL定义

最新推荐文章于 2024-09-27 15:25:05 发布

刘大方_Stone

最新推荐文章于 2024-09-27 15:25:05 发布

阅读量174

点赞数

分类专栏： Kettle 文章标签： etl 数据仓库

原文链接：https://www.snowflake.com/guides/what-etl

版权

Kettle 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

ETL是数据仓库建设的核心过程，包括数据抽取、转换和加载。它将来自不同源头的原始数据整合到单一存储库中，进行预处理以便分析。抽取阶段从多个源获取数据，如CRM或ERP系统、IoT设备等；转换阶段则对数据进行标准化、清洗和格式调整；加载阶段将处理后的数据安全地存储并供其他用户和部门使用，确保企业拥有统一、一致的数据来源。

摘要由CSDN通过智能技术生成

ETL的定义，转自snowflake

ETL stands for “extract, transform, load,” the three processes that, in combination, move data from one database, multiple databases, or other sources to a unified repository—typically a data warehouse. It enables data analysis to provide actionable business information, effectively preparing data for analysis and business intelligence processes.

As data engineers are experts at making data ready for consumption by working with multiple systems and tools, data engineering encompasses ETL. Data engineering involves ingesting, transforming, delivering, and sharing data for analysis. These fundamental tasks are completed via data pipelines that automate the process in a repeatable way. A data pipeline is a set of data-processing elements that move data from source to destination, and often from one format (raw) to another (analytics-ready).

PURPOSE
ETL allows businesses to consolidate data from multiple databases and other sources into a single repository with data that has been properly formatted and qualified in preparation for analysis. This unified data repository allows for simplified access for analysis and additional processing. It also provides a single source of truth, ensuring that all enterprise data is consistent and up-to-date.

PROCESSES
There are three unique processes in extract, transform, load. These are:

Extraction, in which raw data is pulled from a source or multiple sources. Data could come from transactional applications, such as customer relationship management (CRM) data from Salesforce or enterprise resource planning (ERP) data from SAP, or Internet of Things (IoT) sensors that gather readings from a production line or factory floor operation, for example. To create a data warehouse, extraction typically involves combining data from these various sources into a single data set and then validating the data with invalid data flagged or removed. Extracted data may be several formats, such as relational databases, XML, JSON, and others.

Transformation, in which data is updated to match the needs of an organization and the requirements of its data storage solution. Transformation can involve standardizing (converting all data types to the same format), cleansing (resolving inconsistencies and inaccuracies), mapping (combining data elements from two or more data models), augmenting (pulling in data from other sources), and others. During this process, rules and functions are applied, and data cleansed to prevent including bad or non-matching data to the destination repository. Rules that could be applied include loading only specific columns, deduplicating, and merging, among others.

Loading, in which data is delivered and secured for sharing, making business-ready data available to other users and departments, both within the organization and externally. This process may include overwriting the destination’s existing data.