pyspark一个有效的etl工具

Many of you may be curious about ETL Tools and the use of the ETL process in the world of data hubs where data plays a significant role. Today, we will examine this more closely.

你们中的许多人可能对ETL工具以及在数据起着重要作用的数据中心世界中使用ETL流程感到好奇。 今天,我们将对此进行更仔细的研究。

什么是ETL? (What is ETL?)

ETL (which stands for Extraction, Transform and Load) is the generic process of extracting data from one or more systems and loading it into a data warehouse or databases after performing some intermediate transformations.

ETL(代表提取转换加载)是从一个或多个系统中提取数据并在执行一些中间转换后将其加载到数据仓库或数据库中的通用过程。

There are many ETL tools available in the market that can carry out this process.

市场上有许多可以执行此过程的ETL工具。

A standard ETL tool like PySpark, supports all basic data transformation features like sorting, mapping, joins, operations, etc. PySpark’s ability to rapidly process massive amounts of data is a key advantage.

PySpark这样的标准ETL工具支持所有基本的数据转换功能,例如排序映射联接操作等。PySpark快速处理大量数据的能力是一个关键优势。

Some tools perform a complete ETL implementation while some tools help us create a custom ETL process from scratch, and there are a few those fall somewhere in between. Before going into the detail of PySpark, let’s first understand some important features that an ETL tool should have.

有些工具可以执行完整的ETL实现,而有些工具则可以帮助我们从头开始创建自定义ETL流程,其中有些介于两者之间。 在详细介绍PySpark之前,让我们首先了解ETL工具应具有的一些重要功能。

ETL工具的功能 (Features of ETL Tools)

ETL comprises of 3 processes which follow a sequence, beginning with extraction and ending with load. Let us look at these steps more closely:

ETL由3个过程组成,这些过程遵循一个顺序,从提取开始到以加载结束。 让我们更仔细地研究这些步骤:

  1. ExtractThis is the first step of ETL and it is the process of extracting or fetching data from various data sources. This can include most databases (RDBMS/NOSql) and file formats like JSON, CSV, XML, and XLS.

    提取这是ETL的第一步,它是从各种数据源提取或提取数据的过程。 这可以包括大多数数据库(RDBMS / NOSql)和文件格式,如JSON,CSV,XML和XLS。

  2. TransformIn this process, all the extracted data is kept in a staging area where raw data is transformed into a structured format and into a meaningful form for storing it into a data warehouse.

    转变在此过程中,所有提取的数据都保存在一个暂存区域中,在该暂存区域中,原始数据被转换为结构化格式和有意义的形式,以将其存储到数据仓库中。

    A standard ETL tool like

    标准的ETL工具,例如

    PySpark that we will look at later, supports all basic data transformation features like sorting, mapping, joins, operations, etc.

    我们将在后面介绍的PySpark支持所有基本的数据转换功能,例如排序映射联接操作等。

  3. LoadThis is the last process of the ETL tool in which the transformed data is loaded into the target zone or target warehouse database. This stage is a little challenging because a huge amount of data needs to be loaded in a short period.

    加载这是ETL工具的最后一个过程,在该过程中,已转换的数据被加载到目标区域或目标仓库数据库中。 此阶段有点挑战,因为需要在短时间内加载大量数据。

A-flow-diagram-of-3-steps-of-ETL
Source 资源

输入PySpark(Enter PySpark)

PySpark is a combination of Python and Apache Spark. It is a python API for spark which easily integrates and works with RDD using a library called ‘py4j’. It is the version of Spark which runs on Python.

PySpark是Python和Apache Spark的组合。 这是一个用于spark的python API,可以使用名为'py4j'的库轻松集成RDD并与之配合使用。 它是在Python上运行的Spark版本。

As per their official website, “Spark is a unified analytics engine for large-scale data processing”.

根据他们的官方网站,“ Spark是用于大规模数据处理的统一分析引擎”。

The Spark core not only provides many robust features apart from creating ETL pipelines, but also provides support for machine learning (MLib), data streaming (Spark Streaming), SQL (Spark Sql), and graph processing (Graph X).

除了创建ETL管道外,Spark核心不仅提供许多强大的功能,而且还支持机器学习(MLib),数据流(Spark流),SQL(Spark Sql)和图处理(图X)。

PySpark的优点 (Advantages of PySpark)

  • Speed: It is 100 times faster than traditional large-scale any data processing frameworks.

    速度:它比传统的大规模任何数据处理框架快100倍。

  • Real-Time Computation: The main key feature is its in-memory processing in the PySpark framework, it shows low latency.

    实时计算:主要的关键功能是它在PySpark框架中的内存处理功能,显示低延迟。

  • Caching and Disk Persistence: Simple programming layer provides powerful caching and disk persistence capabilities.

    缓存和磁盘持久性:简单的编程层提供了强大的缓存和磁盘持久性功能。

  • Deployment: It can be deployed through Hadoop via Yarn, Mesos or Spark’s own cluster manager.

    部署:可以通过Hadoop通过Yarn,Mesos或Spark自己的集群管理器进行部署。

There are many organizations such as Walmart, DLT Labs, Nokia, Alibaba.com, Netflix, etc that use PySpark.

有许多使用PySpark的组织,例如沃尔玛, DLT Labs ,诺基亚,Alibaba.com,Netflix等。

There are several features that make PySpark such an amazing framework when it comes to deal with huge datasets. Whether it is to analyze datasets or to perform computations on large datasets, Data Engineers are switching to this powerful tool.

当处理大量数据集时,PySpark的许多功能使其成为一个了不起的框架。 无论是分析数据集还是对大型数据集执行计算,数据工程师都在切换到此功能强大的工具。

Why?

为什么?

PySpark’s ability to rapidly process massive amounts of data is a key advantage. If you are looking to create an ETL pipeline to process a huge amount of data quickly or process streams of data, PySpark offers a worthy solution.

PySpark的快速处理大量数据的能力是关键优势。 如果您希望创建一个ETL管道来快速处理大量数据或处理数据流,PySpark提供了一个有价值的解决方案。

It is not an ETL solution out of the box, but it would be the best one for an ETL pipeline deployment.

它不是开箱即用的ETL解决方案,但它将是ETL管道部署的最佳解决方案。

Thanks for reading!

谢谢阅读!

Author — Pranjal Gupta, DLT Labs

作者— Pratull Gupta, DLT Labs

About the Author: Pranjal is currently working with our DL Asset Track team as a Nodejs developer.

关于作者: Pranjal当前以Node.js开发人员的身份与我们的DL Asset Track团队合作。

翻译自: https://medium.com/@dltlabs/pyspark-an-effective-etl-tool-b41a9108d0eb

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值