aws big data_AWS Data Wrangler简化了与AWS Data相关服务的熊猫集成

aws big data

Enterprise organisations are utilising cloud services to build data lakes, warehouses and automated ETL pipelines. In AWS Cloud, data lakes are built on top of Amazon S3 due to its durability, availability, scalability and cheap of cost. Amazon Athena is one of the best tools to query data from S3. When it comes to programatic interaction with AWS services, Boto3 is the first python package that comes to everyones mind. But programatically querying the S3 data using Athena into Pandas dataframes to do ETL hasn’t been that easier when using the Boto3 package alone as it is.

企业组织正在利用云服务来构建数据湖,仓库和自动ETL管道。 在AWS Cloud中,数据湖由于其耐用性,可用性,可扩展性和成本便宜而建在Amazon S3之上。 Amazon Athena是从S3查询数据的最佳工具之一。 在与AWS服务进行程序交互时,Boto3是每个人都想到的第一个python软件包。 但是,仅使用Boto3包时,使用Athena以编程方式查询S3数据到Pandas数据帧中进行ETL并没有那么容易。

Recently, I came across an amazing Python package called “AWS Data Wrangler”. It was developed by AWS as part of the AWS Professional Service Open Source initiative. It makes the lives of data engineers a lot simpler with the amazing integration it provides with big data services and tools in AWS.

最近,我遇到了一个令人称奇的Python软件包,名为“ AWS Data Wrangler ”。 它是由AWS开发的,是AWS专业服务开源计划的一部分。 通过与AWS中的大数据服务和工具提供的惊人集成,它使数据工程师的工作变得更加简单。

Let me illustrate the difference with an example:

让我用一个例子说明区别:

Use case: Perform some ETL operations using Pandas on data present in data lake by extracting data using Amazon Athena queries

用例:通过使用Amazon Athena查询提取数据,对数据湖中存在的数据使用熊猫执行一些ETL操作

使用Boto3将Athena Query输出读入Pandas Dataframe的旧方法: (Old way of reading Athena Query output into Pandas Dataframe using Boto3:)

The process will involve following steps:

该过程将涉及以下步骤:

  1. Submit query to Athena using Start Query execution method via Athena client using AWS Boto3

    使用AWS Boto3通过Athena客户端使用Start Query执行方法将查询提交给Athena
  2. Retrieve the QueryExecutionId from response

    从响应中检索QueryExecutionId
  3. Poll the Query status by passing QueryExecutionId to the Get Query Execution method

    通过将QueryExecutionId传递给Get Query Execution方法来轮询查询状态
  4. Once the query is succeeded, read the output file from Athena output S3 location into Pandas Dataframe (Also you might need to deal with eventual consistency behaviour of S3 because the output file might not be immediately available in S3 for reading into Pandas dataframe)

    查询成功后,将输出文件从Athena输出S3位置读取到Pandas Dataframe中(另外,您可能需要处理S3的最终一致性行为,因为S3中的输出文件可能无法立即用于读取Pandas数据帧中)

The code would look like follows:

该代码如下所示:

使用AWS Data Wrangler将Athena Query输出读取到Pandas Dataframe的新方法: (New way of reading Athena Query output into Pandas Dataframe using AWS Data Wrangler:)

AWS Data Wrangler takes care of all the complexity which we handled manually in our old code snippet like dealing with query submission, polling, reading data into Pandas dataframe, s3 eventual consistency etc.

AWS Data Wrangler处理了我们在旧代码片段中手动处理的所有复杂性,例如处理查询提交,轮询,将数据读入Pandas数据帧,s3最终一致性等。

We can achieve the equivalent output of above code using the following simple code snippet 🎉:

我们可以使用以下简单代码片段achieve获得与上述代码等效的输出:

import awswrangler as wr
import pandas as pd# Retrieving the data from Amazon Athena
athena_results_df = wr.athena.read_sql_query('''SELECT * FROM table_name WHERE column_1="value1" AND column_2="value2"''', database="sample_db")
Image for post
AWS Data Wrangler AWS Data Wrangler

AWS Data Wrangler is built on top of open-source projects like Pandas, Boto3, SQLAlchemy, Apache Arrow etc. It provides easier and simpler Pandas integration with a lot of other AWS services by providing abstract functions. Some of the services and their basic features are as follows:

AWS Data Wrangler构建在Pandas,Boto3,SQLAlchemy,Apache Arrow等开源项目之上。它通过提供抽象功能,提供了与许多其他AWS服务的更轻松,更简单的Pandas集成。 其中一些服务及其基本功能如下:

  • Amazon S3

    亚马逊S3
# Reading CSV files
wr.s3.read_csv(f"s3://sample-bucket/sample.csv")# Writing CSV files
csv_file_path=f"s3://sample-bucket/sample_2.csv"
wr.s3.to_csv(df, csv_file_path, index=False)# Read JSON fileswr.s3.read_json(f"s3://sample-bucket/sample.json")# Writing JSON files
json_file_path=f"s3://sample-bucket/sample_2.json"
wr.s3.to_json(df, json_file_path)# Reading parquet files
wr.s3.read_parquet(f"s3://sample-bucket/sample.parquet")# Writing parquet files
parquet_file_path = f"s3://sample-bucket/sample_2.parquet"
wr.s3.to_parquet(df, parquet_file_path)
  • AWS Glue Catalog

    AWS Glue目录
# Retrieve databases in Glue catalog
wr.catalog.databases()# Retrieve tables in a Glue database
wr.catalog.tables(database="sample_db")# Search a table in Glue database
wr.catalog.tables(name_contains="sample")
wr.catalog.tables(name_prefix="sample_")
wr.catalog.tables(name_suffix="_table")
wr.catalog.tables(search_text="table description")# Retrieve table details
wr.catalog.table(database="sample_db", table="sample_table")# Delete a tablewr.catalog.delete_table_if_exists(database="sample_db", table="sample_table")# Delete a database
wr.catalog.delete_database('sample_db')
  • Databases (MySQL, PostgreSQL, Redshift)

    数据库(MySQL,PostgreSQL,Redshift)
# Get SQLAlchemy Engine from a Glue Catalog Connection
wr.catalog.get_engine(name='sample_connection')# Get SQLAlchemy Engine using the given db parameters
postgres_engine = wr.db.get_engine(
db_type="postgresql",
host="127.0.0.1",
port=5432,
database="sample_db",
user="sample_user",
password="sample_password"
)# Reading data from table in database
wr.db.read_sql_query("SELECT * FROM public.sample_table", con=postgres_engine)# Writing data into table in database
wr.db.to_sql(df, postgres_engine, schema="public", name="sample_table", if_exists="replace", index=False)
# Load data into redshift using COPY command

wr.db.copy_to_redshift(
df=df,
path=path,
con=redshift_engine,
schema="public",
table="sample_table",
mode="overwrite",
iam_role=iam_role
)# Unload data from redshift using UNLOAD command
wr.db.unload_redshift(
sql="SELECT * FROM public.sample_table",
con=redshift_engine,
iam_role=iam_role,
path=path,
keep_files=True
)
  • Amazon Athena

    亚马逊雅典娜
# Read data from Athena using SQL query
wr.athena.read_sql_query("SELECT * FROM sample_table", database="sample_db")# Read data from Athena using SQL query using specific chunksize
df_chunks = wr.athena.read_sql_query(
"SELECT * FROM sample_table",
database="sample_db",
chunksize=10000
)
for df in dfs:
print(len(df.index))
  • Amazon EMR

    亚马逊电子病历
# Create an EMR cluster
wr.emr.create_cluster(params)# Sumbit EMR step
wr.emr.submit_step(cluster_id, command=f"spark-submit s3://sample-bucket/sample-test-script.py")# Terminate an EMR cluster
wr.emr.terminate_cluster(cluster_id)
  • Cloudwatch logs and Insights

    Cloudwatch日志和见解
  • QuickSight

    快视

For the detailed list of all available APIs, go through the following link:

有关所有可用API的详细列表,请通过以下链接:

Go through the tutorials section for additional detailed examples:

浏览教程部分以获取更多详细示例:

AWS Data Wrangler package makes it easier to do ETL tasks involving Pandas dataframes and AWS data related services. Go give it a try and experience its awesomeness.

AWS Data Wrangler软件包使执行涉及Pandas数据框和AWS数据相关服务的ETL任务变得更加容易。 去尝试一下,体验它的强大。

翻译自: https://medium.com/@bv_subhash/aws-data-wrangler-simplifying-pandas-integration-with-aws-data-related-services-2b3325c12188

aws big data

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值