在AWS数据科学项目中探索Redshift数据分析实践

虞熠蝶

于 2025-06-11 09:17:03 发布

阅读量317

点赞数 5

本文链接：https://blog.csdn.net/gitblog_00910/article/details/148578382

版权

在AWS数据科学项目中探索Redshift数据分析实践

data-science-on-aws AI and Machine Learning with Kubeflow, Amazon EKS, and SageMaker 项目地址: https://gitcode.com/gh_mirrors/da/data-science-on-aws

引言

Redshift作为AWS提供的云数据仓库服务，在大规模数据分析领域扮演着重要角色。本文将详细介绍如何在AWS数据科学项目中利用Redshift进行高效的数据探索和分析，包括连接配置、查询优化和可视化展示等关键环节。

环境准备

安装必要依赖

首先需要安装SQLAlchemy库，这是Python中广泛使用的SQL工具包和对象关系映射器：

!pip install -q SQLAlchemy==1.3.13

配置Redshift连接参数

建立Redshift连接前，需要明确以下关键参数：

redshift_schema = "redshift"
redshift_cluster_identifier = "dsoaws"
redshift_host = "dsoaws"
redshift_database = "dsoaws"
redshift_port = "5439"
redshift_table_2015 = "amazon_reviews_tsv_2015"
redshift_table_2014 = "amazon_reviews_tsv_2014"

这些参数定义了要连接的Redshift集群、数据库和表信息，是后续所有操作的基础。

安全凭证管理

从Secrets Manager获取凭证

AWS最佳实践推荐使用Secrets Manager管理数据库凭证，而非硬编码在脚本中：

import json
import boto3

secretsmanager = boto3.client("secretsmanager")
secret = secretsmanager.get_secret_value(SecretId="dsoaws_redshift_login")
cred = json.loads(secret["SecretString"])

redshift_username = cred[0]["username"]
redshift_pw = cred[1]["password"]

这种方式既安全又便于凭证轮换，是生产环境中的推荐做法。

获取Redshift终端节点

通过boto3 SDK可以动态获取Redshift集群的终端节点地址：

redshift = boto3.client("redshift")
response = redshift.describe_clusters(ClusterIdentifier=redshift_cluster_identifier)
redshift_endpoint_address = response["Clusters"][0]["Endpoint"]["Address"]

建立Redshift连接

使用AWS Data Wrangler库建立Redshift连接：

import awswrangler as wr

con_redshift = wr.data_api.redshift.connect(
    cluster_id=redshift_cluster_identifier,
    database=redshift_database,
    db_user=redshift_username,
)

AWS Data Wrangler简化了与AWS数据服务的交互，提供了更高层次的抽象。

查询优化技巧

近似计数(APPROXIMATE COUNT)

对于海量数据的去重计数，精确计算可能非常耗时。Redshift提供了基于HyperLogLog算法的近似计数功能：

%%time
df = wr.data_api.redshift.read_sql_query(
    sql="""SELECT approximate count(distinct customer_id)
                    FROM {}.{}
                    GROUP BY product_category""".format(
        redshift_schema, redshift_table_2015
    ),
    con=con_redshift,
)

与精确计数相比：

%%time
df = wr.data_api.redshift.read_sql_query(
    sql="""SELECT count(distinct customer_id)
                            FROM {}.{}
                            GROUP BY product_category""".format(
            redshift_schema, redshift_table_2015
    ),
    con=con_redshift,
)

近似计数在保持约2%误差率的同时，能显著提升查询性能，特别适合千万级以上的数据量。

数据可视化分析

准备可视化环境

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

按产品类别统计评分数量

执行SQL查询获取各产品类别的评分数量：

statement = """
SELECT product_category,
COUNT(star_rating) AS count_star_rating
FROM {}.{}
GROUP BY product_category
ORDER BY count_star_rating DESC
""".format(
    redshift_schema, redshift_table_2015
)

df = wr.data_api.redshift.read_sql_query(
    sql=statement,
    con=con_redshift,
)

动态可视化

根据数据特征动态调整可视化参数：

# 根据类别数量调整图形大小
num_categories = df.shape[0]
max_ratings = df["count_star_rating"].max()

if num_categories > 10:
    plt.figure(figsize=(10, 10))
else:
    plt.figure(figsize=(10, 5))

plt.style.use("seaborn-whitegrid")

# 创建条形图
barplot = sns.barplot(y="product_category", x="count_star_rating", data=df, saturation=1)
plt.title("Number of Ratings per Product Category (Redshift)")

# 根据数据范围动态设置x轴刻度
if max_ratings <= 8000:
    plt.xticks(
        [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000],
        ["10K", "20K", "30K", "40K", "50K", "60K", "70K", "80K"],
    )
    plt.xlim(0, 80000)
elif max_ratings <= 200000:
    plt.xticks([50000, 100000, 150000, 200000], ["50K", "100K", "1500K", "200K"])
    plt.xlim(0, 200000)
elif max_ratings > 200000:
    plt.xticks([100000, 1000000, 5000000, 10000000, 15000000, 20000000], 
               ["100K", "1m", "5m", "10m", "15m", "20m"])
    plt.xlim(0, 20000000)

plt.xlabel("Number of Ratings")
plt.ylabel("Product Category")
plt.tight_layout()
plt.show(barplot)

这种动态调整确保了可视化效果在不同数据规模下都能保持清晰可读。

通过Redshift Spectrum查询Athena数据

Redshift Spectrum允许直接查询Athena中的数据，无需加载到Redshift：

athena_schema = "athena"
athena_table_name = "amazon_reviews_tsv"

statement = """
SELECT product_category, COUNT(star_rating) AS count_star_rating
FROM {}.{}
GROUP BY product_category
ORDER BY count_star_rating DESC
""".format(
    athena_schema, athena_table_name
)

df = wr.data_api.redshift.read_sql_query(
    sql=statement,
    con=con_redshift,
)

可视化方式与直接查询Redshift类似，但数据来源不同，这展示了AWS数据分析服务的无缝集成能力。

资源释放最佳实践

完成分析后，应当释放资源：

%%javascript
try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}

总结

本文详细介绍了在AWS数据科学项目中利用Redshift进行数据分析的全流程，包括：

安全凭证管理和连接建立
查询优化技巧如近似计数
动态可视化实现
Redshift Spectrum跨服务查询
资源释放最佳实践

这些技术组合使用，可以构建高效、安全且易于维护的数据分析解决方案，适用于各种规模的数据科学项目。

data-science-on-aws AI and Machine Learning with Kubeflow, Amazon EKS, and SageMaker 项目地址: https://gitcode.com/gh_mirrors/da/data-science-on-aws

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考