Pandas 与 SQL:无缝连接数据库的数据分析流程
在现代数据分析中,我们经常需要处理存储在数据库中的大量数据。Pandas 作为 Python 数据分析的核心库,提供了与各种数据库系统无缝集成的能力,使数据科学家和分析师能够轻松地在 Python 环境中分析数据库数据。本文将详细介绍如何结合 Pandas 和 SQL,建立高效的数据分析流程。
1. Pandas 与数据库集成概述
Pandas 提供了多种方式与数据库交互,主要通过以下功能:
pd.read_sql_query
:执行 SQL 查询并将结果加载为 DataFramepd.read_sql_table
:从数据库表中加载数据为 DataFramepd.read_sql
:综合方法,可接受 SQL 查询或表名DataFrame.to_sql
:将 DataFrame 写入数据库表
这些功能依赖于 SQLAlchemy 库,它是一个强大的 Python SQL 工具包和对象关系映射器,支持多种数据库引擎。
2. 设置数据库连接
使用 Pandas 连接数据库首先需要建立数据库连接。以下是连接不同数据库的示例:
2.1 连接 SQLite (内置支持)
SQLite 是一个轻量级的磁盘数据库,非常适合入门和小型项目使用:
import sqlite3
import pandas as pd
# 创建连接
conn = sqlite3.connect('example.db')
# 也可以使用内存数据库
# conn = sqlite3.connect(':memory:')
2.2 使用 SQLAlchemy 引擎 (推荐)
SQLAlchemy 提供了一种统一的方式来连接各种数据库:
from sqlalchemy import create_engine
import pandas as pd
# SQLite
engine = create_engine('sqlite:///example.db')
# MySQL
engine = create_engine('mysql+pymysql://username:password@localhost/dbname')
# PostgreSQL
engine = create_engine('postgresql://username:password@localhost:5432/dbname')
# Microsoft SQL Server
engine = create_engine('mssql+pyodbc://username:password@dsn_name')
# Oracle
engine = create_engine('oracle+cx_oracle://username:password@localhost:1521/sidname')
2.3 建立连接时的常见参数
# 指定编码
engine = create_engine('mysql+pymysql://username:password@localhost/dbname?charset=utf8mb4')
# 设置连接池大小
engine = create_engine('postgresql://username:password@localhost:5432/dbname',
pool_size=5, max_overflow=10)
# 启用回显SQL语句(调试时有用)
engine = create_engine('sqlite:///example.db', echo=True)
3. 从数据库读取数据
3.1 基本读取操作
使用 read_sql
函数可以轻松地从数据库中读取数据:
import pandas as pd
from sqlalchemy import create_engine
# 创建数据库连接
engine = create_engine('sqlite:///example.db')
# 方法1:从表读取所有数据
df = pd.read_sql_table('customers', engine)
# 方法2:使用SQL查询
df = pd.read_sql_query('SELECT * FROM customers WHERE age > 30', engine)
# 方法3:使用通用read_sql函数
df = pd.read_sql('customers', engine) # 读取表
df = pd.read_sql('SELECT * FROM customers', engine) # 执行查询
3.2 使用参数化查询
参数化查询可以防止 SQL 注入攻击,同时使代码更清晰:
# 使用参数化查询
min_age = 30
query = 'SELECT * FROM customers WHERE age > %s'
df = pd.read_sql_query(query, engine, params=(min_age,))
# 或者使用命名参数 (使用字典)
query = 'SELECT * FROM customers WHERE age > %(min_age)s'
df = pd.read_sql_query(query, engine, params={'min_age': 30})
3.3 读取选项与优化
# 指定列集
df = pd.read_sql('SELECT id, name, email FROM customers', engine)
# 读取数据时做索引
df = pd.read_sql('SELECT * FROM customers', engine, index_col='id')
# 解析日期列
df = pd.read_sql('SELECT * FROM orders', engine, parse_dates=['order_date'])
# 分块读取大表数据(处理大型数据集时非常有用)
chunks = pd.read_sql('SELECT * FROM large_table', engine, chunksize=10000)
for chunk in chunks:
# 处理每个数据块
process_data(chunk)
4. 将数据写入数据库
4.1 基本写入操作
Pandas 提供了 to_sql
方法,可以轻松地将 DataFrame 写入数据库表:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
# 创建示例DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, 35, 40],
'salary': [50000, 60000, 70000, 80000],
'department': ['HR', 'IT', 'Finance', 'Marketing']
})
# 创建数据库连接
engine = create_engine('sqlite:///example.db')
# 将DataFrame写入数据库
df.to_sql('employees', engine)
4.2 控制写入行为
# 如果表已存在,则替换
df.to_sql('employees', engine, if_exists='replace')
# 如果表已存在,则追加新数据
df.to_sql('employees', engine, if_exists='append')
# 如果表已存在,则不执行任何操作
df.to_sql('employees', engine, if_exists='fail')
# 指定索引列处理方式
df.to_sql('employees', engine, index=False) # 不将索引列写入表
df.to_sql('employees', engine, index=True, index_label='employee_id') # 指定索引列名称
# 指定数据类型 (需要导入SQLAlchemy类型)
from sqlalchemy.types import VARCHAR, Integer, Float
dtype = {
'name': VARCHAR(100),
'age': Integer,
'salary': Float,
'department': VARCHAR(50)
}
df.to_sql('employees', engine, dtype=dtype)
4.3 分批写入大型数据集
处理大型数据集时,分批写入可以避免内存问题:
# 准备大型DataFrame
large_df = pd.DataFrame(np.random.randn(100000, 4), columns=list('ABCD'))
# 分批写入
batch_size = 10000
for i in range(0, len(large_df), batch_size):
batch = large_df.iloc[i:i+batch_size]
if i == 0:
# 第一个批次,创建表
batch.to_sql('large_table', engine, if_exists='replace')
else:
# 后续批次,追加数据
batch.to_sql('large_table', engine, if_exists='append')
print(f"已写入 {min(i+batch_size, len(large_df))} 行,共 {len(large_df)} 行")
5. SQL 查询与 Pandas 操作的等效转换
有时你可能想知道特定的 SQL 操作如何在 Pandas 中实现,或者反之。以下是一些常见操作的对照:
5.1 SELECT 操作
# SQL: SELECT column1, column2 FROM table
df = pd.read_sql('SELECT column1, column2 FROM table', engine)
# Pandas等效操作 (假设已将整个表读入df)
result = df[['column1', 'column2']]
5.2 WHERE 条件筛选
# SQL: SELECT * FROM table WHERE column1 > 10 AND column2 = 'value'
df = pd.read_sql("SELECT * FROM table WHERE column1 > 10 AND column2 = 'value'", engine)
# Pandas等效操作
result = df[(df['column1'] > 10) & (df['column2'] == 'value')]
5.3 GROUP BY 和聚合函数
# SQL: SELECT department, AVG(salary) as avg_salary FROM employees GROUP BY department
df = pd.read_sql("SELECT department, AVG(salary) as avg_salary FROM employees GROUP BY department", engine)
# Pandas等效操作
result = df.groupby('department')['salary'].mean().reset_index(name='avg_salary')
5.4 JOIN 操作
# SQL:
# SELECT e.*, d.department_name
# FROM employees e
# JOIN departments d ON e.department_id = d.id
query = """
SELECT e.*, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.id
"""
df = pd.read_sql(query, engine)
# Pandas等效操作 (假设已分别读取两个表)
result = pd.merge(employees_df, departments_df,
left_on='department_id', right_on='id')
5.5 ORDER BY 排序
# SQL: SELECT * FROM employees ORDER BY salary DESC, age ASC
df = pd.read_sql("SELECT * FROM employees ORDER BY salary DESC, age ASC", engine)
# Pandas等效操作
result = df.sort_values(by=['salary', 'age'], ascending=[False, True])
5.6 LIMIT 和 OFFSET
# SQL: SELECT * FROM table LIMIT 10 OFFSET 20
df = pd.read_sql("SELECT * FROM table LIMIT 10 OFFSET 20", engine)
# Pandas等效操作
result = df.iloc[20:30]
6. 实际应用案例
下面我们将展示一个完整的工作流,展示如何结合 Pandas 和 SQL 进行数据分析。
6.1 准备示例数据库
import pandas as pd
import numpy as np
from sqlalchemy import create_engine, text
# 创建一个内存SQLite数据库
engine = create_engine('sqlite:///:memory:')
# 创建销售数据
sales_data = pd.DataFrame({
'transaction_id': range(1000),
'date': pd.date_range('2023-01-01', periods=1000),
'customer_id': np.random.randint(1, 101, size=1000),
'product_id': np.random.randint(1, 51, size=1000),
'quantity': np.random.randint(1, 11, size=1000),
'price': np.random.uniform(10, 1000, size=1000).round(2)
})
# 计算总金额
sales_data['total_amount'] = sales_data['quantity'] * sales_data['price']
# 创建客户数据
customer_data = pd.DataFrame({
'customer_id': range(1, 101),
'name': [f'Customer {i}' for i in range(1, 101)],
'segment': np.random.choice(['Retail', 'Corporate', 'Home Office'], size=100),
'region': np.random.choice(['North', 'South', 'East', 'West'], size=100)
})
# 创建产品数据
product_data = pd.DataFrame({
'product_id': range(1, 51),
'product_name': [f'Product {i}' for i in range(1, 51)],
'category': np.random.choice(['Electronics', 'Furniture', 'Office Supplies'], size=50),
'sub_category': np.random.choice(['Phones', 'Chairs', 'Paper', 'Storage', 'Accessories'], size=50)
})
# 将数据写入数据库
sales_data.to_sql('sales', engine, index=False)
customer_data.to_sql('customers', engine, index=False)
product_data.to_sql('products', engine, index=False)
# 验证数据已写入
print("数据表中的记录数:")
for table in ['sales', 'customers', 'products']:
result = engine.execute(text(f"SELECT COUNT(*) FROM {table}"))
count = result.scalar()
print(f"{table}: {count} 条记录")
6.2 数据分析工作流
# 1. 查询并分析每个区域的销售情况
query = """
SELECT c.region, SUM(s.total_amount) as total_sales
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
GROUP BY c.region
ORDER BY total_sales DESC
"""
region_sales = pd.read_sql(query, engine)
print("各区域销售额:")
print(region_sales)
# 2. 分析每个产品类别的销售趋势 (按月)
query = """
SELECT
strftime('%Y-%m', s.date) as month,
p.category,
SUM(s.total_amount) as monthly_sales
FROM sales s
JOIN products p ON s.product_id = p.product_id
GROUP BY month, p.category
ORDER BY month, p.category
"""
category_trends = pd.read_sql(query, engine)
# 将数据转换为透视表格式,更适合分析
category_pivot = category_trends.pivot(index='month', columns='category', values='monthly_sales')
print("\n各产品类别月度销售额:")
print(category_pivot.head())
# 3. 识别前10名客户
query = """
SELECT
c.customer_id,
c.name,
c.segment,
c.region,
SUM(s.total_amount) as total_spent
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
GROUP BY c.customer_id
ORDER BY total_spent DESC
LIMIT 10
"""
top_customers = pd.read_sql(query, engine)
print("\n销售额前10名客户:")
print(top_customers)
# 4. 使用pandas进行更复杂的分析 - 计算每个细分市场的客户平均消费
# 先查询所有必要数据
query = """
SELECT
c.customer_id,
c.name,
c.segment,
c.region,
s.total_amount,
s.date
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
"""
sales_enriched = pd.read_sql(query, engine)
# 按客户ID和细分市场计算总消费
customer_spending = sales_enriched.groupby(['customer_id', 'segment'])['total_amount'].sum().reset_index()
# 按细分市场计算平均消费
segment_avg = customer_spending.groupby('segment')['total_amount'].agg(['mean', 'median', 'std', 'min', 'max'])
print("\n各细分市场的客户消费统计:")
print(segment_avg)
# 5. 识别异常交易 (例如,超过平均交易金额两个标准差的交易)
avg_transaction = sales_data['total_amount'].mean()
std_transaction = sales_data['total_amount'].std()
threshold = avg_transaction + 2 * std_transaction
query = f"""
SELECT s.*, c.name, p.product_name
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
JOIN products p ON s.product_id = p.product_id
WHERE s.total_amount > {threshold}
ORDER BY s.total_amount DESC
"""
large_transactions = pd.read_sql(query, engine)
print(f"\n异常大额交易 (>{threshold:.2f}):")
print(large_transactions.head())
# 6. 将分析结果写回数据库 (例如,创建客户消费汇总表)
customer_summary = sales_enriched.groupby(['customer_id', 'name', 'segment', 'region'])['total_amount'].agg([
('total_spent', 'sum'),
('avg_transaction', 'mean'),
('num_transactions', 'count')
]).reset_index()
# 将结果写入数据库
customer_summary.to_sql('customer_summary', engine, index=False, if_exists='replace')
print("\n客户消费汇总表已创建,记录数:", len(customer_summary))
# 7. 验证写入的数据
verification = pd.read_sql("SELECT * FROM customer_summary LIMIT 5", engine)
print("\n客户消费汇总表示例:")
print(verification)
7. 高级技术与最佳实践
7.1 使用上下文管理器安全处理连接
from sqlalchemy import create_engine
from contextlib import contextmanager
@contextmanager
def get_connection():
"""创建数据库连接的上下文管理器"""
engine = create_engine('sqlite:///example.db')
connection = engine.connect()
try:
yield connection
finally:
connection.close()
# 使用上下文管理器
with get_connection() as conn:
df = pd.read_sql("SELECT * FROM employees", conn)
7.2 处理大型数据集与内存优化
# 分批处理大型数据
def process_large_table(table_name, engine, batch_size=10000):
"""以分批方式处理大型表"""
# 获取表的总行数
count_query = f"SELECT COUNT(*) FROM {table_name}"
total_rows = pd.read_sql(count_query, engine).iloc[0, 0]
print(f"表 {table_name} 共有 {total_rows} 行")
# 计算需要的批次数
num_batches = (total_rows + batch_size - 1) // batch_size
# 使用SQL的LIMIT和OFFSET进行分批读取
results = []
for i in range(num_batches):
offset = i * batch_size
query = f"SELECT * FROM {table_name} LIMIT {batch_size} OFFSET {offset}"
batch_df = pd.read_sql(query, engine)
# 处理批次数据
processed = process_batch(batch_df)
results.append(processed)
print(f"已处理 {min(offset + batch_size, total_rows)}/{total_rows} 行")
# 合并结果
return pd.concat(results, ignore_index=True)
def process_batch(batch_df):
"""处理每个批次的数据 (示例)"""
# 这里可以添加您的数据处理逻辑
return batch_df # 示例中只是返回原始批次
7.3 数据库事务管理
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
# 创建引擎和会话工厂
engine = create_engine('sqlite:///example.db')
Session = sessionmaker(bind=engine)
# 使用事务
def update_employee_data(employee_id, new_salary):
"""在事务中更新员工数据"""
session = Session()
try:
# 查询当前数据
query = f"SELECT * FROM employees WHERE employee_id = {employee_id}"
employee = pd.read_sql(query, session.connection())
if employee.empty:
print(f"未找到ID为{employee_id}的员工")
session.rollback()
return False
# 执行更新
update_query = f"""
UPDATE employees
SET salary = {new_salary}
WHERE employee_id = {employee_id}
"""
session.execute(update_query)
# 记录变更历史
history_data = pd.DataFrame({
'employee_id': [employee_id],
'old_salary': [employee.iloc[0]['salary']],
'new_salary': [new_salary],
'update_time': [pd.Timestamp.now()]
})
history_data.to_sql('salary_history', session.connection(), if_exists='append', index=False)
# 提交事务
session.commit()
print(f"员工 {employee_id} 的薪资已更新")
return True
except Exception as e:
# 发生错误,回滚事务
session.rollback()
print(f"更新失败: {str(e)}")
return False
finally:
# 关闭会话
session.close()
7.4 SQL 查询优化
在使用 Pandas 与数据库交互时,SQL 查询优化尤为重要:
# 优化前: 查询所有列
df = pd.read_sql("SELECT * FROM large_table", engine)
# 优化后: 只查询需要的列
df = pd.read_sql("SELECT id, name, value FROM large_table", engine)
# 优化前: 在Python中过滤
df = pd.read_sql("SELECT * FROM large_table", engine)
filtered_df = df[df['value'] > 1000]
# 优化后: 在SQL中过滤
df = pd.read_sql("SELECT * FROM large_table WHERE value > 1000", engine)
# 优化前: 在Python中连接和聚合
customers = pd.read_sql("SELECT * FROM customers", engine)
orders = pd.read_sql("SELECT * FROM orders", engine)
merged = pd.merge(customers, orders, on='customer_id')
result = merged.groupby('region')['amount'].sum()
# 优化后: 在SQL中连接和聚合
query = """
SELECT c.region, SUM(o.amount) as total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.region
"""
result = pd.read_sql(query, engine)
7.5 混合使用 Pandas 和 SQL 的策略
在实际项目中,选择何时使用 SQL 和何时使用 Pandas 取决于多种因素:
-
使用 SQL 的情况:
- 数据量很大,超出内存容量
- 需要利用数据库索引和优化功能
- 数据操作比较简单(筛选、排序、简单聚合)
- 需要利用数据库特有功能(窗口函数、递归查询等)
-
使用 Pandas 的情况:
- 数据量适中,可以加载到内存
- 需要进行复杂的数据转换和清洗
- 需要使用 Pandas 特有的功能(如 pivot、melt、复杂的时间序列操作)
- 需要与其他 Python 数据科学库集成(如 scikit-learn、matplotlib)
-
混合策略:
- 使用 SQL 查询减少需要加载到 Pandas 的数据量
- 在 SQL 中完成初步聚合和筛选
- 在 Pandas 中执行更复杂的分析和可视化
8. 不同数据库系统的特殊考虑
8.1 SQLite
import sqlite3
import pandas as pd
# SQLite日期时间处理
conn = sqlite3.connect('example.db')
conn.create_function("DATE", 1, lambda x: x.split()[0]) # 自定义日期提取函数
query = "SELECT *, DATE(timestamp) as date FROM events"
df = pd.read_sql(query, conn)
8.2 MySQL/MariaDB
from sqlalchemy import create_engine
import pandas as pd
# 处理MySQL的字符集
engine = create_engine('mysql+pymysql://user:pass@localhost/db?charset=utf8mb4')
# 处理MySQL日期和时间
query = """
SELECT
DATE(timestamp) as date,
TIME(timestamp) as time,
MONTH(timestamp) as month
FROM events
"""
df = pd.read_sql(query, engine)
8.3 PostgreSQL
from sqlalchemy import create_engine
import pandas as pd
# 连接PostgreSQL
engine = create_engine('postgresql://user:pass@localhost:5432/db')
# 使用PostgreSQL的JSON功能
query = """
SELECT
id,
data->>'name' as name,
data->>'email' as email
FROM users
WHERE data->>'status' = 'active'
"""
df = pd.read_sql(query, engine)
8.4 Microsoft SQL Server
from sqlalchemy import create_engine
import pandas as pd
import urllib.parse
# 构建连接字符串
params = urllib.parse.quote_plus(
'DRIVER={SQL Server};'
'SERVER=server_name;'
'DATABASE=database_name;'
'UID=username;'
'PWD=password;'
)
engine = create_engine(f"mssql+pyodbc:///?odbc_connect={params}")
# 使用SQL Server的日期函数
query = """
SELECT
CONVERT(DATE, created_at) as date,
DATEPART(YEAR, created_at) as year,
DATEPART(MONTH, created_at) as month
FROM orders
"""
df = pd.read_sql(query, engine)
9. 处理常见问题与挑战
9.1 数据类型转换
SQL 和 Pandas 的数据类型并不总是完全匹配,有时需要手动处理:
# 日期类型处理
df = pd.read_sql("SELECT * FROM orders", engine, parse_dates=['order_date'])
# 或者读取后转换
df = pd.read_sql("SELECT * FROM orders", engine)
df['order_date'] = pd.to_datetime(df['order_date'])
# 处理SQL Server的bit类型
df = pd.read_sql("SELECT * FROM users", engine)
df['is_active'] = df['is_active'].astype(bool)
# 处理金额/小数
df['price'] = df['price'].astype(float)
9.2 处理NULL值
# 在SQL中处理NULL
query = """
SELECT
id,
name,
COALESCE(email, 'no-email') as email
FROM customers
"""
df = pd.read_sql(query, engine)
# 在Pandas中处理NaN
df = pd.read_sql("SELECT * FROM customers", engine)
df['email'].fillna('no-email', inplace=True)
9.3 处理编码问题
# 指定编码
engine = create_engine('mysql+pymysql://user:pass@localhost/db?charset=utf8mb4')
# 或者在读取后处理编码
df = pd.read_sql("SELECT * FROM messages", engine)
df['content'] = df['content'].str.encode('latin1').str.decode('utf8')
9.4 处理大型结果集的内存问题
# 使用迭代器处理大型结果集
def process_large_query(query, engine, chunksize=10000):
"""处理可能返回大量行的查询"""
total_processed = 0
results = []
# 分块读取查询结果
for chunk in pd.read_sql(query, engine, chunksize=chunksize):
# 处理每个数据块
processed_chunk = process_chunk(chunk)
results.append(processed_chunk)
# 更新进度
total_processed += len(chunk)
print(f"已处理 {total_processed} 行")
# 视情况合并结果
if results:
return pd.concat(results, ignore_index=True)
return pd.DataFrame()
def process_chunk(chunk):
"""处理数据块的函数"""
# 在这里添加您的处理逻辑
return chunk # 示例只是返回原始数据块
10. 实战案例:完整的数据分析流程
下面我们将展示一个完整的案例,包括数据库连接、数据查询、分析和结果存储:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine, text
from datetime import datetime
import os
# 创建输出目录
os.makedirs('reports', exist_ok=True)
# 连接数据库(使用前面创建的内存数据库)
engine = create_engine('sqlite:///:memory:')
# 重新创建示例数据(如果不是接着前面的代码运行)
# 这里略去数据创建代码...
# 1. 加载数据并分析销售趋势
query = """
SELECT
strftime('%Y-%m', s.date) as month,
SUM(s.total_amount) as monthly_sales,
COUNT(DISTINCT s.customer_id) as unique_customers,
COUNT(*) as transaction_count,
SUM(s.total_amount) / COUNT(*) as avg_transaction_value
FROM sales s
GROUP BY month
ORDER BY month
"""
monthly_trend = pd.read_sql(query, engine)
monthly_trend['month'] = pd.to_datetime(monthly_trend['month'] + '-01')
# 绘制月度销售趋势
plt.figure(figsize=(12, 6))
plt.plot(monthly_trend['month'], monthly_trend['monthly_sales'], marker='o')
plt.title('Monthly Sales Trend')
plt.xlabel('Month')
plt.ylabel('Sales Amount')
plt.grid(True)
plt.tight_layout()
plt.savefig('reports/monthly_sales_trend.png')
plt.close()
# 2. 产品类别分析
query = """
SELECT
p.category,
SUM(s.total_amount) as total_sales,
COUNT(*) as num_transactions,
AVG(s.price) as avg_price,
SUM(s.quantity) as total_units
FROM sales s
JOIN products p ON s.product_id = p.product_id
GROUP BY p.category
ORDER BY total_sales DESC
"""
category_analysis = pd.read_sql(query, engine)
# 绘制类别销售分布
plt.figure(figsize=(10, 6))
sns.barplot(x='category', y='total_sales', data=category_analysis)
plt.title('Sales by Product Category')
plt.xlabel('Category')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('reports/category_sales.png')
plt.close()
# 3. 客户细分市场分析
query = """
SELECT
c.segment,
COUNT(DISTINCT c.customer_id) as num_customers,
SUM(s.total_amount) as total_sales,
SUM(s.total_amount) / COUNT(DISTINCT c.customer_id) as sales_per_customer,
COUNT(*) as num_transactions,
COUNT(*) / COUNT(DISTINCT c.customer_id) as transactions_per_customer
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
GROUP BY c.segment
ORDER BY total_sales DESC
"""
segment_analysis = pd.read_sql(query, engine)
# 绘制细分市场分析
fig, ax = plt.subplots(1, 2, figsize=(14, 6))
sns.barplot(x='segment', y='sales_per_customer', data=segment_analysis, ax=ax[0])
ax[0].set_title('Average Sales per Customer by Segment')
ax[0].set_xlabel('Segment')
ax[0].set_ylabel('Avg Sales per Customer')
sns.barplot(x='segment', y='transactions_per_customer', data=segment_analysis, ax=ax[1])
ax[1].set_title('Average Transactions per Customer by Segment')
ax[1].set_xlabel('Segment')
ax[1].set_ylabel('Avg Transactions per Customer')
plt.tight_layout()
plt.savefig('reports/segment_analysis.png')
plt.close()
# 4. 地区性能分析
query = """
SELECT
c.region,
COUNT(DISTINCT c.customer_id) as num_customers,
SUM(s.total_amount) as total_sales,
AVG(s.total_amount) as avg_transaction_value,
COUNT(*) as num_transactions
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
GROUP BY c.region
ORDER BY total_sales DESC
"""
region_analysis = pd.read_sql(query, engine)
# 绘制地区销售分布
plt.figure(figsize=(10, 6))
sns.barplot(x='region', y='total_sales', data=region_analysis)
plt.title('Sales by Region')
plt.xlabel('Region')
plt.ylabel('Total Sales')
plt.tight_layout()
plt.savefig('reports/region_sales.png')
plt.close()
# 5. 客户价值分析 (RFM分析)
current_date = datetime.strptime('2023-04-01', '%Y-%m-%d') # 假设当前日期
# 查询并计算RFM指标
query = """
SELECT
s.customer_id,
MAX(s.date) as last_purchase_date,
COUNT(*) as frequency,
SUM(s.total_amount) as monetary
FROM sales s
GROUP BY s.customer_id
"""
rfm_data = pd.read_sql(query, engine)
# 转换日期并计算Recency
rfm_data['last_purchase_date'] = pd.to_datetime(rfm_data['last_purchase_date'])
rfm_data['recency'] = (current_date - rfm_data['last_purchase_date']).dt.days
# 计算RFM分数 (分位数方法)
rfm_data['R_score'] = pd.qcut(rfm_data['recency'], 5, labels=False, duplicates='drop')
rfm_data['R_score'] = 5 - rfm_data['R_score'] # 反转R分数 (较低的recency应有较高的分数)
rfm_data['F_score'] = pd.qcut(rfm_data['frequency'], 5, labels=False, duplicates='drop')
rfm_data['M_score'] = pd.qcut(rfm_data['monetary'], 5, labels=False, duplicates='drop')
# 组合RFM分数
rfm_data['RFM_score'] = rfm_data['R_score'] * 100 + rfm_data['F_score'] * 10 + rfm_data['M_score']
# 客户细分
def segment_customer(row):
R, F, M = row['R_score'], row['F_score'], row['M_score']
if R >= 4 and F >= 4 and M >= 4:
return 'Champions'
elif R >= 3 and F >= 3 and M >= 3:
return 'Loyal Customers'
elif R >= 3 and F >= 1 and M >= 2:
return 'Potential Loyalists'
elif R >= 4 and F <= 1 and M <= 1:
return 'New Customers'
elif R < 2 and F >= 3 and M >= 3:
return 'At Risk'
elif R < 1 and F >= 2 and M >= 2:
return 'Cannot Lose Them'
elif R < 1 and F < 2 and M >= 2:
return 'Hibernating'
else:
return 'Others'
rfm_data['segment'] = rfm_data.apply(segment_customer, axis=1)
# 按细分统计客户数量
segment_counts = rfm_data['segment'].value_counts().reset_index()
segment_counts.columns = ['segment', 'count']
# 客户细分可视化
plt.figure(figsize=(12, 6))
sns.barplot(x='segment', y='count', data=segment_counts)
plt.title('Customer Segments Based on RFM Analysis')
plt.xlabel('Segment')
plt.ylabel('Number of Customers')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('reports/customer_segments.png')
plt.close()
# 将RFM分析结果写回数据库
rfm_data.to_sql('customer_rfm', engine, if_exists='replace', index=False)
# 6. 生成汇总报告
with open('reports/analysis_summary.txt', 'w') as f:
f.write("销售数据分析报告\n")
f.write("=" * 50 + "\n\n")
f.write("1. 总体销售指标\n")
total_sales = engine.execute(text("SELECT SUM(total_amount) FROM sales")).scalar()
total_transactions = engine.execute(text("SELECT COUNT(*) FROM sales")).scalar()
total_customers = engine.execute(text("SELECT COUNT(DISTINCT customer_id) FROM sales")).scalar()
f.write(f" 总销售额: ${total_sales:.2f}\n")
f.write(f" 总交易数: {total_transactions}\n")
f.write(f" 活跃客户数: {total_customers}\n")
f.write(f" 平均交易金额: ${total_sales/total_transactions:.2f}\n")
f.write(f" 客户平均消费: ${total_sales/total_customers:.2f}\n\n")
f.write("2. 销售趋势\n")
f.write(" 月度销售数据已保存到 'reports/monthly_sales_trend.png'\n\n")
f.write("3. 产品类别分析\n")
f.write(category_analysis.to_string(index=False) + "\n\n")
f.write("4. 客户细分市场分析\n")
f.write(segment_analysis.to_string(index=False) + "\n\n")
f.write("5. 地区销售分析\n")
f.write(region_analysis.to_string(index=False) + "\n\n")
f.write("6. 客户RFM分析\n")
f.write(f" 已识别 {len(segment_counts)} 个客户群组\n")
f.write(segment_counts.to_string(index=False) + "\n\n")
f.write("7. 最畅销产品\n")
top_products_query = """
SELECT
p.product_id,
p.product_name,
p.category,
SUM(s.quantity) as total_units,
SUM(s.total_amount) as total_sales
FROM sales s
JOIN products p ON s.product_id = p.product_id
GROUP BY p.product_id
ORDER BY total_sales DESC
LIMIT 10
"""
top_products = pd.read_sql(top_products_query, engine)
f.write(top_products.to_string(index=False) + "\n\n")
f.write("8. 分析结论\n")
# 这里可以根据实际分析结果添加业务洞察
f.write(" - 需要关注的客户群组: 'At Risk' 和 'Cannot Lose Them'\n")
f.write(" - 表现最好的产品类别: " + category_analysis.iloc[0]['category'] + "\n")
f.write(" - 表现最好的地区: " + region_analysis.iloc[0]['region'] + "\n")
f.write("\n报告生成时间: " + datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
print("数据分析完成!报告和图表已保存到 'reports' 目录。")
结论
Pandas 与 SQL 的结合提供了强大的数据分析能力,使数据科学家和分析师能够高效处理存储在数据库中的大量数据。通过本文介绍的技术和最佳实践,您可以构建无缝连接的数据分析流程,充分利用两者的优势:SQL 的查询优化和数据库功能,以及 Pandas 的灵活数据处理和分析能力。
关键要点包括:
- 使用 SQLAlchemy 提供统一的数据库连接方式
- 合理分配 SQL 和 Pandas 的工作,利用各自的优势
- 针对大型数据集使用分批处理和内存优化技术
- 充分利用参数化查询和事务管理确保安全性
- 构建完整的数据分析流程,从数据获取到分析再到结果存储
通过掌握这些技术,您可以大大提高数据分析工作流的效率和可靠性,无论是处理小型数据集还是大规模企业数据。