订单数据分析-CSDN博客

本文链接：https://blog.csdn.net/n_yyhetaerzimen/article/details/142911286

### 背景

　在当今电子商务与在线交易蓬勃发展的时代，企业面临着日益复杂的市场环境和消费者行为模式。随着交易量的剧增，订单管理成为企业运营的核心环节之一，而订单退单问题直接影响到企业的库存管理、现金流状况以及客户满意度。为了能够解决以上问题，需要对系统中留存的大量历史数据进行分析、标注和建模挖掘，提炼出关键的风险特征指标与消费者行为模式，以提前识别并预测可能引发退单的风险因素，从而帮助企业采取预防措施，优化运营策略，减少经济损失，提升整体服务质量和市场竞争力。

### 数据说明

　**orders.xlsx**数据主要包含**订单编号、用户编号、商品编号、订单金额、支付金额、渠道编号、平台类型、下单时间、支付时间和是否退单**维度，约有**10万余条**数据，产生时间为**2022~2023**年，**Contents.xlsx**数据中包含商品评论文本信息，**约有1万余条**。

**字段解读**

| 字段名称 | 字段说明 |

| ------------ | -------------------------------- |

| id | 编号 |

| orderID | 订单编号 |

| userID | 用户编号 |

| goodsID | 商品编号 |

| orderAmount | 订单金额 |

| payment | 支付金额 |

| chanelID | 渠道编号 |

| platfromType | 平台类型 |

| orderTime | 下单时间 |

| payTime | 支付时间 |

| chargeback | 是否退单 |

### 分析思路

#### 1.数据探索

对订单数据进行概率统计、可视化和相关性分析，理解数据的基本特征和结构，检测数据中的缺失值、异常值和重复值。

#### 2.数据预处理

对订单数据中的各列进行缺失值、异常值、重复值和一致性处理，保证数据规范性。

#### 3.维度分析

分析各维度数据，如GMV、订单量占比或分布分析，以从多个角度理解和评估数据，揭示数据背后的信息和关系，支持决策制定。

#### 4.数据标注

分析用户对商品的评论，提取情感倾向分类，如正面、负面、中性，定义标签体系，丰富用户特征，以便提升客户体验、改进产品和服务。

#### 5.特征处理

选择重要特征，转化特征与标签的数据类型，拆分数据集，对数据进行标准化，为模型准备输入数据。

#### 6.建模挖掘

　构建随机森林分类模型进行训练挖掘，评估模型，使用网格搜索优化模型的超参数，提升模型拟合效果和效率等。

# 导入数据处理库
import pandas as pd
# 导入numpy库，用于数值计算
import numpy as np
# 导入seaborn库，用于数据可视化
import seaborn as sns
# 导入matplotlib.pyplot模块，用于绘图
import matplotlib.pyplot as plt
# 导入处理中文NLP模型
from snownlp import SnowNLP
# 导入字典编码器
from sklearn.feature_extraction import DictVectorizer
# 导入随机采样模块
from sklearn.utils import resample
# 导入切分数据集、网格搜索模块
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
# 导入随机森林分类器模型
from sklearn.ensemble import RandomForestClassifier
# 导入评估模块
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc
# 导入标签编码器
from sklearn.preprocessing import LabelEncoder
# 导入warnings库，用于忽略警告信息
import warnings

# 在Jupyter中内联显示图形
%matplotlib inline
# 设置中文显示字体为SimHei
plt.rcParams['font.sans-serif'] = 'SimHei'
# 解决负号'-'等特殊符号显示为方块的问题
plt.rcParams['axes.unicode_minus'] = False
# 使用ggplot样式来绘制图形
plt.style.use('ggplot')
# 忽略警告信息，以免干扰代码执行过程
warnings.filterwarnings('ignore')

# 读取订单数据
df = pd.read_excel("orders.xlsx")
# 查看数据形状
df.shape
# 查看前 10 行数据
df.head(10)

# 查看数据详情
df.info()

# 计算各类的统计信息。包含均值、最大值和最小值等。
df.describe()

# 检测数据中的重复记录
duplicate_rows = df[df.duplicated()]
# 打印重复订单数量
print('重复的订单数量：', duplicate_rows['orderID'].sum())

# 统计缺失值
missing_values = df.isnull().sum()
# 判断存在缺失值的列
missing_columns = missing_values[missing_values > 0]
# 输出缺失值列以及缺失值数量
print('存在缺失值的列与数量：', missing_columns)

# 检测订单金额和支付金额小于等于0的数据。
count_order_amount = (df['orderAmount'] <= 0).sum()
count_payment = (df['payment'] <= 0).sum()
# 输出订单金额和支付金额小于等于0的异常记录数
print(f'订单金额小于等于0的记录数：{count_order_amount}')  
print(f'支付金额小于等于0的记录数：{count_payment}')

# 检测支付金额大于订单金额的数据。
abnormal_payment_count = (df['payment'] > df['orderAmount']).sum()
# 输出异常支付金额记录数
print(f'支付金额大于订单金额的记录数：{abnormal_payment_count}')

# 检测下单时间到支付时间超过 “1天” 的记录数，（注意：一天内的订单时间截至为 “23:59:59”）。
count_greater_one_day = ((df['payTime'] - df['orderTime']).dt.days > 1).sum()
# 输出下单时间到支付时间超过“1天”的异常记录数
print(f'下单时间到支付时间超过1天的记录数：{count_greater_one_day}')

# 检测支付时间小于等于（早于）下单时间的数据。
count_abnormal_time = (df['payTime'] <= df['orderTime']).sum()
# 输出支付时间小于下单时间的异常记录数
print(f'支付时间小于下单时间的记录数：{count_abnormal_time}')

# 统计退单数量和未退单数量
num_chargeback = df[df['chargeback'] == "是"].shape[0] # 统计退单数量
num_nochargeback = df[df['chargeback'] == "否"].shape[0] # 统计未退单数量

# 定义标签和数据
labels = ['退单', '未退单']
values = [num_chargeback, num_nochargeback]

# 绘制饼图
plt.figure(figsize=(8, 8))
plt.pie(values, labels=labels, colors=['#00BFFF','#98FB98'], autopct='%1.1f%%', textprops={'fontsize': 14}, startangle=140)
plt.title('是否退单数量分布占比统计饼图')
plt.show()


# 选择需要分析的列
cols_to_analyze = ['orderAmount', 'payment', 'chanelID', 'platfromType', 'orderTime', 'payTime']
# 创建子集
df_subset = df[cols_to_analyze].copy()

# 将 orderTime 和 payTime 转换为整数类型的值
df_subset['orderTime_timestamp'] = df_subset['orderTime'].apply(lambda x: int(pd.Timestamp(x).strftime('%Y%m%d%H%M%S')))
df_subset['payTime_timestamp'] = df_subset['payTime'].apply(lambda x: int(pd.Timestamp(x).strftime('%Y%m%d%H%M%S')))
# 删除原日期类型字段
df_subset.drop(['orderTime', 'payTime'], axis=1, inplace=True)

# 创建标签编码器
label_encoder = LabelEncoder()
# 将渠道编号和平台类型特征进行编码，转换为数值类型
df_subset['chanelID'] = label_encoder.fit_transform(df_subset['chanelID'])
df_subset['platfromType'] = label_encoder.fit_transform(df_subset['platfromType'])

# 计算相关矩阵
corr_matrix = df_subset.corr()

# 绘制热力图
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='YlGn')
plt.title('特征相关性分析热力图', fontsize=18)
plt.show()


# 根据“订单编号”删除订单重复值
clean_order_df = df.drop_duplicates(subset='orderID')
# 输出数据去重前后的总数
print(f'去除重复值前的数据总数：{df.shape[0]}', f'\n去除重复值后的数据总数：{clean_order_df.shape[0]}')


# 计算渠道编号（chanelID）众数
chanel_mode = clean_order_df['chanelID'].mode()[0]
# 在原 DataFrame 上进行填充缺失值
clean_order_df['chanelID'].fillna(value=chanel_mode, inplace=True)
# 统计各列数据，查看是否还有缺失值
clean_order_df.isnull().sum()


# 因下单时间和支付时间正常不会超过一天，所以要清除支付时间与下单时间异常值。
# 计算时间差
clean_order_df['time_diff'] = clean_order_df['payTime'] - clean_order_df['orderTime']
# 定义时间差，“一天”
one_day = pd.Timedelta(days=1)
# 过滤下单时间到支付时间超过“1天”的异常值数据
filterd_df = clean_order_df[clean_order_df['time_diff'] < one_day]

# 输出超过一天支付和一天内支付的订单数量
print(f"支付时间超过1天的订单数量：{(clean_order_df['time_diff'] > one_day).sum()}", f"\n一天内支付的订单数量：{filterd_df.shape[0]}")


# 过滤支付时间小于或等于下单时间的异常值记录。
clean_time_df = filterd_df[filterd_df['payTime'] > filterd_df['orderTime']]
# 输出异常数据总数和过滤后总数
print(f"支付时间小于下单时间的数量：{(filterd_df['payTime'] <= filterd_df['orderTime']).sum()}", f"\n支付时间大于下单时间的数量：{clean_time_df.shape[0]}")


# 过滤订单金额和支付金额小于等于 0 的记录
clean_amount_df = clean_time_df[clean_time_df['orderAmount'] > 0]
clean_payment_df = clean_amount_df[clean_amount_df['payment'] > 0]
# 输出异常数据总数和过滤后的总数
print(f"订单金额和支付金额小于等于0的数量：{(clean_time_df['orderAmount'] <= 0).sum(), (clean_amount_df['payment'] <= 0).sum()}", 
      f"订单金额和支付金额大于0的数量：{clean_amount_df.shape[0], clean_payment_df.shape[0]}")


# 过滤“支付金额”大于“订单金额”的数据
clean_price_df = clean_payment_df[clean_payment_df['payment'] <= clean_payment_df['orderAmount']]
# 输出异常数据总数和过滤后的总数
print(f"支付金额大于订单金额的数量：{(clean_payment_df['payment'] > clean_payment_df['orderAmount']).sum()}", f"\n支付金额小于等于订单金额的数量：{clean_price_df.shape[0]}")


# 特殊字段处理，“平台类型（platfromType）”字段中出现了多余空格的值，为保证数据一致性，需要将多余空格进行剔除。
clean_price_df['platfromType'] = clean_price_df['platfromType'].str.replace(' ', '')
# 查看去除空格后的前 10 行数据
clean_price_df.head(10)


# 获取每笔订单月份
clean_price_df['orderTimeMonth'] = clean_price_df['orderTime'].dt.to_period('M').dt.to_timestamp()
# 根据月份分组，计算每月的GMV。
month_gmv = clean_price_df.groupby(by='orderTimeMonth')['orderAmount'].sum()

# 提取月份和 GMV 数据
months = month_gmv.index.tolist()
gmv_values = month_gmv.values.tolist()

# 设置画布大小
plt.figure(figsize=(10, 6))
# 绘制折线图
plt.plot(months, gmv_values, marker='o', linestyle='-', color='b', label='GMV')
plt.xlabel('月份') # 设置 X 轴标签
plt.ylabel('GMV') # 设置 Y 轴标签
plt.title("GMV趋势变化折线图", fontsize=20) # 设置标题
plt.xticks(months, rotation=45)  # 显示全部月份刻度
plt.grid(True) # 启用网格线
plt.legend() # 显示图例
plt.tight_layout()  # 自动调整布局，防止标签被切断
plt.show() # 显示图形


# 统计每个平台的订单量
platform_count = clean_price_df.groupby(by='platfromType')['id'].count().sort_values(ascending=False)

# 提取平台和各平台订单数量
platform = platform_count.index.tolist()
counts = platform_count.values.tolist()

# 设置画布大小
plt.figure(figsize=(16, 10))  # 设置图形大小
# 绘制饼图
plt.pie(counts, labels=platform, autopct='%.2f%%', pctdistance=0.75, labeldistance=1.1, textprops={'fontsize': 16}, startangle=140)
# 添加标题
plt.title("各平台订单量占比饼图", fontsize=20)
# 显示图形
plt.show()


# 通过下单时间计算出下单的星期数。
weekday = clean_price_df['orderTime'].dt.dayofweek
# 创建新列“星期”
clean_price_df['weekday_chinese'] = weekday.apply(lambda x: ['星期一', '星期二', '星期三', '星期四', '星期五', '星期六', '星期日'][x])

# 对一周中每天的下单数据进行计数统计
week_count = clean_price_df.groupby('weekday_chinese')['id'].count().sort_values(ascending=False)
# 提取星期和订单量
week = week_count.index.tolist()
counts = week_count.values.tolist()

# 生成柱状图
plt.figure(figsize=(10, 6))  # 设置图形大小
# 绘制柱状图
plt.bar(week, counts, color='skyblue')
# 添加标题和标签
plt.title("周内每天订单量分析柱状图", fontsize=20)
plt.ylabel('订单量') # 设置标签
# 自动调整布局，防止标签重叠
plt.tight_layout()
# 显示图形
plt.show()


# 建立时段映射函数
def time_segment(hour):
    if 0 <= hour < 5:
        return '凌晨'
    elif 5 <= hour < 9:
        return '早晨'
    elif 9 <= hour < 11:
        return '上午'
    elif 11 <= hour < 13:
        return '中午'
    elif 13 <= hour < 16:
        return '下午'
    elif 16 <= hour < 19:
        return '傍晚'
    else:
        return '晚上'

# 提取下单的小时数，并应用时段映射函数
clean_price_df['orderTimeSegmentation'] = clean_price_df['orderTime'].dt.hour.apply(time_segment)

# 对下单时段进行计数统计
time_seg_count = clean_price_df.groupby('orderTimeSegmentation')['id'].count().sort_values(ascending=False)
# 提取时段和订单数量
time_seg = time_seg_count.index.tolist()
count = time_seg_count.values.tolist()

# 生成柱状图
plt.figure(figsize=(10, 6))  # 设置图形大小
# 绘制柱状图
plt.bar(time_seg, count, color='skyblue')
# 添加标题和标签
plt.title("单日各时段的订单量柱状图", fontsize=20)
plt.xlabel('时段') # 设置标签
plt.ylabel('订单量') # 设置标签
# 自动调整布局，防止标签重叠
plt.tight_layout()
# 显示图形
plt.show()


# 读取商品评论数据
content_df = pd.read_excel("Contents.xlsx")
# 查看前 10 行数据
content_df.head(10)


# 自定义情感倾向标注函数
def get_sentiment_label(sentiment):
    if sentiment >= 0.6: # 大于等于 0.6 分数的评论为好评
        return '好评'
    elif sentiment > 0.4: # 大于 0.4 分数的评论为中评
        return '中评'
    else:
        return '差评'

# 标注情感倾向并存入新的 DataFrame
mark_df = pd.DataFrame(columns=['Content', 'category'])
for index, row in content_df.iterrows():
    comment = str(row['Content'])  # 将评论字段转换为字符串类型
    sentiment_score = SnowNLP(comment).sentiments # 训练每条用户评论
    label = get_sentiment_label(sentiment_score) # 调用情感标注函数
    mark_df.loc[index] = [comment, label] # 拼接评论信息和经过标注后的评价情感分类字段

# 保存标注结果
mark_df.to_csv("Contents_mark.csv", index=False)


# 处理日期列特征，因“下单时间”、“支付时间”和“时间差”几列数据无法直接转换为特征向量，所以需要对其进行处理，转换为数值类型
clean_price_df['newOrderTime'] = clean_price_df['orderTime'].apply(lambda x: int(pd.Timestamp(x).strftime('%Y%m%d%H%M%S'))) # 转换为数值类型
clean_price_df['newPayTime'] = clean_price_df['payTime'].apply(lambda x: int(pd.Timestamp(x).strftime('%Y%m%d%H%M%S'))) # 转换为数值类型
clean_price_df['newTime_diff'] = clean_price_df['time_diff'].apply(lambda x: x.total_seconds()) # 转换为秒数


# 提取特征列，将“订单金额”、“支付金额”、“平台类型”、“下单时间”、“支付时间”和“时间差”列作为特征集
features = clean_price_df[['orderAmount', 'payment', 'platfromType', 'newOrderTime', 'newPayTime', 'newTime_diff']]
# 转换特征集类型为字典格式
features_dict = features.to_dict(orient='records')


# 使用 DictVectorizer 进行编码，将特征集转化字典格式的特征向量
vec = DictVectorizer(sparse=False)
# 转换特征向量
encoded_features = vec.fit_transform(features_dict)
# 获取特征名称
feature_names = vec.get_feature_names_out()


# 对“是否退单”标签列进行映射，将“否”映射为“0”，“是”映射为“1”
# 定义映射字典
mapping_dict = {'否': 0, '是': 1}
# 转换标签列
labels = clean_price_df['chargeback'].map(mapping_dict).astype(int)
features = clean_price_df[['orderAmount', 'payment', 'platfromType', 'newOrderTime', 'newPayTime', 'newTime_diff']]
# 对特征进行编码，将类别特征转换为数值特征
features_encoded = pd.get_dummies(features, drop_first=True)
# 划分训练集和测试集，比例为 7:3
X_train, X_test, y_train, y_test = train_test_split(features_encoded, labels, test_size=0.3, random_state=42)


# 分离出正样本和负样本，正样本表示为 0，负样本表示为 1。
X_train_pos = X_train[y_train == 0]
y_train_pos = y_train[y_train == 0]
X_train_neg = X_train[y_train == 1]
y_train_neg = y_train[y_train == 1]

# 计算目标正样本数量
n_neg = len(X_train_neg)
n_pos = int(n_neg * 5.2 / 4.8)

# 对正样本进行降采样
X_train_pos_resampled, y_train_pos_resampled = resample(X_train_pos, y_train_pos,
                                                        replace=False,
                                                        n_samples=n_pos,
                                                        random_state=42)

# 合并重新采样后的数据集
X_train_resampled = np.vstack((X_train_pos_resampled, X_train_neg))
y_train_resampled = np.hstack((y_train_pos_resampled, y_train_neg))

# 打印降采样后的数据集中正负样本的数量
print("降采样后的数据集中的正样本数量：", len(y_train_resampled[y_train_resampled==0]))
print("降采样后的数据集中的负样本数量：", len(y_train_resampled[y_train_resampled==1]))


# 构建随机森林分类模型
model = RandomForestClassifier(
    n_estimators=500,         # 使用 500 棵决策树构建随机森林
    max_depth=5,              # 每棵决策树的最大深度为 5
    min_samples_split=5,      # 内部节点再划分时需要至少有 5 个样本
    min_samples_leaf=4,       # 叶子节点至少包含的样本数为 4
    random_state=42           # 设置随机种子为 42
)

# 训练模型
model.fit(X_train_resampled, y_train_resampled)


# 在测试集上进行预测
y_pred = model.predict(X_test)

# 定义空 DataFrame 存放预测值和真实值
pred_df = pd.DataFrame()
# 获取预测值和真实值
pred_df['预测值'] = list(y_pred)
pred_df['真实值'] = list(y_test)
# 查看预测值和真实值前 10 行
pred_df.head(10)


# 计算准确率、精确率、召回率和F1分数评估指标
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# 输出各个指标
print(f"准确率：{accuracy * 100:.2f}%")
print(f"精确率：{precision * 100:.2f}%")
print(f"召回率：{recall * 100:.2f}%")
print(f"F1指数：{f1 * 100:.2f}%")


# 计算 ROC 曲线的参数：假正率 (fpr)、真正率 (tpr) 和阈值 (thresholds)
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:, 1])

# 计算 ROC 曲线下面积 (AUC)
roc_auc = auc(fpr, tpr)

# 绘制 ROC 曲线
plt.figure(figsize=(8, 6))  # 设置图形大小为 8x6 英寸
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC 曲线 (面积 = %0.2f)' % roc_auc)  # 绘制 ROC 曲线并标注 AUC 值
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')  # 绘制对角线（随机分类器的 ROC 曲线）
plt.xlim([0.0, 1.0])  # 设置 x 轴范围
plt.ylim([0.0, 1.05])  # 设置 y 轴范围
plt.xlabel('假正率')  # 设置 x 轴标签
plt.ylabel('真正率')  # 设置 y 轴标签
plt.title('ROC曲线分析')  # 设置图形标题
plt.legend(loc="lower right")  # 设置图例位置为右下角
plt.show()  # 显示图形


# 定义参数网格
paramGrid = dict(
    max_depth = [3, 4], # 每棵树的最大深度
    criterion = ['gini', 'entropy'], # 分裂点评价标准
    max_leaf_nodes = [3, 4], # 单个决策树最大叶子节点数
    n_estimators = [500, 800], # 生成决策树的个数，默认为100
    min_samples_split = [2, 3],         # 内部节点再划分所需最小样本数
    min_samples_leaf = [2, 4],           # 叶子节点最小样本数
    max_features = ['sqrt', 'log2'] # 最大特征数
)

# 使用网格搜索寻找最优超参数
grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42), param_grid=paramGrid, cv=3, verbose=2, n_jobs=-1)

# 在训练集上拟合网格搜索模型
grid_search.fit(X_train, y_train)

# 输出最佳的参数组合
print("最佳参数组成：", grid_search.best_params_)
# 输出最佳模型得分
print("最佳模型得分：{:.2f}%".format(grid_search.best_score_ * 100))

# 在测试集上评估最佳模型
best_rf = grid_search.best_estimator_
test_score = best_rf.score(X_test, y_test)
print("模型在验证集上的最佳得分: {:.2f}%".format(test_score * 100))