【BI学习作业17-智能供应链与评分卡模型】

最新推荐文章于 2024-08-08 08:55:16 发布

水花

最新推荐文章于 2024-08-08 08:55:16 发布

阅读量359

点赞数

分类专栏： BI_推荐系统文章标签：机器学习 python

本文链接：https://blog.csdn.net/weixin_43849871/article/details/116110328

版权

BI_推荐系统专栏收录该内容

49 篇文章 16 订阅

订阅专栏

本文探讨了二手车价格预测的神经网络方法，使用XGBoost和LGBM融合模型，以及如何通过五折交叉验证优化预测效果。同时，深入剖析了供应链数据，包括用户分层、欺诈与迟交货预测，以及销售额和订货数量的分析。

摘要由CSDN通过智能技术生成

1.编程题

1.1二手车价格预测

数据地址：数据集地址：https://pan.baidu.com/s/1TMyk9A2ggDOraP2mMwuUcw
提取码：71yr
目标：给你一辆车的各个属性（除了price字段），预测它的价格
要求：使用神经网络、xgb、lgb等机器学习算法来完成预测

Field	Description
SaleID	交易ID，唯一编码
name	汽车交易名称，已脱敏
regDate	汽车注册日期，例如20160101，2016年01月01日
model	车型编码，已脱敏
brand	汽车品牌，已脱敏
body Type	车身类型：豪华轿车：0，微型车：1，厢型车：2，大巴车：3，敞篷车：4，双门汽车：5，商务车：6，搅拌车：7
fuelType	燃油类型：汽油：0，柴油：1，液化石油气：2，天然气：3，混合动力：4，其他：5，电动：6
geatbox	变速箱：手动：0，自动：1
power	发动机功率：范围 [ 0, 600 ]
kilometer	汽车已行驶公里，单位万km
notRepairedDamage	汽车有尚未修复的损坏：是：0，否：1
regionCode	地区编码，已脱敏
seller	销售方：个体：0，非个体：1
offerType	报价类型：提供：0，请求：1
creatDate	汽车上线时间，即开始售卖时间
price	二手车交易价格（预测目标）
v系列特征	匿名特征，包含v0-14在内15个匿名特征

对于欺诈订单进行预测，即Order Status=‘SUSPECTED_FRAUD’
对于迟交货订单进行预测，即Delivery Status= ‘Late delivery’
对于销售额进行预测，即Sales字段
对于订货数量进行预测，即Order Item Quantity

#!/usr/bin/env python
# coding: utf-8

"""
    使用XGBoost，采用五折交叉验证
    MAE = 497
    
    模型融合
    1）相同模型下的 多个子模型，比如 都使用XGBoost, 采用5个子模型进行融合
    2）不同模型的融合，比如 XGBoost + LightGBM
    
    
    模型融合 不是在模型内进行融合
    而是对于模型的预测结果，进行加权平均（即融合）
    有几个模型，就有几个模型预测的结果
    然后对 预测结果，进行加权平均 ( 相加/5 )
    
    什么情况下模型融合不好？
    1） 与一个坏的模型进行融合， X MAE = 1000, XGBoost= 509, 一起融合
   
    
    MAE = Mean Absolute Error
    518， 521， 530， 513, 580 => 497
"""

import pandas as pd

# 数据加载
train_data = pd.read_csv('./used_car_train_20200313.csv', sep=' ')
test = pd.read_csv("./used_car_testB_20200421.csv", sep = ' ')


# # 缺失值补全
train_data['notRepairedDamage'].value_counts()


train_data['notRepairedDamage'].replace('-', '0.0', inplace=True)
test['notRepairedDamage'].replace('-', '0.0', inplace=True)

# 查看数值类型
#train_data.info()
numerical_cols = train_data.select_dtypes(exclude='object').columns
# 查看分类类型
categorical_cols = train_data.select_dtypes(include='object').columns


# 特征选择
#drop_cols = ['SaleID', 'regDate', 'creatDate', 'offerType', 'price']
drop_cols = ['SaleID', 'name', 'price']
feature_cols = [col for col in train_data.columns if col not in drop_cols]
print(feature_cols)


# ## 对日期格式进行处理

"""
train_data['regDate'] = pd.to_datetime(train_data['regDate'], format='%Y%m%d', errors='coerce')
test['regDate'] = pd.to_datetime(train_data['regDate'], format='%Y%m%d', errors='coerce')
train_data[['regDate']]
train_data['creatDate'] = pd.to_datetime(train_data['creatDate'], format='%Y%m%d', errors='coerce')
test['creatDate'] = pd.to_datetime(train_data['creatDate'], format='%Y%m%d', errors='coerce')
train_data[['creatDate']]
train_data['creatDate'] = pd.to_datetime(train_data['creatDate'], format='%Y%m%d', errors='coerce')
# 时间多尺度
train_data['regDate_year'] = train_data['regDate'].apply(lambda x: str(x)[0:4])
#train_data['regDate_month'] = train_data['regDate'].apply(lambda x: str(x)[5:7])
#train_data['regDate_day'] = train_data['regDate'].apply(lambda x: str(x)[8:10])
#train_data[['regDate', 'regDate_year', 'regDate_month', 'regDate_day']]

test['regDate_year'] = test['regDate'].apply(lambda x: str(x)[0:4])
#test['regDate_month'] = test['regDate'].apply(lambda x: str(x)[5:7])
#test['regDate_day'] = test['regDate'].apply(lambda x: str(x)[8:10])
#test[['regDate', 'regDate_year', 'regDate_month', 'regDate_day']]

train_data['creatDate_year'] = train_data['creatDate'].apply(lambda x: str(x)[0:4])
#train_data['creatDate_month'] = train_data['creatDate'].apply(lambda x: str(x)[5:7])
#train_data['creatDate_day'] = train_data['creatDate'].apply(lambda x: str(x)[8:10])
#train_data[['creatDate', 'creatDate_year', 'creatDate_month', 'creatDate_day']]

test['creatDate_year'] = test['creatDate'].apply(lambda x: str(x)[0:4])
#test['creatDate_month'] = test['creatDate'].apply(lambda x: str(x)[5:7])
#test['creatDate_day'] = test['creatDate'].apply(lambda x: str(x)[8:10])
#test[['creatDate', 'creatDate_year', 'creatDate_month', 'creatDate_day']]
# 时间diff
train_data['regDate_diff'] = (train_data['regDate'] - train_data['regDate'].min()).dt.days
test['regDate_diff'] = (test['regDate'] - train_data['regDate'].min()).dt.days
train_data[['regDate', 'regDate_diff']]
"""


"""
feature_cols.append('regDate_year')
feature_cols.append('creatDate_year')
print(feature_cols)
"""
train_data['regDate_year'] = train_data['regDate'].apply(lambda x: int(str(x)[0:4]))
test['regDate_year'] = test['regDate'].apply(lambda x: int(str(x)[0:4]))
train_data['creatDate_year'] = train_data['creatDate'].apply(lambda x: int(str(x)[0:4]))
test['creatDate_year'] = test['creatDate'].apply(lambda x: int(str(x)[0:4]))

import numpy as np
# 定一个统计函数，用于统计某字段的特征
def show_stats(data):
    print('min: ', np.min(data))
    print('max: ', np.max(data))
    # ptp = max - min
    print('ptp: ', np.ptp(data))
    print('mean: ', np.mean(data))
    print('std: ', np.std(data))
    print('var: ', np.var(data))
# 查看price
#show_stats(Y_data)


import warnings
warnings.filterwarnings('ignore')
#X_data.info()
train_data['notRepairedDamage'] = train_data['notRepairedDamage'].astype('float64')
test['notRepairedDamage'] = test['notRepairedDamage'].astype('float64')

#train_data['brand']
brand_data = train_data.groupby('brand')
all_info = {}
for brand_index, brand_temp in brand_data:
    info = {}
    brand_temp = brand_temp[brand_temp['price'] > 0]
    info['brand_amount'] = len(brand_temp)
    info['brand_price_max'] = brand_temp.price.max()
    info['brand_price_min'] = brand_temp.price.min()
    info['brand_price_median'] = brand_temp.price.median()
    info['brand_price_mean'] = brand_temp.price.mean()
    info['brand_price_std'] = brand_temp.price.std()
    info['brand_price_ptp'] = info['brand_price_max'] - info['brand_price_min']
    all_info[brand_index] = info
#all_info


df_brand = pd.DataFrame(all_info).T
df_brand = df_brand.reset_index()
df_brand.rename(columns={'index': 'brand'}, inplace=True)


#brand_stats = pd.DataFrame(all_info).T.reset_index().rename(columns={'index':'brand'})
train_data = train_data.merge(df_brand, how='left', on='brand')
test = test.merge(df_brand, how='left', on='brand')


# 去掉原始的日期字段，添加上 新的日期字段
cols = train_data.columns.tolist()
cols.remove('SaleID')
cols.remove('name')
cols.remove('regDate')
cols.remove('creatDate')
cols.remove('price')
print(cols)


# 提取特征列
X_data = train_data[cols]
Y_data = train_data['price']


# ## 采用五折交叉验证，创建5个子模型



from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_absolute_error

def ensemble_model(clf, train_x, train_y, test):
    # 采用五折交叉验证, 通过n_splits参数 设定份数
    sk = StratifiedKFold(n_splits=5, shuffle=True, random_state=2021)
    result = []
    mean_mae = 0
    for k, (train_index, val_index) in enumerate(sk.split(train_x, train_y)):
        # 使用sk 得到训练集，验证集
        train_x_real = train_x.iloc[train_index]
        train_y_real = train_y.iloc[train_index]
        val_x = train_x.iloc[val_index]
        val_y = train_y.iloc[val_index]
        # 子模型训练
        clf = clf.fit(train_x_real, train_y_real)
        val_y_pred = clf.predict(val_x)
        # 子模型评估
        mae_val = mean_absolute_error(val_y, val_y_pred)
        print('第{}个子模型 MAE {}'.format(k+1, mae_val))
        mean_mae += mae_val / 5
        # 使用子模型 对测试集进行预测
        test_y_pred = clf.predict(test)
        result.append(test_y_pred)
    print(mean_mae)
    # 最终结果 = 5个子模型的平均值
    mean_result = sum(result) / 5
    return mean_result




from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import xgboost as xgb

# 数据集切分, 80%用于训练，20%用于验证 => MAE: 669
# 数据集切分, 80%用于训练，20%用于验证 => MAE: 606
x_train, x_val, y_train, y_val = train_test_split(X_data, Y_data, test_size=0.2)
# 创建模型
model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0,                          subsample=0.8, colsample_bytree=0.8, max_depth=7,                          random_state=2021)
model.fit(x_train, y_train)
y_pred = model.predict(x_val)
mae = mean_absolute_error(y_val, y_pred)
print('XGBoost MAE: ', mae)



# 标签泄露： 就是特征与label进行了关联，如果某个构造的特征 与label进行关联
# model进行预测的时候，训练集结果会很准确
# 适当的标签泄露是有价值，但是取决于 你的value_counts个数
# 如果个数不多，比如40个，是OK的 => 有一些特征的价值 => 相当于聚类的价值
# 如果个数很多，比如15万个，是不行的 => ID的作用
len(train_data['brand'].value_counts())


get_ipython().run_cell_magic('time', '', 'import xgboost as xgb\n# 创建模型\n"""\n# 随机设置\nmodel = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, \\\n                         subsample=0.8, colsample_bytree=0.8, max_depth=7, \\\n                         random_state=2021)\n\nimport lightgbm as lgb\nmodel = lgb.LGBMRegressor(\n            num_leaves=2**5-1, reg_alpha=0.25, reg_lambda=0.25,\n            max_depth=-1, learning_rate=0.005, min_child_samples=3, random_state=2021,\n            n_estimators=2000, subsample=1, colsample_bytree=1,\n        )\nmodel.fit(X_data, Y_data)\n"""\n# 祖传参数，在有限的时间内，得到还不错的结果\nmodel = xgb.XGBRegressor(\n            max_depth=6, learning_rate=0.05, n_estimators=2000, \n            objective=\'reg:linear\', tree_method=\'gpu_hist\', \n            subsample=0.8, colsample_bytree=0.8, \n            min_child_samples=3, eval_metric=\'auc\', reg_lambda=0.5\n        )\n#model.fit(X_data, Y_data)\ny_pred = ensemble_model(model, X_data, Y_data, test[cols])')


#test.drop(['regDate', 'creatDate'], axis=1, inplace=True)


"""
y_pred = model.predict(test[cols])
y_pred
"""

# 训练 欠拟合 => n_estimators太小，或者 learning_rate太小
show_stats(y_pred)


y_pred[y_pred<0] = 11

# 因为XGBoost是集成学习，多棵树组成
# 有些树的叶子节点 有可能为负
result = pd.DataFrame()
result['SaleID'] = test['SaleID']
result['price'] = y_pred
result.to_csv('./ans_xgb_kf5.csv', index=False)


# result_xgb, result_lgb
result = result_xgb * 0.4 + result_lgb * 0.6

1.2供应链分析

数据集地址：https://pan.baidu.com/s/1TMyk9A2ggDOraP2mMwuUcw
提取码：71yr

在这里插入图片描述

任务：

供应链数据探索
对用户进行分层运营
对于欺诈订单进行预测，即Order Status=‘SUSPECTED_FRAUD’
对于迟交货订单进行预测，即Delivery Status= ‘Late delivery’
对于销售额进行预测，即Sales字段
对于订货数量进行预测，即Order Item Quantity

#!/usr/bin/env python
# coding: utf-8

# In[1]:


import pandas as pd

# 数据加载
dataset = pd.read_csv('./SupplyChain.csv', encoding='unicode_escape')


print(dataset.shape)
print(dataset.isnull().sum())

print(dataset[['Customer Fname', 'Customer Lname']])
#  fistname与lastname进行合并
dataset['Customer Full Name'] = dataset['Customer Fname'] +dataset['Customer Lname']
#dataset.head()
dataset['Customer Zipcode'].value_counts()
# 查看缺失值，发现有3个缺失值
print(dataset['Customer Zipcode'].isnull().sum())


dataset['Customer Zipcode'] = dataset['Customer Zipcode'].fillna(0)

import matplotlib.pyplot as plt
import seaborn as sns
# 特征字段之间相关性 热力图
data = dataset
plt.figure(figsize=(20,10))
# annot=True 显示具体数字
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
# 结论：可以观察到Product Price和Sales，Order Item Total有很高的相关性


# 基于Market进行聚合
market = data.groupby('Market')
# 基于Region进行聚合
region = data.groupby('Order Region')
plt.figure(1)
market['Sales per customer'].sum().sort_values(ascending=False).plot.bar(figsize=(12,6), title='Sales in different markets')
plt.figure(2)
region['Sales per customer'].sum().sort_values(ascending=False).plot.bar(figsize=(12,6), title='Sales in different regions')
plt.show()



# 基于Category Name进行聚类
cat = data.groupby('Category Name')
plt.figure(1)
# 不同类别的 总销售额
cat['Sales per customer'].sum().sort_values(ascending=False).plot.bar(figsize=(12,6), title='Total sales')
plt.figure(2)
# 不同类别的 平均销售额
cat['Sales per customer'].mean().sort_values(ascending=False).plot.bar(figsize=(12,6), title='Total sales')
plt.show()


#data['order date (DateOrders)']
# 创建时间戳索引
temp = pd.DatetimeIndex(data['order date (DateOrders)'])

# 取order date (DateOrders)字段中的year, month, weekday, hour, month_year
data['order_year'] = temp.year
data['order_month'] = temp.month
data['order_week_day'] = temp.weekday
data['order_hour'] = temp.hour
data['order_month_year'] = temp.to_period('M')
data.head()


# 对销售额进行探索，按照不同时间维度 年，星期，小时，月
plt.figure(figsize=(10, 12))
plt.subplot(4, 2, 1)
df_year = data.groupby('order_year')
df_year['Sales'].mean().plot(figsize=(12, 12), title='Average sales in years')
plt.subplot(4, 2, 2)
df_day = data.groupby('order_week_day')
df_day['Sales'].mean().plot(figsize=(12, 12), title='Average sales in days per week')
plt.subplot(4, 2, 3)
df_hour = data.groupby('order_hour')
df_hour['Sales'].mean().plot(figsize=(12, 12), title='Average sales in hours per day')
plt.subplot(4, 2, 4)
df_month = data.groupby('order_month')
df_month['Sales'].mean().plot(figsize=(12, 12), title='Average sales in month per year')
plt.tight_layout()
plt.show()


# 探索商品价格与 销售额之间的关系
data.plot(x='Product Price', y='Sales per customer') 
plt.title('Relationship between Product Price and Sales per customer')
plt.xlabel('Product Price')
plt.ylabel('Sales per customer')
plt.show()


# # 用户分层 RFM

data['TotalPrice'] = data['Order Item Quantity'] * data['Order Item Total']
data[['TotalPrice', 'Order Item Quantity', 'Order Item Total']]



# 时间类型转换
data['order date (DateOrders)'] = pd.to_datetime(data['order date (DateOrders)'])
# 统计最后一笔订单的时间
data['order date (DateOrders)'].max()



# 假设我们现在是2018-2-1
import datetime
present = datetime.datetime(2018,2,1)
# 计算每个用户的RFM指标
# 按照Order Customer Id进行聚合，
customer_seg = data.groupby('Order Customer Id').agg({'order date (DateOrders)': lambda x: (present-x.max()).days,                                                       'Order Id': lambda x:len(x), 'TotalPrice': lambda x: x.sum()})


# 将字段名称改成 R，F，M
customer_seg.rename(columns={'order date (DateOrders)': 'R_Value', 'Order Id': 'F_Value', 'TotalPrice': 'M_Value'}, inplace=True)


# 将RFM数据划分为4个尺度
quantiles = customer_seg.quantile(q=[0.25, 0.5, 0.75])
quantiles = quantiles.to_dict()
quantiles



# R_Value越小越好 => R_Score就越大
def R_Score(a, b, c):
    if a <= c[b][0.25]:
        return 4
    elif a <= c[b][0.50]:
        return 3
    elif a <= c[b][0.75]:
        return 2
    else:
        return 1

# F_Value, M_Value越大越好
def FM_Score(a, b, c):
    if a <= c[b][0.25]:
        return 1
    elif a <= c[b][0.50]:
        return 2
    elif a <= c[b][0.75]:
        return 3
    else:
        return 4



# 新建R_Score字段，用于将R_Value => [1,4]
customer_seg['R_Score']  = customer_seg['R_Value'].apply(R_Score, args=("R_Value", quantiles))
# 新建F_Score字段，用于将F_Value => [1,4]
customer_seg['F_Score']  = customer_seg['F_Value'].apply(FM_Score, args=("F_Value", quantiles))
# 新建M_Score字段，用于将R_Value => [1,4]
customer_seg['M_Score']  = customer_seg['M_Value'].apply(FM_Score, args=("M_Value", quantiles))
customer_seg.head()



# 计算RFM用户分层
def RFM_User(df):
    if df['M_Score'] > 2 and df['F_Score'] > 2 and df['R_Score'] > 2:
        return '重要价值用户'
    if df['M_Score'] > 2 and df['F_Score'] <= 2 and df['R_Score'] > 2:
        return '重要发展用户'
    if df['M_Score'] > 2 and df['F_Score'] > 2 and df['R_Score'] <= 2:
        return '重要保持用户'
    if df['M_Score'] > 2 and df['F_Score'] <= 2 and df['R_Score'] <= 2:
        return '重要挽留用户'

    if df['M_Score'] <= 2 and df['F_Score'] > 2 and df['R_Score'] > 2:
        return '一般价值用户'
    if df['M_Score'] <= 2 and df['F_Score'] <= 2 and df['R_Score'] > 2:
        return '一般发展用户'
    if df['M_Score'] <= 2 and df['F_Score'] > 2 and df['R_Score'] <= 2:
        return '一般保持用户'
    if df['M_Score'] <= 2 and df['F_Score'] <= 2 and df['R_Score'] <= 2:
        return '一般挽留用户'


customer_seg['Customer_Segmentation'] = customer_seg.apply(RFM_User, axis=1)

# ## EDA 对于欺诈订单进行探索
# 查看交易的方式
data['Type'].unique()


# 显示不同地区的支付类型情况
import numpy as np
#pay_type = data.groupby('Type')
# 获取4种支付方式的数据
pay_type1 = data[(data['Type'] == 'TRANSFER')]
pay_type2= data[(data['Type'] == 'CASH')]
pay_type3= data[(data['Type'] == 'PAYMENT')]
pay_type4= data[(data['Type'] == 'DEBIT')]
# 获取4种支付方式中，不同地区的特点
count1=pay_type1['Order Region'].value_counts()
count2=pay_type2['Order Region'].value_counts()
count3=pay_type3['Order Region'].value_counts()
count4=pay_type4['Order Region'].value_counts()
# 获取地区region的个数
region_num = len(count1)
fig,ax = plt.subplots(figsize=(20,8))
index=np.arange(region_num)
# 每种类型的显示间隔
bar_width=0.2
# 不透明度设置
opacity=0.6
type1=plt.bar(index,count1,bar_width,alpha=opacity,color='b',label='Transfer')
type2=plt.bar(index+bar_width,count2,bar_width,alpha=opacity,color='r',label='Cash')
type3=plt.bar(index+bar_width+bar_width,count3,bar_width,alpha=opacity,color='g',label='Payment')
type4=plt.bar(index+bar_width+bar_width+bar_width,count4,bar_width,alpha=opacity,color='y',label='Debit')
plt.xlabel('Order Regions')
plt.ylabel('Number of payments')
plt.title('Different Type of payments in all regions')
plt.legend()
# 显示刻度
names=data['Order Region'].value_counts().keys()
#plt.xticks(index+bar_width,names,rotation=90)
plt.xticks(index+bar_width,names,rotation='vertical')
# tight_layout会自动调整子图参数，使之填充整个图像区域
#plt.tight_layout()
plt.show()


# 分析结论：
# Debit（借记）是所有地区中使用最多的支付方式
# Cash（现金）是所有地区中使用最少的支付方式


# 对负收益的产品进行探索
loss = data[(data['Benefit per order']<0)]
# 显示最大的负收益产品 Top10
plt.figure(1)
loss['Category Name'].value_counts().nlargest(10).plot.bar(figsize=(20,8), title="Products with most loss")
# 显示最大的负收益地区 Top10
plt.figure(2)
loss['Order Region'].value_counts().nlargest(10).plot.bar(figsize=(20,8), title="Regions with most loss")
# 所有负收益产品带来的损失
print('Total revenue lost with orders',loss['Benefit per order'].sum())



# 负收益，有可能来自欺诈交易，那么会采用哪种支付方式
xyz = data[(data['Order Status'] == 'SUSPECTED_FRAUD') & (data['Type'] != 'TRANSFER')]
#xyz = data[(data['Order Status'] == 'SUSPECTED_FRAUD') & (data['Type'] == 'TRANSFER')]
xyz['Order Region'].value_counts()
# 发现TRANSFER容易导致欺诈交易



# 显示不同地区的欺诈交易情况
high_fraud = data[(data['Order Status'] == 'SUSPECTED_FRAUD') & (data['Type'] == 'TRANSFER')]
high_fraud['Order Region'].value_counts().plot.bar(figsize=(24,12))

# 显示标题
plt.title("Regions with Highest Fraud",size=15,color='r')
plt.ylabel("Fraud Number")
plt.show()



# Western Europe地区欺诈交易最多，将一个bar分成2个部分（堆积条形图）：分别为总欺诈数量 与 Western Europe的欺诈数量 
high_fraud_total = data[(data['Order Status'] == 'SUSPECTED_FRAUD')]
high_fraud_we = data[(data['Order Status'] == 'SUSPECTED_FRAUD') &(data['Order Region'] == 'Western Europe')]
# 找出风险最高的10个Category
fraud1=high_fraud_total['Category Name'].value_counts().nlargest(10).plot.bar(figsize=(20,8), title="Fraud Category",color='orange')
# 找出这些风险在Western Europe中的情况
fraud2=high_fraud_we['Category Name'].value_counts().nlargest(10).plot.bar(figsize=(20,8), title="Fraud product in Western Europe",color='green')
plt.legend(["All regions", "Western Europe"])
plt.title("Top 10 products with highest fraud detections", size=15)
plt.xlabel("Products", size=13)
plt.ylim(0,600)
plt.show()



# 筛选出Top10 风险Customer
cus = data[(data['Order Status'] == 'SUSPECTED_FRAUD')] 
#Top 10 customers with most fraud
cus['Customer Full Name'].value_counts().nlargest(10).plot.bar(figsize=(20,8), title="Top 10 Highest Fraud Customers")



# 找到MarySmith的交易金额
amount = data[(data['Customer Full Name'] == 'MarySmith')&(data['Order Status'] == 'SUSPECTED_FRAUD')]
amount['Sales'].sum()



import pickle
with open('data.pkl', 'wb') as file:
    pickle.dump(data, file)