1.编程题
1.1二手车价格预测
-
数据地址:数据集地址:https://pan.baidu.com/s/1TMyk9A2ggDOraP2mMwuUcw
提取码:71yr -
目标:给你一辆车的各个属性(除了price字段),预测它的价格
-
要求:使用神经网络、xgb、lgb等机器学习算法来完成预测
Field | Description |
---|---|
SaleID | 交易ID,唯一编码 |
name | 汽车交易名称,已脱敏 |
regDate | 汽车注册日期,例如20160101,2016年01月01日 |
model | 车型编码,已脱敏 |
brand | 汽车品牌,已脱敏 |
body Type | 车身类型:豪华轿车:0,微型车:1,厢型车:2,大巴车:3,敞篷车:4,双门汽车:5,商务车:6,搅拌车:7 |
fuelType | 燃油类型:汽油:0,柴油:1,液化石油气:2,天然气:3,混合动力:4,其他:5,电动:6 |
geatbox | 变速箱:手动:0,自动:1 |
power | 发动机功率:范围 [ 0, 600 ] |
kilometer | 汽车已行驶公里,单位万km |
notRepairedDamage | 汽车有尚未修复的损坏:是:0,否:1 |
regionCode | 地区编码,已脱敏 |
seller | 销售方:个体:0,非个体:1 |
offerType | 报价类型:提供:0,请求:1 |
creatDate | 汽车上线时间,即开始售卖时间 |
price | 二手车交易价格(预测目标) |
v系列特征 | 匿名特征,包含v0-14在内15个匿名特征 |
对于欺诈订单进行预测,即Order Status=‘SUSPECTED_FRAUD’
对于迟交货订单进行预测,即Delivery Status= ‘Late delivery’
对于销售额进行预测,即Sales字段
对于订货数量进行预测,即Order Item Quantity
#!/usr/bin/env python
# coding: utf-8
"""
使用XGBoost,采用五折交叉验证
MAE = 497
模型融合
1)相同模型下的 多个子模型,比如 都使用XGBoost, 采用5个子模型进行融合
2)不同模型的融合,比如 XGBoost + LightGBM
模型融合 不是在模型内进行融合
而是对于模型的预测结果,进行加权平均(即融合)
有几个模型,就有几个模型预测的结果
然后对 预测结果,进行加权平均 ( 相加/5 )
什么情况下模型融合不好?
1) 与一个坏的模型进行融合, X MAE = 1000, XGBoost= 509, 一起融合
MAE = Mean Absolute Error
518, 521, 530, 513, 580 => 497
"""
import pandas as pd
# 数据加载
train_data = pd.read_csv('./used_car_train_20200313.csv', sep=' ')
test = pd.read_csv("./used_car_testB_20200421.csv", sep = ' ')
# # 缺失值补全
train_data['notRepairedDamage'].value_counts()
train_data['notRepairedDamage'].replace('-', '0.0', inplace=True)
test['notRepairedDamage'].replace('-', '0.0', inplace=True)
# 查看数值类型
#train_data.info()
numerical_cols = train_data.select_dtypes(exclude='object').columns
# 查看分类类型
categorical_cols = train_data.select_dtypes(include='object').columns
# 特征选择
#drop_cols = ['SaleID', 'regDate', 'creatDate', 'offerType', 'price']
drop_cols = ['SaleID', 'name', 'price']
feature_cols = [col for col in train_data.columns if col not in drop_cols]
print(feature_cols)
# ## 对日期格式进行处理
"""
train_data['regDate'] = pd.to_datetime(train_data['regDate'], format='%Y%m%d', errors='coerce')
test['regDate'] = pd.to_datetime(train_data['regDate'], format='%Y%m%d', errors='coerce')
train_data[['regDate']]
train_data['creatDate'] = pd.to_datetime(train_data['creatDate'], format='%Y%m%d', errors='coerce')
test['creatDate'] = pd.to_datetime(train_data['creatDate'], format='%Y%m%d', errors='coerce')
train_data[['creatDate']]
train_data['creatDate'] = pd.to_datetime(train_data['creatDate'], format='%Y%m%d', errors='coerce')
# 时间多尺度
train_data['regDate_year'] = train_data['regDate'].apply(lambda x: str(x)[0:4])
#train_data['regDate_month'] = train_data['regDate'].apply(lambda x: str(x)[5:7])
#train_data['regDate_day'] = train_data['regDate'].apply(lambda x: str(x)[8:10])
#train_data[['regDate', 'regDate_year', 'regDate_month', 'regDate_day']]
test['regDate_year'] = test['regDate'].apply(lambda x: str(x)[0:4])
#test['regDate_month'] = test['regDate'].apply(lambda x: str(x)[5:7])
#test['regDate_day'] = test['regDate'].apply(lambda x: str(x)[8:10])
#test[['regDate', 'regDate_year', 'regDate_month', 'regDate_day']]
train_data['creatDate_year'] = train_data['creatDate'].apply(lambda x: str(x)[0:4])
#train_data['creatDate_month'] = train_data['creatDate'].apply(lambda x: str(x)[5:7])
#train_data['creatDate_day'] = train_data['creatDate'].apply(lambda x: str(x)[8:10])
#train_data[['creatDate', 'creatDate_year', 'creatDate_month', 'creatDate_day']]
test['creatDate_year'] = test['creatDate'].apply(lambda x: str(x)[0:4])
#test['creatDate_month'] = test['creatDate'].apply(lambda x: str(x)[5:7])
#test['creatDate_day'] = test['creatDate'].apply(lambda x: str(x)[8:10])
#test[['creatDate', 'creatDate_year', 'creatDate_month', 'creatDate_day']]
# 时间diff
train_data['regDate_diff'] = (train_data['regDate'] - train_data['regDate'].min()).dt.days
test['regDate_diff'] = (test['regDate'] - train_data['regDate'].min()).dt.days
train_data[['regDate', 'regDate_diff']]
"""
"""
feature_cols.append('regDate_year')
feature_cols.append('creatDate_year')
print(feature_cols)
"""
train_data['regDate_year'] = train_data['regDate'].apply(lambda x: int(str(x)[0:4]))
test['regDate_year'] = test['regDate'].apply(lambda x: int(str(x)[0:4]))
train_data['creatDate_year'] = train_data['creatDate'].apply(lambda x: int(str(x)[0:4]))
test['creatDate_year'] = test['creatDate'].apply(lambda x: int(str(x)[0:4]))
import numpy as np
# 定一个统计函数,用于统计某字段的特征
def show_stats(data):
print('min: ', np.min(data))
print('max: ', np.max(data))
# ptp = max - min
print('ptp: ', np.ptp(data))
print('mean: ', np.mean(data))
print('std: ', np.std(data))
print('var: ', np.var(data))
# 查看price
#show_stats(Y_data)
import warnings
warnings.filterwarnings('ignore')
#X_data.info()
train_data['notRepairedDamage'] = train_data['notRepairedDamage'].astype('float64')
test['notRepairedDamage'] = test['notRepairedDamage'].astype('float64')
#train_data['brand']
brand_data = train_data.groupby('brand')
all_info = {}
for brand_index, brand_temp in brand_data:
info = {}
brand_temp = brand_temp[brand_temp['price'] > 0]
info['brand_amount'] = len(brand_temp)
info['brand_price_max'] = brand_temp.price.max()
info['brand_price_min'] = brand_temp.price.min()
info['brand_price_median'] = brand_temp.price.median()
info['brand_price_mean'] = brand_temp.price.mean()
info['brand_price_std'] = brand_temp.price.std()
info['brand_price_ptp'] = info['brand_price_max'] - info['brand_price_min']
all_info[brand_index] = info
#all_info
df_brand = pd.DataFrame(all_info).T
df_brand = df_brand.reset_index()
df_brand.rename(columns={'index': 'brand'}, inplace=True)
#brand_stats = pd.DataFrame(all_info).T.reset_index().rename(columns={'index':'brand'})
train_data = train_data.merge(df_brand, how='left', on='brand')
test = test.merge(df_brand, how='left', on='brand')
# 去掉原始的日期字段,添加上 新的日期字段
cols = train_data.columns.tolist()
cols.remove('SaleID')
cols.remove('name')
cols.remove('regDate')
cols.remove('creatDate')
cols.remove('price')
print(cols)
# 提取特征列
X_data = train_data[cols]
Y_data = train_data['price']
# ## 采用五折交叉验证,创建5个子模型
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_absolute_error
def ensemble_model(clf, train_x, train_y, test):
# 采用五折交叉验证, 通过n_splits参数 设定份数
sk = StratifiedKFold(n_splits=5, shuffle=True, random_state=2021)
result = []
mean_mae = 0
for k, (train_index, val_index) in enumerate(sk.split(train_x, train_y)):
# 使用sk 得到训练集,验证集
train_x_real = train_x.iloc[train_index]
train_y_real = train_y.iloc[train_index]
val_x = train_x.iloc[val_index]
val_y = train_y.iloc[val_index]
# 子模型训练
clf = clf.fit(train_x_real, train_y_real)
val_y_pred = clf.predict(val_x)
# 子模型评估
mae_val = mean_absolute_error(val_y, val_y_pred)
print('第{}个子模型 MAE {}'.format(k+1, mae_val))
mean_mae += mae_val / 5
# 使用子模型 对测试集进行预测
test_y_pred = clf.predict(test)
result.append(test_y_pred)
print(mean_mae)
# 最终结果 = 5个子模型的平均值
mean_result = sum(result) / 5
return mean_result
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import xgboost as xgb
# 数据集切分, 80%用于训练,20%用于验证 => MAE: 669
# 数据集切分, 80%用于训练,20%用于验证 => MAE: 606
x_train, x_val, y_train, y_val = train_test_split(X_data, Y_data, test_size=0.2)
# 创建模型
model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8, colsample_bytree=0.8, max_depth=7, random_state=2021)
model.fit(x_train, y_train)
y_pred = model.predict(x_val)
mae = mean_absolute_error(y_val, y_pred)
print('XGBoost MAE: ', mae)
# 标签泄露: 就是特征与label进行了关联,如果某个构造的特征 与label进行关联
# model进行预测的时候,训练集结果会很准确
# 适当的标签泄露是有价值,但是取决于 你的value_counts个数
# 如果个数不多,比如40个,是OK的 => 有一些特征的价值 => 相当于聚类的价值
# 如果个数很多,比如15万个,是不行的 => ID的作用
len(train_data['brand'].value_counts())
get_ipython().run_cell_magic('time', '', 'import xgboost as xgb\n# 创建模型\n"""\n# 随机设置\nmodel = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, \\\n subsample=0.8, colsample_bytree=0.8, max_depth=7, \\\n random_state=2021)\n\nimport lightgbm as lgb\nmodel = lgb.LGBMRegressor(\n num_leaves=2**5-1, reg_alpha=0.25, reg_lambda=0.25,\n max_depth=-1, learning_rate=0.005, min_child_samples=3, random_state=2021,\n n_estimators=2000, subsample=1, colsample_bytree=1,\n )\nmodel.fit(X_data, Y_data)\n"""\n# 祖传参数,在有限的时间内,得到还不错的结果\nmodel = xgb.XGBRegressor(\n max_depth=6, learning_rate=0.05, n_estimators=2000, \n objective=\'reg:linear\', tree_method=\'gpu_hist\', \n subsample=0.8, colsample_bytree=0.8, \n min_child_samples=3, eval_metric=\'auc\', reg_lambda=0.5\n )\n#model.fit(X_data, Y_data)\ny_pred = ensemble_model(model, X_data, Y_data, test[cols])')
#test.drop(['regDate', 'creatDate'], axis=1, inplace=True)
"""
y_pred = model.predict(test[cols])
y_pred
"""
# 训练 欠拟合 => n_estimators太小,或者 learning_rate太小
show_stats(y_pred)
y_pred[y_pred<0] = 11
# 因为XGBoost是集成学习,多棵树组成
# 有些树的叶子节点 有可能为负
result = pd.DataFrame()
result['SaleID'] = test['SaleID']
result['price'] = y_pred
result.to_csv('./ans_xgb_kf5.csv', index=False)
# result_xgb, result_lgb
result = result_xgb * 0.4 + result_lgb * 0.6
1.2供应链分析
数据集地址:https://pan.baidu.com/s/1TMyk9A2ggDOraP2mMwuUcw
提取码:71yr
任务:
- 供应链数据探索
- 对用户进行分层运营
- 对于欺诈订单进行预测,即Order Status=‘SUSPECTED_FRAUD’
- 对于迟交货订单进行预测,即Delivery Status= ‘Late delivery’
- 对于销售额进行预测,即Sales字段
- 对于订货数量进行预测,即Order Item Quantity
#!/usr/bin/env python
# coding: utf-8
# In[1]:
import pandas as pd
# 数据加载
dataset = pd.read_csv('./SupplyChain.csv', encoding='unicode_escape')
print(dataset.shape)
print(dataset.isnull().sum())
print(dataset[['Customer Fname', 'Customer Lname']])
# fistname与lastname进行合并
dataset['Customer Full Name'] = dataset['Customer Fname'] +dataset['Customer Lname']
#dataset.head()
dataset['Customer Zipcode'].value_counts()
# 查看缺失值,发现有3个缺失值
print(dataset['Customer Zipcode'].isnull().sum())
dataset['Customer Zipcode'] = dataset['Customer Zipcode'].fillna(0)
import matplotlib.pyplot as plt
import seaborn as sns
# 特征字段之间相关性 热力图
data = dataset
plt.figure(figsize=(20,10))
# annot=True 显示具体数字
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
# 结论:可以观察到Product Price和Sales,Order Item Total有很高的相关性
# 基于Market进行聚合
market = data.groupby('Market')
# 基于Region进行聚合
region = data.groupby('Order Region')
plt.figure(1)
market['Sales per customer'].sum().sort_values(ascending=False).plot.bar(figsize=(12,6), title='Sales in different markets')
plt.figure(2)
region['Sales per customer'].sum().sort_values(ascending=False).plot.bar(figsize=(12,6), title='Sales in different regions')
plt.show()
# 基于Category Name进行聚类
cat = data.groupby('Category Name')
plt.figure(1)
# 不同类别的 总销售额
cat['Sales per customer'].sum().sort_values(ascending=False).plot.bar(figsize=(12,6), title='Total sales')
plt.figure(2)
# 不同类别的 平均销售额
cat['Sales per customer'].mean().sort_values(ascending=False).plot.bar(figsize=(12,6), title='Total sales')
plt.show()
#data['order date (DateOrders)']
# 创建时间戳索引
temp = pd.DatetimeIndex(data['order date (DateOrders)'])
# 取order date (DateOrders)字段中的year, month, weekday, hour, month_year
data['order_year'] = temp.year
data['order_month'] = temp.month
data['order_week_day'] = temp.weekday
data['order_hour'] = temp.hour
data['order_month_year'] = temp.to_period('M')
data.head()
# 对销售额进行探索,按照不同时间维度 年,星期,小时,月
plt.figure(figsize=(10, 12))
plt.subplot(4, 2, 1)
df_year = data.groupby('order_year')
df_year['Sales'].mean().plot(figsize=(12, 12), title='Average sales in years')
plt.subplot(4, 2, 2)
df_day = data.groupby('order_week_day')
df_day['Sales'].mean().plot(figsize=(12, 12), title='Average sales in days per week')
plt.subplot(4, 2, 3)
df_hour = data.groupby('order_hour')
df_hour['Sales'].mean().plot(figsize=(12, 12), title='Average sales in hours per day')
plt.subplot(4, 2, 4)
df_month = data.groupby('order_month')
df_month['Sales'].mean().plot(figsize=(12, 12), title='Average sales in month per year')
plt.tight_layout()
plt.show()
# 探索商品价格与 销售额之间的关系
data.plot(x='Product Price', y='Sales per customer')
plt.title('Relationship between Product Price and Sales per customer')
plt.xlabel('Product Price')
plt.ylabel('Sales per customer')
plt.show()
# # 用户分层 RFM
data['TotalPrice'] = data['Order Item Quantity'] * data['Order Item Total']
data[['TotalPrice', 'Order Item Quantity', 'Order Item Total']]
# 时间类型转换
data['order date (DateOrders)'] = pd.to_datetime(data['order date (DateOrders)'])
# 统计最后一笔订单的时间
data['order date (DateOrders)'].max()
# 假设我们现在是2018-2-1
import datetime
present = datetime.datetime(2018,2,1)
# 计算每个用户的RFM指标
# 按照Order Customer Id进行聚合,
customer_seg = data.groupby('Order Customer Id').agg({'order date (DateOrders)': lambda x: (present-x.max()).days, 'Order Id': lambda x:len(x), 'TotalPrice': lambda x: x.sum()})
# 将字段名称改成 R,F,M
customer_seg.rename(columns={'order date (DateOrders)': 'R_Value', 'Order Id': 'F_Value', 'TotalPrice': 'M_Value'}, inplace=True)
# 将RFM数据划分为4个尺度
quantiles = customer_seg.quantile(q=[0.25, 0.5, 0.75])
quantiles = quantiles.to_dict()
quantiles
# R_Value越小越好 => R_Score就越大
def R_Score(a, b, c):
if a <= c[b][0.25]:
return 4
elif a <= c[b][0.50]:
return 3
elif a <= c[b][0.75]:
return 2
else:
return 1
# F_Value, M_Value越大越好
def FM_Score(a, b, c):
if a <= c[b][0.25]:
return 1
elif a <= c[b][0.50]:
return 2
elif a <= c[b][0.75]:
return 3
else:
return 4
# 新建R_Score字段,用于将R_Value => [1,4]
customer_seg['R_Score'] = customer_seg['R_Value'].apply(R_Score, args=("R_Value", quantiles))
# 新建F_Score字段,用于将F_Value => [1,4]
customer_seg['F_Score'] = customer_seg['F_Value'].apply(FM_Score, args=("F_Value", quantiles))
# 新建M_Score字段,用于将R_Value => [1,4]
customer_seg['M_Score'] = customer_seg['M_Value'].apply(FM_Score, args=("M_Value", quantiles))
customer_seg.head()
# 计算RFM用户分层
def RFM_User(df):
if df['M_Score'] > 2 and df['F_Score'] > 2 and df['R_Score'] > 2:
return '重要价值用户'
if df['M_Score'] > 2 and df['F_Score'] <= 2 and df['R_Score'] > 2:
return '重要发展用户'
if df['M_Score'] > 2 and df['F_Score'] > 2 and df['R_Score'] <= 2:
return '重要保持用户'
if df['M_Score'] > 2 and df['F_Score'] <= 2 and df['R_Score'] <= 2:
return '重要挽留用户'
if df['M_Score'] <= 2 and df['F_Score'] > 2 and df['R_Score'] > 2:
return '一般价值用户'
if df['M_Score'] <= 2 and df['F_Score'] <= 2 and df['R_Score'] > 2:
return '一般发展用户'
if df['M_Score'] <= 2 and df['F_Score'] > 2 and df['R_Score'] <= 2:
return '一般保持用户'
if df['M_Score'] <= 2 and df['F_Score'] <= 2 and df['R_Score'] <= 2:
return '一般挽留用户'
customer_seg['Customer_Segmentation'] = customer_seg.apply(RFM_User, axis=1)
# ## EDA 对于欺诈订单进行探索
# 查看交易的方式
data['Type'].unique()
# 显示不同地区的支付类型情况
import numpy as np
#pay_type = data.groupby('Type')
# 获取4种支付方式的数据
pay_type1 = data[(data['Type'] == 'TRANSFER')]
pay_type2= data[(data['Type'] == 'CASH')]
pay_type3= data[(data['Type'] == 'PAYMENT')]
pay_type4= data[(data['Type'] == 'DEBIT')]
# 获取4种支付方式中,不同地区的特点
count1=pay_type1['Order Region'].value_counts()
count2=pay_type2['Order Region'].value_counts()
count3=pay_type3['Order Region'].value_counts()
count4=pay_type4['Order Region'].value_counts()
# 获取地区region的个数
region_num = len(count1)
fig,ax = plt.subplots(figsize=(20,8))
index=np.arange(region_num)
# 每种类型的显示间隔
bar_width=0.2
# 不透明度设置
opacity=0.6
type1=plt.bar(index,count1,bar_width,alpha=opacity,color='b',label='Transfer')
type2=plt.bar(index+bar_width,count2,bar_width,alpha=opacity,color='r',label='Cash')
type3=plt.bar(index+bar_width+bar_width,count3,bar_width,alpha=opacity,color='g',label='Payment')
type4=plt.bar(index+bar_width+bar_width+bar_width,count4,bar_width,alpha=opacity,color='y',label='Debit')
plt.xlabel('Order Regions')
plt.ylabel('Number of payments')
plt.title('Different Type of payments in all regions')
plt.legend()
# 显示刻度
names=data['Order Region'].value_counts().keys()
#plt.xticks(index+bar_width,names,rotation=90)
plt.xticks(index+bar_width,names,rotation='vertical')
# tight_layout会自动调整子图参数,使之填充整个图像区域
#plt.tight_layout()
plt.show()
# 分析结论:
# Debit(借记)是所有地区中使用最多的支付方式
# Cash(现金)是所有地区中使用最少的支付方式
# 对负收益的产品进行探索
loss = data[(data['Benefit per order']<0)]
# 显示最大的负收益产品 Top10
plt.figure(1)
loss['Category Name'].value_counts().nlargest(10).plot.bar(figsize=(20,8), title="Products with most loss")
# 显示最大的负收益地区 Top10
plt.figure(2)
loss['Order Region'].value_counts().nlargest(10).plot.bar(figsize=(20,8), title="Regions with most loss")
# 所有负收益产品带来的损失
print('Total revenue lost with orders',loss['Benefit per order'].sum())
# 负收益,有可能来自欺诈交易,那么会采用哪种支付方式
xyz = data[(data['Order Status'] == 'SUSPECTED_FRAUD') & (data['Type'] != 'TRANSFER')]
#xyz = data[(data['Order Status'] == 'SUSPECTED_FRAUD') & (data['Type'] == 'TRANSFER')]
xyz['Order Region'].value_counts()
# 发现TRANSFER容易导致欺诈交易
# 显示不同地区的欺诈交易情况
high_fraud = data[(data['Order Status'] == 'SUSPECTED_FRAUD') & (data['Type'] == 'TRANSFER')]
high_fraud['Order Region'].value_counts().plot.bar(figsize=(24,12))
# 显示标题
plt.title("Regions with Highest Fraud",size=15,color='r')
plt.ylabel("Fraud Number")
plt.show()
# Western Europe地区欺诈交易最多,将一个bar分成2个部分(堆积条形图):分别为总欺诈数量 与 Western Europe的欺诈数量
high_fraud_total = data[(data['Order Status'] == 'SUSPECTED_FRAUD')]
high_fraud_we = data[(data['Order Status'] == 'SUSPECTED_FRAUD') &(data['Order Region'] == 'Western Europe')]
# 找出风险最高的10个Category
fraud1=high_fraud_total['Category Name'].value_counts().nlargest(10).plot.bar(figsize=(20,8), title="Fraud Category",color='orange')
# 找出这些风险在Western Europe中的情况
fraud2=high_fraud_we['Category Name'].value_counts().nlargest(10).plot.bar(figsize=(20,8), title="Fraud product in Western Europe",color='green')
plt.legend(["All regions", "Western Europe"])
plt.title("Top 10 products with highest fraud detections", size=15)
plt.xlabel("Products", size=13)
plt.ylim(0,600)
plt.show()
# 筛选出Top10 风险Customer
cus = data[(data['Order Status'] == 'SUSPECTED_FRAUD')]
#Top 10 customers with most fraud
cus['Customer Full Name'].value_counts().nlargest(10).plot.bar(figsize=(20,8), title="Top 10 Highest Fraud Customers")
# 找到MarySmith的交易金额
amount = data[(data['Customer Full Name'] == 'MarySmith')&(data['Order Status'] == 'SUSPECTED_FRAUD')]
amount['Sales'].sum()
import pickle
with open('data.pkl', 'wb') as file:
pickle.dump(data, file)