目录
前言
我是计忠平老师的班,期末大作业要求:从网上搜寻的任一数据集并进行数据清洗、可视化和建模(要求两个模型对比)。
目的
根据PUBG官方给的超大数据集,分析诸多因素对获胜的影响并训练模型以预测测试组中球员的排名。目标标签排名将是0到1之间的百分比值,更高的百分比表示该匹配中的更高排名。
import numpy as np
import pandas as pd
# 展示
import matplotlib.pyplot as plt
import seaborn as sns
from pdpbox import pdp
# Sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from IPython.display import display
from sklearn import metrics
数据字段
DBNOs : 玩家击倒的敌人数量
assists : 玩家造成伤害且被队友所杀死的敌人数量
boosts : 玩家使用的增益性物品数量
damageDealt : 玩家造成的总伤害-玩家所受的伤害
headshotKills : 玩家通过爆头杀死的敌人数量
heals : 玩家使用的救援类物品数量
Id : 玩家的ID
killPlace : 玩家杀死敌人数量的排名
killPoints : 基于杀戮的玩家外部排名。
killStreaks : 玩家在短时间内杀死敌人的最大数量
kills : 玩家杀死的敌人的数量
longestKill : 玩家和玩家在死亡时被杀的最长距离。
matchDuration : 比赛时间
matchId : 比赛的ID
matchType : 单排/双排/四排;标准模式是“solo”,“duo”,“squad”,“solo-fpp”,“duo-fpp”和“squad-fpp”; 其他模式来自事件或自定义匹配。
rankPoints : 类似Elo的玩家排名。
revives : 玩家救援队友的次数
rideDistance : 玩家使用交通工具行驶了多少米
roadKills : 玩家在交通工具上杀死敌人的数目
swimDistance : 玩家游泳的距离
teamKills : 该玩家杀死队友的次数
vehicleDestroys : 玩家毁坏的交通工具数目
walkDistance : 玩家步行距离
weaponsAcquired : 玩家捡枪数量
winPoints : 基于赢的玩家外部排名。
groupId : 队伍的ID
numGroups : 在该局比赛中有玩家数据的队伍数量
maxPlace : 在该局中已有数据的最差的队伍名次
winPlacePerc : 预测目标,是以百分数计算的,介于0-1之间,1对应第一名,0对应最后一名。 它是根据maxPlace计算的,而不是numGroups,因此匹配中可能缺少某些队伍。
读入数据
train = pd.read_csv('D:/dataAnalysis/train_V2.csv')
test = pd.read_csv('D:/dataAnalysis/test_V2.csv')
查看前几行
train.head()
运行结果如下:
Id groupId matchId assists boosts damageDealt DBNOs headshotKills heals killPlace ... revives rideDistance roadKills swimDistance teamKills vehicleDestroys walkDistance weaponsAcquired winPoints winPlacePerc
0 7f96b2f878858a 4d4b580de459be a10357fd1a4a91 0 0 0.00 0 0 0 60 ... 0 0.0000 0 0.00 0 0 244.80 1 1466 0.4444
1 eef90569b9d03c 684d5656442f9e aeb375fc57110c 0 0 91.47 0 0 0 57 ... 0 0.0045 0 11.04 0 0 1434.00 5 0 0.6400
2 1eaf90ac73de72 6a4a42c3245a74 110163d8bb94ae 1 0 68.00 0 0 0 47 ... 0 0.0000 0 0.00 0 0 161.80 2 0 0.7755
3 4616d365dd2853 a930a9c79cd721 f1f1f4ef412d7e 0 0 32.90 0 0 0 75 ... 0 0.0000 0 0.00 0 0 202.70 3 0 0.1667
4 315c96c26c9aac de04010b3458dd 6dc8ff871e21e6 0 0 100.00 0 0 0 45 ... 0 0.0000 0 0.00 0 0 49.75 2 0 0.1875
5 rows × 29 columns
查看数据中包含的变量以及相应的变量类型
train.info()
运行结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4446966 entries, 0 to 4446965
Data columns (total 29 columns):
Id object
groupId object
matchId object
assists int64
boosts int64
damageDealt float64
DBNOs int64
headshotKills int64
heals int64
killPlace int64
killPoints int64
kills int64
killStreaks int64
longestKill float64
matchDuration int64
matchType object
maxPlace int64
numGroups int64
rankPoints int64
revives int64
rideDistance float64
roadKills int64
swimDistance float64
teamKills int64
vehicleDestroys int64
walkDistance float64
weaponsAcquired int64
winPoints int64
winPlacePerc float64
dtypes: float64(6), int64(19), object(4)
memory usage: 983.9+ MB
1.数据清洗
通过观察数据集可以发现,kills字段的最大值为50+,不会超过64,即2的八次方,所以用Int8即可,不需要用Int64,其他字段依次类推,从而完成数据集大小的缩减。
def reduce_mem_usage(df):
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
reduce_mem_usage(train)
运行结果:
Memory usage of dataframe is 983.90 MB
Memory usage after optimization is: 288.39 MB
Decreased by 70.7%
剔除开挂的玩家数据:
剔除掉杀人数>1而总移动距离为0的玩家数据
train['totalDistance'] = train['rideDistance'] + train['walkDistance'] + train['swimDistance']
train['killWithoutMoving'] = ((train['kills']>0)&(train['totalDistance']==0))
train['headshot_rate'] = train['headshotKills']/train['kills']
train['headshot_rate'] = train['headshot_rate'].fillna(0)
train[train['killWithoutMoving']==True].shape
运行结果:
(1535, 36) #剔除掉1500多条数据
train.drop(train[train['killWithoutMoving']==True].index,inplace=True)
train.drop(train[train['roadKills']>10].index,inplace=True)#车杀大于十
初始化PlacePerc字段
train[train['winPlacePerc'].isnull()]
2.基础数据探索
根据个人游戏体验,认为杀人数、伤害量、移动距离、玩家捡枪数量、增益性物品与治疗性物品的使用量对吃鸡概率有较大影响,因此对其进行可视化。
①人数与局数之间的关系
这个图与建模没啥关系,就是随便看看。
train['playersJoined'] = train.groupby('matchId')['matchId'].transform('count')
plt.figure(figsize=(15,10))
sns.countplot(train[train['playersJoined']>=75]['playersJoined'])
plt.show()
运行结果:
②击杀人数统计
plt.figure(figsize=(15,10))
sns.countplot(data=train,x=train['kills']).set_title('Kills')
plt.show()
#大部分人0杀
运行结果:
train[train['kills']>10].shape
运行结果:
(8340, 29)
print("{}人一人未杀吃鸡,占总人数的({:.4f}%)".format(len(data[data["winPlacePerc"]==1]),100*len(data[data["winPlacePerc"]==1])/len(train)))
运行结果:
127573人一人未杀吃鸡,占总人数的(2.8688%)
train.drop(train[train['kills']>30].index,inplace=True)#剔除杀人数超过30的
③伤害量
data = train.copy() #data为train的copy,避免修改了train的值
plt.figure(figsize=(15,10))
plt.title("Damage Dealt ",fontsize=15)
sns.distplot(data['damageDealt'])
plt.show()
data = train.copy()
data = data[data['kills']==0]
plt.figure(figsize=(15,10))
plt.title("Damage Dealt by 0 killers",fontsize=15)
sns.distplot(data['damageDealt'])
plt.show()#查看伤害量小于0的人以及杀敌为0的人能够吃鸡的概率,反向推理出伤害量和杀敌数与吃鸡的密切相关性。
#no_kill_success_nums表示没有击杀人但是成功吃鸡的人数
no_kill_success_nums=len(data[data['winPlacePerc']==1])
#no_kill_success_rate表示没有击杀人但是成功吃鸡的概率
no_kill_success_rate=no_kill_success_nums/len(train)
#输出
print("{} players ({:.4f}%) have won without a single kill!".format(no_kill_success_nums,100*no_kill_success_rate))
运行结果:在这里由于较高杀敌数和伤害量的总体分布取值是分散的,我们通过观察伤害量较低的玩家吃鸡概率反向得出杀人数和伤害量正相关。
16666 players (0.3748%) have won without a single kill!
④移动距离
data = train[train['walkDistance'] < train['walkDistance'].quantile(0.99)]
plt.figure(figsize=(15,10))
plt.title("The Running Distances")
sns.distplot(data['walkDistance']) #distplot直方图
plt.show()
winner_data=train[train['winPlacePerc'] == 1].copy()
winner_data = winner_data[winner_data['walkDistance'] < winner_data['walkDistance'].quantile(0.99)]
plt.figure(figsize=(15,10))
plt.title("The winner's running distances")
sns.distplot(winner_data['walkDistance']) #distplot直方图
plt.show()
#我们可知最终赢家通常步行距离在3000m左右,也就是较高的步行距离。
#这里前面也有个尖尖可能是因为挂机的玩家太多了,没有剔除。
⑤玩家捡枪数量
data = train.copy() #data为train的copy,避免修改了train的值
#将大于7的杀敌数同一归结到larger中
data.loc[data['weaponsAcquired'] > data['weaponsAcquired'].quantile(0.99)] = 'larger'
#作图,以kills为横坐标,数据数量为纵坐标
plt.figure(figsize=(15,10))
sns.countplot(data['weaponsAcquired'].astype('str').sort_values())
plt.title("weaponsAcquired Count",fontsize=15)
plt.show()#大多数人每局获得武器数量为1~4。
⑥增益性和治疗性物品
average_heals=train['heals'].mean()
quantile_099_heals=train['heals'].quantile(0.99)
print("平均每人使用治疗性物资数量为: {:.1f}, 99%的人使用治疗性物资数量少于:.{}".format(average_heals, quantile_099_heals))
average_boosts=train['boosts'].mean()
quantile_099_boosts=train['boosts'].quantile(0.99)
print("平均每人使用增益性物资数量为: {:.1f}, 99%的人使用增益性物资数量少于:.{}".format(average_boosts, quantile_099_boosts))
运行结果:
平均每人使用治疗性物资数量为: 1.4, 99%的人使用治疗性物资数量少于:.12.0
平均每人使用增益性物资数量为: 1.1, 99%的人使用增益性物资数量少于:.7.0
3.关系变量探索
①杀敌数
train['kills_rank'] = pd.cut(train['kills'], [-1, 0, 2, 5, 10, 20, 60] ,labels = ['0_kills', '1-2_kills', '3-5_kills', '6-10_kills', '11-20_kills', '20+kills'])
plt.figure(figsize = (10, 6))
sns.boxplot(x = 'kills_rank', y = 'winPlacePerc', data = train)
plt.show()#排名较高的玩家通常击杀数是比较高的,且最终吃鸡的玩家杀敌数普遍在5个以上。
②步行距离
sns.jointplot(x="winPlacePerc", y="walkDistance", data=train, height=10, ratio=3, color="red")
plt.show()#步行距离与排名有较高的相关性。
③增益性物品与治疗性物品
data = train.copy()
data = data[data['heals'] < data['heals'].quantile(0.99)]
data = data[data['boosts'] < data['boosts'].quantile(0.99)]
f,ax1 = plt.subplots(figsize =(20,10))
sns.pointplot(x='heals',y='winPlacePerc',data=data,color='lime',alpha=0.8)
sns.pointplot(x='boosts',y='winPlacePerc',data=data,color='blue',alpha=0.8)
plt.text(4,0.6,'Heals',color='lime',fontsize = 17,style = 'italic')
plt.text(4,0.55,'Boosts',color='blue',fontsize = 17,style = 'italic')
plt.xlabel('Number of heal/boost items',fontsize = 15,color='blue')
plt.ylabel('Win Percentage',fontsize = 15,color='blue')
plt.title('Heals vs Boosts',fontsize = 20,color='blue')
plt.grid()
plt.show()
#增益性与吃鸡概率是正相关的
#治疗性物品前期正相关,在一定上限值之后相对平滑
④组局方式
通过对每局队伍数量的处理来判断是单排、双排还是四排
solos = train[train['numGroups']>50]
duos = train[(train['numGroups']>25) & (train['numGroups']<=50)]
squads = train[train['numGroups']<=25]
f,ax1 = plt.subplots(figsize =(15,10))
sns.pointplot(x='kills',y='winPlacePerc',data=solos,color='red',alpha=0.8)
sns.pointplot(x='kills',y='winPlacePerc',data=duos,color='blue',alpha=0.8)
sns.pointplot(x='kills',y='winPlacePerc',data=squads,color='green',alpha=0.8)
plt.text(37,0.6,'Solos',color='red',fontsize = 17,style = 'italic')
plt.text(37,0.55,'Duos',color='blue',fontsize = 17,style = 'italic')
plt.text(37,0.5,'Squads',color='green',fontsize = 17,style = 'italic')
plt.xlabel('kill nums',fontsize = 15,color='black')
plt.ylabel('win per',fontsize = 15,color='black')
plt.title('Solo vs Duo vs Squad Kills',fontsize = 20,color='blue')
plt.grid()
plt.show()
#在solo和duos模式下,获胜概率随着DBNOs增长而增长;
#但squads模式中,当DBNOs较小时该增长较为明显,但是当DBNOs大于6时,该增长将不再明显,非常平缓。
⑤多变量综合分析
f,ax = plt.subplots(figsize=(15, 15))
sns.heatmap(train.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax,cmap='rainbow')
plt.show()
#正相关影响大于等于0.4的共8个变量,分别为boosts、damageDealt、heals、kills、killStreaks、longestkill、walkDistance、weaponsAcquired。
4.建模
基本工作
#读取相关数据、分类相关数据、读取一些重要的信息,以及建立基本的功能函数。
from functools import reduce
def BuildFeature(is_train=True):
y = None
test_idx = None
if is_train:
print("Reading train.csv")
df = pd.read_csv('D:/dataAnalysis/train_V2.csv')
df = df[df['maxPlace'] > 1]
else:
print("Reading test.csv")
df = pd.read_csv('D:/dataAnalysis/test_V2.csv')
test_idx = df.Id
# Reduce the memory usage
df = reduce_mem_usage(df)#用上面的数据清洗函数来缩减数据集大小
print("Delete Unuseful Columns")
target = 'winPlacePerc'
features = list(df.columns)
features.remove("Id")
features.remove("matchId")
features.remove("groupId")
features.remove("matchType")
if is_train:
print("Read Labels")
y = np.array(df.groupby(['matchId','groupId'])[target].agg('mean'), dtype=np.float64)
features.remove(target)
print("Read Group mean features")
agg = df.groupby(['matchId','groupId'])[features].agg('mean')
agg_rank = agg.groupby('matchId')[features].rank(pct=True).reset_index()
if is_train:
df_out = agg.reset_index()[['matchId','groupId']]
else:
df_out = df[['matchId','groupId']]
df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
df_out = df_out.merge(agg_rank, suffixes=["_mean", "_mean_rank"], how='left', on=['matchId', 'groupId'])
print("Read Group max features")
agg = df.groupby(['matchId','groupId'])[features].agg('max')
agg_rank = agg.groupby('matchId')[features].rank(pct=True).reset_index()
df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
df_out = df_out.merge(agg_rank, suffixes=["_max", "_max_rank"], how='left', on=['matchId', 'groupId'])
print("Read Group min features")
agg = df.groupby(['matchId','groupId'])[features].agg('min')
agg_rank = agg.groupby('matchId')[features].rank(pct=True).reset_index()
df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
df_out = df_out.merge(agg_rank, suffixes=["_min", "_min_rank"], how='left', on=['matchId', 'groupId'])
print("Read Group size features")
agg = df.groupby(['matchId','groupId']).size().reset_index(name='group_size')
df_out = df_out.merge(agg, how='left', on=['matchId', 'groupId'])
print("Read Match mean features")
agg = df.groupby(['matchId'])[features].agg('mean').reset_index()
df_out = df_out.merge(agg, suffixes=["", "_match_mean"], how='left', on=['matchId'])
print("Read Match size features")
agg = df.groupby(['matchId']).size().reset_index(name='match_size')
df_out = df_out.merge(agg, how='left', on=['matchId'])
df_out.drop(["matchId", "groupId"], axis=1, inplace=True)
X = df_out
feature_names = list(df_out.columns)
del df, df_out, agg, agg_rank
#gc.collect()
return X, y, feature_names, test_idx
X_train, y_train, train_columns, _ = BuildFeature(is_train=True)
X_test, _, _ , test_idx = BuildFeature(is_train=False)
运行结果:
Reading train.csv
Delete Unuseful Columns
Read Labels
Read Group mean features
Read Group max features
Read Group min features
Read Group size features
Read Match mean features
Read Match size features
Reading test.csv
Delete Unuseful Columns
Read Group mean features
Read Group max features
Read Group min features
Read Group size features
Read Match mean features
Read Match size features
建模1
sample = 500000
df_sample = train.sample(sample)
df_sample.drop(columns = ['groupId','matchId'],inplace=True)
df = df_sample.drop(columns=['winPlacePerc']) #使用所有50个特征去训练
y = df_sample['winPlacePerc']
X_train,X_valid,y_train,y_valid = train_test_split(df,y,random_state=1)
def print_score(m): #设置mae的输出函数
res= ['mae train',mean_absolute_error(m.predict(X_train),y_train),
'mae val',mean_absolute_error(m.predict(X_valid),y_valid)]
print (res)
from sklearn.metrics import mean_absolute_error
m1 = RandomForestRegressor(n_estimators=50,n_jobs=-1)#随机森林回归
m1.fit(X_train,y_train)
print_score(m1)
运行结果:
['mae train', 0.021753100670779884, 'mae val', 0.05808169224443009]
输出各特征权重
m1.feature_importances_
运行结果:
array([1.36334423e-03, 5.00491287e-03, 3.27381859e-03, 2.73774423e-03,
4.01880235e-04, 2.65406415e-03, 1.83529247e-01, 2.30345314e-03,
3.05262682e-03, 3.04566378e-03, 5.31248376e-03, 8.20709758e-03,
4.93005021e-03, 1.16927895e-02, 4.30886007e-03, 7.62836835e-04,
1.90152978e-03, 5.07407459e-05, 6.79025405e-04, 2.38236519e-04,
7.90305348e-05, 6.75351041e-01, 3.80576497e-03, 2.53653092e-03,
1.81526392e-02, 8.46489284e-03, 3.67818064e-03, 1.17175902e-02,
2.72743734e-02, 0.00000000e+00, 6.63075119e-04, 1.82935049e-05,
2.22132397e-07, 2.39740250e-04, 3.89417917e-04, 5.27390712e-06,
1.52535453e-05, 1.83897216e-07, 3.16213699e-05, 2.66032425e-06,
1.64355877e-05, 1.26763356e-05, 1.60575055e-04, 1.15460128e-04,
2.46646558e-04, 1.06304530e-03, 5.08969456e-04])
用列表的方式输出,方便观察
def rf_feat_importance(m,df):
return pd.DataFrame({'cols':df.columns,'imp':m.feature_importances_}).sort_values('imp',ascending=False)
rf_feat_importance(m1,df)
运行结果:
cols imp
21 walkDistance 6.753510e-01
6 killPlace 1.835292e-01
28 totalDistance 2.727437e-02
24 playersJoined 1.815264e-02
27 matchDurationNorm 1.171759e-02
13 numGroups 1.169279e-02
25 killsNorm 8.464893e-03
11 matchDuration 8.207098e-03
10 longestKill 5.312484e-03
1 boosts 5.004913e-03
12 maxPlace 4.930050e-03
14 rankPoints 4.308860e-03
22 weaponsAcquired 3.805765e-03
26 damageDealtNorm 3.678181e-03
2 damageDealt 3.273819e-03
8 kills 3.052627e-03
9 killStreaks 3.045664e-03
3 DBNOs 2.737744e-03
5 heals 2.654064e-03
23 winPoints 2.536531e-03
7 killPoints 2.303453e-03
16 rideDistance 1.901530e-03
0 assists 1.363344e-03
45 matchType_squad 1.063045e-03
15 revives 7.628368e-04
18 swimDistance 6.790254e-04
30 headshot_rate 6.630751e-04
46 matchType_squad-fpp 5.089695e-04
4 headshotKills 4.018802e-04
34 matchType_duo-fpp 3.894179e-04
44 matchType_solo-fpp 2.466466e-04
33 matchType_duo 2.397402e-04
19 teamKills 2.382365e-04
42 matchType_normal-squad-fpp 1.605751e-04
43 matchType_solo 1.154601e-04
20 vehicleDestroys 7.903053e-05
17 roadKills 5.074075e-05
38 matchType_normal-duo-fpp 3.162137e-05
31 matchType_crashfpp 1.829350e-05
40 matchType_normal-solo-fpp 1.643559e-05
36 matchType_flaretpp 1.525355e-05
41 matchType_normal-squad 1.267634e-05
35 matchType_flarefpp 5.273907e-06
39 matchType_normal-solo 2.660324e-06
32 matchType_crashtpp 2.221324e-07
37 matchType_normal-duo 1.838972e-07
29 killWithoutMoving 0.000000e+00
图表方式:
rf_feat_importance(m1,df)[:10].plot('cols','imp',figsize=(14,6),kind='barh')
plt.show()
建模2
选择imp值在0.02以上的
fi=rf_feat_importance(m1,df)
to_keep = fi[fi.imp>0.02].cols
#这里我选的不是很好,应该选imp<0.02的,懒得跑了,有需要的同学可以跑一下
to_keep
运行结果:
21 walkDistance
6 killPlace
28 totalDistance
Name: cols, dtype: object
建模
X_train,X_valid = X_train[to_keep],X_valid[to_keep]
m2 = RandomForestRegressor(n_estimators=50,n_jobs=-1)
m2.fit(X_train,y_train)
print_score(m2)
运行结果:因为特征选的不好,所以mae较模型1大,误差较大。
(不过老师说如果选权重大的特征训练出来的模型也不一定会比训练所有特征的模型(即模型1)要好)
['mae train', 0.0384157839295845, 'mae val', 0.09033972682087821]