绝地求生预测获胜排名百分位数

import numpy as np
import pandas as pd

# 展示
import matplotlib.pyplot as plt
import seaborn as sns
from pdpbox import pdp
# Sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from IPython.display import display
from sklearn import metrics

1、数据介绍

  • DBNOs - Number of enemy players knocked.
  • assists - Number of enemy players this player damaged that were killed by teammates.
  • boosts - Number of boost items used.
  • damageDealt - Total damage dealt. Note: Self inflicted damage is subtracted.
  • headshotKills - Number of enemy players killed with headshots.
  • heals - Number of healing items used.
  • Id - Player’s Id
  • killPlace - Ranking in match of number of enemy players killed.
  • killPoints - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other - than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
  • killStreaks - Max number of enemy players killed in a short amount of time.
  • kills - Number of enemy players killed.
  • longestKill - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
  • matchDuration - Duration of match in seconds.
  • matchId - ID to identify match. There are no matches that are in both the training and testing set.
  • matchType - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
  • rankPoints - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
  • revives - Number of times this player revived teammates.
  • rideDistance - Total distance traveled in vehicles measured in meters.
  • roadKills - Number of kills while in a vehicle.
  • swimDistance - Total distance traveled by swimming measured in meters.
  • teamKills - Number of times this player killed a teammate.
  • vehicleDestroys - Number of vehicles destroyed.
  • walkDistance - Total distance traveled on foot measured in meters.
  • weaponsAcquired - Number of weapons picked up.
  • winPoints - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
  • groupId - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
  • numGroups - Number of groups we have data for in the match.
  • maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
  • winPlacePerc - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.
train = pd.read_csv('train_V2.csv')
test = pd.read_csv('test_V2.csv')
train.head()
IdgroupIdmatchIdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlace...revivesrideDistanceroadKillsswimDistanceteamKillsvehicleDestroyswalkDistanceweaponsAcquiredwinPointswinPlacePerc
07f96b2f878858a4d4b580de459bea10357fd1a4a91000.0000060...00.000000.0000244.80114660.4444
1eef90569b9d03c684d5656442f9eaeb375fc57110c0091.4700057...00.0045011.04001434.00500.6400
21eaf90ac73de726a4a42c3245a74110163d8bb94ae1068.0000047...00.000000.0000161.80200.7755
34616d365dd2853a930a9c79cd721f1f1f4ef412d7e0032.9000075...00.000000.0000202.70300.1667
4315c96c26c9aacde04010b3458dd6dc8ff871e21e600100.0000045...00.000000.000049.75200.1875

5 rows × 29 columns

2、简单清洗数据

2.1 清洗空数据
train[train['winPlacePerc'].isnull()]
IdgroupIdmatchIdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlace...revivesrideDistanceroadKillsswimDistanceteamKillsvehicleDestroyswalkDistanceweaponsAcquiredwinPointswinPlacePerc
2744604f70c74418bb06412dfbede33f92b224a123c53e008000.00001...00.000.0000.000NaN

1 rows × 29 columns

train.drop(2744604,inplace=True)
train[train['winPlacePerc'].isnull()]
IdgroupIdmatchIdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlace...revivesrideDistanceroadKillsswimDistanceteamKillsvehicleDestroyswalkDistanceweaponsAcquiredwinPointswinPlacePerc

0 rows × 29 columns

2.2 去除每组人数不同对战绩的影响
train['playersJoined'] = train.groupby('matchId')['matchId'].transform('count')
plt.figure(figsize=(15,10))
sns.countplot(train[train['playersJoined']>=75]['playersJoined']) 
'''sns.countplot 是 Seaborn 库中的一个函数,用于绘制分类变量的计数直方图。它主要用于统计每个类别中数据出现的次数,并将结果以直方图的形式可视化展示出来。
这个函数的调用形式通常是 sns.countplot(x='variable', data=data),其中 x 是分类变量的名称,data 是包含数据的 DataFrame 或其他数据结构。'''
plt.show()

在这里插入图片描述

train.head()
IdgroupIdmatchIdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlace...rideDistanceroadKillsswimDistanceteamKillsvehicleDestroyswalkDistanceweaponsAcquiredwinPointswinPlacePercplayersJoined
07f96b2f878858a4d4b580de459bea10357fd1a4a91000.0000060...0.000000.0000244.80114660.444496
1eef90569b9d03c684d5656442f9eaeb375fc57110c0091.4700057...0.0045011.04001434.00500.640091
21eaf90ac73de726a4a42c3245a74110163d8bb94ae1068.0000047...0.000000.0000161.80200.775598
34616d365dd2853a930a9c79cd721f1f1f4ef412d7e0032.9000075...0.000000.0000202.70300.166791
4315c96c26c9aacde04010b3458dd6dc8ff871e21e600100.0000045...0.000000.000049.75200.187597

5 rows × 30 columns

train['killsNorm'] = train['kills']*((100-train['playersJoined'])/100+1)
train['damageDealtNorm'] = train['damageDealt']*((100-train['playersJoined'])/100+1)
train['matchDurationNorm'] = train['matchDuration']*((100-train['playersJoined'])/100+1)
to_show = ['Id', 'kills','killsNorm','damageDealt', 'damageDealtNorm', 'matchDuration', 'matchDurationNorm']
train[to_show][:11]
IdkillskillsNormdamageDealtdamageDealtNormmatchDurationmatchDurationNorm
07f96b2f878858a00.000.0000.0000013061358.24
1eef90569b9d03c00.0091.47099.7023017771936.93
21eaf90ac73de7200.0068.00069.3600013181344.36
34616d365dd285300.0032.90035.8610014361565.24
4315c96c26c9aac11.03100.000103.0000014241466.72
5ff79c12f32650611.05100.000105.0000013951464.75
695959be0e21ca300.000.0000.0000013161355.48
7311b84c6ff439000.008.5388.8795219672045.68
81a68204ccf989100.0051.60053.1480013751416.25
9e5bb5a4358725300.0037.27038.3881019301987.90
102b574d4397281300.0028.38028.6638018111829.11

3、对于开挂的,应当给他们剔除掉

3.1 第一种外挂:没有步数便击杀人头的
train['totalDistance'] = train['rideDistance'] + train['walkDistance'] + train['swimDistance']
train['killWithoutMoving'] = ((train['kills']>0)&(train['totalDistance']==0))
#train[train['killWithoutMoving']==True].shape
train.drop(train[train['killWithoutMoving']==True].index,inplace=True)
3.2 第二种外挂:坐在车上便能击杀人头的
train[train['roadKills']>10].shape
(4, 35)
train.drop(train[train['roadKills']>10].index,inplace=True)
3.3 第三种外挂:击杀人头数高的离谱的
plt.figure(figsize=(15,8))
sns.countplot(data=train,x=train['kills']).set_title('Kills')
plt.show()

在这里插入图片描述

train[train['kills']>30].shape
train.drop(train[train['kills']>30].index,inplace=True)
3.4 第四种外挂:爆头击杀率很高的且击杀人头数多的
'''
headshotKills - Number of enemy players killed with headshots.#爆头击杀,直中要害
kills - Number of enemy players killed.#被击杀
'''
train['headshot_rate'] = train['headshotKills']/train['kills']
train['headshot_rate'] = train['headshot_rate'].fillna(0)
plt.figure(figsize=(15,10))
sns.distplot(train['headshot_rate'],bins=10)
plt.show()

在这里插入图片描述

train[train['headshot_rate']==1].shape
(253959, 36)
train[  (train['headshot_rate']==1)  & (train['kills']==1)  ].shape
(218433, 36)
train.drop(train[  (train['headshot_rate']==1)  & (train['kills']>=5)  ].index,inplace=True)

4、使用categorical变量,减少系统占用内存

train['matchId'] = train['matchId'].astype('category')
train['groupId'] = train['groupId'].astype('category')

5、查看不同模式下击杀人头数与胜率关系

train.drop(columns=['Id'],inplace=True)
5.1 单排,双排,四排
solos=train[train['numGroups']>50]
duos=train[(train['numGroups']>25)&(train['numGroups']<=50)]
squads=train[train['numGroups']<=25]
len(solos)/len(train)
0.15947449945499398
len(duos)/len(train)
0.7412968331156684
f,ax =plt.subplots(figsize=(20,10))
sns.pointplot(x='kills',y='winPlacePerc',data=solos,color='black',alpha=0.8)
sns.pointplot(x='kills',y='winPlacePerc',data=duos,color='red',alpha=0.8)
sns.pointplot(x='kills',y='winPlacePerc',data=squads,color='blue',alpha=0.8)
plt.text(25,0.5,'Solos',color='red')
plt.grid()
plt.show()

在这里插入图片描述

6 、热力图

k = 5
f,ax =plt.subplots(figsize=(12,12))
temp_train=train.drop(columns=['groupId','matchId','matchType']) 

cols = temp_train.corr().nlargest(k,'winPlacePerc')['winPlacePerc'].index
#train.corr().nlargest(k,'winPlacePerc')['winPlacePerc']返回一个与'winPlacePerc'相关性最大的5分变量以及相关系数
# a 0.09
# b 0.01
# c 0.98
#此时相关系数的名称为'winPlacePerc'
cm = np.corrcoef(temp_train[cols].values.T)
sns.heatmap(cm,annot=True,linewidths=0.5,fmt='.1f',ax=ax,yticklabels=cols.values,xticklabels=cols.values)
plt.show()

在这里插入图片描述

7、建模

sample = 500000
df_sample = train.sample(sample)
df_sample.drop(columns = ['groupId','matchId','matchType'],inplace=True)
df = df_sample.drop(columns=['winPlacePerc'])
y = df_sample['winPlacePerc']
7.1 将训练集分为训练部分与测试部分
X_train,X_valid,y_train,y_valid = train_test_split(df,y,random_state=1)
7.2 训练测试集
def print_score(m):
    res= ['mae train',mean_absolute_error(m.predict(X_train),y_train),
         'mae val',mean_absolute_error(m.predict(X_valid),y_valid)]
    print (res)
from sklearn.metrics import mean_absolute_error
m1 = RandomForestRegressor(n_estimators=50,n_jobs=-1)
m1.fit(X_train,y_train)
print_score(m1)
['mae train', 0.02192784867733333, 'mae val', 0.05833728120000001]
m1.feature_importances_
array([1.39260048e-03, 5.35238158e-03, 3.30978196e-03, 2.28083086e-03,
       4.01083836e-04, 2.68734056e-03, 1.81916824e-01, 2.36602622e-03,
       3.11364071e-03, 2.75269723e-03, 6.13427821e-03, 9.33855372e-03,
       5.53881534e-03, 1.20290463e-02, 4.37883164e-03, 8.08006977e-04,
       1.91166905e-03, 4.32997862e-05, 7.26137113e-04, 2.31430733e-04,
       8.58912251e-05, 6.76056200e-01, 3.81913691e-03, 2.60518373e-03,
       1.84332720e-02, 8.68569187e-03, 3.87405597e-03, 1.14654252e-02,
       2.75947316e-02, 0.00000000e+00, 6.67135806e-04])
def rf_feat_importance(m,df):
    return pd.DataFrame({'cols':df.columns,'imp':m.feature_importances_}).sort_values('imp',ascending=False)
rf_feat_importance(m1,df)
colsimp
21walkDistance0.676056
6killPlace0.181917
28totalDistance0.027595
24playersJoined0.018433
13numGroups0.012029
27matchDurationNorm0.011465
11matchDuration0.009339
25killsNorm0.008686
10longestKill0.006134
12maxPlace0.005539
1boosts0.005352
14rankPoints0.004379
26damageDealtNorm0.003874
22weaponsAcquired0.003819
2damageDealt0.003310
8kills0.003114
9killStreaks0.002753
5heals0.002687
23winPoints0.002605
7killPoints0.002366
3DBNOs0.002281
16rideDistance0.001912
0assists0.001393
15revives0.000808
18swimDistance0.000726
30headshot_rate0.000667
4headshotKills0.000401
19teamKills0.000231
20vehicleDestroys0.000086
17roadKills0.000043
29killWithoutMoving0.000000
rf_feat_importance(m1,df)[:10].plot('cols','imp',figsize=(14,6),kind='barh')
plt.show()

在这里插入图片描述

fi=rf_feat_importance(m1,df)
to_keep = fi[fi.imp>0.01].cols
to_keep
21         walkDistance
6             killPlace
28        totalDistance
24        playersJoined
13            numGroups
27    matchDurationNorm
Name: cols, dtype: object
7.3 用取出的重要指标计算训练集误差
X_train,X_valid = X_train[to_keep],X_valid[to_keep]
m2 = RandomForestRegressor(n_estimators=50,n_jobs=-1)
m2.fit(X_train,y_train)
print_score(m2)
['mae train', 0.02192277282133332, 'mae val', 0.05834604831999999]

8、预测测试集

temp_test=test.copy()
temp_test.head()
IdgroupIdmatchIdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlace...rankPointsrevivesrideDistanceroadKillsswimDistanceteamKillsvehicleDestroyswalkDistanceweaponsAcquiredwinPoints
09329eb41e215eb676b23c24e70d645b576ab7daa7f0051.4600073...150000.000.000588.010
1639bd0dcd7bda8430933124148dd42a9a0b906c92804179.1000211...150324669.000.0002017.060
263d5c8ef8dfe910b45f5db20ba9987e7e4477a048e1023.4000449...156500.000.000787.840
3cf5b81422591d1b7497dbdc77f4a1b9a94f1af67f10065.5200054...146500.000.0001812.030
4ee6a295187ba216604ce20a1d23040754a9301606604330.201217...148010.000.0002963.040

5 rows × 28 columns

temp_test只有三列是自带的,其他列均需要处理下
21 walkDistance 25
6 killPlace 9
28 totalDistance
24 playersJoined
13 numGroups 17
27 matchDurationNorm

temp_test['totalDistance']=temp_test['rideDistance'] + temp_test['walkDistance'] + temp_test['swimDistance']
temp_test['playersJoined'] = temp_test.groupby('matchId')['matchId'].transform('count')
temp_test['matchDurationNorm'] = temp_test['matchDuration']*((100-temp_test['playersJoined'])/100+1)

21 walkDistance 25

6 killPlace 9

28 totalDistance 28

24 playersJoined 29

13 numGroups 17

27 matchDurationNorm 30

#训练一个只有特征和指标的模型
X=X_train[['walkDistance','killPlace','totalDistance','playersJoined','numGroups','matchDurationNorm']]
fratures_model = RandomForestRegressor(n_estimators=50,n_jobs=-1)
fratures_model.fit(X,y_train)

在这里插入图片描述

temp_test['winPlacePerc']=fratures_model.predict(temp_test[['walkDistance','killPlace','totalDistance','playersJoined','numGroups','matchDurationNorm']])
temp_test.head()
IdgroupIdmatchIdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlace...swimDistanceteamKillsvehicleDestroyswalkDistanceweaponsAcquiredwinPointstotalDistanceplayersJoinedmatchDurationNormwinPlacePerc
09329eb41e215eb676b23c24e70d645b576ab7daa7f0051.4600073...0.000588.010588.0922034.720.183950
1639bd0dcd7bda8430933124148dd42a9a0b906c92804179.1000211...0.0002017.0606686.0961883.440.858322
263d5c8ef8dfe910b45f5db20ba9987e7e4477a048e1023.4000449...0.000787.840787.8941900.580.718864
3cf5b81422591d1b7497dbdc77f4a1b9a94f1af67f10065.5200054...0.0001812.0301812.0892035.740.520244
4ee6a295187ba216604ce20a1d23040754a9301606604330.201217...0.0002963.0402963.0951392.300.881686

5 rows × 32 columns

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值