★★★ 本文源自AI Studio社区精品项目,【点击此处】查看更多精品内容 >>>
赛事介绍
实时对战游戏是人工智能研究领域的一个热点。由于游戏复杂性、部分可观察和动态实时变化战局等游戏特点使得研究变得比较困难。我们可以在选择英雄阶段预测胜负概率,也可以在比赛期间根据比赛实时数据进行建模。那么我们英雄联盟对局进行期间,能知道自己的胜率吗?
赛事任务
比赛数据使用了英雄联盟玩家的实时游戏数据,记录下用户在游戏中对局数据(如击杀数、住物理伤害)。希望参赛选手能从数据集中挖掘出数据的规律,并预测玩家在本局游戏中的输赢情况。
赛题训练集案例如下:
- 训练集18万数据;
- 测试集2万条数据;
import pandas as pd
import numpy as np
train = pd.read_csv('train.csv.zip')
对于数据集中每一行为一个玩家的游戏数据,数据字段如下所示:
- id:玩家记录id
- win:是否胜利,标签变量
- kills:击杀次数
- deaths:死亡次数
- assists:助攻次数
- largestkillingspree:最大 killing spree(游戏术语,意味大杀特杀。当你连续杀死三个对方英雄而中途没有死亡时)
- largestmultikill:最大mult ikill(游戏术语,短时间内多重击杀)
- longesttimespentliving:最长存活时间
- doublekills:doublekills次数
- triplekills:doublekills次数
- quadrakills:quadrakills次数
- pentakills:pentakills次数
- totdmgdealt:总伤害
- magicdmgdealt:魔法伤害
- physicaldmgdealt:物理伤害
- truedmgdealt:真实伤害
- largestcrit:最大暴击伤害
- totdmgtochamp:对对方玩家的伤害
- magicdmgtochamp:对对方玩家的魔法伤害
- physdmgtochamp:对对方玩家的物理伤害
- truedmgtochamp:对对方玩家的真实伤害
- totheal:治疗量
- totunitshealed:痊愈的总单位
- dmgtoturrets:对炮塔的伤害
- timecc:法控时间
- totdmgtaken:承受的伤害
- magicdmgtaken:承受的魔法伤害
- physdmgtaken:承受的物理伤害
- truedmgtaken:承受的真实伤害
- wardsplaced:侦查守卫放置次数
- wardskilled:侦查守卫摧毁次数
- firstblood:是否为firstblood
测试集中label字段win为空,需要选手预测。
评审规则
- 数据说明
选手需要提交测试集队伍排名预测,具体的提交格式如下:
win
0
1
1
0
- 评估指标
本次竞赛的使用准确率进行评分,数值越高精度越高,评估代码参考:
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)
1)加载数据
#!pip install numpy==1.19
#!pip install -U scikit-learn numpy
import sklearn
import pandas as pd
import paddle
import numpy as np
%pylab inline
import seaborn as sns
train_df_raw = pd.read_csv('data/data137276/train.csv.zip')
test_df_raw = pd.read_csv('data/data137276/test.csv.zip')
train_df = train_df_raw.drop(['id', 'timecc'], axis=1)
test_df = test_df_raw.drop(['id', 'timecc'], axis=1)
train_df_raw
train_df
win | kills | deaths | assists | largestkillingspree | largestmultikill | longesttimespentliving | doublekills | triplekills | quadrakills | ... | totheal | totunitshealed | dmgtoturrets | totdmgtaken | magicdmgtaken | physdmgtaken | truedmgtaken | wardsplaced | wardskilled | firstblood | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 5 | 2 | 0 | 1 | 569 | 0 | 0 | 0 | ... | 849 | 2 | 0 | 7819 | 2178 | 5239 | 401 | 4 | 1 | 0 |
1 | 0 | 5 | 8 | 7 | 3 | 1 | 880 | 0 | 0 | 0 | ... | 642 | 4 | 303 | 24637 | 5607 | 17635 | 1394 | 10 | 0 | 0 |
2 | 1 | 1 | 6 | 16 | 0 | 1 | 593 | 0 | 0 | 0 | ... | 2326 | 3 | 329 | 18749 | 3651 | 14834 | 263 | 7 | 1 | 0 |
3 | 0 | 1 | 2 | 0 | 0 | 1 | 381 | 0 | 0 | 0 | ... | 1555 | 1 | 0 | 12134 | 1739 | 10318 | 76 | 8 | 1 | 0 |
4 | 0 | 4 | 11 | 25 | 0 | 1 | 455 | 0 | 0 | 0 | ... | 6630 | 8 | 0 | 27891 | 14068 | 12749 | 1073 | 34 | 2 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
179995 | 1 | 1 | 6 | 12 | 0 | 1 | 362 | 0 | 0 | 0 | ... | 3559 | 3 | 5751 | 14786 | 2374 | 12309 | 102 | 12 | 1 | 0 |
179996 | 1 | 7 | 3 | 4 | 5 | 1 | 574 | 0 | 0 | 0 | ... | 2529 | 2 | 8907 | 11019 | 3933 | 6533 | 552 | 7 | 2 | 0 |
179997 | 1 | 9 | 0 | 9 | 9 | 1 | 0 | 0 | 0 | 0 | ... | 11494 | 4 | 6627 | 14279 | 3661 | 10617 | 0 | 7 | 2 | 1 |
179998 | 1 | 14 | 1 | 5 | 10 | 2 | 980 | 3 | 0 | 0 | ... | 6555 | 1 | 1943 | 19165 | 4818 | 14110 | 236 | 6 | 0 | 0 |
179999 | 1 | 4 | 4 | 2 | 2 | 1 | 559 | 0 | 0 | 0 | ... | 608 | 1 | 1590 | 10992 | 7681 | 3065 | 246 | 7 | 1 | 0 |
180000 rows × 30 columns
#查看标签
train_df['win']
#查看数据内容
train_df.columns
train_df.info()
2)EDA数据分析
2.1异常值处理
#缺失值
print(type(train_df.isnull()))
train_df.isnull()
#查看缺失值个数
train_df.isnull().sum()
#查看缺失值比例
train_df.isnull().mean(axis=0)
train_df['win'].value_counts().plot(kind='bar')
sns.distplot(train_df['kills'])
sns.distplot(train_df['deaths'])
sns.boxplot(y='kills', x='win', data=train_df)
plt.scatter(train_df['kills'], train_df['deaths'])
plt.xlabel('kills')
plt.ylabel('deaths')
for col in train_df.columns[1:]:
train_df[col] /= train_df[col].max()
test_df[col] /= test_df[col].max()
3)数据集
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold,cross_validate
#取出标签
x=train_df.drop(['win'], axis=1)
y=train_df.win
x
kills | deaths | assists | largestkillingspree | largestmultikill | longesttimespentliving | doublekills | triplekills | quadrakills | pentakills | ... | totheal | totunitshealed | dmgtoturrets | totdmgtaken | magicdmgtaken | physdmgtaken | truedmgtaken | wardsplaced | wardskilled | firstblood | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 5 | 2 | 0 | 1 | 569 | 0 | 0 | 0 | 0 | ... | 849 | 2 | 0 | 7819 | 2178 | 5239 | 401 | 4 | 1 | 0 |
1 | 5 | 8 | 7 | 3 | 1 | 880 | 0 | 0 | 0 | 0 | ... | 642 | 4 | 303 | 24637 | 5607 | 17635 | 1394 | 10 | 0 | 0 |
2 | 1 | 6 | 16 | 0 | 1 | 593 | 0 | 0 | 0 | 0 | ... | 2326 | 3 | 329 | 18749 | 3651 | 14834 | 263 | 7 | 1 | 0 |
3 | 1 | 2 | 0 | 0 | 1 | 381 | 0 | 0 | 0 | 0 | ... | 1555 | 1 | 0 | 12134 | 1739 | 10318 | 76 | 8 | 1 | 0 |
4 | 4 | 11 | 25 | 0 | 1 | 455 | 0 | 0 | 0 | 0 | ... | 6630 | 8 | 0 | 27891 | 14068 | 12749 | 1073 | 34 | 2 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
179995 | 1 | 6 | 12 | 0 | 1 | 362 | 0 | 0 | 0 | 0 | ... | 3559 | 3 | 5751 | 14786 | 2374 | 12309 | 102 | 12 | 1 | 0 |
179996 | 7 | 3 | 4 | 5 | 1 | 574 | 0 | 0 | 0 | 0 | ... | 2529 | 2 | 8907 | 11019 | 3933 | 6533 | 552 | 7 | 2 | 0 |
179997 | 9 | 0 | 9 | 9 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 11494 | 4 | 6627 | 14279 | 3661 | 10617 | 0 | 7 | 2 | 1 |
179998 | 14 | 1 | 5 | 10 | 2 | 980 | 3 | 0 | 0 | 0 | ... | 6555 | 1 | 1943 | 19165 | 4818 | 14110 | 236 | 6 | 0 | 0 |
179999 | 4 | 4 | 2 | 2 | 1 | 559 | 0 | 0 | 0 | 0 | ... | 608 | 1 | 1590 | 10992 | 7681 | 3065 | 246 | 7 | 1 | 0 |
180000 rows × 29 columns
y
0 0
1 0
2 1
3 0
4 0
..
179995 1
179996 1
179997 1
179998 1
179999 1
Name: win, Length: 180000, dtype: int64
print('特征向量形状{}'.format(x.shape))
print('标签形状{}'.format(y.shape))
print('标签类别{}'.format(np.unique(y)))
print('测试集特征形状{}'.format(test_df.shape))
特征向量形状(180000, 29)
标签形状(180000,)
标签类别[0 1]
测试集特征形状(20000, 29)
#数据集划分 /这里分出的test部分用于二次验证
Xtrain,Xtest,Ytrain,Ytest=train_test_split(x,y,test_size=0.2,random_state=1412)
#验证指验证集,而非测试集的特征向量。
print('用于训练的特征向量形状{}'.format(Xtrain.shape))
print('用于训练的标签形状{}'.format(Ytrain.shape))
print('用于验证的特征向量形状{}'.format(Xtest.shape))
print('用于验证的标签形状{}'.format(Ytest.shape))
用于训练的特征向量形状(144000, 29)
用于训练的标签形状(144000,)
用于验证的特征向量形状(36000, 29)
用于验证的标签形状(36000,)
def individual_estimators(estimators):
train_score=[]
cv_mean=[]
test_score=[]
for estimator in estimators:
cv=KFold(n_splits=5,shuffle=True,random_state=1412)
results=cross_validate(estimator[1],Xtrain,Ytrain
,cv=cv
,scoring="accuracy"
,n_jobs=8
,return_train_score=True
,verbose=False)
test=estimator[1].fit(Xtrain,Ytrain).score(Xtest,Ytest)
train_score.append(results["train_score"].mean())
cv_mean.append(results["test_score"].mean())
test_score.append(test)
for i in range(len(estimators)):
print("-------------------------------------------")
print(
estimators[i]
,"\n train_score_mean:{}".format(train_score[i])
,"\n cv_mean:{}".format(cv_mean[i])
,"\n test_score:{}".format(test_score[i])
,"\n")
def fusion_estimators(estimators):
cv=KFold(n_splits=5,shuffle=True,random_state=1412)
results=cross_validate(clf,Xtrain,Ytrain
,cv=cv
,scoring="accuracy"
,n_jobs=-1
,return_train_score=True
,verbose=False)
test=clf.fit(Xtrain,Ytrain).score(Xtest,Ytest)
print("++++++++++++++++++++++++++++++++++++++++++++++")
print(
"\n train_score_mean:{}".format(results["train_score"].mean())
,"\n cv_mean:{}".format(results["test_score"].mean())
,"\n test_score:{}".format(test)
)
4)模型
from sklearn.neighbors import KNeighborsClassifier as KNNC
from sklearn.tree import DecisionTreeClassifier as DTR
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.linear_model import LogisticRegression as LogiR
from sklearn.ensemble import VotingClassifier
4.a为什么模型融合比集成算法更好?
虽然每一个弱分类器并不强,但都能代表一组其对应的假设空间。真实世界的数据分布是多远随机的复杂系统,往往其中一种并不能有一个好的近似结果。模型融合是一种简单粗暴的办法,考虑多重分布的组合。当然,模型融合的结果并不一定好,只是大部分时间是好的。
4.1弱分类器与集成
clf1=LogiR(max_iter=3000,random_state=1412,n_jobs=8)
clf2=RFC(n_estimators=100,random_state=1412,n_jobs=8)
clf3=GBC(n_estimators=100,random_state=1412)
estimators=[("Logistic Regression",clf1),("RandomForest",clf2),("GBDT",clf3)]
clf=VotingClassifier(estimators,voting="soft")
4.1.1对弱分类器分别进行评估
individual_estimators(estimators)
4.1.2对融合算法评估
logi=LogiR(max_iter=3000,n_jobs=8)
fusion_estimators(logi)
test_predict_sklearn=clf.predict(test_df)
test_predict_sklearn=clf.predict_proba(test_df)
print(test_predict_sklearn.shape)
print(test_predict_sklearn)
(20000, 2)
[[0.87535621 0.12464379]
[0.77675525 0.22324475]
[0.16242339 0.83757661]
...
[0.94152587 0.05847413]
[0.90214731 0.09785269]
[0.10380786 0.89619214]]
4.2网络模型
import paddle.fluid
class MyModel(paddle.nn.Layer):
# self代表类的实例自身
def __init__(self):
# 初始化父类中的一些参数
super(MyModel, self).__init__()
self.fc1 = paddle.nn.Linear(in_features=29, out_features=30)
self.hidden1=paddle.fluid.BatchNorm(30)
self.relu1=paddle.nn.ReLU()
self.fc2 = paddle.nn.Linear(in_features=30, out_features=8)
self.relu2=paddle.nn.LeakyReLU()
self.fc3 = paddle.nn.Linear(in_features=8, out_features=6)
self.relu3=paddle.nn.Sigmoid()
self.fc4 = paddle.nn.Linear(in_features=6, out_features=4)
self.fc5=paddle.nn.Linear(in_features=4, out_features=2)
self.softmax = paddle.nn.Softmax()
# 网络的前向计算
def forward(self, inputs):
x = self.fc1(inputs)
#x = self.relu1(x)
x = self.hidden1(x)
x = self.fc2(x)
x = self.relu2(x)
x = self.fc3(x)
x = self.relu3(x)
x=self.fc4(x)
x=self.fc5(x)
#x=self.fc6(x)
x = self.softmax(x)
return x
model = MyModel()
model.train()
opt = paddle.optimizer.SGD(learning_rate=0.01, parameters=model.parameters())
EPOCH_NUM = 10 # 设置外层循环次数
BATCH_SIZE = 100 # 设置batch大小
training_data = train_df.iloc[:-1000,].values.astype(np.float32)
val_data = train_df.iloc[-1000:, ].values.astype(np.float32)
# 定义外层循环
for epoch_id in range(EPOCH_NUM):
# 在每轮迭代开始之前,将训练数据的顺序随机的打乱
np.random.shuffle(training_data)
# 将训练数据进行拆分,每个batch包含10条数据
mini_batches = [training_data[k:k+BATCH_SIZE] for k in range(0, len(training_data), BATCH_SIZE)]
# 定义内层循环
for iter_id, mini_batch in enumerate(mini_batches):
x_data = np.array(mini_batch[:, 1:]) # 获得当前批次训练数据
y_label = np.array(mini_batch[:, :1]) # 获得当前批次训练标签
# 将numpy数据转为飞桨动态图tensor的格式
features = paddle.to_tensor(x_data)
y_label = paddle.to_tensor(y_label)
label=np.zeros([len(y_label),2])
for i in range(len(y_label)):
if y_label[i]==0:
label[i,0]=1
elif y_label[i]==1:
label[i,1]=1
label=paddle.to_tensor(label,dtype=float32)
# 前向计算
predicts = model(features)
# 计算损失
loss = paddle.nn.functional.softmax_with_cross_entropy(predicts, label,soft_label=True)
avg_loss = paddle.mean(loss)
# 反向传播,计算每层参数的梯度值
avg_loss.backward()
# 更新参数,根据设置好的学习率迭代一步
opt.step()
# 清空梯度变量,以备下一轮计算
opt.clear_grad()
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/data_feeder.py:51: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
np.bool, np.float16, np.uint16, np.float32, np.float64, np.int8,
model.eval()
test_data = paddle.to_tensor(test_df.values.astype(np.float32))
test_predict_dl = model(test_data)
test_predict_dl
Tensor(shape=[20000, 2], dtype=float32, place=CPUPlace, stop_gradient=False,
[[0.31092143, 0.68907863],
[0.89762008, 0.10237990],
[0.00382155, 0.99617851],
...,
[0.97896796, 0.02103199],
[0.98377025, 0.01622973],
[0.00828540, 0.99171454]])
test_predict_sklearn
array([[0.87535621, 0.12464379],
[0.77675525, 0.22324475],
[0.16242339, 0.83757661],
...,
[0.94152587, 0.05847413],
[0.90214731, 0.09785269],
[0.10380786, 0.89619214]])
#控制融合比例
test_predict_=(1/4*(np.array(test_predict_dl)))+(3/4*(test_predict_sklearn))
test_predict=np.zeros([len(test_predict_)])
for i in range(len(test_predict_)):
if test_predict_[i,0]>test_predict_[i,1]:
test_predict[i]=0
elif test_predict_[i,0]<test_predict_[i,1]:
test_predict[i]=1
test_predict
array([0., 0., 1., ..., 0., 0., 1.])
pd.DataFrame({'win':
test_predict
}).to_csv('submission.csv', index=None)
!zip submission.zip submission.csv
adding: submission.csv (deflated 94%)