★★★ 本文源自AlStudio社区精品项目,【点击此处】查看更多精品内容 >>>
项目介绍
2022年卡塔尔世界杯(FIFA World Cup Qatar 2022)是第二十二届国际足联世界杯,于当地时间2022年11月20日(北京时间11月21日)至12月18日在卡塔尔境内5座城市中的8座球场举行(赛程将原本的32天减至29天)。卡塔尔是继日本、韩国后,第三个主办世界杯足球赛的亚洲国家,也是首个主办的伊斯兰国家,同时亦是二战后首个从未晋级过世界杯决赛圈的主办国。本届世界杯总花费高达2290亿美元,被称为“史上最贵世界杯”。
项目使用历史数据国际足联世界排名1992-2022和1872年至2022年国际足球成绩完成世界杯预测。
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import joblib
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score, confusion_matrix
数据处理
results.csv包括以下列:
- date - 比赛日期
- home_team - 主队名称
- away_team - 客队名称
- home_score - 全场主队得分,包括加时赛,不包括点球大战
- away_score - 全场客队得分,包括加时赛,不包括点球大战
- tournament - 比赛名称
- city - 比赛所在的城市/城镇/行政单位的名称
- country - 比赛所在国的名称
- neutral - TRUE/FALSE 列,指示比赛是否在中立场地进行
# 读取数据
results = pd.read_csv('/home/aistudio/work/results.csv', parse_dates=['date'])
results.head()
date | home_team | away_team | home_score | away_score | tournament | city | country | neutral | |
---|---|---|---|---|---|---|---|---|---|
0 | 1872-11-30 | Scotland | England | 0 | 0 | Friendly | Glasgow | Scotland | False |
1 | 1873-03-08 | England | Scotland | 4 | 2 | Friendly | London | England | False |
2 | 1874-03-07 | Scotland | England | 2 | 1 | Friendly | Glasgow | Scotland | False |
3 | 1875-03-06 | England | Scotland | 2 | 2 | Friendly | London | England | False |
4 | 1876-03-04 | Scotland | England | 3 | 0 | Friendly | Glasgow | Scotland | False |
# 查看数据信息
results.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44289 entries, 0 to 44288
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 44289 non-null datetime64[ns]
1 home_team 44289 non-null object
2 away_team 44289 non-null object
3 home_score 44289 non-null int64
4 away_score 44289 non-null int64
5 tournament 44289 non-null object
6 city 44289 non-null object
7 country 44289 non-null object
8 neutral 44289 non-null bool
dtypes: bool(1), datetime64[ns](1), int64(2), object(5)
memory usage: 2.7+ MB
# 检查数据是否缺失
results.isna().sum()
date 0
home_team 0
away_team 0
home_score 0
away_score 0
tournament 0
city 0
country 0
neutral 0
dtype: int64
# 筛选1992-2022世界杯预选赛和世界杯正式赛
fifa_data = results[(results['date'] >= '1992-12-31') & ((results['tournament'] == 'FIFA World Cup') | (results['tournament'] == 'FIFA World Cup qualification'))]
fifa_data = fifa_data.drop(['tournament'], axis=1)
fifa_data = fifa_data.reset_index(drop=True)
fifa_data.head()
date | home_team | away_team | home_score | away_score | city | country | neutral | |
---|---|---|---|---|---|---|---|---|
0 | 1993-01-10 | Angola | Zimbabwe | 1 | 1 | Luanda | Angola | False |
1 | 1993-01-10 | DR Congo | Cameroon | 1 | 2 | Kinshasa | Zaïre | False |
2 | 1993-01-16 | South Africa | Nigeria | 0 | 0 | Johannesburg | South Africa | False |
3 | 1993-01-16 | Tanzania | Zambia | 1 | 3 | Mwanza | Tanzania | False |
4 | 1993-01-17 | Benin | Tunisia | 0 | 5 | Cotonou | Benin | False |
# 查看数据信息
fifa_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6359 entries, 0 to 6358
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 6359 non-null datetime64[ns]
1 home_team 6359 non-null object
2 away_team 6359 non-null object
3 home_score 6359 non-null int64
4 away_score 6359 non-null int64
5 city 6359 non-null object
6 country 6359 non-null object
7 neutral 6359 non-null bool
dtypes: bool(1), datetime64[ns](1), int64(2), object(4)
memory usage: 354.1+ KB
fifa_ranking.csv包括以下列:
- rank - 当前国家/地区排名
- country_full - 国家全名
- country_abrv - 国家缩写
- total_points - 当前总分
- previous_points - 上次评分的总分
- rank_change - 自上次发布以来排名如何变化
- confederation - 国际足联联合会
- rank_date - 评级计算日期
# 读取数据
fifa_ranking = pd.read_csv('/home/aistudio/work/fifa_ranking.csv', parse_dates=['rank_date'])
fifa_ranking.head()
rank | country_full | country_abrv | total_points | previous_points | rank_change | confederation | rank_date | |
---|---|---|---|---|---|---|---|---|
0 | 74 | Madagascar | MAD | 18.0 | 0.0 | 0 | CAF | 1992-12-31 |
1 | 52 | Qatar | QAT | 27.0 | 0.0 | 0 | AFC | 1992-12-31 |
2 | 51 | Senegal | SEN | 27.0 | 0.0 | 0 | CAF | 1992-12-31 |
3 | 50 | El Salvador | SLV | 28.0 | 0.0 | 0 | CONCACAF | 1992-12-31 |
4 | 49 | Korea Republic | KOR | 28.0 | 0.0 | 0 | AFC | 1992-12-31 |
# 替换国家全名: 部分国家全名在fifa_ranking和results中存在差异
fifa_ranking['country_full'] = fifa_ranking['country_full'].str.replace('Brunei Darussalam', 'Brunei').str.replace('Cape Verde Islands', 'Cape Verde').str.replace('chinese taipei', 'taiwan').str.replace('Congo DR', 'DR Congo').str.replace("Côte d'Ivoire", 'Ivory Coast').str.replace('Curacao', 'Curaçao').str.replace('IR Iran', 'Iran').str.replace('Kyrgyz Republic', 'Kyrgyzstan').str.replace('Korea DPR', 'North Korea').str.replace('Korea Republic', 'South Korea').str.replace('St Kitts and Nevis', 'Saint Kitts and Nevis').str.replace('St Lucia', 'Saint Lucia').str.replace('St Vincent and the Grenadines', 'Saint Vincent and the Grenadines').str.replace('Sao Tome e Principe', 'São Tomé and Príncipe').str.replace('US Virgin Islands', 'United States Virgin Islands').str.replace('USA', 'United States')
# fifa_ranking以日期为索引、根据国家分组、按天重新采样、最后重置索引
fifa_ranking = fifa_ranking.set_index(['rank_date']).groupby(['country_full'], group_keys=False).resample('D').fillna(method='ffill').reset_index()
fifa_ranking.head()
rank_date | rank | country_full | country_abrv | total_points | previous_points | rank_change | confederation | |
---|---|---|---|---|---|---|---|---|
0 | 2003-01-15 | 204 | Afghanistan | AFG | 7.0 | 0.0 | 0 | AFC |
1 | 2003-01-16 | 204 | Afghanistan | AFG | 7.0 | 0.0 | 0 | AFC |
2 | 2003-01-17 | 204 | Afghanistan | AFG | 7.0 | 0.0 | 0 | AFC |
3 | 2003-01-18 | 204 | Afghanistan | AFG | 7.0 | 0.0 | 0 | AFC |
4 | 2003-01-19 | 204 | Afghanistan | AFG | 7.0 | 0.0 | 0 | AFC |
# 合并数据: 联合results和fifa_ranking
fifa_data = fifa_data.merge(fifa_ranking[['country_full', 'total_points', 'previous_points', 'rank', 'rank_change', 'rank_date']], left_on=['date', 'home_team'], right_on=['rank_date', 'country_full']).drop(['rank_date', 'country_full'], axis=1)
fifa_data = fifa_data.merge(fifa_ranking[['country_full', 'total_points', 'previous_points', 'rank', 'rank_change', 'rank_date']], left_on=['date', 'away_team'], right_on=['rank_date', 'country_full'], suffixes=('_home', '_away')).drop(['rank_date', 'country_full'], axis=1)
fifa_data.head()
date | home_team | away_team | home_score | away_score | city | country | neutral | total_points_home | previous_points_home | rank_home | rank_change_home | total_points_away | previous_points_away | rank_away | rank_change_away | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1993-01-10 | Angola | Zimbabwe | 1 | 1 | Luanda | Angola | False | 10.0 | 0.0 | 102 | 0 | 27.0 | 0.0 | 54 | 0 |
1 | 1993-01-16 | South Africa | Nigeria | 0 | 0 | Johannesburg | South Africa | False | 5.0 | 0.0 | 124 | 0 | 50.0 | 0.0 | 13 | 0 |
2 | 1993-01-16 | Tanzania | Zambia | 1 | 3 | Mwanza | Tanzania | False | 15.0 | 0.0 | 80 | 0 | 38.0 | 0.0 | 32 | 0 |
3 | 1993-01-17 | Benin | Tunisia | 0 | 5 | Cotonou | Benin | False | 4.0 | 0.0 | 127 | 0 | 35.0 | 0.0 | 38 | 0 |
4 | 1993-01-17 | Botswana | Ivory Coast | 0 | 0 | Gaborone | Botswana | False | 2.0 | 0.0 | 139 | 0 | 41.0 | 0.0 | 27 | 0 |
# 查看数据信息
fifa_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6052 entries, 0 to 6051
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 6052 non-null datetime64[ns]
1 home_team 6052 non-null object
2 away_team 6052 non-null object
3 home_score 6052 non-null int64
4 away_score 6052 non-null int64
5 city 6052 non-null object
6 country 6052 non-null object
7 neutral 6052 non-null bool
8 total_points_home 6052 non-null float64
9 previous_points_home 6052 non-null float64
10 rank_home 6052 non-null int64
11 rank_change_home 6052 non-null int64
12 total_points_away 6052 non-null float64
13 previous_points_away 6052 non-null float64
14 rank_away 6052 non-null int64
15 rank_change_away 6052 non-null int64
dtypes: bool(1), datetime64[ns](1), float64(4), int64(6), object(4)
memory usage: 762.4+ KB
# 检查数据是否缺失
fifa_data.isna().sum()
date 0
home_team 0
away_team 0
home_score 0
away_score 0
city 0
country 0
neutral 0
total_points_home 0
previous_points_home 0
rank_home 0
rank_change_home 0
total_points_away 0
previous_points_away 0
rank_away 0
rank_change_away 0
dtype: int64
特征工程
特征工程
- result - 比赛结果 0: 主队胜 1: 客队胜 2: 平局
- home_points - 主队得分 3: 主队胜 0: 客队胜 1: 平局
- away_points - 客队得分 3: 客队胜 0: 主队胜 1: 平局
- target - 预测目标 0: 主队胜 1: 客队胜或者平局
# 特征工程
def get_result(home_score, away_score):
if home_score > away_score:
return pd.Series([0, 3, 0, 0])
elif home_score < away_score:
return pd.Series([1, 0, 3, 1])
else:
return pd.Series([2, 1, 1, 1])
results = fifa_data.apply(lambda x: get_result(x['home_score'], x['away_score']), axis=1)
fifa_data[['result', 'home_points', 'away_points', 'target']] = results
fifa_data.head()
date | home_team | away_team | home_score | away_score | city | country | neutral | total_points_home | previous_points_home | rank_home | rank_change_home | total_points_away | previous_points_away | rank_away | rank_change_away | result | home_points | away_points | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1993-01-10 | Angola | Zimbabwe | 1 | 1 | Luanda | Angola | False | 10.0 | 0.0 | 102 | 0 | 27.0 | 0.0 | 54 | 0 | 2 | 1 | 1 | 1 |
1 | 1993-01-16 | South Africa | Nigeria | 0 | 0 | Johannesburg | South Africa | False | 5.0 | 0.0 | 124 | 0 | 50.0 | 0.0 | 13 | 0 | 2 | 1 | 1 | 1 |
2 | 1993-01-16 | Tanzania | Zambia | 1 | 3 | Mwanza | Tanzania | False | 15.0 | 0.0 | 80 | 0 | 38.0 | 0.0 | 32 | 0 | 1 | 0 | 3 | 1 |
3 | 1993-01-17 | Benin | Tunisia | 0 | 5 | Cotonou | Benin | False | 4.0 | 0.0 | 127 | 0 | 35.0 | 0.0 | 38 | 0 | 1 | 0 | 3 | 1 |
4 | 1993-01-17 | Botswana | Ivory Coast | 0 | 0 | Gaborone | Botswana | False | 2.0 | 0.0 | 139 | 0 | 41.0 | 0.0 | 27 | 0 | 2 | 1 | 1 | 1 |
# 特征编码
label_encoder = LabelEncoder()
labels = ['date', 'home_team', 'away_team', 'city', 'country']
for label in labels:
fifa_data[f'{label}_encoding'] = label_encoder.fit_transform(fifa_data[label])
fifa_data.head()
date | home_team | away_team | home_score | away_score | city | country | neutral | total_points_home | previous_points_home | ... | rank_change_away | result | home_points | away_points | target | date_encoding | home_team_encoding | away_team_encoding | city_encoding | country_encoding | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1993-01-10 | Angola | Zimbabwe | 1 | 1 | Luanda | Angola | False | 10.0 | 0.0 | ... | 0 | 2 | 1 | 1 | 1 | 0 | 5 | 206 | 345 | 4 |
1 | 1993-01-16 | South Africa | Nigeria | 0 | 0 | Johannesburg | South Africa | False | 5.0 | 0.0 | ... | 0 | 2 | 1 | 1 | 1 | 1 | 170 | 136 | 274 | 171 |
2 | 1993-01-16 | Tanzania | Zambia | 1 | 3 | Mwanza | Tanzania | False | 15.0 | 0.0 | ... | 0 | 1 | 0 | 3 | 1 | 1 | 183 | 205 | 410 | 184 |
3 | 1993-01-17 | Benin | Tunisia | 0 | 5 | Cotonou | Benin | False | 4.0 | 0.0 | ... | 0 | 1 | 0 | 3 | 1 | 2 | 21 | 189 | 157 | 20 |
4 | 1993-01-17 | Botswana | Ivory Coast | 0 | 0 | Gaborone | Botswana | False | 2.0 | 0.0 | ... | 0 | 2 | 1 | 1 | 1 | 2 | 26 | 94 | 216 | 25 |
5 rows × 25 columns
# 绘制相关性热图
plt.figure(figsize=(16, 16))
sns.heatmap(fifa_data.corr(), annot=True, linewidths=0.2, square=True)
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-n7IVlUkB-1687232982510)(main_files/main_20_0.png)]
# 删除编码: 相关性太低
fifa_data = fifa_data.drop(['date_encoding', 'home_team_encoding', 'away_team_encoding', 'city_encoding', 'country_encoding'], axis=1)
特征工程
- rank_diff - 排名差异
- rank_change_diff - 排名变化差异
- total_points_diff - 总分差异
- previous_points_diff - 上次评分的总分差异
- home_points2rank - 主队得分 / 客队排名
- away_points2rank - 客队得分 / 主队排名
- points2rank_diff - points2rank差异
# 特征工程
fifa_data['rank_diff'] = fifa_data['rank_home'] - fifa_data['rank_away']
fifa_data['rank_change_diff'] = fifa_data['rank_change_home'] - fifa_data['rank_change_away']
fifa_data['total_points_diff'] = fifa_data['total_points_home'] - fifa_data['total_points_away']
fifa_data['previous_points_diff'] = fifa_data['previous_points_home'] - fifa_data['previous_points_away']
fifa_data['home_points2rank'] = fifa_data['home_points'] / fifa_data['rank_away']
fifa_data['away_points2rank'] = fifa_data['away_points'] / fifa_data['rank_home']
fifa_data['points2rank_diff'] = fifa_data['home_points2rank'] - fifa_data['away_points2rank']
# 绘制相关性热图
plt.figure(figsize=(16, 16))
sns.heatmap(fifa_data.corr(), annot=True, linewidths=0.2, square=True)
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jRXkEsOh-1687232982511)(main_files/main_24_0.png)]
# 分离数据: 优化特征工程
home_team = fifa_data[['date', 'home_team', 'home_score', 'away_score', 'total_points_home', 'total_points_away', 'previous_points_home', 'previous_points_away', 'rank_home', 'rank_away', 'home_points', 'away_points', 'home_points2rank', 'away_points2rank', 'result']]
away_team = fifa_data[['date', 'away_team', 'away_score', 'home_score', 'total_points_away', 'total_points_home', 'previous_points_away', 'previous_points_home', 'rank_away', 'rank_home', 'away_points', 'home_points', 'away_points2rank', 'home_points2rank', 'result']]
home_team.columns = [h.replace('home_', '').replace('_home', '').replace('away_', 'rival_').replace('_away', '_rival') for h in home_team.columns]
away_team.columns = [a.replace('away_', '').replace('_away', '').replace('home_', 'rival_').replace('_home', '_rival') for a in away_team.columns]
# 合并数据: 优化特征工程
team_data = home_team.append(away_team)
data_copy = team_data.copy()
team_data.head()
date | team | score | rival_score | total_points | total_points_rival | previous_points | previous_points_rival | rank | rank_rival | points | rival_points | points2rank | rival_points2rank | result | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1993-01-10 | Angola | 1 | 1 | 10.0 | 27.0 | 0.0 | 0.0 | 102 | 54 | 1 | 1 | 0.018519 | 0.009804 | 2 |
1 | 1993-01-16 | South Africa | 0 | 0 | 5.0 | 50.0 | 0.0 | 0.0 | 124 | 13 | 1 | 1 | 0.076923 | 0.008065 | 2 |
2 | 1993-01-16 | Tanzania | 1 | 3 | 15.0 | 38.0 | 0.0 | 0.0 | 80 | 32 | 0 | 3 | 0.000000 | 0.037500 | 1 |
3 | 1993-01-17 | Benin | 0 | 5 | 4.0 | 35.0 | 0.0 | 0.0 | 127 | 38 | 0 | 3 | 0.000000 | 0.023622 | 1 |
4 | 1993-01-17 | Botswana | 0 | 0 | 2.0 | 41.0 | 0.0 | 0.0 | 139 | 27 | 1 | 1 | 0.037037 | 0.007194 | 2 |
特征工程
- mean_goals - 平均进球
- mean_goals_last5 - 最近五场平均进球
- rival_mean_goals - 对手平均进球
- rival_mean_goals_last5 - 对手最近五场平均进球
- mean_rank - 平均排名
- mean_rank_last5 - 最近五场平均排名
- rival_mean_rank - 对手平均排名
- rival_mean_rank_last5 - 对手最近五场平均排名
- mean_points - 平均得分
- mean_points_last5 - 最近五场平均得分
- rival_mean_points - 对手平均得分
- rival_mean_points_last5 - 对手最近五场平均得分
- mean_points2rank - 平均points2rank
- mean_points2rank_last5 - 最近五场平均points2rank
- rival_mean_points2rank - 对手平均points2rank
- rival_mean_points2rank_last5 - 对手最近五场平均points2rank
# 特征工程
team_values = []
for idx, row in team_data.iterrows():
team = row['team']
date = row['date']
pasts = team_data.loc[(team_data['team'] == team) & (team_data['date'] < date)].sort_values(by=['date'], ascending=False)
last5 = pasts.head(5)
mean_goals = pasts['score'].mean()
mean_goals_last5 = last5['score'].mean()
rival_mean_goals = pasts['rival_score'].mean()
rival_mean_goals_last5 = last5['rival_score'].mean()
mean_rank = pasts['rank'].mean()
mean_rank_last5 = last5['rank'].mean()
rival_mean_rank = pasts['rank_rival'].mean()
rival_mean_rank_last5 = last5['rank_rival'].mean()
mean_points = pasts['points'].mean()
mean_points_last5 = last5['points'].mean()
rival_mean_points = pasts['rival_points'].mean()
rival_mean_points_last5 = last5['rival_points'].mean()
mean_points2rank = pasts['points2rank'].mean()
mean_points2rank_last5 = last5['points2rank'].mean()
rival_mean_points2rank = pasts['rival_points2rank'].mean()
rival_mean_points2rank_last5 = last5['rival_points2rank'].mean()
team_values.append([mean_goals, mean_goals_last5, rival_mean_goals, rival_mean_goals_last5, mean_rank, mean_rank_last5, rival_mean_rank, rival_mean_rank_last5, mean_points, mean_points_last5, rival_mean_points, rival_mean_points_last5, mean_points2rank, mean_points2rank_last5, rival_mean_points2rank, rival_mean_points2rank_last5])
# 合并数据
team_columns = ['mean_goals', 'mean_goals_last5', 'rival_mean_goals', 'rival_mean_goals_last5', 'mean_rank', 'mean_rank_last5', 'rival_mean_rank', 'rival_mean_rank_last5', 'mean_points', 'mean_points_last5', 'rival_mean_points', 'rival_mean_points_last5', 'mean_points2rank', 'mean_points2rank_last5', 'rival_mean_points2rank', 'rival_mean_points2rank_last5']
team_value = pd.DataFrame(team_values, columns=team_columns)
team_data = pd.concat([team_data.reset_index(drop=True), team_value], axis=1, ignore_index=False)
team_data.head()
date | team | score | rival_score | total_points | total_points_rival | previous_points | previous_points_rival | rank | rank_rival | ... | rival_mean_rank | rival_mean_rank_last5 | mean_points | mean_points_last5 | rival_mean_points | rival_mean_points_last5 | mean_points2rank | mean_points2rank_last5 | rival_mean_points2rank | rival_mean_points2rank_last5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1993-01-10 | Angola | 1 | 1 | 10.0 | 27.0 | 0.0 | 0.0 | 102 | 54 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 1993-01-16 | South Africa | 0 | 0 | 5.0 | 50.0 | 0.0 | 0.0 | 124 | 13 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1993-01-16 | Tanzania | 1 | 3 | 15.0 | 38.0 | 0.0 | 0.0 | 80 | 32 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1993-01-17 | Benin | 0 | 5 | 4.0 | 35.0 | 0.0 | 0.0 | 127 | 38 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 1993-01-17 | Botswana | 0 | 0 | 2.0 | 41.0 | 0.0 | 0.0 | 139 | 27 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 31 columns
# 查看数据信息
team_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12104 entries, 0 to 12103
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 12104 non-null datetime64[ns]
1 team 12104 non-null object
2 score 12104 non-null int64
3 rival_score 12104 non-null int64
4 total_points 12104 non-null float64
5 total_points_rival 12104 non-null float64
6 previous_points 12104 non-null float64
7 previous_points_rival 12104 non-null float64
8 rank 12104 non-null int64
9 rank_rival 12104 non-null int64
10 points 12104 non-null int64
11 rival_points 12104 non-null int64
12 points2rank 12104 non-null float64
13 rival_points2rank 12104 non-null float64
14 result 12104 non-null int64
15 mean_goals 11897 non-null float64
16 mean_goals_last5 11897 non-null float64
17 rival_mean_goals 11897 non-null float64
18 rival_mean_goals_last5 11897 non-null float64
19 mean_rank 11897 non-null float64
20 mean_rank_last5 11897 non-null float64
21 rival_mean_rank 11897 non-null float64
22 rival_mean_rank_last5 11897 non-null float64
23 mean_points 11897 non-null float64
24 mean_points_last5 11897 non-null float64
25 rival_mean_points 11897 non-null float64
26 rival_mean_points_last5 11897 non-null float64
27 mean_points2rank 11897 non-null float64
28 mean_points2rank_last5 11897 non-null float64
29 rival_mean_points2rank 11897 non-null float64
30 rival_mean_points2rank_last5 11897 non-null float64
dtypes: datetime64[ns](1), float64(22), int64(7), object(1)
memory usage: 2.9+ MB
# 分离数据
home_team_data = team_data.iloc[:int(team_data.shape[0] / 2), :]
away_team_data = team_data.iloc[int(team_data.shape[0] / 2):, :]
away_team_data.tail()
date | team | score | rival_score | total_points | total_points_rival | previous_points | previous_points_rival | rank | rank_rival | ... | rival_mean_rank | rival_mean_rank_last5 | mean_points | mean_points_last5 | rival_mean_points | rival_mean_points_last5 | mean_points2rank | mean_points2rank_last5 | rival_mean_points2rank | rival_mean_points2rank_last5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12099 | 2022-06-01 | Ukraine | 3 | 1 | 1535.08 | 1472.66 | 1535.08 | 1471.82 | 27 | 39 | ... | 62.433735 | 59.0 | 1.759036 | 1.8 | 0.891566 | 0.6 | 0.068496 | 0.093412 | 0.028319 | 0.023407 |
12100 | 2022-06-05 | Ukraine | 0 | 1 | 1535.08 | 1588.08 | 1535.08 | 1578.01 | 27 | 18 | ... | 62.154762 | 42.0 | 1.773810 | 2.2 | 0.880952 | 0.4 | 0.068596 | 0.107183 | 0.027982 | 0.015407 |
12101 | 2022-06-07 | Australia | 2 | 1 | 1462.29 | 1356.99 | 1486.86 | 1353.10 | 42 | 68 | ... | 81.504762 | 65.6 | 1.980952 | 1.0 | 0.809524 | 1.6 | 0.029632 | 0.011321 | 0.021702 | 0.044029 |
12102 | 2022-06-13 | Peru | 0 | 0 | 1562.32 | 1462.29 | 1563.45 | 1486.86 | 22 | 42 | ... | 34.661654 | 35.6 | 1.090226 | 2.0 | 1.676692 | 0.8 | 0.067419 | 0.065848 | 0.044702 | 0.036364 |
12103 | 2022-06-14 | New Zealand | 0 | 1 | 1206.07 | 1503.09 | 1161.66 | 1464.06 | 101 | 31 | ... | 109.125000 | 156.2 | 1.900000 | 3.0 | 0.925000 | 0.0 | 0.021997 | 0.019261 | 0.010179 | 0.000000 |
5 rows × 31 columns
# 分离数据
home_team_data = home_team_data[home_team_data.columns[-16:]]
away_team_data = away_team_data[away_team_data.columns[-16:]]
home_team_data.columns = ['home_' + str(col) for col in home_team_data.columns]
away_team_data.columns = ['away_' + str(col) for col in away_team_data.columns]
away_team_data.tail()
away_mean_goals | away_mean_goals_last5 | away_rival_mean_goals | away_rival_mean_goals_last5 | away_mean_rank | away_mean_rank_last5 | away_rival_mean_rank | away_rival_mean_rank_last5 | away_mean_points | away_mean_points_last5 | away_rival_mean_points | away_rival_mean_points_last5 | away_mean_points2rank | away_mean_points2rank_last5 | away_rival_mean_points2rank | away_rival_mean_points2rank_last5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12099 | 1.493976 | 1.6 | 0.807229 | 1.0 | 39.265060 | 26.0 | 62.433735 | 59.0 | 1.759036 | 1.8 | 0.891566 | 0.6 | 0.068496 | 0.093412 | 0.028319 | 0.023407 |
12100 | 1.511905 | 1.8 | 0.809524 | 0.8 | 39.119048 | 26.4 | 62.154762 | 42.0 | 1.773810 | 2.2 | 0.880952 | 0.4 | 0.068596 | 0.107183 | 0.027982 | 0.015407 |
12101 | 2.514286 | 1.4 | 0.809524 | 1.2 | 42.885714 | 35.6 | 81.504762 | 65.6 | 1.980952 | 1.0 | 0.809524 | 1.6 | 0.029632 | 0.011321 | 0.021702 | 0.044029 |
12102 | 1.015038 | 1.2 | 1.466165 | 0.6 | 46.751880 | 22.4 | 34.661654 | 35.6 | 1.090226 | 2.0 | 1.676692 | 0.8 | 0.067419 | 0.065848 | 0.044702 | 0.036364 |
12103 | 2.200000 | 3.6 | 0.900000 | 0.2 | 101.725000 | 111.0 | 109.125000 | 156.2 | 1.900000 | 3.0 | 0.925000 | 0.0 | 0.021997 | 0.019261 | 0.010179 | 0.000000 |
# 合并数据
team_data = pd.concat([home_team_data, away_team_data.reset_index(drop=True)], axis=1, ignore_index=False)
fifa_data = pd.concat([fifa_data, team_data.reset_index(drop=True)], axis=1, ignore_index=False)
fifa_data.columns
Index(['date', 'home_team', 'away_team', 'home_score', 'away_score', 'city',
'country', 'neutral', 'total_points_home', 'previous_points_home',
'rank_home', 'rank_change_home', 'total_points_away',
'previous_points_away', 'rank_away', 'rank_change_away', 'result',
'home_points', 'away_points', 'target', 'rank_diff', 'rank_change_diff',
'total_points_diff', 'previous_points_diff', 'home_points2rank',
'away_points2rank', 'points2rank_diff', 'home_mean_goals',
'home_mean_goals_last5', 'home_rival_mean_goals',
'home_rival_mean_goals_last5', 'home_mean_rank', 'home_mean_rank_last5',
'home_rival_mean_rank', 'home_rival_mean_rank_last5',
'home_mean_points', 'home_mean_points_last5', 'home_rival_mean_points',
'home_rival_mean_points_last5', 'home_mean_points2rank',
'home_mean_points2rank_last5', 'home_rival_mean_points2rank',
'home_rival_mean_points2rank_last5', 'away_mean_goals',
'away_mean_goals_last5', 'away_rival_mean_goals',
'away_rival_mean_goals_last5', 'away_mean_rank', 'away_mean_rank_last5',
'away_rival_mean_rank', 'away_rival_mean_rank_last5',
'away_mean_points', 'away_mean_points_last5', 'away_rival_mean_points',
'away_rival_mean_points_last5', 'away_mean_points2rank',
'away_mean_points2rank_last5', 'away_rival_mean_points2rank',
'away_rival_mean_points2rank_last5'],
dtype='object')
# 分离数据
fifa_data = fifa_data[['date', 'home_team', 'away_team', 'rank_home', 'rank_away', 'home_score', 'away_score', 'result', 'rank_diff', 'rank_change_diff', 'total_points_diff', 'previous_points_diff', 'points2rank_diff', 'home_mean_goals', 'home_mean_goals_last5', 'home_rival_mean_goals', 'home_rival_mean_goals_last5', 'home_mean_rank', 'home_mean_rank_last5', 'home_rival_mean_rank', 'home_rival_mean_rank_last5', 'home_mean_points', 'home_mean_points_last5', 'home_rival_mean_points', 'home_rival_mean_points_last5', 'home_mean_points2rank', 'home_mean_points2rank_last5', 'home_rival_mean_points2rank', 'home_rival_mean_points2rank_last5', 'away_mean_goals', 'away_mean_goals_last5', 'away_rival_mean_goals', 'away_rival_mean_goals_last5', 'away_mean_rank', 'away_mean_rank_last5', 'away_rival_mean_rank', 'away_rival_mean_rank_last5', 'away_mean_points', 'away_mean_points_last5', 'away_rival_mean_points', 'away_rival_mean_points_last5', 'away_mean_points2rank', 'away_mean_points2rank_last5', 'away_rival_mean_points2rank', 'away_rival_mean_points2rank_last5', 'target']]
fifa_data.head()
date | home_team | away_team | rank_home | rank_away | home_score | away_score | result | rank_diff | rank_change_diff | ... | away_rival_mean_rank_last5 | away_mean_points | away_mean_points_last5 | away_rival_mean_points | away_rival_mean_points_last5 | away_mean_points2rank | away_mean_points2rank_last5 | away_rival_mean_points2rank | away_rival_mean_points2rank_last5 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1993-01-10 | Angola | Zimbabwe | 102 | 54 | 1 | 1 | 2 | 48 | 0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
1 | 1993-01-16 | South Africa | Nigeria | 124 | 13 | 0 | 0 | 2 | 111 | 0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
2 | 1993-01-16 | Tanzania | Zambia | 80 | 32 | 1 | 3 | 1 | 48 | 0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
3 | 1993-01-17 | Benin | Tunisia | 127 | 38 | 0 | 5 | 1 | 89 | 0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
4 | 1993-01-17 | Botswana | Ivory Coast | 139 | 27 | 0 | 0 | 2 | 112 | 0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
5 rows × 46 columns
# 检查数据是否缺失
fifa_data.isna().sum()
date 0
home_team 0
away_team 0
rank_home 0
rank_away 0
home_score 0
away_score 0
result 0
rank_diff 0
rank_change_diff 0
total_points_diff 0
previous_points_diff 0
points2rank_diff 0
home_mean_goals 101
home_mean_goals_last5 101
home_rival_mean_goals 101
home_rival_mean_goals_last5 101
home_mean_rank 101
home_mean_rank_last5 101
home_rival_mean_rank 101
home_rival_mean_rank_last5 101
home_mean_points 101
home_mean_points_last5 101
home_rival_mean_points 101
home_rival_mean_points_last5 101
home_mean_points2rank 101
home_mean_points2rank_last5 101
home_rival_mean_points2rank 101
home_rival_mean_points2rank_last5 101
away_mean_goals 106
away_mean_goals_last5 106
away_rival_mean_goals 106
away_rival_mean_goals_last5 106
away_mean_rank 106
away_mean_rank_last5 106
away_rival_mean_rank 106
away_rival_mean_rank_last5 106
away_mean_points 106
away_mean_points_last5 106
away_rival_mean_points 106
away_rival_mean_points_last5 106
away_mean_points2rank 106
away_mean_points2rank_last5 106
away_rival_mean_points2rank 106
away_rival_mean_points2rank_last5 106
target 0
dtype: int64
# 缺失值处理
fifa_data = fifa_data.dropna().reset_index(drop=True)
fifa_data.isna().sum()
date 0
home_team 0
away_team 0
rank_home 0
rank_away 0
home_score 0
away_score 0
result 0
rank_diff 0
rank_change_diff 0
total_points_diff 0
previous_points_diff 0
points2rank_diff 0
home_mean_goals 0
home_mean_goals_last5 0
home_rival_mean_goals 0
home_rival_mean_goals_last5 0
home_mean_rank 0
home_mean_rank_last5 0
home_rival_mean_rank 0
home_rival_mean_rank_last5 0
home_mean_points 0
home_mean_points_last5 0
home_rival_mean_points 0
home_rival_mean_points_last5 0
home_mean_points2rank 0
home_mean_points2rank_last5 0
home_rival_mean_points2rank 0
home_rival_mean_points2rank_last5 0
away_mean_goals 0
away_mean_goals_last5 0
away_rival_mean_goals 0
away_rival_mean_goals_last5 0
away_mean_rank 0
away_mean_rank_last5 0
away_rival_mean_rank 0
away_rival_mean_rank_last5 0
away_mean_points 0
away_mean_points_last5 0
away_rival_mean_points 0
away_rival_mean_points_last5 0
away_mean_points2rank 0
away_mean_points2rank_last5 0
away_rival_mean_points2rank 0
away_rival_mean_points2rank_last5 0
target 0
dtype: int64
# 查看数据信息
fifa_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5894 entries, 0 to 5893
Data columns (total 46 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 5894 non-null datetime64[ns]
1 home_team 5894 non-null object
2 away_team 5894 non-null object
3 rank_home 5894 non-null int64
4 rank_away 5894 non-null int64
5 home_score 5894 non-null int64
6 away_score 5894 non-null int64
7 result 5894 non-null int64
8 rank_diff 5894 non-null int64
9 rank_change_diff 5894 non-null int64
10 total_points_diff 5894 non-null float64
11 previous_points_diff 5894 non-null float64
12 points2rank_diff 5894 non-null float64
13 home_mean_goals 5894 non-null float64
14 home_mean_goals_last5 5894 non-null float64
15 home_rival_mean_goals 5894 non-null float64
16 home_rival_mean_goals_last5 5894 non-null float64
17 home_mean_rank 5894 non-null float64
18 home_mean_rank_last5 5894 non-null float64
19 home_rival_mean_rank 5894 non-null float64
20 home_rival_mean_rank_last5 5894 non-null float64
21 home_mean_points 5894 non-null float64
22 home_mean_points_last5 5894 non-null float64
23 home_rival_mean_points 5894 non-null float64
24 home_rival_mean_points_last5 5894 non-null float64
25 home_mean_points2rank 5894 non-null float64
26 home_mean_points2rank_last5 5894 non-null float64
27 home_rival_mean_points2rank 5894 non-null float64
28 home_rival_mean_points2rank_last5 5894 non-null float64
29 away_mean_goals 5894 non-null float64
30 away_mean_goals_last5 5894 non-null float64
31 away_rival_mean_goals 5894 non-null float64
32 away_rival_mean_goals_last5 5894 non-null float64
33 away_mean_rank 5894 non-null float64
34 away_mean_rank_last5 5894 non-null float64
35 away_rival_mean_rank 5894 non-null float64
36 away_rival_mean_rank_last5 5894 non-null float64
37 away_mean_points 5894 non-null float64
38 away_mean_points_last5 5894 non-null float64
39 away_rival_mean_points 5894 non-null float64
40 away_rival_mean_points_last5 5894 non-null float64
41 away_mean_points2rank 5894 non-null float64
42 away_mean_points2rank_last5 5894 non-null float64
43 away_rival_mean_points2rank 5894 non-null float64
44 away_rival_mean_points2rank_last5 5894 non-null float64
45 target 5894 non-null int64
dtypes: datetime64[ns](1), float64(35), int64(8), object(2)
memory usage: 2.1+ MB
# 分离数据
data1 = fifa_data[list(fifa_data.columns[8:13].values) + ['target']]
data2 = fifa_data[list(fifa_data.columns[13:29].values) + ['target']]
data3 = fifa_data[fifa_data.columns[29:]]
# 查看数据
data1.tail()
rank_diff | rank_change_diff | total_points_diff | previous_points_diff | points2rank_diff | target | |
---|---|---|---|---|---|---|
5889 | 12 | -1 | -62.42 | -63.26 | -0.076923 | 1 |
5890 | -9 | -2 | 53.00 | 42.93 | 0.111111 | 0 |
5891 | 26 | -6 | -105.30 | -133.76 | -0.044118 | 1 |
5892 | 20 | 5 | -100.03 | -76.59 | 0.021645 | 1 |
5893 | -70 | -1 | 297.02 | 302.40 | 0.029703 | 0 |
# 查看数据
data2.tail()
home_mean_goals | home_mean_goals_last5 | home_rival_mean_goals | home_rival_mean_goals_last5 | home_mean_rank | home_mean_rank_last5 | home_rival_mean_rank | home_rival_mean_rank_last5 | home_mean_points | home_mean_points_last5 | home_rival_mean_points | home_rival_mean_points_last5 | home_mean_points2rank | home_mean_points2rank_last5 | home_rival_mean_points2rank | home_rival_mean_points2rank_last5 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5889 | 1.310811 | 1.8 | 0.986486 | 0.4 | 45.229730 | 44.6 | 61.270270 | 81.6 | 1.675676 | 3.0 | 1.067568 | 0.0 | 0.053158 | 0.102165 | 0.026165 | 0.000000 | 1 |
5890 | 1.338028 | 2.2 | 1.380282 | 1.0 | 56.478873 | 19.2 | 60.746479 | 53.6 | 1.267606 | 2.2 | 1.478873 | 0.4 | 0.036644 | 0.238173 | 0.033925 | 0.021053 | 0 |
5891 | 1.650000 | 0.8 | 1.140000 | 0.4 | 77.240000 | 69.4 | 94.390000 | 60.4 | 1.470000 | 1.8 | 1.350000 | 1.2 | 0.015780 | 0.034188 | 0.017614 | 0.017391 | 1 |
5892 | 2.509434 | 1.6 | 0.811321 | 1.2 | 42.877358 | 37.2 | 81.377358 | 64.2 | 1.990566 | 1.4 | 0.801887 | 1.4 | 0.029768 | 0.017478 | 0.021497 | 0.038147 | 1 |
5893 | 1.557252 | 1.2 | 1.007634 | 0.2 | 43.229008 | 44.8 | 49.969466 | 37.4 | 1.725191 | 2.6 | 1.038168 | 0.2 | 0.052968 | 0.097719 | 0.030313 | 0.004082 | 0 |
# 查看数据
data3.tail()
away_mean_goals | away_mean_goals_last5 | away_rival_mean_goals | away_rival_mean_goals_last5 | away_mean_rank | away_mean_rank_last5 | away_rival_mean_rank | away_rival_mean_rank_last5 | away_mean_points | away_mean_points_last5 | away_rival_mean_points | away_rival_mean_points_last5 | away_mean_points2rank | away_mean_points2rank_last5 | away_rival_mean_points2rank | away_rival_mean_points2rank_last5 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5889 | 1.493976 | 1.6 | 0.807229 | 1.0 | 39.265060 | 26.0 | 62.433735 | 59.0 | 1.759036 | 1.8 | 0.891566 | 0.6 | 0.068496 | 0.093412 | 0.028319 | 0.023407 | 1 |
5890 | 1.511905 | 1.8 | 0.809524 | 0.8 | 39.119048 | 26.4 | 62.154762 | 42.0 | 1.773810 | 2.2 | 0.880952 | 0.4 | 0.068596 | 0.107183 | 0.027982 | 0.015407 | 0 |
5891 | 2.514286 | 1.4 | 0.809524 | 1.2 | 42.885714 | 35.6 | 81.504762 | 65.6 | 1.980952 | 1.0 | 0.809524 | 1.6 | 0.029632 | 0.011321 | 0.021702 | 0.044029 | 1 |
5892 | 1.015038 | 1.2 | 1.466165 | 0.6 | 46.751880 | 22.4 | 34.661654 | 35.6 | 1.090226 | 2.0 | 1.676692 | 0.8 | 0.067419 | 0.065848 | 0.044702 | 0.036364 | 1 |
5893 | 2.200000 | 3.6 | 0.900000 | 0.2 | 101.725000 | 111.0 | 109.125000 | 156.2 | 1.900000 | 3.0 | 0.925000 | 0.0 | 0.021997 | 0.019261 | 0.010179 | 0.000000 | 0 |
# 小提琴图
standard1 = (data1[:-1] - data1[:-1].mean()) / data1[:-1].std()
standard1['target'] = data1["target"]
violin1 = pd.melt(standard1, id_vars='target', var_name='feature', value_name='value')
standard2 = (data2[:-1] - data2[:-1].mean()) / data2[:-1].std()
standard2['target'] = data2['target']
violin2 = pd.melt(standard2, id_vars='target', var_name='feature', value_name='value')
standard3 = (data3[:-1] - data3[:-1].mean()) / data3[:-1].std()
standard3['target'] = data3['target']
violin3 = pd.melt(standard3, id_vars='target', var_name='feature', value_name='value')
# 绘制小提琴图
plt.figure(figsize=(15, 10))
sns.violinplot(x='feature', y='value', hue='target', data=violin1, split=True, inner='quart')
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VDLPHHQO-1687232982513)(main_files/main_43_0.png)]
# 绘制相关性热图
plt.figure(figsize=(16, 16))
sns.heatmap(standard1.corr(), annot=True, linewidths=0.2, square=True)
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KKGUw1fF-1687232982513)(main_files/main_44_0.png)]
# 绘制小提琴图
plt.figure(figsize=(15, 10))
sns.violinplot(x='feature', y='value', hue='target', data=violin2, split=True, inner='quart')
plt.xticks(rotation=90)
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZribidUS-1687232982513)(main_files/main_45_0.png)]
# 绘制相关性热图
plt.figure(figsize=(16, 16))
sns.heatmap(standard2.corr(), annot=True, linewidths=0.2, square=True)
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sbxm4ITn-1687232982514)(main_files/main_46_0.png)]
# 绘制小提琴图
plt.figure(figsize=(15, 10))
sns.violinplot(x='feature', y='value', hue='target', data=violin3, split=True, inner='quart')
plt.xticks(rotation=90)
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-G5sugYMJ-1687232982514)(main_files/main_47_0.png)]
# 绘制相关性热图
plt.figure(figsize=(16, 16))
sns.heatmap(standard3.corr(), annot=True, linewidths=0.2, square=True)
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oZvfKsNr-1687232982514)(main_files/main_48_0.png)]
特征工程
- mean_goals_diff - 平均进球差异
- mean_goals_last5_diff - 最近五场平均进球差异
- rival_mean_goals_diff - 对手平均进球差异
- rival_mean_goals_last5_diff - 对手最近五场平均进球差异
- mean_rank_diff - 平均排名差异
- mean_rank_last5_diff - 最近五场平均排名差异
- rival_mean_rank_diff - 对手平均排名差异
- rival_mean_rank_last5_diff - 对手最近五场平均排名差异
- mean_points_diff - 平均得分差异
- mean_points_last5_diff - 最近五场平均得分差异
- rival_mean_points_diff - 对手平均得分差异
- rival_mean_points_last5_diff - 对手最近五场平均得分差异
- mean_points2rank_diff - 平均points2rank差异
- mean_points2rank_last5_diff - 最近五场平均points2rank差异
- rival_mean_points2rank_diff - 对手平均points2rank差异
- rival_mean_points2rank_last5_diff - 对手最近五场平均points2rank差异
# 特征工程
data = fifa_data.copy()
data.loc[:, 'mean_goals_diff'] = data['home_mean_goals'] - data['away_mean_goals']
data.loc[:, 'mean_goals_last5_diff'] = data['home_mean_goals_last5'] - data['away_mean_goals_last5']
data.loc[:, 'rival_mean_goals_diff'] = data['home_rival_mean_goals'] - data['away_rival_mean_goals']
data.loc[:, 'rival_mean_goals_last5_diff'] = data['home_rival_mean_goals_last5'] - data['away_rival_mean_goals_last5']
data.loc[:, 'mean_rank_diff'] = data['home_mean_rank'] - data['away_mean_rank']
data.loc[:, 'mean_rank_last5_diff'] = data['home_mean_rank_last5'] - data['away_mean_rank_last5']
data.loc[:, 'rival_mean_rank_diff'] = data['home_rival_mean_rank'] - data['away_rival_mean_rank']
data.loc[:, 'rival_mean_rank_last5_diff'] = data['home_rival_mean_rank_last5'] - data['away_rival_mean_rank_last5']
data.loc[:, 'mean_points_diff'] = data['home_mean_points'] - data['away_mean_points']
data.loc[:, 'mean_points_last5_diff'] = data['home_mean_points_last5'] - data['away_mean_points_last5']
data.loc[:, 'rival_mean_points_diff'] = data['home_rival_mean_points'] - data['away_rival_mean_points']
data.loc[:, 'rival_mean_points_last5_diff'] = data['home_rival_mean_points_last5'] - data['away_rival_mean_points_last5']
data.loc[:, 'mean_points2rank_diff'] = data['home_mean_points2rank'] - data['away_mean_points2rank']
data.loc[:, 'mean_points2rank_last5_diff'] = data['home_mean_points2rank_last5'] - data['away_mean_points2rank_last5']
data.loc[:, 'rival_mean_points2rank_diff'] = data['home_rival_mean_points2rank'] - data['away_rival_mean_points2rank']
data.loc[:, 'rival_mean_points2rank_last5_diff'] = data['home_rival_mean_points2rank_last5'] - data['away_rival_mean_points2rank_last5']
data_diff1 = data.iloc[:, -16:]
standard_diff1 = (data_diff1 - data_diff1.mean()) / data_diff1.std()
standard_diff1['target'] = data['target']
violin_diff1 = pd.melt(standard_diff1, id_vars='target', var_name='feature', value_name='value')
# 绘制小提琴图
plt.figure(figsize=(15, 10))
sns.violinplot(x='feature', y='value', hue='target', data=violin_diff1, split=True, inner='quart')
plt.xticks(rotation=90)
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-n0IRvedt-1687232982515)(main_files/main_50_0.png)]
# 绘制相关性热图
plt.figure(figsize=(16, 16))
sns.heatmap(standard_diff1.corr(), annot=True, linewidths=0.2, square=True)
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-76Szt1d2-1687232982515)(main_files/main_51_0.png)]
特征工程
- mean_goals2mean_rank_diff - 主队平均进球 / 主队平均排名 - 客队平均进球 / 客队平均排名
- rival_mean_goals2mean_rank_diff - 主队对手平均进球 / 主队平均排名 - 客队对手平均进球 / 客队平均排名
- mean_goals2mean_rank_last5_diff - 主队最近五场平均进球 / 主队平均排名 - 客队最近五场平均进球 / 客队平均排名
- rival_mean_goals2mean_rank_last5_diff - 主队对手最近五场平均进球 / 主队平均排名 - 客队对手最近五场平均进球 / 客队平均排名
- mean_points2mean_rank_diff - 主队平均得分 / 主队平均排名 - 客队平均得分 / 客队平均排名
- rival_mean_points2mean_rank_diff - 主队对手平均得分 / 主队平均排名 - 客队对手平均得分 / 客队平均排名
- mean_points2mean_rank_last5_diff - 主队最近五场平均得分 / 主队平均排名 - 客队最近五场平均得分 / 客队平均排名
- rival_mean_points2mean_rank_last5_diff - 主队对手最近五场平均得分 / 主队平均排名 - 客队对手最近五场平均得分 / 客队平均排名
# 特征工程
data.loc[:, 'mean_goals2mean_rank_diff'] = (data['home_mean_goals'] / data['home_mean_rank']) - (data['away_mean_goals'] / data['away_mean_rank'])
data.loc[:, 'rival_mean_goals2mean_rank_diff'] = (data['home_rival_mean_goals'] / data['home_mean_rank']) - (data['away_rival_mean_goals'] / data['away_mean_rank'])
data.loc[:, 'mean_goals2mean_rank_last5_diff'] = (data['home_mean_goals_last5'] / data['home_mean_rank']) - (data['away_mean_goals_last5'] / data['away_mean_rank'])
data.loc[:, 'rival_mean_goals2mean_rank_last5_diff'] = (data['home_rival_mean_goals_last5'] / data['home_mean_rank']) - (data['away_rival_mean_goals_last5'] / data['away_mean_rank'])
data.loc[:, 'mean_points2mean_rank_diff'] = (data['home_mean_points'] / data['home_mean_rank']) - (data['away_mean_points'] / data['away_mean_rank'])
data.loc[:, 'rival_mean_points2mean_rank_diff'] = (data['home_rival_mean_points'] / data['home_mean_rank']) - (data['away_rival_mean_points'] / data['away_mean_rank'])
data.loc[:, 'mean_points2mean_rank_last5_diff'] = (data['home_mean_points_last5'] / data['home_mean_rank']) - (data['away_mean_points_last5'] / data['away_mean_rank'])
data.loc[:, 'rival_mean_points2mean_rank_last5_diff'] = (data['home_rival_mean_points_last5'] / data['home_mean_rank']) - (data['away_rival_mean_points_last5'] / data['away_mean_rank'])
data_diff2 = data.iloc[:, -8:]
standard_diff2 = (data_diff2 - data_diff2.mean()) / data_diff2.std()
standard_diff2['target'] = data['target']
violin_diff2 = pd.melt(standard_diff2, id_vars='target', var_name='feature', value_name='value')
# 绘制小提琴图
plt.figure(figsize=(15, 10))
sns.violinplot(x='feature', y='value', hue='target', data=violin_diff2, split=True, inner='quart')
plt.xticks(rotation=90)
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Wi3FI5HD-1687232982515)(main_files/main_53_0.png)]
# 绘制箱型图
plt.figure(figsize=(15, 10))
sns.boxplot(x='feature', y='value', hue='target', data=violin_diff2)
plt.xticks(rotation=90)
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Co0svD98-1687232982516)(main_files/main_54_0.png)]
# 绘制相关性热图
plt.figure(figsize=(16, 16))
sns.heatmap(standard_diff2.corr(), annot=True, linewidths=0.2, square=True)
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jXDTL6sD-1687232982516)(main_files/main_55_0.png)]
筛选相关性大于0.3的特征
- rank_diff
- total_points_diff
- previous_points_diff
- away_mean_rank
- away_mean_rank_last5
- away_mean_points
- away_rival_mean_points
- mean_goals_diff
- mean_goals_last5_diff
- rival_mean_goals_diff
- rival_mean_goals_last5_diff
- mean_rank_diff
- mean_rank_last5_diff
- mean_points_diff
- mean_points_last5_diff
- rival_mean_points_diff
- rival_mean_points_last5_diff
- mean_points2rank_diff
# 绘制散点图
plt.figure(figsize=(16, 16))
sns.jointplot(x='total_points_diff', y='previous_points_diff', data=data, kind='reg')
plt.show()
<Figure size 1600x1600 with 0 Axes>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Muow1e9n-1687232982516)(main_files/main_57_1.png)]
# 绘制散点图
plt.figure(figsize=(16, 16))
sns.jointplot(x='away_mean_rank', y='away_mean_rank_last5', data=data, kind='reg')
plt.show()
<Figure size 1600x1600 with 0 Axes>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZRQKzFUl-1687232982517)(main_files/main_58_1.png)]
# 绘制散点图
plt.figure(figsize=(16, 16))
sns.jointplot(x='mean_goals_diff', y='mean_goals_last5_diff', data=data, kind='reg')
plt.show()
<Figure size 1600x1600 with 0 Axes>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-e7gIdzEW-1687232982517)(main_files/main_59_1.png)]
# 绘制散点图
plt.figure(figsize=(16, 16))
sns.jointplot(x='rival_mean_goals_diff', y='rival_mean_goals_last5_diff', data=data, kind='reg')
plt.show()
<Figure size 1600x1600 with 0 Axes>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PXncyckZ-1687232982517)(main_files/main_60_1.png)]
# 绘制散点图
plt.figure(figsize=(16, 16))
sns.jointplot(x='mean_rank_diff', y='mean_rank_last5_diff', data=data, kind='reg')
plt.show()
<Figure size 1600x1600 with 0 Axes>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oz25GiUz-1687232982517)(main_files/main_61_1.png)]
# 绘制散点图
plt.figure(figsize=(16, 16))
sns.jointplot(x='mean_points_diff', y='mean_points_last5_diff', data=data, kind='reg')
plt.show()
<Figure size 1600x1600 with 0 Axes>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vegzj9yv-1687232982518)(main_files/main_62_1.png)]
# 绘制散点图
plt.figure(figsize=(16, 16))
sns.jointplot(x='rival_mean_points_diff', y='rival_mean_points_last5_diff', data=data, kind='reg')
plt.show()
<Figure size 1600x1600 with 0 Axes>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ky2J8sn2-1687232982519)(main_files/main_63_1.png)]
删除分布相似的特征
- rank_diff
- total_points_diff
- away_mean_rank
- away_mean_rank_last5
- away_mean_points
- away_rival_mean_points
- mean_goals_diff
- mean_goals_last5_diff
- rival_mean_goals_diff
- mean_rank_diff
- mean_rank_last5_diff
- mean_points_diff
- mean_points_last5_diff
- rival_mean_points_diff
- rival_mean_points_last5_diff
- mean_points2rank_diff
# 构建训练数据
fifa_data = data[['home_team', 'away_team', 'target', 'rank_diff', 'total_points_diff', 'away_mean_rank', 'away_mean_rank_last5', 'away_mean_points', 'away_rival_mean_points', 'mean_goals_diff', 'mean_goals_last5_diff', 'rival_mean_goals_diff', 'mean_rank_diff', 'mean_rank_last5_diff', 'mean_points_diff', 'mean_points_last5_diff', 'rival_mean_points_diff', 'rival_mean_points_last5_diff', 'mean_points2rank_diff']]
fifa_data.head()
home_team | away_team | target | rank_diff | total_points_diff | away_mean_rank | away_mean_rank_last5 | away_mean_points | away_rival_mean_points | mean_goals_diff | mean_goals_last5_diff | rival_mean_goals_diff | mean_rank_diff | mean_rank_last5_diff | mean_points_diff | mean_points_last5_diff | rival_mean_points_diff | rival_mean_points_last5_diff | mean_points2rank_diff | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Egypt | Togo | 0 | -80 | 35.0 | 101.0 | 101.0 | 0.0 | 3.0 | -1.0 | -1.0 | -2.0 | -80.0 | -80.0 | 1.0 | 1.0 | -2.0 | -2.0 | 0.009804 |
1 | Morocco | Benin | 0 | -86 | 28.0 | 127.0 | 127.0 | 0.0 | 3.0 | 1.0 | 1.0 | -5.0 | -86.0 | -86.0 | 3.0 | 3.0 | -3.0 | -3.0 | 0.035294 |
2 | Tunisia | Ethiopia | 0 | -47 | 21.0 | 85.0 | 85.0 | 0.0 | 3.0 | 5.0 | 5.0 | -1.0 | -47.0 | -47.0 | 3.0 | 3.0 | -3.0 | -3.0 | 0.023622 |
3 | Zimbabwe | Angola | 0 | -48 | 17.0 | 102.0 | 102.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.5 | -48.0 | -48.0 | 1.0 | 1.0 | -0.5 | -0.5 | -0.013315 |
4 | Algeria | Ghana | 0 | -9 | 5.0 | 39.0 | 39.0 | 3.0 | 0.0 | -1.0 | -1.0 | 0.0 | -9.0 | -9.0 | -2.0 | -2.0 | 1.0 | 1.0 | -0.020000 |
# 查看数据信息
fifa_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5894 entries, 0 to 5893
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 home_team 5894 non-null object
1 away_team 5894 non-null object
2 target 5894 non-null int64
3 rank_diff 5894 non-null int64
4 total_points_diff 5894 non-null float64
5 away_mean_rank 5894 non-null float64
6 away_mean_rank_last5 5894 non-null float64
7 away_mean_points 5894 non-null float64
8 away_rival_mean_points 5894 non-null float64
9 mean_goals_diff 5894 non-null float64
10 mean_goals_last5_diff 5894 non-null float64
11 rival_mean_goals_diff 5894 non-null float64
12 mean_rank_diff 5894 non-null float64
13 mean_rank_last5_diff 5894 non-null float64
14 mean_points_diff 5894 non-null float64
15 mean_points_last5_diff 5894 non-null float64
16 rival_mean_points_diff 5894 non-null float64
17 rival_mean_points_last5_diff 5894 non-null float64
18 mean_points2rank_diff 5894 non-null float64
dtypes: float64(15), int64(2), object(2)
memory usage: 875.0+ KB
模型训练
# 划分数据
X_train, X_test, y_train, y_test = train_test_split(fifa_data.iloc[:, 3:], fifa_data['target'], test_size=0.2, shuffle=True, random_state=2022)
网格搜索是一种穷举搜索方法,它通过遍历超参数的所有可能组合来寻找最优超参数。网格搜索首先为每个超参数设定一组候选值,然后生成这些候选值的笛卡尔积,形成超参数的组合网格。接着,网格搜索会对每个超参数组合进行模型训练和评估,从而找到性能最佳的超参数组合。网格搜索可以保证在指定的参数范围内找到精度最高的参数,因为网格搜索会遍历所有可能参数的组合,在面对大数据集和多参数的情况下会非常耗时。这里仅展示一个组合,如有需要请自行设置超参数候选值,例如:‘max_depth’: [3, 5, 7]。
# 网格搜索
rf_params = {
'max_depth': [10],
'max_features': ['sqrt'],
'min_samples_leaf': [10],
'min_samples_split': [10],
'n_estimators': [100]
}
rf_search = GridSearchCV(RandomForestClassifier(), rf_params, cv=3, n_jobs=-1)
rf_search.fit(X_train, y_train)
rf_search.best_params_
{'max_depth': 10,
'max_features': 'sqrt',
'min_samples_leaf': 10,
'min_samples_split': 10,
'n_estimators': 100}
随机森林是一种集成算法,它属于Bagging(个体学习器间不存在强依赖关系、可同时生成的并行化方法)类型,通过组合多个弱分类器,最终结果通过投票或取均值,使得整体模型的结果具有较高的精确度和泛化性能。其可以取得不错成绩,主要归功于“随机”和“森林”,一个使它具有抗过拟合能力,一个使它更加精准。
# 模型训练
rf = RandomForestClassifier(max_depth=10, max_features='sqrt', min_samples_leaf=10, min_samples_split=10, n_estimators=100, random_state=2022)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_acc = accuracy_score(y_test, rf_pred.astype('int'))
joblib.dump(rf, 'rf.pkl')
print('RandomForest Acc is: ', rf_acc)
RandomForest Acc is: 0.732824427480916
# 网格搜索
gbdt_params = {
'learning_rate': [0.01],
'max_depth': [5],
'max_features': ['sqrt'],
'min_samples_leaf': [10],
'min_samples_split': [10],
'n_estimators': [500]
}
gbdt_search = GridSearchCV(GradientBoostingClassifier(), gbdt_params, cv=3, n_jobs=-1)
gbdt_search.fit(X_train, y_train)
gbdt_search.best_params_
{'learning_rate': 0.01,
'max_depth': 5,
'max_features': 'sqrt',
'min_samples_leaf': 10,
'min_samples_split': 10,
'n_estimators': 500}
梯度提升决策树(GBDT)是一种集成算法,它属于Boosting(个体学习器间存在强依赖关系、必须串行生成的序列化方法)类型。训练时采用前向分布算法进行贪婪学习,每次迭代都学习一棵CART树来拟合之前 t-1 棵树的预测结果与训练样本真实值的残差。
# 模型训练
gbdt = GradientBoostingClassifier(learning_rate=0.01, max_depth=5, max_features='sqrt', min_samples_leaf=10, min_samples_split=10, n_estimators=500, random_state=2022)
gbdt.fit(X_train, y_train)
gbdt_pred = gbdt.predict(X_test)
gbdt_acc = accuracy_score(y_test, gbdt_pred.astype('int'))
joblib.dump(gbdt, 'gbdt.pkl')
print('GradientBoosting Acc is: ', gbdt_acc)
GradientBoosting Acc is: 0.7430025445292621
# ROC曲线和混淆矩阵
def analyze(model):
plt.figure(figsize=(15, 10))
plt.plot([0, 1], [0, 1], 'k--')
fpr_train, tpr_train, _ = roc_curve(y_train, model.predict_proba(X_train)[:, 1])
plt.plot(fpr_train, tpr_train, label='train')
fpr_test, tpr_test, _ = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
plt.plot(fpr_test, tpr_test, label='test')
auc_train = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
auc_test = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
plt.legend()
plt.title('AUC score is %.2f on test and %.2f on train' % (auc_test, auc_train))
plt.show()
plt.figure(figsize=(15, 10))
matrix = confusion_matrix(y_test, model.predict(X_test))
sns.heatmap(matrix, annot=True, linewidths=0.2, fmt='d')
plt.title('confusion_matrix on test')
plt.show()
# 绘制ROC曲线和混淆矩阵
analyze(rf)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ChbNPrtc-1687232982519)(main_files/main_77_0.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oNOLGNh9-1687232982519)(main_files/main_77_1.png)]
# 绘制ROC曲线和混淆矩阵
analyze(gbdt)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UKnJzL3y-1687232982520)(main_files/main_78_0.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1TvmZyyk-1687232982520)(main_files/main_78_1.png)]
2022世界杯
# 生成特征: 使用历史数据
def get_data(team):
pasts = data_copy[(data_copy['team'] == team)].sort_values(by=['date'], ascending=False)
last5 = pasts.head(5)
rank = pasts['rank'].values[0]
total_points = pasts['total_points'].values[0]
mean_rank = pasts['rank'].mean()
mean_rank_last5 = last5['rank'].mean()
mean_goals = pasts['score'].mean()
mean_goals_last5 = last5['score'].mean()
mean_points = pasts['points'].mean()
mean_points_last5 = last5['points'].mean()
mean_points2rank = pasts['points2rank'].mean()
rival_mean_goals = pasts['rival_score'].mean()
rival_mean_points = pasts['rival_points'].mean()
rival_mean_points_last5 = last5['rival_points'].mean()
return [rank, total_points, mean_rank, mean_rank_last5, mean_goals, mean_goals_last5, mean_points, mean_points_last5, mean_points2rank, rival_mean_goals, rival_mean_points, rival_mean_points_last5]
def get_feature(team1, team2):
rank_diff = team1[0] - team2[0]
total_points_diff = team1[1] - team2[1]
away_mean_rank = team2[2]
away_mean_rank_last5 = team2[3]
away_mean_points = team2[6]
away_rival_mean_points = team2[10]
mean_goals_diff = team1[4] - team2[4]
mean_goals_last5_diff = team1[5] - team2[5]
rival_mean_goals_diff = team1[9] - team2[9]
mean_rank_diff = team1[2] - team2[2]
mean_rank_last5_diff = team1[3] - team2[3]
mean_points_diff = team1[6] - team2[6]
mean_points_last5_diff = team1[7] - team2[7]
rival_mean_points_diff = team1[10] - team2[10]
rival_mean_points_last5_diff = team1[11] - team2[11]
mean_points2rank_diff = team1[8] - team2[8]
return [rank_diff, total_points_diff, away_mean_rank, away_mean_rank_last5, away_mean_points, away_rival_mean_points, mean_goals_diff, mean_goals_last5_diff, rival_mean_goals_diff, mean_rank_diff, mean_rank_last5_diff, mean_points_diff, mean_points_last5_diff, rival_mean_points_diff, rival_mean_points_last5_diff, mean_points2rank_diff]
# 读取数据
fifa_2022 = pd.read_csv('/home/aistudio/work/fifa_2022.csv', parse_dates=['date'])
fifa_2022.head()
date | home_team | away_team | |
---|---|---|---|
0 | 2022-11-20 | Qatar | Ecuador |
1 | 2022-11-21 | Senegal | Netherlands |
2 | 2022-11-21 | England | Iran |
3 | 2022-11-21 | United States | Wales |
4 | 2022-11-22 | Argentina | Saudi Arabia |
# 胜负预测
def predict(teams, model):
home = teams[0]
away = teams[1]
team1 = get_data(home)
team2 = get_data(away)
feature1 = get_feature(team1, team2)
feature2 = get_feature(team2, team1)
proba1 = model.predict_proba([feature1])
proba2 = model.predict_proba([feature2])
pred1 = (proba1[0][0] + proba2[0][1]) / 2
pred2 = (proba2[0][0] + proba1[0][1]) / 2
if pred1 < pred2:
print('%s VS %s: %s获胜 概率: %.2f' % (home, away, away, pred2))
else:
print('%s VS %s: %s获胜 概率: %.2f' % (home, away, home, pred1))
# 2022世界杯
game8 = fifa_2022.iloc[-16:-8, 1:]
game4 = fifa_2022.iloc[-8:-4, 1:]
game2 = fifa_2022.iloc[-4:-2, 1:]
game1 = fifa_2022.iloc[-2:, 1:]
team8 = []
team4 = []
team2 = []
team1 = []
for idx, row in game8.iterrows():
home_team = row['home_team']
away_team = row['away_team']
team8.append([home_team, away_team])
for idx, row in game4.iterrows():
home_team = row['home_team']
away_team = row['away_team']
team4.append([home_team, away_team])
for idx, row in game2.iterrows():
home_team = row['home_team']
away_team = row['away_team']
team2.append([home_team, away_team])
for idx, row in game1.iterrows():
home_team = row['home_team']
away_team = row['away_team']
team1.append([home_team, away_team])
# 1/8决赛
for teams in team8:
predict(teams, gbdt)
Netherlands VS United States: Netherlands获胜 概率: 0.66
Argentina VS Australia: Argentina获胜 概率: 0.83
France VS Poland: France获胜 概率: 0.71
England VS Senegal: England获胜 概率: 0.65
Japan VS Croatia: Croatia获胜 概率: 0.65
Brazil VS South Korea: Brazil获胜 概率: 0.82
Morocco VS Spain: Spain获胜 概率: 0.81
Portugal VS Switzerland: Portugal获胜 概率: 0.52
# 1/4决赛
for teams in team4:
predict(teams, gbdt)
Croatia VS Brazil: Brazil获胜 概率: 0.70
Netherlands VS Argentina: Argentina获胜 概率: 0.60
Morocco VS Portugal: Portugal获胜 概率: 0.76
England VS France: France获胜 概率: 0.55
# 半决赛
for teams in team2:
predict(teams, gbdt)
Argentina VS Croatia: Argentina获胜 概率: 0.64
France VS Morocco: France获胜 概率: 0.73
# 决赛
for teams in team1:
predict(teams, gbdt)
Croatia VS Morocco: Croatia获胜 概率: 0.67
Argentina VS France: Argentina获胜 概率: 0.55
VS Morocco: France获胜 概率: 0.73
# 决赛
for teams in team1:
predict(teams, gbdt)
Croatia VS Morocco: Croatia获胜 概率: 0.67
Argentina VS France: Argentina获胜 概率: 0.55
总结
2022年12月19日,2022年卡塔尔世界杯决赛,阿根廷队在点球大战中战胜法国队,获得冠军。
项目以学习为目的,旨在体验特征工程。优化:数据扩充、数据粒度、特征工程、模型构建。
致谢
Predicting FIFA 2022 World Cup with ML
此文章为搬运
原项目链接