2022世界杯冠军是谁，使用Python来预测吧（二）

Python学研大本营

已于 2023-05-15 15:51:57 修改

阅读量2.8k

点赞数

文章标签： python 机器学习 sklearn

于 2023-05-15 15:45:11 首次发布

本文链接：https://blog.csdn.net/weixin_39915649/article/details/130686065

版权

使用Python预测——2022年世界杯冠军是谁？

微信搜索关注《Python学研大本营》，加入读者群，分享更多精彩

模型

我这里的想法是建立两个模型，一个是Random Forest，另一个是Gradient Boosting，比较一下哪个更好，以便在仿真中使用。我决定使用基于决策树的模型，因为当我研究文献时，它们在足球问题上做得更好。此外，由于数据集的大小，我认为不需要使用更复杂的模型。

我将使用 SkLearn 的GridSearchCV进行参数变化，并将在模拟中使用最佳模型。

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

#separating the target from the features
X = model_db.iloc[:, 3:]
y = model_db[["target"]]

#dividing the database
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=1)
gb = GradientBoostingClassifier(random_state=5)
params = {"learning_rate": [0.01, 0.1, 0.5],
            "min_samples_split": [5, 10],
            "min_samples_leaf": [3, 5],
            "max_depth":[3,5,10],
            "max_features":["sqrt"],
            "n_estimators":[100, 200]
         } 
gb_cv = GridSearchCV(gb, params, cv = 3, n_jobs = -1, verbose = False)
gb_cv.fit(X_train.values, np.ravel(y_train))

#getting the best model
gb = gb_cv.best_estimator_

由于执行延迟，我避免测试很多参数，优先测试具有减少过度拟合的值，例如 learning_rate 不太低和 n_estimators 不太高。

对随机森林做了同样的事情：

params_rf = {"max_depth": [20],
                "min_samples_split": [5, 10],
                "max_leaf_nodes": [175, 200],
                "min_samples_leaf": [5, 10],
                "n_estimators": [250],
                 "max_features": ["sqrt"],
                }

rf = RandomForestClassifier(random_state=1)
rf_cv = GridSearchCV(rf, params_rf, cv = 3, n_jobs = -1, verbose = False)
rf_cv.fit(X_train.values, np.ravel(y_train))

我用混淆矩阵和ROC曲线分析了模型，结果是：

analyze(gb)

梯度提升结果

analyze(rf)

随机森林结果

随机森林模型的性能稍好，但似乎过拟合。分析 Gradient Boosting 的 AUC-ROC，我们看到一个模型具有几乎相同的性能，但过度拟合的风险较低，这就是选择它的原因。

世界杯模拟

现在，我们到了最有趣的部分：看看模型将预测哪支球队赢得世界杯！

首先要做的是获取参加世界杯的球队名单，我使用了Pandas 的read_html方法。该方法从网页中获取数据框，我将其放入维基百科世界杯页面。这样，我将重新创建世界杯表。

该表包含比赛、小组中每支球队的得分以及存储球队赢得每场比赛的概率的列表。如果两支球队在小组中积分相同，这将用作决胜局。

创表前四组及世界杯前十场比赛正如我已经解释过的，该模型对主队获胜和客队获胜/平局进行了分类。那么，我们如何预测平局呢？我为此创建了一个客观规则：知道世界杯的所有比赛都是在中立场地进行的，预测将以两种形式进行：

A 队 x B 队（模拟 1）
B 队 x A 队（模拟 2）

如果两个预测都是 A 队或 B 队获胜，则将获胜分配给该队。如果一队在第一次预测中获胜，而另一队在第二次预测中获胜，则将被分配平局。季后赛阶段，将计算两次预测的概率，平均概率最高的球队晋级。由于模型将“胜利”分配给客队，即使平局也是如此，因此平局概率在客队的概率范围内。因此，通过这种类型的模拟，两支球队都获得了以平局为优势的模拟。

模拟使用的数据直到球队的最后一场比赛。换句话说，对于巴西来说，特征将计算到对阵突尼斯的比赛，这是巴西最后一场比赛。

现在，我们可以运行一个代码来逐场模拟比赛，计算分数并查看第一阶段发生的情况。

advanced_group = []
last_group = ""

for k in table.keys():
    for t in table[k]:
        t[1] = 0
        t[2] = []
        
for teams in matches:
    draw = False
    team_1 = find_stats(teams[1])
    team_2 = find_stats(teams[2])

    

    features_g1 = find_features(team_1, team_2)
    features_g2 = find_features(team_2, team_1)

    probs_g1 = gb.predict_proba([features_g1])
    probs_g2 = gb.predict_proba([features_g2])
    
    team_1_prob_g1 = probs_g1[0][0]
    team_1_prob_g2 = probs_g2[0][1]
    team_2_prob_g1 = probs_g1[0][1]
    team_2_prob_g2 = probs_g2[0][0]

    team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2
    team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2
    
    if ((team_1_prob_g1 > team_2_prob_g1) & (team_2_prob_g2 > team_1_prob_g2)) | ((team_1_prob_g1 < team_2_prob_g1) & (team_2_prob_g2 < team_1_prob_g2)):
        draw=True
        for i in table[teams[0]]:
            if i[0] == teams[1] or i[0] == teams[2]:
                i[1] += 1
                
    elif team_1_prob > team_2_prob:
        winner = teams[1]
        winner_proba = team_1_prob
        for i in table[teams[0]]:
            if i[0] == teams[1]:
                i[1] += 3
                
    elif team_2_prob > team_1_prob:  
        winner = teams[2]
        winner_proba = team_2_prob
        for i in table[teams[0]]:
            if i[0] == teams[2]:
                i[1] += 3
    
    for i in table[teams[0]]: #adding criterio de desempate (probs por jogo)
            if i[0] == teams[1]:
                i[2].append(team_1_prob)
            if i[0] == teams[2]:
                i[2].append(team_2_prob)

    if last_group != teams[0]:
        if last_group != "":
            print("\n")
            print("Group %s advanced: "%(last_group))
            
            for i in table[last_group]: #adding crieterio de desempate
                i[2] = np.mean(i[2])
            
            final_points = table[last_group]
            final_table = sorted(final_points, key=itemgetter(1, 2), reverse = True)
            advanced_group.append([final_table[0][0], final_table[1][0]])
            for i in final_table:
                print("%s -------- %d"%(i[0], i[1]))
        print("\n")
        print("-"*10+" Starting Analysis for Group %s "%(teams[0])+"-"*10)
        
        
    if draw == False:
        print("Group %s - %s vs. %s: Winner %s with %.2f probability"%(teams[0], teams[1], teams[2], winner, winner_proba))
    else:
        print("Group %s - %s vs. %s: Draw"%(teams[0], teams[1], teams[2]))
    last_group =  teams[0]

print("\n")
print("Group %s advanced: "%(last_group))

for i in table[last_group]: #adding crieterio de desempate
    i[2] = np.mean(i[2])
            
final_points = table[last_group]
final_table = sorted(final_points, key=itemgetter(1, 2), reverse = True)
advanced_group.append([final_table[0][0], final_table[1][0]])
for i in final_table:
    print("%s -------- %d"%(i[0], i[1]))

结果是：

---------- Starting Analysis for Group A ----------
Group A - Qatar vs. Ecuador: Winner Ecuador with 0.62 probability
Group A - Senegal vs. Netherlands: Winner Netherlands with 0.62 probability
Group A - Qatar vs. Senegal: Winner Senegal with 0.60 probability
Group A - Netherlands vs. Ecuador: Winner Netherlands with 0.73 probability
Group A - Ecuador vs. Senegal: Draw
Group A - Netherlands vs. Qatar: Winner Netherlands with 0.78 probability


Group A advanced: 
Netherlands -------- 9
Senegal -------- 4
Ecuador -------- 4
Qatar -------- 0


---------- Starting Analysis for Group B ----------
Group B - England vs. Iran: Winner England with 0.62 probability
Group B - United States vs. Wales: Draw
Group B - Wales vs. Iran: Draw
Group B - England vs. United States: Winner England with 0.61 probability
Group B - Wales vs. England: Winner England with 0.64 probability
Group B - Iran vs. United States: Winner United States with 0.58 probability


Group B advanced: 
England -------- 9
United States -------- 4
Wales -------- 2
Iran -------- 1


---------- Starting Analysis for Group C ----------
Group C - Argentina vs. Saudi Arabia: Winner Argentina with 0.79 probability
Group C - Mexico vs. Poland: Draw
Group C - Poland vs. Saudi Arabia: Winner Poland with 0.70 probability
Group C - Argentina vs. Mexico: Winner Argentina with 0.67 probability
Group C - Poland vs. Argentina: Winner Argentina with 0.71 probability
Group C - Saudi Arabia vs. Mexico: Winner Mexico with 0.71 probability


Group C advanced: 
Argentina -------- 9
Poland -------- 4
Mexico -------- 4
Saudi Arabia -------- 0


---------- Starting Analysis for Group D ----------
Group D - Denmark vs. Tunisia: Winner Denmark with 0.68 probability
Group D - France vs. Australia: Winner France with 0.71 probability
Group D - Tunisia vs. Australia: Draw
Group D - France vs. Denmark: Draw
Group D - Australia vs. Denmark: Winner Denmark with 0.71 probability
Group D - Tunisia vs. France: Winner France with 0.69 probability


Group D advanced: 
France -------- 7
Denmark -------- 7
Tunisia -------- 1
Australia -------- 1


---------- Starting Analysis for Group E ----------
Group E - Germany vs. Japan: Winner Germany with 0.62 probability
Group E - Spain vs. Costa Rica: Winner Spain with 0.76 probability
Group E - Japan vs. Costa Rica: Winner Japan with 0.63 probability
Group E - Spain vs. Germany: Draw
Group E - Japan vs. Spain: Winner Spain with 0.67 probability
Group E - Costa Rica vs. Germany: Winner Germany with 0.65 probability


Group E advanced: 
Spain -------- 7
Germany -------- 7
Japan -------- 3
Costa Rica -------- 0


---------- Starting Analysis for Group F ----------
Group F - Morocco vs. Croatia: Winner Croatia with 0.58 probability
Group F - Belgium vs. Canada: Winner Belgium with 0.75 probability
Group F - Belgium vs. Morocco: Winner Belgium with 0.67 probability
Group F - Croatia vs. Canada: Winner Croatia with 0.64 probability
Group F - Croatia vs. Belgium: Winner Belgium with 0.64 probability
Group F - Canada vs. Morocco: Draw


Group F advanced: 
Belgium -------- 9
Croatia -------- 6
Morocco -------- 1
Canada -------- 1


---------- Starting Analysis for Group G ----------
Group G - Switzerland vs. Cameroon: Winner Switzerland with 0.69 probability
Group G - Brazil vs. Serbia: Winner Brazil with 0.72 probability
Group G - Cameroon vs. Serbia: Winner Serbia with 0.66 probability
Group G - Brazil vs. Switzerland: Draw
Group G - Serbia vs. Switzerland: Winner Switzerland with 0.57 probability
Group G - Cameroon vs. Brazil: Winner Brazil with 0.81 probability


Group G advanced: 
Brazil -------- 7
Switzerland -------- 7
Serbia -------- 3
Cameroon -------- 0


---------- Starting Analysis for Group H ----------
Group H - Uruguay vs. South Korea: Winner Uruguay with 0.62 probability
Group H - Portugal vs. Ghana: Winner Portugal with 0.81 probability
Group H - South Korea vs. Ghana: Winner South Korea with 0.76 probability
Group H - Portugal vs. Uruguay: Winner Portugal with 0.60 probability
Group H - Ghana vs. Uruguay: Winner Uruguay with 0.77 probability
Group H - South Korea vs. Portugal: Winner Portugal with 0.67 probability


Group H advanced: 
Portugal -------- 9
Uruguay -------- 6
South Korea -------- 3
Ghana -------- 0

看到一些结果很有趣，比如巴西和瑞士以及丹麦和法国之间的平局。总的来说，夺冠热门在小组赛阶段就过关了。

在季后赛中，思路是一样的：

advanced = advanced_group

playoffs = {"Round of 16": [], "Quarter-Final": [], "Semi-Final": [], "Final": []}

for p in playoffs.keys():
    playoffs[p] = []

actual_round = ""
next_rounds = []

for p in playoffs.keys():
    if p == "Round of 16":
        control = []
        for a in range(0, len(advanced*2), 1):
            if a < len(advanced):
                if a % 2 == 0:
                    control.append((advanced*2)[a][0])
                else:
                    control.append((advanced*2)[a][1])
            else:
                if a % 2 == 0:
                    control.append((advanced*2)[a][1])
                else:
                    control.append((advanced*2)[a][0])

        playoffs[p] = [[control[c], control[c+1]] for c in range(0, len(control)-1, 1) if c%2 == 0]
        
        for i in range(0, len(playoffs[p]), 1):
            game = playoffs[p][i]
            
            home = game[0]
            away = game[1]
            team_1 = find_stats(home)
            team_2 = find_stats(away)

            features_g1 = find_features(team_1, team_2)
            features_g2 = find_features(team_2, team_1)
            
            probs_g1 = gb.predict_proba([features_g1])
            probs_g2 = gb.predict_proba([features_g2])
            
            team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2
            team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2
            
            if actual_round != p:
                print("-"*10)
                print("Starting simulation of %s"%(p))
                print("-"*10)
                print("\n")
            
            if team_1_prob < team_2_prob:
                print("%s vs. %s: %s advances with prob %.2f"%(home, away, away, team_2_prob))
                next_rounds.append(away)
            else:
                print("%s vs. %s: %s advances with prob %.2f"%(home, away, home, team_1_prob))
                next_rounds.append(home)
            
            game.append([team_1_prob, team_2_prob])
            playoffs[p][i] = game
            actual_round = p
        
    else:
        playoffs[p] = [[next_rounds[c], next_rounds[c+1]] for c in range(0, len(next_rounds)-1, 1) if c%2 == 0]
        next_rounds = []
        for i in range(0, len(playoffs[p])):
            game = playoffs[p][i]
            home = game[0]
            away = game[1]
            team_1 = find_stats(home)
            team_2 = find_stats(away)
            
            features_g1 = find_features(team_1, team_2)
            features_g2 = find_features(team_2, team_1)
            
            probs_g1 = gb.predict_proba([features_g1])
            probs_g2 = gb.predict_proba([features_g2])
            
            team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2
            team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2
            
            if actual_round != p:
                print("-"*10)
                print("Starting simulation of %s"%(p))
                print("-"*10)
                print("\n")
            
            if team_1_prob < team_2_prob:
                print("%s vs. %s: %s advances with prob %.2f"%(home, away, away, team_2_prob))
                next_rounds.append(away)
            else:
                print("%s vs. %s: %s advances with prob %.2f"%(home, away, home, team_1_prob))
                next_rounds.append(home)
            game.append([team_1_prob, team_2_prob])
            playoffs[p][i] = game
            actual_round = p

为了在此处查看结果，除了文本输出之外，我决定像在这个 Kaggle 笔记本中那样用季后赛图片绘制图表。这是查看这些问题结果的一种非常有趣的方式。

----------
Starting simulation of Round of 16
----------


Netherlands vs. United States: Netherlands advances with prob 0.54
Argentina vs. Denmark: Argentina advances with prob 0.59
Spain vs. Croatia: Spain advances with prob 0.61
Brazil vs. Uruguay: Brazil advances with prob 0.64
Senegal vs. England: England advances with prob 0.64
Poland vs. France: France advances with prob 0.67
Germany vs. Belgium: Belgium advances with prob 0.53
Switzerland vs. Portugal: Portugal advances with prob 0.57
----------
Starting simulation of Quarter-Final
----------


Netherlands vs. Argentina: Netherlands advances with prob 0.51
Spain vs. Brazil: Brazil advances with prob 0.54
England vs. France: England advances with prob 0.51
Belgium vs. Portugal: Portugal advances with prob 0.52
----------
Starting simulation of Semi-Final
----------


Netherlands vs. Brazil: Brazil advances with prob 0.55
England vs. Portugal: England advances with prob 0.51
----------
Starting simulation of Final
----------


Brazil vs. England: Brazil advances with prob 0.56

模拟世界杯！我的模型预测巴西队获胜，决赛中对阵英格兰队的概率为 56%！我认为最大的冷门是比利时击败德国和英格兰进入决赛，在四分之一决赛中淘汰法国。看到一些概率非常小的比赛很有趣，比如荷兰对阿根廷。从四分之一决赛到决赛，没有一支球队晋级的概率超过60%，这说明晋级季后赛的球队大多水平相近。

季后赛图片

结论

这个项目的想法是用我喜欢的东西，足球来练习我在机器学习方面的知识。我觉得模拟世界杯很有意思，因为是时下的热门话题，吸引了所有喜欢这项运动的人的目光。我相信目标达到了，因为所有特征的构建和数据分析给我带来了搜索和接触许多新技术的机会。

关于结果，我非常希望模型能够正确预测冠军！

如果您想查看详细代码，可以查看附录中作者的GitHub和Kaggle。

附录：

GridSearchCV：https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
read_html https://pandas.pydata.org/docs/reference/api/pandas.read_html.html
Wikipedia: https://en.wikipedia.org/wiki/2022_FIFA_World_Cup#Teams
Kaggle notebook https://www.kaggle.com/code/agostontorok/soccer-world-cup-2018-winner
GitHub https://github.com/sslp23/world_cup_sim
Kaggle https://www.kaggle.com/code/sslp23/predicting-fifa-2022-world-cup-with-ml