2025年美赛C题:奥运奖牌榜模型27种模型Python代码2000行+的全网最详细代码

题目和背景

见另一篇博客

2025年美赛C题:奥运奖牌榜模型 解析及Python代码实现

完整代码获取见Github:https://github.com/cityu-lm/2025_MCM_Problem_C

1. 引言

根据记录了120年奥运会数据的公共数据集,并将其转换为包含以下特征的数据集:国家、年份、运动、赛事、人口、GDP、年龄、BMI、身高、体重、性别和季节。目标变量是获得的奖牌类型,包括无奖牌、铜牌、银牌或金牌,使用监督分类机器学习模型来预测哪些运动员将获得哪些奖牌。此外,使用Label Encoding和One Hot Encoding对分类数据进行了转换,并使用对数转换、最小最大标量和标准标量对数据进行了缩放。为了平衡的目标变量,使用随机重采样和SMOTE进行重采样。然后,使用k近邻、朴素贝叶斯和决策树分类器对的测试数据进行特征选择和超参数调优。的结果显示了一个成功的模型使用SMOTE采样,没有特征选择和超参数调优的kNN。

通过以下分析,将能够了解这种差异是否显著,或者其他因素,如运动员的身高和体重,是否对他们的表现更有影响。从某种意义上说,希望的分析可以为各国如何挑选运动员提供见解,理想情况下,可以找出历史上在奥运会上表现最差的国家,以及为什么会这样。


2. 方法

数据集是关于120年来奥运会运动员和他们在各种项目中的表现。删除带有NaN值的行后,的数据集包含206,165行。这样做是为了得到只包含完整数据的样本。从该数据集中获得的特征包括性别、年龄、身高、体重、年份、国家、季节、运动、赛事和奖牌。删除了一些认为不会影响模型或冗余的功能,如团队(以国家名称表示)、游戏(以年份和赛季表示)以及不会影响运动员表现的名称。还添加了BMI特征,它是根据的身高和体重值计算出来的,以便将来研究它与其他特征的相关性。此外,为了确定运动员所在国家的GDP是否会影响他们的表现,将每个国家的GDP和人口(来自GeoPandas模块)添加到的数据集中。的目标变量是在比赛中获得的奖牌(金、银、铜或无)。

数据集在目标变量方面并不平衡,因为正如预期的那样,在目标变量中标记为“无”(意味着他们没有赢得奖牌)的样本明显多于标记为“铜”,“银”和“金”的样本。85.4%的样本被归为无奖牌(175,984个样本),4.8%的样本被归为金牌(10,167个样本),4.9%的样本被归为银牌(9,866个样本),4.9%的样本被归为铜牌(10,148个样本)。比较了两种不同的重新抽样方法的结果:抽样并因此将所有目标计数更改为9,800,并使用SMOTE对的少数类别(青铜,白银和黄金)进行抽样,将所有目标计数更改为175,739,或原始计数为None。

2.1.1 数据导入

请添加图片描述

在本节中,首先从csv文件导入数据。然后,通过删除空值、删除不打算使用的列(即国家代码、运动员id)以及重命名一些列来清理数据(将原始的“Medal”列设置为目标列)。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
# pip install geopandas
%matplotlib inline
# Import olympic athletes dataset
athletes_url = "data/athlete_events.csv"

df_athletes = pd.read_csv(athletes_url)
df_athletes
IDNameSexAgeHeightWeightTeamNOCGamesYearSeasonCitySportEventMedal
01A DijiangM24.0180.080.0ChinaCHN1992 Summer1992SummerBarcelonaBasketballBasketball Men's BasketballNaN
12A LamusiM23.0170.060.0ChinaCHN2012 Summer2012SummerLondonJudoJudo Men's Extra-LightweightNaN
23Gunnar Nielsen AabyM24.0NaNNaNDenmarkDEN1920 Summer1920SummerAntwerpenFootballFootball Men's FootballNaN
34Edgar Lindenau AabyeM34.0NaNNaNDenmark/SwedenDEN1900 Summer1900SummerParisTug-Of-WarTug-Of-War Men's Tug-Of-WarGold
45Christine Jacoba AaftinkF21.0185.082.0NetherlandsNED1988 Winter1988WinterCalgarySpeed SkatingSpeed Skating Women's 500 metresNaN
................................................
271111135569Andrzej yaM29.0179.089.0Poland-1POL1976 Winter1976WinterInnsbruckLugeLuge Mixed (Men)'s DoublesNaN
271112135570Piotr yaM27.0176.059.0PolandPOL2014 Winter2014WinterSochiSki JumpingSki Jumping Men's Large Hill, IndividualNaN
271113135570Piotr yaM27.0176.059.0PolandPOL2014 Winter2014WinterSochiSki JumpingSki Jumping Men's Large Hill, TeamNaN
271114135571Tomasz Ireneusz yaM30.0185.096.0PolandPOL1998 Winter1998WinterNaganoBobsleighBobsleigh Men's FourNaN
271115135571Tomasz Ireneusz yaM34.0185.096.0PolandPOL2002 Winter2002WinterSalt Lake CityBobsleighBobsleigh Men's FourNaN

271116 rows × 15 columns

# Read country codes data file
noc_url = "data/noc_regions.csv"
noc = pd.read_csv(noc_url)
noc
NOCregionnotes
0AFGAfghanistanNaN
1AHOCuracaoNetherlands Antilles
2ALBAlbaniaNaN
3ALGAlgeriaNaN
4ANDAndorraNaN
............
225YEMYemenNaN
226YMDYemenSouth Yemen
227YUGSerbiaYugoslavia
228ZAMZambiaNaN
229ZIMZimbabweNaN

230 rows × 3 columns

# Read world shape file from github link for gdp and population features
world_url = "https://github.com/emily-ywang/Olympics-ML-Project/blob/master/world.csv?raw=true"
world = pd.read_csv(world_url)

world
Unnamed: 0OBJECTIDfeatureclaLEVELTYPEFORMAL_ENFORMAL_FRPOP_ESTPOP_RANKGDP_MD_EST...NAME_SVNAME_TRNAME_VINAME_ZHWB_NAMEWB_RULESWB_REGIONShape_LengShape_Areageometry
001Admin-0 country2Sovereign countryRepublic of IndonesiaNaN260580739173028000.0...IndonesienEndonezyaIndonesia印度尼西亚IndonesiaNoneEAP495.029918153.078608MULTIPOLYGON (((117.7036079042814 4.1634145420...
112Admin-0 country2Sovereign countryMalaysiaNaN3138199215863000.0...MalaysiaMalezyaMalaysia马来西亚MalaysiaNoneEAP68.45691326.703172MULTIPOLYGON (((117.7036079042814 4.1634145420...
223Admin-0 country2Sovereign countryRepublic of ChileNaN1778926714436100.0...ChileŞiliChile智利ChileNoneLCR416.99727276.761813MULTIPOLYGON (((-69.51008875159459 -17.5065881...
334Admin-0 country2Sovereign countryPlurinational State of BoliviaNaN111382341478350.0...BoliviaBolivyaBolivia玻利維亞BoliviaNoneLCR54.34599192.203587POLYGON ((-69.51008875159459 -17.5065881976871...
445Admin-0 country2Sovereign countryRepublic of PeruNaN3103665615410400.0...PeruPeruPeru秘鲁PeruNoneLCR73.262192106.417089MULTIPOLYGON (((-69.51008875159459 -17.5065881...
..................................................................
246246247Admin-0 country2DependencyNaNNaN30020.0...Förenta staternas mindre öar i Oceanien och Vä...United States Minor Outlying IslandsCác tiểu đảo xa của Hoa Kỳ美国本土外小岛屿Navassa Island (US)Name in italicOther0.0856080.000413POLYGON ((-75.02432206870844 18.41726308762145...
247247248Admin-0 country2DependencyNaNNaN30020.0...Förenta staternas mindre öar i Oceanien och Vä...United States Minor Outlying IslandsCác tiểu đảo xa của Hoa Kỳ美国本土外小岛屿Palmyra Atoll (US)Name in italicOther0.1473630.000576POLYGON ((-162.0608617830448 5.88719310098088,...
248248249Admin-0 country2DependencyNaNNaN30020.0...Förenta staternas mindre öar i Oceanien och Vä...United States Minor Outlying IslandsCác tiểu đảo xa của Hoa Kỳ美国本土外小岛屿Kingman Reef (US)Name in italicOther0.0595700.000222POLYGON ((-162.4001765620986 6.445135808770601...
249249250Admin-0 country2CountryNew ZealandNaN451032712174800.0...Nya ZeelandYeni ZelandaNew Zealand新西兰Tokelau (NZ)Name in italicOther0.1784530.000348MULTIPOLYGON (((-171.1856583318665 -9.36126067...
250250251Admin-0 country2CountryNew ZealandNaN451032712174800.0...Nya ZeelandYeni ZelandaNew Zealand新西兰New ZealandNoneOther104.82898629.143542MULTIPOLYGON (((169.2127597908318 -52.47456229...

251 rows × 54 columns

world.to_csv('data/world.csv',index=False)

2.2. 数据分析

根据国家、年份、运动项目、赛事、人口、GDP、年龄、BMI、身高、体重、性别和季节等特征来预测奥运会运动员获得的奖牌(无、铜、银或金)。县,GDP和人口是重要的特征,因为许多国家比其他国家在奥运会上表现得更好,因为他们的规模,因为较高的人口通常有更多的机会培养优秀的运动员,而GDP影响了获得体育资金和资源的机会。所有这三个因素都表明,某些国家,主要是第一世界国家,在奥运会上的表现要好于资源较少的国家。此外,某些国家在某些项目上比其他国家更擅长,例如,非洲的一个国家可能不擅长滑雪,但赤道以上的一个国家在冬天经常下雪,可能更擅长滑雪。因此,体育和项目是重要的特征,因为某些国家在某些项目上表现更好,某些人根据他们的身高、体重、年龄、性别和BMI可能在这些项目或运动中表现更好。BMI、身高、体重、性别和年龄告诉运动员的个人特征;它们很重要,因为这些差异在决定运动员能否获得奖牌方面起着重要作用。年份是一个重要的特征,因为总的来说,奥运会和田径运动发生了很多变化,的数据集可以追溯到很久以前。有些项目已经不存在了,有些项目是多年来增加的。最后,季节是一个重要的特征,因为在夏季奥运会上获得奖牌的机会比冬季奥运会要多,所以的模型在预测运动员获得奖牌时需要考虑到这一点。

这是一个有监督的机器学习问题,因为的数据集是有标签的,从这些标签中,的模型预测运动员获得了什么奖牌(无,铜牌,银牌或金牌)。的标签包含有关运动员的信息,包括他们的身体信息,他们的国家,他们参加的比赛。利用这些信息和已知的目标变量结果,的模型可以预测目标变量。的项目处理一个分类问题,因为使用年龄、身高、体重和项目等特征来训练的模型,并给出一组相应的特征,对运动员最有可能赢得的奖牌类型(铜牌、银牌、金牌或没有奖牌)进行分类。

在数据上使用了k近邻、朴素贝叶斯和决策树。选择这些是因为它们处理监督分类机器学习问题。具体来说,选择k-Nearest Neighbors是因为对于分类问题,它的性能很好,需要做的调整很少,而且由于没有太多的特征,所以效果很好。选择朴素贝叶斯是因为它是一种具有有效参数估计的概率分类算法,然而,预计其他算法可能会更好地泛化。最后,选择决策树,因为的数据集包含不同的特征类型,而决策树特别擅长对不同的数据类型进行分类,并且决策树易于理解。


3. 实验结果

3.1. 数据分析

3.1.1 数据清洗:运动员数据集

在下面的单元格中,在除“Medal”之外的任何列中删除了具有Na值的样本,因为这些缺失的值可能会在处理和操作数据集时导致复杂性。在的目标变量“奖牌”一栏中,NA值代表没有奖牌,而不是一个缺失值;因此,将“Medal”中的所有NA值替换为字符串none,表示该运动员没有获得奖牌。

# Drop null values (except for Medal) and sort by year
df_athletes.dropna(subset=['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 
                         'Games', 'Year', 'Season', 'City', 'Sport', 'Event'], inplace=True)
df_athletes.sort_values('Year', inplace=True)

# Change remaining Nan values in medal column to be 'None'
df_athletes.replace(np.nan, 'None', regex=True, inplace=True)

# Rename "Medal" column as target
df_athletes.rename(columns={"Medal":"target"}, inplace=True)

3.1.2 数据清理和合并数据框架:NOC数据集

在下面的单元格中,将初始数据框(包含名称、NOC(国家代码名称)、年龄、身高、体重等)与包含国家名称(NOC)代码以及每个国家的全名的数据框合并,以便的最终数据框包含每个国家的全名而不是国家代码。然后,删除了不会影响运动员成绩的列,如Name、City、ID,以及冗余的列,如NOC、Teams和Games。

# Combine NOC region names with NOC codes in dataset
merged = pd.merge(df_athletes, noc, how='inner')
df_data = merged.rename(columns={'region': 'Country'})
# Drop redundant columns (Name, ID, NOC, Games, Sport, notes, City)
df_data.drop(['Name', 'ID', 'NOC', 'Team', 'Games', 'notes', 'City'], axis=1, inplace=True)

3.1.3 数据清理和合并数据框架:世界数据集

import warnings
warnings.filterwarnings('ignore')
# Limit world dataset to three columns: country name, population, and GDP
df_world = world[['NAME_EN', 'POP_EST', 'GDP_MD_EST']]
df_world.rename(columns={'NAME_EN':'Country', 'POP_EST':'Population', 'GDP_MD_EST':'GDP'}, inplace=True)

# Replace country names that don't match with noc names
df_world.replace({'Country':{"People's Republic of China":"China", "United Kingdom":"UK", 
                          "United States of America":"USA", "The Bahamas":"Bahamas",
                         "Trinidad and Tobago":"Trinidad"}}, inplace=True)

在下面的单元格中,将数据集与包含每个国家的GDP和人口数据的数据集合并。

# Rename "Taiwan" country rows as "China" (world dataset only has China, not Taiwan)
df_data.replace({"Country":"Taiwan"}, "China", inplace=True)

# Reorder columns
df_data = df_data[['Sex', 'Age', 'Height', 'Weight', 'Year', 'Country', 'Season', 'Sport', 'Event', 'target']]

# Reset index
df_data.reset_index(drop=True, inplace=True)

# Merge world dataset with athletes dataset
df_data = df_world.merge(df_data)

3.1.4 清理和合并数据集

df_data
CountryPopulationGDPSexAgeHeightWeightYearSeasonSportEventtarget
0Indonesia2605807393028000.0M20.0162.057.01956SummerAthleticsAthletics Men's 100 metresNone
1Indonesia2605807393028000.0M22.0160.056.01960SummerWeightliftingWeightlifting Men's BantamweightNone
2Indonesia2605807393028000.0M37.0172.070.01960SummerSailingSailing Mixed Three Person KeelboatNone
3Indonesia2605807393028000.0M18.0172.066.01960SummerCyclingCycling Men's 100 kilometres Team Time TrialNone
4Indonesia2605807393028000.0M18.0172.066.01960SummerCyclingCycling Men's Road Race, IndividualNone
.......................................
266670New Zealand4510327174800.0M25.0183.078.02016SummerHockeyHockey Men's HockeyNone
266671New Zealand4510327174800.0F18.0159.058.02016SummerDivingDiving Women's SpringboardNone
266672New Zealand4510327174800.0M36.0175.073.02016SummerShootingShooting Men's Small-Bore Rifle, Prone, 50 metresNone
266673New Zealand4510327174800.0M32.0189.070.02016SummerRowingRowing Men's Lightweight Coxless FoursNone
266674New Zealand4510327174800.0F19.0165.067.02016SummerRugby SevensRugby Sevens Women's Rugby SevensSilver

266675 rows × 12 columns

3.1.5 特征工程:创建新的BMI特征

BMI = weight (kg) / height (m^2)

# Calculate and insert BMI into merged dataframe
height = df_data['Height']
weight = df_data['Weight']
bmi = weight / ((height/100)**2)
df_data.insert(8, 'BMI', bmi)
df_data
CountryPopulationGDPSexAgeHeightWeightYearBMISeasonSportEventtarget
0Indonesia2605807393028000.0M20.0162.057.0195621.719250SummerAthleticsAthletics Men's 100 metresNone
1Indonesia2605807393028000.0M22.0160.056.0196021.875000SummerWeightliftingWeightlifting Men's BantamweightNone
2Indonesia2605807393028000.0M37.0172.070.0196023.661439SummerSailingSailing Mixed Three Person KeelboatNone
3Indonesia2605807393028000.0M18.0172.066.0196022.309356SummerCyclingCycling Men's 100 kilometres Team Time TrialNone
4Indonesia2605807393028000.0M18.0172.066.0196022.309356SummerCyclingCycling Men's Road Race, IndividualNone
..........................................
266670New Zealand4510327174800.0M25.0183.078.0201623.291230SummerHockeyHockey Men's HockeyNone
266671New Zealand4510327174800.0F18.0159.058.0201622.942130SummerDivingDiving Women's SpringboardNone
266672New Zealand4510327174800.0M36.0175.073.0201623.836735SummerShootingShooting Men's Small-Bore Rifle, Prone, 50 metresNone
266673New Zealand4510327174800.0M32.0189.070.0201619.596316SummerRowingRowing Men's Lightweight Coxless FoursNone
266674New Zealand4510327174800.0F19.0165.067.0201624.609734SummerRugby SevensRugby Sevens Women's Rugby SevensSilver

266675 rows × 13 columns

3.1.6 数据预处理:分类变量和二元变量的特征编码与工程

分类特征上的标签编码

正在对包括国家、年份、体育和事件在内的特征进行标签编码,将它们从分类值转换为数值,这样它们就可以用机器学习模型进行训练和测试。在标签编码中,每个唯一的分类值在相同的对应列中被分配一个数字作为其“标签”;例如,如果两个样品都来自德国,那么在“国家”列中,这两个样品将显示与德国对应的相同数字标签。由于所有这些特性都有大量的惟一值,因此决定使用标签编码(而不是单热编码)来简化数据,并防止数据框有太多列,因为标签编码不会导致添加额外的列。

# Create dataframe with only categorical features: country, year, sport, event columns
encoded_features = df_data[['Country', 'Year', 'Sport', 'Event']]
column_names = encoded_features.columns

# Create empty dataframe to store encoded values
cat_feature_sub_df = pd.DataFrame(columns = column_names)

# Apply label encoding for each categorical feature column
from sklearn.preprocessing import LabelEncoder

for i in range(4):
    encoder = LabelEncoder()
    data = encoded_features.iloc[:, i]
    data = data.values.reshape(-1, 1)
    cat_feature_subset = encoder.fit_transform(data)
    cat_feature_sub_df[column_names[i]] = cat_feature_subset

cat_feature_sub_df
CountryYearSportEvent
07713330
1771454540
2771436381
3771414201
4771414214
...............
2666701233424310
2666711233415239
2666721233437404
2666731233433352
2666741233435367

266675 rows × 4 columns

One-hot 编码

类似地,正在对包括性别和季节在内的特征实现one-hot编码,将它们从分类值转换为数值,以便它们可以用机器学习模型进行训练和测试;的两个分类特征中的每一个都将被2个特征所取代,这些特征的值可以是0或1,表示给定列中该特征的存在或不存在。特别选择了二进制特征(每个特征只有两个类别)来实现单热编码,因为每个特征总共只添加了2列(总共添加了4列,每个唯一值(女性、男性、夏季、冬季)各一列),所以的数据框不会因为这个单热编码实现而变得太大。

from sklearn.preprocessing import OneHotEncoder

# Create dataframe with only binary features: sex, season
binary_features = df_data[["Sex", "Season"]]

# Apply one-hot encoding
encoder = OneHotEncoder()
bin_feature_subset = encoder.fit_transform(binary_features)

# Turn encoded values into a dataframe
bin_feature_sub_df = pd.DataFrame(bin_feature_subset.toarray(), columns = encoder.get_feature_names_out())
bin_feature_sub_df
Sex_FSex_MSeason_SummerSeason_Winter
00.01.01.00.0
10.01.01.00.0
20.01.01.00.0
30.01.01.00.0
40.01.01.00.0
...............
2666700.01.01.00.0
2666711.00.01.00.0
2666720.01.01.00.0
2666730.01.01.00.0
2666741.00.01.00.0

266675 rows × 4 columns

3.1.7 数值特征的缩放和变换

识别连续和离散数值特征中的偏度

在实现模型训练之前,需要了解的特征的分布,并将它们转换为正态分布,如果它们不限制偏差(即,这样模型就不会对具有较大值的变量更重要)。为此,首先测试了特征的偏度。

# Test for skewness in numerical variables: Population, GDP, Age BMI, Height, and Weight

def skewness():
    # Find skewness
    feature_names = ['Population', 'GDP', 'Age', 'BMI', 'Height', 'Weight']
    for i in range(len(feature_names)):
        print(f"{feature_names[i]} skew: {df_data[feature_names[i]].skew()}")
        
    # Create dist plots of each variable
    fig, axes = plt.subplots(ncols=3, nrows=2, figsize=(20, 10))
    for i, ax in zip(range(len(feature_names)), axes.flat):
        sns.histplot(df_data[feature_names[i]], ax=ax)
    plt.show()

skewness()
Population skew: 4.847316928926255
GDP skew: 2.8226162284359653
Age skew: 1.1319874071481901
BMI skew: 1.3722296331968096
Height skew: 0.007536973027835242
Weight skew: 0.7298782642302241

请添加图片描述

使用对数变换校正偏度

由于特征的分布通常是倾斜的,对每个特征应用一个对数函数来缩放和规范化它们,如下所示。

# Use log transformation to scale BMI, weight, age features
pop_log = np.log(df_data['Population'])
gdp_log = np.log(df_data['GDP'])
bmi_log = np.log(df_data['BMI'])
weight_log = np.log(df_data['Weight'])
age_log = np.log(df_data['Age'])

log_features = [pop_log, gdp_log, age_log, bmi_log, df_data['Height'], weight_log]
feature_names = ['Population', 'GDP', 'Age', 'BMI', 'Height', 'Weight']

# Normalize log transformed data
def normalize(column):
    upper = column.max()
    lower = column.min()
    y = (column-lower)/(upper-lower)
    return y

normal_features = []
for i in range(len(log_features)):
    norm = normalize(log_features[i])
    normal_features.append(norm)

# Empty dataframe with transformed and scaled values
cont_feature_sub_df = pd.DataFrame(columns = feature_names)
比较对数变换前后的偏度和分布
# Find and print new skewness
import scipy.stats

for i in range(len(normal_features)):
    print(f"{feature_names[i]} skew: {float(scipy.stats.skew(df_data[feature_names[i]]))}")
    print(f"{feature_names[i]} log-transformed skew: {float(scipy.stats.skew(normal_features[i]))} \n")
    # Add transformed values to dataframe
    cont_feature_sub_df[feature_names[i]] = normal_features[i]
    
# Create new dist plots with scaled and transformed variables
fig, axes = plt.subplots(ncols=3, nrows=2, figsize=(20, 10))
for i, ax in zip(range(len(normal_features)), axes.flat):
    sns.histplot(normal_features[i], ax=ax, label=feature_names[i])
plt.show()
Population skew: 4.847289663577969
Population log-transformed skew: -0.04977191246066552 

GDP skew: 2.822600351691021
GDP log-transformed skew: -0.42489600377264997 

Age skew: 1.1319810399080523
Age log-transformed skew: 0.23616538067882106 

BMI skew: 1.3722219146342678
BMI log-transformed skew: 0.5696356255853973 

Height skew: 0.007536930633620538
Height log-transformed skew: 0.007536930633621244 

Weight skew: 0.7298741587868682
Weight log-transformed skew: -0.03659542012833335 


请添加图片描述

3.1.8 目标变量标签编码

因为的目标变量包含分类值(bronze, silver, gold, none),使用标签编码将的目标值转换为ML模型的数值,其中0代表金牌,1代表银牌,2代表铜牌,3代表没有奖牌。决定使用标签编码而不是单热编码来解释目标值中的不均匀权重,因为标签编码对更高的值更重要。例如,因为知道在现实生活中参加奥运会的运动员没有赢得奖牌的可能性更大,所以将最大值3赋给没有奖牌,这样这个值就更有分量了。

# Dictionary of numerical values corresponding to target categories
target_labels = {'Gold':0, 'Silver':1, 'Bronze':2, 'None':3}

# Use dictionary to transform each target value
def transform_medals(column):
    return target_labels[column]

# Apply transformation and create new dataframe with encoded target values
target_encoded_df = df_data
target_encoded_df["target"] = df_data["target"].apply(transform_medals)

3.1.9 用于模型测试的各种转换数据集

原始数据集与目标变量编码

这个数据集主要用于数据探索而不是训练

target_encoded_df
CountryPopulationGDPSexAgeHeightWeightYearBMISeasonSportEventtarget
0Indonesia2605807393028000.0M20.0162.057.0195621.719250SummerAthleticsAthletics Men's 100 metres3
1Indonesia2605807393028000.0M22.0160.056.0196021.875000SummerWeightliftingWeightlifting Men's Bantamweight3
2Indonesia2605807393028000.0M37.0172.070.0196023.661439SummerSailingSailing Mixed Three Person Keelboat3
3Indonesia2605807393028000.0M18.0172.066.0196022.309356SummerCyclingCycling Men's 100 kilometres Team Time Trial3
4Indonesia2605807393028000.0M18.0172.066.0196022.309356SummerCyclingCycling Men's Road Race, Individual3
..........................................
266670New Zealand4510327174800.0M25.0183.078.0201623.291230SummerHockeyHockey Men's Hockey3
266671New Zealand4510327174800.0F18.0159.058.0201622.942130SummerDivingDiving Women's Springboard3
266672New Zealand4510327174800.0M36.0175.073.0201623.836735SummerShootingShooting Men's Small-Bore Rifle, Prone, 50 metres3
266673New Zealand4510327174800.0M32.0189.070.0201619.596316SummerRowingRowing Men's Lightweight Coxless Fours3
266674New Zealand4510327174800.0F19.0165.067.0201624.609734SummerRugby SevensRugby Sevens Women's Rugby Sevens1

266675 rows × 13 columns

没有日志转换或规范化的数据集
# Merge dataset without log transformation
non_transf_features = df_data[['Population', 'GDP', 'Age', 'BMI', 'Height', 'Weight']]
combined_df = pd.concat([cat_feature_sub_df, non_transf_features, bin_feature_sub_df, target_encoded_df['target']], axis=1)
combined_df
CountryYearSportEventPopulationGDPAgeBMIHeightWeightSex_FSex_MSeason_SummerSeason_Wintertarget
077133302605807393028000.020.021.719250162.057.00.01.01.00.03
17714545402605807393028000.022.021.875000160.056.00.01.01.00.03
27714363812605807393028000.037.023.661439172.070.00.01.01.00.03
37714142012605807393028000.018.022.309356172.066.00.01.01.00.03
47714142142605807393028000.018.022.309356172.066.00.01.01.00.03
................................................
26667012334243104510327174800.025.023.291230183.078.00.01.01.00.03
26667112334152394510327174800.018.022.942130159.058.01.00.01.00.03
26667212334374044510327174800.036.023.836735175.073.00.01.01.00.03
26667312334333524510327174800.032.019.596316189.070.00.01.01.00.03
26667412334353674510327174800.019.024.609734165.067.01.00.01.00.01

266675 rows × 15 columns

具有日志转换和规范化的数据集
# Merge dataset with log transformation
df_log_transf = pd.concat([cat_feature_sub_df, cont_feature_sub_df, bin_feature_sub_df, target_encoded_df['target']], axis=1)
df_log_transf
CountryYearSportEventPopulationGDPAgeBMIHeightWeightSex_FSex_MSeason_SummerSeason_Wintertarget
077133300.8600600.8360210.3205930.4693870.3535350.3838550.01.01.00.03
17714545400.8600600.8360210.3717040.4729010.3333330.3756120.01.01.00.03
27714363810.8600600.8360210.6504890.5115000.4545450.4795400.01.01.00.03
37714142010.8600600.8360210.2640930.4825680.4545450.4521350.01.01.00.03
47714142140.8600600.8360210.2640930.4825680.4545450.4521350.01.01.00.03
................................................
26667012334243100.5194080.5953600.4402550.5037460.5656570.5299390.01.01.00.03
26667112334152390.5194080.5953600.2640930.4963200.3232320.3919551.00.01.00.03
26667212334374040.5194080.5953600.6357970.5151290.4848480.4990840.01.01.00.03
26667312334333520.5194080.5953600.5726350.4188130.6262630.4795400.01.01.00.03
26667412334353670.5194080.5953600.2930870.5308210.3838380.4591391.00.01.00.01

266675 rows × 15 columns

3.2. 数据探索

3.2.1 数据集初步探索

下面的可视化展示了对数据集的初步探索。下面的图表显示了的样本按性别分布。该分布图表明,的数据样本中男性运动员明显多于女性运动员,男性运动员的数量(14万+)大约是女性运动员数量(7万+)的两倍。

sns.displot(target_encoded_df, x="Sex")

#specify the title
title = "Sex Distribution of Olympic Athletes"

#set the title of the plot
plt.title(title)
Text(0.5, 1.0, 'Sex Distribution of Olympic Athletes')

请添加图片描述

下面的图表显示了的样本在奥运会季节的分布。该分布图表明,的数据样本中夏季运动员的数量明显多于冬季运动员,夏季运动员的数量(175,000+)是冬季运动员数量(40,000+)的三倍。这是有道理的,因为夏季奥运会的项目和项目也比冬季奥运会多,所以总的来说,参加夏季奥运会的运动员要比冬季奥运会多。

sns.displot(target_encoded_df, x="Season")

#specify the title
title = "Season Distribution of Olympic Dataset"

#set the title of the plot
plt.title(title)
Text(0.5, 1.0, 'Season Distribution of Olympic Dataset')


请添加图片描述

下面的直方图显示了的样本按年龄分布。直方图显示,年龄分布呈正偏态,样本中大多数运动员集中在20-30岁左右,年龄越大,运动员数量向右急剧减少。这是有道理的,因为奥运会选手必须保持强壮和最佳的健康状态才能参加艰苦的训练和比赛,而年轻的年龄范围(20-30岁)通常是人们处于最佳健康状态的时候。

sns.histplot(target_encoded_df, x="Age", bins=20)

#specify the title
title = "Age Distribution of Olympic Athletes"

#set the title of the plot
plt.title(title)
Text(0.5, 1.0, 'Age Distribution of Olympic Athletes')


请添加图片描述

下面的核密度图显示了运动员在年龄和BMI方面的分布。如图所示,“最密集”的区域由最小的内圈表示,这表明大多数女运动员(橙色表示)聚集在20岁出头(年龄),体重指数约为21。同样,大量男性运动员的身体质量指数在25左右,年龄在20岁出头。女性运动员的圆形更密集(紧密地聚集在一起),说明女性的年龄和bmi分布相对较小。相比之下,男性运动员的图密度较低(有更多的形状,分组距离更远),表明年龄/ bmi的分布更分散。

# Create and display the kernel density plot
graph = sns.kdeplot(x="Age", y="BMI", hue = "Sex", 
                        data = df_data)

# Specify the title
title = "BMI vs. Age"

# Set the title of the plot
graph.set_title(title, size = 16)

# Add labels to the axes  
graph.set_xlabel("Age", size = 16)
graph.set_ylabel("BMI", size = 16)

# Move the legend to lower right
plt.legend(loc="lower right")

<matplotlib.legend.Legend at 0x1a017f8af10>

请添加图片描述

def heatmap(dataframe, start_col, end_col, color):
    df_features = dataframe.iloc[:, start_col:end_col]
    
    # Find the correlation coefficient between each trait as a new dataframe (absolute value applied so all coefficients between 0 and 1)
    df_corr = df_features.corr().abs()
    
    # Set sizes of figure
    fig_dims = (10, 7)
    fig, ax = plt.subplots(figsize=fig_dims)
    
    # Create a blue heatmap of strengh of correlation
    viz = sns.heatmap(df_corr, vmin=0, vmax=1, linewidths=0.5, cmap=color, ax=ax)
    viz.set(title="Correlation between Variables")
相关性热图(与奖牌类型相比较的特征)

下面的热图描述了每个变量之间的相关性有多强(如旁边的键所示,框中的蓝色越深,变量之间的相关性就越强)。相关性最强的变量包括身高、体重、GDP和人口。根据这张热图,GDP、身高和体重等特征似乎与目标变量(获得的奖牌类型)有更强的相关性。然而,每个单独变量和目标之间的相关性总体上似乎不是很强(由目标列的浅色显示)。

heatmap(combined_df, 0, 16, 'Blues')


请添加图片描述

3.2.2 数据探索按国家和奖牌总数排序

创建新的数据df,按国家和奖牌总数排序

为了从更简洁的角度更好地理解和可视化的数据,决定创建另一个数据框架,将原始数据框架按国家分组,并查看每个国家获得的总奖牌、每个国家派出的运动员总数、每个国家运动员的年龄、体重、身高、BMI中位数以及每个国家最常见的运动、项目和季节等指标。能够将这些计算出的指标与其他列(如每个国家的GDP和人口)进行比较,以更好地了解这些变量如何相互影响。

# Function used to change medal type to number of medals (to sum by country after)
def change_target(medal):
    if medal == 0 or medal == 1 or medal == 2:
        medal = 1
    else:
        medal = 0
    return medal
# Count number of entries for each country to determine number of athletes and add to df_data
df_data['Number of Athletes'] = df_data.groupby('Country')['Country'].transform('count') 
df_data
CountryPopulationGDPSexAgeHeightWeightYearBMISeasonSportEventtargetNumber of Athletes
0Indonesia2605807393028000.0M20.0162.057.0195621.719250SummerAthleticsAthletics Men's 100 metres3334
1Indonesia2605807393028000.0M22.0160.056.0196021.875000SummerWeightliftingWeightlifting Men's Bantamweight3334
2Indonesia2605807393028000.0M37.0172.070.0196023.661439SummerSailingSailing Mixed Three Person Keelboat3334
3Indonesia2605807393028000.0M18.0172.066.0196022.309356SummerCyclingCycling Men's 100 kilometres Team Time Trial3334
4Indonesia2605807393028000.0M18.0172.066.0196022.309356SummerCyclingCycling Men's Road Race, Individual3334
.............................................
266670New Zealand4510327174800.0M25.0183.078.0201623.291230SummerHockeyHockey Men's Hockey37760
266671New Zealand4510327174800.0F18.0159.058.0201622.942130SummerDivingDiving Women's Springboard37760
266672New Zealand4510327174800.0M36.0175.073.0201623.836735SummerShootingShooting Men's Small-Bore Rifle, Prone, 50 metres37760
266673New Zealand4510327174800.0M32.0189.070.0201619.596316SummerRowingRowing Men's Lightweight Coxless Fours37760
266674New Zealand4510327174800.0F19.0165.067.0201624.609734SummerRugby SevensRugby Sevens Women's Rugby Sevens17760

266675 rows × 14 columns

# Create df_summary grouped by country
df_summary = df_data.iloc[:, 0:14]
df_summary['GDP per Capita'] = df_summary['GDP']/df_summary['Population']

# Find total medals, median BMI, median Age, median Weight, most common event, most common sport, most common season, year with most athletes, number of athletes, GDP, GDP per capita, and Population for each country
df_summary['Total Medals'] = df_summary['target'].apply(change_target)
df_summary = df_summary.groupby('Country', as_index=False).agg({'Total Medals': 'sum', 'Population':lambda x: pd.Series.mode(x)[0],
                                                                'GDP':lambda x: pd.Series.mode(x)[0], 'Age': 'median', 'Height':'median','Weight':'median', 'Year':lambda x: pd.Series.mode(x)[0],
                                                                'BMI':'median', 'Season':lambda x: pd.Series.mode(x)[0], 'Sport':lambda x: pd.Series.mode(x)[0], 
                                                                'GDP per Capita':lambda x: pd.Series.mode(x)[0], 'Event':lambda x: pd.Series.mode(x)[0], 'Number of Athletes':lambda x: pd.Series.mode(x)[0]})
df_summary['Medals per Athlete'] = (df_summary['Total Medals']/df_summary['Number of Athletes'])
df_summary
CountryTotal MedalsPopulationGDPAgeHeightWeightYearBMISeasonSportGDP per CapitaEventNumber of AthletesMedals per Athlete
0Afghanistan23412481164080.023.0170.064.0196022.128136SummerWrestling0.001878Athletics Men's 100 metres540.037037
1Albania0304798733900.023.0170.069.0200823.087868SummerWeightlifting0.011122Shooting Women's Sporting Pistol, 25 metres570.000000
2Algeria1540969443609400.024.0175.066.0201621.971336SummerAthletics0.014875Handball Men's Handball4810.031185
3American Samoa051504711.026.0175.082.0198826.595745SummerAthletics0.013805Athletics Men's 100 metres210.000000
4Andorra0857023327.022.0173.071.0201023.255019WinterAlpine Skiing0.038821Alpine Skiing Men's Giant Slalom1350.000000
................................................
184Venezuela1531304016468600.024.0173.069.0200822.720438SummerSwimming0.014969Cycling Men's Road Race, Individual7850.019108
185Vietnam496160163594900.023.0165.058.0198020.831262SummerSwimming0.006187Cycling Men's Road Race, Individual1820.021978
186Yemen02803682973450.021.0168.061.0198821.773842SummerAthletics0.002620Athletics Men's 1,500 metres370.000000
187Zambia11597200065170.023.0175.067.0198821.604938SummerAthletics0.004080Football Men's Football1280.007812
188Zimbabwe221380508428330.024.0175.066.0198022.545959SummerAthletics0.002052Football Women's Football2990.073579

189 rows × 15 columns

从df_summary中创建包含前20个获奖国家的新数据框架
# Create df of top 20 countries with most medals
top_df = df_summary.nlargest(n=20, columns='Total Medals', keep='first')
top_df
CountryTotal MedalsPopulationGDPAgeHeightWeightYearBMISeasonSportGDP per CapitaEventNumber of AthletesMedals per Athlete
122Netherlands1132817084719870800.025.0180.073.0201622.460034SummerSwimming0.050970Hockey Men's Hockey596640.189863
177USA438332662579118560000.024.0178.072.0199222.720438SummerAthletics0.056823Ice Hockey Men's Ice Hockey142140.308358
143Russia36101422575193745000.025.0176.070.0198822.720438SummerAthletics0.026325Ice Hockey Men's Ice Hockey103980.347182
63Germany3189805940173979000.024.0178.070.0197222.530864SummerAthletics0.049371Ice Hockey Men's Ice Hockey131830.241902
9Australia1210232324131189000.024.0177.071.0200022.662709SummerSwimming0.051178Hockey Men's Hockey66300.182504
31Canada1060356236801674000.024.0175.070.0198822.656250SummerAthletics0.046991Ice Hockey Men's Ice Hockey79660.133066
82Italy1060621378022221000.025.0175.070.0199222.675737SummerAthletics0.035743Water Polo Men's Water Polo76970.137716
37China1038137930277121140000.023.0172.065.0200821.913806SummerSwimming0.015327Basketball Men's Basketball65550.158352
176UK1031647694522788000.025.0175.070.0201222.530864SummerAthletics0.043045Hockey Men's Hockey77660.132758
60France987671061612699000.025.0175.069.0199222.275310SummerAthletics0.040220Ice Hockey Men's Ice Hockey79770.123731
123New Zealand8444510327174800.025.0178.073.0201622.985398SummerHockey0.038756Hockey Men's Hockey77600.108763
85Japan8431264513984932000.024.0168.062.0196422.038567SummerGymnastics0.039003Volleyball Women's Volleyball74870.112595
74Hungary7919850845267600.024.0176.070.0199222.598140SummerGymnastics0.027165Water Polo Men's Water Polo46810.168981
164Sweden7659960487498100.025.0179.072.0198822.720438SummerAthletics0.050008Ice Hockey Men's Ice Hockey53140.143959
59Finland7245491218224137.025.0176.070.0195222.516126SummerAthletics0.040817Ice Hockey Men's Ice Hockey43890.164958
142Romania59721529967441000.024.0173.067.0198022.491349SummerGymnastics0.020483Rowing Women's Coxed Eights34770.171700
158South Korea561511812991929000.023.0172.065.0198822.145329SummerGymnastics0.037690Football Men's Football38540.145563
138Poland548384762691052000.025.0175.070.0197222.790358SummerAthletics0.027342Ice Hockey Men's Ice Hockey57280.095670
45Czech Republic51910674723350900.025.0176.071.0198022.720438SummerGymnastics0.032872Ice Hockey Men's Ice Hockey45830.113245
128Norway5145320045364700.025.0179.073.0197222.545959WinterCross Country Skiing0.068552Ice Hockey Men's Ice Hockey30610.167919
从df_summary中创建包含每个运动员获得奖牌前20名的国家的新数据框
# Create df with top 20 countries with largest medals per athlete ratio
medal_per_athlete_df = df_summary.nlargest(n=20, columns='Medals per Athlete', keep='first')
medal_per_athlete_df.reset_index(inplace=True)
medal_per_athlete_df
indexCountryTotal MedalsPopulationGDPAgeHeightWeightYearBMISeasonSportGDP per CapitaEventNumber of AthletesMedals per Athlete
0143Russia36101422575193745000.025.0176.070.0198822.720438SummerAthletics0.026325Ice Hockey Men's Ice Hockey103980.347182
1177USA438332662579118560000.024.0178.072.0199222.720438SummerAthletics0.056823Ice Hockey Men's Ice Hockey142140.308358
2130Pakistan107204924861988200.025.0174.070.0196022.592987SummerHockey0.004822Hockey Men's Hockey3660.292350
363Germany3189805940173979000.024.0178.070.0197222.530864SummerAthletics0.049371Ice Hockey Men's Ice Hockey131830.241902
4150Serbia4677111024101800.024.0180.075.0198423.405654SummerAthletics0.014316Water Polo Men's Water Polo23420.199402
584Jamaica154299056125390.024.0176.068.0201621.877551SummerAthletics0.008490Athletics Men's 4 x 400 metres Relay8090.190358
6122Netherlands1132817084719870800.025.0180.073.0201622.460034SummerSwimming0.050970Hockey Men's Hockey596640.189863
79Australia1210232324131189000.024.0177.071.0200022.662709SummerSwimming0.051178Hockey Men's Hockey66300.182504
843Cuba39411147407132900.024.0175.072.0198023.306680SummerAthletics0.011922Baseball Men's Baseball22100.178281
9142Romania59721529967441000.024.0173.067.0198022.491349SummerGymnastics0.020483Rowing Women's Coxed Eights34770.171700
1074Hungary7919850845267600.024.0176.070.0199222.598140SummerGymnastics0.027165Water Polo Men's Water Polo46810.168981
11135Paraguay17694373964670.022.0177.074.0200423.355637SummerFootball0.009313Football Men's Football1010.168317
12128Norway5145320045364700.025.0179.073.0197222.545959WinterCross Country Skiing0.068552Ice Hockey Men's Ice Hockey30610.167919
1311Azerbaijan449961396167900.024.0171.067.0201622.832879SummerWrestling0.016855Rhythmic Gymnastics Women's Group2630.167300
1442Croatia139429209594240.025.0183.082.0200824.092971SummerSwimming0.021957Water Polo Men's Water Polo8350.166467
1559Finland7245491218224137.025.0176.070.0195222.516126SummerAthletics0.040817Ice Hockey Men's Ice Hockey43890.164958
1637China1038137930277121140000.023.0172.065.0200821.913806SummerSwimming0.015327Basketball Men's Basketball65550.158352
1757Ethiopia53105350020174700.024.0170.057.0198019.265306SummerAthletics0.001658Athletics Men's Marathon3470.152738
1847Denmark2555605948264800.025.0180.073.0197222.448015SummerSwimming0.047236Handball Men's Handball17040.149648
19115Montenegro1464255010610.027.0185.086.0201624.569314SummerWater Polo0.016512Water Polo Men's Water Polo940.148936
按国家分组的数据框可视化

下面的散点图显示,运动员人数与获得的奖牌总数之间存在较强的正相关关系;换句话说,一个国家派出越多的运动员参加奥运会,这个国家就越有可能赢得更多的奖牌。这是有道理的,因为有更多的运动员参加比赛,自然意味着一个国家有更多的参与者,就更有可能赢得更多的奖牌。虽然每个点的大小代表人均GDP,但人均GDP与运动员数量/总奖牌数之间似乎没有很强的相关性;大多数获得奖牌最多的国家的人均国内生产总值既不太大也不太小。为了解释这种差异,并创建一个使各国之间具有可比性的指标,创建了“运动员人均奖牌数”这一列,通过将一个国家的总奖牌数除以该国的运动员总数来计算。

# Create and display the scatterplot
graph = sns.scatterplot(x="Number of Athletes", y="Total Medals", size = "GDP per Capita", sizes= (10, 1000), data = df_summary, alpha=.5, legend=False)

# Specify the title
title = "Total Medals vs. Number of Athletes"

# Set the title of the plot
graph.set_title(title, size = 16)

# Add labels to the axes  
graph.set_xlabel("Number of Athletes", size = 16)
graph.set_ylabel("Total Medals", size = 16)

Text(0, 0.5, 'Total Medals')


请添加图片描述

描绘各国特征之间相关性的热图(与运动员人均奖牌数相比较的特征)

下面的热图显示了按国家分组的数据框中每个变量之间的相关性有多强。从热图中可以看出,GDP与总奖牌数、运动员人数与总奖牌数、GDP与参加奥运会的运动员人数等变量之间存在着中等到强的相关性(这些变量之间不是由彼此推导出来的(例如,BMI是由体重和身高计算出来的)。此外,国内生产总值与运动员人均奖牌数有适度的相关性,这表明国内生产总值较高(较富裕)的国家更有可能赢得更多的奖牌。

heatmap(df_summary, 0, 16, 'Reds')


请添加图片描述

下面的散点图比较了一个国家的GDP和每个运动员的奖牌数。散点图的总体趋势显示出轻微的正相关,这意味着一个国家的GDP越高,该国每位运动员获得的奖牌就越多。然而,图表中也有许多数据点代表了GDP值较低但运动员人均奖牌数较高的国家,这就是为什么正相关性不那么强的原因。

# Set sizes of figure
fig_dims = (10, 7)
fig, ax = plt.subplots(figsize=fig_dims)

# Create and display the scatterplot
graph = sns.scatterplot(x="GDP", y="Medals per Athlete", size = "Population", sizes= (10, 1000), data = df_summary, alpha=.5, ax=ax, legend=False)

# Specify the title
title = "Medals per Athlete vs. GDP"

# Set the title of the plot
graph.set_title(title, size = 16)

# Add labels to the axes  
graph.set_xlabel("GDP", size = 16)
graph.set_ylabel("Medals per Athlete", size = 16)
Text(0, 0.5, 'Medals per Athlete')


请添加图片描述

下面的条形图展示了获得奥运会奖牌最多的前20个国家以及每个国家获得的奖牌总数。从图中可以看出,总的来说,美国的奖牌数最多,达到4000 +枚,其次是俄罗斯(3500 +枚),然后是德国(3000 +枚)。

# Set sizes of figure
fig_dims = (20, 16)
fig, ax = plt.subplots(figsize=fig_dims)

# Create and display the bar plot
graph = sns.barplot(x="Country", y="Total Medals", data = top_df, alpha=.5, ax=ax)

# Specify the title
title = "Total Medals vs. Country (Top 20 Medal-Winning Countries)"

# Set the title of the plot
graph.set_title(title, size = 16)

# Add labels to the axes  
graph.set_xlabel("Country", size = 16)
graph.set_ylabel("Total Medals", size = 16)
Text(0, 0.5, 'Total Medals')


请添加图片描述

为了将奖牌总数和运动员数量这两个指标放在一起,使每个国家获得的奖牌数量更具可比性(因为已经展示了参加奥运会的运动员数量与获得的奖牌总数直接相关),下面的条形图显示了奖牌数量排名前20的国家和每个国家每位运动员的奖牌数量。从图表中可以明显看出,总体而言,俄罗斯的人均奖牌数最高,约为0.34枚,美国以0.31枚紧随其后,巴基斯坦以0.29枚紧随其后。

# Set sizes of figure
fig_dims = (20, 16)
fig, ax = plt.subplots(figsize=fig_dims)

# Create and display the bar plot
graph = sns.barplot(x="Country", y="Medals per Athlete", data = medal_per_athlete_df, ax=ax)

# Specify the title
title = "Medals per Athlete vs. Country (Top 20 Medal per Athlete Value Countries)"

# Set the title of the plot
graph.set_title(title, size = 16)

# Add labels to the axes  
graph.set_xlabel("Country", size = 16)
graph.set_ylabel("Medals per Athlete", size = 16)


Text(0, 0.5, 'Medals per Athlete')


请添加图片描述

3.3. Model Training

3.3.1 模型训练方法概述

算法:
  1. kNN Classifier
  2. Decision Tree
  3. Naive Bayes
抽样方法:
  1. Simple Random Sampling
  2. SMOTE Sampling
数据转换方法:
  1. Log-transformations (see data wrangling)
  2. Standard Scaler
  3. Minmax Scaler
评估指标:
  1. Accuracy
  2. Precision
  3. Recall
  4. F1-score

3.3.2 将训练模型定义为字典

# Import models: KNN, Naive Bayse, Decision Tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Store estimators as a dictionary for easy reference when training and testing
estimators = {'k-Nearest Neighbor': KNeighborsClassifier(), 'Gaussian Naive Bayes': GaussianNB(), 
              'Decision Tree': DecisionTreeClassifier()}

3.3.3 初始化评估度量数据框架

使用这个数据框架来存储不同的建模技术。这包括各种缩放方法、采样方法和超参数调优。对于性能指标,使用准确性、f1-score、召回率和精度。为不同的模型和优化技术找到了这些值。

在模型训练中,使用一种称为score的精度度量,它可以在sklearn库中以.score()的形式获得。这通常用于确定模型准确性,因为它比较了训练集或测试集上的预测输出与实际输出之间的差异。本质上,特征被输入到模型中,并被训练来预测输出,即目标变量;然后将该预测输出与实际目标变量进行比较,并计算百分比精度。

除了准确性之外,还使用一些不同的常见分类性能评估指标:精度、召回率和f1-score。使用sk-learn的分类报告功能找到了所有这些。

精确度测量tp / (tp + fp)的比率,或真阳性率。通常,这个评估指标用于分类模型,因为它确定了给定模型不标记假阳性的能力。在项目的应用中,如果一名运动员被认定为奖牌获得者,但实际上并没有赢得比赛。这是衡量模型能力的一个重要指标,因为它很好地衡量了模型回答的问题的能力:能预测谁会赢吗?如果的准确率很低,可能会把失败者误认为是赢家。

回忆通常被理解为精确的反义词。它测量的是真负率,或tp / (tp + fn)。在项目的上下文中,选择了这个评估指标,因为它确定了能够正确识别奥运奖牌获得者的频率。

F1-score是准确率和召回率之间的加权平均值,计算公式为F1 = 2 (准确率召回率)/(准确率+召回率)。认为这个分数对于评估的模型性能很重要,因为它在某种意义上是“两全其美”。精准度可能高估了准确预测的奖牌预测指标的数量,而不是意外地将其归类为“无”,召回率过于依赖于不过度分配奖牌获得者,f1分数给了一个很好的中间点。

# Different forms of the three models to test
parameters = ['Base', 'Random Sampling', 'SMOTE Sampling', 'Min Max Scaler', 'Standard Scaler', 'Min Max Scaler (Log)', 'Standard Scaler (Log)', 'Feature Selection', 'Grid Search CV']

# Empty lists to add estimator models and model selection types as column labels
methods = []
models = []

# Add to column names using estimators dictionary
for parameter in parameters:
    for key, value in estimators.items():
        methods.append(parameter)
        models.append(key)
# Different performance metrics from classification report 

# Performance metrics for each target value (0: Gold, 1: Silver, 2: Bronze, 3: None)
report_keys = ['0', '1', '2', '3']

# Different metrics and grouping functions
report_agg = ['macro avg', 'weighted avg']
report_values = ['precision', 'recall', 'f1-score', 'support']

# Initialize empty lists to add metric names
column_idx = []
column_metric = []

# Add to list of row names
# For performance metrics based off of target value (0, 1, 2, 3)
for key in report_keys:
    for value in report_values:
        column_idx.append(key)
        column_metric.append(value)

# For accuracy performance metric
column_idx.append('all')
column_metric.append('accuracy')

# For aggregate performance metrics
for agg in report_agg:
    for value in report_values:
        column_idx.append(agg)
        column_metric.append(value)
使用预定义的列和行初始化评估指标的空数据集
# Define columns and rows (indices) for empty dataframe
columns = [methods, models]
indices = [column_idx, column_metric]

# Fill dataframe with 0 values (to be replaced with actual performance metric values)
data = [ [0] * len(methods) for _ in range(len(column_idx))]

# Create dataframe to store evaluation metrics
performance = pd.DataFrame(data, columns = columns, index = indices)
performance
BaseRandom SamplingSMOTE SamplingMin Max Scaler...Min Max Scaler (Log)Standard Scaler (Log)Feature SelectionGrid Search CV
k-Nearest NeighborGaussian Naive BayesDecision Treek-Nearest NeighborGaussian Naive BayesDecision Treek-Nearest NeighborGaussian Naive BayesDecision Treek-Nearest Neighbor...Decision Treek-Nearest NeighborGaussian Naive BayesDecision Treek-Nearest NeighborGaussian Naive BayesDecision Treek-Nearest NeighborGaussian Naive BayesDecision Tree
0precision0000000000...0000000000
recall0000000000...0000000000
f1-score0000000000...0000000000
support0000000000...0000000000
1precision0000000000...0000000000
recall0000000000...0000000000
f1-score0000000000...0000000000
support0000000000...0000000000
2precision0000000000...0000000000
recall0000000000...0000000000
f1-score0000000000...0000000000
support0000000000...0000000000
3precision0000000000...0000000000
recall0000000000...0000000000
f1-score0000000000...0000000000
support0000000000...0000000000
allaccuracy0000000000...0000000000
macro avgprecision0000000000...0000000000
recall0000000000...0000000000
f1-score0000000000...0000000000
support0000000000...0000000000
weighted avgprecision0000000000...0000000000
recall0000000000...0000000000
f1-score0000000000...0000000000
support0000000000...0000000000

25 rows × 27 columns

定义函数将评估分数添加到性能数据框
from sklearn.metrics import classification_report

def metrics(method, estimator, model, predicted, y_test):
    """ method: scaling, sampling, hyperparameter tuning, etc
        estimator: different models (knn, decision tree, naive bayes)
        model: trained model from given method and estimator
        predicted: using model to run on test set and find predicitons
        y_test: actual values corresponding to predictions"""
    
    # Find predicted and expected outcomes of model
    expected = y_test
    
    # Calculate classification report corresponding to model
    report = classification_report(y_true=expected, y_pred=predicted, output_dict=True)
    
    # Initialize empty list to append and store evaluation matrix values
    data = []
    
    # Add in order of performance dataframe indices (0-3 -> accuracy -> aggregated metrics)
    # Append performance scores for target values (0, 1, 2, 3)
    for i in range(4):
        dct = report[str(i)]
        for metric, value in dct.items():
            data.append(value)
    
    # Append accuracy score
    data.append(report['accuracy'])
    
    # Append aggregated performance scores
    report_labels = ['macro avg', 'weighted avg']
    for label in report_labels:
        for metric, value in report[label].items():
            data.append(value)
    
    # From data list, add in each value to corresponding spot in predefined performance dataframe
    for i in range(len(data)):
        performance[method, estimator].iloc[i] = data[i]

3.3.4 初始模型训练:基线模型

# Split data into train and test sets, run algorithms, and return accuracy score

# Define features and target based on non-transformed, original (encoded) dataset
from sklearn.model_selection import train_test_split
features = combined_df.drop(["target"], axis=1)
target = combined_df["target"]

def base_model():
    # Create train and test sets
    X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)
    y_train = np.ravel(y_train)
    y_test = np.ravel(y_test)
    
    # Using train and test sets, run through each of the three estimators
    for name, estimator in estimators.items():
        model = estimator.fit(X=X_train, y=y_train)
        accuracy = model.score(X_test, y_test)
        predicted = model.predict(X=X_test)
        print(f'{name}: \n\t Classification accuracy on the test data: {accuracy:.2%}\n')
        metrics('Base', name, model, predicted, y_test)
base_model()
k-Nearest Neighbor: 
	 Classification accuracy on the test data: 88.51%

Gaussian Naive Bayes: 
	 Classification accuracy on the test data: 79.68%

Decision Tree: 
	 Classification accuracy on the test data: 88.74%

3.3.5 测试数据采样方法

当然,的数据集并没有获得奖牌和非奖牌运动员的均匀分布。这可能会显著影响的结果,因此测试了两种不同类型的抽样,以解释目标值的不均匀分布,并确定导致最佳模型性能的方法。

目标值分布
# Create pie chart of target value distribution
def piedist(df, title):
    """ df: dataframe (resampling method)
        title: name of resampling method, for pie chart title """
    target_labels = {0:'Gold', 1:'Silver', 2:'Bronze', 3:'None'}
    colors = {0:'goldenrod', 1:'silver', 2:'saddlebrown', 3:'rosybrown'}
    
    # Find counts for each target value
    df_target_unique = df['target'].value_counts()
    
    # Create labels for pie chart
    labels = []
    palette = []
    for index, row in zip(df_target_unique.index, df_target_unique):
        labels.append(f'{target_labels[index]}: {row}')
        palette.append(colors[index])

    # Create pie plot of target value distribution
    fig, ax = plt.subplots(figsize=(5,5))

    plt.pie(df_target_unique, labels=labels, autopct='%1.1f%%' , colors=palette)
    plt.title(f'Distribution of Target Value {title}', fontsize=15)
piedist(combined_df, '')


请添加图片描述

在下面的单元格中,我们使用三种不同的算法(k近邻、高斯朴素贝叶斯和决策树)和两种不同的重采样技术(随机重采样和SMOTE重采样)训练了六个不同的模型。然后,我们比较了六种不同模型的精度。我们使用重采样是因为我们的目标变量是不平衡的。随机重采样是欠采样方法,使所有目标值有9800个样本;SMOTE重采样是少数类的过采样方法,使所有目标值有185547个样本。

3.3.6 简单随机抽样

# Sets all target values to have a count of 9,800, since silver has the least samples
random_resampled_df = combined_df.groupby('target').apply(lambda x: x.sample(n=9800)).reset_index(drop = True)

# Create pie chart of distribution
piedist(random_resampled_df, title = '(Simple Random)')


请添加图片描述

3.3.7 SMOTE Sampling

from collections import Counter
from imblearn.over_sampling import SMOTE

# Oversampling minority target values
X = combined_df.drop(["target"], axis =1)
y = combined_df["target"]

sm = SMOTE(sampling_strategy = 'all', random_state=3000)
X_resampled, y_resampled = sm.fit_resample(X, y)

# Create new dataframe of resampled values
df_target_smote = pd.DataFrame(y_resampled, columns =['target'])
smote_resampled_df = pd.concat([X_resampled, df_target_smote], axis=1)

# Create pie plot of target value distribution after smote resampling
piedist(smote_resampled_df, title = '(SMOTE)')


请添加图片描述

3.3.8 抽样方法比较

正如从下面的输出中看到的,对于kNN和决策树,SMOTE重采样数据比使用kNN和决策树分类器的随机重采样数据具有更高的精度。随机重采样数据和非随机重采样数据的kNN准确率从48.06%提高到88.45%。随机重采样数据和SMOTE重采样数据的决策树准确率由57.68%提高到86.38%。然而,高斯朴素贝叶斯在两种重采样技术之间差异不大,总体而言,该模型的准确率远低于kNN和Decision Tree,平均准确率低于30%。

# Define dictionary of sampling methods for easy use later
sampling_techniques = {"Random Sampling" : random_resampled_df, 
                       "SMOTE Sampling" : smote_resampled_df}


def best_sampling(): 
    # Loop through sampling methods
    for sampling_name, df in sampling_techniques.items():
        
        # Redefine feature and target for each resampled dataset
        features = df.drop(["target"], axis=1)
        target = df["target"]
        
        # Split into train and test sets
        X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)
        y_train = np.ravel(y_train)
        y_test = np.ravel(y_test)
    
        # Train and return accuracy score, add to performance metrics dataframe
        for name, estimator in estimators.items():
            model = estimator.fit(X=X_train, y=y_train)
            accuracy = model.score(X_test, y_test)
            print(f'{name}: \n\t Classification accuracy on the test data with {sampling_name}: {accuracy:.2%}\n')
            predicted = model.predict(X=X_test)
            metrics(sampling_name, name, model, predicted, y_test)
best_sampling()
k-Nearest Neighbor: 
	 Classification accuracy on the test data with Random Sampling: 58.71%

Gaussian Naive Bayes: 
	 Classification accuracy on the test data with Random Sampling: 27.47%

Decision Tree: 
	 Classification accuracy on the test data with Random Sampling: 65.14%

k-Nearest Neighbor: 
	 Classification accuracy on the test data with SMOTE Sampling: 90.85%

Gaussian Naive Bayes: 
	 Classification accuracy on the test data with SMOTE Sampling: 27.64%

Decision Tree: 
	 Classification accuracy on the test data with SMOTE Sampling: 90.94%

# SMOTE resampled dataset without log transformations
features = smote_resampled_df.drop(["target"], axis=1)
target = smote_resampled_df["target"]

3.3.9 数据变换

除了对数变换,还想使用标准标量和最小最大标量,看看它们是否能更有效地校正原始数据集中的值分布。对于这一部分,比较了标量器在对数转换数据和原始特征(没有转换)上的性能。

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Define tests to perform
# Scaling before log and normalized transformations
scalers = {"Min Max Scaler" : MinMaxScaler(), "Standard Scaler" : StandardScaler()}
# Scaling after log and normalized transformations
scalers_log = {"Min Max Scaler (Log)": MinMaxScaler(), "Standard Scaler (Log)": StandardScaler()}
def best_preprocessing(scalers_dict, features, target): 
    """ scalers_dict: either scalers or scalers log for scaling methods to test
        features: predefined features dataframe (from SMOTE resampling)
        target: predefined target series (from SMOTE resampling)
    """
    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)
    
    # Loop through and test min max and standard scaler
    for scaling_name, scaling_method in scalers_dict.items():
        
        # Define, fit, and test scaling method
        scaler = scaling_method
        scaler.fit(X_train) 
        X_train_scaled = scaler.transform(X_train) 
        X_test_scaled = scaler.transform(X_test)
        
        # Use trained scaler to test the three models
        for name, estimator in estimators.items():
            model = estimator.fit(X=X_train_scaled, y=y_train)
            accuracy_train = model.score(X_train_scaled, y_train)
            accuracy_test = model.score(X_test_scaled, y_test)
            
            # Add performance metrics to performance dataframe
            predicted = model.predict(X=X_test)
            metrics(scaling_name, name, model, predicted, y_test)
            
            # Return accuracy scores of different scaling methods
            print(f'{name}:')
            print(f'\t Classification accuracy on the training data with {scaling_name}: {accuracy_train:.2%}')
            print(f'\n\t Classification accuracy on the test data with {scaling_name}: {accuracy_test:.2%}\n')
            continue

3.3.10 测试没有对数转换的标量

在下一个单元格中,使用三种不同的算法(k-Nearest Neighbors,高斯朴素贝叶斯和决策树)和两种不同的缩放技术(Min Max Scaler和Standard Scaler)训练了六个不同的模型,并且已经使用SMOTE对数据进行了重新采样,因为SMOTE比随机重新采样具有更高的精度。最小最大标度器对kNN的准确度略高于标准标度器(84.83% ~ 83.44%)。使用最小最大标量和标准标量,朴素贝叶斯的准确率相同,为32.05%。Min Max Scaler和Standard Scaler的决策树准确率基本相同,但Min Max Scaler略高(86.45% ~ 86.38%)。kNN和朴素贝叶斯(对于最小最大标量和标准标量)没有显示过拟合的迹象;然而,对于两种缩放方法,决策树表明模型可能是过拟合的。这是因为决策树分类器的训练数据的准确率为99.9%,而测试数据的准确率仅为86%左右;这表明,对于所拥有的信息量来说,构建的决策树模型过于复杂,并且过于接近的训练集的特殊性。

best_preprocessing(scalers, features, target)
k-Nearest Neighbor:
	 Classification accuracy on the training data with Min Max Scaler: 91.87%

	 Classification accuracy on the test data with Min Max Scaler: 87.88%

Gaussian Naive Bayes:
	 Classification accuracy on the training data with Min Max Scaler: 32.44%

	 Classification accuracy on the test data with Min Max Scaler: 32.24%

Decision Tree:
	 Classification accuracy on the training data with Min Max Scaler: 99.97%

	 Classification accuracy on the test data with Min Max Scaler: 90.90%

k-Nearest Neighbor:
	 Classification accuracy on the training data with Standard Scaler: 91.19%

	 Classification accuracy on the test data with Standard Scaler: 87.00%

Gaussian Naive Bayes:
	 Classification accuracy on the training data with Standard Scaler: 32.44%

	 Classification accuracy on the test data with Standard Scaler: 32.24%

Decision Tree:
	 Classification accuracy on the training data with Standard Scaler: 99.97%

	 Classification accuracy on the test data with Standard Scaler: 90.89%

3.3.11 用对数变换测试标量

这里使用的数据已经使用对数转换进行了转换,以防止偏度。偏性存在于的连续数值变量中,因此在使用对数变换时,使这些特征更符合正态分布。在下一个单元格中,使用三种不同的算法(k-Nearest Neighbors,高斯朴素贝叶斯和决策树)和两种不同的缩放技术(Min Max Scaler和Standard Scaler)训练了六个不同的模型,并且已经使用SMOTE对数据进行了重新采样,因为SMOTE比随机重新采样具有更高的精度。最小最大标度器对kNN的准确度略高于标准标度器(85.61% ~ 84.44%)。使用最小最大标量和标准标量,朴素贝叶斯的准确率相同,为84.96%。Min Max Scaler和Standard Scaler的决策树准确率基本相同,但Min Max Scaler略高(86.39% ~ 86.34%)。最小最大标量和标准标量的kNN和朴素贝叶斯都没有显示出过拟合的迹象,然而,使用两种缩放方法的决策树表明模型可能是过拟合的。这是因为Decision Tree分类器的训练数据的准确率为99.9%,而测试数据的准确率仅为86%左右。

# Define features and target dataframes from previous, log-transformed dataset
features = df_log_transf.drop(["target"], axis=1)
target = df_log_transf["target"]

# Find best preprocessing (scaler vs minmax of log-transformed data)
best_preprocessing(scalers_log, features, target)
k-Nearest Neighbor:
	 Classification accuracy on the training data with Min Max Scaler (Log): 91.79%

	 Classification accuracy on the test data with Min Max Scaler (Log): 88.88%

Gaussian Naive Bayes:
	 Classification accuracy on the training data with Min Max Scaler (Log): 84.37%

	 Classification accuracy on the test data with Min Max Scaler (Log): 84.60%

Decision Tree:
	 Classification accuracy on the training data with Min Max Scaler (Log): 100.00%

	 Classification accuracy on the test data with Min Max Scaler (Log): 88.72%

k-Nearest Neighbor:
	 Classification accuracy on the training data with Standard Scaler (Log): 90.90%

	 Classification accuracy on the test data with Standard Scaler (Log): 87.93%

Gaussian Naive Bayes:
	 Classification accuracy on the training data with Standard Scaler (Log): 84.37%

	 Classification accuracy on the test data with Standard Scaler (Log): 84.60%

Decision Tree:
	 Classification accuracy on the training data with Standard Scaler (Log): 100.00%

	 Classification accuracy on the test data with Standard Scaler (Log): 88.72%

3.3.12 特征选择:递归特征消除

# Define features and target for feature selection and hyperparameter tuning
features = smote_resampled_df.drop(["target"], axis=1)
target = smote_resampled_df["target"]
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeRegressor 

def RFE_feature_selection(estimator, n):
    """ estimator: define which modeling technique (kNN, tree, naive bayes),
        n: define number of features to select """
    method = estimators[estimator]
    X_train, X_test, y_train, y_test = train_test_split(features, target, random_state = 3000)
    
    # Get the most important features using RFE
    important_features = RFE(DecisionTreeClassifier(random_state = 3000), n_features_to_select = n)
    
    # Fit the RFE selector to the training data 
    important_features.fit(X_train, y_train)
    
    # Transform training and testing sets so that only the important selected features remain 
    X_train_selected = important_features.transform(X_train)
    X_test_selected = important_features.transform(X_test)
    
    # Apply method to the training data 
    model = method.fit(X=X_train_selected, y=y_train)
    
    print("Selected features after RFE: ")
    
    # Iterates through the number of columns in the df 
    for i in range(len(features.columns)): 
        
        # If the value corresponding with the column of the feature is true, the feature was used in the model
        if (important_features.support_[i] == True): 
            print("\t", features.columns[i])
            
    # Prints accuracy for the model on the training and testing sets 
    print("\n" + "Performance with selected features: ")
    print("\t" + "Accuracy for the training set: ", model.score(X_train_selected, y_train))
    print("\t" + "Accuracy for the testing set: ", model.score(X_test_selected, y_test))
    
    # Evaluate performance metrics and add to performance dataframe
    predicted = model.predict(X=X_test_selected)
    metrics('Feature Selection', estimator, model, predicted, y_test)

    
用kNN分类器进行特征选择

n-values we tried: 4, 5, 6 7, 8, 9, 10

利用递归特征消去(RFE)确定了三种不同分类模型的最重要特征。对于kNN分类器,在国家、年份、事件、GDP、年龄、BMI、身高和体重这8个特征上获得了最高的准确率。在测试集上的准确率约为88.29%,模型不存在过拟合现象。这表明的模型中最重要的8个特征是上面列出的特征,这比没有特征选择的kNN模型产生更高的精度。

RFE_feature_selection('k-Nearest Neighbor', 8)
Selected features after RFE: 
	 Country
	 Year
	 Event
	 GDP
	 Age
	 BMI
	 Height
	 Weight

Performance with selected features: 
	Accuracy for the training set:  0.9345636317628309
	Accuracy for the testing set:  0.9068930155333685
基于决策树分类器的特征选择

n-values we tried: 4, 5, 6, 7, 8, 9, 10

对于决策树分类器,获得了9个特征的最高准确率,这9个特征包括国家、年份、运动、事件、GDP、年龄、BMI、身高和体重。在测试集上的准确率约为86.09%,而在训练集上的准确率接近100%,表明模型是过拟合的。RFE特征选择表明,最重要的9个特征是上面列出的特征,但与没有特征选择的数据相比,模型的准确率没有明显提高。

RFE_feature_selection('Decision Tree', 9)
Selected features after RFE: 
	 Country
	 Year
	 Sport
	 Event
	 GDP
	 Age
	 BMI
	 Height
	 Weight

Performance with selected features: 
	Accuracy for the training set:  0.9994951863049973
	Accuracy for the testing set:  0.8609840094423515


C:\Users\feiyu\anaconda3\lib\site-packages\pandas\core\indexing.py:1637: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
朴素贝叶斯分类器的特征选择

n-values tried: 3, 4, 5, 6, 7, 8, 9

对于朴素贝叶斯分类器,在Year, Event, Age, BMI和Height这5个特征上获得了最高的准确率。在测试集上的准确率约为30.66%,模型不存在过拟合现象。这表明使用朴素贝叶斯的模型中最重要的5个特征是上面列出的特征,这比没有特征选择的模型产生略高的精度(不到1%)。

RFE_feature_selection('Gaussian Naive Bayes', 5)
Selected features after RFE: 
	 Year
	 Event
	 Age
	 BMI
	 Height

Performance with selected features: 
	Accuracy for the training set:  0.3053997100465111
	Accuracy for the testing set:  0.3065638355780476


C:\Users\feiyu\anaconda3\lib\site-packages\pandas\core\indexing.py:1637: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)

3.4. 模型优化

。。。。

3.4.1 使用GridSearch进行超参数调优

from sklearn.model_selection import GridSearchCV

def grid_search(estimator, param_grid):
    """ estimator: model to test with (knn, tree, naive bayes)
        param_grid: dictionary of different parameters and values to testa nd compare"""
    # Use grid search to find best parameters
    method = estimators[estimator]
    grid_search = GridSearchCV(method, param_grid, cv=5)
    
    # Split into train and test sets, fit with grid search
    X_train, X_test, y_train, y_test = train_test_split(features, target, random_state = 3000)
    grid_search.fit(X=X_train, y=y_train)
    
    # Print resulting best parameters and evaluation metrics
    print("Best parameters: ", grid_search.best_params_)
    
    #print("Best cross-validation score: ", grid_search.best_score_)
    print("Test set score: ", grid_search.score(X_test, y_test))
    
    # Evaluate performance metrics and add to performance dataframe
    predicted = grid_search.predict(X=X_test)
    metrics('Grid Search CV', estimator, grid_search, predicted, y_test)

kNN分类器的GridSearch

对于kNN,测试了参数n_neighbors、weights和metric,以便找到产生最准确的kNN测试集分数的参数的正确组合。准确度最高的是度量为曼哈顿,1个邻居,权重为均匀,准确度为93.45%。

# Hyperparameters to test
param_grid = {"n_neighbors":[1, 5, 15, 25, 55], "weights":["uniform", "distance"], "metric":["euclidean", "manhattan", "minkowski"]}

grid_search('k-Nearest Neighbor', param_grid)
Best parameters:  {'metric': 'manhattan', 'n_neighbors': 1, 'weights': 'uniform'}
Test set score:  0.9344802125606989


C:\Users\feiyu\anaconda3\lib\site-packages\pandas\core\indexing.py:1637: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
决策树分类器的GridSearch

对于决策树模型,测试了参数标准max_depth和min_samples_leaf,以便确定产生最高测试集精度的最佳参数。结果是作为熵的标准,max_depth为55,min_samples_leaf为1,这给的测试集的准确率得分为86.75%。

# Hyperparameters to test
param_grid = {"criterion":["gini", "entropy"], "max_depth":[5, 11, 15, 25, 35, 55], "min_samples_leaf":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}

grid_search('Decision Tree', param_grid)
Best parameters:  {'criterion': 'entropy', 'max_depth': 55, 'min_samples_leaf': 1}
Test set score:  0.8674675419166034


C:\Users\feiyu\anaconda3\lib\site-packages\pandas\core\indexing.py:1637: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
朴素贝叶斯分类器的GridSearch

对于朴素贝叶斯分类器,测试了参数var_smoothing,看看哪个值产生最准确的模型。当var_smoothing = 0.019时,测试集得分为29.60%。

# Hyperparameters to test
param_grid = {'var_smoothing': np.logspace(0,-9, num=100)}

grid_search('Gaussian Naive Bayes', param_grid)
Best parameters:  {'var_smoothing': 0.01873817422860384}
Test set score:  0.29601664268352496


C:\Users\feiyu\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\feiyu\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\feiyu\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\feiyu\anaconda3\lib\site-packages\pandas\core\indexing.py:1637: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)

3.5. 模型测试

3.5.1 对比出最佳模型

performance
BaseRandom SamplingSMOTE SamplingMin Max Scaler...Min Max Scaler (Log)Standard Scaler (Log)Feature SelectionGrid Search CV
k-Nearest NeighborGaussian Naive BayesDecision Treek-Nearest NeighborGaussian Naive BayesDecision Treek-Nearest NeighborGaussian Naive BayesDecision Treek-Nearest Neighbor...Decision Treek-Nearest NeighborGaussian Naive BayesDecision Treek-Nearest NeighborGaussian Naive BayesDecision Treek-Nearest NeighborGaussian Naive BayesDecision Tree
0precision0.4529980.0530080.5490600.4739990.3929560.6203920.8782630.3866090.8660550.262693...0.0765200.0714290.0382490.0631490.8766710.2981820.8622150.9361330.3844750.864842
recall0.3967350.0315830.5699080.6047450.2287900.6361080.9516060.2302650.8765470.002565...0.1554290.0003550.0858770.4790630.9500320.2987930.8731620.9658980.2312570.880017
f1-score0.4230040.0395820.5592900.5314490.2891990.6281520.9134650.2886250.8712690.005081...0.1025520.0007060.0529250.1115890.9118790.2984870.8676540.9507820.2888020.872364
support2818.0000002818.0000002818.0000002487.0000002487.0000002487.00000046390.00000046390.00000046390.00000046390.000000...2818.0000002818.0000002818.0000002818.00000046390.00000046390.00000046390.00000046390.00000046390.00000046390.000000
1precision0.3360950.0000000.4561110.4434000.0000000.5479830.8636090.0000000.8490570.278146...0.0618000.0000000.0000000.0600040.8620270.2847360.8415560.9264270.0000000.852848
recall0.2185460.0000000.5198150.4746380.0000000.5632050.9381200.0000000.8576830.000902...0.0615620.0000000.0000000.2485570.9364020.1642680.8503590.9571500.0000000.859058
f1-score0.2648640.0000000.4858840.4584870.0000000.5554890.8993240.0000000.8533480.001798...0.0616810.0000000.0000000.0966700.8976760.2083410.8459340.9415380.0000000.855941
support2599.0000002599.0000002599.0000002484.0000002484.0000002484.00000046558.00000046558.00000046558.00000046558.000000...2599.0000002599.0000002599.0000002599.00000046558.00000046558.00000046558.00000046558.00000046558.00000046558.000000
2precision0.3547320.0000000.4367070.4532370.0000000.5530550.8633200.0000000.8464850.186296...0.0490480.0000000.0000000.0686840.8624980.2830250.8422610.9238370.2682870.849883
recall0.2013520.0000000.4782120.4179100.0000000.5704810.9363590.0000000.8602570.001872...0.1596540.0000000.0000000.0315550.9344650.2902250.8534990.9578170.1618240.860602
f1-score0.2568900.0000000.4565180.4348580.0000000.5616330.8983570.0000000.8533150.003708...0.0750420.0000000.0000000.0432430.8970400.2865800.8478430.9405200.2018790.855209
support2662.0000002662.0000002662.0000002412.0000002412.0000002412.00000046464.00000046464.00000046464.00000046464.000000...2662.0000002662.0000002662.0000002662.00000046464.00000046464.00000046464.00000046464.00000046464.00000046464.000000
3precision0.9061480.8518920.9389630.6536700.2656850.5980080.9561970.2663610.9033470.248981...0.8614060.8517740.8451180.8989360.9524140.3390540.9000880.9543070.2829700.904338
recall0.9531800.9693110.9243430.4716590.9180800.5465450.7111740.9117370.8682780.995578...0.6963150.9997200.8770060.4093640.7093100.4744340.8669990.8565080.7950150.870250
f1-score0.9290690.9068160.9315950.5479450.4121090.5711200.8156820.4122770.8854650.398342...0.7701120.9198360.8607660.5625500.8130790.3954790.8832340.9027670.4173810.886967
support46433.00000046433.00000046433.0000002417.0000002417.0000002417.00000046135.00000046135.00000046135.00000046135.000000...46433.00000046433.00000046433.00000046433.00000046135.00000046135.00000046135.00000046135.00000046135.00000046135.000000
allaccuracy0.8526750.8272860.8649470.4929590.2844900.5793880.8846220.2842680.8656780.248880...0.6118840.8515740.7514680.3868510.8828600.3065640.8609840.9344800.2960170.867468
macro avgprecision0.5124930.2262250.5952100.5060770.1646600.5798590.8903470.1632430.8662360.244029...0.2621930.2308010.2208420.2726930.8884020.3012490.8615300.9351760.2339330.867978
recall0.4424530.2502230.6230690.4922380.2867170.5790850.8843150.2855010.8656910.250229...0.2682400.2500190.2407210.2921350.8825520.3069300.8610050.9343430.2970240.867482
f1-score0.4684570.2365990.6083220.4931850.1753270.5790980.8817070.1752250.8658490.102232...0.2523470.2301360.2284230.2035130.8799190.2972220.8611660.9339020.2270160.867620
support54512.00000054512.00000054512.0000009800.0000009800.0000009800.000000185547.000000185547.000000185547.000000185547.000000...54512.00000054512.00000054512.00000054512.000000185547.000000185547.000000185547.000000185547.000000185547.000000185547.000000
weighted avgprecision0.8286160.7283770.8712590.5054460.1652490.5799450.8902220.1628880.8661610.244030...0.7430380.7292290.7218440.7751880.8882800.3011750.8614510.9351370.2336670.867907
recall0.8526750.8272860.8649470.4929590.2844900.5793880.8846220.2842680.8656780.248880...0.6118840.8515740.7514680.3868510.8828600.3065640.8609840.9344800.2960170.867468
f1-score0.8384160.7744670.8678990.4932510.1750310.5792960.8818200.1746710.8658060.101695...0.6678840.7835480.7359320.4916660.8800330.2970020.8611170.9339540.2265390.867578
support54512.00000054512.00000054512.0000009800.0000009800.0000009800.000000185547.000000185547.000000185547.000000185547.000000...54512.00000054512.00000054512.00000054512.000000185547.000000185547.000000185547.000000185547.000000185547.000000185547.000000

25 rows × 27 columns

# Save evaluation metrics dataframe as csv
performance.to_csv('model_evaluation.csv')
最佳表现模型:

为了找到性能最好的模型,在性能数据框架中对每个模型的性能取平均值。发现三个最高的平均值属于网格搜索kNN(平均分:93.45%),SMOTE采样kNN(88.54%)和超参数调谐kNN(88.36%)。一般来说,kNN估计器的性能最好。

# Define subset of original dataframe without support metric value
no_support = []
for row in performance.index:
    if row[1] != 'support':
        no_support.append(row)
final_vals = performance.loc[no_support]

# Find mean of each model using the metric scores and sort in descending order
means = final_vals.mean(axis=0)
means.sort_values(ascending=False)
Grid Search CV         k-Nearest Neighbor      0.934482
SMOTE Sampling         k-Nearest Neighbor      0.885428
Feature Selection      k-Nearest Neighbor      0.883600
Grid Search CV         Decision Tree           0.867675
SMOTE Sampling         Decision Tree           0.865906
Feature Selection      Decision Tree           0.861213
Base                   Decision Tree           0.663266
Random Sampling        Decision Tree           0.579380
Base                   k-Nearest Neighbor      0.552074
Random Sampling        k-Nearest Neighbor      0.496953
Standard Scaler (Log)  k-Nearest Neighbor      0.356353
Base                   Gaussian Naive Bayes    0.353824
Min Max Scaler (Log)   k-Nearest Neighbor      0.351436
                       Decision Tree           0.344663
Standard Scaler (Log)  Gaussian Naive Bayes    0.337402
Min Max Scaler (Log)   Gaussian Naive Bayes    0.335719
Standard Scaler (Log)  Decision Tree           0.309593
Feature Selection      Gaussian Naive Bayes    0.302016
Grid Search CV         Gaussian Naive Bayes    0.254847
Min Max Scaler         Decision Tree           0.218359
Random Sampling        Gaussian Naive Bayes    0.212778
SMOTE Sampling         Gaussian Naive Bayes    0.211891
Min Max Scaler         k-Nearest Neighbor      0.201365
Standard Scaler        Decision Tree           0.182212
Min Max Scaler         Gaussian Naive Bayes    0.143427
Standard Scaler        Gaussian Naive Bayes    0.143427
                       k-Nearest Neighbor      0.142986
dtype: float64

总结

比较了k近邻算法、朴素贝叶斯算法和决策树算法。使用了两种重采样技术(随机重采样和SMOTE)和三种数据转换技术(对数变换、最小最大缩放和标准缩放)。然后在所有模型上实现特征选择和超参数调优。使用上述方法的各种组合,总共有27个模型。
略。。。。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Better Bench

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值