挑战
在下面查看之前,请尝试回答有关数据集的一些高级问题。
我们如何操作裁判给黑皮球员更多红牌的问题?
- 反事实:如果球员体重较轻,裁判在同样的条件下可能会给同一个犯规者一张黄牌或无牌。
- 回归:考虑到混杂因素,深色玩家对比例红牌/总牌的回归系数为正。
潜在问题
- 如何将Rater1和Rater2结合起来?平均?如果他们不同意怎么办?把它扔掉?
- 数据是否失衡,即红牌非常罕见?
- 数据是否有偏差,即玩家有不同的游戏时间?这是他们整个职业生涯的总结吗?
- 我怎么知道我已经解释了所有形式的混淆?### 首先,是否对所有参考文献都有系统的区分?
探索/假设:
- 游戏分布
- 红牌vs游戏
- 每局游戏的红牌与每局游戏的总牌数(按肤色)
- 按平均肤色对所有玩家的每场游戏分配红、黄、总牌数和分数红
- 玩家遇到了多少裁判?
- 有没有一些俱乐部更具侵略性,得到更多的卡?还是更保守,得到更少?
- 梳毛会因国家而异吗?
- 高分者在同一位置得到更多的松弛(更少的牌)吗?
- 有没有裁判员的红牌/黄牌比其他裁判员多?
- 评分者的一致性如何?看看科恩的卡帕。
- 红牌的位置有什么不同?比如防守队员得到更多?
- 有更多游戏的玩家会得到更多的牌吗?皮肤颜色有区别吗?
- 是否显示偏差取决于参考国家?
了解数据的组织方式
- 数据集是一个单独的csv,它将裁判和球员之间的每一个交互聚合成一行。换句话说:裁判员A裁判员A裁判员B裁判员参加了10场比赛,在这10场比赛中给了2张红牌。然后在数据集中会有一个唯一的行,它说:
裁判A,球员B,2张红牌……- 这有几个含义,使得理解和处理这些数据的第一步有点棘手。首先,关于球员B的信息每次都会重复,这意味着如果我们简单地对某个指标进行平均,我们可能会得到误导性的结果。
例如,问“运动员的平均体重是多少?”
1.数据读取与预处理
1.1 查看数据
#from __future__ import absolute_import, division, print_function
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)
#import missingno as msno
#import pandas_profiling
from sklearn.datasets import make_blobs
import time
df = pd.read_csv("redcard.csv.gz", compression='gzip')
df.shape
df.head()
df.describe().T
1.2 查看数据类型
df.dtypes
1.3 查看列
all_columns = df.columns.tolist()
all_columns
2.数据切分模块
2.1 球员的特征信息汇成透视表
player_index = 'playerShort' #选手
player_cols = [#'player', # drop player name, we have unique identifier
'birthday',
'height',
'weight',
'position',
'photoID',
'rater1',
'rater2',
]
all_cols_unique_players = df.groupby('playerShort').agg({col:'nunique' for col in player_cols})
all_cols_unique_players.head()
2.2 查看球员信息有无重复
all_cols_unique_players[all_cols_unique_players > 1].dropna().head()
没有重复值!
2.3 查看球员重复信息函数
def get_subgroup(dataframe, g_index, g_columns):
"""Helper function that creates a sub-table from the columns and runs a quick uniqueness test."""
g = dataframe.groupby(g_index).agg({col:'nunique' for col in g_columns})
if g[g > 1].dropna().shape[0] != 0:
print("Warning: you probably assumed this had all unique values but it doesn't.")
return dataframe.groupby(g_index).agg({col:'max' for col in g_columns})
- 调用函数
players = get_subgroup(df, player_index, player_cols)
players.head()
2.4 保存数据函数
def save_subgroup(dataframe, g_index, subgroup_name, prefix='raw_'):
save_subgroup_filename = "".join([prefix, subgroup_name, ".csv.gz"])
dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding='UTF-8')
test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col=g_index, encoding='UTF-8')
# Test that we recover what we send in
if dataframe.equals(test_df):
print("Test-passed: we recover the equivalent subgroup dataframe.")
else:
print("Warning -- equivalence test!!! Double-check.")
- 调用函数
save_subgroup(players, player_index, "players")
2.5 俱乐部与国家的关系
club_index = 'club'
club_cols = ['leagueCountry']
clubs = get_subgroup(df, club_index, club_cols)
clubs.head()
- 查看特征数量
clubs['leagueCountry'].value_counts()
- 保存文件
save_subgroup(clubs, club_index, "clubs", )
2.6 裁判和裁判国家的关系
referee_index = 'refNum'
referee_cols = ['refCountry']
referees = get_subgroup(df, referee_index, referee_cols)
referees.head()
- 查看裁判的国家数
referees.refCountry.nunique()
裁判来自161个不同的国家
- 查看数据形状
referees.shape
3147名裁判
- 保存模型
save_subgroup(referees, referee_index, "referees")
2.7 裁判和球员的关系
dyad_index = ['refNum', 'playerShort']
dyad_cols = ['games',
'victories',
'ties',
'defeats',
'goals',
'yellowCards',
'yellowReds',
'redCards',
]
dyads = get_subgroup(df, g_index=dyad_index, g_columns=dyad_cols)
dyads.head()
- 查看形状
dyads.shape
- 查看球员红牌超1个的信息
dyads[dyads.redCards > 1].head(10)
- 保存模型
save_subgroup(dyads, dyad_index, "dyads")
3. 缺失值可视化分析
3.1 安装missingno缺失值库
3.2 missingno.matrix缺失值分布图
players = load_subgroup("raw_players.csv.gz")
players.shape
players.head()
msno.matrix(players.sample(500),
figsize=(16, 7),
width_ratios=(15, 1))
3.3 missingno.matrix缺失值直方图
msno.bar(players.sample(500))
3.4 missingno.heatmap查看缺失值之间的相关性
msno.heatmap(players.sample(500),
figsize=(16, 7),)
如photoID缺失,rate2 100%缺失
3.5 查看缺失值数量
print("All players:", len(players))
print("rater1 nulls:", len(players[(players.rater1.isnull())]))
print("rater2 nulls:", len(players[players.rater2.isnull()]))
print("Both nulls:", len(players[(players.rater1.isnull()) & (players.rater2.isnull())]))
- 去除rate1
缺失值
players = players[players.rater1.notnull()]
players.shape[0]
- 查看缺失值直方图
msno.matrix(players.sample(500),
figsize=(16, 7),
width_ratios=(15, 1))
可以看到,与rater1相关的rater2和photoID都没有缺失值
- 按类别分组,统计各个分组中值的频数
pd.crosstab(players.rater1, players.rater2)
- 画出各个分组中值的频数热力图
fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(pd.crosstab(players.rater1, players.rater2), cmap='Blues', annot=True, fmt='d', ax=ax)
ax.set_title("Correlation between Rater 1 and Rater 2\n")
fig.tight_layout()
3.6 取评分平均值
players['skintone'] = players[['rater1', 'rater2']].mean(axis=1)
players.head()
- 画出评分平均值的分布
sns.distplot(players.skintone, kde=False);
3.7 位置信息的统计与划分
MIDSIZE = (12, 8)
fig, ax = plt.subplots(figsize=MIDSIZE)
players.position.value_counts(dropna=False, ascending=True).plot(kind='barh', ax=ax)
ax.set_ylabel("Position")
ax.set_xlabel("Counts")
fig.tight_layout()
位置信息有点多,可以将其集中划分为几类:
- 位置种类
position_types = players.position.unique()
position_types
- 将位置统一划分
- isin()接受一个列表,判断该列中元素是否在列表中。
defense = ['Center Back','Defensive Midfielder', 'Left Fullback', 'Right Fullback', ]
midfield = ['Right Midfielder', 'Center Midfielder', 'Left Midfielder',]
forward = ['Attacking Midfielder', 'Left Winger', 'Right Winger', 'Center Forward']
keeper = 'Goalkeeper'
players.loc[players['position'].isin(defense), 'position_agg'] = "Defense"
players.loc[players['position'].isin(midfield), 'position_agg'] = "Midfield"
players.loc[players['position'].isin(forward), 'position_agg'] = "Forward"
players.loc[players['position'].eq(keeper), 'position_agg'] = "Keeper"
- 画出分好位置的数量
MIDSIZE = (12, 8)
fig, ax = plt.subplots(figsize=MIDSIZE)
players['position_agg'].value_counts(dropna=False, ascending=True).plot(kind='barh', ax=ax)
ax.set_ylabel("position_agg")
ax.set_xlabel("Counts")
fig.tight_layout()
3.8 使用散点图矩阵图,可以两两发现特征之间的联系
- 身高,体重,肤色的关系
from pandas.tools.plotting import scatter_matrix
fig, ax = plt.subplots(figsize=(10, 10))
scatter_matrix(players[['height', 'weight', 'skintone']], alpha=0.2, diagonal='hist', ax=ax);
可以看出,身高和体重与肤色并无明显关系
- 体重和身高的关系散点图矩阵图
fig, ax = plt.subplots(figsize=MIDSIZE)
sns.regplot('weight', 'height', data=players, ax=ax)
ax.set_ylabel("Height [cm]")
ax.set_xlabel("Weight [kg]")
fig.tight_layout()
可以看出,体重和身高成线性相关
3.9 身高与体重划分
- pd.qcut(df, len(), columns)
#身高
height_categories = ["vlow_height",
"low_height",
"mid_height",
"high_height",
"vhigh_height",
]
players['heightclass'] = pd.qcut(players['height'],
len(height_categories),
height_categories)
#体重
weight_categories = ["vlow_weight",
"low_weight",
"mid_weight",
"high_weight",
"vhigh_weight",
]
players['weightclass'] = pd.qcut(players['weight'],
len(weight_categories),
weight_categories)
players.head()
3.10 肤色划分
print (players['skintone'])
pd.qcut(players['skintone'], 3)
players['skintoneclass'] = pd.qcut(players['skintone'], 3)
players
4.报表可视化分析(Pandas profiling)
4.1 总体数据
- 首先是数据集信息:
- 变量数(列)、观察数(行)、数据缺失率、内存;
- 数据类型的分布情况
4.2 警告信息
详细信息:https://blog.csdn.net/weixin_43746433/article/details/97119966
5. 红牌和肤色的关系
问题——如何处理生日专栏?
需要考虑的几点:
- 这是截至2012-2013赛季的一个职业生涯数据集
- 2012年玩家的年龄(不一定)与收到红牌的日期相对应。
5.1 生日转年龄
players.birthday.head()
players['birth_date'] = pd.to_datetime(players.birthday, format='%d.%m.%Y')
players['age_years'] = ((pd.to_datetime("2013-01-01") - players['birth_date']).dt.days)/365.25 #润年+1/4
players['age_years']
5.2 划分的列的报表可视化分析(Pandas profiling)
players.head()
- 所有列名
players_cleaned_variables = players.columns.tolist()
players_cleaned_variables
- 可视化报表
players_cleaned_variables = [#'birthday',
'height',
'weight',
# 'position',
# 'photoID',
# 'rater1',
# 'rater2',
'skintone',
'position_agg',
'weightclass',
'heightclass',
'skintoneclass',
# 'birth_date',
'age_years']
pandas_profiling.ProfileReport(players[players_cleaned_variables])
5.3 划分的列保存数据
players[players_cleaned_variables].to_csv("cleaned_players.csv.gz", compression='gzip')
6. 数据更多融合
6.1 裁判和球员信息
clean_players = load_subgroup("cleaned_players.csv.gz")
players = load_subgroup("raw_players.csv.gz", )
countries = load_subgroup("raw_countries.csv.gz")
referees = load_subgroup("raw_referees.csv.gz")
agg_dyads = pd.read_csv("raw_dyads.csv.gz", compression='gzip', index_col=[0, 1])
agg_dyads.head(10)
6.2 逻辑判断
- 局数=胜场+平局+失败?
all(agg_dyads['games'] == agg_dyads.victories + agg_dyads.ties + agg_dyads.defeats)
6.3 数据融合
agg_dyads['totalRedCards'] = agg_dyads['yellowReds'] + agg_dyads['redCards']
agg_dyads.rename(columns={'redCards': 'strictRedCards'}, inplace=True)
agg_dyads.head()
clean_players.head()
agg_dyads.head()
agg_dyads.reset_index().head()
agg_dyads.reset_index().set_index('playerShort').head()
- 合并agg_dyads与clean_players
clean_dyads = (agg_dyads.reset_index()[agg_dyads.reset_index()
.playerShort
.isin(set(clean_players.index))
]).set_index(['refNum', 'playerShort'])
6.4 查看裁判与球员与牌的数量
- 裁判给牌数量
tidy_dyads.head()
(tidy_dyads.groupby(level=0)
.sum()
.sort_values('redcard', ascending=False)
.rename(columns={'redcard':'total redcards given'})).head()
- 球员被给牌数量
(tidy_dyads.groupby(level=1)
.sum()
.sort_values('redcard', ascending=False)
.rename(columns={'redcard':'total redcards received'})).head()
6.5 肤色与各类的关系
tidy_dyads.groupby(level=1).sum().sort_values(ascending=False, by='redcard').head()
- 合并
player_ref_game = (tidy_dyads.reset_index()
.set_index('playerShort')
.merge(clean_players,
left_index=True,
right_index=True)
)
player_ref_game.head()
bootstrap = pd.concat([player_ref_game.sample(replace=True,
n=10000).groupby('skintone').mean()
for _ in range(100)])
player_ref_game.sample(replace=True,n=10000).groupby('skintone').mean()
- 绘图
ax = sns.regplot(bootstrap.index.values,
y='redcard',
data=bootstrap,
lowess=True,
scatter_kws={'alpha':0.4,},
x_jitter=(0.125 / 4.0))
ax.set_xlabel("Skintone");