这里记录一下利用搜集到的数据:足球运动员数据集,进行具体任务前的探索性数据分析,自己觉得是干货,再做其他分析的时候一些函数可以当做模板,代码是挑出来比较典型的,不是流程式的。
目标:
探索性数据分析(EDA). 挑战目标: 这些裁判在给红牌的时候咋想的呢,会不会被跟球员的肤色有关?
数据简介:
数据包含球员和裁判的信息,2012-2013年的比赛数据,总共设计球员2053名,裁判3147名,特征列表如下:
Variable Name: | Variable Description: |
---|---|
playerShort | short player ID |
player | player name |
club | player club |
leagueCountry | country of player club (England, Germany, France, and Spain) |
height | player height (in cm) |
weight | player weight (in kg) |
position | player position |
games | number of games in the player-referee dyad |
goals | number of goals in the player-referee dyad |
yellowCards | number of yellow cards player received from the referee |
yellowReds | number of yellow-red cards player received from the referee |
redCards | number of red cards player received from the referee |
photoID | ID of player photo (if available) |
rater1 | skin rating of photo by rater 1 |
rater2 | skin rating of photo by rater 2 |
refNum | unique referee ID number (referee name removed for anonymizing purposes) |
refCountry | unique referee country ID number |
meanIAT | mean implicit bias score (using the race IAT) for referee country |
nIAT | sample size for race IAT in that particular country |
seIAT | standard error for mean estimate of race IAT |
meanExp | mean explicit bias score (using a racial thermometer task) for referee country |
nExp | sample size for explicit bias in that particular country |
seExp | standard error for mean estimate of explicit bias measure |
导库
from __future__ import absolute_import, division, print_function
%matplotlib inline
%config InlineBackend.figure_format='retina'
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)
import missingno as msno #看缺失值库
import pandas_profiling #提前看数据集信息的库
from sklearn.datasets import make_blobs
import time
保存生成的csv文件到压缩包函数。
def save_subgroup(dataframe, g_index, subgroup_name, prefix='raw_'):
save_subgroup_filename = "".join([prefix, subgroup_name, ".csv.gz"])
dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding='UTF-8')
test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col=g_index, encoding='UTF-8')
# Test that we recover what we send in
if dataframe.equals(test_df):
print("Test-passed: we recover the equivalent subgroup dataframe.")
else:
print("Warning -- equivalence test!!! Double-check.")
读压缩包里面的csv文件函数
def load_subgroup(filename, index_col=[0]):
return pd.read_csv(filename, compression='gzip', index_col=index_col)
读数据,此处记录一个方法,csv文件是在压缩包里面的,所以:
#打开压缩包并读取里面的csv文件方法,用gzip方式
df = pd.read_csv("redcard.csv.gz", compression='gzip')
df.shape #看数据集行列大小
df.head() #先看看数据集前5行
df.dtypes #看数据集所有特征类型int、float、object之类的
df.describe().T
#describe()函数是统计的功能,可以显示数量,最大最小值、均值等
all_columns = df.columns.tolist() #拿到所有特征,成一个list
挑战
如果数据集中有两个球员的名字不小心输一样了,如果直接用,机器会算两次,这样并不好。以球员的height为例
df['height'].mean() #算高的均值
#输出181.93593798236887
#有可能一个运动员出现两次,应该只考虑一次
np.mean(df.groupby('playerShort').height.mean())
#groupby函数是统计分类
#输出181.74372848007872
这里插入一个自己建dataframe的方法,并举例groupby函数应用:
df2 = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
'key2':['one', 'two', 'one', 'two', 'one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})
grouped = df2['data1'].groupby(df2['key1'])
#定位到data1,以key1里面的类型做groupby,意思key1里面不同的属性只取一个
grouped.mean()
key1
a 0.316412
b 0.688893
本来是0.31,经过groupby函数输出编程0.68
clubs['leagueCountry'].value_counts()
#算特征里面不同的属性有多少个
England 48
Spain 27
France 22
Germany 21
缺失值可视化分析
下面使用到了小众的库,是专门来看数据集缺失情况的。
players = load_subgroup("raw_players.csv.gz")
import missingno as msno #看缺失值库
#看缺失值模板
msno.matrix(players.sample(500),
figsize=(16, 7),
width_ratios=(15, 1))
白的代表此特征有缺失值和数量。
#用msno的heatmap函数看特征之间的关系
msno.heatmap(players.sample(500),
figsize=(16, 7),)
print("All players:", len(players))
print("rater1 nulls:", len(players[(players.rater1.isnull())]))
#看下各列空值有多少个
print("rater2 nulls:", len(players[players.rater2.isnull()]))
print("Both nulls:", len(players[(players.rater1.isnull()) & (players.rater2.isnull())]))
All players: 2053
rater1 nulls: 468
rater2 nulls: 468
Both nulls: 468
# modifying dataframe
players = players[players.rater1.notnull()]
#按rater1取非空的样本
之后就去掉了空样本,现在再用msno看一下:
msno.matrix(players.sample(500), #再看一下
figsize=(16, 7),
width_ratios=(15, 1))
pd.crosstab(players.rater1, players.rater2)
#看两个属性的值画成表格有多少个
用seaborn画热度图
#用seaborn画热度图
#看两个变量之间的关系
fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(pd.crosstab(players.rater1, players.rater2), cmap='Blues', annot=True, fmt='d', ax=ax)
ax.set_title("Correlation between Rater 1 and Rater 2\n")
fig.tight_layout()
自定义添加新的列
# modifying dataframe
#加一列算两列的均值
players['skintone'] = players[['rater1', 'rater2']].mean(axis=1)
单变量特征的可视化分布
用seaborn
#用seaborn画柱形图模板,传进去指标就可以了
sns.distplot(players.skintone, kde=False);
球员角色
#画图看position特征不同的属性有多少个
MIDSIZE = (12, 8)
fig, ax = plt.subplots(figsize=MIDSIZE)
players.position.value_counts(dropna=False, ascending=True).plot(kind='barh', ax=ax)
ax.set_ylabel("Position")
ax.set_xlabel("Counts")
fig.tight_layout()
把多个小角色合并成大角色
position_types = players.position.unique() #看下列中不同的属性
array([‘Center Back’, ‘Attacking Midfielder’, ‘Right Midfielder’,
‘Center Midfielder’, ‘Goalkeeper’, ‘Defensive Midfielder’,
‘Left Fullback’, nan, ‘Left Midfielder’, ‘Right Fullback’,
‘Center Forward’, ‘Left Winger’, ‘Right Winger’], dtype=object)
#合并角色
defense = ['Center Back','Defensive Midfielder', 'Left Fullback', 'Right Fullback', ]
midfield = ['Right Midfielder', 'Center Midfielder', 'Left Midfielder',]
forward = ['Attacking Midfielder', 'Left Winger', 'Right Winger', 'Center Forward']
keeper = 'Goalkeeper'
# modifying dataframe -- adding the aggregated position categorical position_agg
players.loc[players['position'].isin(defense), 'position_agg'] = "Defense"
players.loc[players['position'].isin(midfield), 'position_agg'] = "Midfield"
players.loc[players['position'].isin(forward), 'position_agg'] = "Forward"
players.loc[players['position'].eq(keeper), 'position_agg'] = "Keeper"
MIDSIZE = (12, 8)
fig, ax = plt.subplots(figsize=MIDSIZE)
players['position_agg'].value_counts(dropna=False, ascending=True).plot(kind='barh', ax=ax)
ax.set_ylabel("position_agg")
ax.set_xlabel("Counts")
fig.tight_layout()
多特征间的关系
#from pandas.tools.plotting import scatter_matrix 已经不能用了
from pandas.plotting import scatter_matrix #多变量之间的关系
#画多变量关系模板
fig, ax = plt.subplots(figsize=(10, 10))
#alpha:透明程度
scatter_matrix(players[['height', 'weight', 'skintone']], alpha=0.2, diagonal='hist', ax=ax);
两个偏线性分布的点图表示横纵坐标两个变量间关系密切,横向、纵向分布的图表示两个变量间关系极不密切。
如果想单独看两个变量之间的关系:
#想看单独两个变量之间的关系,用seaborn模板,regplot函数
fig, ax = plt.subplots(figsize=MIDSIZE)
sns.regplot('weight', 'height', data=players, ax=ax)
ax.set_ylabel("Height [cm]")
ax.set_xlabel("Weight [kg]")
fig.tight_layout()
为连续变量划分区间或级别
有时候我们想把一些数值的区间划分等级,比如这里0-1,接近零表示球员越白,接近1表示越黑,探讨一个球员被罚是否和其肤色相关的问题。
weight_categories = ["vlow_weight",
"low_weight",
"mid_weight",
"high_weight",
"vhigh_weight",
] #划分类别名字
#pd.qcut函数平均切分
players['weightclass'] = pd.qcut(players['weight'], #切分的东西
len(weight_categories), #切成几份
weight_categories) #切后的类别
之后数据集中就添加了一列weightclass表示球员很黑、较黑…
#同样切身高成5份
height_categories = ["vlow_height",
"low_height",
"mid_height",
"high_height",
"vhigh_height",
]
players['heightclass'] = pd.qcut(players['height'],
len(height_categories),
height_categories)
#再切皮肤成3份
print (players['skintone'])
pd.qcut(players['skintone'], 3)
players['skintoneclass'] = pd.qcut(players['skintone'], 3)
从此数据集中就多了3个指标
报表的可视化分析
同样有个库叫pandas_profiling
,这个库可以在做数据分析之前给你一个对整体数据集的把握。什么时候做呢,可以在刚导入数据后做,也可以对数据预处理完做,也可以在得到结果后做。
操作非常简单,导入数据加一行函数调用就行了
#拿到数据集用这个看数据集的信息
#做数据分析前很有用
import pandas_profiling #提前看数据集信息的库
pandas_profiling.ProfileReport(players)
会生成一个报告,包括对各特征的缺失值、最大、最小值、类型、变量间的相关性、机器给的删除建议等等。太长了,随便弄个数据集试试就好啦。