版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。
1 数据简介
-
数据包含球员和裁判的信息,2012-2013年的比赛数据,总共设计球员2053名,裁判3147名,特征列表如下:
Variable Name: Variable Description: playerShort short player ID player player name club player club leagueCountry country of player club (England, Germany, France, and Spain) height player height (in cm) weight player weight (in kg) position player position games number of games in the player-referee dyad goals number of goals in the player-referee dyad yellowCards number of yellow cards player received from the referee yellowReds number of yellow-red cards player received from the referee redCards number of red cards player received from the referee photoID ID of player photo (if available) rater1 skin rating of photo by rater 1 rater2 skin rating of photo by rater 2 refNum unique referee ID number (referee name removed for anonymizing purposes) refCountry unique referee country ID number meanIAT mean implicit bias score (using the race IAT) for referee country nIAT sample size for race IAT in that particular country seIAT standard error for mean estimate of race IAT meanExp mean explicit bias score (using a racial thermometer task) for referee country nExp sample size for explicit bias in that particular country seExp standard error for mean estimate of explicit bias measure
2 数据预处理
-
数据基本特征挖掘
# Uncomment one of the following lines and run the cell: df = pd.read_csv("redcard.csv.gz", compression='gzip') df.shape (146028, 28) df.head() 复制代码
df.describe().T
复制代码
-
df.dtypes
playerShort object player object club object leagueCountry object birthday object height float64 weight float64 position object games int64 victories int64 ties int64 defeats int64 goals int64 yellowCards int64 yellowReds int64 redCards int64 photoID object rater1 float64 rater2 float64 refNum int64 refCountry int64 Alpha_3 object meanIAT float64 nIAT float64 seIAT float64 meanExp float64 nExp float64 seExp float64 dtype: object 复制代码
-
all_columns = df.columns.tolist()
all_columns ['playerShort', 'player', 'club', 'leagueCountry', 'birthday', 'height', 'weight', 'position', 'games', 'victories', 'ties', 'defeats', 'goals', 'yellowCards', 'yellowReds', 'redCards', 'photoID', 'rater1', 'rater2', 'refNum', 'refCountry', 'Alpha_3', 'meanIAT', 'nIAT', 'seIAT', 'meanExp', 'nExp', 'seExp'] 复制代码
-
df['height'].mean()
181.93593798236887 复制代码
-
df['height'].mean()
181.93593798236887 复制代码
-
np.mean(df.groupby('playerShort').height.mean())
181.74372848007872 复制代码
-
Tidy Data
df2 = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'], 'key2':['one', 'two', 'one', 'two', 'one'], 'data1':np.random.randn(5), 'data2':np.random.randn(5)}) 复制代码
-
分组聚合
grouped = df2['data1'].groupby(df['key1']) grouped.mean() key1 a -0.093686 b -0.322711 Name: data1, dtype: float64 player_index = 'playerShort' player_cols = [#'player', # drop player name, we have unique identifier 'birthday', 'height', 'weight', 'position', 'photoID', 'rater1', 'rater2', ] all_cols_unique_players = df.groupby(' ').agg({col:'nunique' for col in player_cols}) all_cols_unique_players.head() 复制代码
all_cols_unique_players[all_cols_unique_players > 1].dropna().shape[0] == 0
True
复制代码
-
去重
def get_subgroup(dataframe, g_index, g_columns): g = dataframe.groupby(g_index).agg({col:'nunique' for col in g_columns}) if g[g > 1].dropna().shape[0] != 0: print("Warning: you probably assumed this had all unique values but it doesn't.") return dataframe.groupby(g_index).agg({col:'max' for col in g_columns}) players = get_subgroup(df, player_index, player_cols) players.head() 复制代码
3 数据缺失值指标可视化
未完待续