基于球员和裁判数据进行探索性数据分析实践-大数据ML样本集案例实战

最新推荐文章于 2023-05-30 08:00:00 发布

weixin_33910759

最新推荐文章于 2023-05-30 08:00:00 发布

阅读量325

点赞数

文章标签：大数据

原文链接：https://juejin.im/post/5c0e9b306fb9a049c643aecc

版权

1 数据简介

数据包含球员和裁判的信息，2012-2013年的比赛数据，总共设计球员2053名，裁判3147名，特征列表如下：

Variable Name:	Variable Description:
playerShort	short player ID
player	player name
club	player club
leagueCountry	country of player club (England, Germany, France, and Spain)
height	player height (in cm)
weight	player weight (in kg)
position	player position
games	number of games in the player-referee dyad
goals	number of goals in the player-referee dyad
yellowCards	number of yellow cards player received from the referee
yellowReds	number of yellow-red cards player received from the referee
redCards	number of red cards player received from the referee
photoID	ID of player photo (if available)
rater1	skin rating of photo by rater 1
rater2	skin rating of photo by rater 2
refNum	unique referee ID number (referee name removed for anonymizing purposes)
refCountry	unique referee country ID number
meanIAT	mean implicit bias score (using the race IAT) for referee country
nIAT	sample size for race IAT in that particular country
seIAT	standard error for mean estimate of race IAT
meanExp	mean explicit bias score (using a racial thermometer task) for referee country
nExp	sample size for explicit bias in that particular country
seExp	standard error for mean estimate of explicit bias measure

2 数据预处理

数据基本特征挖掘

  # Uncomment one of the following lines and run the cell:
  df = pd.read_csv("redcard.csv.gz", compression='gzip')
  df.shape
  (146028, 28)
  
  df.head()
复制代码

     df.describe().T 
复制代码

df.dtypes

  playerShort       object
  player            object
  club              object
  leagueCountry     object
  birthday          object
  height           float64
  weight           float64
  position          object
  games              int64
  victories          int64
  ties               int64
  defeats            int64
  goals              int64
  yellowCards        int64
  yellowReds         int64
  redCards           int64
  photoID           object
  rater1           float64
  rater2           float64
  refNum             int64
  refCountry         int64
  Alpha_3           object
  meanIAT          float64
  nIAT             float64
  seIAT            float64
  meanExp          float64
  nExp             float64
  seExp            float64
  dtype: object
复制代码

all_columns = df.columns.tolist()

 all_columns

 ['playerShort',
  'player',
  'club',
  'leagueCountry',
  'birthday',
  'height',
  'weight',
  'position',
  'games',
  'victories',
  'ties',
  'defeats',
  'goals',
  'yellowCards',
  'yellowReds',
  'redCards',
  'photoID',
  'rater1',
  'rater2',
  'refNum',
  'refCountry',
  'Alpha_3',
  'meanIAT',
  'nIAT',
  'seIAT',
  'meanExp',
  'nExp',
  'seExp']
复制代码

df['height'].mean()
```
 181.93593798236887
复制代码
```
df['height'].mean()
```
 181.93593798236887
复制代码
```
np.mean(df.groupby('playerShort').height.mean())
```
  181.74372848007872
复制代码
```

Tidy Data

  df2 = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
       'key2':['one', 'two', 'one', 'two', 'one'],
       'data1':np.random.randn(5),
       'data2':np.random.randn(5)})
复制代码

分组聚合

  grouped = df2['data1'].groupby(df['key1'])
  grouped.mean()

  key1
  a   -0.093686
  b   -0.322711
  Name: data1, dtype: float64  
      
   player_index = 'playerShort'
   player_cols = [#'player', # drop player name, we have unique identifier
                 'birthday',
                 'height',
                 'weight',
                 'position',
                 'photoID',
                 'rater1',
                 'rater2',
                ]   

  all_cols_unique_players = df.groupby(' ').agg({col:'nunique' for col in player_cols})
  all_cols_unique_players.head()
复制代码

    all_cols_unique_players[all_cols_unique_players > 1].dropna().shape[0] == 0
    True
复制代码

去重

  def get_subgroup(dataframe, g_index, g_columns):
      g = dataframe.groupby(g_index).agg({col:'nunique' for col in g_columns})
      if g[g > 1].dropna().shape[0] != 0:
          print("Warning: you probably assumed this had all unique values but it doesn't.")
      return dataframe.groupby(g_index).agg({col:'max' for col in g_columns})

  players = get_subgroup(df, player_index, player_cols)
  players.head()
复制代码