基于球员和裁判数据进行探索性数据分析实践-大数据ML样本集案例实战

版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。

1 数据简介

  • 数据包含球员和裁判的信息,2012-2013年的比赛数据,总共设计球员2053名,裁判3147名,特征列表如下:

    Variable Name:Variable Description:
    playerShortshort player ID
    playerplayer name
    clubplayer club
    leagueCountrycountry of player club (England, Germany, France, and Spain)
    heightplayer height (in cm)
    weightplayer weight (in kg)
    positionplayer position
    gamesnumber of games in the player-referee dyad
    goalsnumber of goals in the player-referee dyad
    yellowCardsnumber of yellow cards player received from the referee
    yellowRedsnumber of yellow-red cards player received from the referee
    redCardsnumber of red cards player received from the referee
    photoIDID of player photo (if available)
    rater1skin rating of photo by rater 1
    rater2skin rating of photo by rater 2
    refNumunique referee ID number (referee name removed for anonymizing purposes)
    refCountryunique referee country ID number
    meanIATmean implicit bias score (using the race IAT) for referee country
    nIATsample size for race IAT in that particular country
    seIATstandard error for mean estimate of race IAT
    meanExpmean explicit bias score (using a racial thermometer task) for referee country
    nExpsample size for explicit bias in that particular country
    seExpstandard error for mean estimate of explicit bias measure

2 数据预处理

  • 数据基本特征挖掘

      # Uncomment one of the following lines and run the cell:
      df = pd.read_csv("redcard.csv.gz", compression='gzip')
      df.shape
      (146028, 28)
      
      df.head()
    复制代码

     df.describe().T 
复制代码

  • df.dtypes

      playerShort       object
      player            object
      club              object
      leagueCountry     object
      birthday          object
      height           float64
      weight           float64
      position          object
      games              int64
      victories          int64
      ties               int64
      defeats            int64
      goals              int64
      yellowCards        int64
      yellowReds         int64
      redCards           int64
      photoID           object
      rater1           float64
      rater2           float64
      refNum             int64
      refCountry         int64
      Alpha_3           object
      meanIAT          float64
      nIAT             float64
      seIAT            float64
      meanExp          float64
      nExp             float64
      seExp            float64
      dtype: object
    复制代码
  • all_columns = df.columns.tolist()

     all_columns
    
     ['playerShort',
      'player',
      'club',
      'leagueCountry',
      'birthday',
      'height',
      'weight',
      'position',
      'games',
      'victories',
      'ties',
      'defeats',
      'goals',
      'yellowCards',
      'yellowReds',
      'redCards',
      'photoID',
      'rater1',
      'rater2',
      'refNum',
      'refCountry',
      'Alpha_3',
      'meanIAT',
      'nIAT',
      'seIAT',
      'meanExp',
      'nExp',
      'seExp']
    复制代码
  • df['height'].mean()

     181.93593798236887
    复制代码
  • df['height'].mean()

     181.93593798236887
    复制代码
  • np.mean(df.groupby('playerShort').height.mean())

      181.74372848007872
    复制代码
  • Tidy Data

      df2 = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
           'key2':['one', 'two', 'one', 'two', 'one'],
           'data1':np.random.randn(5),
           'data2':np.random.randn(5)})
    复制代码

  • 分组聚合

      grouped = df2['data1'].groupby(df['key1'])
      grouped.mean()
    
      key1
      a   -0.093686
      b   -0.322711
      Name: data1, dtype: float64  
          
       player_index = 'playerShort'
       player_cols = [#'player', # drop player name, we have unique identifier
                     'birthday',
                     'height',
                     'weight',
                     'position',
                     'photoID',
                     'rater1',
                     'rater2',
                    ]   
    
      all_cols_unique_players = df.groupby(' ').agg({col:'nunique' for col in player_cols})
      all_cols_unique_players.head()
    复制代码

    all_cols_unique_players[all_cols_unique_players > 1].dropna().shape[0] == 0
    True
复制代码
  • 去重

      def get_subgroup(dataframe, g_index, g_columns):
          g = dataframe.groupby(g_index).agg({col:'nunique' for col in g_columns})
          if g[g > 1].dropna().shape[0] != 0:
              print("Warning: you probably assumed this had all unique values but it doesn't.")
          return dataframe.groupby(g_index).agg({col:'max' for col in g_columns})
    
      players = get_subgroup(df, player_index, player_cols)
      players.head()
    复制代码

3 数据缺失值指标可视化

未完待续

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值