Pandas学习笔记(四)—— Pandas变形

前导


更多文章代码详情可查看博主个人网站:https://www.iwtmbtly.com/


导入需要使用的库和文件:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_csv('data/table.csv')
>>> df.head()
  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M  street_1     173      63  34.0      A+
1    S_1   C_1  1102      F  street_2     192      73  32.5      B+
2    S_1   C_1  1103      M  street_2     186      82  87.2      B+
3    S_1   C_1  1104      F  street_2     167      81  80.4      B-
4    S_1   C_1  1105      F  street_4     159      64  84.8      B+

一、透视表

(一)pivot

一般状态下,数据在DataFrame会以压缩(stacked)状态存放,例如上面的Gender,两个类别被叠在一列中,pivot函数可将某一列作为新的cols:

>>> df.pivot(index='ID', columns='Gender', values='Height').head()
Gender      F      M
ID
1101      NaN  173.0
1102    192.0    NaN
1103      NaN  186.0
1104    167.0    NaN
1105    159.0    NaN

然而pivot函数具有很强的局限性,除了功能上较少之外,还不允许values中出现重复的行列索引对(pair),例如下面的语句就会报错:

>>> df.pivot(index='School',columns='Gender',values='Height').head()

因此,更多的时候会选择使用强大的pivot_table函数

(二)pivot_table

首先,再先上面的操作:

>>> pd.pivot_table(df,index='ID',columns='Gender',values='Height').head()
Gender      F      M
ID
1101      NaN  173.0
1102    192.0    NaN
1103      NaN  186.0
1104    167.0    NaN
1105    159.0    NaN

由于功能更多,速度上自然是比不上原来的pivot函数:

>>> %timeit df.pivot(index='ID',columns='Gender',values='Height')
>>> %timeit pd.pivot_table(df,index='ID',columns='Gender',values='Height')
572 µs ± 5.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.43 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas中还提供了各种选项,下面介绍常用参数:

1. aggfunc:对组内进行聚合统计,可传入各类函数,默认为’mean’

>>> pd.pivot_table(df, index='School', columns='Gender', values='Height', aggfunc=['mean', 'sum']).head()
              mean               sum
Gender           F           M     F     M
School
S_1     173.125000  178.714286  1385  1251
S_2     173.727273  172.000000  1911  1548

2. margins:汇总边际状态

# margins_name可以设置名字,默认为'All'
>>> pd.pivot_table(df,index='School',columns='Gender',values='Height',aggfunc=['mean','sum'],margins=True).head()
              mean                           sum            
Gender           F           M         All     F     M   All
School                                                      
S_1     173.125000  178.714286  175.733333  1385  1251  2636
S_2     173.727273  172.000000  172.950000  1911  1548  3459
All     173.473684  174.937500  174.142857  3296  2799  6095

3. 行、列、值都可以为多级

>>> pd.pivot_table(df,index=['School','Class'],
...                columns=['Gender','Address'],
                   value=['Height', 'Weight'])
               Height                                                                 ...   Weight
Gender              F                                                     M           ...        F                 M
Address      street_1 street_2 street_4 street_5 street_6 street_7 street_1 street_2  ... street_6 street_7 street_1 street_2 street_4 street_5 street_6 street_7        
School Class                                                                          ...
S_1    C_1        NaN    179.5    159.0      NaN      NaN      NaN    173.0    186.0  ...      NaN      NaN     63.0     82.0      NaN      NaN      NaN      NaN        
       C_2        NaN      NaN    176.0    162.0    167.0      NaN      NaN      NaN  ...     63.0      NaN      NaN      NaN      NaN     68.0     53.0      NaN        
       C_3      175.0      NaN      NaN    187.0      NaN      NaN      NaN    195.0  ...      NaN      NaN      NaN     70.0     68.0      NaN      NaN     82.0        
S_2    C_1        NaN      NaN      NaN    159.0    161.0      NaN      NaN      NaN  ...     61.0      NaN      NaN      NaN     71.0      NaN      NaN     84.0        
       C_2        NaN      NaN      NaN      NaN      NaN    188.5    175.0      NaN  ...      NaN     76.5     74.0      NaN     91.0    100.0      NaN      NaN        
       C_3        NaN      NaN    157.0      NaN    164.0    190.0      NaN      NaN  ...     81.0     99.0      NaN      NaN     73.0     88.0      NaN      NaN        
       C_4        NaN    176.0      NaN      NaN    175.5      NaN      NaN      NaN  ...     57.0      NaN      NaN      NaN      NaN      NaN      NaN     82.0        

[7 rows x 24 columns]

(三)交叉表

交叉表是一种特殊的透视表,典型的用途如分组统计,如现在想要统计关于街道和性别分组的频数:

>>> pd.crosstab(index=df['Address'],columns=df['Gender'])
Gender    F  M
Address
street_1  1  2
street_2  4  2
street_4  3  5
street_5  3  3
street_6  5  1
street_7  3  3

交叉表的功能也很强大(但目前还不支持多级分组),下面说明一些重要参数:

1. values和aggfunc:分组对某些数据进行聚合操作,这两个参数必须成对出现

# 默认参数等于如下方法:
# pd.crosstab(index=df['Address'],columns=df['Gender'],values=1,aggfunc='count')
>>> pd.crosstab(index=df['Address'], columns=df['Gender'], values=np.random.randint(1, 20, df.shape[0]), aggfunc='min')
Gender     F   M
Address
street_1   2   6
street_2   4   2
street_4  11   1
street_5   3   6
street_6   2  18
street_7   4   3

2. 除了边际参数margins外,还引入了normalize参数,可选’all’,‘index’,'columns’参数值

>>> pd.crosstab(index=df['Address'],columns=df['Gender'],normalize='all',margins=True)
Gender           F         M       All
Address
street_1  0.028571  0.057143  0.085714
street_2  0.114286  0.057143  0.171429
street_4  0.085714  0.142857  0.228571
street_5  0.085714  0.085714  0.171429
street_6  0.142857  0.028571  0.171429
street_7  0.085714  0.085714  0.171429
All       0.542857  0.457143  1.000000

二、其他变形方法

(一)melt

melt函数可以认为是pivot函数的逆操作,将unstacked状态的数据,压缩成stacked,使“宽”的DataFrame变“窄”

>>> df_m = df[['ID','Gender','Math']]
>>> df_m.head()
     ID Gender  Math
0  1101      M  34.0
1  1102      F  32.5
2  1103      M  87.2
3  1104      F  80.4
4  1105      F  84.8
>>> df.pivot(index='ID',columns='Gender',values='Math').head()
Gender     F     M
ID
1101     NaN  34.0
1102    32.5   NaN
1103     NaN  87.2
1104    80.4   NaN
1105    84.8   NaN

melt函数中的id_vars表示需要保留的列,value_vars表示需要stack的一组列

>>> pivoted = df.pivot(index='ID',columns='Gender',values='Math')
>>> result = pivoted.reset_index().melt(id_vars=['ID'],value_vars=['F','M'],value_name='Math')\
...                      .dropna().set_index('ID').sort_index()
>>> # 检验是否与展开前的df相同,可以分别将这些链式方法的中间步骤展开,看看是什么结果
>>> result.equals(df_m.set_index('ID'))
True

(二)压缩与展开

1. stack:这是最基础的变形函数,总共只有两个参数:level和dropna

>>> df_s = pd.pivot_table(df,index=['Class','ID'],columns='Gender',values=['Height','Weight'])
>>> df_s.groupby('Class').head(2)
           Height        Weight
Gender          F      M      F     M
Class ID
C_1   1101    NaN  173.0    NaN  63.0
      1102  192.0    NaN   73.0   NaN
C_2   1201    NaN  188.0    NaN  68.0
      1202  176.0    NaN   94.0   NaN
C_3   1301    NaN  161.0    NaN  68.0
      1302  175.0    NaN   57.0   NaN
C_4   2401  192.0    NaN   62.0   NaN
      2402    NaN  166.0    NaN  82.0
>>> df_stacked = df_s.stack()
>>> df_stacked.groupby('Class').head(2)
                   Height  Weight
Class ID   Gender
C_1   1101 M        173.0    63.0
      1102 F        192.0    73.0
C_2   1201 M        188.0    68.0
      1202 F        176.0    94.0
C_3   1301 M        161.0    68.0
      1302 F        175.0    57.0
C_4   2401 F        192.0    62.0
      2402 M        166.0    82.0

stack函数可以看做将横向的索引放到纵向,因此功能类似与melt,参数level可指定变化的列索引是哪一层(或哪几层,需要列表)

>>> df_stacked = df_s.stack(0)
>>> df_stacked.groupby('Class').head(2)
Gender                 F      M
Class ID
C_1   1101 Height    NaN  173.0
           Weight    NaN   63.0
C_2   1201 Height    NaN  188.0
           Weight    NaN   68.0
C_3   1301 Height    NaN  161.0
           Weight    NaN   68.0
C_4   2401 Height  192.0    NaN
           Weight   62.0    NaN

2. unstack:stack的逆函数,功能上类似于pivot_table

>>> df_stacked.head()
Gender                 F      M
Class ID
C_1   1101 Height    NaN  173.0
           Weight    NaN   63.0
      1102 Height  192.0    NaN
           Weight   73.0    NaN
      1103 Height    NaN  186.0
>>> result = df_stacked.unstack().swaplevel(1,0,axis=1).sort_index(axis=1)
>>> result.equals(df_s)
True
# 同样在unstack中可以指定level参数

三、哑变量与因子化

(一)Dummy Variable(哑变量)

这里主要介绍get_dummies函数,其功能主要是进行one-hot编码:

>>> df_d = df[['Class','Gender','Weight']]
>>> df_d.head()
  Class Gender  Weight
0   C_1      M      63
1   C_1      F      73
2   C_1      M      82
3   C_1      F      81
4   C_1      F      64

现在希望将上面的表格前两列转化为哑变量,并加入第三列Weight数值:

# 可选prefix参数添加前缀,prefix_sep添加分隔符
>>> pd.get_dummies(df_d[['Class','Gender']]).join(df_d['Weight']).head()
   Class_C_1  Class_C_2  Class_C_3  Class_C_4  Gender_F  Gender_M  Weight
0          1          0          0          0         0         1      63
1          1          0          0          0         1         0      73
2          1          0          0          0         0         1      82
3          1          0          0          0         1         0      81
4          1          0          0          0         1         0      64

(二)factorize方法

该方法主要用于自然数编码,并且缺失值会被记做-1,其中sort参数表示是否排序后赋值

>>> codes, uniques = pd.factorize(['b', None, 'a', 'c', 'b'], sort=True)
>>> print(codes)
[ 1 -1  0  2  1]
>>> print(uniques)
['a' 'b' 'c']
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

HiSpring流云

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值