jupyter notebook导入wordpress例子

pandas教程_03分组与聚合 rel="stylesheet" href="custom.css">

引用包

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

数据导入

In [31]:
data_dict = {'color' : ['black', 'white', 'black', 'white', 'black',
                      'white', 'black', 'white', 'black', 'white'],
            'size' : ['S','M','L','M','L','S','S','XL','XL','M'],
        'date':pd.date_range('1/1/2019',periods=10, freq='W' ),
            'feature_1': np.random.randn(10),
            'feature_2': np.random.normal(0.5, 2, 10)}
array=[['A','B','B','B','C','A','B','A','C','C'],['JP','CN','US','US','US','CN','CN','CA','JP','CA']]
                                                 
index = pd.MultiIndex.from_arrays(array, names=['class', 'country'])
data_df = pd.DataFrame(data_dict,index=index)
data_df
Out[31]:
colordatefeature_1feature_2size
classcountry
AJPblack2019-01-06-1.234449-0.133232S
BCNwhite2019-01-131.308935-0.493569M
USblack2019-01-200.0416721.014697L
USwhite2019-01-27-0.2037781.742654M
CUSblack2019-02-030.419852-2.964561L
ACNwhite2019-02-102.350862-1.895651S
BCNblack2019-02-17-0.649887-0.187894S
ACAwhite2019-02-240.9122000.782471XL
CJPblack2019-03-03-1.2954360.416840XL
CAwhite2019-03-100.5006332.827345M

分组

将dataframe根据size进行分组,得到group_1。在这里我们将GroupBy对象转换list后输出。

In [5]:
group_1 = data_df.groupby('size')
for i in list(group_1):
    print(i)
('L',                color       date  feature_1  feature_2 size
class country                                             
B     US       black 2019-01-20  -1.204530   2.331003    L
C     US       black 2019-02-03  -0.475149   2.455877    L)
('M',                color       date  feature_1  feature_2 size
class country                                             
B     CN       white 2019-01-13   0.354512  -0.106245    M
      US       white 2019-01-27   0.640886   3.105454    M
C     CA       white 2019-03-10   0.471399   1.102412    M)
('S',                color       date  feature_1  feature_2 size
class country                                             
A     JP       black 2019-01-06   0.599631   1.029602    S
      CN       white 2019-02-10   0.024186   2.412876    S
B     CN       black 2019-02-17   3.110097   0.678240    S)
('XL',                color       date  feature_1  feature_2 size
class country                                             
A     CA       white 2019-02-24   0.890249   1.522595   XL
C     JP       black 2019-03-03  -1.216877   2.321393   XL)

对GroupBy对象进行分组运算,如sum(),非数值数据则不进行分组运算。将计算得到的数据添加表头前缀后输出

In [6]:
group_1.sum().add_prefix('sum_')
Out[6]:
sum_feature_1sum_feature_2
size
L-1.6796794.786880
M1.4667974.101621
S3.7339144.120718
XL-0.3266283.843988

根据size进行分组后获得所有size值为M的行向量

In [5]:
group_1.get_group('M')
Out[5]:
colordatefeature_1feature_2
classcountry
BCNwhite2019-01-131.7359910.383047
USwhite2019-01-27-0.847715-2.327769
CCAwhite2019-03-10-0.8183031.317979

将dataframe根据size和color两个列标签进行多重分组,得到group_2

In [34]:
group_2 = data_df.groupby(['size', 'color'])
for i in list(group_2):
    print(i)
(('L', 'black'),                color       date  feature_1  feature_2 size
class country                                             
B     US       black 2019-01-20   0.041672   1.014697    L
C     US       black 2019-02-03   0.419852  -2.964561    L)
(('M', 'white'),                color       date  feature_1  feature_2 size
class country                                             
B     CN       white 2019-01-13   1.308935  -0.493569    M
      US       white 2019-01-27  -0.203778   1.742654    M
C     CA       white 2019-03-10   0.500633   2.827345    M)
(('S', 'black'),                color       date  feature_1  feature_2 size
class country                                             
A     JP       black 2019-01-06  -1.234449  -0.133232    S
B     CN       black 2019-02-17  -0.649887  -0.187894    S)
(('S', 'white'),                color       date  feature_1  feature_2 size
class country                                             
A     CN       white 2019-02-10   2.350862  -1.895651    S)
(('XL', 'black'),                color       date  feature_1  feature_2 size
class country                                             
C     JP       black 2019-03-03  -1.295436    0.41684   XL)
(('XL', 'white'),                color       date  feature_1  feature_2 size
class country                                             
A     CA       white 2019-02-24     0.9122   0.782471   XL)

对分组后的数据size函数获得组别个数

In [9]:
print(group_1.size())
print(group_2.size())
size
L     2
M     3
S     3
XL    2
dtype: int64
size  color
L     black    2
M     white    3
S     black    2
      white    1
XL    black    1
      white    1
dtype: int64

此外还可以利用函数进行分组。注意到groupby函数中axis=1对列进行分组,axis=0对行进行分组

In [10]:
def get_letter_type(letter):
    if 'feature' in letter:
        return 'feature'
    else:
        return 'other'
    
for i in list(data_df.groupby(get_letter_type, axis=1)):
    print(i)
('feature',                feature_1  feature_2
class country                      
A     JP        0.599631   1.029602
B     CN        0.354512  -0.106245
      US       -1.204530   2.331003
      US        0.640886   3.105454
C     US       -0.475149   2.455877
A     CN        0.024186   2.412876
B     CN        3.110097   0.678240
A     CA        0.890249   1.522595
C     JP       -1.216877   2.321393
      CA        0.471399   1.102412)
('other',                color       date size
class country                       
A     JP       black 2019-01-06    S
B     CN       white 2019-01-13    M
      US       black 2019-01-20    L
      US       white 2019-01-27    M
C     US       black 2019-02-03    L
A     CN       white 2019-02-10    S
B     CN       black 2019-02-17    S
A     CA       white 2019-02-24   XL
C     JP       black 2019-03-03   XL
      CA       white 2019-03-10    M)

分组对象除了列标签之外,还可以是索引,其中用不同的level值来区分多重索引。

In [16]:
for i in list(data_df.groupby(level=[0,1])):
    print(i)
(('A', 'CA'),                color       date  feature_1  feature_2 size
class country                                             
A     CA       white 2019-02-24   0.890249   1.522595   XL)
(('A', 'CN'),                color       date  feature_1  feature_2 size
class country                                             
A     CN       white 2019-02-10   0.024186   2.412876    S)
(('A', 'JP'),                color       date  feature_1  feature_2 size
class country                                             
A     JP       black 2019-01-06   0.599631   1.029602    S)
(('B', 'CN'),                color       date  feature_1  feature_2 size
class country                                             
B     CN       white 2019-01-13   0.354512  -0.106245    M
      CN       black 2019-02-17   3.110097   0.678240    S)
(('B', 'US'),                color       date  feature_1  feature_2 size
class country                                             
B     US       black 2019-01-20  -1.204530   2.331003    L
      US       white 2019-01-27   0.640886   3.105454    M)
(('C', 'CA'),                color       date  feature_1  feature_2 size
class country                                             
C     CA       white 2019-03-10   0.471399   1.102412    M)
(('C', 'JP'),                color       date  feature_1  feature_2 size
class country                                             
C     JP       black 2019-03-03  -1.216877   2.321393   XL)
(('C', 'US'),                color       date  feature_1  feature_2 size
class country                                             
C     US       black 2019-02-03  -0.475149   2.455877    L)

DataFrame可以根据列标签和索引两者同时分组,分组后可以在不同组间进行迭代

In [10]:
group_3=df.groupby(['country','color'])
for name, group in group_3:
    print(name)
    print(group)
('CA', 'white')
               color size       date  feature_1  feature_2
class country                                             
A     CA       white   XL 2019-02-24   0.412967   1.196859
C     CA       white    M 2019-03-10  -0.818303   1.317979
('CN', 'black')
               color size       date  feature_1  feature_2
class country                                             
B     CN       black    S 2019-02-17  -0.058021  -2.420962
('CN', 'white')
               color size       date  feature_1  feature_2
class country                                             
B     CN       white    M 2019-01-13   1.735991   0.383047
A     CN       white    S 2019-02-10   0.282515   3.156525
('JP', 'black')
               color size       date  feature_1  feature_2
class country                                             
A     JP       black    S 2019-01-06   0.997065  -1.018255
C     JP       black   XL 2019-03-03   0.513201  -3.266357
('US', 'black')
               color size       date  feature_1  feature_2
class country                                             
B     US       black    L 2019-01-20  -0.547211   0.693104
C     US       black    L 2019-02-03  -0.245918   4.444044
('US', 'white')
               color size       date  feature_1  feature_2
class country                                             
B     US       white    M 2019-01-27  -0.847715  -2.327769

当GroupBy对象被建立后,我们可以用agg函数对分组后的数据进行计算。下例中计算了group_3中feature_1的最大值和feature_2的均值。

In [17]:
group_2.agg({'feature_1' : np.min,'feature_2' : np.mean})
Out[17]:
feature_1feature_2
sizecolor
Lblack-1.2045302.393440
Mwhite0.3545121.367207
Sblack0.5996310.853921
white0.0241862.412876
XLblack-1.2168772.321393
white0.8902491.522595

接下来我们使用transform函数对groupby对象进行变换,transform的计算结果和原始数据的形状保持一致。下例中我们自定义了函数data_range来获得根据size分组后各个值的范围。

In [18]:
data_range = lambda x: x.max() - x.min()
data_df.groupby('size').transform(data_range)
Out[18]:
datefeature_1feature_2
classcountry
AJP42 days3.0859121.734636
BCN56 days0.2863753.211699
US14 days0.7293820.124874
US56 days0.2863753.211699
CUS14 days0.7293820.124874
ACN42 days3.0859121.734636
BCN42 days3.0859121.734636
ACA7 days2.1071250.798798
CJP7 days2.1071250.798798
CA56 days0.2863753.211699

另外我们还常常通过transform函数将缺失值替换为组间平均值。

In [29]:
data_df.iloc[1, 2:4] = np.NaN
group_4 = data_df.groupby('size')
f = lambda x: x.fillna(x.mean())
df_trans = group_4.transform(f)
df_trans
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel\kernelbase.py:399: PerformanceWarning: indexing past lexsort depth may impact performance.
  user_expressions, allow_stdin)

Out[29]:
feature_1feature_2
classcountry
AJP-0.023671-0.409491
BCN-0.091596-1.399647
US1.0853962.245660
US-0.127399-1.747656
CUS-2.0462023.475487
ACN-1.0760022.705517
BCN0.1841172.913971
ACA0.601222-2.098025
CJP-0.009375-3.623235
CA-0.055794-1.051638
In [30]:
data_df
Out[30]:
colordatefeature_1feature_2size
classcountry
AJPblack2019-01-06-0.023671-0.409491S
BCNwhite2019-01-13NaNNaNM
USblack2019-01-201.0853962.245660L
USwhite2019-01-27-0.127399-1.747656M
CUSblack2019-02-03-2.0462023.475487L
ACNwhite2019-02-10-1.0760022.705517S
BCNblack2019-02-170.1841172.913971S
ACAwhite2019-02-240.601222-2.098025XL
CJPblack2019-03-03-0.009375-3.623235XL
CAwhite2019-03-10-0.055794-1.051638M

根据列标签color进行分组后对列标签feature_1使用rolling方法,滚动计算最新的三个值的平均。

In [32]:
data_df.groupby('color').rolling(3).feature_1.mean()
Out[32]:
color  class  country
black  A      JP              NaN
       B      US              NaN
       C      US        -0.257642
       B      CN        -0.062787
       C      JP        -0.508490
white  B      CN              NaN
              US              NaN
       A      CN         1.152006
              CA         1.019761
       C      CA         1.254565
Name: feature_1, dtype: float64

expanding函数会对给定的操作(如下例中的sum)进行叠加

In [47]:
data_df.groupby('color').expanding(3).feature_1.sum()
Out[47]:
color  class  country
black  A      JP              NaN
       B      US              NaN
       C      US        -0.772925
       B      CN        -1.422812
       C      JP        -2.718247
white  B      CN              NaN
              US              NaN
       A      CN         3.456018
              CA         4.368218
       C      CA         4.868851
Name: feature_1, dtype: float64

filter函数的参数是作用于整个组且返回值为True或False的函数,我们可以利用filter函数得到分组后的某些特定组别,如下例中元素个数大于三的分组。

In [36]:
data_df.groupby('class').filter(lambda x: len(x) > 3)
Out[36]:
colordatefeature_1feature_2size
classcountry
BCNwhite2019-01-131.308935-0.493569M
USblack2019-01-200.0416721.014697L
USwhite2019-01-27-0.2037781.742654M
CNblack2019-02-17-0.649887-0.187894S

有些对于分组数据的处理用transform和aggregate都很难完成,这时候我们需要使用apply函数,apply相较两者更加灵活。在apply中可使用用自定义函数。

In [37]:
data_df.groupby('class')['feature_1'].apply(lambda x: x.describe())
Out[37]:
class       
A      count    3.000000
       mean     0.676204
       std      1.804268
       min     -1.234449
       25%     -0.161125
       50%      0.912200
       75%      1.631531
       max      2.350862
B      count    4.000000
       mean     0.124235
       std      0.840077
       min     -0.649887
       25%     -0.315306
       50%     -0.081053
       75%      0.358488
       max      1.308935
C      count    3.000000
       mean    -0.124984
       std      1.014446
       min     -1.295436
       25%     -0.437792
       50%      0.419852
       75%      0.460243
       max      0.500633
Name: feature_1, dtype: float64
In [38]:
def f(group):
    return pd.DataFrame({'original' : group,'demeaned' : group - group.mean()})
data_df.groupby('class')['feature_1'].apply(f)
Out[38]:
demeanedoriginal
classcountry
AJP-1.910653-1.234449
BCN1.1847001.308935
CN-0.774122-0.649887
US-0.0825630.041672
US-0.328014-0.203778
CUS0.5448360.419852
ACN1.6746582.350862
CA0.2359960.912200
CJP-1.170452-1.295436
CA0.6256160.500633

参考文献

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值