引用包¶

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

数据导入¶

data_dict = {'color' : ['black', 'white', 'black', 'white', 'black',
                      'white', 'black', 'white', 'black', 'white'],
            'size' : ['S','M','L','M','L','S','S','XL','XL','M'],
        'date':pd.date_range('1/1/2019',periods=10, freq='W' ),
            'feature_1': np.random.randn(10),
            'feature_2': np.random.normal(0.5, 2, 10)}
array=[['A','B','B','B','C','A','B','A','C','C'],['JP','CN','US','US','US','CN','CN','CA','JP','CA']]
                                                 
index = pd.MultiIndex.from_arrays(array, names=['class', 'country'])
data_df = pd.DataFrame(data_dict,index=index)
data_df

分组¶

将dataframe根据size进行分组，得到group_1。在这里我们将GroupBy对象转换list后输出。

group_1 = data_df.groupby('size')
for i in list(group_1):
    print(i)

('L',                color       date  feature_1  feature_2 size
class country                                             
B     US       black 2019-01-20  -1.204530   2.331003    L
C     US       black 2019-02-03  -0.475149   2.455877    L)
('M',                color       date  feature_1  feature_2 size
class country                                             
B     CN       white 2019-01-13   0.354512  -0.106245    M
      US       white 2019-01-27   0.640886   3.105454    M
C     CA       white 2019-03-10   0.471399   1.102412    M)
('S',                color       date  feature_1  feature_2 size
class country                                             
A     JP       black 2019-01-06   0.599631   1.029602    S
      CN       white 2019-02-10   0.024186   2.412876    S
B     CN       black 2019-02-17   3.110097   0.678240    S)
('XL',                color       date  feature_1  feature_2 size
class country                                             
A     CA       white 2019-02-24   0.890249   1.522595   XL
C     JP       black 2019-03-03  -1.216877   2.321393   XL)

对GroupBy对象进行分组运算，如sum(),非数值数据则不进行分组运算。将计算得到的数据添加表头前缀后输出

group_1.sum().add_prefix('sum_')

根据size进行分组后获得所有size值为M的行向量

group_1.get_group('M')

将dataframe根据size和color两个列标签进行多重分组，得到group_2

group_2 = data_df.groupby(['size', 'color'])
for i in list(group_2):
    print(i)

(('L', 'black'),                color       date  feature_1  feature_2 size
class country                                             
B     US       black 2019-01-20   0.041672   1.014697    L
C     US       black 2019-02-03   0.419852  -2.964561    L)
(('M', 'white'),                color       date  feature_1  feature_2 size
class country                                             
B     CN       white 2019-01-13   1.308935  -0.493569    M
      US       white 2019-01-27  -0.203778   1.742654    M
C     CA       white 2019-03-10   0.500633   2.827345    M)
(('S', 'black'),                color       date  feature_1  feature_2 size
class country                                             
A     JP       black 2019-01-06  -1.234449  -0.133232    S
B     CN       black 2019-02-17  -0.649887  -0.187894    S)
(('S', 'white'),                color       date  feature_1  feature_2 size
class country                                             
A     CN       white 2019-02-10   2.350862  -1.895651    S)
(('XL', 'black'),                color       date  feature_1  feature_2 size
class country                                             
C     JP       black 2019-03-03  -1.295436    0.41684   XL)
(('XL', 'white'),                color       date  feature_1  feature_2 size
class country                                             
A     CA       white 2019-02-24     0.9122   0.782471   XL)

对分组后的数据size函数获得组别个数

print(group_1.size())
print(group_2.size())

size
L     2
M     3
S     3
XL    2
dtype: int64
size  color
L     black    2
M     white    3
S     black    2
      white    1
XL    black    1
      white    1
dtype: int64

此外还可以利用函数进行分组。注意到groupby函数中axis=1对列进行分组，axis=0对行进行分组

def get_letter_type(letter):
    if 'feature' in letter:
        return 'feature'
    else:
        return 'other'
    
for i in list(data_df.groupby(get_letter_type, axis=1)):
    print(i)

('feature',                feature_1  feature_2
class country                      
A     JP        0.599631   1.029602
B     CN        0.354512  -0.106245
      US       -1.204530   2.331003
      US        0.640886   3.105454
C     US       -0.475149   2.455877
A     CN        0.024186   2.412876
B     CN        3.110097   0.678240
A     CA        0.890249   1.522595
C     JP       -1.216877   2.321393
      CA        0.471399   1.102412)
('other',                color       date size
class country                       
A     JP       black 2019-01-06    S
B     CN       white 2019-01-13    M
      US       black 2019-01-20    L
      US       white 2019-01-27    M
C     US       black 2019-02-03    L
A     CN       white 2019-02-10    S
B     CN       black 2019-02-17    S
A     CA       white 2019-02-24   XL
C     JP       black 2019-03-03   XL
      CA       white 2019-03-10    M)

分组对象除了列标签之外，还可以是索引，其中用不同的level值来区分多重索引。

for i in list(data_df.groupby(level=[0,1])):
    print(i)

(('A', 'CA'),                color       date  feature_1  feature_2 size
class country                                             
A     CA       white 2019-02-24   0.890249   1.522595   XL)
(('A', 'CN'),                color       date  feature_1  feature_2 size
class country                                             
A     CN       white 2019-02-10   0.024186   2.412876    S)
(('A', 'JP'),                color       date  feature_1  feature_2 size
class country                                             
A     JP       black 2019-01-06   0.599631   1.029602    S)
(('B', 'CN'),                color       date  feature_1  feature_2 size
class country                                             
B     CN       white 2019-01-13   0.354512  -0.106245    M
      CN       black 2019-02-17   3.110097   0.678240    S)
(('B', 'US'),                color       date  feature_1  feature_2 size
class country                                             
B     US       black 2019-01-20  -1.204530   2.331003    L
      US       white 2019-01-27   0.640886   3.105454    M)
(('C', 'CA'),                color       date  feature_1  feature_2 size
class country                                             
C     CA       white 2019-03-10   0.471399   1.102412    M)
(('C', 'JP'),                color       date  feature_1  feature_2 size
class country                                             
C     JP       black 2019-03-03  -1.216877   2.321393   XL)
(('C', 'US'),                color       date  feature_1  feature_2 size
class country                                             
C     US       black 2019-02-03  -0.475149   2.455877    L)

DataFrame可以根据列标签和索引两者同时分组,分组后可以在不同组间进行迭代

group_3=df.groupby(['country','color'])
for name, group in group_3:
    print(name)
    print(group)

('CA', 'white')
               color size       date  feature_1  feature_2
class country                                             
A     CA       white   XL 2019-02-24   0.412967   1.196859
C     CA       white    M 2019-03-10  -0.818303   1.317979
('CN', 'black')
               color size       date  feature_1  feature_2
class country                                             
B     CN       black    S 2019-02-17  -0.058021  -2.420962
('CN', 'white')
               color size       date  feature_1  feature_2
class country                                             
B     CN       white    M 2019-01-13   1.735991   0.383047
A     CN       white    S 2019-02-10   0.282515   3.156525
('JP', 'black')
               color size       date  feature_1  feature_2
class country                                             
A     JP       black    S 2019-01-06   0.997065  -1.018255
C     JP       black   XL 2019-03-03   0.513201  -3.266357
('US', 'black')
               color size       date  feature_1  feature_2
class country                                             
B     US       black    L 2019-01-20  -0.547211   0.693104
C     US       black    L 2019-02-03  -0.245918   4.444044
('US', 'white')
               color size       date  feature_1  feature_2
class country                                             
B     US       white    M 2019-01-27  -0.847715  -2.327769

当GroupBy对象被建立后，我们可以用agg函数对分组后的数据进行计算。下例中计算了group_3中feature_1的最大值和feature_2的均值。

group_2.agg({'feature_1' : np.min,'feature_2' : np.mean})

接下来我们使用transform函数对groupby对象进行变换，transform的计算结果和原始数据的形状保持一致。下例中我们自定义了函数data_range来获得根据size分组后各个值的范围。

data_range = lambda x: x.max() - x.min()
data_df.groupby('size').transform(data_range)

另外我们还常常通过transform函数将缺失值替换为组间平均值。

data_df.iloc[1, 2:4] = np.NaN
group_4 = data_df.groupby('size')
f = lambda x: x.fillna(x.mean())
df_trans = group_4.transform(f)
df_trans

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel\kernelbase.py:399: PerformanceWarning: indexing past lexsort depth may impact performance.
  user_expressions, allow_stdin)

data_df

根据列标签color进行分组后对列标签feature_1使用rolling方法,滚动计算最新的三个值的平均。

data_df.groupby('color').rolling(3).feature_1.mean()

color  class  country
black  A      JP              NaN
       B      US              NaN
       C      US        -0.257642
       B      CN        -0.062787
       C      JP        -0.508490
white  B      CN              NaN
              US              NaN
       A      CN         1.152006
              CA         1.019761
       C      CA         1.254565
Name: feature_1, dtype: float64

expanding函数会对给定的操作(如下例中的sum)进行叠加

data_df.groupby('color').expanding(3).feature_1.sum()

color  class  country
black  A      JP              NaN
       B      US              NaN
       C      US        -0.772925
       B      CN        -1.422812
       C      JP        -2.718247
white  B      CN              NaN
              US              NaN
       A      CN         3.456018
              CA         4.368218
       C      CA         4.868851
Name: feature_1, dtype: float64

filter函数的参数是作用于整个组且返回值为True或False的函数，我们可以利用filter函数得到分组后的某些特定组别,如下例中元素个数大于三的分组。

data_df.groupby('class').filter(lambda x: len(x) > 3)

有些对于分组数据的处理用transform和aggregate都很难完成，这时候我们需要使用apply函数，apply相较两者更加灵活。在apply中可使用用自定义函数。

data_df.groupby('class')['feature_1'].apply(lambda x: x.describe())

class       
A      count    3.000000
       mean     0.676204
       std      1.804268
       min     -1.234449
       25%     -0.161125
       50%      0.912200
       75%      1.631531
       max      2.350862
B      count    4.000000
       mean     0.124235
       std      0.840077
       min     -0.649887
       25%     -0.315306
       50%     -0.081053
       75%      0.358488
       max      1.308935
C      count    3.000000
       mean    -0.124984
       std      1.014446
       min     -1.295436
       25%     -0.437792
       50%      0.419852
       75%      0.460243
       max      0.500633
Name: feature_1, dtype: float64

def f(group):
    return pd.DataFrame({'original' : group,'demeaned' : group - group.mean()})
data_df.groupby('class')['feature_1'].apply(f)

参考文献¶

pandas toolkit
https://cloud.tencent.com/developer/article/1193823

	sum_feature_1	sum_feature_2
size
L	-1.679679	4.786880
M	1.466797	4.101621
S	3.733914	4.120718
XL	-0.326628	3.843988

		date	feature_1	feature_2
class	country
A	JP	42 days	3.085912	1.734636
B	CN	56 days	0.286375	3.211699
	US	14 days	0.729382	0.124874
	US	56 days	0.286375	3.211699
C	US	14 days	0.729382	0.124874
A	CN	42 days	3.085912	1.734636
B	CN	42 days	3.085912	1.734636
A	CA	7 days	2.107125	0.798798
C	JP	7 days	2.107125	0.798798
C	CA	56 days	0.286375	3.211699

		feature_1	feature_2
class	country
A	JP	-0.023671	-0.409491
B	CN	-0.091596	-1.399647
	US	1.085396	2.245660
	US	-0.127399	-1.747656
C	US	-2.046202	3.475487
A	CN	-1.076002	2.705517
B	CN	0.184117	2.913971
A	CA	0.601222	-2.098025
C	JP	-0.009375	-3.623235
C	CA	-0.055794	-1.051638

		color	date	feature_1	feature_2	size
class	country
A	JP	black	2019-01-06	-1.234449	-0.133232	S
B	CN	white	2019-01-13	1.308935	-0.493569	M
	US	black	2019-01-20	0.041672	1.014697	L
	US	white	2019-01-27	-0.203778	1.742654	M
C	US	black	2019-02-03	0.419852	-2.964561	L
A	CN	white	2019-02-10	2.350862	-1.895651	S
B	CN	black	2019-02-17	-0.649887	-0.187894	S
A	CA	white	2019-02-24	0.912200	0.782471	XL
C	JP	black	2019-03-03	-1.295436	0.416840	XL
C	CA	white	2019-03-10	0.500633	2.827345	M

		feature_1	feature_2
size	color
L	black	-1.204530	2.393440
M	white	0.354512	1.367207
S	black	0.599631	0.853921
S	white	0.024186	2.412876
XL	black	-1.216877	2.321393
XL	white	0.890249	1.522595

		demeaned	original
class	country
A	JP	-1.910653	-1.234449
B	CN	1.184700	1.308935
	CN	-0.774122	-0.649887
	US	-0.082563	0.041672
	US	-0.328014	-0.203778
C	US	0.544836	0.419852
A	CN	1.674658	2.350862
A	CA	0.235996	0.912200
C	JP	-1.170452	-1.295436
C	CA	0.625616	0.500633

jupyter notebook导入wordpress例子

引用包¶

数据导入¶

分组¶

参考文献¶