利用python进行数据分析之数据聚合和分组运算--小白笔记

GroupBy机制

split-apply-combine(拆分-应用-合并)

在这里插入图片描述

import pandas as pd 
import numpy as np 
df=pd.DataFrame({'key1':['a','a','b','b','a'],
                'key2':['one','two','one','two','one'],
                'data1':np.random.randn(5),
                'data2':np.random.randn(5)})
df
key1key2data1data2
0aone1.067118-0.237576
1atwo0.6138141.059002
2bone2.6820890.865306
3btwo-0.331019-1.627436
4aone-0.599142-0.615921

按照key1进行分组,并计算data1列的平均值:访问data1,并根据key1调用groupby

grouped=df['data1'].groupby(df['key1'])
grouped
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001FBAA33FDF0>

变量grouped是一个GroupBy对象,它实际上还没有进行任何计算,只是含有一些有关分组键df[‘key1’]的中间数据而已。该对象已经有了接下来对各分组执行运算所需的一切信息

grouped.mean()
key1
a    0.360597
b    1.175535
Name: data1, dtype: float64
means=df['data1'].groupby([df['key1'],df['key2']]).mean()
means
key1  key2
a     one     0.233988
      two     0.613814
b     one     2.682089
      two    -0.331019
Name: data1, dtype: float64
means.unstack()
key2onetwo
key1
a0.2339880.613814
b2.682089-0.331019

分组键可以是任何长度适当的数组

states=np.array(['Ohio','California','California','Ohio','Ohio'])
years=np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states,years]).mean()
California  2005    0.613814
            2006    2.682089
Ohio        2005    0.368050
            2006   -0.599142
Name: data1, dtype: float64

通常,分组信息就位于相同的要处理DataFrame中。这里,你还可以将列名(可以
是字符串、数字或其他Python对象)用作分组键:

df.groupby('key1')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001FBB014A850>
df.groupby(['key1','key2']).mean()
data1data2
key1key2
aone0.233988-0.426748
two0.6138141.059002
bone2.6820890.865306
two-0.331019-1.627436
df.groupby(['key1','key2']).size()
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

对分组进行迭代

GroupBy对象支持迭代,可以产生一组二元元组(由分组名和数据块组成)

for name,group in df.groupby('key1'):
    print(name)
    print(group)
a
  key1 key2     data1     data2
0    a  one  1.067118 -0.237576
1    a  two  0.613814  1.059002
4    a  one -0.599142 -0.615921
b
  key1 key2     data1     data2
2    b  one  2.682089  0.865306
3    b  two -0.331019 -1.627436

对于多重键的情况,元组的第一个元素将会由键值组成的元组

for (k1,k2),group in df.groupby(['key1','key2']):
    print((k1,k2))
    print(group)
('a', 'one')
  key1 key2     data1     data2
0    a  one  1.067118 -0.237576
4    a  one -0.599142 -0.615921
('a', 'two')
  key1 key2     data1     data2
1    a  two  0.613814  1.059002
('b', 'one')
  key1 key2     data1     data2
2    b  one  2.682089  0.865306
('b', 'two')
  key1 key2     data1     data2
3    b  two -0.331019 -1.627436

你可以对这些数据片段做任何操作。将这些数据片段做成一个字典

pieces=dict(list(df.groupby('key1')))
pieces
{'a':   key1 key2     data1     data2
 0    a  one  1.067118 -0.237576
 1    a  two  0.613814  1.059002
 4    a  one -0.599142 -0.615921,
 'b':   key1 key2     data1     data2
 2    b  one  2.682089  0.865306
 3    b  two -0.331019 -1.627436}

groupby默认是在axis=0上进行分组的,通过设置可以在其他任何轴上进行分组

df.dtypes
key1      object
key2      object
data1    float64
data2    float64
dtype: object
grouped=df.groupby(df.dtypes,axis=1)
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001FBAFD4DB20>
for dtype,group in grouped:
    print(dtype)
    print(group)
float64
      data1     data2
0  1.067118 -0.237576
1  0.613814  1.059002
2  2.682089  0.865306
3 -0.331019 -1.627436
4 -0.599142 -0.615921
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one

选取一列或列的子集

对于DataFrame产生的GroupBy对象,如果用一个(单个字符串)或一组(字符串数组)列名对其进行索引,就能实现选取部分列进行聚合的目的

df.groupby('key1')['data1']
df.groupby('key1')[['data2']]

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001FBB05C0910>
df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001FBB05C0E20>

对于大数据集,很可能只需要对部分列进行聚合

df.groupby(['key1','key2'])[['data2']].mean()
data2
key1key2
aone-0.426748
two1.059002
bone0.865306
two-1.627436

上述索引操作返回的对象是一个已分组的DataFrame(如果传入的是列表或数组)或已分组的Series(如果传入的是标量形式的单个列名)

s_grouped=df.groupby(['key1','key2'])['data2']
s_grouped
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001FBB02BB640>
s_grouped.mean()
key1  key2
a     one    -0.426748
      two     1.059002
b     one     0.865306
      two    -1.627436
Name: data2, dtype: float64

通过字典或Series进行分组

people=pd.DataFrame(np.random.randn(5,5),
                   columns=['a','b','c','d','e'],
                   index=['Joe','Steve','Wes','Jim','Travis'])
people.iloc[2:3,[1,2]]=np.nan 
people
abcde
Joe1.0643950.458101-0.6477220.9131580.266544
Steve0.464769-0.1521260.6925180.8007960.020842
Wes0.534295NaNNaN1.0314941.019269
Jim0.4853970.930118-0.1243460.4475060.761696
Travis-0.1089462.5000921.1734280.4672391.579619

假设已知列的分组关系,希望根据分组计算列的和

mapping={'a': 'red',
         'b': 'red',
         'c': 'blue',
         'd': 'blue',
         'e': 'red',
         'f' : 'orange'}


将这个字典传给gruopby,来构造数组,但我们可以直接传递字典

by_column=people.groupby(mapping,axis=1)
by_column.sum()
bluered
Joe0.2654361.789040
Steve1.4933150.333485
Wes1.0314941.553564
Jim0.3231592.177211
Travis1.6406673.970764
map_series=pd.Series(mapping)
map_series
a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object
people.groupby(map_series,axis=1).count()
bluered
Joe23
Steve23
Wes12
Jim23
Travis23

通过函数进行分组

people.groupby(len).sum()
abcde
32.0840871.388219-0.7720692.3921582.047509
50.464769-0.1521260.6925180.8007960.020842
6-0.1089462.5000921.1734280.4672391.579619

将函数跟数组、列表、字典、Series混合使用也可以,因为任何东西在内部都会被转换成数组

key_list=['one','one','one','two','two']
people.groupby([len,key_list]).min()
abcde
3one0.5342950.458101-0.6477220.9131580.266544
two0.4853970.930118-0.1243460.4475060.761696
5one0.464769-0.1521260.6925180.8007960.020842
6two-0.1089462.5000921.1734280.4672391.579619

根据索引级别分组

层次化索引数据集最方便的地方在于它能根据轴索引的一个级别进行聚合

columns=pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],
                                  [1,3,5,1,3]],
                                 names=['city','tenor'])
hier_df=pd.DataFrame(np.random.randn(4,5),columns=columns)
hier_df
cityUSJP
tenor13513
0-2.7125690.036956-0.6431950.770231-1.886666
1-0.8415081.7723660.5505010.2373020.968424
2-1.221179-0.434443-0.4754380.666095-0.083710
3-0.5190552.0622991.077549-1.361656-1.274750

要根据级别分组,使用level关键字传递级别序号或名字

hier_df.groupby(level='city',axis=1).count()
cityJPUS
023
123
223
323

数据聚合

聚合指的是任何能够从数组产生标量值的数据转换过程

函数名说明
count分组中非NA值的数量
sum非NA值的和
mean非NA值的平均值
median非NA值的算数平均中位数
std、var无偏(分母为n-1)标准差和方差
min、max非NA值的最小值和最大值
prod非NA值的积
first、last第一个和最后一个非NA值

quantile计算Series或DataFrame列的样本分位数

df
key1key2data1data2
0aone1.067118-0.237576
1atwo0.6138141.059002
2bone2.6820890.865306
3btwo-0.331019-1.627436
4aone-0.599142-0.615921
grouped=df.groupby(['key1','key2'])
grouped['data1'].quantile(0.9)
key1  key2
a     one     0.900492
      two     0.613814
b     one     2.682089
      two    -0.331019
Name: data1, dtype: float64

如果要使用自己的聚合函数,只需将其传入aggregate或agg方法

def peak_to_peak(arr):
    return arr.max()-arr.min()
grouped.agg(peak_to_peak)
data1data2
key1key2
aone1.666260.378344
two0.000000.000000
bone0.000000.000000
two0.000000.000000
grouped.describe()
data1data2
countmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%max
key1key2
aone2.00.2339881.178224-0.599142-0.1825770.2339880.6505531.0671182.0-0.4267480.26753-0.615921-0.521335-0.426748-0.332162-0.237576
two1.00.613814NaN0.6138140.6138140.6138140.6138140.6138141.01.059002NaN1.0590021.0590021.0590021.0590021.059002
bone1.02.682089NaN2.6820892.6820892.6820892.6820892.6820891.00.865306NaN0.8653060.8653060.8653060.8653060.865306
two1.0-0.331019NaN-0.331019-0.331019-0.331019-0.331019-0.3310191.0-1.627436NaN-1.627436-1.627436-1.627436-1.627436-1.627436

面向列的多函数应用

tips=pd.read_csv("F:/项目学习/利用Pyhon进行数据分析(第二版)/利用Pyhon进行数据分析/pydata-book-2nd-edition/examples/tips.csv")
tips.head()
total_billtipsmokerdaytimesize
016.991.01NoSunDinner2
110.341.66NoSunDinner3
221.013.50NoSunDinner3
323.683.31NoSunDinner2
424.593.61NoSunDinner4
#添加小费百分比
tips['tip_pct']=tips['tip']/tips['total_bill']
tips.head()
total_billtipsmokerdaytimesizetip_pct
016.991.01NoSunDinner20.059447
110.341.66NoSunDinner30.160542
221.013.50NoSunDinner30.166587
323.683.31NoSunDinner20.139780
424.593.61NoSunDinner40.146808
grouped=tips.groupby(['day','smoker'])
grouped_pct=grouped['tip_pct']
grouped_pct.agg('mean')
day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

如果传入一组函数或函数名,得到的DataFrame的列就会以相应的函数命名

grouped_pct.agg(['mean','std',peak_to_peak])
meanstdpeak_to_peak
daysmoker
FriNo0.1516500.0281230.067349
Yes0.1747830.0512930.159925
SatNo0.1580480.0397670.235193
Yes0.1479060.0613750.290095
SunNo0.1601130.0423470.193226
Yes0.1872500.1541340.644685
ThurNo0.1602980.0387740.193350
Yes0.1638630.0393890.151240

如果传入的是一个由(name,function)元组组成的列表,则各元组的第一个元素就会被用作DataFrame的列名

grouped_pct.agg([('foo','mean'),('bar',np.std)])
foobar
daysmoker
FriNo0.1516500.028123
Yes0.1747830.051293
SatNo0.1580480.039767
Yes0.1479060.061375
SunNo0.1601130.042347
Yes0.1872500.154134
ThurNo0.1602980.038774
Yes0.1638630.039389

可以定义一组应用于全部列的一组函数,或不同列应用于不同的函数。

functions=['count','mean','max']
result=grouped[['tip_pct','total_bill']].agg(functions)
result
tip_pcttotal_bill
countmeanmaxcountmeanmax
daysmoker
FriNo40.1516500.187735418.42000022.75
Yes150.1747830.2634801516.81333340.17
SatNo450.1580480.2919904519.66177848.33
Yes420.1479060.3257334221.27666750.81
SunNo570.1601130.2526725720.50666748.17
Yes190.1872500.7103451924.12000045.35
ThurNo450.1602980.2663124517.11311141.19
Yes170.1638630.2412551719.19058843.11
result['tip_pct']
countmeanmax
daysmoker
FriNo40.1516500.187735
Yes150.1747830.263480
SatNo450.1580480.291990
Yes420.1479060.325733
SunNo570.1601130.252672
Yes190.1872500.710345
ThurNo450.1602980.266312
Yes170.1638630.241255
ftuples = [('Durchschnitt', 'mean'),('Abweichung', np.var)]
grouped[['tip_pct', 'total_bill']].agg(ftuples)
tip_pcttotal_bill
DurchschnittAbweichungDurchschnittAbweichung
daysmoker
FriNo0.1516500.00079118.42000025.596333
Yes0.1747830.00263116.81333382.562438
SatNo0.1580480.00158119.66177879.908965
Yes0.1479060.00376721.276667101.387535
SunNo0.1601130.00179320.50666766.099980
Yes0.1872500.02375724.120000109.046044
ThurNo0.1602980.00150317.11311159.625081
Yes0.1638630.00155119.19058869.808518

对一个列或不同的列应用不同的函数,具体的办法是向agg传入一个从列名映射到函数的字典

grouped.agg({'tip':np.max,'size':'sum'})
tipsize
daysmoker
FriNo3.509
Yes4.7331
SatNo9.00115
Yes10.00104
SunNo6.00167
Yes6.5049
ThurNo6.70112
Yes5.0040

只有将多个函数应用到至少一列时,DataFrame才会有层次化

grouped.agg({'tip_pct':['min','max','mean','std'],'size':'sum'})
tip_pctsize
minmaxmeanstdsum
daysmoker
FriNo0.1203850.1877350.1516500.0281239
Yes0.1035550.2634800.1747830.05129331
SatNo0.0567970.2919900.1580480.039767115
Yes0.0356380.3257330.1479060.061375104
SunNo0.0594470.2526720.1601130.042347167
Yes0.0656600.7103450.1872500.15413449
ThurNo0.0729610.2663120.1602980.038774112
Yes0.0900140.2412550.1638630.03938940

以没有行索引的形式返回依聚合数据

所有示例中的聚合数据都有唯一的分组键组成的索引(可能是层次化)。可以向groupby传入as_index=False禁用该功能

tips.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   smoker      244 non-null    object 
 3   day         244 non-null    object 
 4   time        244 non-null    object 
 5   size        244 non-null    int64  
 6   tip_pct     244 non-null    float64
dtypes: float64(3), int64(1), object(3)
memory usage: 13.5+ KB
num_col=tips.select_dtypes(include=['int64','float64'])
num_col
total_billtipsizetip_pct
016.991.0120.059447
110.341.6630.160542
221.013.5030.166587
323.683.3120.139780
424.593.6140.146808
...............
23929.035.9230.203927
24027.182.0020.073584
24122.672.0020.088222
24217.821.7520.098204
24318.783.0020.159744

244 rows × 4 columns

tips
total_billtipsmokerdaytimesizetip_pct
016.991.01NoSunDinner20.059447
110.341.66NoSunDinner30.160542
221.013.50NoSunDinner30.166587
323.683.31NoSunDinner20.139780
424.593.61NoSunDinner40.146808
........................
23929.035.92NoSatDinner30.203927
24027.182.00YesSatDinner20.073584
24122.672.00YesSatDinner20.088222
24217.821.75NoSatDinner20.098204
24318.783.00NoThurDinner20.159744

244 rows × 7 columns

tips_1=tips.drop('time',axis=1)
tips_1.groupby(['day','smoker']).mean()
total_billtipsizetip_pct
daysmoker
FriNo18.4200002.8125002.2500000.151650
Yes16.8133332.7140002.0666670.174783
SatNo19.6617783.1028892.5555560.158048
Yes21.2766672.8754762.4761900.147906
SunNo20.5066673.1678952.9298250.160113
Yes24.1200003.5168422.5789470.187250
ThurNo17.1131112.6737782.4888890.160298
Yes19.1905883.0300002.3529410.163863
tips_1.groupby(['day','smoker'],as_index=False).mean()
daysmokertotal_billtipsizetip_pct
0FriNo18.4200002.8125002.2500000.151650
1FriYes16.8133332.7140002.0666670.174783
2SatNo19.6617783.1028892.5555560.158048
3SatYes21.2766672.8754762.4761900.147906
4SunNo20.5066673.1678952.9298250.160113
5SunYes24.1200003.5168422.5789470.187250
6ThurNo17.1131112.6737782.4888890.160298
7ThurYes19.1905883.0300002.3529410.163863

apply:一般性的“拆分-应用-合并”

def top(df,n=5,column='tip_pct'):
    return df.sort_values(by=column)[-n:]
top(tips,n=6)
total_billtipsmokerdaytimesizetip_pct
10914.314.00YesSatDinner20.279525
18323.176.50YesSunDinner40.280535
23211.613.39NoSatDinner20.291990
673.071.00YesSatDinner10.325733
1789.604.00YesSunDinner20.416667
1727.255.15YesSunDinner20.710345
#对smoker分组并使用该函数调用apply

tips.groupby('smoker').apply(top)
total_billtipsmokerdaytimesizetip_pct
smoker
No8824.715.85NoThurLunch20.236746
18520.695.00NoSunDinner50.241663
5110.292.60NoSunDinner20.252672
1497.512.00NoThurLunch20.266312
23211.613.39NoSatDinner20.291990
Yes10914.314.00YesSatDinner20.279525
18323.176.50YesSunDinner40.280535
673.071.00YesSatDinner10.325733
1789.604.00YesSunDinner20.416667
1727.255.15YesSunDinner20.710345
tips.groupby(['smoker','day']).apply(top,n=1,column='total_bill')
total_billtipsmokerdaytimesizetip_pct
smokerday
NoFri9422.753.25NoFriDinner20.142857
Sat21248.339.00NoSatDinner40.186220
Sun15648.175.00NoSunDinner60.103799
Thur14241.195.00NoThurLunch50.121389
YesFri9540.174.73YesFriDinner40.117750
Sat17050.8110.00YesSatDinner30.196812
Sun18245.353.50YesSunDinner30.077178
Thur19743.115.00YesThurLunch40.115982
result=tips.groupby('smoker')['tip_pct'].describe()
result
countmeanstdmin25%50%75%max
smoker
No151.00.1593280.0399100.0567970.1369060.1556250.1850140.291990
Yes93.00.1631960.0851190.0356380.1067710.1538460.1950590.710345
result.unstack('smoker')
       smoker
count  No        151.000000
       Yes        93.000000
mean   No          0.159328
       Yes         0.163196
std    No          0.039910
       Yes         0.085119
min    No          0.056797
       Yes         0.035638
25%    No          0.136906
       Yes         0.106771
50%    No          0.155625
       Yes         0.153846
75%    No          0.185014
       Yes         0.195059
max    No          0.291990
       Yes         0.710345
dtype: float64

在groupby中调用describe之类的函数,实际上是应用了下面两条代码的快捷方式:
f=lambda x:x.describe()
grouped.apply(f)

禁止分组键

分组键会跟原始对象的索引共同构成结果对象中的层次
化索引。将group_keys=False传入groupby即可禁止该效果

tips.groupby('smoker',group_keys=False).apply(top)
total_billtipsmokerdaytimesizetip_pct
8824.715.85NoThurLunch20.236746
18520.695.00NoSunDinner50.241663
5110.292.60NoSunDinner20.252672
1497.512.00NoThurLunch20.266312
23211.613.39NoSatDinner20.291990
10914.314.00YesSatDinner20.279525
18323.176.50YesSunDinner40.280535
673.071.00YesSatDinner10.325733
1789.604.00YesSunDinner20.416667
1727.255.15YesSunDinner20.710345

分位数和桶分析

frame = pd.DataFrame({'data1': np.random.randn(1000),
                      'data2': np.random.randn(1000)})
quartiles = pd.cut(frame.data1, 4)
quartiles[:10]
0      (-0.0596, 1.69]
1     (-3.565, -1.809]
2    (-1.809, -0.0596]
3      (-0.0596, 1.69]
4      (-0.0596, 1.69]
5      (-0.0596, 1.69]
6    (-1.809, -0.0596]
7    (-1.809, -0.0596]
8    (-1.809, -0.0596]
9    (-1.809, -0.0596]
Name: data1, dtype: category
Categories (4, interval[float64, right]): [(-3.565, -1.809] < (-1.809, -0.0596] < (-0.0596, 1.69] < (1.69, 3.439]]
def get_stats(group):
    return {'min':group.min(),'max':group.max(),'count':group.count(),'mean':group.mean()}
grouped=frame.data2.groupby(quartiles)
grouped.apply(get_stats).unstack()
minmaxcountmean
data1
(-3.565, -1.809]-2.2400362.38358040.0-0.147557
(-1.809, -0.0596]-2.4557183.207027452.0-0.016250
(-0.0596, 1.69]-2.9896263.232200459.00.048823
(1.69, 3.439]-1.7745192.20106949.00.054332

这些都是长度相等的桶。要根据样本分位数得到大小相等的桶,使用qcut。传入labels=False可只获得分位数的编号

grouping = pd.qcut(frame.data1, 10, labels=False)
grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()

minmaxcountmean
data1
0-2.2400362.617873100.0-0.036513
1-2.1626903.207027100.00.038019
2-2.3120581.901076100.0-0.052351
3-2.2972012.963489100.0-0.033958
4-2.4557183.178088100.0-0.083398
5-2.9896262.336623100.00.110641
6-2.2692973.232200100.00.074222
7-2.1156402.599014100.00.070273
8-2.3875082.364138100.0-0.062957
9-1.7745192.333942100.00.094271

示例:用特定于分组的值填充缺失值

对于缺失数据的清理工作,有时你会用dropna将其替换掉,而有时则可能会希望用
一个固定值或由数据集本身所衍生出来的值去填充NA值。这时就得使用fillna这个
工具

s = pd.Series(np.random.randn(6))
s
0    1.264682
1    0.495235
2   -0.016187
3    1.781491
4   -0.231563
5   -0.393924
dtype: float64
s[::2] = np.nan
s
0         NaN
1    0.495235
2         NaN
3    1.781491
4         NaN
5   -0.393924
dtype: float64
s.fillna(s.mean())
0    0.627601
1    0.495235
2    0.627601
3    1.781491
4    0.627601
5   -0.393924
dtype: float64

假设你需要对不同的分组填充不同的值。一种方法是将数据分组,并使用apply和一个能够对各数据块调用fillna的函数即可。

states = ['Ohio', 'New York', 'Vermont', 'Florida',
          'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = pd.Series(np.random.randn(8), index=states)
data
Ohio          1.045855
New York      1.075995
Vermont       0.425475
Florida       0.086684
Oregon       -1.262191
Nevada       -0.209671
California    0.120289
Idaho        -0.564744
dtype: float64
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data
Ohio          1.045855
New York      1.075995
Vermont            NaN
Florida       0.086684
Oregon       -1.262191
Nevada             NaN
California    0.120289
Idaho              NaN
dtype: float64
data.groupby(group_key).mean()
East    0.736178
West   -0.570951
dtype: float64

我们可以用分组平均值去填充NA值

fill_mean=lambda g:g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)
East  Ohio          1.045855
      New York      1.075995
      Vermont       0.736178
      Florida       0.086684
West  Oregon       -1.262191
      Nevada       -0.570951
      California    0.120289
      Idaho        -0.570951
dtype: float64

也可以在代码中定义各组的填充值

fill_values={'East':0.5,'West':-1}
fill_func=lambda g:g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)
East  Ohio          1.045855
      New York      1.075995
      Vermont       0.500000
      Florida       0.086684
West  Oregon       -1.262191
      Nevada       -1.000000
      California    0.120289
      Idaho        -1.000000
dtype: float64

示例:随机采样和排列

假设你想要从一个大数据集中随机抽取(进行替换或不替换)样本以进行蒙特卡罗
模拟(Monte Carlo simulation)或其他分析工作。“抽取”的方式有很多,这里使用的方法是对Series使用sample方法

suits=['H','S','C','D']
card_val=(list(range(1,11))+[10]*3)*4
base_names=['A']+list(range(2,11))+['J','K','Q']
cards=[]
for suit in['H','S','C','D']:
    cards.extend(str(num)+suit for num in base_names)
deck=pd.Series(card_val,index=cards)
deck
AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
AS      1
2S      2
3S      3
4S      4
5S      5
6S      6
7S      7
8S      8
9S      9
10S    10
JS     10
KS     10
QS     10
AC      1
2C      2
3C      3
4C      4
5C      5
6C      6
7C      7
8C      8
9C      9
10C    10
JC     10
KC     10
QC     10
AD      1
2D      2
3D      3
4D      4
5D      5
6D      6
7D      7
8D      8
9D      9
10D    10
JD     10
KD     10
QD     10
dtype: int64
#从整副牌抽出5张
def draw(deck,n=5):
    return deck.sample(n)
draw(deck)
10H    10
9H      9
6H      6
JC     10
QH     10
dtype: int64

想要从每种花色中随机抽取两张牌。由于花色是牌名的最后一个字符,所以
我们可以据此进行分组,并使用apply

get_suit=lambda card:card[-1]
deck.groupby(get_suit).apply(draw,n=2)
C  7C     7
   KC    10
D  4D     4
   AD     1
H  3H     3
   8H     8
S  KS    10
   JS    10
dtype: int64
deck.groupby(get_suit,group_keys=False).apply(draw,n=2)
5C    5
4C    4
6D    6
4D    4
2H    2
5H    5
8S    8
3S    3
dtype: int64

示例:分组加权平均数和相关系数

df = pd.DataFrame({'category': ['a', 'a', 'a', 'a','b', 'b', 'b', 'b'],
                   'data': np.random.randn(8),
                   'weights': np.random.rand(8)})

df
categorydataweights
0a-1.0228460.702148
1a0.4059660.095783
2a0.2821710.439928
3a0.5412870.866943
4b0.6955190.363663
5b-0.4199170.270279
6b-0.0772270.565714
7b1.6005110.636178
grouped=df.groupby('category')
get_wavg=lambda g:np.average(g['data'],weights=g['weights'])
grouped.apply(get_wavg)
category
a   -0.040814
b    0.606788
dtype: float64

Yahoo!Finance的数据集,其中含有几只股票和标准普尔500指数(符号SPX)的收盘价

close_px=pd.read_csv('F:/项目学习/利用Pyhon进行数据分析(第二版)/利用Pyhon进行数据分析/pydata-book-2nd-edition/examples/stock_px_2.csv',parse_dates=True,index_col=0)
close_px.head()

AAPLMSFTXOMSPX
2003-01-027.4021.1129.22909.03
2003-01-037.4521.1429.24908.59
2003-01-067.4521.5229.96929.01
2003-01-077.4321.9328.95922.93
2003-01-087.2821.3128.83909.93
close_px.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AAPL    2214 non-null   float64
 1   MSFT    2214 non-null   float64
 2   XOM     2214 non-null   float64
 3   SPX     2214 non-null   float64
dtypes: float64(4)
memory usage: 86.5 KB
close_px[-4:]
AAPLMSFTXOMSPX
2011-10-11400.2927.0076.271195.54
2011-10-12402.1926.9677.161207.25
2011-10-13408.4327.1876.371203.66
2011-10-14422.0027.2778.111224.58
spx_corr=lambda x:x.corrwith(x['SPX'])

rets=close_px.pct_change().dropna()
rets
AAPLMSFTXOMSPX
2003-01-030.0067570.0014210.000684-0.000484
2003-01-060.0000000.0179750.0246240.022474
2003-01-07-0.0026850.019052-0.033712-0.006545
2003-01-08-0.020188-0.028272-0.004145-0.014086
2003-01-090.0082420.0290940.0211590.019386
...............
2011-10-100.0514060.0262860.0369770.034125
2011-10-110.0295260.002227-0.0001310.000544
2011-10-120.004747-0.0014810.0116690.009795
2011-10-130.0155150.008160-0.010238-0.002974
2011-10-140.0332250.0033110.0227840.017380

2213 rows × 4 columns

get_year=lambda x:x.year
by_year=rets.groupby(get_year)
by_year.apply(spx_corr)
AAPLMSFTXOMSPX
20030.5411240.7451740.6612651.0
20040.3742830.5885310.5577421.0
20050.4675400.5623740.6310101.0
20060.4282670.4061260.5185141.0
20070.5081180.6587700.7862641.0
20080.6814340.8046260.8283031.0
20090.7071030.6549020.7979211.0
20100.7101050.7301180.8390571.0
20110.6919310.8009960.8599751.0

计算列与列之间的相关系数

by_year.apply(lambda g:g['AAPL'].corr(g['MSFT']))
2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64

示例:组级别的线性回归

import statsmodels.api as sm
def regress(data,yvar,xvars):
    Y=data[yvar]
    X=data[xvars]
    X['intercept']=1
    result=sm.OLS(Y,X).fit()
    return result.params
by_year.apply(regress,'AAPL',['SPX'])
SPXintercept
20031.1954060.000710
20041.3634630.004201
20051.7664150.003246
20061.6454960.000080
20071.1987610.003438
20080.968016-0.001110
20090.8791030.002954
20101.0526080.001261
20110.8066050.001514

透视表和交叉表

透视表

tips_1.pivot_table(index=['day','smoker'])
sizetiptip_pcttotal_bill
daysmoker
FriNo2.2500002.8125000.15165018.420000
Yes2.0666672.7140000.17478316.813333
SatNo2.5555563.1028890.15804819.661778
Yes2.4761902.8754760.14790621.276667
SunNo2.9298253.1678950.16011320.506667
Yes2.5789473.5168420.18725024.120000
ThurNo2.4888892.6737780.16029817.113111
Yes2.3529413.0300000.16386319.190588
tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker')
sizetip_pct
smokerNoYesNoYes
timeday
DinnerFri2.0000002.2222220.1396220.165347
Sat2.5555562.4761900.1580480.147906
Sun2.9298252.5789470.1601130.187250
Thur2.000000NaN0.159744NaN
LunchFri3.0000001.8333330.1877350.188937
Thur2.5000002.3529410.1603110.163863

传入margins=True添加分项小计。这将会添加标签为All的行和列,其值对应于单个等级中所有数据的分组统计

tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker',margins=True)
sizetip_pct
smokerNoYesAllNoYesAll
timeday
DinnerFri2.0000002.2222222.1666670.1396220.1653470.158916
Sat2.5555562.4761902.5172410.1580480.1479060.153152
Sun2.9298252.5789472.8421050.1601130.1872500.166897
Thur2.000000NaN2.0000000.159744NaN0.159744
LunchFri3.0000001.8333332.0000000.1877350.1889370.188765
Thur2.5000002.3529412.4590160.1603110.1638630.161301
All2.6688742.4086022.5696720.1593280.1631960.160803

All值为平均数:不单独考虑烟民与非烟民(All列),不单独考虑行分组两个
级别中的任何单项(All行)

要使用其他的聚合函数,将其传给aggfunc即可。例如,使用count或len可以得到有关分组大小的交叉表(计数或频率)

tips.pivot_table('tip_pct', index=['time', 'smoker'],columns='day',aggfunc=len, margins=True)
dayFriSatSunThurAll
timesmoker
DinnerNo3.045.057.01.0106
Yes9.042.019.0NaN70
LunchNo1.0NaNNaN44.045
Yes6.0NaNNaN17.023
All19.087.076.062.0244
tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'],columns='day', aggfunc='mean', fill_value=0)

dayFriSatSunThur
timesizesmoker
Dinner1No0.0000000.1379310.0000000.000000
Yes0.0000000.3257330.0000000.000000
2No0.1396220.1627050.1688590.159744
Yes0.1712970.1486680.2078930.000000
3No0.0000000.1546610.1526630.000000
Yes0.0000000.1449950.1526600.000000
4No0.0000000.1500960.1481430.000000
Yes0.1177500.1245150.1933700.000000
5No0.0000000.0000000.2069280.000000
Yes0.0000000.1065720.0656600.000000
6No0.0000000.0000000.1037990.000000
Lunch1No0.0000000.0000000.0000000.181728
Yes0.2237760.0000000.0000000.000000
2No0.0000000.0000000.0000000.166005
Yes0.1819690.0000000.0000000.158843
3No0.1877350.0000000.0000000.084246
Yes0.0000000.0000000.0000000.204952
4No0.0000000.0000000.0000000.138919
Yes0.0000000.0000000.0000000.155410
5No0.0000000.0000000.0000000.121389
6No0.0000000.0000000.0000000.173706

pivot_table参数说明

函数名说明
values待聚合的列的名称。默认聚合所有数值列
index用于分组的列名或其他分组键,出现在结果透视表的行
columns用于分组的列名或其他分组键,出现在结果透视表的列
aggfunc聚合函数或函数列表,默认mean。可以是任何对groupby有效的函数
fill_value用于替换表中的缺失值
dropna如果为True,不添加条目都为NA的列
margins添加行/列小计和总计,默认为False

交叉表:crosstab

交叉表(cross-tabulation,简称crosstab)是一种用于计算分组频率的特殊透视表。

data=pd.DataFrame({'Sample':np.arange(1,11),
                  'Nationality':['USA','Japan','USA','Japan','Japan','Japan','USA','USA','Japan','USA'],
                  'Handedness':['Right-handed',' Left-handed','Right-handed','Right-handed',' Left-handed','Right-handed','Right-handed','Left-handed','Right-handed','Right-handed']
                  })
data
SampleNationalityHandedness
01USARight-handed
12JapanLeft-handed
23USARight-handed
34JapanRight-handed
45JapanLeft-handed
56JapanRight-handed
67USARight-handed
78USALeft-handed
89JapanRight-handed
910USARight-handed
pd.crosstab(data.Nationality, data.Handedness, margins=True)

HandednessLeft-handedLeft-handedRight-handedAll
Nationality
Japan2035
USA0145
All21710
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)

smokerNoYesAll
timeday
DinnerFri3912
Sat454287
Sun571976
Thur101
LunchFri167
Thur441761
All15193244
  • 4
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值