dataframe groupby_利用Python进行数据分析GroupBy

数据的分组与聚合操作

在数据分析工作流中,一个重要的工作是对数据进行分类,并在每一组上应用一个聚合函数或转换函数。在经历载入、合并、准备数据集之后,可能需要计算分组统计或者制作数据透视表用于报告或者可视化的目的。pandas提供了一个非常灵活的groupby接口,来对数据集进行切片、切块和总结。本章主要内容如下:

  • 使用一个或多个键将pandas对象拆分成多块
  • 计算组汇总统计信息
  • 应用组内变换或其他操作
  • 计算数据透视表和交叉表
  • 执行分位数分析和其他统计组分析
import pandas as pdimport numpy as np

GroupBy

groupby包含三个步骤:split-apply-combine。首先是将数据按照特定轴向分离,进行分组;然后将函数应用到各个组中;最后将结果联合起来。

61cb0985f58f14bd4829f45c614faca7.png
# 我们先创建一个DataFrame。对其进行groupby操作df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],                   'key2': ['one', 'two', 'one', 'two', 'one'],                   'data1': np.random.randn(5),                   'data2': np.random.randn(5)})df
key1key2data1data2
0aone0.209170-0.331921
1atwo0.1625720.322581
2bone0.1450621.242235
3btwo-1.705477-0.563494
4aone0.309842-1.541020
# 根据key1标签计算data1列的均值grouped = df['data1'].groupby(df['key1'])grouped
# grouped变量是一个GroupBy对象,可以对其进行操作# 比如计算均值grouped.mean()   # 得到的就是data1根据key1分组的均值
key1
a 0.227195
b -0.780207
Name: data1, dtype: float64
# 当然我们可以使用列表传入Series分组键进行分组means = df['data1'].groupby([df['key1'], df['key2']]).mean() # means含有多层索引means
key1  key2
a one 0.259506
two 0.162572
b one 0.145062
two -1.705477
Name: data1, dtype: float64
# 当然,分组的键也可以是正确长度的任何数组states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])years = np.array([2005, 2005, 2006, 2005, 2006])df['data1'].groupby([states, years]).mean()
California  2005    0.162572
2006 0.145062
Ohio 2005 -0.748154
2006 0.309842
Name: data1, dtype: float64
# 分组信息如果在DataFrame中,则可以直接使用列名作为分组键df.groupby('key1').mean()  # 因为df['key2']不是数值数据,所以并不会计算均值
data1data2
key1
a0.227195-0.516787
b-0.7802070.339371
# 使用size查看组大小信息,注意缺失值将排除在外df.groupby(['key1', 'key2']).size()
key1  key2
a one 2
two 1
b one 1
two 1
dtype: int64

遍历各分组

GroupBy对象支持迭代,会先生成一个包含组名和数据块的2维元组序列

for name, group in df.groupby('key1'):    print(name)    print(group)
a
key1 key2 data1 data2
0 a one 0.209170 -0.331921
1 a two 0.162572 0.322581
4 a one 0.309842 -1.541020
b
key1 key2 data1 data2
2 b one 0.145062 1.242235
3 b two -1.705477 -0.563494
# 在有多个分组键的情况下,元组中的第一个元素是键值的元组for k, group in df.groupby(['key1', 'key2']):    print(k)    print(group)
('a', 'one')
key1 key2 data1 data2
0 a one 0.209170 -0.331921
4 a one 0.309842 -1.541020
('a', 'two')
key1 key2 data1 data2
1 a two 0.162572 0.322581
('b', 'one')
key1 key2 data1 data2
2 b one 0.145062 1.242235
('b', 'two')
key1 key2 data1 data2
3 b two -1.705477 -0.563494
# 选择任意一块数据进行操作pieces = dict(list(df.groupby('key1')))pieces['b']
key1key2data1data2
2bone0.1450621.242235
3btwo-1.705477-0.563494

默认情况下,groupby在axis=0的轴向上进行分组,但是也可以选定其他轴向。

# 我们可以根据数据类型进行分组df.dtypes
key1      object
key2 object
data1 float64
data2 float64
dtype: object
grouped = df.groupby(df.dtypes, axis=1)for dtype, group in grouped:    print(dtype)    print(group)
float64
data1 data2
0 0.209170 -0.331921
1 0.162572 0.322581
2 0.145062 1.242235
3 -1.705477 -0.563494
4 0.309842 -1.541020
object
key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one

选择一列或者所有列的子集

将从DataFrame中创建的GroupBy对象用列名称或者列名称数组进行索引时,相当于获取用于聚合的列子集。也就是说:

  • df.groupby('key1')['data1'] 相当于df['data1'].groupby(df['key1'])
  • df.groupby('key1')[['data1']]相当于df[['data1']].groupby(df['key1'])

对于大型数据集,可能只需要聚合少部分列

# 计算data2列的均值并获得DataFramedf.groupby('key1')[['data1']].mean()
data1
key1
a0.227195
b-0.780207
# 如果只有单个列名进行传递,则返回的是Seriess_grouped = df.groupby(['key1', 'key2'])['data2']s_grouped # 可以看到获得的是分组的Series
s_grouped.mean()
key1  key2
a one -0.936470
two 0.322581
b one 1.242235
two -0.563494
Name: data2, dtype: float64

使用字典和Series分组

分组信息可能会以非数组形式存在

people = pd.DataFrame(np.random.randn(5, 5),                       columns=['a', 'b', 'c', 'd', 'e'],                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])people.iloc[2:3, [1, 2]] = np.nanpeople
abcde
Joe1.2217240.1102671.285643-2.2904010.409033
Steve0.847775-0.4558330.725666-0.694706-1.162236
Wes-0.504510NaNNaN1.0692621.364316
Jim-2.089744-0.940379-0.7371870.189542-1.010504
Travis-1.310530-0.341813-1.7308680.2894350.053035
# 'f'未用于分组mapping = {'a': 'red', 'b':'blue', 'c':'blue', 'd':'blue', 'e': 'red', 'f': 'orange'}by_column = people.groupby(mapping, axis=1)by_column.sum()
bluered
Joe-0.8944911.630757
Steve-0.424873-0.314461
Wes1.0692620.859806
Jim-1.488024-3.100248
Travis-1.783247-1.257495
# 使用Series分组map_series = pd.Series(mapping)people.groupby(map_series, axis=1).count()
bluered
Joe32
Steve32
Wes12
Jim32
Travis32

使用函数分组

作为分组键传递的函数将会按照每个索引值调用一次,同时返回值用作分组名称。

# 根据名字长度分组people.groupby(len).sum()
abcde
3-1.372529-0.8301110.548455-1.0315980.762845
50.847775-0.4558330.725666-0.694706-1.162236
6-1.310530-0.341813-1.7308680.2894350.053035

可以将函数与数组、字典或Series进行混合,所有的对象在内部都转换为数组

key_list = ['one', 'one', 'one', 'two', 'two']for key, group in people.groupby([len, key_list]):    print(key)    print(group)
(3, 'one')
a b c d e
Joe 1.221724 0.110267 1.285643 -2.290401 0.409033
Wes -0.504510 NaN NaN 1.069262 1.364316
(3, 'two')
a b c d e
Jim -2.089744 -0.940379 -0.737187 0.189542 -1.010504
(5, 'one')
a b c d e
Steve 0.847775 -0.455833 0.725666 -0.694706 -1.162236
(6, 'two')
a b c d e
Travis -1.31053 -0.341813 -1.730868 0.289435 0.053035

根据索引层级分组

在分层索引的某个层级上可以进行聚合

columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],                                     [1, 3, 5, 1, 3]],                                     names=['cty', 'tenor'])hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)hier_df
ctyUSJP
tenor13513
02.3038571.048559-0.188793-0.178907-0.334412
1-0.2211240.218428-0.4350820.538684-0.740732
20.1874500.325777-0.892818-0.8975970.855642
3-0.0063150.5376070.855795-0.135583-0.635938
# 根据层级索引时,将层级数值或层级名称传递给level关键字hier_df.groupby(level='cty', axis=1).count()
ctyJPUS
023
123
223
323

数据聚合

数据聚合就是对Series或DataFrame所有列使用aggregate和所需函数,或者调用mean、std等方法 groupby优化方法:

函数名描述
count分组中的非NA数量
sum非NA值的累和
mean非NA值的均值
median非NA值的算数中位数

逐列及多函数应用

# 使用上篇文章使用的tips数据集tips = pd.read_csv(r'tip.csv')tips['tip_pct'] = tips['tip'] / (tips['total_bill'])tips.head()
total_billtipsmokerdaytimesizetip_pct
016.991.01NoSunDinner20.059447
110.341.66NoSunDinner30.160542
221.013.50NoSunDinner30.166587
323.683.31NoSunDinner20.139780
424.593.61NoSunDinner40.146808
# 根据day和smoker对tips进行分组grouped = tips.groupby(['day', 'smoker'])# 获得描述性统计(不使用describe方法)# 将函数名以字符串形式传递grouped_pct = grouped['tip_pct']grouped_pct.agg('mean') # 得到的是Series
day   smoker
Fri No 0.151650
Yes 0.174783
Sat No 0.158048
Yes 0.147906
Sun No 0.160113
Yes 0.187250
Thur No 0.160298
Yes 0.163863
Name: tip_pct, dtype: float64
# 若传递的是函数或者函数名的列表,则会获得一个列名为函数名的DataFrame# 首先定义一个函数,获得极差def peak_to_peak(arr):    return arr.max() - arr.min()grouped_pct.agg(['mean', 'std', peak_to_peak]) # 只有groupby内置的方法才可以直接传字符串,否则报错
meanstdpeak_to_peak
daysmoker
FriNo0.1516500.0281230.067349
Yes0.1747830.0512930.159925
SatNo0.1580480.0397670.235193
Yes0.1479060.0613750.290095
SunNo0.1601130.0423470.193226
Yes0.1872500.1541340.644685
ThurNo0.1602980.0387740.193350
Yes0.1638630.0393890.151240
# 如果传递的是(name, function)的元组,那么DataFrame的列名则是元组的第一个元素grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])
foobar
daysmoker
FriNo0.1516500.028123
Yes0.1747830.051293
SatNo0.1580480.039767
Yes0.1479060.061375
SunNo0.1601130.042347
Yes0.1872500.154134
ThurNo0.1602980.038774
Yes0.1638630.039389

在DataFrame中,可以指定应用到所有列上的函数列表或者每一列上应用不同函数

functions = ['count', 'mean', 'max']result = grouped['tip_pct', 'total_bill'].agg(functions)result

tip_pcttotal_bill
countmeanmaxcountmeanmax
daysmoker
FriNo40.1516500.187735418.42000022.75
Yes150.1747830.2634801516.81333340.17
SatNo450.1580480.2919904519.66177848.33
Yes420.1479060.3257334221.27666750.81
SunNo570.1601130.2526725720.50666748.17
Yes190.1872500.7103451924.12000045.35
ThurNo450.1602980.2663124517.11311141.19
Yes170.1638630.2412551719.19058843.11

产生的DataFrame拥有分层列

result['tip_pct']
countmeanmax
daysmoker
FriNo40.1516500.187735
Yes150.1747830.263480
SatNo450.1580480.291990
Yes420.1479060.325733
SunNo570.1601130.252672
Yes190.1872500.710345
ThurNo450.1602980.266312
Yes170.1638630.241255

和以前一样,可以传递具有自定义名称的元组列表

ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]grouped['tip_pct', 'total_bill'].agg(ftuples)

tip_pcttotal_bill
DurchschnittAbweichungDurchschnittAbweichung
daysmoker
FriNo0.1516500.00079118.42000025.596333
Yes0.1747830.00263116.81333382.562438
SatNo0.1580480.00158119.66177879.908965
Yes0.1479060.00376721.276667101.387535
SunNo0.1601130.00179320.50666766.099980
Yes0.1872500.02375724.120000109.046044
ThurNo0.1602980.00150317.11311159.625081
Yes0.1638630.00155119.19058869.808518

使用含有列名与函数对应关系的字典传递给agg,使列执行不同的函数操作;当多个函数应用于至少一个列时,会产生多层列

grouped.agg({'tip_pct': ['min', 'max', 'mean', 'std'],             'size': 'sum'})
tip_pctsize
minmaxmeanstdsum
daysmoker
FriNo0.1203850.1877350.1516500.0281239
Yes0.1035550.2634800.1747830.05129331
SatNo0.0567970.2919900.1580480.039767115
Yes0.0356380.3257330.1479060.061375104
SunNo0.0594470.2526720.1601130.042347167
Yes0.0656600.7103450.1872500.15413449
ThurNo0.0729610.2663120.1602980.038774112
Yes0.0900140.2412550.1638630.03938940

返回不含行索引的聚合数据

# 使用as_index=False来禁用分组键作为索引tips.groupby(['day','smoker'], as_index=False).mean()# 当然,使用reset_index也是一样的tips.groupby(['day','smoker']).mean().reset_index()
daysmokertotal_billtipsizetip_pct
0FriNo18.4200002.8125002.2500000.151650
1FriYes16.8133332.7140002.0666670.174783
2SatNo19.6617783.1028892.5555560.158048
3SatYes21.2766672.8754762.4761900.147906
4SunNo20.5066673.1678952.9298250.160113
5SunYes24.1200003.5168422.5789470.187250
6ThurNo17.1131112.6737782.4888890.160298
7ThurYes19.1905883.0300002.3529410.163863

通用split-apply-combine

  • apply方法

使用之前的tip数据集,选出tip_pct最高的五个组。首先,可以写一个在特定列中选出最大值所在行的函数

def  top(df, n=5, column='tip_pct'):    return tips.sort_values(by=column)[-n:]
top(tips, n=6)
total_billtipsmokerdaytimesizetip_pct
10914.314.00YesSatDinner20.279525
18323.176.50YesSunDinner40.280535
23211.613.39NoSatDinner20.291990
673.071.00YesSatDinner10.325733
1789.604.00YesSunDinner20.416667
1727.255.15YesSunDinner20.710345
# 按照smoker, day分组# top函数在每一行分组上被调用,之后使用pandas.concat将结果粘贴,分组名为标签# top函数的参数可以放在函数后进行传递tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')
total_billtipsmokerdaytimesizetip_pct
smokerday
NoFri17050.8110.0YesSatDinner30.196812
Sat17050.8110.0YesSatDinner30.196812
Sun17050.8110.0YesSatDinner30.196812
Thur17050.8110.0YesSatDinner30.196812
YesFri17050.8110.0YesSatDinner30.196812
Sat17050.8110.0YesSatDinner30.196812
Sun17050.8110.0YesSatDinner30.196812
Thur17050.8110.0YesSatDinner30.196812

压缩分组键

groupby传递groupby_key=False ,不展示分组索引

tips.groupby('smoker').apply(top)
total_billtipsmokerdaytimesizetip_pct
smoker
No18323.176.50YesSunDinner40.280535
23211.613.39NoSatDinner20.291990
673.071.00YesSatDinner10.325733
1789.604.00YesSunDinner20.416667
1727.255.15YesSunDinner20.710345
Yes18323.176.50YesSunDinner40.280535
23211.613.39NoSatDinner20.291990
673.071.00YesSatDinner10.325733
1789.604.00YesSunDinner20.416667
1727.255.15YesSunDinner20.710345
tips.groupby('smoker', as_index=False).apply(top)
total_billtipsmokerdaytimesizetip_pct
018323.176.50YesSunDinner40.280535
23211.613.39NoSatDinner20.291990
673.071.00YesSatDinner10.325733
1789.604.00YesSunDinner20.416667
1727.255.15YesSunDinner20.710345
118323.176.50YesSunDinner40.280535
23211.613.39NoSatDinner20.291990
673.071.00YesSatDinner10.325733
1789.604.00YesSunDinner20.416667
1727.255.15YesSunDinner20.710345
tips.groupby('smoker', group_keys=False).apply(top)
total_billtipsmokerdaytimesizetip_pct
18323.176.50YesSunDinner40.280535
23211.613.39NoSatDinner20.291990
673.071.00YesSatDinner10.325733
1789.604.00YesSunDinner20.416667
1727.255.15YesSunDinner20.710345
18323.176.50YesSunDinner40.280535
23211.613.39NoSatDinner20.291990
673.071.00YesSatDinner10.325733
1789.604.00YesSunDinner20.416667
1727.255.15YesSunDinner20.710345
tips.groupby(['day','smoker']).mean()
total_billtipsizetip_pct
daysmoker
FriNo18.4200002.8125002.2500000.151650
Yes16.8133332.7140002.0666670.174783
SatNo19.6617783.1028892.5555560.158048
Yes21.2766672.8754762.4761900.147906
SunNo20.5066673.1678952.9298250.160113
Yes24.1200003.5168422.5789470.187250
ThurNo17.1131112.6737782.4888890.160298
Yes19.1905883.0300002.3529410.163863
tips.groupby(['day','smoker'], group_keys=False).mean()
total_billtipsizetip_pct
daysmoker
FriNo18.4200002.8125002.2500000.151650
Yes16.8133332.7140002.0666670.174783
SatNo19.6617783.1028892.5555560.158048
Yes21.2766672.8754762.4761900.147906
SunNo20.5066673.1678952.9298250.160113
Yes24.1200003.5168422.5789470.187250
ThurNo17.1131112.6737782.4888890.160298
Yes19.1905883.0300002.3529410.163863

注意:

as_index:对于聚合输出,返回以组标签作为索引的对象。仅与DataFrame输入相关。as_index = False实际上是“SQL风格”的分组输出。

group_keys:调用apply时,选择是否将组键添加到索引

分位数与桶分析

使用groupby方法与cut, qcut配合使用,对数据更方便地进行分桶或分位分析

frame = pd.DataFrame({'data1': np.random.randn(1000),                      'data2': np.random.randn(1000)})quantiles = pd.cut(frame.data1, 4)quantiles[:10]
0     (-1.138, 0.242]
1 (0.242, 1.622]
2 (0.242, 1.622]
3 (-1.138, 0.242]
4 (-1.138, 0.242]
5 (-2.524, -1.138]
6 (0.242, 1.622]
7 (-1.138, 0.242]
8 (-1.138, 0.242]
9 (-1.138, 0.242]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-2.524, -1.138] < (-1.138, 0.242] < (0.242, 1.622] < (1.622, 3.002]]
def get_stats(group):    return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}grouped = frame.data2.groupby(quantiles)grouped.apply(get_stats)
data1                  
(-2.524, -1.138] min -2.243129
max 2.431713
count 116.000000
mean 0.111999
(-1.138, 0.242] min -2.734511
max 2.722345
count 467.000000
mean 0.028268
(0.242, 1.622] min -2.654807
max 2.826075
count 356.000000
mean 0.075527
(1.622, 3.002] min -2.873482
max 2.594543
count 61.000000
mean -0.007187
Name: data2, dtype: float64
# 上面获得的是等长桶,我们可以使用qcut方法获得等大小的桶grouping = pd.qcut(frame.data1, 10, labels=False)grouped = frame.data2.groupby(grouping)grouped.apply(get_stats).unstack()
minmaxcountmean
data1
0-2.2431292.431713100.00.102224
1-2.2960632.376008100.0-0.015589
2-1.9471942.722345100.00.070937
3-1.9801022.534790100.00.107431
4-2.7345112.335623100.0-0.025969
5-2.4246192.345450100.00.098001
6-2.6548072.398495100.00.026486
7-2.0767282.826075100.00.038720
8-2.4832752.118598100.00.056305
9-2.8734822.594543100.00.067875

使用指定分组值填充缺失值

s = pd.Series(np.random.randn(6))s[::2] = np.nans.fillna(s.mean())
0    0.057708
1 -0.045778
2 0.057708
3 0.047644
4 0.057708
5 0.171259
dtype: float64
states = ['Ohio', 'New Year', 'Vermont', 'Florida', 'Oregon', 'Nevada', 'California', 'Idaho']group_key = ['East'] * 4 + ['West'] * 4data = pd.Series(np.random.randn(8), index=states)data
Ohio          0.350673
New Year -0.126434
Vermont -0.299054
Florida -0.175114
Oregon 0.667130
Nevada 1.023922
California 0.840702
Idaho -0.662078
dtype: float64
data[['Vermont', 'Nevada', 'Idaho']] = np.nandata
Ohio          0.350673
New Year -0.126434
Vermont NaN
Florida -0.175114
Oregon 0.667130
Nevada NaN
California 0.840702
Idaho NaN
dtype: float64
data.groupby(group_key).mean()
East    0.016375
West 0.753916
dtype: float64
fill_mean = lambda g: g.fillna(g.mean())data.groupby(group_key).apply(fill_mean)
Ohio          0.350673
New Year -0.126434
Vermont 0.016375
Florida -0.175114
Oregon 0.667130
Nevada 0.753916
California 0.840702
Idaho 0.753916
dtype: float64
# 为每个分组预定义填充值fill_values = {'East': 0.5, 'West': -1}fill_func = lambda g: g.fillna(fill_values[g.name])data.groupby(group_key).apply(fill_func)
Ohio          0.350673
New Year -0.126434
Vermont 0.500000
Florida -0.175114
Oregon 0.667130
Nevada -1.000000
California 0.840702
Idaho -1.000000
dtype: float64

示例:随机采样与排列

使用Seriessample方法

# 构造一副英式扑克牌# 红桃。黑桃,梅花,方块suits = ['H', 'S', 'C', 'D']card_val = (list(range(1, 11)) + [10] * 3) * 4base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']cards = []for suit in ['H', 'S', 'C', 'D']:    cards.extend(str(num) + suit for num in base_names)deck = pd.Series(card_val, index=cards)
deck[:13]
AH      1
2H 2
3H 3
4H 4
5H 5
6H 6
7H 7
8H 8
9H 9
10H 10
JH 10
KH 10
QH 10
dtype: int64
# 从其中随机拿出五张牌deck.sample(5)
10D    10
7H 7
AH 1
6S 6
4D 4
dtype: int64
# 为了后续方便使用,我们定义一个随机抽牌的函数def draw(deck, n=5):    return deck.sample(n)# 从每个花色中随机抽取两张牌,花色是牌名的最后一个字符get_suit = lambda card: card[-1]deck.groupby(get_suit).apply(draw, 2)
C  QC     10
KC 10
D 10D 10
JD 10
H 2H 2
QH 10
S 7S 7
AS 1
dtype: int64

示例:分组加权平均

df = pd.DataFrame({'category': ['a', 'a', 'a', 'b', 'b', 'b'],                   'data': np.random.randn(6),                   'weights': np.random.rand(6)})df
categorydataweights
0a0.2622840.552361
1a1.1416050.295554
2a0.2198250.030885
3b1.5231840.917011
4b-0.3040850.136345
5b-1.8903890.219738
grouped = df.groupby('category')get_wavg = lambda g: np.average(g['data'], weights=g['weights'])grouped.apply(get_wavg)
category
a 0.556522
b 0.738300
dtype: float64

数据透视表和交叉表

数据透视表:根据一个或多个键聚合一张表的数据,将数据在矩形格式中排列。

pandas中出了groupby工具以及分层索引的重塑操作外,还有pivot_table方法,可以实现数据透视表的功能,pivot_table还可以添加部分总计。

# 使用index设置分组tips.pivot_table(index=['day', 'smoker'])
sizetiptip_pcttotal_bill
daysmoker
FriNo2.2500002.8125000.15165018.420000
Yes2.0666672.7140000.17478316.813333
SatNo2.5555563.1028890.15804819.661778
Yes2.4761902.8754760.14790621.276667
SunNo2.9298253.1678950.16011320.506667
Yes2.5789473.5168420.18725024.120000
ThurNo2.4888892.6737780.16029817.113111
Yes2.3529413.0300000.16386319.190588
# 使用groupby呈现grouped = tips.groupby(['day', 'smoker'])for key, group in grouped:    print(key)    print(group)
('Fri', 'No')
total_bill tip smoker day time size tip_pct
91 22.49 3.50 No Fri Dinner 2 0.155625
94 22.75 3.25 No Fri Dinner 2 0.142857
99 12.46 1.50 No Fri Dinner 2 0.120385
223 15.98 3.00 No Fri Lunch 3 0.187735
('Fri', 'Yes')
total_bill tip smoker day time size tip_pct
90 28.97 3.00 Yes Fri Dinner 2 0.103555
92 5.75 1.00 Yes Fri Dinner 2 0.173913
93 16.32 4.30 Yes Fri Dinner 2 0.263480
95 40.17 4.73 Yes Fri Dinner 4 0.117750
96 27.28 4.00 Yes Fri Dinner 2 0.146628
97 12.03 1.50 Yes Fri Dinner 2 0.124688
98 21.01 3.00 Yes Fri Dinner 2 0.142789
100 11.35 2.50 Yes Fri Dinner 2 0.220264
101 15.38 3.00 Yes Fri Dinner 2 0.195059
220 12.16 2.20 Yes Fri Lunch 2 0.180921
221 13.42 3.48 Yes Fri Lunch 2 0.259314
222 8.58 1.92 Yes Fri Lunch 1 0.223776
224 13.42 1.58 Yes Fri Lunch 2 0.117735
225 16.27 2.50 Yes Fri Lunch 2 0.153657
226 10.09 2.00 Yes Fri Lunch 2 0.198216
('Sat', 'No')
total_bill tip smoker day time size tip_pct
19 20.65 3.35 No Sat Dinner 3 0.162228
20 17.92 4.08 No Sat Dinner 2 0.227679
21 20.29 2.75 No Sat Dinner 2 0.135535
22 15.77 2.23 No Sat Dinner 2 0.141408
23 39.42 7.58 No Sat Dinner 4 0.192288
24 19.82 3.18 No Sat Dinner 2 0.160444
25 17.81 2.34 No Sat Dinner 4 0.131387
26 13.37 2.00 No Sat Dinner 2 0.149589
27 12.69 2.00 No Sat Dinner 2 0.157604
28 21.70 4.30 No Sat Dinner 2 0.198157
29 19.65 3.00 No Sat Dinner 2 0.152672
30 9.55 1.45 No Sat Dinner 2 0.151832
31 18.35 2.50 No Sat Dinner 4 0.136240
32 15.06 3.00 No Sat Dinner 2 0.199203
33 20.69 2.45 No Sat Dinner 4 0.118415
34 17.78 3.27 No Sat Dinner 2 0.183915
35 24.06 3.60 No Sat Dinner 3 0.149626
36 16.31 2.00 No Sat Dinner 3 0.122624
37 16.93 3.07 No Sat Dinner 3 0.181335
38 18.69 2.31 No Sat Dinner 3 0.123596
39 31.27 5.00 No Sat Dinner 3 0.159898
40 16.04 2.24 No Sat Dinner 3 0.139651
57 26.41 1.50 No Sat Dinner 2 0.056797
59 48.27 6.73 No Sat Dinner 4 0.139424
64 17.59 2.64 No Sat Dinner 3 0.150085
65 20.08 3.15 No Sat Dinner 3 0.156873
66 16.45 2.47 No Sat Dinner 2 0.150152
68 20.23 2.01 No Sat Dinner 2 0.099357
70 12.02 1.97 No Sat Dinner 2 0.163894
71 17.07 3.00 No Sat Dinner 3 0.175747
74 14.73 2.20 No Sat Dinner 2 0.149355
75 10.51 1.25 No Sat Dinner 2 0.118934
104 20.92 4.08 No Sat Dinner 2 0.195029
108 18.24 3.76 No Sat Dinner 2 0.206140
110 14.00 3.00 No Sat Dinner 2 0.214286
111 7.25 1.00 No Sat Dinner 1 0.137931
212 48.33 9.00 No Sat Dinner 4 0.186220
227 20.45 3.00 No Sat Dinner 4 0.146699
228 13.28 2.72 No Sat Dinner 2 0.204819
232 11.61 3.39 No Sat Dinner 2 0.291990
233 10.77 1.47 No Sat Dinner 2 0.136490
235 10.07 1.25 No Sat Dinner 2 0.124131
238 35.83 4.67 No Sat Dinner 3 0.130338
239 29.03 5.92 No Sat Dinner 3 0.203927
242 17.82 1.75 No Sat Dinner 2 0.098204
('Sat', 'Yes')
total_bill tip smoker day time size tip_pct
56 38.01 3.00 Yes Sat Dinner 4 0.078927
58 11.24 1.76 Yes Sat Dinner 2 0.156584
60 20.29 3.21 Yes Sat Dinner 2 0.158206
61 13.81 2.00 Yes Sat Dinner 2 0.144823
62 11.02 1.98 Yes Sat Dinner 2 0.179673
63 18.29 3.76 Yes Sat Dinner 4 0.205577
67 3.07 1.00 Yes Sat Dinner 1 0.325733
69 15.01 2.09 Yes Sat Dinner 2 0.139241
72 26.86 3.14 Yes Sat Dinner 2 0.116902
73 25.28 5.00 Yes Sat Dinner 2 0.197785
76 17.92 3.08 Yes Sat Dinner 2 0.171875
102 44.30 2.50 Yes Sat Dinner 3 0.056433
103 22.42 3.48 Yes Sat Dinner 2 0.155219
105 15.36 1.64 Yes Sat Dinner 2 0.106771
106 20.49 4.06 Yes Sat Dinner 2 0.198145
107 25.21 4.29 Yes Sat Dinner 2 0.170171
109 14.31 4.00 Yes Sat Dinner 2 0.279525
168 10.59 1.61 Yes Sat Dinner 2 0.152030
169 10.63 2.00 Yes Sat Dinner 2 0.188147
170 50.81 10.00 Yes Sat Dinner 3 0.196812
171 15.81 3.16 Yes Sat Dinner 2 0.199873
206 26.59 3.41 Yes Sat Dinner 3 0.128244
207 38.73 3.00 Yes Sat Dinner 4 0.077459
208 24.27 2.03 Yes Sat Dinner 2 0.083642
209 12.76 2.23 Yes Sat Dinner 2 0.174765
210 30.06 2.00 Yes Sat Dinner 3 0.066534
211 25.89 5.16 Yes Sat Dinner 4 0.199305
213 13.27 2.50 Yes Sat Dinner 2 0.188395
214 28.17 6.50 Yes Sat Dinner 3 0.230742
215 12.90 1.10 Yes Sat Dinner 2 0.085271
216 28.15 3.00 Yes Sat Dinner 5 0.106572
217 11.59 1.50 Yes Sat Dinner 2 0.129422
218 7.74 1.44 Yes Sat Dinner 2 0.186047
219 30.14 3.09 Yes Sat Dinner 4 0.102522
229 22.12 2.88 Yes Sat Dinner 2 0.130199
230 24.01 2.00 Yes Sat Dinner 4 0.083299
231 15.69 3.00 Yes Sat Dinner 3 0.191205
234 15.53 3.00 Yes Sat Dinner 2 0.193175
236 12.60 1.00 Yes Sat Dinner 2 0.079365
237 32.83 1.17 Yes Sat Dinner 2 0.035638
240 27.18 2.00 Yes Sat Dinner 2 0.073584
241 22.67 2.00 Yes Sat Dinner 2 0.088222
('Sun', 'No')
total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808
5 25.29 4.71 No Sun Dinner 4 0.186240
6 8.77 2.00 No Sun Dinner 2 0.228050
7 26.88 3.12 No Sun Dinner 4 0.116071
8 15.04 1.96 No Sun Dinner 2 0.130319
9 14.78 3.23 No Sun Dinner 2 0.218539
10 10.27 1.71 No Sun Dinner 2 0.166504
11 35.26 5.00 No Sun Dinner 4 0.141804
12 15.42 1.57 No Sun Dinner 2 0.101816
13 18.43 3.00 No Sun Dinner 4 0.162778
14 14.83 3.02 No Sun Dinner 2 0.203641
15 21.58 3.92 No Sun Dinner 2 0.181650
16 10.33 1.67 No Sun Dinner 3 0.161665
17 16.29 3.71 No Sun Dinner 3 0.227747
18 16.97 3.50 No Sun Dinner 3 0.206246
41 17.46 2.54 No Sun Dinner 2 0.145475
42 13.94 3.06 No Sun Dinner 2 0.219512
43 9.68 1.32 No Sun Dinner 2 0.136364
44 30.40 5.60 No Sun Dinner 4 0.184211
45 18.29 3.00 No Sun Dinner 2 0.164024
46 22.23 5.00 No Sun Dinner 2 0.224921
47 32.40 6.00 No Sun Dinner 4 0.185185
48 28.55 2.05 No Sun Dinner 3 0.071804
49 18.04 3.00 No Sun Dinner 2 0.166297
50 12.54 2.50 No Sun Dinner 2 0.199362
51 10.29 2.60 No Sun Dinner 2 0.252672
52 34.81 5.20 No Sun Dinner 4 0.149382
53 9.94 1.56 No Sun Dinner 2 0.156942
54 25.56 4.34 No Sun Dinner 4 0.169797
55 19.49 3.51 No Sun Dinner 2 0.180092
112 38.07 4.00 No Sun Dinner 3 0.105070
113 23.95 2.55 No Sun Dinner 2 0.106472
114 25.71 4.00 No Sun Dinner 3 0.155581
115 17.31 3.50 No Sun Dinner 2 0.202195
116 29.93 5.07 No Sun Dinner 4 0.169395
150 14.07 2.50 No Sun Dinner 2 0.177683
151 13.13 2.00 No Sun Dinner 2 0.152323
152 17.26 2.74 No Sun Dinner 3 0.158749
153 24.55 2.00 No Sun Dinner 4 0.081466
154 19.77 2.00 No Sun Dinner 4 0.101163
155 29.85 5.14 No Sun Dinner 5 0.172194
156 48.17 5.00 No Sun Dinner 6 0.103799
157 25.00 3.75 No Sun Dinner 4 0.150000
158 13.39 2.61 No Sun Dinner 2 0.194922
159 16.49 2.00 No Sun Dinner 4 0.121286
160 21.50 3.50 No Sun Dinner 4 0.162791
161 12.66 2.50 No Sun Dinner 2 0.197472
162 16.21 2.00 No Sun Dinner 3 0.123381
163 13.81 2.00 No Sun Dinner 2 0.144823
165 24.52 3.48 No Sun Dinner 3 0.141925
166 20.76 2.24 No Sun Dinner 2 0.107900
167 31.71 4.50 No Sun Dinner 4 0.141911
185 20.69 5.00 No Sun Dinner 5 0.241663
('Sun', 'Yes')
total_bill tip smoker day time size tip_pct
164 17.51 3.00 Yes Sun Dinner 2 0.171331
172 7.25 5.15 Yes Sun Dinner 2 0.710345
173 31.85 3.18 Yes Sun Dinner 2 0.099843
174 16.82 4.00 Yes Sun Dinner 2 0.237812
175 32.90 3.11 Yes Sun Dinner 2 0.094529
176 17.89 2.00 Yes Sun Dinner 2 0.111794
177 14.48 2.00 Yes Sun Dinner 2 0.138122
178 9.60 4.00 Yes Sun Dinner 2 0.416667
179 34.63 3.55 Yes Sun Dinner 2 0.102512
180 34.65 3.68 Yes Sun Dinner 4 0.106205
181 23.33 5.65 Yes Sun Dinner 2 0.242177
182 45.35 3.50 Yes Sun Dinner 3 0.077178
183 23.17 6.50 Yes Sun Dinner 4 0.280535
184 40.55 3.00 Yes Sun Dinner 2 0.073983
186 20.90 3.50 Yes Sun Dinner 3 0.167464
187 30.46 2.00 Yes Sun Dinner 5 0.065660
188 18.15 3.50 Yes Sun Dinner 3 0.192837
189 23.10 4.00 Yes Sun Dinner 3 0.173160
190 15.69 1.50 Yes Sun Dinner 2 0.095602
('Thur', 'No')
total_bill tip smoker day time size tip_pct
77 27.20 4.00 No Thur Lunch 4 0.147059
78 22.76 3.00 No Thur Lunch 2 0.131810
79 17.29 2.71 No Thur Lunch 2 0.156738
81 16.66 3.40 No Thur Lunch 2 0.204082
82 10.07 1.83 No Thur Lunch 1 0.181728
84 15.98 2.03 No Thur Lunch 2 0.127034
85 34.83 5.17 No Thur Lunch 4 0.148435
86 13.03 2.00 No Thur Lunch 2 0.153492
87 18.28 4.00 No Thur Lunch 2 0.218818
88 24.71 5.85 No Thur Lunch 2 0.236746
89 21.16 3.00 No Thur Lunch 2 0.141777
117 10.65 1.50 No Thur Lunch 2 0.140845
118 12.43 1.80 No Thur Lunch 2 0.144811
119 24.08 2.92 No Thur Lunch 4 0.121262
120 11.69 2.31 No Thur Lunch 2 0.197605
121 13.42 1.68 No Thur Lunch 2 0.125186
122 14.26 2.50 No Thur Lunch 2 0.175316
123 15.95 2.00 No Thur Lunch 2 0.125392
124 12.48 2.52 No Thur Lunch 2 0.201923
125 29.80 4.20 No Thur Lunch 6 0.140940
126 8.52 1.48 No Thur Lunch 2 0.173709
127 14.52 2.00 No Thur Lunch 2 0.137741
128 11.38 2.00 No Thur Lunch 2 0.175747
129 22.82 2.18 No Thur Lunch 3 0.095530
130 19.08 1.50 No Thur Lunch 2 0.078616
131 20.27 2.83 No Thur Lunch 2 0.139615
132 11.17 1.50 No Thur Lunch 2 0.134288
133 12.26 2.00 No Thur Lunch 2 0.163132
134 18.26 3.25 No Thur Lunch 2 0.177985
135 8.51 1.25 No Thur Lunch 2 0.146886
136 10.33 2.00 No Thur Lunch 2 0.193611
137 14.15 2.00 No Thur Lunch 2 0.141343
139 13.16 2.75 No Thur Lunch 2 0.208967
140 17.47 3.50 No Thur Lunch 2 0.200343
141 34.30 6.70 No Thur Lunch 6 0.195335
142 41.19 5.00 No Thur Lunch 5 0.121389
143 27.05 5.00 No Thur Lunch 6 0.184843
144 16.43 2.30 No Thur Lunch 2 0.139988
145 8.35 1.50 No Thur Lunch 2 0.179641
146 18.64 1.36 No Thur Lunch 3 0.072961
147 11.87 1.63 No Thur Lunch 2 0.137321
148 9.78 1.73 No Thur Lunch 2 0.176892
149 7.51 2.00 No Thur Lunch 2 0.266312
195 7.56 1.44 No Thur Lunch 2 0.190476
243 18.78 3.00 No Thur Dinner 2 0.159744
('Thur', 'Yes')
total_bill tip smoker day time size tip_pct
80 19.44 3.00 Yes Thur Lunch 2 0.154321
83 32.68 5.00 Yes Thur Lunch 2 0.152999
138 16.00 2.00 Yes Thur Lunch 2 0.125000
191 19.81 4.19 Yes Thur Lunch 2 0.211509
192 28.44 2.56 Yes Thur Lunch 2 0.090014
193 15.48 2.02 Yes Thur Lunch 2 0.130491
194 16.58 4.00 Yes Thur Lunch 2 0.241255
196 10.34 2.00 Yes Thur Lunch 2 0.193424
197 43.11 5.00 Yes Thur Lunch 4 0.115982
198 13.00 2.00 Yes Thur Lunch 2 0.153846
199 13.51 2.00 Yes Thur Lunch 2 0.148038
200 18.71 4.00 Yes Thur Lunch 3 0.213789
201 12.74 2.01 Yes Thur Lunch 2 0.157771
202 13.00 2.00 Yes Thur Lunch 2 0.153846
203 16.40 2.50 Yes Thur Lunch 2 0.152439
204 20.53 4.00 Yes Thur Lunch 4 0.194837
205 16.47 3.23 Yes Thur Lunch 3 0.196114
# 在tip_pct和size上进行聚合,并根据time分组,把smoker放入表的列,而将day放入表的行tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker')
sizetip_pct
smokerNoYesNoYes
timeday
DinnerFri2.0000002.2222220.1396220.165347
Sat2.5555562.4761900.1580480.147906
Sun2.9298252.5789470.1601130.187250
Thur2.000000NaN0.159744NaN
LunchFri3.0000001.8333330.1877350.188937
Thur2.5000002.3529410.1603110.163863
# 通过传递margins=True来包含部分统计。tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker', margins=True)
sizetip_pct
smokerNoYesAllNoYesAll
timeday
DinnerFri2.0000002.2222222.1666670.1396220.1653470.158916
Sat2.5555562.4761902.5172410.1580480.1479060.153152
Sun2.9298252.5789472.8421050.1601130.1872500.166897
Thur2.000000NaN2.0000000.159744NaN0.159744
LunchFri3.0000001.8333332.0000000.1877350.1889370.188765
Thur2.5000002.3529412.4590160.1603110.1638630.161301
All2.6688742.4086022.5696720.1593280.1631960.160803
# 我们也可以使用不同的聚合函数,将函数传递给aggfunc,比如'count'和lentips.pivot_table('tip_pct', index=['time', 'smoker'], columns='day', aggfunc=len, margins=True)
dayFriSatSunThurAll
timesmoker
DinnerNo3.045.057.01.0106.0
Yes9.042.019.0NaN70.0
LunchNo1.0NaNNaN44.045.0
Yes6.0NaNNaN17.023.0
All19.087.076.062.0244.0
# 当然,也可以使用fill_value填充空值tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'], columns='day', aggfunc='mean', fill_value=0)
dayFriSatSunThur
timesizesmoker
Dinner1No0.0000000.1379310.0000000.000000
Yes0.0000000.3257330.0000000.000000
2No0.1396220.1627050.1688590.159744
Yes0.1712970.1486680.2078930.000000
3No0.0000000.1546610.1526630.000000
Yes0.0000000.1449950.1526600.000000
4No0.0000000.1500960.1481430.000000
Yes0.1177500.1245150.1933700.000000
5No0.0000000.0000000.2069280.000000
Yes0.0000000.1065720.0656600.000000
6No0.0000000.0000000.1037990.000000
Lunch1No0.0000000.0000000.0000000.181728
Yes0.2237760.0000000.0000000.000000
2No0.0000000.0000000.0000000.166005
Yes0.1819690.0000000.0000000.158843
3No0.1877350.0000000.0000000.084246
Yes0.0000000.0000000.0000000.204952
4No0.0000000.0000000.0000000.138919
Yes0.0000000.0000000.0000000.155410
5No0.0000000.0000000.0000000.121389
6No0.0000000.0000000.0000000.173706

pivot_table选项

选项名描述
values需要聚合的列名;默认情况下聚合所有数值型的列
index在结果透视表的行上进行分组的列名或其他分组键
columns在结果透视表的列上进行分组的列名或其他分组键
aggfunc聚合函数或函数列表,默认为'mean'
fill_value替换缺失值
dropna删除含有NA的列
margins添加行/列小计和总计

crosstab-交叉表

crosstab前两个参数是数组、Series或者数组列表,计算分组频率

pd.crosstab([tips['time'], tips['day']], tips['smoker'], margins=True)
smokerNoYesAll
timeday
DinnerFri3912
Sat454287
Sun571976
Thur101
LunchFri167
Thur441761
All15193244

总结

本文主要讲解了pandas中分组聚合常用的GroupBy方法,这在数据分析中应用非常广泛。下一次我们将继续讲解时间序列

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值