c++ 字符串合并_“拆分-应用-合并”分组运算,python分组技术groupby应用详解3

f0d8016307553b512eb4c8b76b6f28cd.png
issnail:“拆分-应用-合并”分组运算,python分组技术groupby应用详解2​zhuanlan.zhihu.com
13d9411330696990ac7a0c8e0720e85e.png
issnail:“拆分-应用-合并”分组运算过程,python分组技术groupby应用详解​zhuanlan.zhihu.com
13d9411330696990ac7a0c8e0720e85e.png

基于上述文章,进行连载!具体如下文所示。

选取一个或一组列

对于由DataFrame产生的GroupBy对象,如果用一个(单个字符串)或一组(字符串数组)列名对其进行索引,就能实现选取部分列进行聚合的目的。

In [28]: import numpy as np

...: import pandas as pd

...: from pandas import Series,DataFrame

...: df=DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],

'data1':np.random.randn(5),'data2':np.random.randn(5)})

In [29]: df

Out[29]:

key1 key2 data1 data2

0 a one 1.523269 -0.063037

1 a two -0.368826 -0.406932

2 b one 0.753990 -1.866066

3 b two 0.602979 -1.923139

4 a one -1.712829 0.600494

In [30]: df.groupby('key1')['data1']

Out[30]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x119874f28>

In [31]: df.groupby('key1')[['data2']]

Out[31]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x1198a52e8>

试一下代码的语法糖:

In [32]: df['data1'].groupby(df['key1'])

Out[32]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x1198a5dd8>

In [33]: df[['data2']].groupby(df['key1'])

Out[33]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x119874e48>

尤其对于大数据集,很可能只需要对部分列进行聚合

In [34]: df.groupby(['key1','key2'])[['data2']].mean()

Out[34]:

data2

key1 key2

a one 0.268729

two -0.406932

b one -1.866066

two -1.923139

这种索引操作所返回的对象是一个已分组的DataFrame(如果传入的是列表或数组)或已分组的Series(如果传入的是标量形式的单个列名):

In [35]: s_grouped=df.groupby(['key1','key2'])['data2']

In [36]: s_grouped

Out[36]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x119840198>

In [37]: s_grouped.mean()

Out[37]:

key1 key2

a one 0.268729

two -0.406932

b one -1.866066

two -1.923139

Name: data2, dtype: float64

通过字典或Series进行分组

In [38]: people=DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],

index=['Joe','Steve','Wes','Jim','Travis'])

In [39]: people

Out[39]:

a b c d e

Joe 0.798251 -2.181076 -1.002941 -1.216847 -1.306730

Steve -0.384873 -0.712956 2.000298 -0.201616 0.618182

Wes 2.630382 0.643784 -0.609386 1.151120 -0.780954

Jim 0.226097 -0.177566 0.615332 -0.199972 -0.995759

Travis 0.283909 0.601818 -0.264502 -0.997113 0.205104

#添加几个NA值

In [40]: people.ix[2:3,['b','c']]=np.nan

In [41]: people

Out[41]:

a b c d e

Joe 0.798251 -2.181076 -1.002941 -1.216847 -1.306730

Steve -0.384873 -0.712956 2.000298 -0.201616 0.618182

Wes 2.630382 NaN NaN 1.151120 -0.780954

Jim 0.226097 -0.177566 0.615332 -0.199972 -0.995759

Travis 0.283909 0.601818 -0.264502 -0.997113 0.205104

假设已知列的分组关系,并希望根据分组计算列的总计:

In [39]: mapping={'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}

将这个字典传给groupby即可:

In [43]: by_column=people.groupby(mapping,axis=1)

In [44]: by_column.sum()

Out[44]:

blue red

Joe -2.219788 -2.689556

Steve 1.798682 -0.479647

Wes 1.151120 1.849428

Jim 0.415360 -0.947227

Travis -1.261615 1.090831

Series也有同样的功能,它可以被看做一个固定大小的映射。

如果用Series作为分组键,则pandas会检查Series以确保其索引跟分组轴是对齐的:

In [46]: map_series

Out[46]:

a red

b red

c blue

d blue

e red

f orange

dtype: object

In [47]: people.groupby(map_series,axis=1).count()

Out[47]:

blue red

Joe 2 3

Steve 2 3

Wes 1 2

Jim 2 3

Travis 2 3

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值