<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001C3A885C978>
grouped.mean()# 调用 mean 方法计算分组的均值
key1
a 0.039233
b 1.482094
Name: data1, dtype: float64
means = df['data1'].groupby([df['key1'], df['key2']]).mean()# 将多个数组作为列表传入
means: 结果为含多层索引的 Series
key1 key2
a one -0.378057
two 0.873811
b one 1.997530
two 0.966659
Name: data1, dtype: float64
means.unstack()# 拆分为 DataFrame
key2
one
two
key1
a
-0.378057
0.873811
b
1.997530
0.966659
分组键为正确长度的任何数组:
# 使用正确长度的任何数组作为分组键
states = np.array(['Ohio','California','California','Ohio','Ohio'])
years = np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states, years]).mean()
for name, group in df.groupby('key1'):print(name)print(group)
a
key1 key2 data1 data2
0 a one -0.556877 0.300878
1 a two 0.873811 0.742571
4 a one -0.199236 -0.990511
b
key1 key2 data1 data2
2 b one 1.997530 -0.632550
3 b two 0.966659 -1.091297
在多个分组键的情况下,元组中的第一个元素是键值的元组
for(k1, k2), group in df.groupby(['key1','key2']):print((k1, k2))print(group)
('a', 'one')
key1 key2 data1 data2
0 a one -0.556877 0.300878
4 a one -0.199236 -0.990511
('a', 'two')
key1 key2 data1 data2
1 a two 0.873811 0.742571
('b', 'one')
key1 key2 data1 data2
2 b one 1.99753 -0.63255
('b', 'two')
key1 key2 data1 data2
3 b two 0.966659 -1.091297
{'a': key1 key2 data1 data2
0 a one -0.556877 0.300878
1 a two 0.873811 0.742571
4 a one -0.199236 -0.990511, 'b': key1 key2 data1 data2
2 b one 1.997530 -0.632550
3 b two 0.966659 -1.091297}
pieces['b']# 字典的值为 DataFrame
key1
key2
data1
data2
2
b
one
1.997530
-0.632550
3
b
two
0.966659
-1.091297
默认情况下,groupby 在 axis=0 的轴上分组,使用 axis=1 对列进行分组
grouped = df.groupby(df.dtypes, axis=1)for dtype, group in grouped:print(dtype)print(group)
float64
data1 data2
0 -0.556877 0.300878
1 0.873811 0.742571
2 1.997530 -0.632550
3 0.966659 -1.091297
4 -0.199236 -0.990511
object
key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one
key1 key2
a one -0.344816
two 0.742571
b one -0.632550
two -1.091297
Name: data2, dtype: float64
1.3 使用字典和 Series 分组
分组信息可能会以非数组形式存在,考虑如下示例:
people = pd.DataFrame(np.random.randn(5,5),
index=['Joe','Steve','Wes','Jim','Travis'],
columns=['a','b','c','d','e'])
people.iloc[2:3,[1,2]]= np.nan
people