pandas库的groupby问题
一、对象分组
1 一个简单例子
In [1]: df = pd.DataFrame([('bird', 'Falconiformes', 389.0),
...: ('bird', 'Psittaciformes', 24.0),
...: ('mammal', 'Carnivora', 80.2),
...: ('mammal', 'Primates', np.nan),
...: ('mammal', 'Carnivora', 58)],
...: index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'],
...: columns=('class', 'order', 'max_speed'))
...:
In [2]: df
Out[2]:
class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
lion mammal Carnivora 80.2
monkey mammal Primates NaN
leopard mammal Carnivora 58.0
In [3]: grouped = df.groupby('class') # 生成两组,默认axis=0/'columns'
In [4]: grouped = df.groupby('order', axis='columns') # 无输出
In [5]: grouped = df.groupby(['class', 'order']) # 生成四组
groupby默认按照列名排列,亦可使axis=1,这样就能以列为单位切片(结果是一列一列的组合):
In [50]: grouped = df.groupby(df.dtypes, axis=1) # 按列的类型来分组,可见被分成了两块
In [51]: for i,j in grouped:
...: print(i)
...: print(j)
float64
max_speed
falcon 389.0
parrot 24.0
lion 80.2
monkey NaN
leopard 58.0
object
class order
falcon bird Falconiformes
parrot bird Psittaciformes
lion mammal Carnivora
monkey mammal Primates
leopard mammal Carnivora
其他常用的几种功能:
# 按照组名数量使用以下代码查看包含了哪些内容,也称为迭代
for i,j in grouped:
print(i)
print(j)
for (i1,i2),j in grouped:
print(i1, i2)
print(j)
# 或者如下简单方式查看有哪些组:
grouped.groups
# 使用以下代码求和,前提是该列可以求和:
In []: df.groupby('class').sum()
Out[]:
max_speed
class
bird 413.0
mammal 138.2
# 分别求平均数、组的大小、计数:
In []: df.groupby('class').mean()
Out[]:
max_speed
class
bird 206.5
mammal 69.1
In []: df.groupby('class').size()
Out[]:
class
bird 2
mammal 3
dtype: int64
In []: df.groupby('class').count() # 注意count与size的区别
Out[]:
order max_speed
class
bird 2 2
mammal 3 2 # 由于有一个NaN值的存在
2 多重索引与分组
In [6]: df = pd.DataFrame({
'A': ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B': ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C': np.random.randn(8),
...: 'D': np.random.randn(8)})
In [7]: df
Out[7]:
A B C D
0 foo one 0.469112 -0.861849
1 bar one -0.282863 -2.104569
2 foo two -1.509059 -0.494929
3 bar three -1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two -0.173215 -0.706771
6 foo one 0.119209 -1.039575
7 foo three -1.044236 0.271860
In [8]: grouped = df.groupby('A') # 生成2组
In [9]: grouped = df.groupby(['A', 'B']) # 生成6组
In [10]: df2 = df.set_index(['A', 'B'])
Out[10]: # print(df2)
C D
A B
foo one -1.209388 -0.309949
bar one -0.380334 -1.352238
foo two 0.309979 -0.695926
bar three 0.650321 0.965206
foo two 0.809020 1.003307
bar two 0.668484 1.013688
foo one 0.513104 0.079576
three 1.579055 -0.083461 # 注意多重索引此处不同
In [11]: grouped = df2.groupby(level=df2.index.names.difference(['B']))
# 等价于 grouped = df2.groupby(level=0),如要按多重索引内的索引'B',则level=1
# 由于'B'列不能求和,亦等价于 df.groupby(['A']).sum()
In [12]: grouped.sum()
Out[12]:
C D
A
bar -1.591710 -1.739537
foo -0.752861 -1.402938
3 应用函数与按列分组
In [13]: def get_letter_type(letter):
....: if letter.lower() in 'aeiou':
....: return 'vowel'
....: else:
....: return 'consonant'
In [14]: grouped = df.groupby(get_letter_type, axis=1)
# axis=0将整个df以哪几行的形式分组,而axis=1则按照哪几列来分组
# 该处运用类似于apply函数
In []:for i,j in grouped:
print(i)
print(j)
Out[]: # 由于A的小写属于元音字母,导致原df被切分成两部分
consonant
B C D
0 one -1.209388 -0.309949
1 one -0.380334 -1.352238
2 two 0.309979 -0.695926
3 three 0.650321 0.965206
4 two 0.809020 1.003307
5 two 0.668484 1.013688
6 one 0.513104 0.079576
7 three 1.579055 -0.083461
vowel
A
0 foo
1 bar
2 foo
3 bar
4 foo
5 bar
6 foo
7 foo
4 level之用法
下面实例介绍了level的基本用法:
In [15]: lst = [1, 2, 3, 1, 2, 3]
In [16]: s = pd.Series([1, 2, 3, 10, 20, 30], lst)
Out[16]: # print(s)
1 1
2 2
3 3
1 10
2 20
3 30
dtype: int64
In [17]: grouped = s.groupby(level=0) # 按照索引仍分成3组,level=1用于多索引中的第二列索引
In [18]: grouped.first() # 每一组的第一行
Out[18]:
1 1
2 2
3 3
dtype: int64
In [19]: grouped.last() # 每一组的最后一行
Out[19]:
1 10
2 20
3 30
dtype: int64
In [20]: grouped.sum()
Out[20]:
1 11
2 22
3 33
dtype: int64
5 排序
groupby默认按照组名升序排列,可以取消此操作。
In [21]: df2 = pd.DataFrame({
'X': ['B', 'B', 'A', 'A'], 'Y': [1, 2, 3, 4]})
In [22]: df2.groupby(['X']).sum()
Out[22]:
Y
X
A 7
B 3
In [23]: df2.groupby(['X'], sort=False).sum() # 并不是改为降序,而是按照组名的原有顺序
Out[23]:
Y
X
B 3
A 7
groupby分组的数据不会自动升序展示,而是默认按照原有顺序保存。
In [24]: df3 = pd.DataFrame({
'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})
In [25]: df3.groupby(['X']).get_group('A')
Out[25]:
X Y
0 A 1
2 A 3
In [26]: df3.groupby(['X']).get_group('B')
Out[26]:
X Y
1 B 4
3 B 2
6 groupby对象的函数
按照一列分组和列切片:
In [27]: df.groupby('A').groups
Out[27]:
{
'bar': Int64Index([1, 3, 5], dtype='int64'),
'foo': Int64Index([0, 2, 4, 6, 7], dtype='int64')}
In [28]: df.groupby(get_letter_type, axis=1).groups
Out[28]:
{
'consonant': Index(['B', 'C', 'D'], dtype='object'),
'vowel': Index(['A'], dtype='object')}
按照两列分组:
In [29]: grouped = df.groupby(['A', 'B']