df = DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':np.arange(0,5),
'data2':np.arange(-5,0)})
print(df)
key1 key2 data1 data2
0 a one 0 -5
1 a two 1 -4
2 b one 2 -3
3 b two 3 -2
4 a one 4 -1
1.将‘data2’这一列按照‘key1’这一列进行分组
grouped = df['data2'].groupby(df['key1'])
print(grouped.min())
key1
a -5
b -3
Name: data2, dtype: int32
2.将分组的列做成列表传入:将‘data1’这一列按照‘key1’和‘key2’列进行分组
grouped1 = df['data1'].groupby([df['key1'],df['key2']])
print(grouped1.min())
key1 key2
a one 0
two 1
b one 2
two 3
Name: data1, dtype: int32
3.分组键传的值也可以是正确长度的数组
list1 = [2001,2002,2003,2004,2005]
grouped2 = df['data1'].groupby(list1)
print(grouped2.mean())
2001 0
2002 1
2003 2
2004 3
2005 4
Name: data1, dtype: int32
4.将整个DataFrame按照其中的某一列进行划分,这时候可以传入该列的列名
grouped3 = df.groupby('key1')
print(grouped3.mean())
data1 data2
key1
a 1.666667 -3.333333
b 2.500000 -2.500000 #这时划分的内容并没有key2列,因为key2列并不是数值
#传入列名列表
grouped4 = df.groupby(['key1','key2'])
print(grouped4.mean())
data1 data2
key1 key2
a one 2 -3
two 1 -4
b one 2 -3
two 3 -2
5.size:返回一个包含组大小信息的Series
grouped5 = df.groupby(['key1','key2'])
print(grouped5.size())
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
6.GroupBy对象支持迭代,会生成一个包含组名和数据块的2维元组序列
for name ,value in df.groupby('key1'):
print(name,end="---分组名")
print(value)
a---分组名
key1 key2 data1 data2
0 a one 0 -5
1 a two 1 -4
4 a one 4 -1
b---分组名
key1 key2 data1 data2
2 b one 2 -3
3 b two 3 -2
7.当迭代时有多个分组键的情况下,元组的第一个元素是键值的元组
('a', 'one')
key1 key2 data1 data2
0 a one 0 -5
4 a one 4 -1
('a', 'two')
key1 key2 data1 data2
1 a two 1 -4
('b', 'one')
key1 key2 data1 data2
2 b one 2 -3
('b', 'two')
key1 key2 data1 data2
3 b two 3 -2
8.将分组的数据保存在字典中,通过分组的值取出相应的值
dicta = dict(list(df.groupby('key1')))
print(dicta)
{'a': key1 key2 data1 data2
0 a one 0 -5
1 a two 1 -4
4 a one 4 -1,
'b': key1 key2 data1 data2
2 b one 2 -3
3 b two 3 -2}
print(dicta['a'])
key1 key2 data1 data2
0 a one 0 -5
1 a two 1 -4
4 a one 4 -1
9.将从DataFrame创建的GroupBy对象用列名称或列名称数组进行索引时,会产生用于聚合列子集的效果
print(df.groupby('key1')['data1'].min())
key1
a 0
b 2
Name: data1, dtype: int32
print(df.groupby('key1')[['data1']].min())
data1
key1
a 0
b 2
10.在字典或Series中写入各个列的对应关系,然后将字典传入grouped数组,实现各列相加
people = DataFrame(np.arange(25).reshape((5,5)),
columns=['a','b','c','d','e'],
index=['aa','bb','cc','dd','ee'])
dictb = {'a':'now','b':'now','c':'now','d':'last','e':'last'}
print(people.groupby(dictb,axis=1).sum())
last now
aa 7 3
bb 17 18
cc 27 33
dd 37 48
ee 47 63
mapSeries = Series(dictb)
print(people.groupby(mapSeries,axis=1).sum())
last now
aa 7 3
bb 17 18
cc 27 33
dd 37 48
ee 47 63
11.使用Python函数来定义分组关系,函数会将每个索引值调用一次,会将返回值作为分组名称
people2 = DataFrame(np.arange(25).reshape((5,5)),
columns=['a','b','c','d','e'],
index=['aaa','bbbb','cc','dd','ee'])
print(people2.groupby(len).size())
2 3
3 1
4 1
dtype: int64
12.将函数与其他混合使用
print(people2.groupby([len,'a']).size())
a
2 10 1
15 1
20 1
3 0 1
4 5 1
dtype: int64
13. 按索引层级分组
column = [['A','A','A','B','B'],[1,3,5,1,3]]
name = ['city','tenor']
people3 = DataFrame(np.arange(25).reshape((5,5)),
columns=column,
index=['aaa','bbbb','cc','dd','ee'])
people3.columns.names=name
print(people3.groupby(level='city',axis=1).count())
city A B
aaa 3 2
bbbb 3 2
cc 3 2
dd 3 2
ee 3 2
14.使用自己的聚合函数:将自己的函数传入agg或aggregate
hah = lambda x:x.max()-x.min()
print(df.groupby('key1').aggregate(hah))
data1 data2
key1
a 4 4
b 1 1
15.describe()
print(df.groupby('key1').describe())
data1 ... data2
count mean std min 25% ... min 25% 50% 75% max
key1 ...
a 3.0 1.666667 2.081666 0.0 0.50 ... -5.0 -4.50 -4.0 -2.50 -1.0
b 2.0 2.500000 0.707107 2.0 2.25 ... -3.0 -2.75 -2.5 -2.25 -2.0
[2 rows x 16 columns]
16.各列同时使用多个函数进行聚合,向agg传入函数列表
hah = lambda x:x.min()
print(df.groupby('key1').agg([hah,'mean','max']))
data1 data2
<lambda_0> mean max <lambda_0> mean max
key1
a 0 1.666667 4 -5 -3.333333 -1
b 2 2.500000 3 -3 -2.500000 -2
17.从上面我们可以看出lambda函数的显示名称为<lambda_0>,我们可以通过传入元组来自定义函数名称
print(df.groupby('key1').agg([('hah',hah),'mean','max']))
data1 data2
hah mean max hah mean max
key1
a 0 1.666667 4 -5 -3.333333 -1
b 2 2.500000 3 -3 -2.500000 -2
18.将不同的函数应用到不同的列上,需要把列名和函数对应的字典传入agg
print(df.groupby('key1').agg({'data1':[('hah',hah),'mean','max'],'data2':['mean']}))
data1 data2
hah mean max mean
key1
a 0 1.666667 4 -3.333333
b 2 2.500000 3 -2.500000
19.分位数
#cut
qq = pd.cut(np.random.randn(100),4)
dff = Series(np.random.randn(100))
print(dff.groupby(qq).agg(['count','mean','max']))
count mean max
(-2.296, -1.064] 16 0.358343 2.006945
(-1.064, 0.164] 41 0.190866 2.265793
(0.164, 1.391] 34 0.021378 2.545082
(1.391, 2.619] 9 0.410388 1.696888
#qcut
qq = pd.qcut(np.random.randn(100),4)
dff = Series(np.random.randn(100))
print(dff.groupby(qq).agg(['count','mean','max']))
count mean max
(-3.247, -0.861] 25 -0.102378 1.743070
(-0.861, -0.0778] 25 0.470662 2.484437
(-0.0778, 0.511] 25 0.214561 1.822943
(0.511, 1.827] 25 -0.432044 2.611959
数据聚合与分组操作
最新推荐文章于 2022-05-17 14:09:30 发布