分组运算过程 split->apply->combine
拆分:进行分组的根据
应用:每个分组运行的计算规则
合并:把每个分组的计算结果合并起来
1.分组函数-groupby
groupby(by=None) ,groupby实现了split过程。
import pandas as pd
import numpy as np
df=pd.DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})
print(df)
out:
key1 key2 data1 data2
0 a one 0.445011 -0.346976
1 a two 0.129147 -1.260998
2 b one 0.806248 -0.125555
3 b two 0.981721 -1.633108
4 a one -1.533791 -0.176332
for name,group in df.groupby('key1'):
print(name)
print(group)
out:
a
key1 key2 data1 data2
0 a one 0.445011 -0.346976
1 a two 0.129147 -1.260998
4 a one -1.533791 -0.176332
b
key1 key2 data1 data2
2 b one 0.806248 -0.125555
3 b two 0.981721 -1.633108
# by指定一个列
print(df.groupby('key1').mean())
out:
data1 data2
key1
a 1.533847 0.463332
b -0.224867 -1.825610
#by指定多个列
print(df.groupby(['key1','key2']).mean())
out:
data1 data2
key1 key2
a one 2.362819 0.559232
two -0.124097 0.271531
b one -0.548848 -1.423937
two 0.099114 -2.227282
2.agg函数处理groupby结果
agg(func)
agg实现了apply+combine
- func 取内置聚合函数(如max、min)
import pandas as pd
import numpy as np
df=pd.DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})
print(df)
out:
key1 key2 data1 data2
0 a one -0.407514 -0.010082
1 a two -0.774303 0.251207
2 b one -1.189536 -2.061739
3 b two -0.411025 -0.289213
4 a one 1.688148 -0.434298
print(df.groupby('key1').agg(max))
out:
key2 data1 data2
key1
a two 1.688148 0.251207
b two -0.411025 -0.289213
#也可以不用agg函数,直接使用聚合函数MAX()
print(df.groupby('key1').max())
out:
key2 data1 data2
key1
a two 1.688148 0.251207
b two -0.411025 -0.289213
- func 取自定义函数
import pandas as pd
import numpy as np
df=pd.DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':[3,7,9,12,15],
'data2':[1,6,14,23,7]})
print(df)
out:
key1 key2 data1 data2
0 a one 3 1
1 a two 7 6
2 b one 9 14
3 b two 12 23
4 a one 15 7
print(df.groupby('key1').agg(lambda x:x.max()-x.min()))
out:
data1 data2
key1
a 12 6
b 3 9
- func取函数列表
import pandas as pd
import numpy as np
df=pd.DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':[3,7,9,12,15],
'data2':[1,6,14,23,7]})
print(df)
out:
key1 key2 data1 data2
0 a one 3 1
1 a two 7 6
2 b one 9 14
3 b two 12 23
4 a one 15 7
#应用多个聚合函数
#通过元组提供新的列名
print(df.groupby('key1').agg(['mean','std','sum',('range',lambda df:df.max()-df.min())]))
out:
data1 data2
mean std sum range mean std sum range
key1
a 8.333333 6.110101 25 12 4.666667 3.214550 14 6
b 10.500000 2.121320 21 3 18.500000 6.363961 37 9
- func 取key为列名、value为函数的dict
import pandas as pd
import numpy as np
df=pd.DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':[3,7,9,12,15],
'data2':[1,6,14,23,7]})
print(df)
out:
key1 key2 data1 data2
0 a one 3 1
1 a two 7 6
2 b one 9 14
3 b two 12 23
4 a one 15 7
#应用不同的聚合函数到每列
dict_map={'data1':['mean',('range',lambda df:df.max()-df.min())],'data2':'sum'}
print(df.groupby('key1').agg(dict_map))
out:
data1 data2
mean range sum
key1
a 8.333333 12 14
b 10.500000 3 37