Pandas实现groupby分组统计
类似SQL: select city.max(temperature) from city_weather groupby by city
groupby: 先对数据分组,然后在每个分组上应用聚合函数、转换函数
import pandas as pd
import numpy as np
% matplotlib inline
df = pd. DataFrame( { 'A' : [ 'foo' , 'bar' , 'foo' , 'bar' , 'foo' , 'bar' , 'foo' , 'foo' ] ,
'B' : [ 'one' , 'one' , 'two' , 'three' , 'two' , 'two' , 'one' , 'three' ] ,
'C' : np. random. randn( 8 ) ,
'D' : np. random. randn( 8 ) } )
df
A B C D 0 foo one -0.190741 1.612948 1 bar one -0.781209 1.002117 2 foo two 0.858491 -0.178429 3 bar three -0.280559 0.083816 4 foo two 0.214302 -0.107717 5 bar two -0.134638 -1.604365 6 foo one -2.108731 -0.653819 7 foo three 0.414685 -1.717435
1、分组使用聚合函数做数据统计
1、单个列groupby,查询所有数据列的统计
df. groupby( 'A' ) . sum ( )
C D A bar -1.196406 -0.518431 foo -0.811994 -1.044452
1、groupby中的A变成了数据的索引列 2、B列不是数字,所以被自动忽略
2、多个列groupby,查询所有数据列的统计
df. groupby( [ 'A' , 'B' ] ) . mean( )
C D A B bar one -0.781209 1.002117 three -0.280559 0.083816 two -0.134638 -1.604365 foo one -1.149736 0.479564 three 0.414685 -1.717435 two 0.536397 -0.143073
(A,B)成对变成了二级索引,不想改变原来索引。加一个as_index=False
df. groupby( [ 'A' , 'B' ] , as_index= False ) . mean( )
A B C D 0 bar one -0.781209 1.002117 1 bar three -0.280559 0.083816 2 bar two -0.134638 -1.604365 3 foo one -1.149736 0.479564 4 foo three 0.414685 -1.717435 5 foo two 0.536397 -0.143073
3、同时查看多种数据统计
df. groupby( 'A' ) . agg( [ np. sum , np. mean, np. std] )
C D sum mean std sum mean std A bar -1.196406 -0.398802 0.339116 -0.518431 -0.17281 1.322055 foo -0.811994 -0.162399 1.151755 -1.044452 -0.20889 1.204567
列变成了多级索引
4、查看单列的结果数据统计
df. groupby( 'A' ) [ 'C' ] . agg( [ np. sum , np. mean, np. std] )
sum mean std A bar -1.196406 -0.398802 0.339116 foo -0.811994 -0.162399 1.151755
df. groupby( 'A' ) . agg( [ np. sum , np. mean, np. std] ) [ 'C' ]
sum mean std A bar -1.196406 -0.398802 0.339116 foo -0.811994 -0.162399 1.151755
5、不同列使用不同的聚合函数
df. groupby( 'A' ) . agg( { 'C' : np. sum , 'D' : np. std} )
C D A bar -1.196406 1.322055 foo -0.811994 1.204567