pd.groupby 作用
pd.groupby 能将feature按不同类型分开
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df_train = pd.read_csv("train.csv") # titanic数据
查看统计
df_train.groupby(['Sex','Survived'])['Survived'].count()
Sex Survived
female 0 81
1 233
male 0 468
1 109
Name: Survived, dtype: int64
画出性别对应的生存率
df_train[['Sex','Survived']].groupby(['Sex']).mean().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1f8d3c93198>
df = pd.DataFrame(data={'books':['bk1','bk1','bk1','bk2','bk2','bk3'], 'price': [12,12,12,15,15,17]})
df
books | price | |
---|---|---|
0 | bk1 | 12 |
1 | bk1 | 12 |
2 | bk1 | 12 |
3 | bk2 | 15 |
4 | bk2 | 15 |
5 | bk3 | 17 |
df0 = df.groupby('books',as_index=True).sum()
print (df0.loc['bk1'])
price 36
Name: bk1, dtype: int64
df0.loc[0] #报错
df1 = df.groupby('books',as_index=False).sum()
print (df1.loc[df1.books == 'bk1'])
books price
0 bk1 36
df1.loc['bk1'] # 报错
当as_index = True 时,df.loc[]只能用label来,比如’bk1’
当as_index =False 时,df.loc[]只能用索引来 ,比如 0,1,2
但是都能用 df.iloc[]来索引,结果一致
agg vs filter vs transform
链接里有详细的教程
简单用法
df.groupby('day')['total_bill'].mean()
df.groupby('day').filter(lambda x : x['total_bill'].mean() > 20)
df.groupby('day')['total_bill'].transform(lambda x : x/x.mean())
适用条件
-
if we want to get a single value for each group -> use
aggregate()
-
if we want to get a subset of the input rows -> use
filter()
-
if we want to get a new value for each input row -> use
transform()