1. Categoricals and groupby
- boolean filter and count
- groupby and count
- groupby and sum
- groupby and sum multiple columns
- groupby and mean: multi-level index
- groupby and sum by Series
- 把数据转化为categorical data可以节省空间,加快数据处理速度
- 两个表index一样了之后可以用一个表的列来groupby另一个表的列。
2. Groupby and aggregation
- multiple aggregation
- aggregation functions
- aggregation dictionaries
- 选出aggregation的结果
-
Aggregating on index levels/fields
- grouping on a function of index
df.index.strftime('%a'):把2015-02-02 08:30:00这样的时间格式转化为Mon
3. Groupby and transformation
- zscore,利用zscore找出异常值
# Import zscore
from scipy.stats import zscore
# Group gapminder_2010: standardized
standardized = gapminder_2010.groupby('region')['life','fertility'].transform(zscore)
# Construct a Boolean Series to identify outliers: outliers
outliers = (standardized['life'] < -3) | (standardized['fertility'] > 3)
# Filter gapminder_2010 by the outliers: gm_outliers
gm_outliers = gapminder_2010.loc[outliers]
# Print gm_outliers
print(gm_outliers)
- 对 NaN空值的处理:填充特定组的平均值,利用groupby然后用transform对age列空值进行填充
# Create a groupby object: by_sex_class
by_sex_class = titanic.groupby(['sex','pclass'])
# Write a function that imputes median
def impute_median(series):
return series.fillna(series.median())
# Impute age and assign to titanic['age']
titanic.age = by_sex_class['age'].transform(impute_median)
# Print the output of titanic.tail(10)
print(titanic.tail(10))
- apply 什么时候用apply什么时候用transform?
# Group gapminder_2010 by 'region': regional
regional = gapminder_2010.groupby('region')
# Apply the disparity function on regional: reg_disp
reg_disp = regional.apply(disparity)
# Print the disparity of 'United States', 'United Kingdom', and 'China'
print(reg_disp.loc[['United States','United Kingdom','China']])
4. Groupby and filtering
- 用mask
此处用fillna(false)用的很好;
df.loc里可以用表示筛选行的mask
- filter函数:
# Read the CSV file into a DataFrame: sales
sales = pd.read_csv('sales.csv', index_col='Date', parse_dates=True)
# Group sales by 'Company': by_company
by_company = sales.groupby('Company')
# Compute the sum of the 'Units' of by_company: by_com_sum
by_com_sum = by_company['Units'].sum()
print(by_com_sum)
# Filter 'Units' where the sum is > 35: by_com_filt
by_com_filt = by_company.filter(lambda g:g['Units'].sum()>35)
print(by_com_filt)
- map函数,根据现有条件判定true or false之后建立一个标签列
注意括号和中括号的使用
此例中计算titanic数据集中不同组的存活率,survived列设定存活者值为1,非存活着值为0,那么改列的平均值就是相应组的存活率。
# Create the Boolean Series: under10
under10 = (titanic['age']<10).map({True:'under 10',False:'Over 10'})
# Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10)['survived'].mean()
print(survived_mean_1)
# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10,'pclass'])['survived'].mean()
print(survived_mean_2)