190422-Grouping data

1. Categoricals and groupby

  • boolean filter and count

  • groupby and count

  • groupby and sum

  • groupby and sum multiple columns

  • groupby and mean: multi-level index

  • groupby and sum by Series

  • 把数据转化为categorical data可以节省空间,加快数据处理速度

  • 两个表index一样了之后可以用一个表的列来groupby另一个表的列。

2. Groupby and aggregation

  • multiple aggregation

  • aggregation functions

  • aggregation dictionaries

  • 选出aggregation的结果

  • Aggregating on index levels/fields

  • grouping on a function of index

df.index.strftime('%a'):把2015-02-02 08:30:00这样的时间格式转化为Mon

3. Groupby and transformation

  • zscore,利用zscore找出异常值
# Import zscore
from scipy.stats import zscore

# Group gapminder_2010: standardized
standardized = gapminder_2010.groupby('region')['life','fertility'].transform(zscore)

# Construct a Boolean Series to identify outliers: outliers
outliers = (standardized['life'] < -3) | (standardized['fertility'] > 3)

# Filter gapminder_2010 by the outliers: gm_outliers
gm_outliers = gapminder_2010.loc[outliers]

# Print gm_outliers
print(gm_outliers)
  • 对 NaN空值的处理:填充特定组的平均值,利用groupby然后用transform对age列空值进行填充
# Create a groupby object: by_sex_class
by_sex_class = titanic.groupby(['sex','pclass'])

# Write a function that imputes median
def impute_median(series):
    return series.fillna(series.median())

# Impute age and assign to titanic['age']
titanic.age = by_sex_class['age'].transform(impute_median)

# Print the output of titanic.tail(10)
print(titanic.tail(10))
  • apply 什么时候用apply什么时候用transform?
# Group gapminder_2010 by 'region': regional
regional = gapminder_2010.groupby('region')

# Apply the disparity function on regional: reg_disp
reg_disp = regional.apply(disparity)

# Print the disparity of 'United States', 'United Kingdom', and 'China'
print(reg_disp.loc[['United States','United Kingdom','China']])

4. Groupby and filtering

  • 用mask

此处用fillna(false)用的很好;

df.loc里可以用表示筛选行的mask

  • filter函数:
# Read the CSV file into a DataFrame: sales
sales = pd.read_csv('sales.csv', index_col='Date', parse_dates=True)

# Group sales by 'Company': by_company
by_company = sales.groupby('Company')

# Compute the sum of the 'Units' of by_company: by_com_sum
by_com_sum = by_company['Units'].sum()
print(by_com_sum)

# Filter 'Units' where the sum is > 35: by_com_filt
by_com_filt = by_company.filter(lambda g:g['Units'].sum()>35)
print(by_com_filt)

  • map函数,根据现有条件判定true or false之后建立一个标签列

注意括号和中括号的使用

此例中计算titanic数据集中不同组的存活率,survived列设定存活者值为1,非存活着值为0,那么改列的平均值就是相应组的存活率。

# Create the Boolean Series: under10
under10 = (titanic['age']<10).map({True:'under 10',False:'Over 10'})

# Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10)['survived'].mean()
print(survived_mean_1)

# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10,'pclass'])['survived'].mean()
print(survived_mean_2)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值