理解Pandas的Transform

Pandas具有丰富的功能,transform是与groupby(pandas中最有用的操作之一)组合使用的。一般情况下,我们在groupby之后使用aggregate , filter 或 apply来汇总数据,transform可能稍难理解。

aggregation会返回数据的缩减版本,而transformation能返回完整数据的某一变换版本供我们重组。这样的transformation,输出的形状和输入一致。一个常见的例子是通过减去分组平均值来居中数据。

实践

加载数据

import pandas as pd

data = {'account':[383080,383080,383080,412290,412290,412290,412290,412290,218895,218895,218895,218895],
       "name":['Will LLC','Will LLC','Will LLC','Jerde-Hilpert','Jerde-Hilpert','Jerde-Hilpert','Jerde-Hilpert','Jerde-Hilpert','Kulans Inc','Kulans Inc','Kulans Inc','Kulans Inc'],
       "order":[1001,1001,1001,1005,1005,1005,1005,1005,1006,1006,1006,1006],
       'sku':['B1-20000','B1-27722','B1-86481','S1-06532','S1-82801','S1-06532','S1-47412','S1-27722','S1-27722','B1-33087','B1-33364','B1-20000'],
       "quantity":[7,11,3,48,21,9,44,36,32,23,3,-1],
       'unit price':[33.69,21.12,35.99,55.82,13.62,92.55,78.91,25.42,95.66,22.55,72.3,72.18],
       "ext price":[235.83,232.32,107.97,2679.36,286.02,832.95,3472.04,915.12,3061.12,518.65,216.90,-72.18]}

df = pd.DataFrame(data)
df
accountnameorderskuquantityunit priceext price
0383080Will LLC1001B1-20000733.69235.83
1383080Will LLC1001B1-277221121.12232.32
2383080Will LLC1001B1-86481335.99107.97
3412290Jerde-Hilpert1005S1-065324855.822679.36
4412290Jerde-Hilpert1005S1-828012113.62286.02
5412290Jerde-Hilpert1005S1-06532992.55832.95
6412290Jerde-Hilpert1005S1-474124478.913472.04
7412290Jerde-Hilpert1005S1-277223625.42915.12
8218895Kulans Inc1006S1-277223295.663061.12
9218895Kulans Inc1006B1-330872322.55518.65
10218895Kulans Inc1006B1-33364372.30216.90
11218895Kulans Inc1006B1-20000-172.18-72.18

可以看到数据包含了不同的订单(order),以及订单里的不同商品的数量(quantity)、单价(unit price)和总价(ext price)

我们的任务是为数据表添加一列,表示不同商品在所在订单的价钱占比。

第一种方法实现步骤:

df.groupby(by='order')['ext price'].agg(sum)
order
1001     576.12
1005    8185.49
1006    3724.49
Name: ext price, dtype: float64
df.groupby(by='order')['ext price'].agg(sum).rename('order_total')
order
1001     576.12
1005    8185.49
1006    3724.49
Name: order_total, dtype: float64
order_total = df.groupby(by='order')['ext price'].agg(sum).rename('order_total').reset_index()
order_total
orderorder_total
01001576.12
110058185.49
210063724.49
df_1 = df.merge(order_total)
df_1
accountnameorderskuquantityunit priceext priceorder_total
0383080Will LLC1001B1-20000733.69235.83576.12
1383080Will LLC1001B1-277221121.12232.32576.12
2383080Will LLC1001B1-86481335.99107.97576.12
3412290Jerde-Hilpert1005S1-065324855.822679.368185.49
4412290Jerde-Hilpert1005S1-828012113.62286.028185.49
5412290Jerde-Hilpert1005S1-06532992.55832.958185.49
6412290Jerde-Hilpert1005S1-474124478.913472.048185.49
7412290Jerde-Hilpert1005S1-277223625.42915.128185.49
8218895Kulans Inc1006S1-277223295.663061.123724.49
9218895Kulans Inc1006B1-330872322.55518.653724.49
10218895Kulans Inc1006B1-33364372.30216.903724.49
11218895Kulans Inc1006B1-20000-172.18-72.183724.49
df_1['percent_of _order'] = df_1['ext price']/df_1['order_total']
df_1
accountnameorderskuquantityunit priceext priceorder_totalpercent_of _order
0383080Will LLC1001B1-20000733.69235.83576.120.409342
1383080Will LLC1001B1-277221121.12232.32576.120.403249
2383080Will LLC1001B1-86481335.99107.97576.120.187409
3412290Jerde-Hilpert1005S1-065324855.822679.368185.490.327330
4412290Jerde-Hilpert1005S1-828012113.62286.028185.490.034942
5412290Jerde-Hilpert1005S1-06532992.55832.958185.490.101759
6412290Jerde-Hilpert1005S1-474124478.913472.048185.490.424170
7412290Jerde-Hilpert1005S1-277223625.42915.128185.490.111798
8218895Kulans Inc1006S1-277223295.663061.123724.490.821890
9218895Kulans Inc1006B1-330872322.55518.653724.490.139254
10218895Kulans Inc1006B1-33364372.30216.903724.490.058236
11218895Kulans Inc1006B1-20000-172.18-72.183724.49-0.019380

第二种方法实现步骤(transform)

df.groupby(by='order')['ext price'].transform(sum)
0      576.12
1      576.12
2      576.12
3     8185.49
4     8185.49
5     8185.49
6     8185.49
7     8185.49
8     3724.49
9     3724.49
10    3724.49
11    3724.49
Name: ext price, dtype: float64

不再是只显示3个订单的对应项,而是保持了与原始数据集相同数量的项目,这样就很好继续了。这就是transform的独特之处。

df['order_total'] = df.groupby(by='order')['ext price'].transform(sum)
df['percent_of_order'] = df['ext price']/df['order_total']
df
accountnameorderskuquantityunit priceext priceorder_totalpercent_of_order
0383080Will LLC1001B1-20000733.69235.83576.120.409342
1383080Will LLC1001B1-277221121.12232.32576.120.403249
2383080Will LLC1001B1-86481335.99107.97576.120.187409
3412290Jerde-Hilpert1005S1-065324855.822679.368185.490.327330
4412290Jerde-Hilpert1005S1-828012113.62286.028185.490.034942
5412290Jerde-Hilpert1005S1-06532992.55832.958185.490.101759
6412290Jerde-Hilpert1005S1-474124478.913472.048185.490.424170
7412290Jerde-Hilpert1005S1-277223625.42915.128185.490.111798
8218895Kulans Inc1006S1-277223295.663061.123724.490.821890
9218895Kulans Inc1006B1-330872322.55518.653724.490.139254
10218895Kulans Inc1006B1-33364372.30216.903724.490.058236
11218895Kulans Inc1006B1-20000-172.18-72.183724.49-0.019380

甚至可以一步解决:

df['percent_of_order'] = df['ext price']/df.groupby(by='order')['ext price'].transform(sum)
df
accountnameorderskuquantityunit priceext priceorder_totalpercent_of_order
0383080Will LLC1001B1-20000733.69235.83576.120.409342
1383080Will LLC1001B1-277221121.12232.32576.120.403249
2383080Will LLC1001B1-86481335.99107.97576.120.187409
3412290Jerde-Hilpert1005S1-065324855.822679.368185.490.327330
4412290Jerde-Hilpert1005S1-828012113.62286.028185.490.034942
5412290Jerde-Hilpert1005S1-06532992.55832.958185.490.101759
6412290Jerde-Hilpert1005S1-474124478.913472.048185.490.424170
7412290Jerde-Hilpert1005S1-277223625.42915.128185.490.111798
8218895Kulans Inc1006S1-277223295.663061.123724.490.821890
9218895Kulans Inc1006B1-330872322.55518.653724.490.139254
10218895Kulans Inc1006B1-33364372.30216.903724.490.058236
11218895Kulans Inc1006B1-20000-172.18-72.183724.49-0.019380
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值