Pandas具有丰富的功能,transform是与groupby(pandas中最有用的操作之一)组合使用的。一般情况下,我们在groupby之后使用aggregate , filter 或 apply来汇总数据,transform可能稍难理解。
aggregation会返回数据的缩减版本,而transformation能返回完整数据的某一变换版本供我们重组。这样的transformation,输出的形状和输入一致。一个常见的例子是通过减去分组平均值来居中数据。
实践
加载数据
import pandas as pd
data = {'account':[383080,383080,383080,412290,412290,412290,412290,412290,218895,218895,218895,218895],
"name":['Will LLC','Will LLC','Will LLC','Jerde-Hilpert','Jerde-Hilpert','Jerde-Hilpert','Jerde-Hilpert','Jerde-Hilpert','Kulans Inc','Kulans Inc','Kulans Inc','Kulans Inc'],
"order":[1001,1001,1001,1005,1005,1005,1005,1005,1006,1006,1006,1006],
'sku':['B1-20000','B1-27722','B1-86481','S1-06532','S1-82801','S1-06532','S1-47412','S1-27722','S1-27722','B1-33087','B1-33364','B1-20000'],
"quantity":[7,11,3,48,21,9,44,36,32,23,3,-1],
'unit price':[33.69,21.12,35.99,55.82,13.62,92.55,78.91,25.42,95.66,22.55,72.3,72.18],
"ext price":[235.83,232.32,107.97,2679.36,286.02,832.95,3472.04,915.12,3061.12,518.65,216.90,-72.18]}
df = pd.DataFrame(data)
df
account | name | order | sku | quantity | unit price | ext price | |
---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | 1001 | B1-20000 | 7 | 33.69 | 235.83 |
1 | 383080 | Will LLC | 1001 | B1-27722 | 11 | 21.12 | 232.32 |
2 | 383080 | Will LLC | 1001 | B1-86481 | 3 | 35.99 | 107.97 |
3 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 48 | 55.82 | 2679.36 |
4 | 412290 | Jerde-Hilpert | 1005 | S1-82801 | 21 | 13.62 | 286.02 |
5 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 9 | 92.55 | 832.95 |
6 | 412290 | Jerde-Hilpert | 1005 | S1-47412 | 44 | 78.91 | 3472.04 |
7 | 412290 | Jerde-Hilpert | 1005 | S1-27722 | 36 | 25.42 | 915.12 |
8 | 218895 | Kulans Inc | 1006 | S1-27722 | 32 | 95.66 | 3061.12 |
9 | 218895 | Kulans Inc | 1006 | B1-33087 | 23 | 22.55 | 518.65 |
10 | 218895 | Kulans Inc | 1006 | B1-33364 | 3 | 72.30 | 216.90 |
11 | 218895 | Kulans Inc | 1006 | B1-20000 | -1 | 72.18 | -72.18 |
可以看到数据包含了不同的订单(order),以及订单里的不同商品的数量(quantity)、单价(unit price)和总价(ext price)
我们的任务是为数据表添加一列,表示不同商品在所在订单的价钱占比。
第一种方法实现步骤:
df.groupby(by='order')['ext price'].agg(sum)
order
1001 576.12
1005 8185.49
1006 3724.49
Name: ext price, dtype: float64
df.groupby(by='order')['ext price'].agg(sum).rename('order_total')
order
1001 576.12
1005 8185.49
1006 3724.49
Name: order_total, dtype: float64
order_total = df.groupby(by='order')['ext price'].agg(sum).rename('order_total').reset_index()
order_total
order | order_total | |
---|---|---|
0 | 1001 | 576.12 |
1 | 1005 | 8185.49 |
2 | 1006 | 3724.49 |
df_1 = df.merge(order_total)
df_1
account | name | order | sku | quantity | unit price | ext price | order_total | |
---|---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | 1001 | B1-20000 | 7 | 33.69 | 235.83 | 576.12 |
1 | 383080 | Will LLC | 1001 | B1-27722 | 11 | 21.12 | 232.32 | 576.12 |
2 | 383080 | Will LLC | 1001 | B1-86481 | 3 | 35.99 | 107.97 | 576.12 |
3 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 48 | 55.82 | 2679.36 | 8185.49 |
4 | 412290 | Jerde-Hilpert | 1005 | S1-82801 | 21 | 13.62 | 286.02 | 8185.49 |
5 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 9 | 92.55 | 832.95 | 8185.49 |
6 | 412290 | Jerde-Hilpert | 1005 | S1-47412 | 44 | 78.91 | 3472.04 | 8185.49 |
7 | 412290 | Jerde-Hilpert | 1005 | S1-27722 | 36 | 25.42 | 915.12 | 8185.49 |
8 | 218895 | Kulans Inc | 1006 | S1-27722 | 32 | 95.66 | 3061.12 | 3724.49 |
9 | 218895 | Kulans Inc | 1006 | B1-33087 | 23 | 22.55 | 518.65 | 3724.49 |
10 | 218895 | Kulans Inc | 1006 | B1-33364 | 3 | 72.30 | 216.90 | 3724.49 |
11 | 218895 | Kulans Inc | 1006 | B1-20000 | -1 | 72.18 | -72.18 | 3724.49 |
df_1['percent_of _order'] = df_1['ext price']/df_1['order_total']
df_1
account | name | order | sku | quantity | unit price | ext price | order_total | percent_of _order | |
---|---|---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | 1001 | B1-20000 | 7 | 33.69 | 235.83 | 576.12 | 0.409342 |
1 | 383080 | Will LLC | 1001 | B1-27722 | 11 | 21.12 | 232.32 | 576.12 | 0.403249 |
2 | 383080 | Will LLC | 1001 | B1-86481 | 3 | 35.99 | 107.97 | 576.12 | 0.187409 |
3 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 48 | 55.82 | 2679.36 | 8185.49 | 0.327330 |
4 | 412290 | Jerde-Hilpert | 1005 | S1-82801 | 21 | 13.62 | 286.02 | 8185.49 | 0.034942 |
5 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 9 | 92.55 | 832.95 | 8185.49 | 0.101759 |
6 | 412290 | Jerde-Hilpert | 1005 | S1-47412 | 44 | 78.91 | 3472.04 | 8185.49 | 0.424170 |
7 | 412290 | Jerde-Hilpert | 1005 | S1-27722 | 36 | 25.42 | 915.12 | 8185.49 | 0.111798 |
8 | 218895 | Kulans Inc | 1006 | S1-27722 | 32 | 95.66 | 3061.12 | 3724.49 | 0.821890 |
9 | 218895 | Kulans Inc | 1006 | B1-33087 | 23 | 22.55 | 518.65 | 3724.49 | 0.139254 |
10 | 218895 | Kulans Inc | 1006 | B1-33364 | 3 | 72.30 | 216.90 | 3724.49 | 0.058236 |
11 | 218895 | Kulans Inc | 1006 | B1-20000 | -1 | 72.18 | -72.18 | 3724.49 | -0.019380 |
第二种方法实现步骤(transform)
df.groupby(by='order')['ext price'].transform(sum)
0 576.12
1 576.12
2 576.12
3 8185.49
4 8185.49
5 8185.49
6 8185.49
7 8185.49
8 3724.49
9 3724.49
10 3724.49
11 3724.49
Name: ext price, dtype: float64
不再是只显示3个订单的对应项,而是保持了与原始数据集相同数量的项目,这样就很好继续了。这就是transform的独特之处。
df['order_total'] = df.groupby(by='order')['ext price'].transform(sum)
df['percent_of_order'] = df['ext price']/df['order_total']
df
account | name | order | sku | quantity | unit price | ext price | order_total | percent_of_order | |
---|---|---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | 1001 | B1-20000 | 7 | 33.69 | 235.83 | 576.12 | 0.409342 |
1 | 383080 | Will LLC | 1001 | B1-27722 | 11 | 21.12 | 232.32 | 576.12 | 0.403249 |
2 | 383080 | Will LLC | 1001 | B1-86481 | 3 | 35.99 | 107.97 | 576.12 | 0.187409 |
3 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 48 | 55.82 | 2679.36 | 8185.49 | 0.327330 |
4 | 412290 | Jerde-Hilpert | 1005 | S1-82801 | 21 | 13.62 | 286.02 | 8185.49 | 0.034942 |
5 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 9 | 92.55 | 832.95 | 8185.49 | 0.101759 |
6 | 412290 | Jerde-Hilpert | 1005 | S1-47412 | 44 | 78.91 | 3472.04 | 8185.49 | 0.424170 |
7 | 412290 | Jerde-Hilpert | 1005 | S1-27722 | 36 | 25.42 | 915.12 | 8185.49 | 0.111798 |
8 | 218895 | Kulans Inc | 1006 | S1-27722 | 32 | 95.66 | 3061.12 | 3724.49 | 0.821890 |
9 | 218895 | Kulans Inc | 1006 | B1-33087 | 23 | 22.55 | 518.65 | 3724.49 | 0.139254 |
10 | 218895 | Kulans Inc | 1006 | B1-33364 | 3 | 72.30 | 216.90 | 3724.49 | 0.058236 |
11 | 218895 | Kulans Inc | 1006 | B1-20000 | -1 | 72.18 | -72.18 | 3724.49 | -0.019380 |
甚至可以一步解决:
df['percent_of_order'] = df['ext price']/df.groupby(by='order')['ext price'].transform(sum)
df
account | name | order | sku | quantity | unit price | ext price | order_total | percent_of_order | |
---|---|---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | 1001 | B1-20000 | 7 | 33.69 | 235.83 | 576.12 | 0.409342 |
1 | 383080 | Will LLC | 1001 | B1-27722 | 11 | 21.12 | 232.32 | 576.12 | 0.403249 |
2 | 383080 | Will LLC | 1001 | B1-86481 | 3 | 35.99 | 107.97 | 576.12 | 0.187409 |
3 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 48 | 55.82 | 2679.36 | 8185.49 | 0.327330 |
4 | 412290 | Jerde-Hilpert | 1005 | S1-82801 | 21 | 13.62 | 286.02 | 8185.49 | 0.034942 |
5 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 9 | 92.55 | 832.95 | 8185.49 | 0.101759 |
6 | 412290 | Jerde-Hilpert | 1005 | S1-47412 | 44 | 78.91 | 3472.04 | 8185.49 | 0.424170 |
7 | 412290 | Jerde-Hilpert | 1005 | S1-27722 | 36 | 25.42 | 915.12 | 8185.49 | 0.111798 |
8 | 218895 | Kulans Inc | 1006 | S1-27722 | 32 | 95.66 | 3061.12 | 3724.49 | 0.821890 |
9 | 218895 | Kulans Inc | 1006 | B1-33087 | 23 | 22.55 | 518.65 | 3724.49 | 0.139254 |
10 | 218895 | Kulans Inc | 1006 | B1-33364 | 3 | 72.30 | 216.90 | 3724.49 | 0.058236 |
11 | 218895 | Kulans Inc | 1006 | B1-20000 | -1 | 72.18 | -72.18 | 3724.49 | -0.019380 |