pandas中groupby使用的摸索

虚拟搬运工

已于 2022-03-27 22:59:33 修改

阅读量1.3k

点赞数

文章标签：数据挖掘数据分析 big data pandas

于 2021-12-19 22:16:13 首次发布

本文链接：https://blog.csdn.net/u010048197/article/details/122026191

版权

pandas中groupby使用的摸索

关于groupby的“split-apply-combine”的特性，请查阅官方文档及其他资料。这里只记录下自己在解决问题中遇到的问题。

问题

数据分组后，使用2列数据的运算结果，生成新的一列数据。

假设数据为，

>>> df
    a   b   c   d group
0  94  32  37  13    G1
1  87  95  24  77    G2
2   6  74  68  64    G3
3  36  17  83   4    G2
4  35  61  20   0    G1

用了groupby后的输出结果是DataFrameGroupBy类型，

>>> df.groupby("group")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026D9B10F448>

先看下不同操作时候输出的结果又何不同，
累加操作，每组后每行的结果不同，

>>> df.groupby("group")["a","c"].cumsum()
     a    c
0   94   37
1   87   24
2    6   68
3  123  107
4  129   57

求和操作，分组后每组结果一致，就是所谓的聚合操作，

>>> df.groupby("group")["a","c"].sum()
         a    c
group
G1     129   57
G2     123  107
G3       6   68

假设在聚合操作的情况下，取一列数据操作，把结果写入新的一列，

>>> df.groupby("group")["a"].sum()
group
G1    129
G2    123
G3      6
Name: a, dtype: int32

>>> df["sum_a"] = df.groupby("group")["a"].sum()
>>> df
    a   b   c   d group  sum_a
0  94  32  37  13    G1    NaN
1  87  95  24  77    G2    NaN
2   6  74  68  64    G3    NaN
3  36  17  83   4    G2    NaN
4  35  61  20   0    G1    NaN

操作不符合预期！

再换一种方式，

>>> df["sum_a"] = df.groupby("group")["a"].apply(lambda x: sum(x))
>>> df
    a   b   c   d group  sum_a
0  94  32  37  13    G1    NaN
1  87  95  24  77    G2    NaN
2   6  74  68  64    G3    NaN
3  36  17  83   4    G2    NaN
4  35  61  20   0    G1    NaN

也不符合预期！

再试试新的方式，

>>> df["sum_a"] = df.groupby("group")["a"].transform(lambda x: sum(x))
>>> df
    a   b   c   d group  sum_a
0  94  32  37  13    G1    129
1  87  95  24  77    G2    123
2   6  74  68  64    G3      6
3  36  17  83   4    G2    123
4  35  61  20   0    G1    129

用transform可以到达预期的结果。

如果需要取两列的数据聚合，写入新的一列要如何操作？transform每次只能取一列的数据操作，所有没法做到。用apply，可以在分组内跨行、跨列操作。

>>> df["sum_ab"] = df.groupby("group").apply(lambda x:(x["a"] +x["b"])).shift()
...
TypeError: incompatible index of inserted column with frame index

incompatible index of inserted column with frame index

报错，说索引不兼容。再看看右侧的输出是什么,

>>> df.groupby("group").apply(lambda x:(x["a"] +x["b"])).shift()
group
G1     0      NaN
       4    126.0
G2     1     96.0
       3    182.0
G3     2     53.0
dtype: float64

>>> type(df.groupby("group").apply(lambda x:(x["a"] +x["b"])).shift())
<class 'pandas.core.series.Series'>

右侧输出的是Series，它的索引有两级，一级是分组名，一级是df一致的索引。所有要重置右侧输出的索引和df一致。

>>> df["sum_ab"] = df.groupby("group").apply(lambda x:(x["a"] +x["b"])).shift().reset_index(le
... vel=0, drop=True)
>>> df
    a   b   c   d group  sum_a  sum_ab
0  94  32  37  13    G1    129     NaN
1  87  95  24  77    G2    123    96.0
2   6  74  68  64    G3      6    53.0
3  36  17  83   4    G2    123   182.0
4  35  61  20   0    G1    129   126.0

另外，也可以自己定义处理函数，

>>> def sum_ab(d):
...     d["sum_abb"] = (d["a"] + d["b"]).shift()
...     return d

>>> df = df.groupby("group").apply(sum_ab)
>>> df
    a   b   c   d group  sum_a  sum_ab  sum_abb
0  94  32  37  13    G1    129     NaN      NaN
1  87  95  24  77    G2    123    96.0      NaN
2   6  74  68  64    G3      6    53.0      NaN
3  36  17  83   4    G2    123   182.0    182.0

squeeze参数的作用，先看官方文档说明

squeeze ： bool, default False
Reduce the dimensionality of the return type if possible, otherwise return a consistent type.

通过代码看作用

>>> df["group_1"] = "g1"
>>> df
    a   b   c   d group  sum_a  sum_ab  sum_abb group_1
0  94  32  37  13    G1    129     NaN      NaN      g1
1  87  95  24  77    G2    123    96.0      NaN      g1
2   6  74  68  64    G3      6    53.0      NaN      g1
3  36  17  83   4    G2    123   182.0    182.0      g1
4  35  61  20   0    G1    129   126.0    126.0      g1

>>> df.groupby("group_1").apply(lambda x: x["a"].shift())
a         0     1     2    3     4
group_1
g1      NaN  94.0  87.0  6.0  36.0

>>> df.groupby("group").apply(lambda x: x["a"].shift())
group
G1     0     NaN
       4    94.0
G2     1     NaN
       3    87.0
G3     2     NaN
Name: a, dtype: float64

>>> df.groupby("group_1", squeeze=True).apply(lambda x: x["a"].shift())
0     NaN
1    94.0
2    87.0
3     6.0
4    36.0
Name: g1, dtype: float64

>>> df.groupby("group", squeeze=True).apply(lambda x: x["a"].shift())
group
G1     0     NaN
       4    94.0
G2     1     NaN
       3    87.0
G3     2     NaN
Name: a, dtype: float64

>>> type(df.groupby("group").apply(lambda x: x["a"].shift()))
<class 'pandas.core.series.Series'>

>>> type(df.groupby("group_1").apply(lambda x: x["a"].shift()))
<class 'pandas.core.frame.DataFrame'>

>>> type(df.groupby("group", squeeze=True).apply(lambda x: x["a"].shift()))
<class 'pandas.core.series.Series'>

>>> type(df.groupby("group_1", squeeze=True).apply(lambda x: x["a"].shift()))
<class 'pandas.core.series.Series'>