groupby

最新推荐文章于 2023-07-20 12:01:37 发布

涛涛北京

最新推荐文章于 2023-07-20 12:01:37 发布

阅读量3.1k

点赞数 4

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

GroupBy

group_by的底层步骤：

（1）split：依据某个标准把数据分割

（2）Apply：将某个函数应用到每一组数据,例如agg、transform

（3）combining：将结果合并

在apply这一步，我们可以采取：

（1）聚合操作：将一个group的数据映射为一个数，例如求平均值、group内元素的个数等

（2）转换操作：数据分布正态化等

（3）过滤操作：去除较少元素的group或者去除较小的元素等，必须传入一个返回布尔值的func

1-spliting into groups

groupby函数有两个比较重要的参数：by和axis，其中：
（1）by表示依据什么进行分组，可以传入一个 column-name、list、dict、func，具体特点为：
column-name：以该column对应的series为依据进行分组，series内容相同的为同组；
list：根据lis内容进行分类汇总；
dict：key是对应的axis label(可以是index也可以是column)，value是目标label，同一个目标label的为一组；
func：实现axis label到目标label的转换，同一个label的为一组。
（2）axis表示按照什么方向进行groupby，他将决定数据切分的方式，具体而言，axis=0/"index"表示以行为单位进行split；
反之，axis=1/“columns”，表示数据以列为单位进行split。不要被惯性干扰，其实对于一个dataframe而言，内容完全是对称的。

df = pd.DataFrame({
    "class":["bird", "bird", "mammal", "mammal", "mammal"],
    "order":["Falconiformes", "Psittaciformes", "Carnivora", "Primates", "Carnivora"],
    "max_speed":[389.0, 24.0, 80.2, np.nan, 58]
}, index=["falcon", "parrot", "lion", "monkey", "leopard"])

df

	class	order	max_speed
falcon	bird	Falconiformes	389.0
parrot	bird	Psittaciformes	24.0
lion	mammal	Carnivora	80.2
monkey	mammal	Primates	NaN
leopard	mammal	Carnivora	58.0

按照行进行分割

df.groupby("class", axis="index")  # axis=0和axis=index等价；1和column等价

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fea93816ca0>

df.groupby("class", axis=0).groups  # 返回字典，key是group label，value是axis labels（index）

{'bird': ['falcon', 'parrot'], 'mammal': ['lion', 'monkey', 'leopard']}

df.groupby(["class", "order"])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fea93861580>

df.groupby(["class", "order"]).groups

{('bird', 'Falconiformes'): ['falcon'], ('bird', 'Psittaciformes'): ['parrot'], ('mammal', 'Carnivora'): ['lion', 'leopard'], ('mammal', 'Primates'): ['monkey']}

In [6]: df = pd.DataFrame(
   ...:     {
   ...:         "A": np.random.randn(5),
   ...:         "B": np.random.randn(5),
   ...:         "C": np.random.randn(5),
   ...:         "D": np.random.randn(5),
   ...:     }
   ...: )

df

	A	B	C	D
0	1.047022	-0.266527	1.659578	-1.731920
1	-0.726851	-0.931228	-2.774025	0.443112
2	0.265430	-0.172112	0.278415	0.333350
3	1.128021	-0.499495	-0.694711	-0.884655
4	0.180662	0.802036	-1.210244	-1.251326

按照列column进行分割

def get_letter_type(letter):
    if letter.lower() in "aeiou":
        return "vowel"
    else:
        return "constant"

grouped = df.groupby(get_letter_type, axis=1)

grouped.groups  # key 是 group key，value是对应的axis labels（列名）

{'constant': ['B', 'C', 'D'], 'vowel': ['A']}

grouped.get_group("constant").head(3)

	B	C	D
0	-0.266527	1.659578	-1.731920
1	-0.931228	-2.774025	0.443112
2	-0.172112	0.278415	0.333350

根据字典进行groupby

df.groupby({"A":"A", "B":"A", "C":"C", "D":"C"}, axis=1).get_group("A").head(3)

	A	B
0	1.047022	-0.266527
1	-0.726851	-0.931228
2	0.265430	-0.172112

df.groupby({"A":"A", "B":"A", "C":"C", "D":"C"}, axis=1).get_group("C")

	C	D
0	1.659578	-1.731920
1	-2.774025	0.443112
2	0.278415	0.333350
3	-0.694711	-0.884655
4	-1.210244	-1.251326

df.groupby({"A":"A", "B":"A", "C":"C", "D":"C"}, axis=1).agg(max)

	A	C
0	1.047022	1.659578
1	-0.726851	0.443112
2	0.265430	0.333350
3	1.128021	-0.694711
4	0.802036	-1.210244

小结：

重要属性：groups属性返回字典，key就是group-key，value是命中的axis-label

重要接口：get_group（）接口返回当前 group key 对应的所有数据

特点：pandas 的index是允许重复的，如果一个包含重复内容的index被作为grouby的key，那么

同一个key指向的value会被分配到同一个组和，所以groupby之后的内容是没有重复的。

问题：如果是行切的话，可以基于列名做聚合，那如果是列切，是基于index做聚合吗？最后的例子表明是的。

另外，groupby作用于df时返回Dataframegroupby对象（dfgb）,作用与Series返回sgb，dfgb虽然是df，但也要从group的角度理解其数据

GroupBy Sort

groupby之后生成dfgb对象，聚合操作后生成df对象，可输出

groupby操作会对key排序，如果不需要的话可以领sort为False以加速

df2 = pd.DataFrame({"X": ["B", "B", "A", "A"], "Y": [1, 2, 3, 4]})

b = df2.groupby("X").sum()
type(b)

pandas.core.frame.DataFrame

	Y
X
A	7
B	3

df2.groupby("X", sort=False).sum()

	Y
X
B	3
A	7

groupby 去空

使用groupby操作时，通过配置 dropna参数为True或者False来确定，空数据时候可以作为group的key

df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])
df_dropna

	a	b	c
0	1	2.0	3
1	1	NaN	4
2	2	1.0	3
3	1	2.0	2

group-key列具有空值

df_dropna.groupby("b").sum()

	a	c
b
1.0	2	3
2.0	2	5

df_dropna.groupby("b", dropna=False).sum()

	a	c
b
1.0	2	3
2.0	2	5
NaN	1	4

每一个group内部都有一定的数据，在执行聚合函数时候，只会跳过nan数据本身，不干扰同行其他数据

df_dropna.groupby("a").sum()

	b	c
a
1	4.0	9
2	1.0	3

dfgroupby对象的列选操作–column selectioon

df groupby 之后，可能希望对不同的列做不同的操作，这个时候可以使用列选操作,后面会有列选后续操作的使用
GroupBy 多重索引-MultiIndex：https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

df_dropna

	a	b	c
0	1	2.0	3
1	1	NaN	4
2	2	1.0	3
3	1	2.0	2

grouped = df_dropna.groupby("a")

grouped["c"]  # 等价于 grouped.c

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fbc71f67d60>

grouped["c"].get_group(1)

0    3
1    4
3    2
Name: c, dtype: int64

groupby对象的迭代

如前所述，groupby对象可以看作是key-value，其中key是group key，value就是其他数据构成的dataframe

for key, group in grouped:
    print(key)
    print(group)
    print(type(group))
    print("")

1
   a    b  c
0  1  2.0  3
1  1  NaN  4
3  1  2.0  2
<class 'pandas.core.frame.DataFrame'>

2
   a    b  c
2  2  1.0  3
<class 'pandas.core.frame.DataFrame'>

聚合操作-Aggregation

对groupbyDataframe 对象可以进一步用聚合函数得到同类汇聚的结果进行展示
可以直接调用groupbyDataframe的成员函数例如：sum/mean/max/size/count等，但是注意加括号

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)

df

	A	B	C	D
0	foo	one	0.351405	0.425799
1	bar	one	-2.175271	-0.278632
2	foo	two	-0.175300	-0.910150
3	bar	three	-1.723109	0.614626
4	foo	two	-1.765967	-0.465801
5	bar	two	0.653539	1.727326
6	foo	one	0.037191	-0.381093
7	foo	three	2.056519	-1.862873

df.groupby("A").sum()  # df.groupby("A").agg(np.sum)  # 等价

	C	D
A
bar	-3.244841	2.063321
foo	0.503848	-3.194118

小插曲：聚合后得到dataFrame，然后可以基于to_dict接口得到字典

df.groupby("A").sum().to_dict("index")

{'bar': {'C': -3.2448414872221907, 'D': 2.063321126560749},
 'foo': {'C': 0.5038481220391884, 'D': -3.194117627544297}}

df.groupby("A").sum().to_dict()

{'C': {'bar': -3.2448414872221907, 'foo': 0.5038481220391884},
 'D': {'bar': 2.063321126560749, 'foo': -3.194117627544297}}

可以看出，aggergation之后，group names会作为新的index出现在列表中，如果不想让他们
作为index，可以配置 as_index=False,或者调用reset_index方法

df.groupby("A", as_index=False).agg(np.sum)  # df.groupby("A").agg(np.sum).reset_index()   # 等价

	A	C	D
0	bar	0.512978	1.766982
1	foo	2.510977	-3.129734

df.groupby("A").size() # 针对A

A
bar    3
foo    5
dtype: int64

df.groupby("A").count()  # 针对 BCD

	B	C	D
A
bar	3	3	3
foo	5	5	5

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)

df

	A	B	C	D
0	foo	one	0.596527	-1.168976
1	bar	one	1.387488	0.439306
2	foo	two	-0.105645	-0.085638
3	bar	three	0.019256	0.758142
4	foo	two	-0.038975	-0.505880
5	bar	two	-0.212490	1.759912
6	foo	one	0.259668	-0.550867
7	foo	three	-1.184676	-1.158491

grouped = df.groupby("A")

Applying multiple functions at once-单次投放多个函数

grouped.agg([np.sum, np.mean])  # 对于groupby dataFrame，可以传入 多个 函数，那么将会返回多层标签对应的dataframe结果

	C		D
	sum	mean	sum	mean
A
bar	0.952934	0.317645	1.565201	0.521734
foo	-3.357455	-0.671491	-1.722682	-0.344536

grouped.agg([np.sum, np.mean]).to_excel("groupby_test.xlsx")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3h515W4T-1629604064172)(attachment:groupby_test.png)]

grouped["C"].agg([np.sum, np.max, np.mean])  # 对于groupby Series 将不同的函数 apply 到统一series获取不同的结果构成dataFrame

	sum	amax	mean
A
bar	0.952934	0.456841	0.317645
foo	-3.357455	0.979221	-0.671491

agg后现实的名字是 function 确定的，如果想重命名，可以用rename方法，Series和DataFrame方法类似如下

grouped["C"].agg([np.sum, np.mean]).rename(columns={"sum":"和", "mean":"平均值"})

	和	平均值
A
bar	0.952934	0.317645
foo	-3.357455	-0.671491

Named Aggregation

为了支持面向列的聚合操作，pandas支持Named Aggregagtion操作

key就是输出的列名，value包含两部分，面向的column，使用的aggfunc，具体见下面的例子

逻辑：groupby之后，显然任一个key对应的height和weight都是一个series，NamedAggregation可以为不同的series配置不用的聚合函数

animals = pd.DataFrame(
    {
        "kind": ["cat", "dog", "cat", "dog"],
        "height": [9.1, 6.0, 9.5, 34.0],
        "weight": [7.9, 7.5, 9.9, 198.0],
    }
)

animals

	kind	height	weight
0	cat	9.1	7.9
1	dog	6.0	7.5
2	cat	9.5	9.9
3	dog	34.0	198.0

animals.groupby("kind").agg(
    min_height=pd.NamedAgg(column="height", aggfunc="min"),
    max_height=pd.NamedAgg(column="height", aggfunc="max"),
    average_weight=pd.NamedAgg(column="weight", aggfunc="mean")

)

	min_height	max_height	average_weight
kind
cat	9.1	9.5	8.90
dog	6.0	34.0	102.75

上面的NamedAgg使用的是namedTuple，下面是等价的普通元组形式：

animals.groupby("kind").agg(
    min_height=("height", "min"),
    max_weight=("weight", "max")
)

	min_height	max_weight
kind
cat	9.1	9.9
dog	6.0	198.0

当选取的keywords不是python合法命名时，可以用字典+unpack的方式实现

animals.groupby("kind").agg(
    **{"min height":("height", "min")},
    max_weight=("weight", "max")
)

	min height	max_weight
kind
cat	9.1	9.9
dog	6.0	198.0

如果只关注其中一个columns，那么可以对SeriesGroupBy对象使用named聚合
由于事先指定了列，那么value部分就只有函数

animals.groupby("kind").height.agg(
    min_height="min", 
    max_height="max"
)

	min_height	max_height
kind
cat	9.1	9.5
dog	6.0	34.0

animals.groupby("kind")["height"].agg(
    min_height="min", 
    max_height="max"
)

	min_height	max_height
kind
cat	9.1	9.5
dog	6.0	34.0

上述功能可以基于字典等价实现如下：

animals.groupby("kind").agg({"height":[min], "weight":max}).rename(columns={"height":"高度", "min":"最小值"})

	高度	weight
	最小值	max
kind
cat	9.1	9.9
dog	6.0	198.0

note：传入的func可以是np系列，也可以是groupby类的成员函数，成员函数可以用字符串的形式访问/也可以不用字符串

基于用户自定义函数进行聚合

首先，明确不同情况下返回的数据结构如下，可以看出 groupby之后返回的是groupby dataframe

1、之后进行列选返回的是 groupby Series；

2、用[[column]]返回dataframe对象；

以上是和直接基于dataFrame进行列选的的结果相同的

animals.groupby("kind")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fea94016490>

animals.groupby("kind")[["weight"]]

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fea94016a00>

animals.groupby("kind")["weight"]

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fea94025460>

animals.groupby("kind")[["weight"]].agg(lambda x: set(x)) # x是一个group的数据

	weight
kind
cat	{9.9, 7.9}
dog	{198.0, 7.5}

小结：

1、多列使用1个函数

2、每列使用不同的函数

3、一列使用不同的函数

4、Named Aggregation

5、基于自定义函数

Transformation

对group内部的数据做映射
与aggregation不同的是 Transform不直接追求数据的降维，而是对每一个数据做映射

index = pd.date_range("10/1/1999", periods=1100, freq="D")  # 创建时间索引

ts = pd.Series(np.random.normal(0, 2, 1100), index=index)  # 均值为0，方差为4，1100个点

ts.head()

1999-10-01   -0.215051
1999-10-02   -1.241018
1999-10-03   -3.258130
1999-10-04   -0.441858
1999-10-05   -2.059294
Freq: D, dtype: float64

grouped = ts.groupby(lambda x: x.year)

grouped.mean()  # 以年为单位显示，因为按照year分组

1999   -0.080469
2000   -0.009197
2001    0.030337
2002    0.165433
dtype: float64

grouped.var()

1999    4.574915
2000    3.531059
2001    4.272156
2002    4.283034
dtype: float64

transformer = grouped.transform(lambda x: (x - np.mean(x)) / np.std(x))

type(transformer)

pandas.core.series.Series

transformer  # 映射之后的结果，和原数据一样的长度

1999-10-01   -0.063266
1999-10-02   -0.545563
1999-10-03   -1.493789
1999-10-04   -0.169886
1999-10-05   -0.930227
                ...   
2002-09-30    0.305898
2002-10-01   -2.120204
2002-10-02    0.328765
2002-10-03   -1.115765
2002-10-04   -0.054039
Freq: D, Length: 1100, dtype: float64

transformer.mean()

-3.087176978701998e-17

transformer.var()

1.0009099181073695

compare = pd.DataFrame({"original":ts, "Transformed":transformer})

compare.plot()

必要时，transform会broadcast

ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())

1999-10-01    11.571518
1999-10-02    11.571518
1999-10-03    11.571518
1999-10-04    11.571518
1999-10-05    11.571518
                ...    
2002-09-30    12.554169
2002-10-01    12.554169
2002-10-02    12.554169
2002-10-03    12.554169
2002-10-04    12.554169
Freq: D, Length: 1100, dtype: float64

ts.groupby(lambda x: x.year).transform(lambda x: max(x))

1999-10-01    6.142405
1999-10-02    6.142405
1999-10-03    6.142405
1999-10-04    6.142405
1999-10-05    6.142405
                ...   
2002-09-30    6.754179
2002-10-01    6.754179
2002-10-02    6.754179
2002-10-03    6.754179
2002-10-04    6.754179
Freq: D, Length: 1100, dtype: float64

基于transform填补空缺值

df = pd.DataFrame([[np.nan, 2, 0, 0],
...                    [3, 4, 0, 1],
...                    [np.nan, np.nan, 1, 5],
...                    [5, 3, 2, 4]],
...                   columns=list("ABCD"))

df

	A	B	C	D
0	NaN	2.0	0	0
1	3.0	4.0	0	1
2	NaN	NaN	1	5
3	5.0	3.0	2	4

grouped = df.groupby([0, 0, 1, 1])  # 这个给出了基于 list groupby的方法，列表内容也可以是字符串

grouped.transform(lambda x: x.fillna(x.mean()))

	A	B	C	D
0	3.0	2.0	0	0
1	3.0	4.0	0	1
2	5.0	3.0	1	5
3	5.0	3.0	2	4

滑窗操作

df = pd.DataFrame({"A": [1]*10 + [5]*10, "B":np.arange(20)})

df

	A	B
0	1	0
1	1	1
2	1	2
3	1	3
4	1	4
5	1	5
6	1	6
7	1	7
8	1	8
9	1	9
10	5	10
11	5	11
12	5	12
13	5	13
14	5	14
15	5	15
16	5	16
17	5	17
18	5	18
19	5	19

df.groupby("A").rolling(4).B.mean()  # 解释：将滑窗宽度设置为4，求4个值的平均值，无法放置窗口的置为None

A    
1  0      NaN
   1      NaN
   2      NaN
   3      1.5
   4      2.5
   5      3.5
   6      4.5
   7      5.5
   8      6.5
   9      7.5
5  10     NaN
   11     NaN
   12     NaN
   13    11.5
   14    12.5
   15    13.5
   16    14.5
   17    15.5
   18    16.5
   19    17.5
Name: B, dtype: float64

df.groupby("A").expanding().sum()  # 将当前值与前面的所有值求和，关键在于expand函数

		B
A
1	0	0.0
	1	1.0
	2	3.0
	3	6.0
	4	10.0
	5	15.0
	6	21.0
	7	28.0
	8	36.0
	9	45.0
5	10	10.0
	11	21.0
	12	33.0
	13	46.0
	14	60.0
	15	75.0
	16	91.0
	17	108.0
	18	126.0
	19	145.0

resample函数：对数据进行重采样（可以变少/多）

df_re = pd.DataFrame(
    {
        "date": pd.date_range(start="2016-01-01", periods=4, freq="W"),
        "group": [1, 1, 2, 2],
        "val": [5, 6, 7, 8],
    }
).set_index("date")

df_re

	group	val
date
2016-01-03	1	5
2016-01-10	1	6
2016-01-17	2	7
2016-01-24	2	8

df_re.resample("2D").ffill()

	group	val
date
2016-01-03	1	5
2016-01-05	1	5
2016-01-07	1	5
2016-01-09	1	5
2016-01-11	1	6
2016-01-13	1	6
2016-01-15	1	6
2016-01-17	2	7
2016-01-19	2	7
2016-01-21	2	7
2016-01-23	2	7

Filtration

必须为filter函数传入一个func作为参数，这个函数的每一个元素都是groupbyDataFrame中的一个Group

dff = pd.DataFrame({"A":np.arange(8), "B":list("aaabbbbc")})

dff.groupby("B").filter(lambda x: len(x) >= 3)

	A	B
0	0	a
1	1	a
2	2	a
3	3	b
4	4	b
5	5	b
6	6	b

dispatching

这里展示了一些groupbySeriesd的简便的实例函数的调度方法:std、nlargest、nsmallest

df

	A	B
0	1	0
1	1	1
2	1	2
3	1	3
4	1	4
5	1	5
6	1	6
7	1	7
8	1	8
9	1	9
10	5	10
11	5	11
12	5	12
13	5	13
14	5	14
15	5	15
16	5	16
17	5	17
18	5	18
19	5	19

df.groupby("A").std()

	B
A
1	3.02765
5	3.02765

df.groupby("A").agg(np.std)

	B
A
1	3.02765
5	3.02765

df.groupby("A").agg(lambda x: x.std())

	B
A
1	3.02765
5	3.02765

s = pd.Series([9, 8, 7, 5, 19, 1, 4.2, 3.3])

g = pd.Series(list("abababab"))

gb = s.groupby(g)

gb.nlargest(3)

a  4    19.0
   0     9.0
   2     7.0
b  1     8.0
   3     5.0
   7     3.3
dtype: float64

灵活的Apply操作

如果前面介绍的agg、filter、transform不能满足要求，可以考虑使用apply函数

s = pd.Series(np.random.rand(5))
s

0    0.309940
1    0.801719
2    0.225639
3    0.622519
4    0.545119
dtype: float64

def f(x):
    return pd.Series([x, x**2], index=["x", "x^2"])

s.apply(f)

	x	x^2
0	0.309940	0.096063
1	0.801719	0.642753
2	0.225639	0.050913
3	0.622519	0.387530
4	0.545119	0.297155

涛涛北京

关注

4
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
groupby

import pandas as pdimport numpy as npimport matplotlib.pyplot as plt%matplotlib inlineGroupBygroup_by的底层步骤：（1）split：依据某个标准把数据分割（2）Apply：将某个函数应用到每一组数据,例如agg、transform（3）combining：将结果合并在apply这一步，我们可以采取：（1）聚合操作：将一个group的数据映射为一个数，例如求平均值、group内元素的
复制链接

扫一扫