lele's python groupby函数总结

最新推荐文章于 2024-07-15 20:03:26 发布

gulie8

最新推荐文章于 2024-07-15 20:03:26 发布

阅读量702

点赞数

分类专栏： python 疑难杂症

python 同时被 2 个专栏收录

33 篇文章 2 订阅

订阅专栏

疑难杂症

21 篇文章 0 订阅

订阅专栏

groupby详解：

l （Splitting）按照一些规则将数据分为不同的组；

l （Applying）对于每组数据分别执行一个函数；

l （Combining）将结果组合到一个数据结构中；

groupby 是pandas 中非常重要的一个函数, 主要用于数据聚合和分类计算. 其思想是“split-apply-combine”（拆分 - 应用 - 合并）.
pandas groupby 的应用非常灵活, 但只要记住上面的核心思想-“split-apply-combine”, 就不难理解了.

分组键可以有多种形式，且类型不必相同：

1.列表或数组，其长度与待分组的轴一样。

2.表示DataFrame某个列名的值。
3.
字典或Series，给出待分组轴上的值与分组名之间的对应关系。

4.函数，用于处理轴索引或索引中的各个标签。

注意，后三种都只是快捷方式而已，其最终目的仍然是产生一组用于拆分对象的值。

值得注意的是, groupby之后是一个对象, 直到应用一个函数之后才会变成一个Series或者Dataframe.

url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"
df = pd.read_csv(url, sep="|")

#每种职业的平均年龄并降序排列
df.groupby(['occupation']).agg({'age':'mean'}).\
sort_values(by='age',ascending=False)

#分别找出男人和女人每种职业的人数
df.groupby(['gender','occupation']).size()
gender occupation
F administrator 36
artist 13
salesman 3
scientist 3
student 60
technician 1
writer 19
M administrator 43
artist 15
doctor 7
educator 69
engineer 65

#如何找出男人和女人在不同职业的平均年龄
df.groupby(['gender','occupation']).age.mean()
out:
gender occupation
F administrator 40.638889
artist 30.307692
educator 39.115385
scientist 28.333333
student 20.750000
technician 38.000000
writer 37.631579
M administrator 37.162791
artist 32.333333


参数as_index 是指是否将groupby的column作为index, 默认是True:
df.groupby(['gender','occupation'],as_index=False).age.mean()
out:
gender occupation age
0 F administrator 40.638889
1 F artist 30.307692
17 F student 20.750000
18 F technician 38.000000
19 F writer 37.631579
20 M administrator 37.162791
21 M artist 32.333333
24 M engineer 36.600000
25 M entertainment 29.000000
26 M executive 38.172414
27 M healthcare 45.400000

对groupby对象应用自定义函数：
上面我们都是以pandas自带的函数应用再group对象上的, 也可以使用自定义的函数。
#求不同性别年龄的极差
def data_range(x):
return x.max()-x.min()

df.groupby('gender').age.agg(data_range)

#验证：
df.groupby('gender').age.max()-df.groupby('gender').age.min()

对group by后的内容进行操作，可转换成字典：
#转化为字典
a_dict=dict(list(df.groupby('occupation')))

>>>a_dict

out:
{'administrator': user_id age gender occupation zip_code
6 7 57 M administrator 91344
7 8 36 M administrator 05201
33 34 38 F administrator 42141
41 42 30 M administrator 17870
47 48 45 M administrator 12550,
'technician': user_id age gender occupation zip_code
0 1 24 M technician 85711
3 4 24 M technician 43537
43 44 26 M technician 46260,
'writer': user_id age gender occupation zip_code
2 3 23 M writer 32067
20 21 26 M writer 30068
21 22 25 M writer 40206
27 28 32 M writer 55369
49 50 21 M writer 52245}

>>>a_dict['student']

out:
user_id age gender occupation zip_code
8 9 29 M student 01002
29 30 7 M student 55436
31 32 28 F student 78741
32 33 23 M student 27510
35 36 19 F student 93117
36 37 23 M student 55105
48 49 23 F student 76111

对于大数据，很多情况是只需要对部分列进行聚合:
#对df进行'key1'，'key2'的两次分组，然后取data2的数据，对两次细分的分组数据取均值
value = df.groupby(['key1','key2'])[['data2']].mean()

查看group_by_name的组成groups方法:
df.groupby('gender',as_index=False).groups
Out[53]:
{'F': Int64Index([ 1, 4, 10, 11, 14, 17, 19, 22, 23, 26,
...
913, 916, 919, 920, 921, 924, 929, 937, 938, 941],
dtype='int64', length=273),
'M': Int64Index([ 0, 2, 3, 5, 6, 7, 8, 9, 12, 13,
...
930, 931, 932, 933, 934, 935, 936, 939, 940, 942],
dtype='int64', length=670)}

对分组进行迭代：

#name就是groupby中的key1的值，group就是要输出的内容

for name, group in df.groupby('key1'):

print (name,group)

a data1 data2 key1 key2

0 -1.313101 -0.453361 a one

2 0.462611 1.150597 a one

4 0.077367 -0.282876 a one

b data1 data2 key1 key2

1 0.791463 1.096693 b two

3 -0.216121 1.381333 b two

选择group分组:
DataFrameGroupBy的get_group方法:
df.groupby('gender',as_index=False).get_group('F')
Out[55]:
user_id age gender occupation zip_code
1 2 53 F other 94043
4 5 33 F other 15213
10 11 39 F other 30329
11 12 28 F other 06405
14 15 49 F educator 97301
17 18 35 F other 37212
19 20 42 F homemaker 95660
22 23 30 F artist 48197
23 24 21 F artist 94533
26 27 40 F librarian 30030
31 32 28 F student 78741

注：不是DataFrame的方法.

通过字典或Series进行分组信息的统计：
除数组以外，分组信息还可以其他形式存在，来看一个DataFrame示例：
>>> people = pd.DataFrame(np.random.randn(5, 5),
... columns=['a', 'b', 'c', 'd', 'e'],
... index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis']
... )
>>> people
a b c d e
Joe 0.306336 -0.139431 0.210028 -1.489001 -0.172998
Steve 0.998335 0.494229 0.337624 -1.222726 -0.402655
Wes 1.415329 0.450839 -1.052199 0.731721 0.317225
Jim 0.550551 3.201369 0.669713 0.725751 0.577687
Travis -2.013278 -2.010304 0.117713 -0.545000 -1.228323

假设已知列的分组关系，并希望根据分组计算列的总计：
>>> mapping = {'a':'red', 'b':'red', 'c':'blue',
... 'd':'blue', 'e':'red', 'f':'orange'}
>>> mapping
{'a': 'red', 'c': 'blue', 'b': 'red', 'e': 'red', 'd': 'blue', 'f': 'orange'}
>>> type(mapping)
<type 'dict'>
只需将这个字典传给groupby即可，
>>> by_column = people.groupby(mapping, axis=1)
>>> by_column
<pandas.core.groupby.DataFrameGroupBy object at 0x066150F0>
>>> by_column.sum()
blue red
Joe -1.278973 -0.006092
Steve -0.885102 1.089908
Wes 0.731721 1.732554
Jim 1.395465 4.329606
Travis -0.427287 -5.251905

通过函数进行分组：
相较于字典或Series，Python函数在定义分组映射关系时可以更有创意且更为抽象。任何被当做分组键的函数都会在各个索引值上被调用一次，
其返回值就会被用作分组名称。

具体点说，以DataFrame为例，其索引值为人的名字。假设你希望根据人名的长度进行分组，虽然可以求取一个字符串长度数组，但其实仅仅传入
len函数即可：
>> people.groupby(len).sum()
a b c d e
3 2.272216 3.061938 0.879741 -0.031529 0.721914
5 0.998335 0.494229 0.337624 -1.222726 -0.402655
6 -2.013278 -2.010304 0.117713 -0.545000 -1.228323

将函数跟数组、列表、字典、Series混合使用也不是问题，因为任何东西最终都会被转换为数组：
>>> key_list = ['one', 'one', 'one', 'two', 'two']
>>> people.groupby([len, key_list]).min()
a b c d e
3 one 0.306336 -0.139431 0.210028 -1.489001 -0.172998
two 0.550551 3.201369 0.669713 0.725751 0.577687
5 one 0.998335 0.494229 0.337624 -1.222726 -0.402655
6 two -2.013278 -2.010304 0.117713 -0.545000 -1.228323

根据索引级别分组：
层次化索引数据集最方便的地方在于它能够根据索引级别进行聚合。要实现该目的，通过level关键字传入级别编号或名称即可：
>>> columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
... [1, 3, 5, 1, 3]], names=['cty', 'tenor'])
>>> columns
MultiIndex
[US 1, 3, 5, JP 1, 3]
>>> hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
>>> hier_df
cty US JP
tenor 1 3 5 1 3
0 -0.166600 0.248159 -0.082408 -0.710841 -0.097131
1 -1.762270 0.687458 1.235950 -1.407513 1.304055
2 1.089944 0.258175 -0.749688 -0.851948 1.687768
3 -0.378311 -0.078268 0.247147 -0.018829 0.744540
>>> hier_df.groupby(level='cty', axis=1).count()
cty JP US
0 2 3
1 2 3
2 2 3
3 2 3

Dataframe groupby修改内容的两种方法:

第一种方法

遍历groupby中的每一个组，将group对象（元组）的第二个元素取出来存为dataframe对象进行操作。注意，在循环中直接对group进行修改是不会更改groupby后的对象的。

df = pd.DataFrame({'A': 'a a b b b'.split(), 'B': [1, 2, 1, 2, 3], 'C': [4, 6, 5, 6, 7]})
print(df)
df = df.groupby(['A'])
f = lambda x: pd.Series([x.B + x.C, x.C - x.B], index=['D', 'F'])
for group in df:
    print(group)
    df1 = group[1]   # 取出第二个元素
    print(df1)
    df1[['D', 'F']] = df1.apply(f, axis=1)
    print(df1)

输出结果为

   A  B  C
0  a  1  4
1  a  2  6
2  b  1  5
3  b  2  6
4  b  3  7

('a',    A  B  C
0  a  1  4
1  a  2  6)

   A  B  C
0  a  1  4
1  a  2  6

   A  B  C  D  F
0  a  1  4  5  3
1  a  2  6  8  4

('b',    A  B  C
2  b  1  5
3  b  2  6
4  b  3  7)

   A  B  C
2  b  1  5
3  b  2  6
4  b  3  7

   A  B  C   D  F
2  b  1  5   6  4
3  b  2  6   8  4
4  b  3  7  10  4

第二种方法

将dataframe进行groupby后转换成字典，然后对字典进行取值，之后对dataframe对象进行操作。这种方法可以对字典进行修改。

df = pd.DataFrame({'A': 'a a b b b'.split(), 'B': [1, 2, 1, 2, 3], 'C': [4, 6, 5, 6, 7]})
print(df)
dict_df = dict(list(df.groupby('A')))
print(dict_df)
a = dict_df['a']
print("print a")
print(a)
a_B = dict_df['a']['B']
print("print a_B")
print(a_B)
f = lambda x: pd.Series([x.B + x.C, x.C - x.B], index=['D', 'F'])
a[['D', 'F']] = a.apply(f, axis=1)
print("print a")
print(a)
# 在原字典中键‘a’的值里添加一列
dict_df['a'].loc[:, 'D'] = 0
print('print dict_df[''a'']')
print(dict_df['a'])

输出结果为

   A  B  C
0  a  1  4
1  a  2  6
2  b  1  5
3  b  2  6
4  b  3  7
{'a':    A  B  C
0  a  1  4
1  a  2  6, 'b':    A  B  C
2  b  1  5
3  b  2  6
4  b  3  7}
print a
   A  B  C
0  a  1  4
1  a  2  6
print a_B
0    1
1    2
Name: B, dtype: int64
print a
   A  B  C  D  F
0  a  1  4  5  3
1  a  2  6  8  4
print dict_df[a]
   A  B  C  D
0  a  1  4  0
1  a  2  6  0

对比分析一下，第二种方法需要清楚的知道分组键是什么，才能进行调用，如果分组键比较多且需要对所有的分组都进行同样的操作的话，第一种方法比较快捷。然而，如果是想直接对groupby后的内容进行修改的话，第二种方法比较好。
---------------------

Pandas 将列转换成行, 通过Groupby分组:

for name, group in xfdps_all.groupby(['System_ID']):#首先对原始数据进行groupby
    # print name
    # print group
    newdf=pd.DataFrame({name:list(group['Service Call Close Date'])})#构建新的dataframe
    newdf[name]=pd.to_datetime(newdf[name])#转换数据格式为日期
    # print newdf
    newdf2=newdf.sort_values(by=name,ascending=True)#对时间进行排序
    print newdf2.shape
    print newdf2.T   #转置,由列变成行
    tempdf=tempdf.append(newdf2.T)
    print tempdf.shape
tempdf.to_excel("D:\\xfd-ps\\xfdps_1031.xlsx")#输出结果
```

对dataframe进行groupby后求众数mode:

1. 问题

有如下一个dataframe，打算对a的每一个类别求b的众数(mode)，dir(df.groupby('a'))可以看到是没有mode函数的，因此不能直接使用df.groupby('a').mode().reset_index()

解决方案:

1.使用scipy.stats.mode()：df中的B类别有两个众数，返回的结果B类别的众数取了较小的结果

>>> from scipy import stats
>>> df.groupby('a').agg(lambda x: stats.mode(x)[0][0]).reset_index()
   a  b
0  A  1
1  B  2

2.使用value_counts()
(1) 先看value_counts()的作用：可以看到得到的结果中的index是取值，内容是计数，并且index是降序排列的，因此取index[0]是取最大值，因此有两个众数以上的时候，会取到较大的结果

>>> ss = pd.Series([1,2,2,3,3])
>>> ss
0    1
1    2
2    2
3    3
4    3
dtype: int64
>>> ss.value_counts()
3    2
2    2
1    1
dtype: int64
>>> ss.value_counts().index[0]
3

(2) 应用到dataframe的groupby之后的聚合函数中：

>>> df.groupby('a').agg(lambda x: x.value_counts().index[0]).reset_index()
   a  b
0  A  1
1  B  3

3.使用pd.Series.mode()：该函数是返回Series的众数的，当众数有多个时，会返回一个list，里面包含了所有众数

>>> df.groupby('a').agg(pd.Series.mode).reset_index()
   a       b
0  A       1
1  B  [2, 3]

4.使用pd.Series.mode()和np.mean()对有多个众数的结果取均值作为新的众数

>>> import numpy as np
>>> df.groupby('a').agg(lambda x: np.mean(pd.Series.mode(x))).reset_index()
  a    b
0  A  1.0
1  B  2.5

gulie8

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录