第3章 panda 分组

最新推荐文章于 2022-03-02 13:00:24 发布

bingo！？

最新推荐文章于 2022-03-02 13:00:24 发布

阅读量338

点赞数

分类专栏： Python panda 文章标签： python 数据分析

本文链接：https://blog.csdn.net/qq_43824915/article/details/105777390

版权

Python 同时被 2 个专栏收录

23 篇文章 0 订阅

订阅专栏

panda

3 篇文章 0 订阅

订阅专栏

第3章分组

import numpy as np
import pandas as pd
df = pd.read_csv('data/table.csv',index_col='ID')
df.head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+

一、SAC过程

1. 内涵

SAC指的是分组操作中的split-apply-combine过程

其中split指基于某一些规则，将数据拆成若干组，apply是指对每一组独立地使用函数，combine指将每一组的结果组合成某一类数据结构

2. apply过程

在该过程中，我们实际往往会遇到四类问题：

整合（Aggregation）——即分组计算统计量（如求均值、求每组元素个数）

变换（Transformation）——即分组对每个单元的数据进行操作（如元素标准化）

过滤（Filtration）——即按照某些规则筛选出一些组（如选出组内某一指标小于50的组）

综合问题——即前面提及的三种问题的混合

二、groupby函数

1. 分组函数的基本内容：

（a）根据某一列分组

grouped_single = df.groupby('School')

经过groupby后会生成一个groupby对象，该对象本身不会返回任何东西，只有当相应的方法被调用才会起作用

例如取出某一个组：

grouped_single.get_group('S_1').head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+

（b）根据某几列分组

grouped_mul = df.groupby(['School','Class'])
grouped_mul.get_group(('S_2','C_4'))

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
2401	S_2	C_4	F	street_2	192	62	45.3	A
2402	S_2	C_4	M	street_7	166	82	48.7	B
2403	S_2	C_4	F	street_6	158	60	59.7	B+
2404	S_2	C_4	F	street_2	160	84	67.7	B
2405	S_2	C_4	F	street_6	193	54	47.6	B

（c）组容量与组数

grouped_single.size()   #组容量

School
S_1    15
S_2    20
dtype: int64

grouped_mul.size()

School  Class
S_1     C_1      5
        C_2      5
        C_3      5
S_2     C_1      5
        C_2      5
        C_3      5
        C_4      5
dtype: int64

grouped_single.ngroups   #组数

grouped_mul.ngroups

（d）组的遍历

for i,group  in grouped_single:   #组名，分组
    print(i)
    display(group.head())

S_1

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+

S_2

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
2101	S_2	C_1	M	street_7	174	84	83.3	C
2102	S_2	C_1	F	street_6	161	61	50.6	B+
2103	S_2	C_1	M	street_4	157	61	52.5	B-
2104	S_2	C_1	F	street_5	159	97	72.2	B+
2105	S_2	C_1	M	street_4	170	81	34.2	A

（e）level参数（用于多级索引）和axis参数

df.set_index(['Gender','School']).groupby(level=0,axis=0).get_group('F').head()  #level多级索引级别

		Class	Address	Height	Weight	Math	Physics
Gender	School
F	S_1	C_1	street_2	192	73	32.5	B+
	S_1	C_1	street_2	167	81	80.4	B-
	S_1	C_1	street_4	159	64	84.8	B+
	S_1	C_2	street_4	176	94	63.5	B-
	S_1	C_2	street_5	162	63	33.8	B

2. groupby对象的特点

（a）查看所有可调用的方法

由此可见，groupby对象可以使用相当多的函数，灵活程度很高

print([attr for attr in dir(grouped_single) if not attr.startswith('_')])

['Address', 'Class', 'Gender', 'Height', 'Math', 'Physics', 'School', 'Weight', 'agg', 'aggregate', 'all', 'any', 'apply', 'backfill', 'bfill', 'boxplot', 'corr', 'corrwith', 'count', 'cov', 'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'dtypes', 'expanding', 'ffill', 'fillna', 'filter', 'first', 'get_group', 'groups', 'head', 'hist', 'idxmax', 'idxmin', 'indices', 'last', 'mad', 'max', 'mean', 'median', 'min', 'ndim', 'ngroup', 'ngroups', 'nth', 'nunique', 'ohlc', 'pad', 'pct_change', 'pipe', 'plot', 'prod', 'quantile', 'rank', 'resample', 'rolling', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail', 'take', 'transform', 'tshift', 'var']

（b）分组对象的head和first

对分组对象使用head函数，返回的是每个组的前几行，而不是数据集前几行

grouped_single.head(2)

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
2101	S_2	C_1	M	street_7	174	84	83.3	C
2102	S_2	C_1	F	street_6	161	61	50.6	B+

first显示的是以分组为索引的每组的第一个分组信息

grouped_single.first()

	Class	Gender	Address	Height	Weight	Math	Physics
School
S_1	C_1	M	street_1	173	63	34.0	A+
S_2	C_1	M	street_7	174	84	83.3	C

（c）分组依据

对于groupby函数而言，分组的依据是非常自由的，只要是与数据框长度相同的列表即可，同时支持函数型分组

df.groupby(np.random.choice(['a','b','c'],df.shape[0])).get_group('b').head()   #a,b,c抽随机抽35次对应每行进行分组
#相当于将np.random.choice(['a','b','c'],df.shape[0])当做新的一列进行分组

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1105	S_1	C_1	F	street_4	159	64	84.8	B+
1301	S_1	C_3	M	street_4	161	68	31.5	B+
1302	S_1	C_3	F	street_1	175	57	87.7	A-
1303	S_1	C_3	M	street_7	188	82	49.7	B

从原理上说，我们可以看到利用函数时，传入的对象就是索引，因此根据这一特性可以做一些复杂的操作

df[:5].groupby(lambda x:print(x)).head(2)

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+

根据奇偶行分组

df.groupby(lambda x:'奇数行' if not df.index.get_loc(x)%2==1 else '偶数行').groups
# df.index.get_loc(1101)  通过索引获得行号

{'偶数行': Int64Index([1102, 1104, 1201, 1203, 1205, 1302, 1304, 2101, 2103, 2105, 2202,
             2204, 2301, 2303, 2305, 2402, 2404],
            dtype='int64', name='ID'),
 '奇数行': Int64Index([1101, 1103, 1105, 1202, 1204, 1301, 1303, 1305, 2102, 2104, 2201,
             2203, 2205, 2302, 2304, 2401, 2403, 2405],
            dtype='int64', name='ID')}

如果是多层索引，那么lambda表达式中的输入就是元组，下面实现的功能为查看两所学校中男女生分别均分是否及格

注意：此处只是演示groupby的用法，实际操作不会这样写

math_score = df.set_index(['Gender','School'])['Math'].sort_index()
grouped_score = df.set_index(['Gender','School']).groupby(lambda x:(x,'均分及格' if math_score[x].mean()>=60 else '均分不及格'))  #x为分组
for name,_ in grouped_score:print(name)
print(math_score)

(('F', 'S_1'), '均分及格')
(('F', 'S_2'), '均分及格')
(('M', 'S_1'), '均分及格')
(('M', 'S_2'), '均分不及格')
Gender  School
F       S_1       32.5
        S_1       80.4
        S_1       84.8
        S_1       63.5
        S_1       33.8
        S_1       68.4
        S_1       87.7
        S_1       61.7
        S_2       50.6
        S_2       72.2
        S_2       68.5
        S_2       85.4
        S_2       72.3
        S_2       65.9
        S_2       95.5
        S_2       45.3
        S_2       59.7
        S_2       67.7
        S_2       47.6
M       S_1       34.0
        S_1       87.2
        S_1       97.0
        S_1       58.8
        S_1       31.5
        S_1       49.7
        S_1       85.2
        S_2       83.3
        S_2       52.5
        S_2       34.2
        S_2       39.1
        S_2       73.8
        S_2       47.2
        S_2       32.7
        S_2       48.9
        S_2       48.7
Name: Math, dtype: float64

（d）groupby的[]操作

可以用[]选出groupby对象的某个或者某几个列，上面的均分比较可以如下简洁地写出：

df.groupby(['Gender','School'])['Math'].mean()>=60

Gender  School
F       S_1        True
        S_2        True
M       S_1        True
        S_2       False
Name: Math, dtype: bool

用列表可选出多个属性列：

df.groupby(['Gender','School'])[['Math','Height']].mean()

		Math	Height
Gender	School
F	S_1	64.100000	173.125000
F	S_2	66.427273	173.727273
M	S_1	63.342857	178.714286
M	S_2	51.155556	172.000000

（e）连续型变量分组

例如利用cut函数对数学成绩分组：

bins = [0,40,60,80,90,100]
cuts = pd.cut(df['Math'],bins=bins) #可选label添加自定义标签
df.groupby(cuts)['Math'].count()

Math
(0, 40]       7
(40, 60]     10
(60, 80]      9
(80, 90]      7
(90, 100]     2
Name: Math, dtype: int64

三、聚合、过滤和变换

1. 聚合（Aggregation）

（a）常用聚合函数

所谓聚合就是把一堆数，变成一个标量，因此mean/sum/size/count/std/var/sem/describe/first/last/nth/min/max都是聚合函数

为了熟悉操作，不妨验证标准误sem函数，它的计算公式是： $\frac{组内标准差}{\sqrt{组容量}}$ ，下面进行验证：

group_m = grouped_single['Math']
print(group_m.std().values/np.sqrt(group_m.count().values)== group_m.sem().values)
print(group_m.std().values)  #组内标准差
print(np.sqrt(group_m.count().values)) #组容量 
print(group_m.std().values/np.sqrt(group_m.count().values))

[ True  True]
[23.07747407 17.58930521]
[3.87298335 4.47213595]
[5.95857818 3.93308821]

（b）同时使用多个聚合函数

group_m.agg(['sum','mean','std'])

	sum	mean	std
School
S_1	956.2	63.746667	23.077474
S_2	1191.1	59.555000	17.589305

利用元组进行重命名

group_m.agg([('Sum','sum'),('Mean','mean')])

	Sum	Mean
School
S_1	956.2	63.746667
S_2	1191.1	59.555000

指定哪些函数作用哪些列

grouped_mul.agg({'Math':['mean','max'],'Height':'var'})

		Math		Height
		mean	max	var
School	Class
S_1	C_1	63.78	87.2	183.3
	C_2	64.30	97.0	132.8
	C_3	63.16	87.7	179.2
S_2	C_1	58.56	83.3	54.7
	C_2	62.80	85.4	256.0
	C_3	63.06	95.5	205.7
	C_4	53.80	67.7	300.2

（c）使用自定义函数

grouped_single['Math'].agg(lambda x:print(x,'间隔'))
#可以发现，agg函数的传入是分组逐列进行的，有了这个特性就可以做许多事情

1101    34.0
1102    32.5
1103    87.2
1104    80.4
1105    84.8
1201    97.0
1202    63.5
1203    58.8
1204    33.8
1205    68.4
1301    31.5
1302    87.7
1303    49.7
1304    85.2
1305    61.7
Name: Math, dtype: float64 间隔
2101    83.3
2102    50.6
2103    52.5
2104    72.2
2105    34.2
2201    39.1
2202    68.5
2203    73.8
2204    47.2
2205    85.4
2301    72.3
2302    32.7
2303    65.9
2304    95.5
2305    48.9
2401    45.3
2402    48.7
2403    59.7
2404    67.7
2405    47.6
Name: Math, dtype: float64 间隔





School
S_1   NaN
S_2   NaN
Name: Math, dtype: float64

官方没有提供极差计算的函数，但通过agg可以容易地实现组内极差计算

grouped_single['Math'].agg(lambda x:x.max()-x.min())

School
S_1    65.5
S_2    62.8
Name: Math, dtype: float64

（d）利用NamedAgg函数进行多个聚合

注意：不支持lambda函数，但是可以使用外置的def函数

def R1(x):
    return x.max()-x.min()
def R2(x):
    return x.max()-x.median()
grouped_single['Math'].agg(min_score1=pd.NamedAgg(column='col1',aggfunc='min'),
                           max_score1=pd.NamedAgg(column='col2', aggfunc='max'),
                           range_score2=pd.NamedAgg(column='col3', aggfunc=R2))

	min_score1	max_score1	range_score2
School
S_1	31.5	97.0	33.5
S_2	32.7	95.5	39.4

（e）带参数的聚合函数

判断是否组内数学分数至少有一个值在50-52之间：

def f(s,low,high):
    return s.between(low,high).any()
grouped_single['Math'].agg(f,50,52)

School
S_1    False
S_2     True
Name: Math, dtype: bool

如果需要使用多个函数，并且其中至少有一个带参数，则使用wrap技巧：

def f_test(s,low,high):
    return s.between(low,high).max()
def agg_f(f_mul,name,*args,**kwargs):
    def wrapper(x):
        return f_mul(x,*args,**kwargs)
    wrapper.__name__ = name
    return wrapper
new_f = agg_f(f_test,'at_least_one_in_50_52',50,52)
grouped_single['Math'].agg([new_f,'mean']).head()

	at_least_one_in_50_52	mean
School
S_1	False	63.746667
S_2	True	59.555000

2. 过滤（Filteration）

filter函数是用来筛选某些组的（务必记住结果是组的全体），因此传入的值应当是布尔标量

grouped_single[['Math','Physics']].filter(lambda x:(x['Math']>32).all()).head()

	Math	Physics
ID
2101	83.3	C
2102	50.6	B+
2103	52.5	B-
2104	72.2	B+
2105	34.2	A

3. 变换（Transformation）

（a）传入对象

transform函数中传入的对象是组内的列，并且返回值需要与列长完全一致

grouped_single[['Math','Height']].transform(lambda x:x-x.min()).head()

	Math	Height
ID
1101	2.5	14
1102	1.0	33
1103	55.7	27
1104	48.9	8
1105	53.3	0

如果返回了标量值，那么组内的所有元素会被广播为这个值

grouped_single[['Math','Height']].transform(lambda x:x.mean())

	Math	Height
ID
1101	63.746667	175.733333
1102	63.746667	175.733333
1103	63.746667	175.733333
1104	63.746667	175.733333
1105	63.746667	175.733333
1201	63.746667	175.733333
1202	63.746667	175.733333
1203	63.746667	175.733333
1204	63.746667	175.733333
1205	63.746667	175.733333
1301	63.746667	175.733333
1302	63.746667	175.733333
1303	63.746667	175.733333
1304	63.746667	175.733333
1305	63.746667	175.733333
2101	59.555000	172.950000
2102	59.555000	172.950000
2103	59.555000	172.950000
2104	59.555000	172.950000
2105	59.555000	172.950000
2201	59.555000	172.950000
2202	59.555000	172.950000
2203	59.555000	172.950000
2204	59.555000	172.950000
2205	59.555000	172.950000
2301	59.555000	172.950000
2302	59.555000	172.950000
2303	59.555000	172.950000
2304	59.555000	172.950000
2305	59.555000	172.950000
2401	59.555000	172.950000
2402	59.555000	172.950000
2403	59.555000	172.950000
2404	59.555000	172.950000
2405	59.555000	172.950000

（b）利用变换方法进行组内标准化

grouped_single[['Math','Height']].transform(lambda x:(x-x.mean())/x.std()).head()

	Math	Height
ID
1101	-1.288991	-0.214991
1102	-1.353990	1.279460
1103	1.016287	0.807528
1104	0.721627	-0.686923
1105	0.912289	-1.316166

（c）利用变换方法进行组内缺失值的均值填充

df_nan = df[['Math','School']].copy().reset_index()
df_nan.loc[np.random.randint(0,df.shape[0],5),['Math']]=np.nan
df_nan

	ID	Math	School
0	1101	34.0	S_1
1	1102	32.5	S_1
2	1103	NaN	S_1
3	1104	80.4	S_1
4	1105	84.8	S_1
5	1201	NaN	S_1
6	1202	63.5	S_1
7	1203	58.8	S_1
8	1204	33.8	S_1
9	1205	68.4	S_1
10	1301	31.5	S_1
11	1302	NaN	S_1
12	1303	49.7	S_1
13	1304	85.2	S_1
14	1305	61.7	S_1
15	2101	NaN	S_2
16	2102	50.6	S_2
17	2103	52.5	S_2
18	2104	72.2	S_2
19	2105	34.2	S_2
20	2201	39.1	S_2
21	2202	68.5	S_2
22	2203	73.8	S_2
23	2204	47.2	S_2
24	2205	85.4	S_2
25	2301	72.3	S_2
26	2302	NaN	S_2
27	2303	65.9	S_2
28	2304	95.5	S_2
29	2305	48.9	S_2
30	2401	45.3	S_2
31	2402	48.7	S_2
32	2403	59.7	S_2
33	2404	67.7	S_2
34	2405	47.6	S_2

df_nan.groupby('School').transform(lambda x: x.fillna(x.mean())).join(df.reset_index()['School']).head()   #缺失值填充

	ID	Math	School
0	1101	34.000	S_1
1	1102	32.500	S_1
2	1103	57.025	S_1
3	1104	80.400	S_1
4	1105	84.800	S_1

四、apply函数

1. apply函数的灵活性

可能在所有的分组函数中，apply是应用最为广泛的，这得益于它的灵活性：

对于传入值而言，从下面的打印内容可以看到是以分组的表传入apply中：

df.groupby('School').apply(lambda x:print(x.head(1)))

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
2101    S_2   C_1      M  street_7     174      84  83.3       C

apply函数的灵活性很大程度来源于其返回值的多样性：

① 标量返回值

df[['School','Math','Height']].groupby('School').apply(lambda x:x.max())

	School	Math	Height
School
S_1	S_1	97.0	195
S_2	S_2	95.5	194

② 列表返回值

df[['School','Math','Height']].groupby('School').apply(lambda x:x-x.min()).head()

	Math	Height
ID
1101	2.5	14.0
1102	1.0	33.0
1103	55.7	27.0
1104	48.9	8.0
1105	53.3	0.0

③ 数据框返回值

df[['School','Math','Height']].groupby('School')\
    .apply(lambda x:pd.DataFrame({'col1':x['Math']-x['Math'].max(),
                                  'col2':x['Math']-x['Math'].min(),
                                  'col3':x['Height']-x['Height'].max(),
                                  'col4':x['Height']-x['Height'].min()})).head()

	col1	col2	col3	col4
ID
1101	-63.0	2.5	-22	14
1102	-64.5	1.0	-3	33
1103	-9.8	55.7	-9	27
1104	-16.6	48.9	-28	8
1105	-12.2	53.3	-36	0

2. 用apply同时统计多个指标

此处可以借助OrderedDict工具进行快捷的统计：

from collections import OrderedDict
def f(df):
    data = OrderedDict()
    data['M_sum'] = df['Math'].sum()
    data['W_var'] = df['Weight'].var()
    data['H_mean'] = df['Height'].mean()
    return pd.Series(data)
grouped_single.apply(f)         #返回数据框

	M_sum	W_var	H_mean
School
S_1	956.2	117.428571	175.733333
S_2	1191.1	181.081579	172.950000

五、问题与练习

1. 问题

【问题一】 rolling和expanding方法从原理上说都是一种transform方法，请问它们有什么区别？

【问题二】什么是fillna的前向/后向填充，如何实现？

向前和向后填充,使用 ffill和 bfill
fillna(method='ffill')
fillna(method='bfill')

【问题三】下面的代码实现了什么功能？请仿照设计一个它的groupby版本。

s = pd.Series ([0, 1, 1, 0, 1, 1, 1, 0])
s1 = s.cumsum()   #([0,1,2,2,3,4,5,5])
result = s.mul(s1).diff().where(lambda x: x < 0).ffill().add(s1,fill_value =0)  #相乘作差集，
s1
print(s.mul(s1))
print(s.mul(s1).diff())
result

0    0
1    1
2    2
3    0
4    3
5    4
6    5
7    0
dtype: int64
0    NaN
1    1.0
2    1.0
3   -2.0
4    3.0
5    1.0
6    1.0
7   -5.0
dtype: float64





0    0.0
1    1.0
2    2.0
3    0.0
4    1.0
5    2.0
6    3.0
7    0.0
dtype: float64

【问题四】如何计算组内0.25分位数与0.75分位数？要求显示在同一张表上。

def R1(x):
    return np.percentile(x,25)
def R2(x):
    return np.percentile(x,75)
print(grouped_single['Math'].agg(percentile_25=pd.NamedAgg(column='col1',aggfunc=R1),
         percentile_75=pd.NamedAgg(column='col2', aggfunc=R2)))

        percentile_25  percentile_75
School                              
S_1             41.85         85.000
S_2             47.50         72.225

【问题五】 idxmax和nunique是什么函数，它具有哪些功能和应用？

获取最大值的索引，返回的是唯一值的个数

2. 练习

【练习一】：现有一份关于diamonds的数据集，列分别记录了克拉数、颜色、开采深度、价格，请解决下列问题：

df=pd.read_csv('data/Diamonds.csv')
pd.read_csv('data/Diamonds.csv').head()

	carat	color	depth	price
0	0.23	E	61.5	326
1	0.21	E	59.8	326
2	0.23	E	56.9	327
3	0.29	I	62.4	334
4	0.31	J	63.3	335

(a) 在所有重量超过1克拉的钻石中，价格的极差是多少？

(b) 若以开采深度的0.2\0.4\0.6\0.8分位数为分组依据，每一组中钻石颜色最多的是哪一种？该种颜色是组内平均而言单位重量最贵的吗？

(d) 请按颜色分组，分别计算价格关于克拉数的回归系数。（单变量的简单线性回归，并只使用Pandas和Numpy完成）

dd=df.loc[df['carat']>1]
dd['price'].max()-dd['price'].min()

display(df.head())

bins = df['depth'].quantile(np.linspace(0,1,6)).tolist()
cuts = pd.cut(df['depth'],bins=bins) #可选label添加自定义标签

df['cuts'] = cuts
color_result=df.groupby('cuts')['color'].describe()
display(color_result)

df['均重价格']=df['price']/df['carat']
color_result['top'] == [i[1] for i in df.groupby(['cuts' ,'color'])['均重价格'].mean().groupby(['cuts']).idxmax().values]

	carat	color	depth	price	cuts	均重价格
0	0.23	E	61.5	326	(60.8, 61.6]	1417.391304
1	0.21	E	59.8	326	(43.0, 60.8]	1552.380952
2	0.23	E	56.9	327	(43.0, 60.8]	1421.739130
3	0.29	I	62.4	334	(62.1, 62.7]	1151.724138
4	0.31	J	63.3	335	(62.7, 79.0]	1080.645161

	count	unique	top	freq
cuts
(43.0, 60.8]	11294	7	E	2259
(60.8, 61.6]	11831	7	G	2593
(61.6, 62.1]	10403	7	G	2247
(62.1, 62.7]	10137	7	G	2193
(62.7, 79.0]	10273	7	G	2000

cuts
(43.0, 60.8]    False
(60.8, 61.6]    False
(61.6, 62.1]    False
(62.1, 62.7]     True
(62.7, 79.0]     True
Name: top, dtype: bool

df = df.drop(columns='均重价格')
cuts = pd.cut(df['carat'],bins=[0,0.5,1,1.5,2,np.inf]) #可选label添加自定义标签
df['cuts'] = cuts

def f(nums):
    if not nums:        
        return 0
    res = 1                            
    cur_len = 1                        
    for i in range(1, len(nums)):      
        if nums[i-1] < nums[i]:        
            cur_len += 1                
            res = max(cur_len, res)     
        else:                       
            cur_len = 1                 
    return res          #严格递增序列最大值

for name,group in df.groupby('cuts'):
    group = group.sort_values(by='depth')
    s = group['price']
    print(name,f(s.tolist()))

(0.0, 0.5] 8
(0.5, 1.0] 8
(1.0, 1.5] 7
(1.5, 2.0] 11
(2.0, inf] 7

【练习二】：有一份关于美国10年至17年的非法药物数据集，列分别记录了年份、州（5个）、县、药物类型、报告数量，请解决下列问题：

df=pd.read_csv('data/Drugs.csv')
pd.read_csv('data/Drugs.csv').head()

	YYYY	State	COUNTY	SubstanceName	DrugReports
0	2010	VA	ACCOMACK	Propoxyphene	1
1	2010	OH	ADAMS	Morphine	9
2	2010	PA	ADAMS	Methadone	2
3	2010	VA	ALEXANDRIA CITY	Heroin	5
4	2010	PA	ALLEGHENY	Hydromorphone	5

(a) 按照年份统计，哪个县的报告数量最多？这个县所属的州在当年也是报告数最多的吗？

(b) 从14年到15年，Heroin的数量增加最多的是哪一个州？它在这个州是所有药物中增幅最大的吗？若不是，请找出符合该条件的药物。

idx=pd.IndexSlice
for i in range(2010,2018):
    county = (df.groupby(['COUNTY','YYYY']).sum().loc[idx[:,i],].idxmax()[0][0])
    state = df.query('COUNTY == "%s"'%county)['State'].iloc[0]
    state_true = df.groupby(['State','YYYY']).sum().loc[idx[:,i],].idxmax()[0][0]
    if state==state_true:
        print('在%d年，%s县的报告数最多，它所属的州%s也是报告数最多的'%(i,county,state))
    else:
        print('在%d年，%s县的报告数最多，但它所属的州%s不是报告数最多的，%s州报告数最多'%(i,county,state,state_true))

在2010年，PHILADELPHIA县的报告数最多，它所属的州PA也是报告数最多的
在2011年，PHILADELPHIA县的报告数最多，但它所属的州PA不是报告数最多的，OH州报告数最多
在2012年，PHILADELPHIA县的报告数最多，但它所属的州PA不是报告数最多的，OH州报告数最多
在2013年，PHILADELPHIA县的报告数最多，但它所属的州PA不是报告数最多的，OH州报告数最多
在2014年，PHILADELPHIA县的报告数最多，但它所属的州PA不是报告数最多的，OH州报告数最多
在2015年，PHILADELPHIA县的报告数最多，但它所属的州PA不是报告数最多的，OH州报告数最多
在2016年，HAMILTON县的报告数最多，它所属的州OH也是报告数最多的
在2017年，HAMILTON县的报告数最多，它所属的州OH也是报告数最多的

df_b = df[(df['YYYY'].isin([2014,2015]))&(df['SubstanceName']=='Heroin')]
df_add = df_b.groupby(['YYYY','State']).sum()
display((df_add.loc[2015]-df_add.loc[2014]).idxmax() ) 

df_b = df[(df['YYYY'].isin([2014,2015]))&(df['State']=='OH')]
df_add = df_b.groupby(['YYYY','SubstanceName']).sum()
display((df_add.loc[2015]-df_add.loc[2014]).idxmax()) #这里利用了索引对齐的特点
display((df_add.loc[2015]/df_add.loc[2014]).idxmax())
df_b.head()

DrugReports    OH
dtype: object



DrugReports    Heroin
dtype: object



DrugReports    Acetyl fentanyl
dtype: object

	YYYY	State	COUNTY	SubstanceName	DrugReports
10843	2014	OH	ADAMS	Buprenorphine	17
10844	2014	OH	ADAMS	Heroin	93
10851	2014	OH	ALLEN	Fentanyl	4
10852	2014	OH	ALLEN	Hydrocodone	40
10930	2014	OH	ADAMS	Morphine	2

bingo！？

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
第3章 panda 分组

第3章分组import numpy as npimport pandas as pddf = pd.read_csv('data/table.csv',index_col='ID')df.head() School Class Gender Address Height Weig...
复制链接

扫一扫