第3章 分组

第3章 分组

import numpy as np
import pandas as pd
df = pd.read_csv('data/table.csv',index_col='ID')
df.head()
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
# import numpy as np
# import pandas as pd 
# df=pd.read_csv('data/table.csv',index_col='ID')
# df.head()

一、SAC过程

1. 内涵

SAC指的是分组操作中的split-apply-combine过程
其中split指基于某一些规则,将数据拆成若干组,apply是指对每一组独立地使用函数,combine指将每一组的结果组合成某一类数据结构

2. apply过程

在该过程中,我们实际往往会遇到四类问题:
整合(Aggregation)——即分组计算统计量(如求均值、求每组元素个数)
变换(Transformation)——即分组对每个单元的数据进行操作(如元素标准化)
过滤(Filtration)——即按照某些规则筛选出一些组(如选出组内某一指标小于50的组)
综合问题——即前面提及的三种问题的混合

二、groupby函数

1. 分组函数的基本内容:

(a)根据某一列分组
grouped_single = df.groupby('School')
grouped_single =df.groupby('School')
display(grouped_single)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026FFBC23DA0>
经过groupby后会生成一个groupby对象,该对象本身不会返回任何东西,只有当相应的方法被调用才会起作用
例如取出某一个组:
grouped_single.get_group('S_1').head()
grouped_single.get_group('S_1')
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
1201S_1C_2Mstreet_51886897.0A-
1202S_1C_2Fstreet_41769463.5B-
1203S_1C_2Mstreet_61605358.8A+
1204S_1C_2Fstreet_51626333.8B
1205S_1C_2Fstreet_61676368.4B-
1301S_1C_3Mstreet_41616831.5B+
1302S_1C_3Fstreet_11755787.7A-
1303S_1C_3Mstreet_71888249.7B
1304S_1C_3Mstreet_21957085.2A
1305S_1C_3Fstreet_51876961.7B-
(b)根据某几列分组
grouped_mul = df.groupby(['School','Class'])
grouped_mul.get_group(('S_2','C_4'))
grouped_mul=df.groupby(['School','Class'])
grouped_mul.get_group(('S_2','C_1'))
SchoolClassGenderAddressHeightWeightMathPhysics
ID
2101S_2C_1Mstreet_71748483.3C
2102S_2C_1Fstreet_61616150.6B+
2103S_2C_1Mstreet_41576152.5B-
2104S_2C_1Fstreet_51599772.2B+
2105S_2C_1Mstreet_41708134.2A
(c)组容量与组数
grouped_single.size()
# grouped_single.size()
School
S_1    15
S_2    20
dtype: int64
grouped_mul.size()
# grouped_mul.size()
School  Class
S_1     C_1      5
        C_2      5
        C_3      5
S_2     C_1      5
        C_2      5
        C_3      5
        C_4      5
dtype: int64
grouped_single.ngroups
grouped_single.ngroups
2
grouped_mul.ngroups
grouped_mul.ngroups
7
(d)组的遍历
for name,group in grouped_single:
    print(name)
    display(group.head())
# for name ,group in grouped_single:
#     print(name)
#     display(group.head())
S_1
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1102S_1C_1Fstreet_21927332.5B+
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
S_2
SchoolClassGenderAddressHeightWeightMathPhysics
ID
2101S_2C_1Mstreet_71748483.3C
2102S_2C_1Fstreet_61616150.6B+
2103S_2C_1Mstreet_41576152.5B-
2104S_2C_1Fstreet_51599772.2B+
2105S_2C_1Mstreet_41708134.2A
(e)level参数(用于多级索引)和axis参数
df.set_index(['Gender','School']).groupby(level=1,axis=0).get_group('S_1').head()
df.set_index(['Gender','School']).groupby(level=0).get_group('M')#.head()

ClassAddressHeightWeightMathPhysics
GenderSchool
MS_1C_1street_11736334.0A+
S_1C_1street_21868287.2B+
S_1C_2street_51886897.0A-
S_1C_2street_61605358.8A+
S_1C_3street_41616831.5B+
S_1C_3street_71888249.7B
S_1C_3street_21957085.2A
S_2C_1street_71748483.3C
S_2C_1street_41576152.5B-
S_2C_1street_41708134.2A
S_2C_2street_519310039.1B
S_2C_2street_41559173.8A+
S_2C_2street_11757447.2B-
S_2C_3street_51718832.7A
S_2C_3street_41877348.9B
S_2C_4street_71668248.7B

2. groupby对象的特点

(a)查看所有可调用的方法
由此可见,groupby对象可以使用相当多的函数,灵活程度很高
print([attr for attr in dir(grouped_single) if not attr.startswith('_')])
# print([attr for attr in dir(grouped_single) if not attr.startswith('_')])
['Address', 'Class', 'Gender', 'Height', 'Math', 'Physics', 'School', 'Weight', 'agg', 'aggregate', 'all', 'any', 'apply', 'backfill', 'bfill', 'boxplot', 'corr', 'corrwith', 'count', 'cov', 'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'dtypes', 'expanding', 'ffill', 'fillna', 'filter', 'first', 'get_group', 'groups', 'head', 'hist', 'idxmax', 'idxmin', 'indices', 'last', 'mad', 'max', 'mean', 'median', 'min', 'ndim', 'ngroup', 'ngroups', 'nth', 'nunique', 'ohlc', 'pad', 'pct_change', 'pipe', 'plot', 'prod', 'quantile', 'rank', 'resample', 'rolling', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail', 'take', 'transform', 'tshift', 'var']
['Address', 'Class', 'Gender', 'Height', 'Math', 'Physics', 'School', 'Weight', 'agg', 'aggregate', 'all', 'any', 'apply', 'backfill', 'bfill', 'boxplot', 'corr', 'corrwith', 'count', 'cov', 'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'dtypes', 'expanding', 'ffill', 'fillna', 'filter', 'first', 'get_group', 'groups', 'head', 'hist', 'idxmax', 'idxmin', 'indices', 'last', 'mad', 'max', 'mean', 'median', 'min', 'ndim', 'ngroup', 'ngroups', 'nth', 'nunique', 'ohlc', 'pad', 'pct_change', 'pipe', 'plot', 'prod', 'quantile', 'rank', 'resample', 'rolling', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail', 'take', 'transform', 'tshift', 'var']
(b)分组对象的head和first
对分组对象使用head函数,返回的是每个组的前几行,而不是数据集前几行
grouped_single.head(2)
grouped_single.head(1)
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
2101S_2C_1Mstreet_71748483.3C
first显示的是以分组为索引的每组的第一个分组信息
grouped_single.first()
grouped_single.first()
ClassGenderAddressHeightWeightMathPhysics
School
S_1C_1Mstreet_11736334.0A+
S_2C_1Mstreet_71748483.3C
(c)分组依据
对于groupby函数而言,分组的依据是非常自由的,只要是与数据框长度相同的列表即可,同时支持函数型分组
df.groupby(np.random.choice(['a','b','c'],df.shape[0])).get_group('a')#.head()
#相当于将np.random.choice(['a','b','c'],df.shape[0])当做新的一列进行分组
print(np.random.choice(['a','b','c'],df.shape[0]))
a=df.groupby(np.random.choice(['a','b','c'],df.shape[0]))

for name ,group in a:
    print(name )
    display(group)

a.size()
['a' 'b' 'b' 'a' 'c' 'b' 'c' 'b' 'b' 'b' 'b' 'c' 'c' 'a' 'a' 'b' 'b' 'a'
 'c' 'b' 'b' 'c' 'c' 'a' 'b' 'a' 'a' 'a' 'a' 'a' 'c' 'a' 'a' 'a' 'a']
a
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1103S_1C_1Mstreet_21868287.2B+
1104S_1C_1Fstreet_21678180.4B-
1105S_1C_1Fstreet_41596484.8B+
1203S_1C_2Mstreet_61605358.8A+
2101S_2C_1Mstreet_71748483.3C
2105S_2C_1Mstreet_41708134.2A
2301S_2C_3Fstreet_41577872.3B+
2304S_2C_3Fstreet_61648195.5A-
2402S_2C_4Mstreet_71668248.7B
2403S_2C_4Fstreet_61586059.7B+
2404S_2C_4Fstreet_21608467.7B
2405S_2C_4Fstreet_61935447.6B
b
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1102S_1C_1Fstreet_21927332.5B+
1204S_1C_2Fstreet_51626333.8B
1303S_1C_3Mstreet_71888249.7B
1304S_1C_3Mstreet_21957085.2A
1305S_1C_3Fstreet_51876961.7B-
2201S_2C_2Mstreet_519310039.1B
2204S_2C_2Mstreet_11757447.2B-
2303S_2C_3Fstreet_71909965.9C
c
SchoolClassGenderAddressHeightWeightMathPhysics
ID
1101S_1C_1Mstreet_11736334.0A+
1201S_1C_2Mstreet_51886897.0A-
1202S_1C_2Fstreet_41769463.5B-
1205S_1C_2Fstreet_61676368.4B-
1301S_1C_3Mstreet_41616831.5B+
1302S_1C_3Fstreet_11755787.7A-
2102S_2C_1Fstreet_61616150.6B+
2103S_2C_1Mstreet_41576152.5B-
2104S_2C_1Fstreet_51599772.2B+
2202S_2C_2Fstreet_71947768.5B+
2203S_2C_2Mstreet_41559173.8A+
2205S_2C_2Fstreet_71837685.4B
2302S_2C_3Mstreet_51718832.7A
2305S_2C_3Mstreet_41877348.9B
2401S_2C_4Fstreet_21926245.3A
a    12
b     8
c    15
dtype: int64
从原理上说,我们可以看到利用函数时,传入的对象就是索引,因此根据这一特性可以做一些复杂的操作
df[:5].groupby(lambda x:print(x)).head(5)
a=df[:5].groupby(pd.Series([2,1,1,4,5],index=[1105,1104,1103,1102,1101]))
# a.size()
for name ,group in a:
    print(name )
    display(group)
# df[:5].groupby(lambda x:x*2).head(5)
1101
1102
1103
1104
1105
display(df.iloc[0:5,0:5])
b=df.iloc[0:5,0:5].groupby([1,1,1,2,1],axis=1)
for name ,group in b:
    print(name )
    display(group)
SchoolClassGenderAddressHeight
ID
1101S_1C_1Mstreet_1173
1102S_1C_1Fstreet_2192
1103S_1C_1Mstreet_2186
1104S_1C_1Fstreet_2167
1105S_1C_1Fstreet_4159
1
SchoolClassGenderHeight
ID
1101S_1C_1M173
1102S_1C_1F192
1103S_1C_1M186
1104S_1C_1F167
1105S_1C_1F159
2
Address
ID
1101street_1
1102street_2
1103street_2
1104street_2
1105street_4
根据奇偶行分组
# df.groupby(lambda x :print(x))
df.index.get_loc(1102)
1
display(df.groupby(lambda x:'奇数行'  if not df.index.get_loc(x)%2==1 else '偶数行').groups)
df.groupby(lambda x:'奇数ID行' if  x%2==1 else '偶数ID行').groups
{'偶数行': Int64Index([1102, 1104, 1201, 1203, 1205, 1302, 1304, 2101, 2103, 2105, 2202,
             2204, 2301, 2303, 2305, 2402, 2404],
            dtype='int64', name='ID'),
 '奇数行': Int64Index([1101, 1103, 1105, 1202, 1204, 1301, 1303, 1305, 2102, 2104, 2201,
             2203, 2205, 2302, 2304, 2401, 2403, 2405],
            dtype='int64', name='ID')}





{'偶数ID行': Int64Index([1102, 1104, 1202, 1204, 1302, 1304, 2102, 2104, 2202, 2204, 2302,
             2304, 2402, 2404],
            dtype='int64', name='ID'),
 '奇数ID行': Int64Index([1101, 1103, 1105, 1201, 1203, 1205, 1301, 1303, 1305, 2101, 2103,
             2105, 2201, 2203, 2205, 2301, 2303, 2305, 2401, 2403, 2405],
            dtype='int64', name='ID')}
如果是多层索引,那么lambda表达式中的输入就是元组,下面实现的功能为查看两所学校中男女生分别均分是否及格
注意:此处只是演示groupby的用法,实际操作不会这样写
df.set_index(['Gender','School']).head()

ClassAddressHeightWeightMathPhysics
GenderSchool
MS_1C_1street_11736334.0A+
FS_1C_1street_21927332.5B+
MS_1C_1street_21868287.2B+
FS_1C_1street_21678180.4B-
S_1C_1street_41596484.8B+
df.set_index(['Gender','School']).sort_index().groupby(lambda x:print(x))
('F', 'S_1')
('F', 'S_1')
('F', 'S_1')
('F', 'S_1')
('F', 'S_1')
('F', 'S_1')
('F', 'S_1')
('F', 'S_1')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('F', 'S_2')
('M', 'S_1')
('M', 'S_1')
('M', 'S_1')
('M', 'S_1')
('M', 'S_1')
('M', 'S_1')
('M', 'S_1')
('M', 'S_2')
('M', 'S_2')
('M', 'S_2')
('M', 'S_2')
('M', 'S_2')
('M', 'S_2')
('M', 'S_2')
('M', 'S_2')
('M', 'S_2')





<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026FFBE152E8>
math_score = df.set_index(['Gender','School'])['Math'].sort_index()
grouped_score = df.set_index(['Gender','School']).sort_index().\
            groupby(lambda x:(x,'均分及格' if math_score[x].mean()>=60 else '均分不及格'))
for name,_ in grouped_score:print(name)
for name ,group in grouped_score:
    print(name)
    display(group)
# math_score =df.set_index(['Gender','School'])['Math'].sort_index().groupby(lambda x:(x,'均分及格' if math_socre[x].mean()>60 else '均分不及格'))

(('F', 'S_1'), '均分及格')
(('F', 'S_2'), '均分及格')
(('M', 'S_1'), '均分及格')
(('M', 'S_2'), '均分不及格')
(('F', 'S_1'), '均分及格')
ClassAddressHeightWeightMathPhysics
GenderSchool
FS_1C_1street_21927332.5B+
S_1C_1street_21678180.4B-
S_1C_1street_41596484.8B+
S_1C_2street_41769463.5B-
S_1C_2street_51626333.8B
S_1C_2street_61676368.4B-
S_1C_3street_11755787.7A-
S_1C_3street_51876961.7B-
(('F', 'S_2'), '均分及格')
ClassAddressHeightWeightMathPhysics
GenderSchool
FS_2C_1street_61616150.6B+
S_2C_1street_51599772.2B+
S_2C_2street_71947768.5B+
S_2C_2street_71837685.4B
S_2C_3street_41577872.3B+
S_2C_3street_71909965.9C
S_2C_3street_61648195.5A-
S_2C_4street_21926245.3A
S_2C_4street_61586059.7B+
S_2C_4street_21608467.7B
S_2C_4street_61935447.6B
(('M', 'S_1'), '均分及格')
ClassAddressHeightWeightMathPhysics
GenderSchool
MS_1C_1street_11736334.0A+
S_1C_1street_21868287.2B+
S_1C_2street_51886897.0A-
S_1C_2street_61605358.8A+
S_1C_3street_41616831.5B+
S_1C_3street_71888249.7B
S_1C_3street_21957085.2A
(('M', 'S_2'), '均分不及格')
ClassAddressHeightWeightMathPhysics
GenderSchool
MS_2C_1street_71748483.3C
S_2C_1street_41576152.5B-
S_2C_1street_41708134.2A
S_2C_2street_519310039.1B
S_2C_2street_41559173.8A+
S_2C_2street_11757447.2B-
S_2C_3street_51718832.7A
S_2C_3street_41877348.9B
S_2C_4street_71668248.7B
math_score.tail()
print(math_score[('M','S_2')])
math_score[('M', 'S_2')].mean()
(M, S_2)    83.3
(M, S_2)    52.5
(M, S_2)    34.2
(M, S_2)    39.1
(M, S_2)    73.8
(M, S_2)    47.2
(M, S_2)    32.7
(M, S_2)    48.9
(M, S_2)    48.7
Name: Math, dtype: float64





51.155555555555544
(d)groupby的[]操作
可以用[]选出groupby对象的某个或者某几个列,上面的均分比较可以如下简洁地写出:
df.groupby(['Gender','School'])['Math'].mean()>=60
# df.groupby(['Gender','School'])['Math'].mean()>=60
Gender  School
F       S_1        True
        S_2        True
M       S_1        True
        S_2       False
Name: Math, dtype: bool
用列表可选出多个属性列:
df.groupby(['Gender','School'])[['Math','Height']].mean()
# df.groupby(['Gender','School'])[['Math']].mean()
# a=df.set_index(['Gender','School']).sort_index()[['Math']]
# a.query('(School=="S_1")and (Gender=="F")').mean()
MathHeight
GenderSchool
FS_164.100000173.125000
S_266.427273173.727273
MS_163.342857178.714286
S_251.155556172.000000
(e)连续型变量分组
例如利用cut函数对数学成绩分组:
bins = [0,40,60,80,90,100]
cuts = pd.cut(df['Math'],bins=bins) #可选label添加自定义标签
df.groupby(cuts)['Math'].count()
# bins=[0,40,60,80,90,100]
# cuts=pd.cut(df['Math'],bins=bins)
# df.groupby(cuts).size()
Math
(0, 40]       7
(40, 60]     10
(60, 80]      9
(80, 90]      7
(90, 100]     2
dtype: int64

三、聚合、过滤和变换

1. 聚合(Aggregation)

(a)常用聚合函数
所谓聚合就是把一堆数,变成一个标量,因此mean/sum/size/count/std/var/sem/describe/first/last/nth/min/max都是聚合函数
为了熟悉操作,不妨验证标准误sem函数,它的计算公式是: 组 内 标 准 差 组 容 量 \frac{组内标准差}{\sqrt{组容量}} ,下面进行验证:
group_m = grouped_single['Math']
display(group_m.std().values/np.sqrt(group_m.count().values)== group_m.sem().values)

group_m=grouped_single['Math']
display(group_m.std().values)
# np.sqrt()
display(np.sqrt(group_m.count().values))
# group_m.head()
group_m.std().values/np.sqrt(group_m.count().values)==group_m.sem().values
array([ True,  True])



array([23.07747407, 17.58930521])



array([3.87298335, 4.47213595])





array([ True,  True])
(b)同时使用多个聚合函数
group_m.agg(['sum','mean','std'])
group_m.agg(['sum','mean','std','sem','count'])
summeanstdsemcount
School
S_1956.263.74666723.0774745.95857815
S_21191.159.55500017.5893053.93308820
利用元组进行重命名
group_m.agg([('rename_sum','sum'),('rename_mean','mean')])

group_m.agg([('rename_sum','sum'),('rename_mean','mean')])
rename_sumrename_mean
School
S_1956.263.746667
S_21191.159.555000
指定哪些函数作用哪些列
grouped_mul.agg({'Math':['mean','max'],'Height':'var'})

grouped_mul.agg({'Math':['mean','max'],'Height':'var'})
MathHeight
meanmaxvar
SchoolClass
S_1C_163.7887.2183.3
C_264.3097.0132.8
C_363.1687.7179.2
S_2C_158.5683.354.7
C_262.8085.4256.0
C_363.0695.5205.7
C_453.8067.7300.2
(c)使用自定义函数
grouped_single['Math'].agg(lambda x:print(x.head(),x.count(),'间隔'))
#可以发现,agg函数的传入是分组逐列进行的,有了这个特性就可以做许多事情
# grouped_single['Math'].agg(lambda x:print(x.head(),x.count(),'间隔'))
1101    34.0
1102    32.5
1103    87.2
1104    80.4
1105    84.8
Name: Math, dtype: float64 15 间隔
2101    83.3
2102    50.6
2103    52.5
2104    72.2
2105    34.2
Name: Math, dtype: float64 20 间隔





School
S_1    None
S_2    None
Name: Math, dtype: object
官方没有提供极差计算的函数,但通过agg可以容易地实现组内极差计算
grouped_single['Math'].agg(lambda x:x.max()-x.min())

grouped_single['Math'].agg(lambda x:x.max()-x.min())
School
S_1    65.5
S_2    62.8
Name: Math, dtype: float64
(d)利用NamedAgg函数进行多个聚合
注意:不支持lambda函数,但是可以使用外置的def函数
def R1(x):
    return x.max()-x.min()
def R2(x):
    return x.max()-x.median()
grouped_single['Math'].agg(min_score1=pd.NamedAgg(column='col1', aggfunc=R1),
                           max_score1=pd.NamedAgg(column='col2', aggfunc='max'),
                           range_score2=pd.NamedAgg(column='col3', aggfunc=R2)).head()



display(grouped_single['Math'].head())

def R1(x):
    return x.max()-x.min()
def R2(x):
    return x.max()-x.median()
grouped_single['Math','Height'].agg(min_score1=pd.NamedAgg(column='Math',aggfunc=R1),
                          max_score1=pd.NamedAgg(column='Math',aggfunc='max'),
                          range_score2=pd.NamedAgg(column='Height',aggfunc=R2)).head()#
ID
1101    34.0
1102    32.5
1103    87.2
1104    80.4
1105    84.8
2101    83.3
2102    50.6
2103    52.5
2104    72.2
2105    34.2
Name: Math, dtype: float64


F:\dev\anaconda\envs\python35\lib\site-packages\ipykernel_launcher.py:17: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
min_score1max_score1range_score2
School
S_165.597.020.0
S_262.895.523.5
(e)带参数的聚合函数
判断是否组内数学分数至少有一个值在50-52之间:
def f(s,low,high):
    return s.between(low,high).max()
grouped_single['Math'].agg(f,50,52)

grouped_single['Math'].agg(lambda x:print(x.between(50,52)))
grouped_single['Math'].agg(lambda x:print(x.between(50,52).max()))
def f(s,low,high):
    return s.between(low,high).any()
grouped_single['Math'].agg(f,50,52)

1101    False
1102    False
1103    False
1104    False
1105    False
1201    False
1202    False
1203    False
1204    False
1205    False
1301    False
1302    False
1303    False
1304    False
1305    False
Name: Math, dtype: bool
2101    False
2102     True
2103    False
2104    False
2105    False
2201    False
2202    False
2203    False
2204    False
2205    False
2301    False
2302    False
2303    False
2304    False
2305    False
2401    False
2402    False
2403    False
2404    False
2405    False
Name: Math, dtype: bool
False
True





School
S_1    False
S_2     True
Name: Math, dtype: bool
如果需要使用多个函数,并且其中至少有一个带参数,则使用wrap技巧:
def f_test(s,low,high):
    return s.between(low,high).max()
def agg_f(f_mul,name,*args):#,**kwargs
    def wrapper(x):
        return f_mul(x,*args)#,**kwargs
    wrapper.__name__ = name
    return wrapper
new_f = agg_f(f_test,'at_least_one_in_50_52',50,52)
grouped_single['Math'].agg([new_f,'mean']).head()



# def f_test(s,low,high):
#     return s.between(low,high).max()
# def agg_f(f_mul,*args):
#     def wrapper(x):
#         return f_mul(x,*args)
#     return wrapper
# grouped_single['Math'].agg([agg_f(f_test,50,52),'mean'])

at_least_one_in_50_52mean
School
S_1False63.746667
S_2True59.555000

现在这段的目的就是我agg里面能够加带参数的函数,那么我们知道agg的传入x会传到agg_f(f_test,50,52)里面,那agg_f()的返回结果是个什么呢?是wrapper,那么wrapper返回的又是什么?是f_mul(x,50,52),这样就把外层的参数通过包裹传到了内层,并且最终agg传入的x会最终传入f_mul中的x,巧妙地利用agg_f中的args将数值传到f_mul中的args数值。

2. 过滤(Filteration)

filter函数是用来筛选某些组的(务必记住结果是组的全体),因此传入的值应当是布尔标量
grouped_single[['Math','Physics']].filter(lambda x:print((x['Math']>32).all())).head()
# grouped_single[['Math','Physics']].filter(lambda x:(x['Math']>32)).head()
# grouped_single[['Math','Physics']].agg(lambda x:print(x['Math']>32))
# grouped_single[['Math','Physics']].agg(lambda x:print(x.head(),x.count(),'间隔'))
grouped_single[['Math','Physics']].filter(lambda x:(x['Math']>32).all()).head()
False
True
MathPhysics
ID
210183.3C
210250.6B+
210352.5B-
210472.2B+
210534.2A
grouped_single[['Math','Physics']].filter(lambda x:(x['Math']>34).all()).head()
MathPhysics
ID

filter选的是组,所以组的所有人都成绩超过32(34)则返回True,否则返回False,注意True和False选的是组,所以32的时候因为两个组只有一个符合条件所以选出来一个,34都不符合,所以没有选出来的。

3. 变换(Transformation)

(a)传入对象
transform函数中传入的对象是组内的列,并且返回值需要与列长完全一致
grouped_single[['Math','Height']].agg(lambda x:print(x-x.min())).head()
1101     2.5
1102     1.0
1103    55.7
1104    48.9
1105    53.3
1201    65.5
1202    32.0
1203    27.3
1204     2.3
1205    36.9
1301     0.0
1302    56.2
1303    18.2
1304    53.7
1305    30.2
Name: Math, dtype: float64
2101    50.6
2102    17.9
2103    19.8
2104    39.5
2105     1.5
2201     6.4
2202    35.8
2203    41.1
2204    14.5
2205    52.7
2301    39.6
2302     0.0
2303    33.2
2304    62.8
2305    16.2
2401    12.6
2402    16.0
2403    27.0
2404    35.0
2405    14.9
Name: Math, dtype: float64
1101    14
1102    33
1103    27
1104     8
1105     0
1201    29
1202    17
1203     1
1204     3
1205     8
1301     2
1302    16
1303    29
1304    36
1305    28
Name: Height, dtype: int64
2101    19
2102     6
2103     2
2104     4
2105    15
2201    38
2202    39
2203     0
2204    20
2205    28
2301     2
2302    16
2303    35
2304     9
2305    32
2401    37
2402    11
2403     3
2404     5
2405    38
Name: Height, dtype: int64
MathHeight
School
S_1NoneNone
S_2NoneNone
grouped_single[['Math','Height']].transform(lambda x:x-x.min()).head()
grouped_single[['Math','Height']].transform(lambda x:x-x.min()).head()
MathHeight
ID
11012.514
11021.033
110355.727
110448.98
110553.30

如果返回了标量值,那么组内的所有元素会被广播为这个值
grouped_single[['Math','Height']].transform(lambda x:x.mean()).head()
# grouped_single[['Math','Height']].transform(lambda x:x.mean()).head()
MathHeight
ID
110163.746667175.733333
110263.746667175.733333
110363.746667175.733333
110463.746667175.733333
110563.746667175.733333
(b)利用变换方法进行组内标准化
grouped_single[['Math','Height']].transform(lambda x:(x-x.mean())/x.std()).head()
grouped_single[['Math','Height']].transform(lambda x:(x-x.mean())/x.std()).head()
MathHeight
ID
1101-1.288991-0.214991
1102-1.3539901.279460
11031.0162870.807528
11040.721627-0.686923
11050.912289-1.316166
(c)利用变换方法进行组内缺失值的均值填充
df_nan = df[['Math','School']].copy().reset_index()
df_nan.loc[np.random.randint(0,df.shape[0],25),['Math']]=np.nan
df_nan.head()

df_nan=df[['Math','School']].copy().reset_index()
df_nan.loc[np.random.randint(0,df.shape[0],25),['Math']]=np.nan
df_nan.head()
IDMathSchool
0110134.0S_1
11102NaNS_1
21103NaNS_1
3110480.4S_1
4110584.8S_1
df_nan.groupby('School').transform(lambda x: x.fillna(x.mean())).join(df.reset_index()['School']).head()
df_nan.groupby('School').transform(lambda x: x.fillna(x.mean())).join(df.reset_index()['School']).head()
IDMathSchool
0110168.214286S_1
1110268.214286S_1
2110387.200000S_1
3110480.400000S_1
4110568.214286S_1

四、apply函数

1. apply函数的灵活性

可能在所有的分组函数中,apply是应用最为广泛的,这得益于它的灵活性:
对于传入值而言,从下面的打印内容可以看到是以分组的表传入apply中:
df.groupby('School').apply(lambda x:print(x.head(5)))
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
2101    S_2   C_1      M  street_7     174      84  83.3       C
2102    S_2   C_1      F  street_6     161      61  50.6      B+
2103    S_2   C_1      M  street_4     157      61  52.5      B-
2104    S_2   C_1      F  street_5     159      97  72.2      B+
2105    S_2   C_1      M  street_4     170      81  34.2       A
apply函数的灵活性很大程度来源于其返回值的多样性:
① 标量返回值
df[['School','Math','Height']].groupby('School').apply(lambda x:x.max())
df[['School','Math','Height']].groupby('School').apply(lambda x:print(x,x.max()))
display(df[['School','Math','Height']].groupby('School').agg(lambda x:x.max()))
df[['School','Math','Height']].groupby('School').apply(lambda x:x.max())
     School  Math  Height
ID                       
1101    S_1  34.0     173
1102    S_1  32.5     192
1103    S_1  87.2     186
1104    S_1  80.4     167
1105    S_1  84.8     159
1201    S_1  97.0     188
1202    S_1  63.5     176
1203    S_1  58.8     160
1204    S_1  33.8     162
1205    S_1  68.4     167
1301    S_1  31.5     161
1302    S_1  87.7     175
1303    S_1  49.7     188
1304    S_1  85.2     195
1305    S_1  61.7     187 School    S_1
Math       97
Height    195
dtype: object
     School  Math  Height
ID                       
2101    S_2  83.3     174
2102    S_2  50.6     161
2103    S_2  52.5     157
2104    S_2  72.2     159
2105    S_2  34.2     170
2201    S_2  39.1     193
2202    S_2  68.5     194
2203    S_2  73.8     155
2204    S_2  47.2     175
2205    S_2  85.4     183
2301    S_2  72.3     157
2302    S_2  32.7     171
2303    S_2  65.9     190
2304    S_2  95.5     164
2305    S_2  48.9     187
2401    S_2  45.3     192
2402    S_2  48.7     166
2403    S_2  59.7     158
2404    S_2  67.7     160
2405    S_2  47.6     193 School     S_2
Math      95.5
Height     194
dtype: object
MathHeight
School
S_197.0195
S_295.5194
SchoolMathHeight
School
S_1S_197.0195
S_2S_295.5194
② 列表返回值
display(df[['School','Math','Height']].groupby('School').apply(lambda x:x-x.min()).head())
df[['School','Math','Height']].groupby('School').transform(lambda x:x-x.min()).head()
MathHeight
ID
11012.514.0
11021.033.0
110355.727.0
110448.98.0
110553.30.0
MathHeight
ID
11012.514
11021.033
110355.727
110448.98
110553.30
③ 数据框返回值
df[['School','Math','Height']].groupby('School')\
    .apply(lambda x:pd.DataFrame({'col1':x['Math']-x['Math'].max(),
                                  'col2':x['Math']-x['Math'].min(),
                                  'col3':x['Height']-x['Height'].max(),
                                  'col4':x['Height']-x['Height'].min()})).head()




df[['School','Math','Height']].groupby('School').apply(lambda x:pd.DataFrame({
    'col1':x['Math']-x['Math'].max(),
    'col2':x['Math']-x['Math'].min(),
    'col3':x['Height']-x['Height'].max(),
    'col4':x['Height']-x['Height'].min()
})).head()
col1col2col3col4
ID
1101-63.02.5-2214
1102-64.51.0-333
1103-9.855.7-927
1104-16.648.9-288
1105-12.253.3-360

2. 用apply同时统计多个指标

此处可以借助OrderedDict工具进行快捷的统计:
from collections import OrderedDict
def f(df):
    data = OrderedDict()
    data['M_sum'] = df['Math'].sum()
    data['W_var'] = df['Weight'].var()
    data['H_mean'] = df['Height'].mean()
    print('data',data)
    print('series')
    print(pd.Series(data))
    return pd.Series(data)
grouped_single.apply(f)



from collections import OrderedDict
def f(df):
    data=OrderedDict()
    data['M_sum']=df['Math'].sum()
    data['W_var']=df['Weight'].var()
    data['H_mean']=df['Height'].mean()
    return pd.Series(data)
grouped_single.apply(f)
data OrderedDict([('M_sum', 956.2000000000002), ('W_var', 117.42857142857143), ('H_mean', 175.73333333333332)])
series
M_sum     956.200000
W_var     117.428571
H_mean    175.733333
dtype: float64
data OrderedDict([('M_sum', 1191.1), ('W_var', 181.08157894736837), ('H_mean', 172.95)])
series
M_sum     1191.100000
W_var      181.081579
H_mean     172.950000
dtype: float64
M_sumW_varH_mean
School
S_1956.2117.428571175.733333
S_21191.1181.081579172.950000

五、问题与练习

1. 问题

【问题一】 什么是fillna的前向/后向填充,如何实现?
df = pd.read_csv('data/table.csv',index_col='ID')
df.head(3)
df_nan = df[['Math','School']].copy().reset_index()
df_nan.loc[np.random.randint(0,df.shape[0],25),['Math']]=np.nan
df_nan.head()
df_nan.Math=df_nan.Math.fillna(method='bfill')
df_nan.head()
IDMathSchool
0110134.0S_1
1110287.2S_1
2110387.2S_1
3110497.0S_1
4110597.0S_1

fillna 的method方法可以控制参数的填充方式,是向上填充:将缺失值填充为该列中它上一个未缺失值;向下填充相反

method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

pad / ffill: 向下自动填充

backfill / bfill: 向上自动填充

【问题二】 下面的代码实现了什么功能?请仿照设计一个它的groupby版本。
s = pd.Series ([0, 1, 1, 0, 1, 1, 1, 0])
s1 = s.cumsum()
result = s.mul(s1).diff().where(lambda x: x < 0).ffill().add(s1,fill_value =0)

s1:将s序列求累加和 [0, 1, 2, 2, 3, 4, 5, 5]

s.mul(s1)😒 与s1累乘 [0, 1, 2, 0, 3, 4, 5, 0]

.diff() 求一阶差分 [nan, 1.0, 1.0, -2.0, 3.0, 1.0, 1.0, -5.0]

.where(lambda x: x < 0) 值是否小于0:[nan, nan, nan, -2.0, nan, nan, nan, -5.0]

.ffill():向下填充 [nan, nan, nan, -2.0, -2.0, -2.0, -2.0, -5.0]

.add(s1,fill_value =0) 缺失值补0后与s1求和 [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 3.0, 0.0]

list(s.mul(s1).diff().where(lambda x: x < 0).ffill().add(s1,fill_value =0))
gp =df.groupby('School')

gp.apply(lambda x:x['Math'].mul(x['Math'].cumsum()).diff().where(lambda m: m < 0).ffill().add(x['Math'].cumsum(),fill_value =0))
School  ID  
S_1     1101       34.00
        1102       66.50
        1103      153.70
        1104      234.10
        1105      318.90
        1201      415.90
        1202    -9421.00
        1203    -9362.20
        1204   -11740.56
        1205   -11672.16
        1301   -21966.61
        1302   -21878.91
        1303   -25585.41
        1304   -25500.21
        1305   -16257.66
S_2     2101       83.30
        2102      -29.65
        2103       22.85
        2104       95.05
        2105    -8364.36
        2201    -8325.26
        2202    -8256.76
        2203    -8182.96
        2204    -9864.48
        2205    -9779.08
        2301    -2042.69
        2302   -25111.27
        2303   -25045.37
        2304   -24949.87
        2305   -37377.81
        2401     -300.07
        2402     -251.37
        2403     -191.67
        2404     -123.97
        2405   -19527.49
Name: Math, dtype: float64
【问题三】 如何计算组内0.25分位数与0.75分位数?要求显示在同一张表上。
【问题四】 既然索引已经能够选出某些符合条件的子集,那么filter函数的设计有什么意义?
【问题五】 整合、变换、过滤三者在输入输出和功能上有何异同?
【问题六】 在带参数的多函数聚合时,有办法能够绕过wrap技巧实现同样功能吗?

问题三

gp.apply(lambda x:pd.DataFrame({'q25':x.quantile(0.25),
                                  'q75':x.quantile(0.75)
                                       }))
q25q75
School
S_1Height164.50187.500
Weight63.0077.000
Math41.8585.000
S_2Height159.75187.750
Weight70.2585.000
Math47.5072.225

问题四

filter函数是用来筛选组的,结果是组的全体

问题五

整合(Aggregation)分组计算统计量:输入的是每组数据,输出是每组的统计量,在列维度上是标量。

变换(Transformation)—即分组对每个单元的数据进行操作(如元素标准化):输入的是每组数据,输出是每组数据经过某种规则变换后的数据,不改变数据的维度。

过滤(Filtration)—即按照某些规则筛选出一些组:输入的是每组数据,输出的是满足要求的组的所有数据。

问题六

2. 练习

【练习一】: 现有一份关于diamonds的数据集,列分别记录了克拉数、颜色、开采深度、价格,请解决下列问题:
pd.read_csv('data/Diamonds.csv').head()
caratcolordepthprice
00.23E61.5326
10.21E59.8326
20.23E56.9327
30.29I62.4334
40.31J63.3335
(a) 在所有重量超过1克拉的钻石中,价格的极差是多少?
(b) 若以开采深度的0.2\0.4\0.6\0.8分位数为分组依据,每一组中钻石颜色最多的是哪一种?该种颜色是组内平均而言单位重量最贵的吗?
© 以重量分组(0-0.5,0.5-1,1-1.5,1.5-2,2+),按递增的深度为索引排序,求每组中连续的严格递增价格序列长度的最大值。
(d) 请按颜色分组,分别计算价格关于克拉数的回归系数。(单变量的简单线性回归,并只使用Pandas和Numpy完成)

a在所有重量超过1克拉的钻石中,价格的极差是多少?

df=pd.read_csv('data/Diamonds.csv')
df.head()
caratcolordepthprice
00.23E61.5326
10.21E59.8326
20.23E56.9327
30.29I62.4334
40.31J63.3335
a=df[df['carat']>1]
a['price'].max()-a['price'].min()
17561
df_r=df.query('carat>1')['price']
df_r.max()-df_r.min()
17561

b若以开采深度的0.2\0.4\0.6\0.8分位数为分组依据,每一组中钻石颜色最多的是哪一种?该种颜色是组内平均而言单位重量最贵的吗?

np.linspace(0,1,6)
array([0. , 0.2, 0.4, 0.6, 0.8, 1. ])
bins=df['depth'].quantile(np.linspace(0,1,6)).tolist()
df['cuts']=pd.cut(df['depth'],bins=bins)
color_result=df.groupby('cuts')['color'].describe()
color_result
countuniquetopfreq
cuts
(43.0, 60.8]112947E2259
(60.8, 61.6]118317G2593
(61.6, 62.1]104037G2247
(62.1, 62.7]101377G2193
(62.7, 79.0]102737G2000
color_result = df.groupby('cuts')['color'].describe()
color_result
countuniquetopfreq
cuts
(43.0, 60.8]112947E2259
(60.8, 61.6]118317G2593
(61.6, 62.1]104037G2247
(62.1, 62.7]101377G2193
(62.7, 79.0]102737G2000
df['均重价格']=df['price']/df['carat']
# color_result['top']=[]
color_result['top'].count()
for i in range(color_result['top'].count()):
    temp=color_result['top'].iloc[i]==df.groupby(['cuts','color'])['均重价格'].mean().groupby(['cuts']).idxmax().values[i][1]
    print(color_result.reset_index()['cuts'][i],temp)
# df.groupby(['cuts','color'])['均重价格'].mean().groupby(['cuts']).head(8)
(43.0, 60.8] False
(60.8, 61.6] False
(61.6, 62.1] False
(62.1, 62.7] True
(62.7, 79.0] True
df.groupby(['cuts','color'])['均重价格'].mean().head()
cuts          color
(43.0, 60.8]  D        4096.138305
              E        3929.625897
              F        4136.841550
              G        4295.283909
              H        4275.933161
Name: 均重价格, dtype: float64
df['均重价格']=df['price']/df['carat']
color_result['top']==[i[1] for i in df.groupby(['cuts','color'])['均重价格'].mean().groupby(['cuts']).idxmax().values]
cuts
(43.0, 60.8]    False
(60.8, 61.6]    False
(61.6, 62.1]    False
(62.1, 62.7]     True
(62.7, 79.0]     True
Name: top, dtype: bool

c以重量分组(0-0.5,0.5-1,1-1.5,1.5-2,2+),按递增的深度为索引排序,求每组中连续的严格递增价格序列长度的最大值。

# df = df.drop(columns='均重价格')
# cuts = pd.cut(df['carat'],bins=[0,0.5,1,1.5,2,np.inf]) #可选label添加自定义标签
# df['cuts'] = cuts
# df.head()


# df=df.drop(columns='均重价格')
cuts=pd.cut(df['carat'],bins=[0,0.5,1,1.5,2,np.inf])
type(cuts)
df['cuts']=cuts
df.head()
caratcolordepthpricecuts
00.23E61.5326(0.0, 0.5]
10.21E59.8326(0.0, 0.5]
20.23E56.9327(0.0, 0.5]
30.29I62.4334(0.0, 0.5]
40.31J63.3335(0.0, 0.5]
def f(nums):
    if not nums:        
        return 0
    res = 1                            
    cur_len = 1                        
    for i in range(1, len(nums)):      
        if nums[i-1] < nums[i]:        
            cur_len += 1                
            res = max(cur_len, res)     
        else:                       
            cur_len = 1                 
    return res
def f2(nums):
    if not nums:
        return 0
    res=1
    temp=1
    for i in range(len(nums)):
        if nums[i-1]>=nums[i]:
            res=max(res,temp)
            temp=1
        else:
            temp+=1
    return max(res,temp)
df.groupby(['cuts']).head()
caratcolordepthpricecuts
00.23E61.5326(0.0, 0.5]
10.21E59.8326(0.0, 0.5]
20.23E56.9327(0.0, 0.5]
30.29I62.4334(0.0, 0.5]
40.31J63.3335(0.0, 0.5]
900.70E62.52757(0.5, 1.0]
910.86E55.12757(0.5, 1.0]
920.70G61.62757(0.5, 1.0]
930.71E62.42759(0.5, 1.0]
940.78G63.82759(0.5, 1.0]
1721.17J60.22774(1.0, 1.5]
2151.01F61.82781(1.0, 1.5]
2411.01E64.52788(1.0, 1.5]
2421.01H62.72788(1.0, 1.5]
2471.05J63.22789(1.0, 1.5]
20241.52E57.33105(1.5, 2.0]
20251.52E57.33105(1.5, 2.0]
39261.51G64.03497(1.5, 2.0]
39551.52H64.93504(1.5, 2.0]
41281.52I61.23541(1.5, 2.0]
122462.06J61.25203(2.0, inf]
130022.14J69.45405(2.0, inf]
131182.15J65.55430(2.0, inf]
137572.22J66.75607(2.0, inf]
139912.01I67.45696(2.0, inf]
for name,group in df.groupby('cuts'):
    group = group.sort_values(by='depth')
    s = group['price']
    print(name,f(s.tolist()))
    
    
# for name,group in df.groupby(['cuts']):
#     group=group.sort_values(by='depth')
#     s=group['price']
#     print(name,f(s.tolist()))
(0.0, 0.5] 8
(0.5, 1.0] 8
(1.0, 1.5] 7
(1.5, 2.0] 11
(2.0, inf] 7
for name,group in df.groupby(['cuts']):
    group=group.sort_values(by='depth')
    s=group['price']
    print(name,f2(s.tolist()))
(0.0, 0.5] 8
(0.5, 1.0] 8
(1.0, 1.5] 7
(1.5, 2.0] 11
(2.0, inf] 7

d请按颜色分组,分别计算价格关于克拉数的回归系数。(单变量的简单线性回归,并只使用Pandas和Numpy完成)

for name,group in df[['carat','price','color']].groupby('color'):
    L1 = np.array([np.ones(group.shape[0]),group['carat']])#.reshape(2,group.shape[0])
#     print(np.shape(L1))
#     print(L1)
    L2 = group['price']
#     print('L2',np.shape(L2))
#     print(L2)
    result = (np.linalg.inv(L1.dot(L1.T)).dot(L1)).dot(L2).reshape(2,1)
    print('当颜色为%s时,截距项为:%f,回归系数为:%f'%(name,result[0],result[1]))
    
    
    
# for name,group in df[['carat','price','color']].groupby('color'):
#     L1=np.array([np.ones(group.shape[0]),group['carat']])
#     L2=group['price']
#     result=np.linalg.inv(L1.dot(L1.T)).dot(L1).dot(L2)
#     print('当颜色为%s时,截距项为:%f,回归系数为:%f'%(name ,result[0],result[1]))
    
# for name,group in df[['carat','price','color']].groupby('color'):
#     L1 = np.array([np.ones(group.shape[0]),group['carat']])#.reshape(2,group.shape[0])
# #     print(np.shape(L1))
# #     print(L1)
#     L2 = group['price']
# #     print('L2',np.shape(L2))
# #     print(L2)
#     result = np.linalg.inv(L1.dot(L1.T)).dot(L1).dot(L2)#.reshape(2,1)
#     print('当颜色为%s时,截距项为:%f,回归系数为:%f'%(name,result[0],result[1]))
当颜色为D时,截距项为:-2361.017152,回归系数为:8408.353126
当颜色为E时,截距项为:-2381.049600,回归系数为:8296.212783
当颜色为F时,截距项为:-2665.806191,回归系数为:8676.658344
当颜色为G时,截距项为:-2575.527643,回归系数为:8525.345779
当颜色为H时,截距项为:-2460.418046,回归系数为:7619.098320
当颜色为I时,截距项为:-2878.150356,回归系数为:7761.041169
当颜色为J时,截距项为:-2920.603337,回归系数为:7094.192092
当颜色为D时,截距项为:-2361.017152,回归系数为:8408.353126
当颜色为E时,截距项为:-2381.049600,回归系数为:8296.212783
当颜色为F时,截距项为:-2665.806191,回归系数为:8676.658344
当颜色为G时,截距项为:-2575.527643,回归系数为:8525.345779
当颜色为H时,截距项为:-2460.418046,回归系数为:7619.098320
当颜色为I时,截距项为:-2878.150356,回归系数为:7761.041169
当颜色为J时,截距项为:-2920.603337,回归系数为:7094.192092
for name,group in df[['carat','price','color']].groupby('color'):
    L1 = np.array([np.ones(group.shape[0]),group['carat']])#.reshape(2,group.shape[0])
#     print(np.shape(L1))
#     print(L1)
    L2 = group['price']
#     print('L2',np.shape(L2))
#     print(L2)
    result = np.linalg.inv(L1.dot(L1.T)).dot(L1).dot(L2)#.reshape(2,1)
    print('当颜色为%s时,截距项为:%f,回归系数为:%f'%(name,result[0],result[1]))
当颜色为D时,截距项为:-2361.017152,回归系数为:8408.353126
当颜色为E时,截距项为:-2381.049600,回归系数为:8296.212783
当颜色为F时,截距项为:-2665.806191,回归系数为:8676.658344
当颜色为G时,截距项为:-2575.527643,回归系数为:8525.345779
当颜色为H时,截距项为:-2460.418046,回归系数为:7619.098320
当颜色为I时,截距项为:-2878.150356,回归系数为:7761.041169
当颜色为J时,截距项为:-2920.603337,回归系数为:7094.192092
【练习二】:有一份关于美国10年至17年的非法药物数据集,列分别记录了年份、州(5个)、县、药物类型、报告数量,请解决下列问题:
pd.read_csv('data/Drugs.csv').head()
YYYYStateCOUNTYSubstanceNameDrugReports
02010VAACCOMACKPropoxyphene1
12010OHADAMSMorphine9
22010PAADAMSMethadone2
32010VAALEXANDRIA CITYHeroin5
42010PAALLEGHENYHydromorphone5
(a) 按照年份统计,哪个县的报告数量最多?这个县所属的州在当年也是报告数最多的吗?
(b) 从14年到15年,Heroin的数量增加最多的是哪一个州?它在这个州是所有药物中增幅最大的吗?若不是,请找出符合该条件的药物。

a按照年份统计,哪个县的报告数量最多?这个县所属的州在当年也是报告数最多的吗?

df = pd.read_csv('data/Drugs.csv')
df.head()
df=pd.read_csv('data/Drugs.csv')
df.head()
YYYYStateCOUNTYSubstanceNameDrugReports
02010VAACCOMACKPropoxyphene1
12010OHADAMSMorphine9
22010PAADAMSMethadone2
32010VAALEXANDRIA CITYHeroin5
42010PAALLEGHENYHydromorphone5
re=set()
for name ,group in df.groupby('YYYY'):
#     print(name)
    temp=group.groupby('COUNTY')['DrugReports'].sum()
    county=temp.idxmax()
    state=group[group['COUNTY']==county].iloc[0,1]
    temp2=group.groupby('State')['DrugReports'].sum()
    state_max=temp2.idxmax()
    print(name,county,state==state_max)

2010 PHILADELPHIA True
2011 PHILADELPHIA False
2012 PHILADELPHIA False
2013 PHILADELPHIA False
2014 PHILADELPHIA False
2015 PHILADELPHIA False
2016 HAMILTON True
2017 HAMILTON True
idx=pd.IndexSlice
for i in range(2010,2018):
    county = (df.groupby(['COUNTY','YYYY']).sum().loc[idx[:,i],:].idxmax()[0][0])
    state = df.query('COUNTY == "%s"'%county)['State'].iloc[0]
    state_true = df.groupby(['State','YYYY']).sum().loc[idx[:,i],:].idxmax()[0][0]
    if state==state_true:
        print('在%d年,%s县的报告数最多,它所属的州%s也是报告数最多的'%(i,county,state))
    else:
        print('在%d年,%s县的报告数最多,但它所属的州%s不是报告数最多的,%s州报告数最多'%(i,county,state,state_true))
        
        
idx=pd.IndexSlice
for i in range(2010,2018):
    county=df.groupby(['COUNTY','YYYY']).sum().loc[idx[:,i],:].idxmax()[0][0]
    state=df.query('COUNTY=="%s"'%county)['State'].iloc[0]
    state_true=df.groupby(['State','YYYY']).sum().loc[idx[:,i,:],:].idxmax()[0][0]
    print(i,county,state==state_true)
在2010年,PHILADELPHIA县的报告数最多,它所属的州PA也是报告数最多的
在2011年,PHILADELPHIA县的报告数最多,但它所属的州PA不是报告数最多的,OH州报告数最多
在2012年,PHILADELPHIA县的报告数最多,但它所属的州PA不是报告数最多的,OH州报告数最多
在2013年,PHILADELPHIA县的报告数最多,但它所属的州PA不是报告数最多的,OH州报告数最多
在2014年,PHILADELPHIA县的报告数最多,但它所属的州PA不是报告数最多的,OH州报告数最多
在2015年,PHILADELPHIA县的报告数最多,但它所属的州PA不是报告数最多的,OH州报告数最多
在2016年,HAMILTON县的报告数最多,它所属的州OH也是报告数最多的
在2017年,HAMILTON县的报告数最多,它所属的州OH也是报告数最多的
2010 PHILADELPHIA True
2011 PHILADELPHIA False
2012 PHILADELPHIA False
2013 PHILADELPHIA False
2014 PHILADELPHIA False
2015 PHILADELPHIA False
2016 HAMILTON True
2017 HAMILTON True

b从14年到15年,Heroin的数量增加最多的是哪一个州?它在这个州是所有药物中增幅最大的吗?若不是,请找出符合该条件的药物。

df_b = df[(df['YYYY'].isin([2014,2015]))&(df['SubstanceName']=='Heroin')]
df_add = df_b.groupby(['YYYY','State']).sum()
(df_add.loc[2015]-df_add.loc[2014]).idxmax()


# df_b=df[(df['YYYY'].isin([2014,2015]))&(df['SubstanceName']=='Heroin')]
# df_add=df_b.groupby(['YYYY','State']).sum()
# (df_add.loc[2015]-df_add.loc[2014]).idxmax()
DrugReports    OH
dtype: object
for name ,group in df.groupby(['SubstanceName']):
    print(name)
    display(group)
3,4-Methylenedioxy U-47700
YYYYStateCOUNTYSubstanceNameDrugReports
207662017PAALLEGHENY3,4-Methylenedioxy U-477003
3-Fluorofentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
179462016VAFRANKLIN3-Fluorofentanyl1
3-Methylfentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
167612016OHALLEN3-Methylfentanyl5
168532016PABEAVER3-Methylfentanyl7
168582016OHBELMONT3-Methylfentanyl1
171232016WVBERKELEY3-Methylfentanyl1
171512016PACAMBRIA3-Methylfentanyl1
..................
236612017OHOTTAWA3-Methylfentanyl8
237172017OHMAHONING3-Methylfentanyl4
238012017OHVAN WERT3-Methylfentanyl7
238682017OHROSS3-Methylfentanyl1
240052017OHSTARK3-Methylfentanyl1

102 rows × 5 columns

4-Fluoroisobutyryl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
167842016PABERKS4-Fluoroisobutyryl fentanyl2
167872016PABLAIR4-Fluoroisobutyryl fentanyl2
168372016PAALLEGHENY4-Fluoroisobutyryl fentanyl4
168702016PABRADFORD4-Fluoroisobutyryl fentanyl1
174142016VAALEXANDRIA CITY4-Fluoroisobutyryl fentanyl1
..................
237482017KYMONTGOMERY4-Fluoroisobutyryl fentanyl1
238172017VAWARREN4-Fluoroisobutyryl fentanyl1
239742017VAROANOKE4-Fluoroisobutyryl fentanyl1
240102017OHSUMMIT4-Fluoroisobutyryl fentanyl7
240182017OHTRUMBULL4-Fluoroisobutyryl fentanyl1

132 rows × 5 columns

4-Methylfentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
222322017OHLUCAS4-Methylfentanyl5
ANPP
YYYYStateCOUNTYSubstanceNameDrugReports
35462011PADAUPHINANPP1
140602014PAPHILADELPHIAANPP1
155842015OHMONTGOMERYANPP1
167662016OHASHTABULAANPP1
177532016OHCUYAHOGAANPP20
194152016KYMONTGOMERYANPP1
199202016PAPHILADELPHIAANPP1
201302016OHSUMMITANPP1
209052017KYCLARKANPP2
209552017KYFAYETTEANPP5
210072017KYBOYDANPP3
210402017PACHESTERANPP1
211182017PADELAWAREANPP2
213082017OHCUYAHOGAANPP169
214272017OHGREENEANPP2
214842017PABUTLERANPP2
215722017VAFAIRFAXANPP1
216802017OHGUERNSEYANPP1
217242017PAJUNIATAANPP2
218632017OHLAWRENCEANPP2
218672017PALEBANONANPP1
219092017KYJEFFERSONANPP4
219642017PALUZERNEANPP2
221362017OHLAKEANPP57
221412017PALANCASTERANPP2
224872017PANORTHUMBERLANDANPP1
227532017PASCHUYLKILLANPP1
228592017WVRALEIGHANPP1
229752017VAPRINCE WILLIAMANPP1
234382017VASTAFFORDANPP2
235292017OHSUMMITANPP22
236162017OHMIAMIANPP2
236272017KYMONTGOMERYANPP4
236732017PAPHILADELPHIAANPP3
240602017PAYORKANPP1
Acetyl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
115672014VACHESTERFIELDAcetyl fentanyl1
120672014PAELKAcetyl fentanyl1
123242014OHMARIONAcetyl fentanyl3
123462014OHMONTGOMERYAcetyl fentanyl1
128252014PAGREENEAcetyl fentanyl1
..................
239042017OHTRUMBULLAcetyl fentanyl13
239892017PASCHUYLKILLAcetyl fentanyl4
239942017OHSENECAAcetyl fentanyl4
240332017PAWASHINGTONAcetyl fentanyl2
240502017WVWOODAcetyl fentanyl1

265 rows × 5 columns

Acetylcodeine
YYYYStateCOUNTYSubstanceNameDrugReports
17612010WVOHIOAcetylcodeine5
19482010PAPHILADELPHIAAcetylcodeine1
31652011VAFAIRFAXAcetylcodeine1
104822013PAPHILADELPHIAAcetylcodeine1
130472014PAPHILADELPHIAAcetylcodeine2
193182016PAPHILADELPHIAAcetylcodeine3
Acetyldihydrocodeine
YYYYStateCOUNTYSubstanceNameDrugReports
84802012PAPHILADELPHIAAcetyldihydrocodeine1
162202015PAPHILADELPHIAAcetyldihydrocodeine1
194522016PAPHILADELPHIAAcetyldihydrocodeine3
Acryl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
166702016OHASHLANDAcryl fentanyl3
171722016OHCLARKAcryl fentanyl1
176072016OHERIEAcryl fentanyl1
180432016OHCUYAHOGAAcryl fentanyl5
180802016OHFRANKLINAcryl fentanyl1
..................
239832017KYRUSSELLAcryl fentanyl1
240172017KYTRIMBLEAcryl fentanyl1
240262017OHVAN WERTAcryl fentanyl5
240362017OHWAYNEAcryl fentanyl5
240412017PAWESTMORELANDAcryl fentanyl2

112 rows × 5 columns

Alphaprodine
YYYYStateCOUNTYSubstanceNameDrugReports
16912010PAPHILADELPHIAAlphaprodine1
Benzylfentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
200422017OHATHENSBenzylfentanyl1
205622017OHBUTLERBenzylfentanyl10
207672017PAALLEGHENYBenzylfentanyl36
209062017OHCLARKBenzylfentanyl2
209502017VAFAIRFAXBenzylfentanyl1
210682017OHCUYAHOGABenzylfentanyl72
212692017OHFRANKLINBenzylfentanyl2
218772017OHHAMILTONBenzylfentanyl1
219272017OHLAKEBenzylfentanyl18
237502017OHMONTGOMERYBenzylfentanyl5
238102017OHWARRENBenzylfentanyl4
Buprenorphine
YYYYStateCOUNTYSubstanceNameDrugReports
122010OHASHTABULABuprenorphine7
212010KYBATHBuprenorphine1
432010OHBUTLERBuprenorphine15
602010VACHARLOTTESVILLE CITYBuprenorphine1
682010PACLEARFIELDBuprenorphine11
..................
240242017PAUNIONBuprenorphine3
240272017VAVIRGINIA BEACH CITYBuprenorphine2
240422017VAWESTMORELANDBuprenorphine2
240442017KYWHITLEYBuprenorphine15
240512017WVWOODBuprenorphine4

2524 rows × 5 columns

Butorphanol
YYYYStateCOUNTYSubstanceNameDrugReports
24252010VAVIRGINIA BEACH CITYButorphanol3
29792010VAPETERSBURG CITYButorphanol1
80822012VAWYTHEButorphanol1
83722012VASCOTTButorphanol1
98242013PAMERCERButorphanol1
163652015OHSCIOTOButorphanol1
Butyryl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
129282014PAMONROEButyryl fentanyl1
135742015VAALBEMARLEButyryl fentanyl1
138402015PAALLEGHENYButyryl fentanyl1
150412015OHHAMILTONButyryl fentanyl6
154622015PALANCASTERButyryl fentanyl1
..................
234462017OHSUMMITButyryl fentanyl3
234532017OHTUSCARAWASButyryl fentanyl2
234752017WVWOODButyryl fentanyl3
234962017OHROSSButyryl fentanyl1
239202017PAWASHINGTONButyryl fentanyl17

70 rows × 5 columns

Carfentanil
YYYYStateCOUNTYSubstanceNameDrugReports
166592016OHALLENCarfentanil3
166812016PABEAVERCarfentanil4
168442016OHASHLANDCarfentanil11
168732016OHBROWNCarfentanil2
171022016PAALLEGHENYCarfentanil4
..................
238462017PAYORKCarfentanil4
238692017OHROSSCarfentanil3
239912017OHSCIOTOCarfentanil23
240212017OHTUSCARAWASCarfentanil7
240322017OHWASHINGTONCarfentanil5

216 rows × 5 columns

Codeine
YYYYStateCOUNTYSubstanceNameDrugReports
852010PADAUPHINCodeine3
862010KYDAVIESSCodeine1
1132010OHASHLANDCodeine1
1192010PABEAVERCodeine2
1242010OHBELMONTCodeine1
..................
239242017OHWAYNECodeine1
239352017KYWHITLEYCodeine1
239492017PAYORKCodeine3
239752017VAROANOKECodeine2
240572017VAWYTHECodeine1

1034 rows × 5 columns

Crotonyl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
204922017OHBUTLERCrotonyl fentanyl4
206332017OHCLARKCrotonyl fentanyl1
206572017OHCUYAHOGACrotonyl fentanyl18
224292017OHMONTGOMERYCrotonyl fentanyl14
232122017OHWARRENCrotonyl fentanyl1
237422017OHMIAMICrotonyl fentanyl1
Cyclopentyl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
219902017KYFAYETTECyclopentyl fentanyl1
Cyclopropyl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
197452017OHADAMSCyclopropyl fentanyl6
197652017OHATHENSCyclopropyl fentanyl1
197712017KYBATHCyclopropyl fentanyl2
197812017KYBOYDCyclopropyl fentanyl3
202672016KYWARRENCyclopropyl fentanyl1
..................
238922017OHSTARKCyclopropyl fentanyl1
239422017KYWOODFORDCyclopropyl fentanyl1
239592017OHPREBLECyclopropyl fentanyl1
239922017KYSCOTTCyclopropyl fentanyl1
240222017OHUNIONCyclopropyl fentanyl1

89 rows × 5 columns

Cyclopropyl/Crotonyl Fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
213102017OHCUYAHOGACyclopropyl/Crotonyl Fentanyl21
230232017OHSUMMITCyclopropyl/Crotonyl Fentanyl5
Desmethylprodine 
YYYYStateCOUNTYSubstanceNameDrugReports
45422011KYJEFFERSONDesmethylprodine5
49482011OHWARRENDesmethylprodine3
50422011OHSHELBYDesmethylprodine1
52242011OHMONTGOMERYDesmethylprodine1
54672011VASCOTTDesmethylprodine2
75552012OHMIAMIDesmethylprodine1
75622012OHMONTGOMERYDesmethylprodine1
98342013OHMONTGOMERYDesmethylprodine1
Dextropropoxyphene
YYYYStateCOUNTYSubstanceNameDrugReports
92010PAARMSTRONGDextropropoxyphene1
3292010WVBERKELEYDextropropoxyphene1
6632010OHCLERMONTDextropropoxyphene1
6662010OHCLINTONDextropropoxyphene1
10382010WVHARDYDextropropoxyphene2
11492010WVJACKSONDextropropoxyphene1
11542010WVKANAWHADextropropoxyphene3
12402010OHGREENEDextropropoxyphene2
12482010OHHAMILTONDextropropoxyphene7
14422010WVHANCOCKDextropropoxyphene2
17332010WVMERCERDextropropoxyphene1
17742010WVPOCAHONTASDextropropoxyphene1
17842010WVRALEIGHDextropropoxyphene2
19392010WVOHIODextropropoxyphene2
20252010OHMIAMIDextropropoxyphene3
20342010OHMONTGOMERYDextropropoxyphene6
21012010OHSHELBYDextropropoxyphene2
21652010OHROSSDextropropoxyphene2
23482010OHUNIONDextropropoxyphene1
26362010OHWARRENDextropropoxyphene2
31942011OHHAMILTONDextropropoxyphene2
34152010PASCHUYLKILLDextropropoxyphene1
34652010WVWAYNEDextropropoxyphene1
40962011WVMERCERDextropropoxyphene1
41982011OHMONTGOMERYDextropropoxyphene1
44282011WVKANAWHADextropropoxyphene5
44392011PALAWRENCEDextropropoxyphene1
46992011WVGILMERDextropropoxyphene2
47962011WVUPSHURDextropropoxyphene1
48992011WVWOODDextropropoxyphene1
51982011WVMARIONDextropropoxyphene1
54692011OHSHELBYDextropropoxyphene1
56032011PAWESTMORELANDDextropropoxyphene1
62522012WVHARRISONDextropropoxyphene1
65872012OHHAMILTONDextropropoxyphene1
67282012WVMARIONDextropropoxyphene1
70492012WVKANAWHADextropropoxyphene1
75672012WVMORGANDextropropoxyphene1
90482013WVKANAWHADextropropoxyphene1
94322013WVMARIONDextropropoxyphene1
112822014PABEAVERDextropropoxyphene1
125012014OHHAMILTONDextropropoxyphene1
164892015PAYORKDextropropoxyphene1
176722016OHHAMILTONDextropropoxyphene3
187372016WVKANAWHADextropropoxyphene3
Dihydrocodeine
YYYYStateCOUNTYSubstanceNameDrugReports
5092010PAALLEGHENYDihydrocodeine1
16782010WVOHIODihydrocodeine1
26532010PAYORKDihydrocodeine1
33442011KYFRANKLINDihydrocodeine1
62002012VACHESTERFIELDDihydrocodeine1
82582012VAPRINCE WILLIAMDihydrocodeine1
112752014OHASHLANDDihydrocodeine1
123722014VAHENRICODihydrocodeine1
177722016VAFAIRFAXDihydrocodeine1
196222016OHLAKEDihydrocodeine1
211342017VAFAIRFAXDihydrocodeine1
223712017OHLICKINGDihydrocodeine2
Dihydromorphone
YYYYStateCOUNTYSubstanceNameDrugReports
772010PACRAWFORDDihydromorphone1
Fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
312010WVBOONEFentanyl1
362010OHBROWNFentanyl1
722010PACOLUMBIAFentanyl1
782010PACRAWFORDFentanyl2
982010VAFAIRFAXFentanyl2
..................
239872017OHSANDUSKYFentanyl33
240252017PAUNIONFentanyl9
240392017VAWAYNESBORO CITYFentanyl1
240452017VAWINCHESTER CITYFentanyl4
240542017OHWYANDOTFentanyl3

1484 rows × 5 columns

Fluorobutyryl fentanyl 
YYYYStateCOUNTYSubstanceNameDrugReports
187112016WVHARRISONFluorobutyryl fentanyl1
230182017OHSTARKFluorobutyryl fentanyl22
Fluorofentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
181362016KYGRAYSONFluorofentanyl1
189732016OHMONTGOMERYFluorofentanyl2
Fluoroisobutyryl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
182102016OHHAMILTONFluoroisobutyryl fentanyl2
200642017PABLAIRFluoroisobutyryl fentanyl4
206582017OHCUYAHOGAFluoroisobutyryl fentanyl55
209472017PAERIEFluoroisobutyryl fentanyl2
210112017PABRADFORDFluoroisobutyryl fentanyl1
210222017OHBUTLERFluoroisobutyryl fentanyl5
210742017PADAUPHINFluoroisobutyryl fentanyl7
210832017PADELAWAREFluoroisobutyryl fentanyl1
212092017VACHESTERFIELDFluoroisobutyryl fentanyl1
212492017PAELKFluoroisobutyryl fentanyl1
212972017OHCLERMONTFluoroisobutyryl fentanyl2
213462017PAFAYETTEFluoroisobutyryl fentanyl1
213832017VAFAIRFAXFluoroisobutyryl fentanyl5
214892017PACAMBRIAFluoroisobutyryl fentanyl1
215892017VAFREDERICKFluoroisobutyryl fentanyl3
216852017OHHAMILTONFluoroisobutyryl fentanyl34
217662017VALOUDOUNFluoroisobutyryl fentanyl1
218552017PALACKAWANNAFluoroisobutyryl fentanyl2
219422017PALEBANONFluoroisobutyryl fentanyl1
219672017PALUZERNEFluoroisobutyryl fentanyl5
220942017VAHENRICOFluoroisobutyryl fentanyl2
221562017PALEHIGHFluoroisobutyryl fentanyl1
226462017PAMERCERFluoroisobutyryl fentanyl3
228742017PAJUNIATAFluoroisobutyryl fentanyl1
228852017PALANCASTERFluoroisobutyryl fentanyl2
229972017OHSCIOTOFluoroisobutyryl fentanyl1
232502017PAWYOMINGFluoroisobutyryl fentanyl1
232752017OHSUMMITFluoroisobutyryl fentanyl3
233322017KYMONTGOMERYFluoroisobutyryl fentanyl1
235632017OHWAYNEFluoroisobutyryl fentanyl1
236262017PAMONROEFluoroisobutyryl fentanyl1
236492017PANORTHAMPTONFluoroisobutyryl fentanyl1
236562017KYOLDHAMFluoroisobutyryl fentanyl3
Furanyl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
153052015OHFRANKLINFuranyl fentanyl1
161282015PANORTHAMPTONFuranyl fentanyl1
164992016VAALEXANDRIA CITYFuranyl fentanyl1
166772016OHAUGLAIZEFuranyl fentanyl1
167972016PABRADFORDFuranyl fentanyl4
..................
239502017PAYORKFuranyl fentanyl29
239572017VAPORTSMOUTH CITYFuranyl fentanyl1
240092017VASUFFOLK CITYFuranyl fentanyl2
240232017OHUNIONFuranyl fentanyl1
240342017PAWASHINGTONFuranyl fentanyl12

341 rows × 5 columns

Furanyl/3-Furanyl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
190542016WVRALEIGHFuranyl/3-Furanyl fentanyl1
226652017WVMORGANFuranyl/3-Furanyl fentanyl2
230772017WVNICHOLASFuranyl/3-Furanyl fentanyl3
232462017WVWOODFuranyl/3-Furanyl fentanyl1
Heroin
YYYYStateCOUNTYSubstanceNameDrugReports
32010VAALEXANDRIA CITYHeroin5
72010VAAMELIAHeroin1
82010VAARLINGTONHeroin41
142010OHATHENSHeroin72
162010OHAUGLAIZEHeroin35
..................
239812017OHROSSHeroin67
240082017VASTAUNTON CITYHeroin1
240302017VAWARRENHeroin73
240562017PAWYOMINGHeroin20
240612017VAYORKHeroin48

2727 rows × 5 columns

Hydrocodeinone
YYYYStateCOUNTYSubstanceNameDrugReports
99532013PANORTHUMBERLANDHydrocodeinone1
99712013PALEHIGHHydrocodeinone1
149952015PAERIEHydrocodeinone1
Hydrocodone
YYYYStateCOUNTYSubstanceNameDrugReports
132010OHASHTABULAHydrocodone21
222010KYBATHHydrocodone7
232010PABEDFORDHydrocodone3
272010WVBERKELEYHydrocodone6
332010WVBRAXTONHydrocodone35
..................
239792017KYROCKCASTLEHydrocodone8
239932017KYSCOTTHydrocodone1
240282017KYWARRENHydrocodone11
240552017OHWYANDOTHydrocodone6
240582017VAWYTHEHydrocodone19

2979 rows × 5 columns

Hydromorphone
YYYYStateCOUNTYSubstanceNameDrugReports
42010PAALLEGHENYHydromorphone5
402010PABUCKSHydromorphone5
422010KYBULLITTHydromorphone2
632010VACHESTERFIELDHydromorphone5
652010KYCHRISTIANHydromorphone7
..................
237972017OHTUSCARAWASHydromorphone2
239382017VAWISEHydromorphone1
239722017VARICHMONDHydromorphone2
240002017VASMYTHHydromorphone1
240462017VAWINCHESTER CITYHydromorphone1

1191 rows × 5 columns

Isobutyryl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
218792017OHHAMILTONIsobutyryl fentanyl1
233232017WVWOODIsobutyryl fentanyl3
Levorphanol
YYYYStateCOUNTYSubstanceNameDrugReports
108492014PAALLEGHENYLevorphanol1
173622016PAALLEGHENYLevorphanol1
MT-45
YYYYStateCOUNTYSubstanceNameDrugReports
99862013PALYCOMINGMT-451
190382016PAPHILADELPHIAMT-451
Meperidine
YYYYStateCOUNTYSubstanceNameDrugReports
242010VABEDFORDMeperidine1
3242010KYBARRENMeperidine1
6022010OHADAMSMeperidine1
6172010OHATHENSMeperidine2
7462010PABUCKSMeperidine4
..................
208802017PADELAWAREMeperidine1
218102017VAHENRYMeperidine1
221752017KYJEFFERSONMeperidine1
236362017VAMONTGOMERYMeperidine1
239552017VAPITTSYLVANIAMeperidine1

79 rows × 5 columns

Metazocine
YYYYStateCOUNTYSubstanceNameDrugReports
237802017PAPHILADELPHIAMetazocine1
Methadone
YYYYStateCOUNTYSubstanceNameDrugReports
22010PAADAMSMethadone2
172010OHAUGLAIZEMethadone1
342010KYBREATHITTMethadone13
392010VABUCKINGHAMMethadone2
412010PABUCKSMethadone32
..................
239442017KYWOODFORDMethadone2
239512017PAYORKMethadone6
239822017OHROSSMethadone2
240312017VAWARRENMethadone4
240382017PAWAYNEMethadone1

1795 rows × 5 columns

Methorphan
YYYYStateCOUNTYSubstanceNameDrugReports
812010OHCUYAHOGAMethorphan2
1962010VAFAIRFAXMethorphan1
1982010VAFAIRFAX CITYMethorphan2
2902010PADELAWAREMethorphan6
5102010PAALLEGHENYMethorphan2
..................
224902017KYOLDHAMMethorphan1
226762017KYMCCRACKENMethorphan1
227222017VAPOWHATANMethorphan1
228402017PAPHILADELPHIAMethorphan2
232222017PAWASHINGTONMethorphan1

134 rows × 5 columns

Methoxyacetyl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
197482017PAADAMSMethoxyacetyl fentanyl2
198502017PAALLEGHENYMethoxyacetyl fentanyl58
200842017OHBUTLERMethoxyacetyl fentanyl5
203052017VAALBEMARLEMethoxyacetyl fentanyl1
203452017KYBOONEMethoxyacetyl fentanyl2
205142017PAARMSTRONGMethoxyacetyl fentanyl1
205192017OHASHLANDMethoxyacetyl fentanyl1
206052017KYCAMPBELLMethoxyacetyl fentanyl1
206352017OHCLARKMethoxyacetyl fentanyl5
208342017OHCARROLLMethoxyacetyl fentanyl1
208652017OHCUYAHOGAMethoxyacetyl fentanyl70
209402017PADELAWAREMethoxyacetyl fentanyl1
210472017KYCLARKMethoxyacetyl fentanyl1
212362017PADAUPHINMethoxyacetyl fentanyl1
212992017OHCLERMONTMethoxyacetyl fentanyl1
213322017PAERIEMethoxyacetyl fentanyl1
214212017VAGOOCHLANDMethoxyacetyl fentanyl1
214282017OHGREENEMethoxyacetyl fentanyl8
215672017OHERIEMethoxyacetyl fentanyl2
216582017KYJESSAMINEMethoxyacetyl fentanyl2
218242017PAHUNTINGDONMethoxyacetyl fentanyl1
218362017KYJEFFERSONMethoxyacetyl fentanyl2
219312017OHLAKEMethoxyacetyl fentanyl12
219512017OHLICKINGMethoxyacetyl fentanyl3
219692017PALUZERNEMethoxyacetyl fentanyl1
219912017KYFAYETTEMethoxyacetyl fentanyl5
220092017OHFRANKLINMethoxyacetyl fentanyl2
220452017OHHANCOCKMethoxyacetyl fentanyl3
220462017VAHANOVERMethoxyacetyl fentanyl3
220792017OHHAMILTONMethoxyacetyl fentanyl73
222722017OHFAIRFIELDMethoxyacetyl fentanyl1
223472017VAHENRYMethoxyacetyl fentanyl2
224972017KYOWENMethoxyacetyl fentanyl1
226012017OHLORAINMethoxyacetyl fentanyl2
227862017OHMEDINAMethoxyacetyl fentanyl2
229412017KYMASONMethoxyacetyl fentanyl2
230322017OHTRUMBULLMethoxyacetyl fentanyl2
230472017PAWASHINGTONMethoxyacetyl fentanyl10
232402017KYWOLFEMethoxyacetyl fentanyl1
233042017OHWASHINGTONMethoxyacetyl fentanyl1
233372017OHMONTGOMERYMethoxyacetyl fentanyl8
234022017OHROSSMethoxyacetyl fentanyl1
234352017PASNYDERMethoxyacetyl fentanyl2
234832017PAYORKMethoxyacetyl fentanyl1
235222017VASPOTSYLVANIAMethoxyacetyl fentanyl2
235532017OHWARRENMethoxyacetyl fentanyl1
239562017OHPORTAGEMethoxyacetyl fentanyl3
239672017WVRALEIGHMethoxyacetyl fentanyl2
239682017OHRICHLANDMethoxyacetyl fentanyl2
240112017OHSUMMITMethoxyacetyl fentanyl9
Mitragynine
YYYYStateCOUNTYSubstanceNameDrugReports
51592011KYKENTONMitragynine1
59762012OHCUYAHOGAMitragynine1
64052012KYDAVIESSMitragynine1
64562012PABERKSMitragynine3
67032012PALEHIGHMitragynine1
73202012PASCHUYLKILLMitragynine1
74512012OHMONTGOMERYMitragynine9
76472012KYMCCRACKENMitragynine1
80182013KYBULLITTMitragynine1
82372013OHBUTLERMitragynine1
93542013OHGREENEMitragynine3
96112013KYKENTONMitragynine5
96802013KYDAVIESSMitragynine2
100222013OHMONTGOMERYMitragynine35
106322013OHWARRENMitragynine2
119632014OHCUYAHOGAMitragynine2
125792014KYDAVIESSMitragynine2
126632014KYKENTONMitragynine2
127372014WVMONONGALIAMitragynine2
127572014VANORFOLK CITYMitragynine3
129322014OHMONTGOMERYMitragynine1
129872014KYMADISONMitragynine1
140272014OHMIAMIMitragynine2
146162015PACOLUMBIAMitragynine1
148372015OHCUYAHOGAMitragynine1
153452015OHHOCKINGMitragynine1
154492015KYJEFFERSONMitragynine1
155852015OHMONTGOMERYMitragynine1
157292015KYMCCRACKENMitragynine1
159472015OHLAKEMitragynine1
159732015KYLYONMitragynine1
170032015KYWARRENMitragynine1
171632016KYCARROLLMitragynine1
175682016KYFAYETTEMitragynine2
179892016KYHOPKINSMitragynine1
180462016OHCUYAHOGAMitragynine8
183022016OHGREENEMitragynine1
185362016KYKENTONMitragynine1
185972016OHMIAMIMitragynine1
187342016KYJEFFERSONMitragynine29
192092016KYNELSONMitragynine1
193292016VAPORTSMOUTH CITYMitragynine1
194202016OHMONTGOMERYMitragynine5
195712016KYWARRENMitragynine1
195732016OHWARRENMitragynine3
196232016OHLAKEMitragynine9
211392017KYFAYETTEMitragynine1
215382017OHCUYAHOGAMitragynine29
218372017KYJEFFERSONMitragynine1
220242017OHGEAUGAMitragynine1
221382017OHLAKEMitragynine2
223342017WVHANCOCKMitragynine1
238132017OHWARRENMitragynine3
238942017OHSUMMITMitragynine3
Morphine
YYYYStateCOUNTYSubstanceNameDrugReports
12010OHADAMSMorphine9
202010KYBARRENMorphine3
282010PABERKSMorphine5
352010KYBRECKINRIDGEMorphine1
592010KYCASEYMorphine1
..................
240032017VASTAFFORDMorphine6
240072017OHSTARKMorphine16
240162017PASUSQUEHANNAMorphine1
240432017VAWESTMORELANDMorphine3
240532017KYWOODFORDMorphine1

2102 rows × 5 columns

Nalbuphine
YYYYStateCOUNTYSubstanceNameDrugReports
27962011VACHESTERFIELDNalbuphine1
Opiates
YYYYStateCOUNTYSubstanceNameDrugReports
5892010OHDELAWAREOpiates2
5962010OHFAIRFIELDOpiates1
13002010OHLICKINGOpiates14
21362010OHPERRYOpiates1
45652011OHLICKINGOpiates10
74762012OHPERRYOpiates1
77592012OHLICKINGOpiates11
89662013OHCOSHOCTONOpiates1
95572013OHFAIRFIELDOpiates1
101832013OHLICKINGOpiates5
106642013OHPERRYOpiates1
156682015PAMONTGOMERYOpiates1
178472016OHCUYAHOGAOpiates9
201062016PASCHUYLKILLOpiates1
201472016PAWASHINGTONOpiates1
Opium
YYYYStateCOUNTYSubstanceNameDrugReports
6932010VAFAIRFAXOpium1
40132011PAPHILADELPHIAOpium1
43342011VALOUDOUNOpium1
72962012VAPRINCE WILLIAMOpium1
86822013VAFAIRFAXOpium14
103032013VALOUDOUNOpium1
103872013OHPORTAGEOpium1
116022014VAFAIRFAXOpium2
139922014VALOUDOUNOpium1
166292015VAWASHINGTONOpium1
182782016VALOUDOUNOpium2
212562017VAFAIRFAXOpium2
216622017KYKENTONOpium1
222432017VAMANASSAS CITYOpium1
230162017VASTAFFORDOpium1
Oxycodone
YYYYStateCOUNTYSubstanceNameDrugReports
52010KYALLENOxycodone15
102010OHASHLANDOxycodone45
192010WVBARBOUROxycodone1
252010KYBELLOxycodone148
302010KYBOONEOxycodone56
..................
239972017VASHENANDOAHOxycodone14
240142017VASURRYOxycodone2
240192017OHTRUMBULLOxycodone83
240402017WVWEBSTEROxycodone3
240472017VAWISEOxycodone23

3124 rows × 5 columns

Oxymorphone
YYYYStateCOUNTYSubstanceNameDrugReports
62010KYALLENOxymorphone1
112010OHASHLANDOxymorphone2
262010KYBELLOxymorphone3
462010PABUTLEROxymorphone2
832010OHCUYAHOGAOxymorphone4
..................
239962017OHSHELBYOxymorphone1
239982017VASHENANDOAHOxymorphone1
240042017VASTAFFORDOxymorphone22
240122017OHSUMMITOxymorphone4
240482017VAWISEOxymorphone2

1182 rows × 5 columns

Pentazocine
YYYYStateCOUNTYSubstanceNameDrugReports
6992010VAFAUQUIERPentazocine2
15642010KYJEFFERSONPentazocine3
18542010WVNICHOLASPentazocine1
20302010WVMINGOPentazocine1
27712011WVBRAXTONPentazocine1
28622011KYBULLITTPentazocine1
29542010KYMUHLENBERGPentazocine1
30732010VAWYTHEPentazocine1
33472011OHFRANKLINPentazocine1
35922011VAGREENSVILLEPentazocine1
37232011PADELAWAREPentazocine1
37892011KYJEFFERSONPentazocine2
42922011WVRALEIGHPentazocine1
50532011OHSUMMITPentazocine1
54682011VASCOTTPentazocine1
54982011KYWARRENPentazocine1
68862012KYJEFFERSONPentazocine1
73892012OHOTTAWAPentazocine1
75382012VATAZEWELLPentazocine2
75802012KYOHIOPentazocine2
76102012VAROANOKEPentazocine2
78052012OHMORROWPentazocine1
82082013OHASHLANDPentazocine1
83442013VABEDFORDPentazocine1
87592012PAWARRENPentazocine1
88982013PACRAWFORDPentazocine1
100012013OHMEDINAPentazocine1
103002013WVLOGANPentazocine1
104702013KYOHIOPentazocine1
106112013VASUFFOLK CITYPentazocine1
107362013WVWEBSTERPentazocine1
117752014OHCRAWFORDPentazocine1
118712014KYBARRENPentazocine1
125272014KYJEFFERSONPentazocine2
126622014WVKANAWHAPentazocine1
126862014OHLAWRENCEPentazocine1
133922014OHPICKAWAYPentazocine1
135242014VATAZEWELLPentazocine1
136212014OHWARRENPentazocine1
136352014OHWOODPentazocine1
138632015PABEAVERPentazocine1
145102015OHBUTLERPentazocine1
180112016VACARROLLPentazocine1
189822016OHMUSKINGUMPentazocine1
216682017KYKNOTTPentazocine1
222422017OHMAHONINGPentazocine1
232142017OHWARRENPentazocine1
Pethidine
YYYYStateCOUNTYSubstanceNameDrugReports
4002010PADELAWAREPethidine1
10582010KYKENTONPethidine1
14162010PAFAYETTEPethidine2
14662010KYJEFFERSONPethidine2
24182010KYTRIGGPethidine1
43372011VALOUDOUNPethidine1
45062011OHNOBLEPethidine1
53882011KYSIMPSONPethidine1
65402012PADAUPHINPethidine1
112832014PABEAVERPethidine1
152552015KYLIVINGSTONPethidine1
222052017PALANCASTERPethidine1
Phenyl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
200302017PAALLEGHENYPhenyl fentanyl9
208682017OHCUYAHOGAPhenyl fentanyl4
210932017OHCOLUMBIANAPhenyl fentanyl1
211172017OHDELAWAREPhenyl fentanyl1
215692017OHERIEPhenyl fentanyl5
222012017OHLAKEPhenyl fentanyl26
222892017OHFRANKLINPhenyl fentanyl1
229112017OHLOGANPhenyl fentanyl1
230892017OHOTTAWAPhenyl fentanyl1
231982017OHTRUMBULLPhenyl fentanyl3
232302017OHWAYNEPhenyl fentanyl1
236122017OHMEDINAPhenyl fentanyl3
236552017WVOHIOPhenyl fentanyl1
239152017OHWARRENPhenyl fentanyl1
240132017OHSUMMITPhenyl fentanyl6
Propoxyphene
YYYYStateCOUNTYSubstanceNameDrugReports
02010VAACCOMACKPropoxyphene1
152010OHATHENSPropoxyphene1
182010OHAUGLAIZEPropoxyphene2
292010VABLANDPropoxyphene1
532010PACARBONPropoxyphene1
..................
222122017KYLAURELPropoxyphene1
225522017VARUSSELLPropoxyphene1
227002017OHMONTGOMERYPropoxyphene1
228872017PALANCASTERPropoxyphene1
235962017OHMAHONINGPropoxyphene1

335 rows × 5 columns

Remifentanil
YYYYStateCOUNTYSubstanceNameDrugReports
50992012PAALLEGHENYRemifentanil1
Tetrahydrofuran fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
175112016OHCLARKTetrahydrofuran fentanyl2
196602016OHMADISONTetrahydrofuran fentanyl1
218392017OHJEFFERSONTetrahydrofuran fentanyl1
229062017OHLICKINGTetrahydrofuran fentanyl1
232552017PAYORKTetrahydrofuran fentanyl1
Thebaine
YYYYStateCOUNTYSubstanceNameDrugReports
20062010WVMARIONThebaine1
23552010KYWARRENThebaine1
53592011PAPIKEThebaine2
143042015OHBUTLERThebaine2
232712017OHSTARKThebaine5
Tramadol
YYYYStateCOUNTYSubstanceNameDrugReports
1062010VAALLEGHANYTramadol1
1502010KYCARTERTramadol2
1582010VACLARKETramadol1
1732010PACRAWFORDTramadol1
1852010PADELAWARETramadol1
..................
240292017KYWARRENTramadol1
240352017VAWASHINGTONTramadol1
240492017OHWOODTramadol1
240522017WVWOODTramadol3
240592017VAWYTHETramadol5

1384 rows × 5 columns

U-47700
YYYYStateCOUNTYSubstanceNameDrugReports
167602016PAALLEGHENYU-4770019
168472016OHASHTABULAU-477005
168832016OHBUTLERU-477002
170422016VAARLINGTONU-477002
171842016PACLINTONU-477001
..................
238952017OHSUMMITU-4770041
239162017PAWARRENU-477004
239172017VAWARRENU-477001
240022017PASNYDERU-477001
240372017OHWAYNEU-477007

195 rows × 5 columns

U-48800
YYYYStateCOUNTYSubstanceNameDrugReports
201752017OHALLENU-488009
205052017PAALLEGHENYU-4880010
207852017OHAUGLAIZEU-488002
209352017OHDEFIANCEU-488001
212932017OHCLARKU-488001
216272017OHHARDINU-488001
219332017OHLAKEU-488005
226632017OHMORGANU-488001
228332017OHOTTAWAU-488001
235842017PALUZERNEU-488002
236332017OHMONTGOMERYU-488002
U-49900
YYYYStateCOUNTYSubstanceNameDrugReports
203252017OHASHTABULAU-499001
211512017OHFRANKLINU-499005
214472017OHHANCOCKU-499001
216452017OHHURONU-499002
217732017OHLUCASU-499004
232432017OHWOODU-499003
238962017OHSUMMITU-499001
U-51754
YYYYStateCOUNTYSubstanceNameDrugReports
211072017OHCUYAHOGAU-517547
Valeryl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
166562016PAALLEGHENYValeryl fentanyl4
166842016PABERKSValeryl fentanyl1
171262016PABLAIRValeryl fentanyl1
181742016OHHURONValeryl fentanyl2
184482016PAFAYETTEValeryl fentanyl1
192342016PAPHILADELPHIAValeryl fentanyl1
197342016PAWESTMORELANDValeryl fentanyl2
199912016PAVENANGOValeryl fentanyl1
201072016PASCHUYLKILLValeryl fentanyl3
201812017PAARMSTRONGValeryl fentanyl1
204292016PAYORKValeryl fentanyl3
208032017PABLAIRValeryl fentanyl1
cis-3-methylfentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
166552016PAALLEGHENYcis-3-methylfentanyl14
205032017PAALLEGHENYcis-3-methylfentanyl9
206142017KYCARTERcis-3-methylfentanyl2
o-Fluorofentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
178932016OHGREENEo-Fluorofentanyl1
182272016OHHIGHLANDo-Fluorofentanyl1
188802016OHMARIONo-Fluorofentanyl1
190032016OHMONTGOMERYo-Fluorofentanyl1
192972016VANELSONo-Fluorofentanyl1
199752016VASTAFFORDo-Fluorofentanyl2
p-Fluorobutyryl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
164812015OHWOODp-Fluorobutyryl fentanyl2
171572016PACARBONp-Fluorobutyryl fentanyl1
179442016OHFRANKLINp-Fluorobutyryl fentanyl1
182872016PALUZERNEp-Fluorobutyryl fentanyl1
183912016KYMADISONp-Fluorobutyryl fentanyl1
187412016PALACKAWANNAp-Fluorobutyryl fentanyl2
212122017VACHESTERFIELDp-Fluorobutyryl fentanyl2
217742017PALUZERNEp-Fluorobutyryl fentanyl4
237072017PALYCOMINGp-Fluorobutyryl fentanyl2
p-Fluorofentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
177542016OHCUYAHOGAp-Fluorofentanyl6
183282016WVHARRISONp-Fluorofentanyl1
p-methoxybutyryl fentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
174192016PAALLEGHENYp-methoxybutyryl fentanyl1
trans-3-Methylfentanyl
YYYYStateCOUNTYSubstanceNameDrugReports
168402016PAALLEGHENYtrans-3-Methylfentanyl14
204412017PAALLEGHENYtrans-3-Methylfentanyl7
211982017KYCARTERtrans-3-Methylfentanyl2
# for name ,group in df[df["YYYY"].isin([2014,2015])].groupby(['SubstanceName']):
#     df_b = group
#     df_add = df_b.groupby(['YYYY','State']).sum()
#     display(df_add)
#     a=(df_add.loc[2015,'DrugReports']-df_add.loc[2014,'DrugReports']).idxmax()
#     display(a)

df_b = df[(df['YYYY'].isin([2014,2015]))&(df['State']=='OH')]
df_add = df_b.groupby(['YYYY','SubstanceName']).sum()
display((df_add.loc[2015]-df_add.loc[2014]).idxmax()) #这里利用了索引对齐的特点
display((df_add.loc[2015]/df_add.loc[2014]).idxmax())
DrugReports    Heroin
dtype: object



DrugReports    Acetyl fentanyl
dtype: object

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值