pandas学习笔记3

最新推荐文章于 2023-09-18 21:45:52 发布

坝坝头伯爵

最新推荐文章于 2023-09-18 21:45:52 发布

阅读量153

点赞数

文章标签：学习数据挖掘数据分析

本文链接：https://blog.csdn.net/weixin_52703681/article/details/123189208

版权

我们先导入如下的dataframe

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

一.SAC过程

SAC是指分组操作中的split-apply-combine
split指将数据拆分成若干组
apply指对每组独立的使用函数
combine指将每组的结果合成某一类数据结构

1.apply过程

在此过程中，我们会遇到以下四类问题：
整合（aggregation）——即分组计算统计量
变换（transformation）——即分组对每个单元的数据进行操作
过滤（filtration）——即按照某些规则筛选出一些组
综合问题——上面三种问题的结合

二.groupby函数

1.分组函数的基本内容：

（a）根据某列分组

grouped_single = df.groupby('School')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000014857E045B0>#返回

经过groupby后会生成一个groupby对象，该对象本身不会返回任何东西，只有当相应的方法被调用才会起作用，比如

grouped_single.get_group('S_1').head()

    School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

(b)根据某几列分组

grouped_mul = df.groupby(['School','Class'])
grouped_mul.get_group(('S_2','C_4'))

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
2401    S_2   C_4      F  street_2     192      62  45.3       A
2402    S_2   C_4      M  street_7     166      82  48.7       B
2403    S_2   C_4      F  street_6     158      60  59.7      B+
2404    S_2   C_4      F  street_2     160      84  67.7       B
2405    S_2   C_4      F  street_6     193      54  47.6       B

（c）组容量与组数

grouped_single.size()#组容量

School
S_1    15
S_2    20
dtype: int64

grouped_mul.size()#组容量

School  Class
S_1     C_1      5
        C_2      5
        C_3      5
S_2     C_1      5
        C_2      5
        C_3      5
        C_4      5
dtype: int64

grouped_single.ngroups#组数

2

grouped_single.ngroups#组数

7

（d）组的遍历

for name,group in grouped_single:
	print(name)
	print(group.head())

S_1
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+
S_2
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
2101    S_2   C_1      M  street_7     174      84  83.3       C
2102    S_2   C_1      F  street_6     161      61  50.6      B+
2103    S_2   C_1      M  street_4     157      61  52.5      B-
2104    S_2   C_1      F  street_5     159      97  72.2      B+
2105    S_2   C_1      M  street_4     170      81  34.2       A

2.groupby对象特点

（a）查看所有可调用的方法

groupby对象可以使用相当多的函数，灵活程度很高

（b）分组对象的head和first

对分组对象使用head函数，返回的是每个组的前几行，而不是数据集前几行

grouped_single.head(2)

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
2101    S_2   C_1      M  street_7     174      84  83.3       C
2102    S_2   C_1      F  street_6     161      61  50.6      B+

first显示的是以分组为索引的每组的第一个分组信息

grouped_single.first()

       Class Gender   Address  Height  Weight  Math Physics
School                                                     
S_1      C_1      M  street_1     173      63  34.0      A+
S_2      C_1      M  street_7     174      84  83.3       C

(d)groupby的[]操作

可以用[]选出groupby对象的某个或者某几个列，上面的均分比较可以如下简洁地写出：

df.groupby(['Gender','School'])['Math'].mean()>=60

Gender  School
F       S_1        True
        S_2        True
M       S_1        True
        S_2       False
Name: Math, dtype: bool

用列表可选出多个属性列:

df.groupby(['Gender','School'])[['Math','Height']].mean()

                    Math      Height
Gender School                       
F      S_1     64.100000  173.125000
       S_2     66.427273  173.727273
M      S_1     63.342857  178.714286
       S_2     51.155556  172.000000

(e)连续型变量分组

例如利用cut函数对数学成绩分组

bins=[0,40,60,80,90,100]
cuts=pd.cut(df['Math'],bins=bins)
df.groupby(cuts)['Math'].count()

Math
(0, 40]       7
(40, 60]     10
(60, 80]      9
(80, 90]      7
(90, 100]     2
Name: Math, dtype: int64

三。聚合、过滤和变换

1.聚合（aggregation）

(a)常用聚合函数

所谓聚合就是把一堆数，变成一个标量，因此mean/sum/size/count/std/var/sem/describe/first/last/nth/min/max都是聚合函数

（b）同时使用多个聚合函数

group_m = grouped_single['Math']
group_m.agg(['sum','mean','std'])

           sum       mean        std
School                              
S_1      956.2  63.746667  23.077474
S_2     1191.1  59.555000  17.589305

利用元组进行重命名

group_m.agg([('rename_sum','sum'),('rename_mean','mean')])

        rename_sum  rename_mean
School                         
S_1          956.2    63.746667
S_2         1191.1    59.555000

指定哪些函数作用哪些列

grouped_mul.agg({'Math':['mean','max'],'Height':'var'})

               Math       Height
               mean   max    var
School Class                    
S_1    C_1    63.78  87.2  183.3
       C_2    64.30  97.0  132.8
       C_3    63.16  87.7  179.2
S_2    C_1    58.56  83.3   54.7
       C_2    62.80  85.4  256.0
       C_3    63.06  95.5  205.7
       C_4    53.80  67.7  300.2

2.过滤（filteration）

filter函数是用来筛选某些组的（务必记住结果是组的全体），因此传入的值应当是布尔标量

grouped_single[['Math','Physics']].filter(lambda x:(x['Math']>32).all()).head()

      Math Physics
ID                
2101  83.3       C
2102  50.6      B+
2103  52.5      B-
2104  72.2      B+
2105  34.2       A

3.变换（transformation）

（a）传入对象

transform函数中传入的对象是组内的列，并且返回值需要与列长完全一致

grouped_single[['Math','Height']].transform(lambda x:x-x.min()).head()

	Math	Height
ID		
1101	2.5	14
1102	1.0	33
1103	55.7	27
1104	48.9	8
1105	53.3	0

如果返回了标量值，那么组内的所有元素会被广播为这个值

grouped_single[['Math','Height']].transform(lambda x:x.mean()).head()

Math	Height
ID		
1101	63.746667	175.733333
1102	63.746667	175.733333
1103	63.746667	175.733333
1104	63.746667	175.733333
1105	63.746667	175.733333

坝坝头伯爵

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas学习笔记3

我们先导入如下的dataframe School Class Gender Address Height Weight Math PhysicsID 1101 S_1 C_1 M street_1 173 63 34.0 A+1102 S_1 C_1 F street_2 192
复制链接

扫一扫