Pandas基础3.2｜Python学习笔记

最新推荐文章于 2023-12-21 16:12:00 发布

PenguinAsHeathen

最新推荐文章于 2023-12-21 16:12:00 发布

阅读量491

点赞数

分类专栏： Python学习笔记文章标签： python

本文链接：https://blog.csdn.net/m0_46384386/article/details/105810198

版权

Python学习笔记专栏收录该内容

32 篇文章 0 订阅

订阅专栏

import numpy as np
import pandas as pd
df = pd.read_csv('./data/table.csv',index_col = 'ID')
df.head()

	Unnamed: 0	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	0	S_1	C_1	M	street_1	173	63	34.0	A+
1102	1	S_1	C_1	F	street_2	192	73	32.5	B+
1103	2	S_1	C_1	M	street_2	186	82	87.2	B+
1104	3	S_1	C_1	F	street_2	167	81	80.4	B-
1105	4	S_1	C_1	F	street_4	159	64	84.8	B+

聚合、过滤和变换

聚合（Aggregation）

常用聚合函数

聚合：将一系列数变成一个标量。e.g. mean/sum/size/count/std/var/sem/describe/first/last/nth/min/max

#验证标准误sem函数
grouped_single = df.groupby('School')
group_m = grouped_single['Math']
group_m.std().values/np.sqrt(group_m.count().values)==group_m.sem().values

array([ True,  True])

同时使用多个聚合函数

group_m.agg(['sum','mean','std'])

	sum	mean	std
School
S_1	956.2	63.746667	23.077474
S_2	1191.1	59.555000	17.589305

利用元组进行重命名

group_m.agg([('rename_sum','sum'),('rename_mean','mean'),('rename_std','std')])

	rename_sum	rename_mean	rename_std
School
S_1	956.2	63.746667	23.077474
S_2	1191.1	59.555000	17.589305

指定哪些函数作用哪些列

grouped_mul = df.groupby(['School','Class'])
grouped_mul.agg({'Math':['mean','max'],'Height':'var'})

		Math		Height
		mean	max	var
School	Class
S_1	C_1	63.78	87.2	183.3
	C_2	64.30	97.0	132.8
	C_3	63.16	87.7	179.2
S_2	C_1	58.56	83.3	54.7
	C_2	62.80	85.4	256.0
	C_3	63.06	95.5	205.7
	C_4	53.80	67.7	300.2

自定义函数

grouped_single['Math'].agg(lambda x:print(x.head(),'间隔'))
#agg函数一个有用的特性：分组逐列进行agg函数的传入

1101    34.0
1102    32.5
1103    87.2
1104    80.4
1105    84.8
Name: Math, dtype: float64 间隔
2101    83.3
2102    50.6
2103    52.5
2104    72.2
2105    34.2
Name: Math, dtype: float64 间隔





School
S_1    None
S_2    None
Name: Math, dtype: object

RMK：官方没有提供计算极差的函数，可以用agg实现组内极差计算

grouped_single['Math'].agg(lambda x: x.max()-x.min())

School
S_1    65.5
S_2    62.8
Name: Math, dtype: float64

利用NamedAgg函数进行多个聚合

RMK：不支持lambda函数，但是可以使用外置的def函数

def R1(x):
    return x.max()-x.min()
def R2(x):
    return x.max()-x.median()
grouped_single['Math'].agg(min_score1=pd.NamedAgg(column='col1',aggfunc=R1),max_score1=pd.NamedAgg(column='col2',aggfunc='max'),range_score2=pd.NamedAgg(column='col3',aggfunc=R2)).head()

	min_score1	max_score1	range_score2
School
S_1	65.5	97.0	33.5
S_2	62.8	95.5	39.4

带参数的聚合函数

#判断是否组内数学分数至少有一个值在50-52之间
def f(s,low,high):
    return s.between(low,high).max()
grouped_single['Math'].agg(f,50,52)

School
S_1    False
S_2     True
Name: Math, dtype: bool

如果需要使用多个函数，并且其中至少有一个带参数，则使用wrap技巧

def f_test (s,low,high):
    return s.between(low,high).max()
def agg_f(f_mul,name,*args,**kwargs):
    def wrapper(x):
        return f_mul(x,*args,**kwargs)
    wrapper.__name__=name
    return wrapper
new_f = agg_f(f_test,'at_least_one_in_50_52',50,52)
grouped_single['Math'].agg([new_f,'mean']).head()

	at_least_one_in_50_52	mean
School
S_1	False	63.746667
S_2	True	59.555000

过滤（Filteration）

filter函数：用来筛选某些组（结果为组的全体），故传入的值是布尔标量

grouped_single[['Math','Physics']].filter(lambda x: (x['Math']>30).all()).head()

	Math	Physics
ID
1101	34.0	A+
1102	32.5	B+
1103	87.2	B+
1104	80.4	B-
1105	84.8	B+

变换（Transformation）

传入对象

Transform函数中出传入的对象是组内的列，并且返回值需要与列长完全一致

grouped_single[['Math','Height']].transform(lambda x : x-x.min()).head()

	Math	Height
ID
1101	2.5	14
1102	1.0	33
1103	55.7	27
1104	48.9	8
1105	53.3	0

若返回了标量值，则组内的所有元素都会被广播为这个值

grouped_single[['Math','Height']].transform(lambda x:x.mean()).head()

	Math	Height
ID
1101	63.746667	175.733333
1102	63.746667	175.733333
1103	63.746667	175.733333
1104	63.746667	175.733333
1105	63.746667	175.733333

利用变换方法进行组内标准化

grouped_single[['Math','Height']].transform(lambda x:(x-x.mean())/x.std()).head()

	Math	Height
ID
1101	-1.288991	-0.214991
1102	-1.353990	1.279460
1103	1.016287	0.807528
1104	0.721627	-0.686923
1105	0.912289	-1.316166

利用变换方法进行组内缺失值的均值填充

df_nan_1 = df[['Math','School']].copy()
df_nan_1.head()

	Math	School
ID
1101	34.0	S_1
1102	32.5	S_1
1103	87.2	S_1
1104	80.4	S_1
1105	84.8	S_1

df_nan = df_nan_1.reset_index()
df_nan.loc[np.random.randint(0,df.shape[0],25),['Math']] = np.nan
df_nan.head()

	ID	Math	School
0	1101	NaN	S_1
1	1102	NaN	S_1
2	1103	87.2	S_1
3	1104	NaN	S_1
4	1105	NaN	S_1

df_nan.groupby('School').transform(lambda x:x.fillna(x.mean())).join(df.reset_index()['School']).head()

	ID	Math	School
0	1101	83.225	S_1
1	1102	83.225	S_1
2	1103	87.200	S_1
3	1104	83.225	S_1
4	1105	83.225	S_1

apply函数

apply函数的灵活性

所有分组函数中，apply应用最为广泛

#对于传入值而言，从下面的打印内容可看出是以分组的表传入apply中
df.groupby('School').apply(lambda x:print(x.head(1)))

      Unnamed: 0 School Class Gender   Address  Height  Weight  Math Physics
ID                                                                          
1101           0    S_1   C_1      M  street_1     173      63  34.0      A+
      Unnamed: 0 School Class Gender   Address  Height  Weight  Math Physics
ID                                                                          
2101          15    S_2   C_1      M  street_7     174      84  83.3       C

apply函数的灵活性很大程度上来源于其返回值的多样性

标量返回值

df[['School','Math','Height']].groupby('School').apply(lambda x:x.max())

	School	Math	Height
School
S_1	S_1	97.0	195
S_2	S_2	95.5	194

列表返回值

df[['School','Math','Height']].groupby('School').apply(lambda x: x- x.min()).head()

	Math	Height
ID
1101	2.5	14.0
1102	1.0	33.0
1103	55.7	27.0
1104	48.9	8.0
1105	53.3	0.0

数据框返回值

df[['School','Math','Height']].groupby('School').apply(lambda x:pd.DataFrame({'col1':x['Math']-x['Math'].max()})).head()

	col1
ID
1101	-63.0
1102	-64.5
1103	-9.8
1104	-16.6
1105	-12.2

用apply同时统计多个指标

可借助OrderedDict工具进行快捷统计

from collections import OrderedDict
def f(df):
    data = OrderedDict()
    data['M_sum'] = df['Math'].sum()
    data['W_var'] = df['Weight'].var()
    data['H_mean'] = df['Height'].mean()
    return pd.Series(data)
grouped_single.apply(f)

	M_sum	W_var	H_mean
School
S_1	956.2	117.428571	175.733333
S_2	1191.1	181.081579	172.950000

PenguinAsHeathen

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Pandas基础3.2｜Python学习笔记

import numpy as npimport pandas as pddf = pd.read_csv('./data/table.csv',index_col = 'ID')df.head() Unnamed: 0 School Class Gender Address Hei...
复制链接

扫一扫