数据分析04-pandas(apply函数、排序、数据合、分组聚合、透视表、交叉表及项目分析)

最新推荐文章于 2023-11-09 16:00:45 发布

YEGE学AI算法

最新推荐文章于 2023-11-09 16:00:45 发布

阅读量1.1k

点赞数

分类专栏：数据分析文章标签：数据分析

本文链接：https://blog.csdn.net/yegeli/article/details/106993021

版权

数据分析专栏收录该内容

8 篇文章 4 订阅

订阅专栏

apply函数

pandas提供了apply函数方便的处理Series与DataFrame；apply函数支持逐一处理数据集中的每个元素都会执行一次目标函数，把返回值存入结果集中。：

# series.apply()
ary = np.array(['80公斤','83公斤','78公斤','74公斤','84公斤'])
s = pd.Series(ary)
def func(x):
    return x[:2]
s.apply(func)

# dataframe.apply()
def func(x):
    x[pd.isna(x)] = x.mean()
    return x
ratings.apply(func, axis=1)

排序

Pandas有两种排序方式，它们分别是按标签与按实际值排序。

import numpy as np

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack','Lee','David','Gasper','Betina','Andres']),
 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
unsorted_df = pd.DataFrame(d)

按标签（行）排序

使用sort_index()方法，通过传递axis参数和排序顺序，可以对DataFrame进行排序。默认情况下，按照升序对行标签进行排序。

# 按照行标进行排序
sorted_df=unsorted_df.sort_index()
print (sorted_df)
# 控制排序顺序
sorted_df = unsorted_df.sort_index(ascending=False)
print (sorted_df)

按标签（列）排序

# 按照列标签进行排序
sorted_df=unsorted_df.sort_index(axis=1)
print (sorted_df)

按某列值排序

像索引排序一样，sort_values()是按值排序的方法。它接受一个by参数，它将使用要与其排序值的DataFrame的列名称。

sorted_df = unsorted_df.sort_values(by='Age')
print (sorted_df)
# 先按Age进行升序排序，然后按Rating降序排序
sorted_df = unsorted_df.sort_values(by=['Age', 'Rating'], ascending=[True, False])
print (sorted_df)

数据合并

concat

concat函数是在pandas的方法，可以根据不同的轴合并数据集。

r = pd.concat(datas, axis=0, join='outer', ignore_index=False, 
              keys=['x', 'y', 'z'])

纵向合并：

在这里插入图片描述

横向合并：

在这里插入图片描述

merge & join

panda具有全功能、高性能的内存连接操作，与SQL之类的关系数据库非常相似。与其他开源实现相比，这些方法的性能要好得多(在某些情况下要好一个数量级以上)

pandas提供了merge函数实现高效的内存链接操作：

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,left_index=False, right_index=False)

参数名称	说明
left	接收DataFrame或Series。表示要添加的新数据。无默认。
right	接收DataFrame或Series。表示要添加的新数据。无默认。。
how	接收inner，outer，left，right。表示数据的连接方式。默认为inner。
on	接收string或sequence。表示外键字段名。默认为None。
left_on	接收string或sequence。关联操作时左表中的关联字段名。
right_on	接收string或sequence。关联操作时右表中的关联字段名。
left_index	接收boolean。表示是否将left参数接收数据的index作为连接主键。默认为False。
right_index	接收boolean。表示是否将right参数接收数据的index作为连接主键。默认为False。
sort	接收boolean。表示是否根据连接键对合并后的数据进行排序。默认为False。
suffixes	接收接收tuple。表示用于追加到left和right参数接收数据重叠列名的尾缀默认为(’_x’, ‘_y’)。

合并两个DataFrame：

import pandas as pd
left = pd.DataFrame({
         'student_id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
         'student_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung', 'Billy', 'Brian', 'Bran', 'Bryce', 'Betty', 'Emma', 'Marry', 'Allen', 'Jean', 'Rose', 'David', 'Tom', 'Jack', 'Daniel', 'Andrew'],
         'class_id':[1,1,1,2,2,2,3,3,3,4,1,1,1,2,2,2,3,3,3,2], 
         'gender':['M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F'], 
         'age':[20,21,22,20,21,22,23,20,21,22,20,21,22,23,20,21,22,20,21,22], 
         'score':[98,74,67,38,65,29,32,34,85,64,52,38,26,89,68,46,32,78,79,87]})
right = pd.DataFrame(
         {'class_id':[1,2,3,5],
         'class_name': ['ClassA', 'ClassB', 'ClassC', 'ClassE']})
# 合并两个DataFrame
data = pd.merge(left,right)
print(data)

其他合并方法同数据库相同：

合并方法	SQL等效	描述
`left`	`LEFT OUTER JOIN`	使用左侧对象的键
`right`	`RIGHT OUTER JOIN`	使用右侧对象的键
`outer`	`FULL OUTER JOIN`	使用键的联合
`inner`	`INNER JOIN`	使用键的交集

实验：

# 合并两个DataFrame (左连接)
rs = pd.merge(left,right,on='subject_id', how='right')
print(rs)
# 合并两个DataFrame (左连接)
rs = pd.merge(left,right,on='subject_id', how='outer')
print(rs)
# 合并两个DataFrame (左连接)
rs = pd.merge(left,right,on='subject_id', how='inner')
print(rs)

分组聚合

pandas提供了功能类似于数据库中group by语句的用于拆分数据组的方法pd.groupby()；该方法提供的是分组聚合步骤中的拆分功能，能根据索引或字段对数据进行分组（Split）进而针对得到的多组数据执行聚合操作（Apply），最终合并为最终结果（Combine）。
在这里插入图片描述

分组

groupby方法的参数及其说明：

DataFrame.groupby(by=None, axis=0, as_index=True, sort=True)

参数名称	说明
by	接收list，string，mapping或generator。用于确定进行分组的依据。无默认。
axis	接收int。表示操作的轴向，默认对行进行操作。默认为0。
as_index	接收boolearn。表示聚合后的聚合标签是否以DataFrame索引形式输出。默认为True。
sort	接收boolearn。表示是否对分组依据分组标签进行排序。默认为True。

用groupby方法分组后的结果并不能直接查看，而是被存在内存中，输出的是内存地址。实际上分组后的数据对象（Groupby对象）类似Series与DataFrame，是pandas提供的一种对象。

Groupby对象的常用方法：

方法	说明
groupObject.get_group(‘A’)	返回A组的详细数据
groupObject.size()	返回每一组的频数

grouped = data.groupby(by=['class_id', 'gender'])
grouped.get_group((1, 'M'))
grouped = data.groupby(by=['class_id', 'gender'])
grouped.get_group((1, 'M'))

聚合

聚合函数为每个组返回聚合值。当创建了分组(groupby)对象，就可以对每个分组的其他字段数据执行求和、求标准差等操作。

使用聚合函数agg进行组内计算：

grouped = data.groupby(by='class_id')
grouped.agg({'score':np.mean})

对于某个字段希望只做求均值操作，而对另一个字段则希望只做求和操作，可以使用字典的方式，将两个字段名分别作为key：

grouped.agg({'age':np.max, 'score':np.mean})

还可以这样：

result = grouped.agg(
    {'age':np.max, 'score':[np.mean, np.max]})

pandas支持的聚合函数有：

方法名称	说明
count	计算分组的数目，包括缺失值。
head	返回每组的前n个值。
max	返回每组最大值。
mean	返回每组的均值。
median	返回每组的中位数。
cumcount	对每个分组中组员的进行标记，0至n-1。
size	返回每组的大小。
min	返回每组最小值。
std	返回每组的标准差。
sum	返回每组的和。

透视表与交叉表

透视表

透视表(pivot table)是各种电子表格程序和其他数据分析软件中一种常见的数据汇总工具。它根据一个或多个键对数据进行分组聚合，并根据每个分组进行数据汇总。

# 以class_id与gender做分组汇总数据，默认聚合统计所有列
print(data.pivot_table(index=['class_id', 'gender']))

# 以class_id与gender做分组汇总数据，聚合统计score列
print(data.pivot_table(index=['class_id', 'gender'], values=['score']))

# 以class_id与gender做分组汇总数据，聚合统计score列，针对age的每个值列级分组统计
print(data.pivot_table(index=['class_id', 'gender'], values=['score'], columns=['age']))

# 以class_id与gender做分组汇总数据，聚合统计score列，针对age的每个值列级分组统计，添加行、列小计
print(data.pivot_table(index=['class_id', 'gender'], values=['score'], columns=['age'], margins=True))

# 以class_id与gender做分组汇总数据，聚合统计score列，针对age的每个值列级分组统计，添加行、列小计
print(data.pivot_table(index=['class_id', 'gender'], values=['score'],  columns=['age'], margins=True, aggfunc='max'))

交叉表

交叉表(cross-tabulation, 简称crosstab)是一种用于计算分组频率的特殊透视表：

# 按照class_id分组，针对不同的gender，统计数量
print(pd.crosstab(data.class_id, data.gender, margins=True))

项目：分析影响学生成绩的因素

资源文件下载

https://download.csdn.net/download/yegeli/12562286
在我的资源中下载：StudentsPerformance.csv

字段	说明
gender	性别
race/ethnicity	种族
parental level of education	父母教育水平
lunch	午餐
test preparation course	是否通过预科考试
math score	数学得分
reading score	阅读得分
writing score	写作得分

学生成绩影响因素分析

import numpy as np
import pandas as pd

data = pd.read_csv('StudentsPerformance.csv')
data['total score'] = data.sum(axis=1)
# 参数number，object意为：统计数字列与字符串列
data.describe(include=['number', 'object'])

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score	total score
count	1000	1000	1000	1000	1000	1000.00000	1000.000000	1000.000000	1000.000000
unique	2	5	6	2	2	NaN	NaN	NaN	NaN
top	female	group C	some college	standard	none	NaN	NaN	NaN	NaN
freq	518	319	226	645	642	NaN	NaN	NaN	NaN
mean	NaN	NaN	NaN	NaN	NaN	66.08900	69.169000	68.054000	203.312000
std	NaN	NaN	NaN	NaN	NaN	15.16308	14.600192	15.195657	42.771978
min	NaN	NaN	NaN	NaN	NaN	0.00000	17.000000	10.000000	27.000000
25%	NaN	NaN	NaN	NaN	NaN	57.00000	59.000000	57.750000	175.000000
50%	NaN	NaN	NaN	NaN	NaN	66.00000	70.000000	69.000000	205.000000
75%	NaN	NaN	NaN	NaN	NaN	77.00000	79.000000	79.000000	233.000000
max	NaN	NaN	NaN	NaN	NaN	100.00000	100.000000	100.000000	300.000000

# 分析性别对学习成绩的影响（按性别分组）
data.pivot_table(index='gender')

	math score	reading score	total score	writing score
gender
female	63.633205	72.608108	208.708494	72.467181
male	68.728216	65.473029	197.512448	63.311203

总体来说，女生的成绩普遍比较好，但是男生更善于数学。

# 分析种族对学习成绩的影响
data.pivot_table(index='race/ethnicity')

	math score	reading score	total score	writing score
race/ethnicity
group A	61.629213	64.674157	188.977528	62.674157
group B	63.452632	67.352632	196.405263	65.600000
group C	64.463950	69.103448	201.394984	67.827586
group D	67.362595	70.030534	207.538168	70.145038
group E	73.821429	73.028571	218.257143	71.407143

种族划分（优秀-及格）： E - D - C - B - A

# 分析父母教育水平对学习成绩的影响
r = data.pivot_table(index='parental level of education')
r.sort_values(by='total score', ascending=False)

	math score	reading score	total score	writing score
parental level of education
master's degree	69.745763	75.372881	220.796610	75.677966
bachelor's degree	69.389831	73.000000	215.771186	73.381356
associate's degree	67.882883	70.927928	208.707207	69.896396
some college	67.128319	69.460177	205.429204	68.840708
some high school	63.497207	66.938547	195.324022	64.888268
high school	62.137755	64.704082	189.290816	62.448980

父母受教育水平越高，学习成绩越好。

# 分析中午饭学习成绩的影响
r = data.pivot_table(index='lunch')
r.sort_values(by='total score', ascending=False)

	math score	reading score	total score	writing score
lunch
standard	70.034109	71.654264	212.511628	70.823256
free/reduced	58.921127	64.653521	186.597183	63.022535

# 分析中午饭学习成绩的影响
r = data.pivot_table(index='test preparation course')
r.sort_values(by='total score', ascending=False)

	math score	reading score	total score	writing score
test preparation course
completed	69.695531	73.893855	218.008380	74.418994
none	64.077882	66.534268	195.116822	64.504673

r = data.pivot_table(index=['gender','test preparation course'])
r

		math score	reading score	total score	writing score
gender	test preparation course
female	completed	67.195652	77.375000	223.364130	78.793478
female	none	61.670659	69.982036	200.634731	68.982036
male	completed	72.339080	70.212644	212.344828	69.793103
male	none	66.688312	62.795455	189.133117	59.649351

分析前100名与后100名同学的不同情况

r = data.sort_values(by='total score', ascending=False)
top100 = r.head(100)
tail100 = r.tail(100)

r1 = pd.DataFrame({'top100':top100['gender'].value_counts(),
              'tail100':tail100['gender'].value_counts()})
r1

	tail100	top100
female	38	66
male	62	34

data['parental level of education'].value_counts()

some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: parental level of education, dtype: int64

r2 = pd.DataFrame({'top100':top100['parental level of education'].value_counts(),
              'tail100':tail100['parental level of education'].value_counts()})
r2

	tail100	top100
associate's degree	17	29
bachelor's degree	8	20
high school	32	6
master's degree	1	15
some college	14	21
some high school	28	9

代码总结

import numpy as np
import pandas as pd

# apply() 应用于Series
ary = np.array(['80公斤','83公斤','78公斤','74公斤','84公斤'])
s = pd.Series(ary)
s
def func(item):
    return float(item[:-2])

s = s.apply(func)
s

0    80.0
1    83.0
2    78.0
3    74.0
4    84.0
dtype: float64

# apply() 应用于 DataFrame
ratings = pd.read_json('../../data/ratings.json')

def func(item):
    item[item.isna()] = item.mean()
    return item

ratings.apply(func, axis=0)

<ipython-input-7-ebdfbe0e051f>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  item[item.isna()] = item.mean()

	John Carson	Michelle Peterson	William Reynolds	Jillian Hobart	Melissa Jones	Alex Roberts	Michael Henry
Inception	2.5	3.0	2.50	3.5	3	3.0	3.166667
Pulp Fiction	3.5	3.5	3.00	3.5	4	4.0	4.500000
Anger Management	3.0	1.5	3.25	3.0	2	3.7	3.166667
Fracture	3.5	5.0	3.50	4.0	3	5.0	4.000000
Serendipity	2.5	3.5	3.25	2.5	2	3.5	1.000000
Jerry Maguire	3.0	3.0	4.00	4.5	3	3.0	3.166667

排序

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack','Lee','David','Gasper','Betina','Andres']),
 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
unsorted_df = pd.DataFrame(d)
unsorted_df

	Name	Age	Rating
0	Tom	25	4.23
1	James	26	3.24
2	Ricky	25	3.98
3	Vin	23	2.56
4	Steve	30	3.20
5	Minsu	29	4.60
6	Jack	23	3.80
7	Lee	34	3.78
8	David	40	2.98
9	Gasper	30	4.80
10	Betina	51	4.10
11	Andres	46	3.65

unsorted_df.sort_index(axis=1, ascending=True)

	Age	Name	Rating
0	25	Tom	4.23
1	26	James	3.24
2	25	Ricky	3.98
3	23	Vin	2.56
4	30	Steve	3.20
5	29	Minsu	4.60
6	23	Jack	3.80
7	34	Lee	3.78
8	40	David	2.98
9	30	Gasper	4.80
10	51	Betina	4.10
11	46	Andres	3.65

unsorted_df.sort_index(ascending=False)

	Name	Age	Rating
11	Andres	46	3.65
10	Betina	51	4.10
9	Gasper	30	4.80
8	David	40	2.98
7	Lee	34	3.78
6	Jack	23	3.80
5	Minsu	29	4.60
4	Steve	30	3.20
3	Vin	23	2.56
2	Ricky	25	3.98
1	James	26	3.24
0	Tom	25	4.23

按照某列字段进行排序

unsorted_df.sort_values(
    by=['Age', 'Rating'], ascending=[True, False])

	Name	Age	Rating
6	Jack	23	3.80
3	Vin	23	2.56
0	Tom	25	4.23
2	Ricky	25	3.98
1	James	26	3.24
5	Minsu	29	4.60
9	Gasper	30	4.80
4	Steve	30	3.20
7	Lee	34	3.78
8	David	40	2.98
11	Andres	46	3.65
10	Betina	51	4.10

Merge & join

left = pd.DataFrame({
         'student_id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
         'student_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung', 'Billy', 'Brian', 'Bran', 'Bryce', 'Betty', 'Emma', 'Marry', 'Allen', 'Jean', 'Rose', 'David', 'Tom', 'Jack', 'Daniel', 'Andrew'],
         'class_id':[1,1,1,2,2,2,3,3,3,4,1,1,1,2,2,2,3,3,3,2], 
         'gender':['M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F'], 
         'age':[20,21,22,20,21,22,23,20,21,22,20,21,22,23,20,21,22,20,21,22], 
         'score':[98,74,67,38,65,29,32,34,85,64,52,38,26,89,68,46,32,78,79,87]})
right = pd.DataFrame(
         {'class_id':[1,2,3,5],
         'class_name': ['ClassA', 'ClassB', 'ClassC', 'ClassE']})

left

	student_id	student_name	class_id	gender	age	score
0	1	Alex	1	M	20	98
1	2	Amy	1	M	21	74
2	3	Allen	1	F	22	67
3	4	Alice	2	F	20	38
4	5	Ayoung	2	M	21	65
5	6	Billy	2	M	22	29
6	7	Brian	3	F	23	32
7	8	Bran	3	F	20	34
8	9	Bryce	3	M	21	85
9	10	Betty	4	M	22	64
10	11	Emma	1	F	20	52
11	12	Marry	1	F	21	38
12	13	Allen	1	M	22	26
13	14	Jean	2	M	23	89
14	15	Rose	2	F	20	68
15	16	David	2	F	21	46
16	17	Tom	3	M	22	32
17	18	Jack	3	M	20	78
18	19	Daniel	3	F	21	79
19	20	Andrew	2	F	22	87

right

	class_id	class_name
0	1	ClassA
1	2	ClassB
2	3	ClassC
3	5	ClassE

r = pd.merge(left, right, how='inner')
r

	student_id	student_name	class_id	gender	age	score	class_name
0	1	Alex	1	M	20	98	ClassA
1	2	Amy	1	M	21	74	ClassA
2	3	Allen	1	F	22	67	ClassA
3	11	Emma	1	F	20	52	ClassA
4	12	Marry	1	F	21	38	ClassA
5	13	Allen	1	M	22	26	ClassA
6	4	Alice	2	F	20	38	ClassB
7	5	Ayoung	2	M	21	65	ClassB
8	6	Billy	2	M	22	29	ClassB
9	14	Jean	2	M	23	89	ClassB
10	15	Rose	2	F	20	68	ClassB
11	16	David	2	F	21	46	ClassB
12	20	Andrew	2	F	22	87	ClassB
13	7	Brian	3	F	23	32	ClassC
14	8	Bran	3	F	20	34	ClassC
15	9	Bryce	3	M	21	85	ClassC
16	17	Tom	3	M	22	32	ClassC
17	18	Jack	3	M	20	78	ClassC
18	19	Daniel	3	F	21	79	ClassC

分组聚合

grouped = r.groupby(by=['class_id', 'gender'])
# 返回每个分组的频数
grouped.size()

class_id  gender
1         F         3
          M         3
2         F         4
          M         3
3         F         3
          M         3
dtype: int64

# 获取某一个分组细节
grouped.get_group((2, 'F'))

	student_id	student_name	class_id	gender	age	score	class_name
6	4	Alice	2	F	20	38	ClassB
10	15	Rose	2	F	20	68	ClassB
11	16	David	2	F	21	46	ClassB
12	20	Andrew	2	F	22	87	ClassB

# 针对每小组执行聚合操作  agg()
grouped.agg({'score':np.mean})
r = grouped.agg({'score':np.mean, 'age':[np.max, np.min]})
r

		score	age
		mean	amax	amin
class_id	gender
1	F	52.333333	22	20
1	M	66.000000	22	20
2	F	59.750000	22	20
2	M	61.000000	23	21
3	F	48.333333	23	20
3	M	65.000000	22	20

透视表

data = pd.merge(left, right)
# 透视表
data.pivot_table(index='class_id')

	age	score	student_id
class_id
1	21.000000	59.166667	7.000000
2	21.285714	60.285714	11.428571
3	21.166667	56.666667	13.000000

data

	student_id	student_name	class_id	gender	age	score	class_name
0	1	Alex	1	M	20	98	ClassA
1	2	Amy	1	M	21	74	ClassA
2	3	Allen	1	F	22	67	ClassA
3	11	Emma	1	F	20	52	ClassA
4	12	Marry	1	F	21	38	ClassA
5	13	Allen	1	M	22	26	ClassA
6	4	Alice	2	F	20	38	ClassB
7	5	Ayoung	2	M	21	65	ClassB
8	6	Billy	2	M	22	29	ClassB
9	14	Jean	2	M	23	89	ClassB
10	15	Rose	2	F	20	68	ClassB
11	16	David	2	F	21	46	ClassB
12	20	Andrew	2	F	22	87	ClassB
13	7	Brian	3	F	23	32	ClassC
14	8	Bran	3	F	20	34	ClassC
15	9	Bryce	3	M	21	85	ClassC
16	17	Tom	3	M	22	32	ClassC
17	18	Jack	3	M	20	78	ClassC
18	19	Daniel	3	F	21	79	ClassC

# 结果集中只希望看到score列的均值
data.pivot_table(index='class_id', values='score')

	score
class_id
1	59.166667
2	60.285714
3	56.666667

# 依据class_id 与 gender 同时做分组
data.pivot_table(index=['class_id','gender'], values='score')

		score
class_id	gender
1	F	52.333333
1	M	66.000000
2	F	59.750000
2	M	61.000000
3	F	48.333333
3	M	65.000000

data.pivot_table(index=['class_id','gender'], 
    columns='age', values='score')

	age	20	21	22	23
class_id	gender
1	F	52.0	38.0	67.0	NaN
1	M	98.0	74.0	26.0	NaN
2	F	53.0	46.0	87.0	NaN
2	M	NaN	65.0	29.0	89.0
3	F	34.0	79.0	NaN	32.0
3	M	78.0	85.0	32.0	NaN

data.pivot_table(index=['class_id','gender'], 
    columns='age', values='score', aggfunc=np.max)

	age	20	21	22	23
class_id	gender
1	F	52.0	38.0	67.0	NaN
1	M	98.0	74.0	26.0	NaN
2	F	68.0	46.0	87.0	NaN
2	M	NaN	65.0	29.0	89.0
3	F	34.0	79.0	NaN	32.0
3	M	78.0	85.0	32.0	NaN

r = data.pivot_table(index=['class_id','gender'], 
    columns='age', values='score', aggfunc=np.max,
    margins=True)
r

	age	20	21	22	23	All
class_id	gender
1	F	52.0	38.0	67.0	NaN	67
1	M	98.0	74.0	26.0	NaN	98
2	F	68.0	46.0	87.0	NaN	87
2	M	NaN	65.0	29.0	89.0	89
3	F	34.0	79.0	NaN	32.0	79
3	M	78.0	85.0	32.0	NaN	85
All		98.0	85.0	87.0	89.0	98

YEGE学AI算法

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
打赏
0
评论
数据分析04-pandas(apply函数、排序、数据合、分组聚合、透视表、交叉表及项目分析)

数据分析-04排序数据合并concatmerge & join分组聚合透视表与交叉表项目：分析影响学生成绩的因素apply函数pandas提供了apply函数方便的处理Series与DataFrame；apply函数支持逐一处理数据集中的每个元素都会执行一次目标函数，把返回值存入结果集中。：# series.apply()ary = np.array(['80公斤','83公斤','78公斤','74公斤','84公斤'])s = pd.Series(ary)def func(x):
复制链接

扫一扫