Pandas数据处理之数据透视表

最新推荐文章于 2022-10-29 10:14:08 发布

初一·

最新推荐文章于 2022-10-29 10:14:08 发布

阅读量7.8k

点赞数 7

文章标签： Python Pandas

本文链接：https://blog.csdn.net/weixin_43060843/article/details/94384009

版权

3.10 数据透视表

数据透视表(pivot table) 是一种类似GroupBy的操作方法，常见于Excel中。数据透视表将每一列数据作为输入，输出将数据不断细分为多个维度累计信息的二维数据表。

3.10.1 演示数据透视表

示例将采用泰坦尼克号的乘客信息数据库来演示，可以在Seaborn程序库获取：

In [1]: import numpy as np
		import pandas as pd
		import seaborn as sns
		titanic = sns.load_dataset('titanic')
In [2]: titanic.head()
Out[2]:	
survived  pclass	sex		age	  sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0		3		male	22.0	1	  0		7.2500		S		Third	man		True	NaN		Southampton	no		False
1	1		1		female	38.0	1	  0		71.2833		C		First	woman	False	C		Cherbourg	yes		False
2	1		3		female	26.0	0	  0		7.9250		S		Third	woman	False	NaN		Southampton	yes		True
3	1		1		female	35.0	1	  0		53.1000		S		First	woman	False	C		Southampton	yes		False
4	0		3		male	35.0	0	  0		8.0500		S		Third	man		True	NaN		Southampton	no		True

In [3]: titanic.shape
Out[3]:(891, 15)

这份数据包含了惨遭厄运的每位乘客的大量信息，包括性别(sex)、年龄(age)、船舱等级(class)和船票价格(fare).

3.20.2 使用groupby制作数据透视表

按照性别进行分组，研究性别与生还情况的关系：

In [4]: titanic.groupby("sex")["survived"].mean()
Out[4]:
 sex
female    0.742038
male      0.188908
Name: survived, dtype: float64

从数据可以看出：有75%的女性被救，男性中只有19%被救。

如果我们进一步探索，同时观察不同性别与船舱等级的生还情况。根据GroupBy的操作流程,我们也能够实现想要的效果：将船舱等级与性别分组，然后选择生还状态列，应用均值累计函数，再将各组结果组合，最后通过行索引转列索引操作将最里层的行索引换成列索引，形成二维数组。

In [5]: titanic.groupby(["sex","class"])["survived"].mean()
Out[5]:
		 sex       class 
female  First     0.968085
        Second    0.921053
        Third     0.500000
male    First     0.368852
        Second    0.157407
        Third     0.135447
Name: survived, dtype: float64

In [6]: titanic.groupby(["sex","class"])["survived"].mean().unstack()
Out[6]:
class	 First	     Second	     Third
 sex			
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

但是相对于pandas李彤的pivot_table方法，语句要复杂一些。所以使用pivot_table来制作透视表。

3.10.3 数据透视表语法

DataFrame 的pivot_table 方法的完整签名如下所示：

DataFrame.pivot_table(data, values=None, index=None, columns=None,
			aggfunc='mean', fill_value=None, margins=False,
			dropna=True, margins_name='All')

index : 透视表的行索引，必要参数，如果我们想要设置多层次索引，使用列表[ ]
values ：对目标数据进行筛选，默认是全部数据，我们可通过values参数设置我们想要展示的数据列
columns :透视表的列索引，非必要参数，同index使用方式一样
aggfunc ：对数据聚合时进行的函数操作，默认是求平均值，也可以sum、count等
margins ：额外列，在最边上，默认是对行列求和
fill_value : 对于空值进行填充
dropna : 默认开启去重

下面我们来测试一下各个参数：

In [7]: titanic.pivot_table(index='sex', columns='class')
Out[7]: 
			  adult_male								age				····
class	 First	 Second		 Third		 First		 Second	 	  Third	····
sex																					
female	0.00000	0.000000	0.000000	34.611765	28.722973	21.750000 ····
male	0.97541	0.916667	0.919308	41.281386	30.740707	26.507589 ····

默认对所有列进行聚合，这时我们给与values参数，只计算我们想要的结果：

In [8]: agg = pd.cut(titanic["age"],[0,18,80])	# 对年龄数据列进行分段，便于观看
In [9]: titanic.pivot_table(index=['sex',age], columns='class',values=['survived','fare'])
Out[9]: 
								fare							survived
class				First	  Second	  Third		 First		 Second		 Third
sex		 age						
female	(0, 18]	  127.474245  25.064286	17.370835	0.909091	1.000000	0.511628
		(18, 80]  105.043469  21.224653	14.785453	0.972973	0.900000	0.423729
male	(0, 18]   114.638320  26.116947	20.639055	0.800000	0.600000	0.215686
		(18, 80]   68.877389  20.219593	10.022624	0.375000	0.071429	0.133663

在实际使用中，并不一定每次都要均值，这是我们可以使用aggfunc指定累计函数：

In [10]: titanic.pivot_table(index='sex', columns='class',aggfunc={'survived':sum, 'fare':'mean'})
Out[10]: 
					 fare						survived
class	 First	     Second	     Third	  First	 Second	 Third
 sex						
female	106.125798	21.970121	16.118810	91	   70	  72
male	67.226127	19.741782	12.661633	45	   17	  47

需要注意的是，这里忽略了一个参数values。当我们为aggfunc指定映射关系的时候，待透视的数据就已经确定了。
当需要计算每一组的总数时，可以通过margins 参数来设置：

In [11]: titanic.pivot_table('survived', index='sex', columns='class', margins=True)
Out[11]: 
class	  First		 Second		  Third		  All
 sex				
female	0.968085	0.921053	0.500000	0.742038
male	0.368852	0.157407	0.135447	0.188908
All		0.629630	0.472826	0.242363	0.383838

margin 的标签可以通过margins_name 参数进行自定义，默认值是"All"。

初一·

关注

7
点赞
踩
31

收藏

觉得还不错? 一键收藏
0
评论
Pandas数据处理之数据透视表

3.10 数据透视表数据透视表(pivot table) 是一种类似GroupBy的操作方法，常见于Excel中。数据透视表将每一列数据作为输入，输出将数据不断细分为多个维度累计信息的二维数据表。3.10.1 演示数据透视表示例将采用泰坦尼克号的乘客信息数据库来演示，可以在Seaborn程序库获取：In [1]: import numpy as np import pandas...
复制链接

扫一扫