第4章 变形
import numpy as np
import pandas as pd
df = pd. read_csv( 'data/table.csv' )
df. head( )
School Class ID Gender Address Height Weight Math Physics 0 S_1 C_1 1101 M street_1 173 63 34.0 A+ 1 S_1 C_1 1102 F street_2 192 73 32.5 B+ 2 S_1 C_1 1103 M street_2 186 82 87.2 B+ 3 S_1 C_1 1104 F street_2 167 81 80.4 B- 4 S_1 C_1 1105 F street_4 159 64 84.8 B+
一、透视表
1. pivot
一般状态下,数据在DataFrame会以压缩(stacked)状态存放,例如上面的Gender,两个类别被叠在一列中,pivot函数可将某一列作为新的cols:
df. pivot( index= 'ID' , columns= 'Gender' , values= 'Height' ) . head( )
Gender F M ID 1101 NaN 173.0 1102 192.0 NaN 1103 NaN 186.0 1104 167.0 NaN 1105 159.0 NaN
然而pivot函数具有很强的局限性,除了功能上较少之外,还不允许values中出现重复的行列索引对(pair),例如下面的语句就会报错:
因此,更多的时候会选择使用强大的pivot_table函数
2. pivot_table
首先,再现上面的操作:
pd. pivot_table( df, index= 'ID' , columns= 'Gender' , values= 'Height' ) . head( )
Gender F M ID 1101 NaN 173.0 1102 192.0 NaN 1103 NaN 186.0 1104 167.0 NaN 1105 159.0 NaN
由于功能更多,速度上自然是比不上原来的pivot函数:
% timeit df. pivot( index= 'ID' , columns= 'Gender' , values= 'Height' )
% timeit pd. pivot_table( df, index= 'ID' , columns= 'Gender' , values= 'Height' )
2.32 ms ± 44.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
11.2 ms ± 828 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pandas中提供了各种选项,下面介绍常用参数:
① aggfunc:对组内进行聚合统计,可传入各类函数,默认为’mean’
pd. pivot_table( df, index= 'School' , columns= 'Gender' , values= 'Height' , aggfunc= [ 'mean' , 'std' ] ) . head( )
mean std Gender F M F M School S_1 173.125000 178.714286 11.703937 14.067863 S_2 173.727273 172.000000 16.297797 12.419742
② margins:汇总边际状态
pd. pivot_table( df, index= 'School' , columns= 'Gender' , values= 'Height' , aggfunc= [ 'mean' , 'sum' ] , margins= True , margins_name= 'all' ) . head( )
mean sum Gender F M all F M all School S_1 173.125000 178.714286 175.733333 1385 1251 2636 S_2 173.727273 172.000000 172.950000 1911 1548 3459 all 173.473684 174.937500 174.142857 3296 2799 6095
③ 行、列、值都可以为多级
pd. pivot_table( df, index= [ 'School' , 'Class' ] ,
columns= [ 'Gender' , 'Address' ] ,
values= [ 'Height' , 'Weight' ] )
Height ... Weight Gender F M ... F M Address street_1 street_2 street_4 street_5 street_6 street_7 street_1 street_2 street_4 street_5 ... street_4 street_5 street_6 street_7 street_1 street_2 street_4 street_5 street_6 street_7 School Class S_1 C_1 NaN 179.5 159.0 NaN NaN NaN 173.0 186.0 NaN NaN ... 64.0 NaN NaN NaN 63.0 82.0 NaN NaN NaN NaN C_2 NaN NaN 176.0 162.0 167.0 NaN NaN NaN NaN 188.0 ... 94.0 63.0 63.0 NaN NaN NaN NaN 68.0 53.0 NaN C_3 175.0 NaN NaN 187.0 NaN NaN NaN 195.0 161.0 NaN ... NaN 69.0 NaN NaN NaN 70.0 68.0 NaN NaN 82.0 S_2 C_1 NaN NaN NaN 159.0 161.0 NaN NaN NaN 163.5 NaN ... NaN 97.0 61.0 NaN NaN NaN 71.0 NaN NaN 84.0 C_2 NaN NaN NaN NaN NaN 188.5 175.0 NaN 155.0 193.0 ... NaN NaN NaN 76.5 74.0 NaN 91.0 100.0 NaN NaN C_3 NaN NaN 157.0 NaN 164.0 190.0 NaN NaN 187.0 171.0 ... 78.0 NaN 81.0 99.0 NaN NaN 73.0 88.0 NaN NaN C_4 NaN 176.0 NaN NaN 175.5 NaN NaN NaN NaN NaN ... NaN NaN 57.0 NaN NaN NaN NaN NaN NaN 82.0
7 rows × 24 columns
3. crosstab(交叉表)
交叉表是一种特殊的透视表,典型的用途如分组统计,如现在想要统计关于街道和性别分组的频数:
pd. crosstab( index= df[ 'Address' ] , columns= df[ 'Gender' ] )
Gender F M Address street_1 1 2 street_2 4 2 street_4 3 5 street_5 3 3 street_6 5 1 street_7 3 3
交叉表的功能也很强大(但目前还不支持多级分组),下面说明一些重要参数:
① values和aggfunc:分组对某些数据进行聚合操作,这两个参数必须成对出现
pd. crosstab( index= df[ 'Address' ] , columns= df[ 'Gender' ] ,
values= np. random. randint( 1 , 20 , df. shape[ 0 ] ) , aggfunc= 'mean' )
Gender F M Address street_1 19.00 11.000000 street_2 10.75 9.500000 street_4 10.00 11.400000 street_5 6.00 10.333333 street_6 13.80 18.000000 street_7 14.00 9.333333
② 除了边际参数margins外,还引入了normalize参数,可选’all’,‘index’,'columns’参数值
pd. crosstab( index= df[ 'Address' ] , columns= df[ 'Gender' ] , normalize= 'all' , margins= True )
Gender F M All Address street_1 0.028571 0.057143 0.085714 street_2 0.114286 0.057143 0.171429 street_4 0.085714 0.142857 0.228571 street_5 0.085714 0.085714 0.171429 street_6 0.142857 0.028571 0.171429 street_7 0.085714 0.085714 0.171429 All 0.542857 0.457143 1.000000
二、其他变形方法
1. melt
melt函数可以认为是pivot函数的逆操作,将unstacked状态的数据,压缩成stacked,使“宽”的DataFrame变“窄”
df_m = df[ [ 'ID' , 'Gender' , 'Math' ] ]
df_m. head( )
ID Gender Math 0 1101 M 34.0 1 1102 F 32.5 2 1103 M 87.2 3 1104 F 80.4 4 1105 F 84.8
df. pivot( index= 'ID' , columns= 'Gender' , values= 'Math' ) . head( )
Gender F M ID 1101 NaN 34.0 1102 32.5 NaN 1103 NaN 87.2 1104 80.4 NaN 1105 84.8 NaN
melt函数中的id_vars表示需要保留的列,value_vars表示需要stack的一组列
pivoted = df. pivot( index= 'ID' , columns= 'Gender' , values= 'Math' )
result = pivoted. reset_index( ) . melt( id_vars= [ 'ID' ] , value_vars= [ 'F' , 'M' ] , value_name= 'Math' ) \
. dropna( ) . set_index( 'ID' ) . sort_index( )
result. equals( df_m. set_index( 'ID' ) )
pivoted. reset_index( ) . melt( id_vars= [ 'ID' ] , value_vars= [ 'F' , 'M' ] , value_name= 'Math' )
ID Gender Math 0 1101 F NaN 1 1102 F 32.5 2 1103 F NaN 3 1104 F 80.4 4 1105 F 84.8 ... ... ... ... 65 2401 M NaN 66 2402 M 48.7 67 2403 M NaN 68 2404 M NaN 69 2405 M NaN
70 rows × 3 columns
2. 压缩与展开
(1)stack:这是最基础的变形函数,总共只有两个参数:level和dropna
df_s = pd. pivot_table( df, index= [ 'Class' , 'ID' ] , columns= 'Gender' , values= [ 'Height' , 'Weight' ] )
df_s. groupby( 'Class' ) . head( 2 )
Height Weight Gender F M F M Class ID C_1 1101 NaN 173.0 NaN 63.0 1102 192.0 NaN 73.0 NaN C_2 1201 NaN 188.0 NaN 68.0 1202 176.0 NaN 94.0 NaN C_3 1301 NaN 161.0 NaN 68.0 1302 175.0 NaN 57.0 NaN C_4 2401 192.0 NaN 62.0 NaN 2402 NaN 166.0 NaN 82.0
df_stacked = df_s. stack( 1 )
df_stacked. groupby( 'Class' ) . head( 2 )
Height Weight Class ID Gender C_1 1101 M 173.0 63.0 1102 F 192.0 73.0 C_2 1201 M 188.0 68.0 1202 F 176.0 94.0 C_3 1301 M 161.0 68.0 1302 F 175.0 57.0 C_4 2401 F 192.0 62.0 2402 M 166.0 82.0
stack函数可以看做将横向的索引放到纵向,因此功能类似与melt,参数level可指定变化的列索引是哪一层(或哪几层,需要列表)
df_stacked = df_s. stack( 0 )
df_stacked. groupby( 'Class' ) . head( 2 )
Gender F M Class ID C_1 1101 Height NaN 173.0 Weight NaN 63.0 C_2 1201 Height NaN 188.0 Weight NaN 68.0 C_3 1301 Height NaN 161.0 Weight NaN 68.0 C_4 2401 Height 192.0 NaN Weight 62.0 NaN
(2) unstack:stack的逆函数,功能上类似于pivot_table
df_stacked. head( )
Gender F M Class ID C_1 1101 Height NaN 173.0 Weight NaN 63.0 1102 Height 192.0 NaN Weight 73.0 NaN 1103 Height NaN 186.0
result = df_stacked. unstack( ) . swaplevel( 1 , 0 , axis= 1 ) . sort_index( axis= 1 )
三、哑变量与因子化
1. Dummy Variable(哑变量)
这里主要介绍get_dummies函数,其功能主要是进行one-hot编码:
df_d = df[ [ 'Class' , 'Gender' , 'Weight' ] ]
df_d. head( )
Class Gender Weight 0 C_1 M 63 1 C_1 F 73 2 C_1 M 82 3 C_1 F 81 4 C_1 F 64
现在希望将上面的表格前两列转化为哑变量,并加入第三列Weight数值:
pd. get_dummies( df_d[ [ 'Class' , 'Gender' ] ] ) . join( df_d[ 'Weight' ] ) . head( )
Class_C_1 Class_C_2 Class_C_3 Class_C_4 Gender_F Gender_M Weight 0 1 0 0 0 0 1 63 1 1 0 0 0 1 0 73 2 1 0 0 0 0 1 82 3 1 0 0 0 1 0 81 4 1 0 0 0 1 0 64
2. factorize方法
该方法主要用于自然数编码,并且缺失值会被记做-1,其中sort参数表示是否排序后赋值
codes, uniques = pd. factorize( [ 'b' , None , 'a' , 'c' , 'b' ] , sort= True )
display( codes)
display( uniques)
array([ 1, -1, 0, 2, 1])
array(['a', 'b', 'c'], dtype=object)
四、问题与练习
1. 问题
【问题一】 上面提到了许多变形函数,如melt/crosstab/pivot/pivot_table/stack/unstack函数,请总结它们各自的使用特点。
pivot/pivot_table/unstack 对数据框展开 crosstab 分组统计 melt/stack 压缩
【问题二】 变形函数和多级索引是什么关系?哪些变形函数会使得索引维数变化?具体如何变化?
变形函数将不同数据分类作为索引,或者使索引作为数据
melt/ pivot/ pivot_table/ stack/ unstack 会引起索引维数变化
2. 练习
【练习一】 继续使用上一章的药物数据集:
pd. read_csv( 'data/Drugs.csv' ) . head( )
YYYY State COUNTY SubstanceName DrugReports 0 2010 VA ACCOMACK Propoxyphene 1 1 2010 OH ADAMS Morphine 9 2 2010 PA ADAMS Methadone 2 3 2010 VA ALEXANDRIA CITY Heroin 5 4 2010 PA ALLEGHENY Hydromorphone 5
(a) 现在请你将数据表转化成如下形态,每行需要显示每种药物在每个地区的10年至17年的变化情况,且前三列需要排序:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5zfxkLId-1588084046406)(picture/drug_pic.png)]
(b) 现在请将(a)中的结果恢复到原数据表,并通过equal函数检验初始表与新的结果是否一致(返回True)
df= pd. read_csv( 'data/Drugs.csv' )
result = pd. pivot_table( df, index= [ 'State' , 'COUNTY' , 'SubstanceName' ]
, columns= 'YYYY'
, values= 'DrugReports' , fill_value= '-' ) . reset_index( )
result. head( )
YYYY State COUNTY SubstanceName 2010 2011 2012 2013 2014 2015 2016 2017 0 KY ADAIR Buprenorphine - 3 5 4 27 5 7 10 1 KY ADAIR Codeine - - 1 - - - - 1 2 KY ADAIR Fentanyl - - 1 - - - - - 3 KY ADAIR Heroin - - 1 2 - 1 - 2 4 KY ADAIR Hydrocodone 6 9 10 10 9 7 11 3
result1 = result. melt( id_vars= result. columns[ : 3 ] , value_vars= result. columns[ - 8 : ]
, var_name= 'YYYY' , value_name= 'DrugReports' ) . query( 'DrugReports != "-"' )
result1. head( )
State COUNTY SubstanceName YYYY DrugReports 4 KY ADAIR Hydrocodone 2010 6 6 KY ADAIR Methadone 2010 1 13 KY ALLEN Hydrocodone 2010 10 15 KY ALLEN Methadone 2010 4 17 KY ALLEN Oxycodone 2010 15
【练习二】 现有一份关于某地区地震情况的数据集,请解决如下问题:
df= pd. read_csv( 'data/Earthquake.csv' )
(a) 现在请你将数据表转化成如下形态,将方向列展开,并将距离、深度和烈度三个属性压缩:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-31Wp4E8r-1588084046409)(picture/earthquake_pic.png)]
(b) 现在请将(a)中的结果恢复到原数据表,并通过equal函数检验初始表与新的结果是否一致(返回True)
result = pd. pivot_table( df, index= [ '日期' , '时间' , '维度' , '经度' ]
, columns= '方向'
, values= [ '烈度' , '深度' , '距离' ] , fill_value= '-' ) . stack( level= 0 ) . rename_axis( index= { None : '地震参数' } )
result. head( 6 )
方向 east north north_east north_west south south_east south_west west 日期 时间 维度 经度 地震参数 1912.08.09 12:29:00 AM 40.6 27.2 深度 - - - - - 16 - - 烈度 - - - - - 6.7 - - 距离 - - - - - 4.3 - - 1912.08.10 12:23:00 AM 40.6 27.1 深度 - - - - - - 15 - 烈度 - - - - - - 6 - 距离 - - - - - - 2 -
df_result = result. unstack( ) . stack( 0 ) [ ( ~ ( result. unstack( ) . stack( 0 ) == '-' ) ) . any ( 1 ) ] . reset_index( )
df_result. columns. name= None
df_result = df_result. sort_index( axis= 0 ) . astype( { '深度' : 'float64' , '烈度' : 'float64' , '距离' : 'float64' } )
df_result. head( )
日期 时间 维度 经度 方向 深度 烈度 距离 0 1912.08.09 12:29:00 AM 40.6 27.2 south_east 16.0 6.7 4.3 1 1912.08.10 12:23:00 AM 40.6 27.1 south_west 15.0 6.0 2.0 2 1912.08.10 12:30:00 AM 40.6 27.1 south_west 15.0 5.2 2.0 3 1912.08.11 12:19:04 AM 40.6 27.2 south_east 30.0 4.9 4.3 4 1912.08.11 12:20:00 AM 40.6 27.1 south_west 15.0 4.5 2.0