数据透视表
import pandas as pd
example = pd. DataFrame( { 'Amount' : [ 74 . , 235 . , 175 . , 100 . , 115 . , 245 . , 180 . , 90 . , 88 . , 129 . , 273 . , 300 . ] ,
'Category' : [ 'Transportation' , 'Grocery' , 'Household' , 'Entertainment' ,
'Transportation' , 'Grocery' , 'Household' , 'Entertainment' ,
'Transportation' , 'Grocery' , 'Household' , 'Entertainment' ] ,
'Month' : [ 'January' , 'January' , 'January' , 'January' ,
'February' , 'February' , 'February' , 'February' ,
'Marth' , 'Marth' , 'Marth' , 'Marth' ] } )
example
Amount Category Month 0 74.0 Transportation January 1 235.0 Grocery January 2 175.0 Household January 3 100.0 Entertainment January 4 115.0 Transportation February 5 245.0 Grocery February 6 180.0 Household February 7 90.0 Entertainment February 8 88.0 Transportation Marth 9 129.0 Grocery Marth 10 273.0 Household Marth 11 300.0 Entertainment Marth
上面这个数据表感觉非常的杂乱,数据透视的意思就是按照我们的意思重新组织这张数据表
new_example = example. pivot( index= 'Category' , columns= 'Month' , values= 'Amount' )
index 表示以原先数据的哪个属性的值为行坐标进行分类 columns 表示以原先数据的哪个属性的值为新的属性
values 表示以原先数据的哪个属性的值填充前面两个 index 和 columns 定义出的新表格
new_example
Month February January Marth Category Entertainment 90.0 100.0 300.0 Grocery 245.0 235.0 129.0 Household 180.0 175.0 273.0 Transportation 115.0 74.0 88.0
new_example. sum ( axis= 1 )
Category
Entertainment 490.0
Grocery 609.0
Household 628.0
Transportation 277.0
dtype: float64
new_example. sum ( axis= 0 )
Month
February 630.0
January 584.0
Marth 790.0
dtype: float64
经过重新组织后,数据表的呈现方式更有意义
df = pd. read_csv( '../../datasets/titanic/test.csv' )
df. head( 5 )
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q 1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S 2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q 3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S 4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
需求: 统计男女在不同船舱等级的票价
df. pivot_table( index= 'Sex' , columns= 'Pclass' , values= 'Fare' )
Pclass 1 2 3 Sex female 115.591168 26.438750 13.735129 male 75.586551 20.184654 11.826350
df. pivot_table( index= 'Sex' , columns= 'Pclass' , values= 'Fare' , aggfunc= 'min' )
Pclass 1 2 3 Sex female 25.7 10.5000 6.9500 male 0.0 9.6875 3.1708
需求: 统计不同船舱等级里男女的平均年龄
df. pivot_table( index= 'Pclass' , columns= 'Sex' , values= 'Age' )
Sex female male Pclass 1 41.333333 40.520000 2 24.376552 30.940678 3 23.073400 24.525104