2 第二章:数据重构
第一部分:数据聚合与运算
2.6 数据运用
2.6.1 任务一:通过教材《Python for Data Analysis》P303、Google or anything来学习了解GroupBy机制
2.4.2:任务二:计算泰坦尼克号男性与女性的平均票价
text. groupby( 'Sex' ) [ 'Fare' ] . mean( )
Sex
female 44.479818
male 25.523893
Name: Fare, dtype: float64
在了解GroupBy机制之后,运用这个机制完成一系列的操作,来达到我们的目的。
下面通过几个任务来熟悉GroupBy机制。
2.4.3:任务三:统计泰坦尼克号中男女的存活人数
text. groupby( 'Sex' ) [ 'Survived' ] . sum ( )
Sex
female 233
male 109
Name: Survived, dtype: int64
text. groupby( 'Sex' ) [ 'Survived' ] . count( )
Sex
female 314
male 577
Name: Survived, dtype: int64
text. groupby( 'Sex' ) [ 'Survived' ] . sum ( ) / text. groupby( 'Sex' ) [ 'Survived' ] . count( )
Sex
female 0.742038
male 0.188908
Name: Survived, dtype: float64
2.4.4:任务四:计算客舱不同等级的存活人数
text. groupby( 'Pclass' ) [ 'Survived' ] . sum ( )
Pclass
1 136
2 87
3 119
Name: Survived, dtype: int64
text. groupby( 'Pclass' ) [ 'Survived' ] . count( )
Pclass
1 216
2 184
3 491
Name: Survived, dtype: int64
text. groupby( 'Pclass' ) [ 'Survived' ] . sum ( ) / text. groupby( 'Pclass' ) [ 'Survived' ] . count( )
Pclass
1 0.629630
2 0.472826
3 0.242363
Name: Survived, dtype: float64
统计不同等级客舱中男女比例
text. groupby( [ 'Pclass' , 'Sex' ] ) . head( 1 )
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1.0 0.0 A/5 21171 7.2500 NaN S 1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1.0 0.0 PC 17599 71.2833 C85 C 2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0.0 0.0 STON/O2. 3101282 7.9250 NaN S 6 6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0.0 0.0 17463 51.8625 E46 S 9 9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1.0 0.0 237736 30.0708 NaN C 17 17 18 1 2 Williams, Mr. Charles Eugene male NaN 0.0 0.0 244373 13.0000 NaN S
text. groupby( [ 'Pclass' , 'Sex' ] ) [ 'PassengerId' ] . count( ) / [ 94 + 76 + 144 , 122 + 108 + 347 , 94 + 76 + 144 , 122 + 108 + 347 , 94 + 76 + 144 , 122 + 108 + 347 ]
Pclass Sex
1 female 0.299363
male 0.211438
2 female 0.242038
male 0.187175
3 female 0.458599
male 0.601386
Name: PassengerId, dtype: float64
【提示: 】表中的存活那一栏,可以发现如果还活着记为1,死亡记为0
【思考 】从数据分析的角度,上面的统计结果可以得出那些结论
#思考心得 女性更倾向与买贵一些的票,同时女性的存活率显著的比男性高,对比不同仓位中男女所占的比例也可以看出,相比男性在女性的群体中 更倾向于住好一些的客舱,这可能也就是存活率高的原因。
【思考】从任务二到任务四中,这些运算可以通过agg()函数来同时计算。并且可以使用rename函数修改列名。你可以按照提示写出这个过程吗?
text. groupby( 'Sex' ) . agg( { 'Fare' : [ ( 'rename_mean' , 'mean' ) ] , 'Survived' : [ ( 'rename_sum' , 'sum' ) ] } )
Fare Survived rename_mean rename_sum Sex female 44.479818 233 male 25.523893 109
text. groupby( 'Pclass' ) . agg( { 'Survived' : [ 'sum' ] } ) . rename( columns= { 'sum' : 'rename_sum' } )
Survived rename_sum Pclass 1 136 2 87 3 119
2.4.5:任务五:统计在不同等级的票中的不同年龄的船票花费的平均值
text. groupby( [ 'Pclass' , 'Age' ] ) [ 'Fare' ] . mean( )
Pclass Age
1 0.92 151.5500
2.00 151.5500
4.00 81.8583
11.00 120.0000
14.00 120.0000
...
3 61.00 6.2375
63.00 9.5875
65.00 7.7500
70.50 7.7500
74.00 7.7750
Name: Fare, Length: 182, dtype: float64
text[ 'P_A_Fare_mean' ] = text[ 'Fare' ]
for name , group in text. groupby( [ 'Pclass' , 'Age' ] ) :
text. loc[ group. index, 'P_A_Fare_mean' ] = pd. Series( group[ 'Fare' ] . mean( ) , index= group. index, name= 'P_A_Fare_mean' )
text. head( )
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked P_A_Fare_mean 0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1.0 0.0 A/5 21171 7.2500 NaN S 7.988330 1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1.0 0.0 PC 17599 71.2833 C85 C 103.711800 2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0.0 0.0 STON/O2. 3101282 7.9250 NaN S 14.158036 3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1.0 0.0 113803 53.1000 C123 S 165.744911 4 4 5 0 3 Allen, Mr. William Henry male 35.0 0.0 0.0 373450 8.0500 NaN S 9.736800
2.4.6:任务六:将任务二和任务三的数据合并,并保存到sex_fare_survived.csv
text. groupby( 'Sex' ) . agg( { 'Fare' : [ ( 'rename_mean' , 'mean' ) ] , 'Survived' : [ ( 'rename_sum' , 'sum' ) ] } )
Fare Survived rename_mean rename_sum Sex female 44.479818 233 male 25.523893 109
mean= text. groupby( 'Sex' ) [ 'Fare' ] . mean( )
sur= text. groupby( 'Sex' ) [ 'Survived' ] . sum ( )
display( mean)
display( sur)
display( pd. merge( mean, sur, on= 'Sex' ) )
display( mean. to_frame( ) . join( sur) )
display( pd. concat( [ mean. to_frame( ) , sur. to_frame( ) ] , axis= 1 ) )
Sex
female 44.479818
male 25.523893
Name: Fare, dtype: float64
Sex
female 233
male 109
Name: Survived, dtype: int64
Fare Survived Sex female 44.479818 233 male 25.523893 109
Fare Survived Sex female 44.479818 233 male 25.523893 109
Fare Survived Sex female 44.479818 233 male 25.523893 109
result = pd. merge( mean, sur, on= 'Sex' )
result. to_csv( 'sex_fare_survived.csv' )
2.4.7:任务七:得出不同年龄的总的存活人数,然后找出存活人数的最高的年龄,最后计算存活人数最高的存活率(存活人数/总人数)
text. groupby( 'Age' ) [ 'Survived' ] . sum ( )
Age
0.42 1
0.67 1
0.75 2
0.83 2
0.92 1
..
70.00 0
70.50 0
71.00 0
74.00 0
80.00 1
Name: Survived, Length: 88, dtype: int64
text. groupby( 'Age' ) [ 'Survived' ] . sum ( ) . idxmax( )
24.0
text. groupby( 'Age' ) [ 'Survived' ] . sum ( ) . loc[ 24.0 ]
15
text. groupby( 'Age' ) [ 'Survived' ] . sum ( ) . loc[ 24.0 ] / text. groupby( 'Age' ) [ 'Survived' ] . count( ) . loc[ 24.0 ]
0.5