第二章:第二节数据重构

复习:在前面我们已经学习了Pandas基础,第二章我们开始进入数据分析的业务部分,在第二章第一节的内容中,我们学习了数据的清洗,这一部分十分重要,只有数据变得相对干净,我们之后对数据的分析才可以更有力。而这一节,我们要做的是数据重构,数据重构依旧属于数据理解(准备)的范围。

开始之前,导入numpy、pandas包和数据
# 导入基本库
import numpy as np
import pandas as pd
# 载入data文件中的:train-left-up.csv
text = pd.read_csv('./data/train-left-up.csv')
text.head()
PassengerIdSurvivedPclassName
0103Braund, Mr. Owen Harris
1211Cumings, Mrs. John Bradley (Florence Briggs Th...
2313Heikkinen, Miss. Laina
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)
4503Allen, Mr. William Henry

2 第二章:数据重构

2.4 数据的合并

2.4.1 任务一:将data文件夹里面的所有数据都载入,观察数据的之间的关系
#写入代码
text_left_up = pd.read_csv("data/train-left-up.csv")
text_left_down = pd.read_csv("data/train-left-down.csv")
text_right_up = pd.read_csv("data/train-right-up.csv")
text_right_down = pd.read_csv("data/train-right-down.csv")
#写入代码
display(text_left_up.tail())
display(text_left_down.head())

PassengerIdSurvivedPclassName
43443501Silvey, Mr. William Baird
43543611Carter, Miss. Lucile Polk
43643703Ford, Miss. Doolina Margaret "Daisy"
43743812Richards, Mrs. Sidney (Emily Hocking)
43843901Fortune, Mr. Mark
PassengerIdSurvivedPclassName
044002Kvillner, Mr. Johan Henrik Johannesson
144112Hart, Mrs. Benjamin (Esther Ada Bloomfield)
244203Hampe, Mr. Leon
344303Petterson, Mr. Johan Emil
444412Reynaldo, Ms. Encarnacion
display(text_right_up.tail())
display(text_right_down.head())
SexAgeSibSpParchTicketFareCabinEmbarked
886male27.00.00.021153613.00NaNS
887female19.00.00.011205330.00B42S
888femaleNaN1.02.0W./C. 660723.45NaNS
889male26.00.00.011136930.00C148C
890male32.00.00.03703767.75NaNQ
SexAgeSibSpParchTicketFareCabinEmbarked
0male31.000C.A. 1872310.500NaNS
1female45.011F.C.C. 1352926.250NaNS
2male20.0003457699.500NaNS
3male25.0103470767.775NaNS
4female28.00023043413.000NaNS

【提示】结合之前我们加载的train.csv数据,大致预测一下上面的数据是什么

2.4.2:任务二:使用concat方法:将数据train-left-up.csv和train-right-up.csv横向合并为一张表,并保存这张表为result_up
#写入代码
result_up=pd.concat([text_left_up,text_right_up.loc[:438]],axis=1)
result_up
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.01.00.0A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01.00.0PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.00.00.0STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01.00.011380353.1000C123S
4503Allen, Mr. William Henrymale35.00.00.03734508.0500NaNS
.......................................
43443501Silvey, Mr. William Bairdmale50.01.00.01350755.9000E44S
43543611Carter, Miss. Lucile Polkfemale14.01.02.0113760120.0000B96 B98S
43643703Ford, Miss. Doolina Margaret "Daisy"female21.02.02.0W./C. 660834.3750NaNS
43743812Richards, Mrs. Sidney (Emily Hocking)female24.02.03.02910618.7500NaNS
43843901Fortune, Mr. Markmale64.01.04.019950263.0000C23 C25 C27S

439 rows × 12 columns

2.4.3 任务三:使用concat方法:将train-left-down和train-right-down横向合并为一张表,并保存这张表为result_down。然后将上边的result_up和result_down纵向合并为result。
#写入代码
result_down=pd.concat([text_left_down,text_right_down],axis=1)
result_down
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
044002Kvillner, Mr. Johan Henrik Johannessonmale31.000C.A. 1872310.500NaNS
144112Hart, Mrs. Benjamin (Esther Ada Bloomfield)female45.011F.C.C. 1352926.250NaNS
244203Hampe, Mr. Leonmale20.0003457699.500NaNS
344303Petterson, Mr. Johan Emilmale25.0103470767.775NaNS
444412Reynaldo, Ms. Encarnacionfemale28.00023043413.000NaNS
.......................................
44788702Montvila, Rev. Juozasmale27.00021153613.000NaNS
44888811Graham, Miss. Margaret Edithfemale19.00011205330.000B42S
44988903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.450NaNS
45089011Behr, Mr. Karl Howellmale26.00011136930.000C148C
45189103Dooley, Mr. Patrickmale32.0003703767.750NaNQ

452 rows × 12 columns

2.4.4 任务四:使用DataFrame自带的方法join方法和append:完成任务二和任务三的任务
#写入代码
result_up=text_left_up.join(text_right_up)
result_down=text_left_down.join(text_right_down)
result=result_up.append(result_down)
result
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.01.00.0A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01.00.0PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.00.00.0STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01.00.011380353.1000C123S
4503Allen, Mr. William Henrymale35.00.00.03734508.0500NaNS
.......................................
44788702Montvila, Rev. Juozasmale27.00.00.021153613.0000NaNS
44888811Graham, Miss. Margaret Edithfemale19.00.00.011205330.0000B42S
44988903Johnston, Miss. Catherine Helen "Carrie"femaleNaN1.02.0W./C. 660723.4500NaNS
45089011Behr, Mr. Karl Howellmale26.00.00.011136930.0000C148C
45189103Dooley, Mr. Patrickmale32.00.00.03703767.7500NaNQ

891 rows × 12 columns

2.4.5 任务五:使用Panads的merge方法和DataFrame的append方法:完成任务二和任务三的任务
#写入代码
result_up=pd.merge(text_left_up,text_right_up,left_index=True,right_index=True)
result_down=pd.merge(text_left_down,text_right_down,left_index=True,right_index=True)
result=result_up.append(result_down)
result
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.01.00.0A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01.00.0PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.00.00.0STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01.00.011380353.1000C123S
4503Allen, Mr. William Henrymale35.00.00.03734508.0500NaNS
.......................................
44788702Montvila, Rev. Juozasmale27.00.00.021153613.0000NaNS
44888811Graham, Miss. Margaret Edithfemale19.00.00.011205330.0000B42S
44988903Johnston, Miss. Catherine Helen "Carrie"femaleNaN1.02.0W./C. 660723.4500NaNS
45089011Behr, Mr. Karl Howellmale26.00.00.011136930.0000C148C
45189103Dooley, Mr. Patrickmale32.00.00.03703767.7500NaNQ

891 rows × 12 columns

利用merge来做纵列的合并
result_left=pd.merge(text_left_up,text_left_down,how='outer')
text_right_up['key']=0
text_right_down['key']=1
text_right_up
text_right_down
result_right=pd.merge(text_right_up.loc[:438],text_right_down,how='outer')
result=pd.merge(result_left,result_right.iloc[:,:-1],left_index=True,right_index=True)
result
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.01.00.0A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01.00.0PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.00.00.0STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01.00.011380353.1000C123S
4503Allen, Mr. William Henrymale35.00.00.03734508.0500NaNS
.......................................
88688702Montvila, Rev. Juozasmale27.00.00.021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale19.00.00.011205330.0000B42S
88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN1.02.0W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale26.00.00.011136930.0000C148C
89089103Dooley, Mr. Patrickmale32.00.00.03703767.7500NaNQ

891 rows × 12 columns

【思考】对比merge、join以及concat的方法的不同以及相同。思考一下在任务四和任务五的情况下,为什么都要求使用DataFrame的append方法,如何只要求使用merge或者join可不可以完成任务四和任务五呢?

  • append像序列中添加行
  • assign 更像是左连接,一列一列的像序列中添加列
  • combine 求两个表的并集
  • update 把一个表里面的元素参照另一个表进行更新
  • concat 可以将两个表在两个维度进行拼接
  • merge 通过指定一个key来将两个表进行连接
  • join 也可以指定key,默认的话就是按照index来进行连接
2.4.6 任务六:完成的数据保存为result.csv
#写入代码
result.to_csv('result.csv')

2.5 换一种角度看数据

2.5.1 任务一:将我们的数据变为Series类型的数据

stack函数可以看做将横向的索引放到纵向,因此功能类似与melt,参数level可指定变化的列索引是哪一层(或哪几层,需要列表)

#写入代码
text = pd.read_csv('result.csv')
text.head()

Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00103Braund, Mr. Owen Harrismale22.01.00.0A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01.00.0PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.00.00.0STON/O2. 31012827.9250NaNS
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01.00.011380353.1000C123S
44503Allen, Mr. William Henrymale35.00.00.03734508.0500NaNS
#写入代码
unit_result=text.stack().head(20)
unit_result

0  Unnamed: 0                                                     0
   PassengerId                                                    1
   Survived                                                       0
   Pclass                                                         3
   Name                                     Braund, Mr. Owen Harris
   Sex                                                         male
   Age                                                           22
   SibSp                                                          1
   Parch                                                          0
   Ticket                                                 A/5 21171
   Fare                                                        7.25
   Embarked                                                       S
1  Unnamed: 0                                                     1
   PassengerId                                                    2
   Survived                                                       1
   Pclass                                                         1
   Name           Cumings, Mrs. John Bradley (Florence Briggs Th...
   Sex                                                       female
   Age                                                           38
   SibSp                                                          1
dtype: object
unit_result.unstack()
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareEmbarked
00103Braund, Mr. Owen Harrismale2210A/5 211717.25S
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female381NaNNaNNaNNaN
#将代码保存为unit_result,csv
unit_result.to_csv('unit_result.csv')
test = pd.read_csv('unit_result.csv')
test.head()
Unnamed: 0Unnamed: 10
00Unnamed: 00
10PassengerId1
20Survived0
30Pclass3
40NameBraund, Mr. Owen Harris

test.iloc[:,1:]
Unnamed: 10
0Unnamed: 00
1PassengerId1
2Survived0
3Pclass3
4NameBraund, Mr. Owen Harris
5Sexmale
6Age22.0
7SibSp1.0
8Parch0.0
9TicketA/5 21171
10Fare7.25
11EmbarkedS
12Unnamed: 01
13PassengerId2
14Survived1
15Pclass1
16NameCumings, Mrs. John Bradley (Florence Briggs Th...
17Sexfemale
18Age38.0
19SibSp1.0
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  20 non-null     int64 
 1   Unnamed: 1  20 non-null     object
 2   0           20 non-null     object
dtypes: int64(1), object(2)
memory usage: 608.0+ bytes
# test.pivot_table(index=test.index,columns='Unnamed: 1',values='0')

复习:在前面我们已经学习了Pandas基础,第二章我们开始进入数据分析的业务部分,在第二章第一节的内容中,我们学习了数据的清洗,这一部分十分重要,只有数据变得相对干净,我们之后对数据的分析才可以更有力。而这一节,我们要做的是数据重构,数据重构依旧属于数据理解(准备)的范围。

开始之前,导入numpy、pandas包和数据
# 导入基本库
import pandas as pd
import numpy as np
# 载入上一个任务人保存的文件中:result.csv,并查看这个文件
text = pd.read_csv('result.csv')
text.head()
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00103Braund, Mr. Owen Harrismale22.01.00.0A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01.00.0PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.00.00.0STON/O2. 31012827.9250NaNS
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01.00.011380353.1000C123S
44503Allen, Mr. William Henrymale35.00.00.03734508.0500NaNS

2 第二章:数据重构

第一部分:数据聚合与运算

2.6 数据运用

2.6.1 任务一:通过教材《Python for Data Analysis》P303、Google or anything来学习了解GroupBy机制
#写入心得

2.4.2:任务二:计算泰坦尼克号男性与女性的平均票价
# 写入代码
text.groupby('Sex')['Fare'].mean()

Sex
female    44.479818
male      25.523893
Name: Fare, dtype: float64

在了解GroupBy机制之后,运用这个机制完成一系列的操作,来达到我们的目的。

下面通过几个任务来熟悉GroupBy机制。

2.4.3:任务三:统计泰坦尼克号中男女的存活人数
# 写入代码
text.groupby('Sex')['Survived'].sum()
Sex
female    233
male      109
Name: Survived, dtype: int64
text.groupby('Sex')['Survived'].count()
Sex
female    314
male      577
Name: Survived, dtype: int64
text.groupby('Sex')['Survived'].sum()/text.groupby('Sex')['Survived'].count()
Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64
2.4.4:任务四:计算客舱不同等级的存活人数
# 写入代码
text.groupby('Pclass')['Survived'].sum()
Pclass
1    136
2     87
3    119
Name: Survived, dtype: int64
text.groupby('Pclass')['Survived'].count()
Pclass
1    216
2    184
3    491
Name: Survived, dtype: int64
text.groupby('Pclass')['Survived'].sum()/text.groupby('Pclass')['Survived'].count()
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64
统计不同等级客舱中男女比例
text.groupby(['Pclass','Sex']).head(1)
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00103Braund, Mr. Owen Harrismale22.01.00.0A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01.00.0PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.00.00.0STON/O2. 31012827.9250NaNS
66701McCarthy, Mr. Timothy Jmale54.00.00.01746351.8625E46S
991012Nasser, Mrs. Nicholas (Adele Achem)female14.01.00.023773630.0708NaNC
17171812Williams, Mr. Charles EugenemaleNaN0.00.024437313.0000NaNS
text.groupby(['Pclass','Sex'])['PassengerId'].count()/[94+76+144,122+108+347,94+76+144,122+108+347,94+76+144,122+108+347]#['Sex']=='male'
Pclass  Sex   
1       female    0.299363
        male      0.211438
2       female    0.242038
        male      0.187175
3       female    0.458599
        male      0.601386
Name: PassengerId, dtype: float64

提示:】表中的存活那一栏,可以发现如果还活着记为1,死亡记为0

思考】从数据分析的角度,上面的统计结果可以得出那些结论

#思考心得
女性更倾向与买贵一些的票,同时女性的存活率显著的比男性高,对比不同仓位中男女所占的比例也可以看出,相比男性在女性的群体中
更倾向于住好一些的客舱,这可能也就是存活率高的原因。

【思考】从任务二到任务四中,这些运算可以通过agg()函数来同时计算。并且可以使用rename函数修改列名。你可以按照提示写出这个过程吗?

#思考心得
text.groupby('Sex').agg({'Fare':[('rename_mean','mean')],'Survived':[('rename_sum','sum')]})


FareSurvived
rename_meanrename_sum
Sex
female44.479818233
male25.523893109
text.groupby('Pclass').agg({'Survived':['sum']}).rename(columns={'sum':'rename_sum'})
Survived
rename_sum
Pclass
1136
287
3119
2.4.5:任务五:统计在不同等级的票中的不同年龄的船票花费的平均值
# 写入代码
text.groupby(['Pclass','Age'])['Fare'].mean()
# text['P_A_Fare_mean']=text.groupby(['Pclass','Age'])['Fare'].mean()
Pclass  Age  
1       0.92     151.5500
        2.00     151.5500
        4.00      81.8583
        11.00    120.0000
        14.00    120.0000
                   ...   
3       61.00      6.2375
        63.00      9.5875
        65.00      7.7500
        70.50      7.7500
        74.00      7.7750
Name: Fare, Length: 182, dtype: float64
text['P_A_Fare_mean']=text['Fare']
for name ,group in text.groupby(['Pclass','Age']):
    text.loc[group.index,'P_A_Fare_mean']= pd.Series(group['Fare'].mean(),index=group.index,name='P_A_Fare_mean')
#     display(text.loc[group.index]['P_A_Fare_mean'])
text.head()
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedP_A_Fare_mean
00103Braund, Mr. Owen Harrismale22.01.00.0A/5 211717.2500NaNS7.988330
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01.00.0PC 1759971.2833C85C103.711800
22313Heikkinen, Miss. Lainafemale26.00.00.0STON/O2. 31012827.9250NaNS14.158036
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01.00.011380353.1000C123S165.744911
44503Allen, Mr. William Henrymale35.00.00.03734508.0500NaNS9.736800
2.4.6:任务六:将任务二和任务三的数据合并,并保存到sex_fare_survived.csv
# 写入代码
text.groupby('Sex').agg({'Fare':[('rename_mean','mean')],'Survived':[('rename_sum','sum')]})
FareSurvived
rename_meanrename_sum
Sex
female44.479818233
male25.523893109
mean=text.groupby('Sex')['Fare'].mean()
sur=text.groupby('Sex')['Survived'].sum()
display(mean)
display(sur)
display(pd.merge(mean,sur,on='Sex'))
display(mean.to_frame().join(sur))
display(pd.concat([mean.to_frame(),sur.to_frame()],axis=1))
Sex
female    44.479818
male      25.523893
Name: Fare, dtype: float64



Sex
female    233
male      109
Name: Survived, dtype: int64
FareSurvived
Sex
female44.479818233
male25.523893109
FareSurvived
Sex
female44.479818233
male25.523893109
FareSurvived
Sex
female44.479818233
male25.523893109
result =pd.merge(mean,sur,on='Sex')
result.to_csv('sex_fare_survived.csv')
2.4.7:任务七:得出不同年龄的总的存活人数,然后找出存活人数的最高的年龄,最后计算存活人数最高的存活率(存活人数/总人数)
# 写入代码
text.groupby('Age')['Survived'].sum()#.max()
Age
0.42     1
0.67     1
0.75     2
0.83     2
0.92     1
        ..
70.00    0
70.50    0
71.00    0
74.00    0
80.00    1
Name: Survived, Length: 88, dtype: int64
# 写入代码
#存活人数的最高的年龄
text.groupby('Age')['Survived'].sum().idxmax()
24.0
# 写入代码
# 存活人数的最高的年龄的存活人数
text.groupby('Age')['Survived'].sum().loc[24.0]
15
# 写入代码
#最后计算存活人数最高的存活率(存活人数/总人数)
text.groupby('Age')['Survived'].sum().loc[24.0]/text.groupby('Age')['Survived'].count().loc[24.0]
0.5
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值