第二章:第二三节数据重构1-课程

复习:在前面我们已经学习了Pandas基础,第二章我们开始进入数据分析的业务部分,在第二章第一节的内容中,我们学习了数据的清洗,这一部分十分重要,只有数据变得相对干净,我们之后对数据的分析才可以更有力。而这一节,我们要做的是数据重构,数据重构依旧属于数据理解(准备)的范围。

开始之前,导入numpy、pandas包和数据
# 导入基本库
import numpy as np
import pandas as pd
# 载入data文件中的:train-left-up.csv
train_left_up_csv=pd.read_csv('./data/train-left-up.csv')
train_left_up_csv
PassengerIdSurvivedPclassName
0103Braund, Mr. Owen Harris
1211Cumings, Mrs. John Bradley (Florence Briggs Th...
2313Heikkinen, Miss. Laina
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)
4503Allen, Mr. William Henry
...............
43443501Silvey, Mr. William Baird
43543611Carter, Miss. Lucile Polk
43643703Ford, Miss. Doolina Margaret "Daisy"
43743812Richards, Mrs. Sidney (Emily Hocking)
43843901Fortune, Mr. Mark

439 rows × 4 columns

2 第二章:数据重构

2.4 数据的合并

2.4.1 任务一:将data文件夹里面的所有数据都载入,观察数据的之间的关系
#写入代码
train_left_down_csv=pd.read_csv('./data/train-left-down.csv')
train_left_down_csv.head(20)
PassengerIdSurvivedPclassName
044002Kvillner, Mr. Johan Henrik Johannesson
144112Hart, Mrs. Benjamin (Esther Ada Bloomfield)
244203Hampe, Mr. Leon
344303Petterson, Mr. Johan Emil
444412Reynaldo, Ms. Encarnacion
544513Johannesen-Bratthammer, Mr. Bernt
644611Dodge, Master. Washington
744712Mellinger, Miss. Madeleine Violet
844811Seward, Mr. Frederic Kimber
944913Baclini, Miss. Marie Catherine
1045011Peuchen, Major. Arthur Godfrey
1145102West, Mr. Edwy Arthur
1245203Hagland, Mr. Ingvald Olai Olsen
1345301Foreman, Mr. Benjamin Laventall
1445411Goldenberg, Mr. Samuel L
1545503Peduzzi, Mr. Joseph
1645613Jalsevac, Mr. Ivan
1745701Millet, Mr. Francis Davis
1845811Kenyon, Mrs. Frederick R (Marion)
1945912Toomey, Miss. Ellen
#写入代码
train_right_down_csv=pd.read_csv('./data/train-right-down.csv')
train_right_down_csv.head(20)

SexAgeSibSpParchTicketFareCabinEmbarked
0male31.000C.A. 1872310.5000NaNS
1female45.011F.C.C. 1352926.2500NaNS
2male20.0003457699.5000NaNS
3male25.0103470767.7750NaNS
4female28.00023043413.0000NaNS
5maleNaN00653068.1125NaNS
6male4.0023363881.8583A34S
7female13.00125064419.5000NaNS
8male34.00011379426.5500NaNS
9female5.021266619.2583NaNC
10male52.00011378630.5000C104S
11male36.012C.A. 3465127.7500NaNS
12maleNaN106530319.9667NaNS
13male30.00011305127.7500C111C
14male49.0101745389.1042C92C
15maleNaN00A/5 28178.0500NaNS
16male29.0003492407.8958NaNC
17male65.0001350926.5500E38S
18femaleNaN101746451.8625D21S
19female50.000F.C.C. 1353110.5000NaNS
train_left_up_csv=pd.read_csv('./data/train-left-up.csv')
train_left_up_csv.head(20)
PassengerIdSurvivedPclassName
0103Braund, Mr. Owen Harris
1211Cumings, Mrs. John Bradley (Florence Briggs Th...
2313Heikkinen, Miss. Laina
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)
4503Allen, Mr. William Henry
5603Moran, Mr. James
6701McCarthy, Mr. Timothy J
7803Palsson, Master. Gosta Leonard
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
91012Nasser, Mrs. Nicholas (Adele Achem)
101113Sandstrom, Miss. Marguerite Rut
111211Bonnell, Miss. Elizabeth
121303Saundercock, Mr. William Henry
131403Andersson, Mr. Anders Johan
141503Vestrom, Miss. Hulda Amanda Adolfina
151612Hewlett, Mrs. (Mary D Kingcome)
161703Rice, Master. Eugene
171812Williams, Mr. Charles Eugene
181903Vander Planke, Mrs. Julius (Emelia Maria Vande...
192013Masselmani, Mrs. Fatima
train_right_up_csv=pd.read_csv('./data/train-right-up.csv')
train_right_up_csv.head(20)
SexAgeSibSpParchTicketFareCabinEmbarked
0male22.010A/5 211717.2500NaNS
1female38.010PC 1759971.2833C85C
2female26.000STON/O2. 31012827.9250NaNS
3female35.01011380353.1000C123S
4male35.0003734508.0500NaNS
5maleNaN003308778.4583NaNQ
6male54.0001746351.8625E46S
7male2.03134990921.0750NaNS
8female27.00234774211.1333NaNS
9female14.01023773630.0708NaNC
10female4.011PP 954916.7000G6S
11female58.00011378326.5500C103S
12male20.000A/5. 21518.0500NaNS
13male39.01534708231.2750NaNS
14female14.0003504067.8542NaNS
15female55.00024870616.0000NaNS
16male2.04138265229.1250NaNQ
17maleNaN0024437313.0000NaNS
18female31.01034576318.0000NaNS
19femaleNaN0026497.2250NaNC

【提示】结合之前我们加载的train.csv数据,大致预测一下上面的数据是什么

2.4.2:任务二:使用concat方法:将数据train-left-up.csv和train-right-up.csv横向合并为一张表,并保存这张表为result_up
#写入代码  当axis = 1的时候,concat就是行对齐,然后将不同列名称的两张表合并
# 加上join参数的属性,如果为’inner’得到的是两表的交集,如果是outer,得到的是两表的并集。
result_up_csv=pd.concat([train_left_up_csv,train_right_up_csv],axis=1,join='inner')
result_up_csv.to_csv('./data/result_up.csv')
2.4.3 任务三:使用concat方法:将train-left-down和train-right-down横向合并为一张表,并保存这张表为result_down。然后将上边的result_up和result_down纵向合并为result。
#写入代码
result_down_csv=pd.concat([train_left_down_csv,train_right_down_csv],axis=1,join='inner')
result_down_csv.to_csv('./data/result_down.csv')
result=pd.concat([result_up_csv,result_down_csv],axis=0)
result
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
44788702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
44888811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
44988903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
45089011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
45189103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 12 columns

2.4.4 任务四:使用DataFrame自带的方法join方法和append:完成任务二和任务三的任务
#写入代码  join中没有axis参数,所以该方法只能实现按行拼接。
# append 向DataFrame中加入新的行,如果列名不在对象中,则加入新的列。
result_up = train_left_up_csv.join(train_right_up_csv)
result_down = train_left_down_csv.join(train_right_down_csv)
result = result_up.append(result_down)
result
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
44788702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
44888811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
44988903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
45089011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
45189103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 12 columns

2.4.5 任务五:使用Panads的merge方法和DataFrame的append方法:完成任务二和任务三的任务
#写入代码
# merge()(通常以left.merge(right)方式使用)方法和join()功能上有些类似,但merge()方法中caller可以和right中的任意index或column进行匹配
result_up=train_left_up_csv.merge(train_right_up_csv,left_index=True,right_index=True)
result_down=train_left_down_csv.merge(train_right_down_csv,left_index=True,right_index=True)
result=resul_up.append(result_down)
result
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
44788702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
44888811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
44988903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
45089011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
45189103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 12 columns

【思考】对比merge、join以及concat的方法的不同以及相同。思考一下在任务四和任务五的情况下,为什么都要求使用DataFrame的append方法,如何只要求使用merge或者join可不可以完成任务四和任务五呢?

2.4.6 任务六:完成的数据保存为result.csv
#写入代码
result.to_csv('./data/result.csv')

2.5 换一种角度看数据

2.5.1 任务一:将我们的数据变为Series类型的数据
#写入代码
train_result=pd.read_csv('./data/result.csv')
train_result
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
44503Allen, Mr. William Henrymale35.0003734508.0500NaNS
..........................................
88644788702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88744888811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88844988903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
88945089011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89045189103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 13 columns

#写入代码
unit_result=train_result.stack()
type(unit_result.head(20))

pandas.core.series.Series
unit_result.to_csv('./data/unit_result.csv')
unit_result_csv=pd.read_csv('./data/unit_result.csv')
unit_result_csv
Unnamed: 0Unnamed: 10
00Unnamed: 00
10PassengerId1
20Survived0
30Pclass3
40NameBraund, Mr. Owen Harris
............
10712890SibSp0
10713890Parch0
10714890Ticket370376
10715890Fare7.75
10716890EmbarkedQ

10717 rows × 3 columns

# 笔记 stack和unstack两个函数。stack的意思是堆叠,堆积,unstack即“不要堆叠”,
unit_result_csv_csv=unit_result.unstack()
unit_result_csv_csv.to_csv('./data/unit_result_csv_csv.csv')
unit_result_csv_csv=pd.read_csv('./data/unit_result_csv_csv.csv')
unit_result_csv_csv.head(20)
Unnamed: 0Unnamed: 0.1PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
000103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
111211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
222313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
333411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
444503Allen, Mr. William Henrymale35.0003734508.0500NaNS
555603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
666701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
777803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
888913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
9991012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
1010101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
1111111211Bonnell, Miss. Elizabethfemale58.00011378326.5500C103S
1212121303Saundercock, Mr. William Henrymale20.000A/5. 21518.0500NaNS
1313131403Andersson, Mr. Anders Johanmale39.01534708231.2750NaNS
1414141503Vestrom, Miss. Hulda Amanda Adolfinafemale14.0003504067.8542NaNS
1515151612Hewlett, Mrs. (Mary D Kingcome)female55.00024870616.0000NaNS
1616161703Rice, Master. Eugenemale2.04138265229.1250NaNQ
1717171812Williams, Mr. Charles EugenemaleNaN0024437313.0000NaNS
1818181903Vander Planke, Mrs. Julius (Emelia Maria Vande...female31.01034576318.0000NaNS
1919192013Masselmani, Mrs. FatimafemaleNaN0026497.2250NaNC

复习:在前面我们已经学习了Pandas基础,第二章我们开始进入数据分析的业务部分,在第二章第一节的内容中,我们学习了数据的清洗,这一部分十分重要,只有数据变得相对干净,我们之后对数据的分析才可以更有力。而这一节,我们要做的是数据重构,数据重构依旧属于数据理解(准备)的范围。

开始之前,导入numpy、pandas包和数据
# 导入基本库
import numpy as np
import pandas as pd
# 载入上一个任务人保存的文件中:result.csv,并查看这个文件
result_csv=pd.read_csv('./data/result.csv')
result_csv
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
44503Allen, Mr. William Henrymale35.0003734508.0500NaNS
..........................................
88644788702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88744888811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88844988903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
88945089011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89045189103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 13 columns

2 第二章:数据重构

第一部分:数据聚合与运算

2.6 数据运用

2.6.1 任务一:通过教材《Python for Data Analysis》P303、Google or anything来学习了解GroupBy机制
#写入心得

2.4.2:任务二:计算泰坦尼克号男性与女性的平均票价
# 写入代码
result_csv_Sex=result_csv['Fare'].groupby(result_csv['Sex'])
result_csv_Sex=result_csv_Sex.mean()
result_csv_Sex
Sex
female    44.479818
male      25.523893
Name: Fare, dtype: float64

在了解GroupBy机制之后,运用这个机制完成一系列的操作,来达到我们的目的。

下面通过几个任务来熟悉GroupBy机制。

2.4.3:任务三:统计泰坦尼克号中男女的存活人数
# 写入代码
result_Group=result_csv['Survived'].groupby(result_csv['Sex'])
result_Group=result_Group.sum()
result_Group
Sex
female    233
male      109
Name: Survived, dtype: int64
2.4.4:任务四:计算客舱不同等级的存活人数
# 写入代码
result_Group_Ticket=result_csv['Survived'].groupby(result_csv['Pclass'])
result_Group_Ticket.sum()
Pclass
1    136
2     87
3    119
Name: Survived, dtype: int64

提示:】表中的存活那一栏,可以发现如果还活着记为1,死亡记为0

思考】从数据分析的角度,上面的统计结果可以得出那些结论

#思考心得 


【思考】从任务二到任务三中,这些运算可以通过agg()函数来同时计算。并且可以使用rename函数修改列名。你可以按照提示写出这个过程吗?

#思考心得  agg()函数是聚合函数 如果要不同列用求不同统计量,则用字典{‘行名/列名’,‘函数名’}指定。
result_csv.groupby('Sex').agg({'Fare': 'mean', 'Pclass': 'sum'}).rename(columns=
                            {'Fare': 'mean_fare', 'Pclass': 'count_pclass'})


mean_farecount_pclass
Sex
female44.479818678
male25.5238931379
2.4.5:任务五:统计在不同等级的票中的不同年龄的船票花费的平均值
# 写入代码
result_Group_Ticket_age=result_csv['Fare'].groupby([result_csv['Pclass'],result_csv['Age']])
result_Group_Ticket_age.mean()
Pclass  Age  
1       0.92     151.5500
        2.00     151.5500
        4.00      81.8583
        11.00    120.0000
        14.00    120.0000
                   ...   
3       61.00      6.2375
        63.00      9.5875
        65.00      7.7500
        70.50      7.7500
        74.00      7.7750
Name: Fare, Length: 182, dtype: float64
2.4.6:任务六:将任务二和任务三的数据合并,并保存到sex_fare_survived.csv
# 写入代码
result = pd.merge(result_csv_Sex,result_Group,on='Sex')
result
FareSurvived
Sex
female44.479818233
male25.523893109
2.4.7:任务七:得出不同年龄的总的存活人数,然后找出存活人数最多的年龄段,最后计算存活人数最高的存活率(存活人数/总人数)
# 写入代码
survived_age = result_csv['Survived'].groupby(result_csv['Age']).sum()
survived_age.head()
Age
0.42    1
0.67    1
0.75    2
0.83    2
0.92    1
Name: Survived, dtype: int64
# 写入代码
survived_age[survived_age.values==survived_age.max()]
Age
24.0    15
Name: Survived, dtype: int64
# 写入代码
_sum = result_csv['Survived'].sum()
print(_sum)
342
# 写入代码
precetn =survived_age.max()/_sum
precetn
0.043859649122807015


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值