从零开始数据分析Kaggle项目—泰坦尼克号(六)

本文探讨了泰坦尼克号数据集,通过数据连接(concat/merge/join/append)、转置(unstack)及groupby操作,分析了存活率与票价、性别和舱级的关系。结果显示,票价高、女性和头等舱乘客的存活概率较高,揭示了生存机会的不平等现象。
摘要由CSDN通过智能技术生成

本第二章2.3节,主要内容包括数据的连接方式concat/merage/join/append,转置函数un/stack,groupby分组以及存活率计算。

# title: "Kaggle项目泰坦尼克号 1__1.3"
# author: "小鱼"
# date: "2021-12-18"
import pandas as pd
import numpy as np
df1= pd.read_csv("train-left-up.csv")
df2 = pd.read_csv("train-left-down.csv")
df3 = pd.read_csv("train-right-up.csv")
df4 = pd.read_csv("train-right-down.csv")

数据重塑

任务二:使用concat方法:将数据train-left-up.csv和train-right-up.csv横向合并为一张表,并保存这张表为result_up

result_up = pd.concat([df1, df3], axis=1)  # concat横向连接
result_up
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
43443501Silvey, Mr. William Bairdmale50.0101350755.9000E44S
43543611Carter, Miss. Lucile Polkfemale14.012113760120.0000B96 B98S
43643703Ford, Miss. Doolina Margaret "Daisy"female21.022W./C. 660834.3750NaNS
43743812Richards, Mrs. Sidney (Emily Hocking)female24.0232910618.7500NaNS
43843901Fortune, Mr. Markmale64.01419950263.0000C23 C25 C27S

439 rows × 12 columns

#  任务三:使用concat方法:将train-left-down和train-right-down横向合并保存result_down,然后将上边的result_up和result_down纵向合并为result
result_down = pd.concat([df2, df4], axis=1)  # concat横向连接
result_down

result = pd.concat([result_up,result_down], axis=0)  # concat纵向连接 
result
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
44788702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
44888811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
44988903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
45089011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
45189103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 12 columns

# 使用DataFrame自带的方法join方法和append:完成任务二和任务三的任务
resul_up = df1.join(df3) # join
result_down = df2.join(df4)  # join

result = result_up.append(result_down)  #append
result
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
44788702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
44888811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
44988903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
45089011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
45189103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 12 columns

# 使用Panads的merge方法和DataFrame的append方法:完成任务二和任务三的任务
result_up = pd.merge(df1,df3,left_index=True,right_index=True)
result_down = pd.merge(df2,df4,left_index=True,right_index=True)
result = resul_up.append(result_down)
result

# 对比merge、join以及concat的方法的不同以及相同

# DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
# 1.1 merge的特征
# 1.2 默认以重叠列名当做链接键
# 1.3 默认是INNER JOIN。
# 1.4 可以多键连接,'on'参数后传入多键列表即可
# 1.5 如果两个对象的列表不同,可以用left_on, right_on指定。
# 1.6 也可以用行索引当连接键,使用参数left_index=True, right_index=True. 但是这种情况下最好用JOIN

# DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)         # 列名重复的时候需要指定lsuffix, rsuffix参数
# df1.join(df2.set_index('key2'), on='key1')                                             #1.1列名不同,列内容有相同 
# df1.join(df2.set_index('key'), on='key',lsuffix='_col1', rsuffix='_col2')              #1.2列名相同,内容有相同
# df1.join(df2, lsuffix="_l")                                                            #1.3列名不同,内容也不同

# pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
# concat轴向连接,axis=1 在行中操作,axis=0是在列中操作,默认是axis=0,即垂直堆叠。ignore_index=T,默认情况下是False。如果设成了True,就是把结果的合并表重新编排行索引。
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
44788702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
44888811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
44988903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
45089011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
45189103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 12 columns

result.to_csv("result.csv")  #保存数据
# 将我们的数据变为Series类型的数据
# 将完整的数据加载出来
text = pd.read_csv('result.csv')
text.head()
# 代码写在这里
unit_result=text.stack().head(20)   # stack:将数据的列“旋转”为行,unstack:将数据的行“旋转”为列,互为一组逆运算
unit_result

#将代码保存为unit_result,csv
unit_result.to_csv('unit_result.csv')
test = pd.read_csv('unit_result.csv')
test
Unnamed: 0Unnamed: 10
00Unnamed: 00
10PassengerId1
20Survived0
30Pclass3
40NameBraund, Mr. Owen Harris
50Sexmale
60Age22.0
70SibSp1
80Parch0
90TicketA/5 21171
100Fare7.25
110EmbarkedS
121Unnamed: 01
131PassengerId2
141Survived1
151Pclass1
161NameCumings, Mrs. John Bradley (Florence Briggs Th...
171Sexfemale
181Age38.0
191SibSp1

数据聚合与运算

df = pd.read_csv("result.csv")  #加载数据
df.head(5)
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
44503Allen, Mr. William Henrymale35.0003734508.0500NaNS
# 任务一:通过教材《Python for Data Analysis》P303、Google or anything来学习了解GroupBy机制
# DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)
# 根据一个或者多个字段,将数据划分为不同的组,然后进行进一步分析,比如求分组的数量,分组内的最大值最小值平均值等
# 任务二:计算泰坦尼克号男性与女性的平均票价
df1  = df['Fare'].groupby(df['Sex'])
means = df1.mean()


# df.groupby(by=["Sex"]).mean()
# 任务三:统计泰坦尼克号中男女的存活人数
df2  = df['Survived'].groupby(df['Sex'])
Survived_Sex = df2.sum()
# 任务四:计算客舱不同等级的存活人数
df3  = df['Survived'].groupby(df['Pclass'])
df3.sum()
Pclass
1    136
2     87
3    119
Name: Survived, dtype: int64

从数据分析的角度,上面的统计结果可以得出那些结论:
票价高的存活概率更大,女性存活率高于男性,头等舱存活概率最高

# 从任务二到任务三中,这些运算可以通过agg()函数来同时计算。并且可以使用rename函数修改列名。你可以按照提示写出这个过程吗?
# DataFrame.agg(func=None, axis=0, *args, **kwargs)
df.groupby('Sex').agg({'Fare': 'mean', 'Survived': 'sum'}).rename(columns=
                            {'Fare': 'mean_fare', 'Survived': 'Survived_pclass'})


mean_fareSurvived_pclass
Sex
female44.479818233
male25.523893109
# 任务五:统计在不同等级的票中的不同年龄的船票花费的平均值
text.groupby(['Pclass','Age'])['Fare'].mean().head()
Pclass  Age  
1       0.92     151.5500
        2.00     151.5500
        4.00      81.8583
        11.00    120.0000
        14.00    120.0000
Name: Fare, dtype: float64
# 任务六:将任务二和任务三的数据合并,并保存到sex_fare_survived.csv
result = pd.merge(means,Survived_Sex,on = "Sex")
result .to_csv("sex_fare_survived.csv")
# 得出不同年龄的总的存活人数,然后找出存活人数最多的年龄段,最后_sum = text['Survived'].sum(),计算存活人数最高的存活率(存活人数/总人数) 
survived_age = df['Survived'].groupby(df['Age']).sum()        #同年龄的总的存活人数
survived_age.head()
Age
0.42    1
0.67    1
0.75    2
0.83    2
0.92    1
Name: Survived, dtype: int64
survived_age[survived_age.values==survived_age.max()]  #存活人数最多的年龄段
Age
24.0    15
Name: Survived, dtype: int64
_sum = text['Survived'].sum()   
_sum
342
_sum = text['Survived'].count()                  # 计算存活人数最高的存活率(存活人数/总人数) 
print("sum of person:" + str(_sum))
sum of person:891
percent =survived_age.max()/ _sum
print("最大存活率" + str(percent))
最大存活率0.016835016835016835

本第二章2.3节,主要内容包括数据的连接方式concat/merage/join/append,转置函数un/stack,groupby分组以及存活率计算。从数据的结果来看,女性存活率要高于男性,可能是“女士优先”绅士风度的体现吧,根据不同的舱级来看,票价更贵的存活率更高,一分钱一分货,生命不是平等的,生存几率也不是平等的。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值