本第二章2.3节,主要内容包括数据的连接方式concat/merage/join/append,转置函数un/stack,groupby分组以及存活率计算。
# title: "Kaggle项目泰坦尼克号 1__1.3"
# author: "小鱼"
# date: "2021-12-18"
import pandas as pd
import numpy as np
df1= pd.read_csv("train-left-up.csv")
df2 = pd.read_csv("train-left-down.csv")
df3 = pd.read_csv("train-right-up.csv")
df4 = pd.read_csv("train-right-down.csv")
数据重塑
任务二:使用concat方法:将数据train-left-up.csv和train-right-up.csv横向合并为一张表,并保存这张表为result_up
result_up = pd.concat([df1, df3], axis=1) # concat横向连接
result_up
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
434 | 435 | 0 | 1 | Silvey, Mr. William Baird | male | 50.0 | 1 | 0 | 13507 | 55.9000 | E44 | S |
435 | 436 | 1 | 1 | Carter, Miss. Lucile Polk | female | 14.0 | 1 | 2 | 113760 | 120.0000 | B96 B98 | S |
436 | 437 | 0 | 3 | Ford, Miss. Doolina Margaret "Daisy" | female | 21.0 | 2 | 2 | W./C. 6608 | 34.3750 | NaN | S |
437 | 438 | 1 | 2 | Richards, Mrs. Sidney (Emily Hocking) | female | 24.0 | 2 | 3 | 29106 | 18.7500 | NaN | S |
438 | 439 | 0 | 1 | Fortune, Mr. Mark | male | 64.0 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S |
439 rows × 12 columns
# 任务三:使用concat方法:将train-left-down和train-right-down横向合并保存result_down,然后将上边的result_up和result_down纵向合并为result
result_down = pd.concat([df2, df4], axis=1) # concat横向连接
result_down
result = pd.concat([result_up,result_down], axis=0) # concat纵向连接
result
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
447 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
448 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
449 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
450 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
451 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
# 使用DataFrame自带的方法join方法和append:完成任务二和任务三的任务
resul_up = df1.join(df3) # join
result_down = df2.join(df4) # join
result = result_up.append(result_down) #append
result
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
447 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
448 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
449 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
450 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
451 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
# 使用Panads的merge方法和DataFrame的append方法:完成任务二和任务三的任务
result_up = pd.merge(df1,df3,left_index=True,right_index=True)
result_down = pd.merge(df2,df4,left_index=True,right_index=True)
result = resul_up.append(result_down)
result
# 对比merge、join以及concat的方法的不同以及相同
# DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
# 1.1 merge的特征
# 1.2 默认以重叠列名当做链接键
# 1.3 默认是INNER JOIN。
# 1.4 可以多键连接,'on'参数后传入多键列表即可
# 1.5 如果两个对象的列表不同,可以用left_on, right_on指定。
# 1.6 也可以用行索引当连接键,使用参数left_index=True, right_index=True. 但是这种情况下最好用JOIN
# DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False) # 列名重复的时候需要指定lsuffix, rsuffix参数
# df1.join(df2.set_index('key2'), on='key1') #1.1列名不同,列内容有相同
# df1.join(df2.set_index('key'), on='key',lsuffix='_col1', rsuffix='_col2') #1.2列名相同,内容有相同
# df1.join(df2, lsuffix="_l") #1.3列名不同,内容也不同
# pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
# concat轴向连接,axis=1 在行中操作,axis=0是在列中操作,默认是axis=0,即垂直堆叠。ignore_index=T,默认情况下是False。如果设成了True,就是把结果的合并表重新编排行索引。
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
447 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
448 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
449 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
450 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
451 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
result.to_csv("result.csv") #保存数据
# 将我们的数据变为Series类型的数据
# 将完整的数据加载出来
text = pd.read_csv('result.csv')
text.head()
# 代码写在这里
unit_result=text.stack().head(20) # stack:将数据的列“旋转”为行,unstack:将数据的行“旋转”为列,互为一组逆运算
unit_result
#将代码保存为unit_result,csv
unit_result.to_csv('unit_result.csv')
test = pd.read_csv('unit_result.csv')
test
Unnamed: 0 | Unnamed: 1 | 0 | |
---|---|---|---|
0 | 0 | Unnamed: 0 | 0 |
1 | 0 | PassengerId | 1 |
2 | 0 | Survived | 0 |
3 | 0 | Pclass | 3 |
4 | 0 | Name | Braund, Mr. Owen Harris |
5 | 0 | Sex | male |
6 | 0 | Age | 22.0 |
7 | 0 | SibSp | 1 |
8 | 0 | Parch | 0 |
9 | 0 | Ticket | A/5 21171 |
10 | 0 | Fare | 7.25 |
11 | 0 | Embarked | S |
12 | 1 | Unnamed: 0 | 1 |
13 | 1 | PassengerId | 2 |
14 | 1 | Survived | 1 |
15 | 1 | Pclass | 1 |
16 | 1 | Name | Cumings, Mrs. John Bradley (Florence Briggs Th... |
17 | 1 | Sex | female |
18 | 1 | Age | 38.0 |
19 | 1 | SibSp | 1 |
数据聚合与运算
df = pd.read_csv("result.csv") #加载数据
df.head(5)
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
# 任务一:通过教材《Python for Data Analysis》P303、Google or anything来学习了解GroupBy机制
# DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)
# 根据一个或者多个字段,将数据划分为不同的组,然后进行进一步分析,比如求分组的数量,分组内的最大值最小值平均值等
# 任务二:计算泰坦尼克号男性与女性的平均票价
df1 = df['Fare'].groupby(df['Sex'])
means = df1.mean()
# df.groupby(by=["Sex"]).mean()
# 任务三:统计泰坦尼克号中男女的存活人数
df2 = df['Survived'].groupby(df['Sex'])
Survived_Sex = df2.sum()
# 任务四:计算客舱不同等级的存活人数
df3 = df['Survived'].groupby(df['Pclass'])
df3.sum()
Pclass
1 136
2 87
3 119
Name: Survived, dtype: int64
从数据分析的角度,上面的统计结果可以得出那些结论:
票价高的存活概率更大,女性存活率高于男性,头等舱存活概率最高
# 从任务二到任务三中,这些运算可以通过agg()函数来同时计算。并且可以使用rename函数修改列名。你可以按照提示写出这个过程吗?
# DataFrame.agg(func=None, axis=0, *args, **kwargs)
df.groupby('Sex').agg({'Fare': 'mean', 'Survived': 'sum'}).rename(columns=
{'Fare': 'mean_fare', 'Survived': 'Survived_pclass'})
mean_fare | Survived_pclass | |
---|---|---|
Sex | ||
female | 44.479818 | 233 |
male | 25.523893 | 109 |
# 任务五:统计在不同等级的票中的不同年龄的船票花费的平均值
text.groupby(['Pclass','Age'])['Fare'].mean().head()
Pclass Age
1 0.92 151.5500
2.00 151.5500
4.00 81.8583
11.00 120.0000
14.00 120.0000
Name: Fare, dtype: float64
# 任务六:将任务二和任务三的数据合并,并保存到sex_fare_survived.csv
result = pd.merge(means,Survived_Sex,on = "Sex")
result .to_csv("sex_fare_survived.csv")
# 得出不同年龄的总的存活人数,然后找出存活人数最多的年龄段,最后_sum = text['Survived'].sum(),计算存活人数最高的存活率(存活人数/总人数)
survived_age = df['Survived'].groupby(df['Age']).sum() #同年龄的总的存活人数
survived_age.head()
Age
0.42 1
0.67 1
0.75 2
0.83 2
0.92 1
Name: Survived, dtype: int64
survived_age[survived_age.values==survived_age.max()] #存活人数最多的年龄段
Age
24.0 15
Name: Survived, dtype: int64
_sum = text['Survived'].sum()
_sum
342
_sum = text['Survived'].count() # 计算存活人数最高的存活率(存活人数/总人数)
print("sum of person:" + str(_sum))
sum of person:891
percent =survived_age.max()/ _sum
print("最大存活率" + str(percent))
最大存活率0.016835016835016835
本第二章2.3节,主要内容包括数据的连接方式concat/merage/join/append,转置函数un/stack,groupby分组以及存活率计算。从数据的结果来看,女性存活率要高于男性,可能是“女士优先”绅士风度的体现吧,根据不同的舱级来看,票价更贵的存活率更高,一分钱一分货,生命不是平等的,生存几率也不是平等的。