【Python】Kaggle_Titanic_prediction 1 -- logistics regression逻辑回归预测

Kaggle泰坦尼克号沉船生存预测,已经是数据挖掘界国际经典入门案例了。
那,小试“牛”刀。
#Titanic: Machine Learning from Disaster#

# 导入常用数据模块
import pandas as pd
import numpy as np
# 导入训练集数据文件
train1=pd.read_csv("D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/train.csv")
train1.head(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
# 看下数据基本统计信息

#先看数字类型的整体情况
train1.describe()

# 平均Survived存活率38.4%,不到四成的人活下来。
# 平均Age年龄29.7,整体偏年轻。
# SibSp兄弟姐妹数量加上Parch父母子女数量,平均0.9个人(这里可以归纳为“家庭成员”),相当于,平均每人带一个同伴。当然,有人携家带口人多,也会有人独自出行。
# 平均Fare票价32.2.从常识推断,高票价人数会比普通票价人数少,并且票价高很多。票价的中位数和众数应该在32以下。
# Pclass船舱等级平均2.3,说明三等舱的乘客比一、二等舱乘客多。

# Age字段缺失两百多个数据量,后期待补,可用均值mean或中位数median直接补上。
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
# 然后看看字符串类型(非数值)的整体情况
train1.describe(include="O")

# Ticket票有681种,有可能是681批人。
# Embarked码头有3个。
# Cabin舱位有147种,不过因为该列数据缺失较多,这个值147无法采用。
NameSexTicketCabinEmbarked
count891891891204889
unique89126811473
topJensen, Mr. Hans Pedermale347082B96 B98S
freq157774644
# 看下各字段信息
train1.info()

# 总891条数据,Age、Cabin、Embarked有缺失
# 那接下来看看怎么让这些缺失数据更直观
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
# 用 isnull 查看各列有没有缺失值
train1.isnull().any()

# ——有三列缺失,Age、Cabin、Embarked。但是还不够直观,我们不知道是大部分数据缺失还是个别数据缺失。
# 接下来可以试试在isnull后面加sum来统计总缺失个数
PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool
# sum统计出个数之后,为了更直观一些,将统计出的数据排序
train1.isnull().sum().sort_values(ascending=False)

# 这样就更直观了,不仅知道哪些列有缺失数据,还知道缺失了多少个数据,而且降序排列更加一目了然。
# Cabin          687
# Age            177
# Embarked         2
# 以上三个字段缺失,其中:
# Age可以用均值 mean 或中位数 median 来填充。
# Embarked登船地点只缺两个数据,或许可以通过其他关联数据(例如票号)找到对应的登陆港口。
# Cabin暂时还想不出怎么补,我们得继续看看具体数据内容,找到它们的逻辑所在。或者直接舍弃该字段。
Cabin          687
Age            177
Embarked         2
Fare             0
Ticket           0
Parch            0
SibSp            0
Sex              0
Name             0
Pclass           0
Survived         0
PassengerId      0
dtype: int64
# 查看 Embarked 列缺失值的信息
train1[train1.Embarked.isnull()]

# 这两位,票号一致,舱位一致,说明是同行的两个人。既不是sibsp,也不是parch,说明是朋友。
# 那问题来了,他们会从哪个登船地点上来呢?

# 在下一步,我们把train1的所有数据放出来看(考虑到占篇幅太大,得出结论后已清空原数据显示),浏览一下发现:
# 票号Ticket六位数且以“113***”开头的乘客,Embarked登船地点都是S。由此我们推测,这两位的登船口也是S,直接填充进缺失值。
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
616211Icard, Miss. Ameliefemale38.00011357280.0B28NaN
82983011Stone, Mrs. George Nelson (Martha Evelyn)female62.00011357280.0B28NaN
# 用 fillna 函数来填充

train2=train1.fillna({"Embarked":"S"})
train2.head(20)

# 如果要展示某一列,可以train1["Embarked"].fillna({"Embarked":"S"}) 这样展示
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
111211Bonnell, Miss. Elizabethfemale58.00011378326.5500C103S
121303Saundercock, Mr. William Henrymale20.000A/5. 21518.0500NaNS
131403Andersson, Mr. Anders Johanmale39.01534708231.2750NaNS
141503Vestrom, Miss. Hulda Amanda Adolfinafemale14.0003504067.8542NaNS
151612Hewlett, Mrs. (Mary D Kingcome)female55.00024870616.0000NaNS
161703Rice, Master. Eugenemale2.04138265229.1250NaNQ
171812Williams, Mr. Charles EugenemaleNaN0024437313.0000NaNS
181903Vander Planke, Mrs. Julius (Emelia Maria Vande...female31.01034576318.0000NaNS
192013Masselmani, Mrs. FatimafemaleNaN0026497.2250NaNC
# 确认一下 Embarked 缺失值是否已填充成功
train2[train2.Embarked.isnull()]

# 结果显示已经没有缺失记录,说明已填充成功。
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
# 看看其他列是不是还缺失
train2.isnull().sum().sort_values(ascending=False)

# 发现其他两列仍然和填充 Embarked 前一致,一切正常,继续下一步。
Cabin          687
Age            177
Embarked         0
Fare             0
Ticket           0
Parch            0
SibSp            0
Sex              0
Name             0
Pclass           0
Survived         0
PassengerId      0
dtype: int64
# 下面,用中位数median来填充 Age 列缺失值
train3=train2.fillna(train2['Age'].median())
train3.head(20)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.250028S
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.925028S
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.050028S
5603Moran, Mr. Jamesmale28.0003308778.458328Q
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.075028S
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.133328S
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.070828C
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
111211Bonnell, Miss. Elizabethfemale58.00011378326.5500C103S
121303Saundercock, Mr. William Henrymale20.000A/5. 21518.050028S
131403Andersson, Mr. Anders Johanmale39.01534708231.275028S
141503Vestrom, Miss. Hulda Amanda Adolfinafemale14.0003504067.854228S
151612Hewlett, Mrs. (Mary D Kingcome)female55.00024870616.000028S
161703Rice, Master. Eugenemale2.04138265229.125028Q
171812Williams, Mr. Charles Eugenemale28.00024437313.000028S
181903Vander Planke, Mrs. Julius (Emelia Maria Vande...female31.01034576318.000028S
192013Masselmani, Mrs. Fatimafemale28.00026497.225028C
# 填充完毕,重新看看 train 的基础信息
train3.isnull().sum()

# 发现 Carbin 列也全部没缺失了,为什么呢?而且出现了很多次“28”,估计是age的median值28。
# 是不是在填充 Age 列的时候,影响到 Carbin 列了呢?接下来要证实。
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64
# 换一种方式,还是用中位数 median 填充 Age 列缺失值
train2["Age"]=train2["Age"].fillna(train2["Age"].median())
train2.isnull().sum().sort_values(ascending=False)

# 这种方式与上一种方式的区别,是train2后面加了["Age"]来限定某列。
# 特别注意,语句 train2["Age"]=train2["Age"].fillna(train2["Age"].median()) 中,等号前面的train2["Age"] 不能把["Age"]去掉,如果去掉尝试,整个文件都得重头运行一遍。不信试试。
Cabin          687
Embarked         0
Fare             0
Ticket           0
Parch            0
SibSp            0
Age              0
Sex              0
Name             0
Pclass           0
Survived         0
PassengerId      0
dtype: int64
train2.head(20)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. Jamesmale28.0003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
111211Bonnell, Miss. Elizabethfemale58.00011378326.5500C103S
121303Saundercock, Mr. William Henrymale20.000A/5. 21518.0500NaNS
131403Andersson, Mr. Anders Johanmale39.01534708231.2750NaNS
141503Vestrom, Miss. Hulda Amanda Adolfinafemale14.0003504067.8542NaNS
151612Hewlett, Mrs. (Mary D Kingcome)female55.00024870616.0000NaNS
161703Rice, Master. Eugenemale2.04138265229.1250NaNQ
171812Williams, Mr. Charles Eugenemale28.00024437313.0000NaNS
181903Vander Planke, Mrs. Julius (Emelia Maria Vande...female31.01034576318.0000NaNS
192013Masselmani, Mrs. Fatimafemale28.00026497.2250NaNC
# 接下来看看test测试数据

# 导入测试集数据文件
test1=pd.read_csv("D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/test.csv")
test1.head(5)
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
# 查看缺失列及缺失数量
test1.isnull().sum().sort_values(ascending=False)

# Cabin又上榜,估计可以先当弃子,搁一边先不管。
# Age也上榜,用均值mean或中位数median填充。
# Fare缺失一个数据,可以回到数据集中查看一下规律,手动填充。
Cabin          327
Age             86
Fare             1
Embarked         0
Ticket           0
Parch            0
SibSp            0
Sex              0
Name             0
Pclass           0
PassengerId      0
dtype: int64
# 查看某列(Fare)缺失值
test1[test1.Fare.isnull()]

# Fare缺失的这位乘客,3等舱,男性,65岁,单独出行。其实可以用Pclass三等舱的平均票价或中位数来填充,想来这样填充对整体评估还是比价契合的。
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
15210443Storey, Mr. Thomasmale60.5003701NaNNaNS
# 用某列各个组中的平均值填充相应缺失值
# 例如,将Pclass分组,用各组Fare平均值来填充对应的Fare
test1["Fare"] = test1.groupby("Pclass").transform(lambda x: x.fillna(x.mean()))
test1.isnull().sum().sort_values(ascending=False)

# 填充完毕,从缺失值数量上看,OK。

# 接下来验证一下填充的是不是Pclass=3的平均值
Cabin          327
Age             86
Embarked         0
Fare             0
Ticket           0
Parch            0
SibSp            0
Sex              0
Name             0
Pclass           0
PassengerId      0
dtype: int64
# 我们发现缺失了Fare值的那位乘客,PassengerId是1044,行索引是152,我们可以索引出来看看结果。
test1.loc[152,["Pclass","Name","Age","Fare"]]
# 或 test1.loc[152] 或 test1.loc[152,:] 可以查看152行的所有列
Pclass                     3
Name      Storey, Mr. Thomas
Age                     60.5
Fare                    1044
Name: 152, dtype: object
# 最便捷的方法,是直接 用 loc 函数,查看第152行所有列
test1.loc[152]

# 发现填充的Fare数值是1044,那我们看看Pclass=3(即三等舱)的票价Fare平均值是不是1044。
PassengerId                  1044
Pclass                          3
Name           Storey, Mr. Thomas
Sex                          male
Age                          60.5
SibSp                           0
Parch                           0
Ticket                       3701
Fare                         1044
Cabin                         NaN
Embarked                        S
Name: 152, dtype: object
Fare_Pclass_mean = test1.groupby("Pclass")["Fare"].mean()
Fare_Pclass_mean
# 或不用赋值,直接用 test1.groupby("Pclass")["Fare"].mean()得出结果即可,达到验证目的即可。

# 然而验证发现,Pclass=3的均值是1094,并不是上面填充的1044。这是为何呢?
# 说明这个函数填充,错了——test1["Fare"] = test1.groupby("Pclass").transform(lambda x: x.fillna(x.mean()))
Pclass
1    1098.224299
2    1117.935484
3    1094.178899
Name: Fare, dtype: float64
# 接下来找找其他方法,同时先恢复填充前的源状态。
# 重新导入数据
test2=pd.read_csv("D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/test.csv")
test2.head(35)
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
58973Svensson, Mr. Johan Cervinmale14.00075389.2250NaNS
68983Connolly, Miss. Katefemale30.0003309727.6292NaNQ
78992Caldwell, Mr. Albert Francismale26.01124873829.0000NaNS
89003Abrahim, Mrs. Joseph (Sophie Halaut Easu)female18.00026577.2292NaNC
99013Davies, Mr. John Samuelmale21.020A/4 4887124.1500NaNS
109023Ilieff, Mr. YliomaleNaN003492207.8958NaNS
119031Jones, Mr. Charles Cressonmale46.00069426.0000NaNS
129041Snyder, Mrs. John Pillsbury (Nelle Stevenson)female23.0102122882.2667B45S
139052Howard, Mr. Benjaminmale63.0102406526.0000NaNS
149061Chaffee, Mrs. Herbert Fuller (Carrie Constance...female47.010W.E.P. 573461.1750E31S
159072del Carlo, Mrs. Sebastiano (Argenia Genovesi)female24.010SC/PARIS 216727.7208NaNC
169082Keane, Mr. Danielmale35.00023373412.3500NaNQ
179093Assaf, Mr. Geriosmale21.00026927.2250NaNC
189103Ilmakangas, Miss. Ida Livijafemale27.010STON/O2. 31012707.9250NaNS
199113Assaf Khalil, Mrs. Mariana (Miriam")"female45.00026967.2250NaNC
209121Rothschild, Mr. Martinmale55.010PC 1760359.4000NaNC
219133Olsen, Master. Artur Karlmale9.001C 173683.1708NaNS
229141Flegenheim, Mrs. Alfred (Antoinette)femaleNaN00PC 1759831.6833NaNS
239151Williams, Mr. Richard Norris IImale21.001PC 1759761.3792NaNC
249161Ryerson, Mrs. Arthur Larned (Emily Maria Borie)female48.013PC 17608262.3750B57 B59 B63 B66C
259173Robins, Mr. Alexander Amale50.010A/5. 333714.5000NaNS
269181Ostby, Miss. Helene Ragnhildfemale22.00111350961.9792B36C
279193Daher, Mr. Shedidmale22.50026987.2250NaNC
289201Brady, Mr. John Bertrammale41.00011305430.5000A21S
299213Samaan, Mr. EliasmaleNaN20266221.6792NaNC
309222Louch, Mr. Charles Alexandermale50.010SC/AH 308526.0000NaNS
319232Jefferys, Mr. Clifford Thomasmale24.020C.A. 3102931.5000NaNS
329243Dean, Mrs. Bertram (Eva Georgetta Light)female33.012C.A. 231520.5750NaNS
339253Johnston, Mrs. Andrew G (Elizabeth Lily" Watson)"femaleNaN12W./C. 660723.4500NaNS
349261Mock, Mr. Philipp Edmundmale30.0101323657.7500C78C
test2.isnull().sum().sort_values(ascending=False)

# Fare还在缺失中,这是我们的源数据。
Cabin          327
Age             86
Fare             1
Embarked         0
Ticket           0
Parch            0
SibSp            0
Sex              0
Name             0
Pclass           0
PassengerId      0
dtype: int64
# 既然我们知道了三等舱平均票价1094,那直接填充进 缺失的那列NaN。
test2["Fare"]=test2["Fare"].fillna("1094")
test2.isnull().sum().sort_values(ascending=False)

# 填充Fare列成功
Cabin          327
Age             86
Embarked         0
Fare             0
Ticket           0
Parch            0
SibSp            0
Sex              0
Name             0
Pclass           0
PassengerId      0
dtype: int64
# Age年龄一列缺失值也用平均值替代,将前面训练集数据的填充方法复制过来直接用。
test2["Age"]=test2["Age"].fillna(test2["Age"].median())
test2.isnull().sum().sort_values(ascending=False)

# 填充Age列成功
Cabin          327
Embarked         0
Fare             0
Ticket           0
Parch            0
SibSp            0
Age              0
Sex              0
Name             0
Pclass           0
PassengerId      0
dtype: int64
# 看看Age列年龄均值多少,验证下。
Age_mean = test2["Age"].mean()
Age_mean
29.599282296650717
# 看看填充缺失值后的测试集数据test2

test2.head(35)

# 发现缺失行(索引10/22/29/33)的Age缺失值都填充了27,而前面Age_mean计算结果又是29.6.
# (本来存疑。后来发现把median中位数当成mean均值了。不过无碍,两者皆可。2019.4.9)
# 鉴于实际填充的27与计算平均值结果29相差不大,考虑进度,此处暂且忽略。容后再探讨。2019.4.8
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
58973Svensson, Mr. Johan Cervinmale14.00075389.225NaNS
68983Connolly, Miss. Katefemale30.0003309727.6292NaNQ
78992Caldwell, Mr. Albert Francismale26.01124873829NaNS
89003Abrahim, Mrs. Joseph (Sophie Halaut Easu)female18.00026577.2292NaNC
99013Davies, Mr. John Samuelmale21.020A/4 4887124.15NaNS
109023Ilieff, Mr. Yliomale27.0003492207.8958NaNS
119031Jones, Mr. Charles Cressonmale46.00069426NaNS
129041Snyder, Mrs. John Pillsbury (Nelle Stevenson)female23.0102122882.2667B45S
139052Howard, Mr. Benjaminmale63.0102406526NaNS
149061Chaffee, Mrs. Herbert Fuller (Carrie Constance...female47.010W.E.P. 573461.175E31S
159072del Carlo, Mrs. Sebastiano (Argenia Genovesi)female24.010SC/PARIS 216727.7208NaNC
169082Keane, Mr. Danielmale35.00023373412.35NaNQ
179093Assaf, Mr. Geriosmale21.00026927.225NaNC
189103Ilmakangas, Miss. Ida Livijafemale27.010STON/O2. 31012707.925NaNS
199113Assaf Khalil, Mrs. Mariana (Miriam")"female45.00026967.225NaNC
209121Rothschild, Mr. Martinmale55.010PC 1760359.4NaNC
219133Olsen, Master. Artur Karlmale9.001C 173683.1708NaNS
229141Flegenheim, Mrs. Alfred (Antoinette)female27.000PC 1759831.6833NaNS
239151Williams, Mr. Richard Norris IImale21.001PC 1759761.3792NaNC
249161Ryerson, Mrs. Arthur Larned (Emily Maria Borie)female48.013PC 17608262.375B57 B59 B63 B66C
259173Robins, Mr. Alexander Amale50.010A/5. 333714.5NaNS
269181Ostby, Miss. Helene Ragnhildfemale22.00111350961.9792B36C
279193Daher, Mr. Shedidmale22.50026987.225NaNC
289201Brady, Mr. John Bertrammale41.00011305430.5A21S
299213Samaan, Mr. Eliasmale27.020266221.6792NaNC
309222Louch, Mr. Charles Alexandermale50.010SC/AH 308526NaNS
319232Jefferys, Mr. Clifford Thomasmale24.020C.A. 3102931.5NaNS
329243Dean, Mrs. Bertram (Eva Georgetta Light)female33.012C.A. 231520.575NaNS
339253Johnston, Mrs. Andrew G (Elizabeth Lily" Watson)"female27.012W./C. 660723.45NaNS
349261Mock, Mr. Philipp Edmundmale30.0101323657.75C78C
# 经过对数据集中影响因素的分析,
# 我们筛选了潜在影响因素"Survived","Pclass","Sex","Age","SibSp","Parch","Fare","Embarked",
# 剔除了"PassengerId"/"Ticket"/"Cabin"三个或缺失或无关因素。

train2[["Survived","Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]].corr(method="pearson")
# 查看各因素之间关联性

# 与Survived强相关的因素:无
# 与Survived弱相关的因素:Pclass/Fare——船舱等级和票价,果然是,社会经济地位影响生存概率。
# 与Survived极弱相关或无相关的因素:"Age","SibSp","Parch"——从电影来看,明明是小孩和妇女优先,所以这里三个因素可能可以进一步分析探讨。
# 因为不是连续变量,没能通过pearson相关性展示的因素有"Sex"和"Embarked",这两个我们另外找其他方法。
SurvivedPclassAgeSibSpParchFare
Survived1.000000-0.338481-0.064910-0.0353220.0816290.257307
Pclass-0.3384811.000000-0.3398980.0830810.018443-0.549500
Age-0.064910-0.3398981.000000-0.233296-0.1724820.096688
SibSp-0.0353220.083081-0.2332961.0000000.4148380.159651
Parch0.0816290.018443-0.1724820.4148381.0000000.216225
Fare0.257307-0.5495000.0966880.1596510.2162251.000000
# 此处补充一个知识点(来自MOOC课程)
# ——Pearson相关系数

# r 取值范围[-1,1]
# 0.8-1.0 极强相关
# 0.6-0.8 强相关
# 0.4-0.6 中等强度相关
# 0.2-0.4 弱相关
# 0-0.2 极弱相关或无相关

# 需要进一步了解,可以去找概率论的书来看
# 关于"SibSp"和"Parch",一个父母子女,一个兄弟朋友,我们统一归类为同伴,并将他们相加更新一个新因素。

train2["Family"]=train2["SibSp"]+train2["Parch"]
train2.head(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamily
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C1
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS0
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S1
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS0
# 关于"Age",我们做一个年龄分层。

# 1婴幼儿童:0-11
# 2少年:12-18
# 3青年:19-35
# 4中年:36-52
# 5中老年:53-65
# 6老年:66以上

# 从网上搜索Python数值分组方法,处理如下:
#分组依据,注意最小值要减1,最大最要加1.因为pandas的数值分组是左开右闭,或左闭右开
#使用了开区间、闭区间的概念,可百度了解
bins=[-1,12,19,36,53,66,150]
 
#分组对应的标签,-1到11对应婴幼儿童,12到18对应少年……
labels=['婴幼儿童','少年','青年','中年','中老年','老年']

#使用pandas中的cut进行数值分组,right=False表示左闭右开,省略参数right表示左开右闭
train2['age_group']=pd.cut(
        train2['Age'],
        bins,
        right=False,
        labels=labels)

train2.head(5)

# Python数值分组 参考来源 https://blog.csdn.net/qq_35990702/article/details/82313055 
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilyage_group
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1青年
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C1中年
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS0青年
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S1青年
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS0青年
# 为方便做相关性分析,把['婴幼儿童','少年','青年','中年','中老年','老年']转化成数值型[1,2,3,4,5,6]

train2['age_group0'] = train2['age_group'].map({'婴幼儿童': 1, '少年': 2,'青年':3,'中年':4,'中老年':5,'老年':6}).astype(int)
train2.head(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilyage_groupage_group0
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1青年3
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C1中年4
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS0青年3
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S1青年3
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS0青年3
# 关于“Sex”,我们也要把类别属性(性别sex)转化为1,2

train2['Sex0'] = train2['Sex'].map({'female': 1, 'male': 2}).astype(int)
train2.head(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilyage_groupage_group0Sex0
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1青年32
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C1中年41
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS0青年31
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S1青年31
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS0青年32
# 关于"Embarked",同样类似Sex处理方式,将其类别属性转化为数值属性。
# 先看看"Embarked"总共有多少种不重复的值
train2['Embarked'].value_counts()
S    646
C    168
Q     77
Name: Embarked, dtype: int64
# 将 Embarked 类别属性(S/C/Q)转化为数值属性(1/2/3)
train2["Embarked0"] = train2["Embarked"].map({'S': 1, 'C': 2, 'Q': 2}).astype(int)
train2.head(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilyage_groupage_group0Sex0Embarked0
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1青年321
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C1中年412
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS0青年311
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S1青年311
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS0青年321
# 这次,我们再来看看相关性分析
# 注意:"Age"——age_group;"SibSp"及"Parch"——Family;"Embarked"——Embarked0
train2[["Survived","Pclass","Sex0","age_group0","Family","Fare","Embarked0"]].corr(method="pearson")

# 这次结果,性别sex0与survived存活是强相关的关系,甚至超过了社会经济地位(船舱等级和票价)的影响程度。
# 而根据本次相关性分析,Family人数和age_group年龄层反而对存活没有产生很大的影响作用。
# (在一些大神帖子里看到,family人数和年龄对个人存活影响很大,那这两个悬念就留着先,等学有所成再抽时间回来撸。)
SurvivedPclassSex0age_group0FamilyFareEmbarked0
Survived1.000000-0.338481-0.543351-0.0868790.0166390.2573070.149683
Pclass-0.3384811.0000000.131900-0.3083490.065997-0.549500-0.074053
Sex0-0.5433510.1319001.0000000.095705-0.200988-0.182333-0.119224
age_group0-0.086879-0.3083490.0957051.000000-0.2935980.0774380.002818
Family0.0166390.065997-0.200988-0.2935981.0000000.217138-0.077359
Fare0.257307-0.549500-0.1823330.0774380.2171381.0000000.162184
Embarked00.149683-0.074053-0.1192240.002818-0.0773590.1621841.000000
# 忍不住,看看family。
train2[["Family","Survived"]].groupby("Family",as_index=False).mean().sort_values(by="Survived",ascending=False)

# 结果发现,家庭人数1-3人的乘客,存活率均高于50%,远高于其他情形。
FamilySurvived
330.724138
220.578431
110.552795
660.333333
000.303538
440.200000
550.136364
770.000000
8100.000000
# 年龄层Age_group 也按照family的方法看一下
train2[["age_group","Survived"]].groupby("age_group",as_index=False).mean().sort_values(by="Survived",ascending=False)

# 年龄层age_group介于婴幼儿童和少年(0-18岁)的乘客,存活率较高,均四成以上,而老年人存活率则不足13%。
age_groupSurvived
0婴幼儿童0.573529
1少年0.436620
3中年0.397590
4中老年0.372093
2青年0.353271
5老年0.125000
# 特征工程
# 据说,特征工程是影响最终预测准确率的最关键因素,甚至超过了各类神奇的算法本身。

# 以上相关性分析我们发现,Fare与Pclass之间的相关系数达到了0.55,强相关关系,所以我们应该对这两个因素做统一处理。
# 作为新手,为了效率和简便,就粗暴地直接取其中一个因素,与Survived相关性最强的Pclass,同时舍弃另一个。

# 那,根据前面corr分析,与survived存活息息相关的因素:Sex0 > Pclass > Embarked (已舍弃Fare)
# 而根据双因素分析,family人数在1-3之间,或者年龄层age_group介于婴幼儿童和少年(0-18岁)的乘客,存活率较高。
# 既然在训练集数据train中提取了一些相关因素,那也得在测试集test中对应转化。
test2["Family"]=test2["SibSp"]+test2["Parch"]
test2.head(5)
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamily
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ0
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727NaNS1
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ0
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS0
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS2
bins=[-1,12,19,36,53,66,150]
 
#分组对应的标签,-1到11对应婴幼儿童,12到18对应少年……
labels=['婴幼儿童','少年','青年','中年','中老年','老年']

test2['age_group']=pd.cut(
        test2['Age'],
        bins,
        right=False,
        labels=labels)

test2.head(5)
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilyage_group
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ0青年
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727NaNS1中年
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ0中老年
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS0青年
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS2青年
test2['age_group0'] = test2['age_group'].map({'婴幼儿童': 1, '少年': 2,'青年':3,'中年':4,'中老年':5,'老年':6}).astype(int)
test2.head(5)
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilyage_groupage_group0
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ0青年3
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727NaNS1中年4
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ0中老年5
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS0青年3
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS2青年3
test2['Sex0'] = test2['Sex'].map({'female': 1, 'male': 2}).astype(int)
test2.head(5)
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilyage_groupage_group0Sex0
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ0青年32
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727NaNS1中年41
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ0中老年52
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS0青年32
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS2青年31
test2["Embarked0"] = test2["Embarked"].map({'S': 1, 'C': 2, 'Q': 2}).astype(int)
test2.head(5)
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilyage_groupage_group0Sex0Embarked0
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ0青年322
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727NaNS1中年411
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ0中老年522
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS0青年321
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS2青年311
# 模型构建与评估
# 划分训练集、训练集数据
x_train = train2[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]
y_train =train2["Survived"]
x_test = test2[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]
# Logistic回归
from sklearn.linear_model import LogisticRegression
Classifier1 = LogisticRegression()
#训练模型
Classifier1.fit(x_train,y_train)
#预测
Y1_prediction = Classifier1.predict(x_test)
#模型评估
score_Logit = Classifier1.score(x_train,y_train)
score_Logit

# 可喜可贺,预测准确率0.805,比官方sample的0.766高了一丢丢。
0.8047138047138047
Classifier1.coef_
array([[-0.74124862,  0.00496064, -0.17449562, -0.33260786, -2.28141631,
         0.58615422]])
Final = pd.DataFrame({"PassengerId":test2["PassengerId"],
                       "Survived":Y1_prediction
                       })
Final.head(10)
PassengerIdSurvived
08920
18930
28940
38950
48960
58970
68981
78990
89001
99010
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/Final1.csv",index=False)
# submission score 0.75598,低于sample 0.766.看来需要再接再励,调调参,或者试试其他算法。
# 把“fare”因素去掉,同样logistics 回归,重新试试。
x_train1 = train2[["Pclass","Family","age_group0","Sex0","Embarked0"]]
y_train1 =train2["Survived"]
x_test1 = test2[["Pclass","Family","age_group0","Sex0","Embarked0"]]
from sklearn.linear_model import LogisticRegression
Classifier1 = LogisticRegression()   #训练模型
Classifier1.fit(x_train1,y_train1)   #预测
Y1_prediction = Classifier1.predict(x_test1)   #模型评估
score_Logit = Classifier1.score(x_train1,y_train1)
score_Logit

# 预测准确率0.799,比刚刚同类方法预测值0.805还低了0.006。
# 不过无妨,可以试试导入竞赛submission试试。
0.7991021324354658
Final = pd.DataFrame({"PassengerId":test2["PassengerId"],
                       "Survived":Y1_prediction
                       })
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/Final2.csv",index=False)
# submission score 0.75598,与刚刚结果一致。
x_train2 = train2[["Pclass","Family","age_group0","Sex0"]]
y_train2 =train2["Survived"]
x_test2 = test2[["Pclass","Family","age_group0","Sex0"]]
from sklearn.linear_model import LogisticRegression
Classifier1 = LogisticRegression()   #训练模型
Classifier1.fit(x_train2,y_train2)   #预测
Y1_prediction = Classifier1.predict(x_test2)   #模型评估
score_Logit = Classifier1.score(x_train2,y_train2)
score_Logit

# 结果0.806,比刚刚较高的0.805还高了0.001,算是小进步。
0.8058361391694725
Final = pd.DataFrame({"PassengerId":test2["PassengerId"],
                       "Survived":Y1_prediction
                       })
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/Final3.csv",index=False)
# submission score 0.77511,总算比官方sample 0.766高一些了。
# 减小变量至三个。
x_train3 = train2[["Pclass","Family","Sex0"]]
y_train3 =train2["Survived"]
x_test3 = test2[["Pclass","Family","Sex0"]]

from sklearn.linear_model import LogisticRegression
Classifier1 = LogisticRegression()   #训练模型
Classifier1.fit(x_train3,y_train3)   #预测
Y1_prediction = Classifier1.predict(x_test3)   #模型评估
score_Logit = Classifier1.score(x_train3,y_train3)
score_Logit
0.8002244668911336
Final = pd.DataFrame({"PassengerId":test2["PassengerId"],
                       "Survived":Y1_prediction
                       })
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/Final4.csv",index=False)

# submission score 0.77033。
x_train4 = train2[["Pclass","age_group0","Sex0"]]
y_train4 =train2["Survived"]
x_test4 = test2[["Pclass","age_group0","Sex0"]]

from sklearn.linear_model import LogisticRegression
Classifier1 = LogisticRegression()   #训练模型
Classifier1.fit(x_train4,y_train4)   #预测
Y1_prediction = Classifier1.predict(x_test4)   #模型评估
score_Logit = Classifier1.score(x_train4,y_train4)
score_Logit
0.8002244668911336
Final = pd.DataFrame({"PassengerId":test2["PassengerId"],
                       "Survived":Y1_prediction
                       })
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/Final5.csv",index=False)

# submission score 0.76076,又低了。
# 最后试试两个因素。
x_train5 = train2[["Pclass","Sex0"]]
y_train5 =train2["Survived"]
x_test5 = test2[["Pclass","Sex0"]]

from sklearn.linear_model import LogisticRegression
Classifier1 = LogisticRegression()   #训练模型
Classifier1.fit(x_train5,y_train5)   #预测
Y1_prediction = Classifier1.predict(x_test5)   #模型评估
score_Logit = Classifier1.score(x_train5,y_train5)
score_Logit
0.7867564534231201
Final = pd.DataFrame({"PassengerId":test2["PassengerId"],
                       "Survived":Y1_prediction
                       })
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/Final6.csv",index=False)

# submission score 0.76555,回到原始持平。
# Logistic回归 尝试结束。接下来试试其他算法。
# 先把清洗后的train2和test2保存成CSV,下次分析直接导入使用。

train2.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/train2.csv",index=False)
test2.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/test2.csv",index=False)
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值