Python: Kaggle Titannic数据集处理

该博客探讨了在Kaggle的Titanic数据集中,如何利用性别、船舱等级和登陆港口等因素影响乘客的存活率。通过分析,发现女性的存活率显著高于男性,船舱等级与存活率负相关,尤其是3等舱。年龄也是一个重要因素,15岁以下乘客存活率较高,40岁以上则差异不大。初步数据处理和特征选择能有效提升预测模型的性能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

# list Titannic_all file

想法1:能否用某个算法求个权值矩阵,后用其.x数据,再用k-means聚类

想法2 :直接先用逻辑回归,随机森林

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

#% matplotlib inline
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
train_data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
train_data.info()
print("-" * 40)
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
train_data.describe()
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200

性别对存活率影响

train_data['Survived'].value_counts().plot.pie(autopct = '%1.2f%%')
<matplotlib.axes._subplots.AxesSubplot at 0x1acc7eab518>

在这里插入图片描述

男性人数是女性两倍,但女性存活率大很多

train_data.groupby(['Sex','Survived']).size().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1acc9f482b0>

在这里插入图片描述

train_data[['Sex','Survived']].groupby(['Sex']).sum().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1acc9fa5b70>

在这里插入图片描述

train_data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1acca00aa58>

在这里插入图片描述

船舱等级似乎影响更大,3等舱人最多,Survived最少

train_data.groupby(['Pclass','Sex']).size().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1acca081f28>

在这里插入图片描述

train_data[['Pclass','Survived']].groupby(['Pclass']).sum().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1acca0e9400>

在这里插入图片描述

train_data[['Pclass','Survived']].groupby(['Pclass']).mean().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1acca162470>

在这里插入图片描述

不同船舱的女性存活率有所区别

train_data[['Sex','Pclass','Survived']].groupby(['Pclass','Sex']).mean().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1acca1bd1d0>

在这里插入图片描述

年龄与存活的关系

fig, ax = plt.subplots(1,2,figsize = (18,8))
sns.violinplot('Pclass','Age',hue='Survived',data=train_data,split=True,ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,100,10))

sns.violinplot('Sex','Age',hue='Survived',data = train_data,split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,100,10))

plt.show()

在这里插入图片描述

年龄分布特征分析,拼错单词难受呀

在15岁之前存活率较高,40以后没区别

fig,ax = plt.subplots(figsize=(10,5))
sns.kdeplot(train_data.loc[(train_data['Survived'] == 0),'Age'],shade = True,color = 'gray',label = 'Not Survived')
sns.kdeplot(train_data.loc[(train_data['Survived'] == 1),'Age'],shade=True,color = 'g',label = 'Survived')
plt.title('Age--Survived or NOt')
plt.xlabel('Age')
Text(0.5, 0, 'Age')

在这里插入图片描述

登陆港口与存活与否的关系,s港口登陆的人最多,获救率最低,c,q港口大多为女性登陆?

grid = sns.FacetGrid(data = train_data,col='Pclass',hue='Sex')
grid.map(sns.countplot,'Embarked')
grid.add_legend()
<seaborn.axisgrid.FacetGrid at 0x1acca219668>

在这里插入图片描述

sns.countplot('Embarked',hue='Survived',data = train_data)
plt.title('Embarked and Survived')
Text(0.5, 1.0, 'Embarked and Survived')

在这里插入图片描述

sns.factorplot('Embarked','Survived',data = train_data,kind='bar')
plt.title('Embarked and Survived rate')
Text(0.5, 1.0, 'Embarked and Survived rate')

在这里插入图片描述

第一次简易处理数据,只填充年龄,选取几个简单特征

# age 用中位数填充
train_data["Age"] = train_data["Age"].fillna(train_data["Age"].median())
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
# 线性回归算法
from sklearn.linear_model import LinearRegression
# 这啥,留着百度
from sklearn.model_selection import KFold

#选取典型特征
predictors = ["Pclass","Age","SibSp","Parch","Fare"]

#初始化线性回归算法
alg = LinearRegression()
#样本分三份,3折交叉验证
kf = KFold(n_splits=3,shuffle=False,random_state=1)

predictions = []
for train,test in kf.split(train_data):
    train_predictors = (train_data[predictors].iloc[train,:]) #iloc DataFarm里的函数
    train_target = (train_data["Survived"].iloc[train])
    alg.fit(train_predictors,train_target)  # 使用线性回归算法
    test_predictions = alg.predict(train_data[predictors].iloc[test,:])
    predictions.append(test_predictions)

print(predictions[0][1:10],predictions[1][1:10],predictions[2][1:10],'\n'*2,len(predictions),len(predictions[0]))  #len = 3
[0.64716068 0.22381187 0.65781892 0.15821019 0.20954606 0.54764008
 0.35828968 0.29636233 0.54633321] [0.62891234 0.77642663 0.22950758 0.13992775 0.29488562 0.42050708
 0.22989295 1.06210609 0.73216932] [0.22171068 0.46757728 0.11475416 0.26813918 0.47166107 0.45771509
 0.26832766 0.66241433 0.15305294] 

 3 297
predictions = np.concatenate(predictions,axis=0)
predictions[predictions > 0.5] = 1
predictions[predictions <= 0.5] = 0

accuracy = sum(predictions == train_data["Survived"]) / len(predictions)

print("准确率:",accuracy)
准确率: 0.7037037037037037

增加Age、Embarked、Cabin特征值,提高0.08

# 对Sex进行处理:male:0,female:1
train_data.loc[train_data['Sex'] == 'male','Sex'] = 0
train_data.loc[train_data['Sex'] == 'female','Sex'] = 1
# 对Embarked进行处理,根据上述分析,C口登船的几乎都为女性,Pclass = 2,尤其明显。男性几乎都在S口登船,取众数
train_data['Embarked'] = train_data['Embarked'].fillna('C')
# 对Cabin进行特征化,Cabin缺失太多,也是特征(甲板 = 。=)
train_data['Cabin'] = train_data.Cabin.fillna('U0')
train_data.loc[train_data['Embarked'] == 'U0','Embarked'] = 0
train_data.loc[train_data['Embarked'] == 'S','Embarked'] = 1
train_data.loc[train_data['Embarked'] == 'C','Embarked'] = 2
train_data.loc[train_data['Embarked'] == 'Q','Embarked'] = 3
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          891 non-null object
Embarked       891 non-null int64
dtypes: float64(2), int64(7), object(3)
memory usage: 83.6+ KB
predictors_2 = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
alg_2 = LinearRegression()
#样本分三份,3折交叉验证
kf_2 = KFold(n_splits=3,shuffle=False,random_state=1)

predictions_2 = []
for train,test in kf_2.split(train_data):
    train_predictors_2 = (train_data[predictors_2].iloc[train,:]) #iloc DataFarm里的函数
    train_target_2 = (train_data["Survived"].iloc[train])
    alg_2.fit(train_predictors_2,train_target_2)  # 使用线性回归算法
    test_predictions_2 = alg_2.predict(train_data[predictors_2].iloc[test,:])
    predictions_2.append(test_predictions_2)
predictions_2 = np.concatenate(predictions_2,axis=0)
predictions_2[predictions_2 > 0.5] = 1
predictions_2[predictions_2 <= 0.5] = 0

accuracy_2 = sum(predictions_2 == train_data["Survived"]) / len(predictions_2)

print("准确率:",accuracy_2)
准确率: 0.7833894500561167

测试集处理

test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
#处理Age
test_data["Age"] = test_data["Age"].fillna(train_data["Age"].median())
#处理Sex
test_data.loc[test_data['Sex'] == 'male','Sex'] = 0
test_data.loc[test_data['Sex'] == 'female','Sex'] = 1
#处理Embarked
train_data['Embarked'] = train_data['Embarked'].fillna('C')
#处理Cabin
test_data['Cabin'] = test_data.Cabin.fillna('U0')
test_data.loc[test_data['Embarked'] == 'U0','Embarked'] = 0
test_data.loc[test_data['Embarked'] == 'S','Embarked'] = 1
test_data.loc[test_data['Embarked'] == 'C','Embarked'] = 2
test_data.loc[test_data['Embarked'] == 'Q','Embarked'] = 3
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null int64
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          418 non-null object
Embarked       418 non-null int64
dtypes: float64(2), int64(6), object(3)
memory usage: 36.0+ KB
#处理Fare,训练集不缺
test_data["Fare"] = test_data["Fare"].fillna(train_data["Fare"].median())
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null int64
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           418 non-null float64
Cabin          418 non-null object
Embarked       418 non-null int64
dtypes: float64(2), int64(6), object(3)
memory usage: 36.0+ KB
test_features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
test_data['Survived'] = -1

test_predictors = test_data[test_features]
test_data['Survived'] = alg_2.predict(test_predictors)
test_data.head()
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSurvived
08923Kelly, Mr. James034.5003309117.8292U030.158051
18933Wilkes, Mrs. James (Ellen Needs)147.0103632727.0000U010.480204
28942Myles, Mr. Thomas Francis062.0002402769.6875U030.177382
38953Wirz, Mr. Albert027.0003151548.6625U010.106463
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)122.011310129812.2875U010.617975
test_data.loc[test_data['Survived'] > 0.5,'Survived'] = 1
test_data.loc[test_data['Survived'] <= 0.5,'Survived'] = 0
test_data.head()
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSurvived
08923Kelly, Mr. James034.5003309117.8292U030.0
18933Wilkes, Mrs. James (Ellen Needs)147.0103632727.0000U010.0
28942Myles, Mr. Thomas Francis062.0002402769.6875U030.0
38953Wirz, Mr. Albert027.0003151548.6625U010.0
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)122.011310129812.2875U011.0
submission = pd.DataFrame({
    'PassengerId':test_data['PassengerId'],
    'Survived':test_data['Survived']
})
submission.head()
PassengerIdSurvived
08920.0
18930.0
28940.0
38950.0
48961.0
submission.describe()
PassengerIdSurvived
count418.000000418.000000
mean1100.5000000.358852
std120.8104580.480238
min892.0000000.000000
25%996.2500000.000000
50%1100.5000000.000000
75%1204.7500001.000000
max1309.0000001.000000
submission.to_csv('titanic_submission.csv',index = False)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值