数据分析——泰坦尼克号预测

之前在学校做过课程设计,但是对流程比较一知半解,现在看完了机器学习实战这本书,带着自己的理解重新做一遍。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

数据导入

观察数据的具体情况,可以发现年龄变量Age和Cabin有缺失,然后Name,sex,Ticket,cabin和Embark是object类型,在后续的数据处理中要进行调整。

data_train = pd.read_csv(r'C:/Users/train.csv')
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

再看看测试集

data_test= pd.read_csv(r'test.csv')
data_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         418 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

把索引设置为乘客编号

test_process = test_process.set_index(['PassengerId'])
test_process

现在测试集长这样

PclassNameSexAgeSibSpParchTicketFareEmbarkedCalledName_lengthFirst_name
PassengerId
8923Kelly, Mr. Jamesmale34003309117.8292QMr16Kelly
8933Wilkes, Mrs. James (Ellen Needs)female47103632727.0000SMr32Wilkes
8942Myles, Mr. Thomas Francismale62002402769.6875QMr25Myles
8953Wirz, Mr. Albertmale27003151548.6625SMr16Wirz
8963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female2211310129812.2875SMr44Hirvonen
.......................................
13053Spector, Mr. Woolfmale2500A.5. 32368.0500SMr18Spector
13061Oliva y Ocana, Dona. Ferminafemale3900PC 17758108.9000CNaN28Oliva y Ocana
13073Saether, Mr. Simon Sivertsenmale3800SOTON/O.Q. 31012627.2500SMr28Saether
13083Ware, Mr. Frederickmale25003593098.0500SMr19Ware
13093Peter, Master. Michael Jmale2211266822.3583CNaN24Peter

418 rows × 12 columns

数据处理

缺失值处理

本次数据的缺失应该是完全随机的,不依赖于其他完全变量,所以可以采取删除和填补两种方式。cabin缺失过多,直接删除这一特征,不放心的话可以计算一些相关度或者画图看看情况。

# 删除cabin
train_process = data_train.drop(['Cabin'],axis=1)
# 年龄数据进行缺失值填补
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
Age_df = train_process[['Age','Survived','Pclass','SibSp','Parch','Fare']]
UnknowAge = Age_df[Age_df.Age.isnull()].values
KnowAge = Age_df[Age_df.Age.notnull()].values
#y是目标年龄,x是已知属性
y_train = KnowAge[:,0]
x_train = KnowAge[:,1:]
rfr = RandomForestRegressor(n_estimators=500,random_state=42)
rfr.fit(x_train,y_train)
predictedAges = rfr.predict(UnknowAge[:,1::])
Age_df.loc[ (Age_df.Age.isnull()), 'Age' ] = predictedAges 
train_process.Age=Age_df.Age.astype(int)

年龄缺失值使用随机森林进行填补,建立回归方程进行拟合。

测试集也要删除cabin变量和进行年龄缺失值的填补。

#测试集
test_process = data_test.drop(['Cabin'],axis=1)
test_process.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         418 non-null    float64
 9   Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(4)
memory usage: 32.8+ KB
Age_df = test_process[['Age','Pclass','SibSp','Parch','Fare']]
UnknowAge = Age_df[Age_df.Age.isnull()].values
KnowAge = Age_df[Age_df.Age.notnull()].values
#y是目标年龄,x是已知属性
y_train = KnowAge[:,0]
x_train = KnowAge[:,1:]
rfr = RandomForestRegressor(n_estimators=500,random_state=42)
rfr.fit(x_train,y_train)
predictedAges = rfr.predict(UnknowAge[:,1::])
Age_df.loc[ (Age_df.Age.isnull()), 'Age' ] = predictedAges 
test_process.Age=Age_df.Age.astype(int)

文本数据处理

对文本数据名字进行处理,把名字的称谓,长度,前名提取出来并舍弃名字变量。

def change(df):
    df['Called'] = df['Name'].str.findall('Miss|Mr|Ms').str[0].to_frame()
    df['Name_length'] = df['Name'].apply(lambda x:len(x))
    df['First_name'] = df['Name'].str.split(',').str[0]
    df = df.drop(['Name'],axis=1)
    
change(train_process)
change(test_process)

TargetEncoder

把其他object类型变量进行编码处理。sklearn有很多种编码方式,target适用于特征无内在顺序,category数量 > 4的情况
one-hot适用于特征无内在顺序,category数量 < 4的情况。

import category_encoders
from category_encoders import TargetEncoder
X_train = train_process.iloc[:,2:]
y_train = train_process.iloc[:,1]
tar_encoder1 = TargetEncoder(cols=['Sex','Ticket','Embarked','Called','Name_length','First_name'],
                             handle_missing='value',
                             handle_unknown='value')
tar_encoder1.fit(X_train,y_train)
TargetEncoder(cols=['Sex', 'Ticket', 'Embarked', 'Called', 'Name_length',
                    'First_name'])
X_train_encoded = tar_encoder1.transform(X_train)
X_train_encoded.drop(['Name'],axis=1)
PclassSexAgeSibSpParchTicketFareEmbarkedCalledName_lengthFirst_name
030.18890822.0100.3838387.25000.3369570.2837210.2820510.103230
110.74203838.0100.38383871.28330.5535710.2837210.9984760.383838
230.74203826.0000.3838387.92500.3369570.6978020.3157890.383838
310.74203835.0100.46875953.10000.3369570.2837210.9994390.468759
430.18890835.0000.3838388.05000.3369570.2837210.3720930.468759
....................................
88620.18890827.0000.38383813.00000.3369570.4920630.3250000.383838
88710.74203819.0000.38383830.00000.3369570.6978020.3720930.632953
88830.742038NaN120.10323023.45000.3369570.6978020.4284610.103230
88910.18890826.0000.38383830.00000.5535710.2837210.3250000.383838
89030.18890832.0000.3838387.75000.3896100.2837210.2343750.383838

891 rows × 11 columns

X_test = test_process 
X_test.drop(['Name'],axis=1)
PclassSexAgeSibSpParchTicketFareEmbarkedCalledName_lengthFirst_name
PassengerId
8923male34003309117.8292QMr16Kelly
8933female47103632727.0000SMr32Wilkes
8942male62002402769.6875QMr25Myles
8953male27003151548.6625SMr16Wirz
8963female2211310129812.2875SMr44Hirvonen
....................................
13053male2500A.5. 32368.0500SMr18Spector
13061female3900PC 17758108.9000CNaN28Oliva y Ocana
13073male3800SOTON/O.Q. 31012627.2500SMr28Saether
13083male25003593098.0500SMr19Ware
13093male2211266822.3583CNaN24Peter

418 rows × 11 columns

X_test_encoded = tar_encoder1.transform(X_test)

归一化

后面要多模型验证,所以要把数据归一化。

import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X_train_encoded[['Age','Fare']])
scaler.fit(X_test_encoded[['Age','Fare']])
StandardScaler()
X_train_encoded[['Age','Fare']] = scaler.transform(X_train_encoded[['Age','Fare']])
X_test_encoded[['Age','Fare']] = scaler.transform(X_test_encoded[['Age','Fare']])

模型预测

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC

X_train_encoded
X_test_encoded
PclassSexAgeSibSpParchTicketFareEmbarkedCalledName_lengthFirst_name
PassengerId
89230.1889080.325138000.383838-0.4970630.3896100.2837210.2307690.732634
89330.7420381.326156100.383838-0.5119260.3369570.2837210.5652170.383838
89420.1889082.481178000.383838-0.4637540.3896100.2837210.3272730.383838
89530.188908-0.213872000.383838-0.4821270.3369570.2837210.2307690.383838
89630.742038-0.598880110.383838-0.4171510.3369570.2837210.9994390.383838
....................................
130530.188908-0.367875000.383838-0.4931050.3369570.2837210.2000000.383838
130610.7420380.710145000.4687591.3145570.5535710.4920630.3720930.383838
130730.1889080.633143000.383838-0.5074450.3369570.2837210.3720930.383838
130830.188908-0.367875000.383838-0.4931050.3369570.2837210.2343750.383838
130930.188908-0.598880110.834289-0.2366400.5535710.4920630.3720930.834289

418 rows × 11 columns

X_train_encoded.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Pclass       891 non-null    int64  
 1   Sex          891 non-null    float64
 2   Age          891 non-null    int32  
 3   SibSp        891 non-null    int64  
 4   Parch        891 non-null    int64  
 5   Ticket       891 non-null    float64
 6   Fare         891 non-null    float64
 7   Embarked     891 non-null    float64
 8   Called       891 non-null    float64
 9   Name_length  891 non-null    float64
 10  First_name   891 non-null    float64
dtypes: float64(7), int32(1), int64(3)
memory usage: 73.2 KB

投票法

先看看投票法

lr_clf = LogisticRegression(penalty='l1',solver='saga',n_jobs=-1,max_iter=20000)
rnd_clf = RandomForestClassifier(n_estimators=300,max_depth=8,min_samples_leaf=1,min_samples_split=5,random_state=42)
svm_clf = SVC(C=2,kernel='poly',random_state=42,probability=True)
voting_clf = VotingClassifier(estimators=[('lr',lr_clf),('rf',rnd_clf),('scv',svm_clf)],voting='soft')
voting_clf.fit(X_train_encoded,y_train)
  VotingClassifier(estimators=[('lr',
                                  LogisticRegression(max_iter=20000, n_jobs=-1,
                                                     penalty='l1', solver='saga')),
                                 ('rf',
                                  RandomForestClassifier(max_depth=8,
                                                         min_samples_split=5,
                                                         n_estimators=300,
                                                         random_state=42)),
                                 ('scv',
                                  SVC(C=2, kernel='poly', probability=True,
                                      random_state=42))],
                     voting='soft')
y_test = pd.read_csv(r'C:/Users/gender_submission.csv')
y_test = y_test['Survived']

from sklearn.metrics import accuracy_score

for clf in (lr_clf,rnd_clf,svm_clf,voting_clf):
    clf.fit(X_train_encoded,y_train)
    y_pred = clf.predict(X_test_encoded)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
    
LogisticRegression 0.6961722488038278
RandomForestClassifier 0.80622009569378
SVC 0.6363636363636364
VotingClassifier 0.8110047846889952

再试试XGBoost,果然效果比较好。

XGBoost

import xgboost
from sklearn.metrics import mean_squared_error
xgb_reg = xgboost.XGBRFRegressor(random_state=42)
xgb_reg.fit(X_train_encoded,y_train)
y_pred = xgb_reg.predict(X_test_encoded)
val_error=mean_squared_error(y_test,y_pred)
print("Validation MSE:", val_error) 
Validation MSE: 0.5023153196818051
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
泰坦尼克号数据_泰坦尼克号数据分析报告 891名乘客中遇难乘客有549⼈,占61.6%,⽣还乘客342⼈,占38.4%.各等级船舱乘客⼈数 各等级船舱乘客⼈数 Pclass_count=titanic_data['Pclass'].value_counts().sort_index() #⽤Bar_pie()函数作条形图和饼状图 Bar_pie(Pclass_count) 三等船舱乘客最多,占55.1%;⼀等船舱次之占24.2%;⼆级船舱乘客最少,占20.7%.男⼥乘客分布情况 男⼥乘客分布情况 Sex_count=titanic_data['Sex'].value_counts() print(Sex_count) Bar_pie(Sex_count) male 577 female 314 Name: Sex, dtype: int64 男乘客有577⼈,占64.8%;⼥乘客有314⼈,占35.2%.乘客年龄分布情况 乘客年龄分布情况 In [84]: #乘客年龄分布直⽅图 #创建figure、subplot,并⽤hist作条形图 fig_Age=plt.figure(figsize=(10,5)) ax_Age=fig_Age.add_subplot(1,2,1) titanic_data['Age'].hist(bins=10,color='g',alpha=0.3,grid=False) #设置x轴刻度标签 ax_Age.set_xticks([0,10,20,30,40,50,60,70,80,90,100]) #添加标题,x轴标签,y轴标签 ax_Age.set_title('Hist plot of Age') ax_Age.set_xlabel('Age') ax_Age.set_ylabel('number of people') #乘客年龄分布箱线图 #作箱线图 plt.subplot(122) titanic_data.boxplot(column='Age',showfliers=False) #添加y轴标签 plt.ylabel('Age') plt.title('boxplot of Fare') titanic_data['Age'].describe() count 891.000000 mean 29.544332 std 13.013778 min 0.000000 25% 22.000000 50% 29.000000 75% 35.000000 max 80.000000 Name: Age, dtype: float64 乘客年龄⼤概成正态分布,平均年龄29岁多,最⼤的80岁,最⼩的不到1岁(利⽤int()取整,不到1岁的为0).兄弟姐妹、配偶在船上的 兄弟姐妹、配偶在船上的 乘客分布情况条形图 乘客分布情况条形图 #创建figure、subplot,⽤plot()作柱状图 fig_SibSp=plt.figure(figsize=(10,5)) ax_SibSp=fig_SibSp.add_subplot(1,2,1) SibSp_count=titanic_data['SibSp'].value_counts() SibSp_count.plot(kind='bar') #添加标题,x轴标签,y轴标签 ax_SibSp.set_title('Bar plot of SibSp') ax_SibSp.set_xlabel('number of SibSp') ax_SibSp.set_ylabel('number of people') #拥有各 数量的兄弟姐妹、配偶的乘客⽐例条形图 plt.subplot(122) SibSp_count.div(SibSp_count.sum()).plot(kind='bar') #添加标题,x、y轴 标签 plt.title('Ratio of people in SibSp') plt.xlabel('SibSp') plt.ylabel('ratio') 在船上没有兄弟姐妹配偶的乘客较多,占68.2%.⽗母、孩⼦在船上的乘客分布条形图 ⽗母、孩⼦在船上的乘客分布条形图 Parch_count=titanic_data['Parch'].value_counts() #创建figure、subplot,⽤plot()作柱状图 fig_Parch=plt.figure(figsize=(10,5)) ax_Parch=fig_Parch.add_subplot(1,2,1) Parch_count.plot(kind='bar') #添加标题,x、y轴标签 ax_Parch.set_title('Bar plot of Parch') ax
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值