泰坦尼克号幸存者逻辑回归预测

泰坦尼克号幸存者预测

import numpy as np
import pandas as pd

数据读取

#最后一行是换行符不取
train=pd.read_csv(r'./mytrain.csv').iloc[:,:-1]
test=pd.read_csv(r'./mytest.csv').iloc[:,:-1]
#查看数据
train.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
test.head()
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS

查看数据特征

train.describe()
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200

可以看出训练集中age、Cabin、Embarked列有缺失这里直接删除Age列为NAN的行由于test数据中有些列缺失我们可以用线性回归填充需要将Age作为标签而不作为特征,由于Cabin这列缺失太多直接删除这一列,Embarked这缺失比较少用众数填充, PassengerId、Name、Ticket也不用考虑。

train.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

选取数据特征

选取特征即删除PassengerId、Name、Ticket、Cabin

train.drop(['PassengerId','Name','Ticket','Cabin'],inplace=True,axis=1)
test.drop(['PassengerId','Name','Ticket','Cabin'],inplace=True,axis=1)

查看Embarked列的频数找出众数填充

train['Embarked'].value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64
train['Embarked'].fillna(value='S',inplace=True)

只有年龄还有Nan值直接删除

train.dropna(inplace=True)
#可以看出训练集中age、Fare列有缺失
test.describe()
PclassAgeSibSpParchFare
count418.000000332.000000418.000000418.000000417.000000
mean2.26555030.2725900.4473680.39234435.627188
std0.84183814.1812090.8967600.98142955.907576
min1.0000000.1700000.0000000.0000000.000000
25%1.00000021.0000000.0000000.0000007.895800
50%3.00000027.0000000.0000000.00000014.454200
75%3.00000039.0000001.0000000.00000031.500000
max3.00000076.0000008.0000009.000000512.329200

对test数据操作填充Fare(均值),Age等下用线性回归填充

test['Fare'].fillna(test['Fare'].mean(),inplace=True)

数据编码

对非数值型数据转换为数值

Sex列转换

train.loc[train['Sex']=='male','Sex']=1#male转换为1
train.loc[train['Sex']=='female','Sex']=0
test.loc[test['Sex']=='male','Sex']=1
test.loc[test['Sex']=='female','Sex']=0

Embarked转换

train.loc[train.Embarked=='S', 'Embarked'] = 0#Embarked=='S'转换为0
train.loc[train.Embarked=='C', 'Embarked'] = 1
train.loc[train.Embarked=='Q', 'Embarked'] = 2
test.loc[test.Embarked=='S', 'Embarked'] = 0
test.loc[test.Embarked=='C', 'Embarked'] = 1
test.loc[test.Embarked=='Q', 'Embarked'] = 2

异常值处理

先取出标签Survived,删除标签。然后遍历train中每个cell的数据,小于或大于一定cell所在列的数就标记这列存在为异常值,然后根据标记删除train中数据,就可以作为模型的训练数据。

先取出标签Survived,删除标签。

lables_Survived=train['Survived']
train.drop('Survived',axis=1,inplace=True)
train.shape
(714, 7)

构造标记列表元素个数与train的行数一样,如果train的一行中存在异常值就标记1

mark_erro=np.zeros(train.shape[0])
#mark_erro
train_erro=np.array(train)
Min=np.percentile(train_erro[:,1],50)
Min
1.0

遍历train中每个cell的数据,然后标记mark_erro

train_erro=np.array(train)#转换为矩阵方便遍历
for i in range(train.shape[1]):
    Min=np.percentile(train_erro[:,i],8)#使用percentile需要传入array,第二个参数取array中第几%的数作为异常值的判断标准
    Max=np.percentile(train_erro[:,i],92)
    for j in range(train.shape[0]):
        if  train_erro[j,i]<Min or train_erro[j, i]>Max:
            mark_erro[j]=1#标记异常

过滤异常值,选取mark_erro为0的行

x_true = train.loc[mark_erro==np.zeros(train.shape[0])]
y_true = lables_Survived.loc[mark_erro==np.zeros(train.shape[0])]
x_true.head()
PclassSexAgeSibSpParchFareEmbarked
11038.01071.28331
23026.0007.92500
31035.01053.10000
43135.0008.05000
83027.00211.13330

线性回归粗糙预测年龄

from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.model_selection import GridSearchCV#网格搜索优化参数
from sklearn.metrics import accuracy_score# 分类准确率分数逻辑回归预测分数

取出年龄标签并删除

x_true_age=x_true['Age']
x_true.drop('Age',inplace=True,axis=1)
E:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py:3697: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)

线性回归模型

model_age_pre=LinearRegression()
model_age_pre.fit(x_true,x_true_age)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

预测分数R方预测

model_age_pre.score(x_true,x_true_age)
0.13195844479498142

预测test数据中缺失的Age,先取出Age缺失的行来操作,删除Age缺失的行得到正确数据,然后将处理过后Age缺失的行拼接到正确数据后面

test_Ageisnull=test.loc[test.Age.isnull()]
test_Ageisnull.head()
PclassSexAgeSibSpParchFareEmbarked
1031NaN007.89580
2210NaN0031.68330
2931NaN2021.67921
3330NaN1223.45000
3630NaN008.05000
test.isnull().sum()#显示只有age列为空删除age为空的
Pclass       0
Sex          0
Age         86
SibSp        0
Parch        0
Fare         0
Embarked     0
dtype: int64
test.dropna(inplace=True)

删除Age列作为数据给训练好的线性模型训练预测年龄

x_age_pre=test_Ageisnull.drop('Age',axis=1)
age_pre=model_age_pre.predict(x_age_pre)
test_Ageisnull['Age']=age_pre#预测出来的年龄加入数据
test.append(test_Ageisnull)#将处理好缺失年龄的数据重新加到test中
test.head()
E:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
PclassSexAgeSibSpParchFareEmbarked
03134.5007.82922
13047.0107.00000
22162.0009.68752
33127.0008.66250
43022.01112.28750
model_Survived_pre=LogisticRegression()
param_grid = np.linspace(0.001,1000,10)  #生成需要搜索的待选参数
param_grid = dict(C=param_grid)#调参模型需要传入字典格式的参数,C是LogisticRegression中的参数
param_grid
{'C': array([1.00000e-03, 1.11112e+02, 2.22223e+02, 3.33334e+02, 4.44445e+02,
        5.55556e+02, 6.66667e+02, 7.77778e+02, 8.88889e+02, 1.00000e+03])}

设计调过参的逻辑回归模型,并喂入训练集

gri=GridSearchCV(model_Survived_pre,param_grid)
gri.fit(x_true,y_true)
GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': array([1.00000e-03, 1.11112e+02, 2.22223e+02, 3.33334e+02, 4.44445e+02,
       5.55556e+02, 6.66667e+02, 7.77778e+02, 8.88889e+02, 1.00000e+03])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

查看模型评价得分

gri.best_score_#好像有点低需要选择更优的参数或者模型
0.7828947368421053

直接使用逻辑回归预测不调参

model=LogisticRegression()
model.fit(x_true,y_true)
accuracy_score(model.predict(x_true), y_true)
0.8004385964912281
  • 1
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值