kaggle实例学习-Titanic(2)

继续完善很久以前没有写完的东西。。。

import csv as csv 
import numpy as np
csv_file_object = csv.reader(open('train.csv', 'rb')) 
header = csv_file_object.next()
data=[]
for row in csv_file_object:
    data.append(row)
data = np.array(data)
#将list转成array
print data

看看结果:

[['1' '0' '3' ..., '7.25' '' 'S']
 ['2' '1' '1' ..., '71.2833' 'C85' 'C']
 ['3' '1' '3' ..., '7.925' '' 'S']
 ..., 
 ['889' '0' '3' ..., '23.45' '' 'S']
 ['890' '1' '1' ..., '30' 'C148' 'C']
 ['891' '0' '3' ..., '7.75' '' 'Q']]

现在的data为array类型,注意到data里的元素都属于同意类型,即字符串类型。
完整查看第一行和最后一行的数据:

In [7]:data[0]

Out[7]:
array(['1', '0', '3', 'Braund, Mr. Owen Harris', 'male', '22', '1', '0',
       'A/5 21171', '7.25', '', 'S'], 
      dtype='|S82')
In [8]:

data[-1]

Out[8]:

array(['891', '0', '3', 'Dooley, Mr. Patrick', 'male', '32', '0', '0',
       '370376', '7.75', '', 'Q'], 
      dtype='|S82')

再进行统计时需要将string转为float

number_passengers = np.size(data[0::,1].astype(np.float))

简单统计下survived的人数

In [11]:

number_survived = np.sum(data[0::,1].astype(np.float))

In [12]:

number_survived

Out[12]:

342.0

计算一下生存率:

In [13]:

proportion_survivors = number_survived / number_passengers

In [14]:

proportion_survivors

Out[14]:

0.38383838383838381

把人群中male和female记录下来:

women_only_stats = data[0::,4] == "female"
men_only_stats = data[0::,4] != "female"

将male和female分开分析一下:

women_onboard = data[women_only_stats,1].astype(np.float)     
men_onboard = data[men_only_stats,1].astype(np.float)

计算下男女获救的比例:

In [23]:

proportion_women_survived = \

                       np.sum(women_onboard) / np.size(women_onboard)  

In [24]:

proportion_men_survived = \

                       np.sum(men_onboard) / np.size(men_onboard)

In [26]:

proportion_women_survived



Out[26]:

0.7420382165605095
In [27]:

proportion_men_survived

Out[27]:

0.18890814558058924

我的天!男女获救比例相差极大,看来Lady First 这一点做得很好!这样我们就大概从数据中学到了以上结果,现在在test.csv中测试一下:

In [28]:

test_file = open('test.csv', 'rb')

test_file_object = csv.reader(test_file)

header = test_file_object.next()

In [29]:

prediction_file = open("genderbasedmodel.csv", "wb")

prediction_file_object = csv.writer(prediction_file)

In [30]:

prediction_file_object.writerow(["PassengerId", "Survived"])

for row in test_file_object:       # For each row in test.csv

    if row[3] == 'female':         # is it a female, if yes then                                       

        prediction_file_object.writerow([row[0],'1'])    # predict 1

    else:                              # or else if male,       

        prediction_file_object.writerow([row[0],'0'])    # predict 0

test_file.close()

prediction_file.close()

看一下预测结果:

………
现在得到的这个结果文件已经是可以submission了!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值