继续完善很久以前没有写完的东西。。。
import csv as csv
import numpy as np
csv_file_object = csv.reader(open('train.csv', 'rb'))
header = csv_file_object.next()
data=[]
for row in csv_file_object:
data.append(row)
data = np.array(data)
#将list转成array
print data
看看结果:
[['1' '0' '3' ..., '7.25' '' 'S']
['2' '1' '1' ..., '71.2833' 'C85' 'C']
['3' '1' '3' ..., '7.925' '' 'S']
...,
['889' '0' '3' ..., '23.45' '' 'S']
['890' '1' '1' ..., '30' 'C148' 'C']
['891' '0' '3' ..., '7.75' '' 'Q']]
现在的data为array类型,注意到data里的元素都属于同意类型,即字符串类型。
完整查看第一行和最后一行的数据:
In [7]:data[0]
Out[7]:
array(['1', '0', '3', 'Braund, Mr. Owen Harris', 'male', '22', '1', '0',
'A/5 21171', '7.25', '', 'S'],
dtype='|S82')
In [8]:
data[-1]
Out[8]:
array(['891', '0', '3', 'Dooley, Mr. Patrick', 'male', '32', '0', '0',
'370376', '7.75', '', 'Q'],
dtype='|S82')
再进行统计时需要将string转为float
number_passengers = np.size(data[0::,1].astype(np.float))
简单统计下survived的人数
In [11]:
number_survived = np.sum(data[0::,1].astype(np.float))
In [12]:
number_survived
Out[12]:
342.0
计算一下生存率:
In [13]:
proportion_survivors = number_survived / number_passengers
In [14]:
proportion_survivors
Out[14]:
0.38383838383838381
把人群中male和female记录下来:
women_only_stats = data[0::,4] == "female"
men_only_stats = data[0::,4] != "female"
将male和female分开分析一下:
women_onboard = data[women_only_stats,1].astype(np.float)
men_onboard = data[men_only_stats,1].astype(np.float)
计算下男女获救的比例:
In [23]:
proportion_women_survived = \
np.sum(women_onboard) / np.size(women_onboard)
In [24]:
proportion_men_survived = \
np.sum(men_onboard) / np.size(men_onboard)
In [26]:
proportion_women_survived
Out[26]:
0.7420382165605095
In [27]:
proportion_men_survived
Out[27]:
0.18890814558058924
我的天!男女获救比例相差极大,看来Lady First 这一点做得很好!这样我们就大概从数据中学到了以上结果,现在在test.csv中测试一下:
In [28]:
test_file = open('test.csv', 'rb')
test_file_object = csv.reader(test_file)
header = test_file_object.next()
In [29]:
prediction_file = open("genderbasedmodel.csv", "wb")
prediction_file_object = csv.writer(prediction_file)
In [30]:
prediction_file_object.writerow(["PassengerId", "Survived"])
for row in test_file_object: # For each row in test.csv
if row[3] == 'female': # is it a female, if yes then
prediction_file_object.writerow([row[0],'1']) # predict 1
else: # or else if male,
prediction_file_object.writerow([row[0],'0']) # predict 0
test_file.close()
prediction_file.close()
看一下预测结果:
………
现在得到的这个结果文件已经是可以submission了!