一.题目描述
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
二.基本思路
分析题意可知本题实际上是一个分类问题,给定titanic上一批乘客的一些个人信息(feature)和是否逃生(label),题目要求的是预测另一批乘客的逃生情况。训练数据集如下所示:
首先观察数据。训练数据集中在age这一栏有较多的缺失值,这些缺失值需要进行一些处理。name,ticket,cabin,embark 这几栏数据从直观上感觉用途不大,可以尝试直接丢弃。
本题有用的特征基本上是离散的,直观的想法是利用决策树模型去做。scikit-learn提供了决策树的模型。直接利用决策树的库函数,丢弃缺失age的数据,选取最基本的特征:pclass,male,sib,par,fare(部分缺失设为7),训练了一个决策树模型,线下测试误差率是19%。线上提交结果,最后正确率是72%。程序如下:
import csv
from numpy import *
from sklearn import tree
from sklearn import cross_validation
def total_test_online():
train_data,train_label = load_data_set()
train_feature0,new_label0 = feature_extraction(train_data,train_label) #only take some data with age info
model0 = build_model(train_feature0,new_label0)
train_feature1 = feature_extraction1(train_data) #without using age info
model1 = build_model(train_feature1,train_label)
test_data = load_test_set()
result= test(test_data,model0,model1,train_label)
gen_res_file(result)
def total_test_offline():
data,label = load_data_set()
train_data,train_label,test_data,test_label = pre_data(data,label)
train_feature0,new_label0 = feature_extraction(train_data,train_label) #only take some data with age info
model0 = build_model(train_feature0,new_label0)
train_feature1 = feature_extraction1(train_data) #without using age info
model1 = build_model(train_feature1,train_label)
result= test(test_data,model0,model1,train_label)
judge_off_result(result,test_label)
# val_res = val_model(model1,train_feature,train_label)
def load_data_set():
file = open('train.csv','rb')
lines=csv.reader(file)
l=[];train_data=[];train_label=[];
for line in lines:
l.append(line)
l.remove(l[0])
for line in l:
train_label.append(int(line[1]))
tmp=[]
tmp=line[2:]
train_data.append(tmp)
return train_data,train_label
def load_test_set():
file = open('tes