【Kaggle练习赛】之Titanic: Machine Learning from Disaster

该博客讨论了Kaggle上的Titanic数据集,通过机器学习预测乘客的生存情况。作者首先介绍了问题背景和基本思路,指出这是一个分类问题,并使用决策树模型作为初步尝试。接着,通过添加新特征和使用决策森林模型,线下测试正确率提高到82.8%,线上测试达到77.033%。文章还提及了数据预处理和特征选择的重要性。
摘要由CSDN通过智能技术生成

一.题目描述

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

二.基本思路

分析题意可知本题实际上是一个分类问题,给定titanic上一批乘客的一些个人信息(feature)和是否逃生(label),题目要求的是预测另一批乘客的逃生情况。训练数据集如下所示:
这里写图片描述

首先观察数据。训练数据集中在age这一栏有较多的缺失值,这些缺失值需要进行一些处理。name,ticket,cabin,embark 这几栏数据从直观上感觉用途不大,可以尝试直接丢弃。

本题有用的特征基本上是离散的,直观的想法是利用决策树模型去做。scikit-learn提供了决策树的模型。直接利用决策树的库函数,丢弃缺失age的数据,选取最基本的特征:pclass,male,sib,par,fare(部分缺失设为7),训练了一个决策树模型,线下测试误差率是19%。线上提交结果,最后正确率是72%。程序如下:

import csv
from numpy import *
from sklearn import tree
from sklearn import cross_validation  

def total_test_online():
    train_data,train_label = load_data_set()

    train_feature0,new_label0 = feature_extraction(train_data,train_label) #only take some data with age info
    model0 = build_model(train_feature0,new_label0)

    train_feature1 = feature_extraction1(train_data) #without using age info
    model1 = build_model(train_feature1,train_label)

    test_data = load_test_set()
    result= test(test_data,model0,model1,train_label)
    gen_res_file(result)


def total_test_offline():
    data,label = load_data_set()
    train_data,train_label,test_data,test_label = pre_data(data,label)
    train_feature0,new_label0 = feature_extraction(train_data,train_label) #only take some data with age info
    model0 = build_model(train_feature0,new_label0)

    train_feature1 = feature_extraction1(train_data) #without using age info
    model1 = build_model(train_feature1,train_label)

    result= test(test_data,model0,model1,train_label)
    judge_off_result(result,test_label)

 #   val_res = val_model(model1,train_feature,train_label)



def load_data_set():
    file = open('train.csv','rb')
    lines=csv.reader(file) 
    l=[];train_data=[];train_label=[];
    for line in lines:  
       l.append(line) 
    l.remove(l[0])
    for line in l:
        train_label.append(int(line[1]))
        tmp=[]
        tmp=line[2:]
        train_data.append(tmp)
    return train_data,train_label

def load_test_set():
    file = open('tes
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值