【Kaggle练习赛】之Titanic: Machine Learning from Disaster

最新推荐文章于 2021-06-30 16:16:40 发布

messiran10

最新推荐文章于 2021-06-30 16:16:40 发布

阅读量1.9k

点赞数

分类专栏：机器学习算法练习 python数据挖掘文章标签：机器学习

本文链接：https://blog.csdn.net/messiran10/article/details/50549812

版权

该博客讨论了Kaggle上的Titanic数据集，通过机器学习预测乘客的生存情况。作者首先介绍了问题背景和基本思路，指出这是一个分类问题，并使用决策树模型作为初步尝试。接着，通过添加新特征和使用决策森林模型，线下测试正确率提高到82.8%，线上测试达到77.033%。文章还提及了数据预处理和特征选择的重要性。

摘要由CSDN通过智能技术生成

一.题目描述

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

二.基本思路

分析题意可知本题实际上是一个分类问题，给定titanic上一批乘客的一些个人信息（feature）和是否逃生（label），题目要求的是预测另一批乘客的逃生情况。训练数据集如下所示：
这里写图片描述

首先观察数据。训练数据集中在age这一栏有较多的缺失值，这些缺失值需要进行一些处理。name,ticket,cabin,embark 这几栏数据从直观上感觉用途不大，可以尝试直接丢弃。

本题有用的特征基本上是离散的，直观的想法是利用决策树模型去做。scikit-learn提供了决策树的模型。直接利用决策树的库函数，丢弃缺失age的数据，选取最基本的特征：pclass,male,sib,par,fare(部分缺失设为7)，训练了一个决策树模型，线下测试误差率是19%。线上提交结果，最后正确率是72%。程序如下：

import csv
from numpy import *
from sklearn import tree
from sklearn import cross_validation  

def total_test_online():
    train_data,train_label = load_data_set()

    train_feature0,new_label0 = feature_extraction(train_data,train_label) #only take some data with age info
    model0 = build_model(train_feature0,new_label0)

    train_feature1 = feature_extraction1(train_data) #without using age info
    model1 = build_model(train_feature1,train_label)

    test_data = load_test_set()
    result= test(test_data,model0,model1,train_label)
    gen_res_file(result)


def total_test_offline():
    data,label = load_data_set()
    train_data,train_label,test_data,test_label = pre_data(data,label)
    train_feature0,new_label0 = feature_extraction(train_data,train_label) #only take some data with age info
    model0 = build_model(train_feature0,new_label0)

    train_feature1 = feature_extraction1(train_data) #without using age info
    model1 = build_model(train_feature1,train_label)

    result= test(test_data,model0,model1,train_label)
    judge_off_result(result,test_label)

 #   val_res = val_model(model1,train_feature,train_label)



def load_data_set():
    file = open('train.csv','rb')
    lines=csv.reader(file) 
    l=[];train_data=[];train_label=[];
    for line in lines:  
       l.append(line) 
    l.remove(l[0])
    for line in l:
        train_label.append(int(line[1]))
        tmp=[]
        tmp=line[2:]
        train_data.append(tmp)
    return train_data,train_label

def load_test_set():
    file = open('tes

最低0.47元/天解锁文章

messiran10

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
【Kaggle练习赛】之Titanic: Machine Learning from Disaster

一.题目描述The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out
复制链接

扫一扫