Titanic(泰坦尼克号数据集)

原文:

Overview

The data has been split into two groups:

  • training set (train.csv)

  • test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary

 

 

Variable

 

Definition

 

Key

 

survival

 

Survival

 

0 = No, 1 = Yes

 

pclass

 

Ticket class

 

1 = 1st, 2 = 2nd, 3 = 3rd

 

sex

 

Sex

 

 

 

Age

 

Age in years

 

 

 

sibsp

 

# of siblings / spouses aboard the Titanic

 

 

 

parch

 

# of parents / children aboard the Titanic

 

 

 

ticket

 

Ticket number

 

 

 

fare

 

Passenger fare

 

 

 

cabin

 

Cabin number

 

 

 

embarked

 

Port of Embarkation

 

C = Cherbourg, Q = Queenstown, S = Southampton

 

Variable Notes

pclass: A proxy for socio-economic status (SES)

1st = Upper

2nd = Middle

3rd = Lower

 

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

 

sibsp: The dataset defines family relations in this way...

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

 

parch: The dataset defines family relations in this way...

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

 

译:

概述

数据分为两组:

训练集(train.csv)

试验装置(test.csv)

训练集应该用来建立你的机器学习模型。对于训练集,我们为每个乘客提供结果(也称为“基本事实”)。你的模型将基于乘客的性别和等级等“特征”。也可以使用特征工程来创建新特征。

应该使用测试集来查看模型对未查看数据的执行情况。对于测试集,我们不提供每个乘客的真实情况。你的工作就是预测这些结果。对于测试集中的每个乘客,使用你训练过的模型来预测他们是否在泰坦尼克号沉没后幸存下来。

我们还包括gender_submission.csv,一组假设所有且只有女性乘客幸存的预测,作为提交文件应该是什么样子的一个例子。

数据字典:

 

 

Variable

 

Definition

 

Key

 

survival

 

Survival

 

0 = No, 1 = Yes

 

pclass

 

Ticket class

 

1 = 1st, 2 = 2nd, 3 = 3rd

 

sex

 

Sex

 

 

 

Age

 

Age in years

 

 

 

sibsp

 

# of siblings / spouses aboard the Titanic

 

 

 

parch

 

# of parents / children aboard the Titanic

 

 

 

ticket

 

Ticket number

 

 

 

fare

 

Passenger fare

 

 

 

cabin

 

Cabin number

 

 

 

embarked

 

Port of Embarkation

 

C = Cherbourg, Q = Queenstown, S = Southampton

 

Variable Notes:

pclass: A proxy for socio-economic status (SES)

1st = Upper

2nd = Middle

3rd = Lower

 

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

 

sibsp: The dataset defines family relations in this way...

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

 

parch: The dataset defines family relations in this way...

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

大家可以到官网地址下载数据集,我自己也在百度网盘分享了一份。可关注本人公众号,回复“2020103001”获取下载链接。

 

©️2020 CSDN 皮肤主题: 护眼 设计师:闪电赇 返回首页