大数据第一课(满分作业)——泰坦尼克号生存者预测(Titanic - Machine Learning from Disaster)

1 项目背景

1.1 The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

1.2 What Data Will I Use in This Competition?

In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled ‘train.csv’ and the other is titled ‘test.csv’.
Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.
The ‘test.csv’ dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.
Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.
Check out the “Data” tab to explore the datasets even further. Once you feel you’ve created a competitive model, submit it to Kaggle to see where your model stands on our leaderboard against other Kagglers.

2 数据获取

从网上https://www.kaggle.com/competitions/titanic/data下载好数据"train.csv"和"test.csv"还有"gender_submission.csv"之后,如图 1所示,接下来利用Excel和Python语句进行数据总览。
在这里插入图片描述

3 数据分析

3.1 数据字段分析

"train.csv"的数据字段分析如下表所示。
在这里插入图片描述
另外,"test.csv"文件中不包含人员存活的情况。

3.2 导入数据

Python版本3.10(Anaconda3环境)、Python IDE PyCharm(Professional 2022.1)
将Python文件建立在"*.csv"文件的同一目录下,代码和运行结果如下

# 导入数据
train = pd.read_csv("./train.csv")
test = pd.read_csv("./test.csv")

# 查看前5条数据,看是否导入成功
print(train.head())
print()
print(test.head())
print()

在这里插入图片描述

3.3 数据清洗(预处理)

数据清洗之前需要明确要处理数据的字段类型,浏览上面输出的结果,共有11个字段(PassengerId和Ticket作为编号可以去除,不需考虑),剩余10个字段可以分为四类,如下:

  1. 数字类特征: 年龄,票价,兄弟姐妹配偶数量,父母小孩数量
  2. 类别特征: 性别,港口,船舱等级,是否存活
  3. 包含数字和字符的特征: 船票和船舱
  4. 文字类特征: 姓名

3.3.1 缺失值处理

# 获取数据的维度和查询各字段缺失值情况
print(train.info())
print()
print(train.isnull().sum())
print()

在这里插入图片描述
通过缺失值的查询,发现Age字段缺失大约到了20%,可以采用填充的方式进行处理,而Cabin字段的缺失值高达77%,可以直接删除,最后就是登船港口只有2条数据缺失可以直接删除也可以进行填充。接下来逐步进行缺失值处理。
首先是对于年龄字段,填充的具体方式可以查看数据的分布,一般对于年龄的分布都是属于右偏分布(就是均值大于中位数),可以通过describe()方法查询,输出结果如下:(可以发现均值比50%处的中

  • 5
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值