Titanic是kaggle上的一道入门题目,很适合新手去练市数据分析。
这道题给的数据是泰坦尼克号上的乘客的信息,预测乘客是否幸存。这是个二元分类的机器学习问题。数据链接:https://www.kaggle.com/c/titanic/data
1. 数据清洗(Data Cleaning)
2. 探索性可视化(Exploratory Visualization)
3. 特征工程(Feature Engineering)
4. 基本建模&评估(Basic Modeling& Evaluation)
一 。数据清洗
import pandas as pd
import numpy as np
train=pd.read_csv('F:\\kaggleData\\titanic\\train.csv')
train.head()
train.info()
train.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
标签解释:
- PassengerId => 乘客ID
- Pclass => 客舱等级(1/2/3等舱位)
- Name => 乘客姓名
- Sex => 性别
- Age => 年龄
- SibSp => 兄弟姐妹数/配偶数
- Parch => 父母数/子女数