1. 加载数据
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
train_df = pd.read_csv("tatannic/train.csv")
test_df = pd.read_csv("tatannic/test.csv")
2. 特征类型分析
print(train_df.shape)
train_df.head()
print(test_df.shape)
test_df.head()
- 可以发现测试集比训练集少‘Survived‘这个特征,当然我们要预测的也是这个特征
train_df.describe()
sns.countplot(train_df['Survived'])
train_df['Survived'].value_counts()
train_df.info()
- 一共有12个特征,7个数值特征,5个类别特征,'Age’和‘Cabin’缺失的比较多,‘Embarked’缺失的较少
train_df.describe(include=[np.object])
3. 无关特征删除
类别特征
- Name
- 目标变量分析的从存活率,所以这里我们删除’Name’这个特征
train_df.drop('Name', axis=1, inplace=True)
test_df.drop('Name', axis=1, inplace=True)
- Ticket
train_df['Ticket'].value_counts()
ticket_count = train_df['Ticket'].value_counts()
ticket_count = ticket_count[ticket_count>=4]
ticket_count_df = train_df[train_df['Ticket'].isin(ticket_count.index)]
ticket_count_df['Ticket'].value_counts()