引言
Hexo博客:Yanbin’s blog
我的博客Titanic获救预测中对dataset的预处理感觉不是很完善,看了Kaggle上的一些Kernels,重新进行预处理(for 深度学习)…
特征处理
%matplotlib inline
import pandas as pd
import numpy as np
import re
train = pd.read_csv(r'E:\Mirror\GitHub\Predict-survival-on-the-Titanic\data\train.csv')
test = pd.read_csv(r'E:\Mirror\GitHub\Predict-survival-on-the-Titanic\data\test.csv')
full_data = [train, test]
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
1. Pclass
票类:经济地位的象征
序号 | 票类 |
---|---|
1 | 头等舱 |
2 | 中等舱 |
3 | 末等舱 |
# One-hot编码
# train
train['P1'] = np.array(train['Pclass'] == 1).astype(np.int32)
train['P2'] = np.array(train['Pclass'] == 2).astype(np.int32)
train['P3'] = np.array(train['Pclass'] == 3).astype(np.int32)
# test
test['P1'] = np.array(test['Pclass'] == 1).astype(np.int32)
test['P2'] = np.array(test['Pclass'] == 2).astype(np.int32)
test['P3'] = np.array(test['Pclass'] == 3).astype(np.int32)
print(train.head(1))
PassengerId Survived Pclass Name Sex Age SibSp \
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1
Parch Ticket Fare Cabin Embarked P1 P2 P3
0 0 A/5 21171 7.25 NaN S 0 0 1
2. Sex
性别:男or女
Sex | label |
---|---|
male | 1 |
female | 0 |
# 把male/female转换成1/0
train['Sex'] = [1 if i == 'male' else 0 for i in train.Sex]
test['Sex'] = [1 if i == 'male' else 0 for i in test.Sex]
print(train.head(1))
PassengerId Survived Pclass Name Sex Age SibSp \
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1
Parch Ticket Fare Cabin Embarked P1 P2 P3
0 0 A/5 21171 7.25 NaN S 0 0 1
3. SibSp and Parch
- SibSp
the number of siblings/spouse:兄弟姐妹/配偶人数
- Parch
the number of children/parents:子女/父母人数
# 'FamilySize':家庭成员人数
for dataset in full_data:
dataset['FamilySize'] = dataset['SibSp'] + dataset[