生存预测
泰坦尼克号生存预测是kaggle上面对于初级机器学习者比较适合的用来练习的比赛,本人现在想学习一些特征工程之类相关的只是,所以就是看了一下kaggle上面的大佬展示出来的notebook,有些理解与你们共享,互相学习
数据集说明
首先我们在学习之前要来看一下就是这个我们要学习的数据集,对于这个数据,然后来看作者是要进行怎样的操作,他是怎么操作
,学习一下思路
特征工程
1.读取数据
import pandas as pd
import numpy as np
import re
import sklearn
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
# %matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
tls.set_credentials_file(username='wenmingren', api_key='AgndH853A37UzzyELMyD')
import warnings
warnings.filterwarnings('ignore')
# Going to use these 5 base models for the stacking
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.cross_validation import KFold;
train = pd.read_csv(r'D:\PythonDDD\shuju files\tantanic\train.csv')#读取数据
test = pd.read_csv(r'D:\PythonDDD\shuju files\tantanic\test.csv')
2.离散值转换
train['Name_length'] = train['Name'].apply(len)#用名字的长度替换名字的位置
test['Name_length'] = test['Name'].apply(len)
# 这个特征告诉我们一名乘客是都在船上会有一个贵宾厅
train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
test['Has_Cabin'] = test["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
3.加入新的特征
for dataset in full_data:
dataset['Embarked'] = dataset['Embarked'].fillna('S')#删除空值用S代替
# 删除fare列中所有空值 and 建立一个新的特征CategoricalFare车票类别
for dataset in full_data:
dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())#删除空值,用中位数代替
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)#分为四类
for dataset in full_data:
dataset['Embarked'] = dataset['Embarked'].fillna('S')#删除空值用S代替
# 删除fare列中所有空值 and 建立一个新的特征CategoricalFare车票类别
for dataset in full_data:
dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())#删除空值,用中位数代替
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)#分为四类
for dataset in full_data:
age_avg = dataset['Age'].mean()#均值
age_std = dataset['Age'].std()#标准差
age_null_count = dataset['Age'].isnull().sum()#测试是否为空格,计算有86个
age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)#填补空值列表
dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list#空值替换
dataset['Age'] = dataset['Age'].astype(int)#转换输入类型
train['CategoricalAge'] = pd.cut(train['Age'], 5)#分割年龄为五个类别
def get_title(name):
title_search = re.search(' ([A-Za-z]+)\.', name)#正则匹配人名
# 如果标题存在&#x
for dataset in full_data:
dataset['Embarked'] = dataset['Embarked'].fillna('S')#删除空值用S代替
# 删除fare列中所有空值 and 建立一个新的特征CategoricalFare车票类别
for dataset in full_data:
dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())#删除空值,用中位数代替
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)#分为四类
for dataset in full_data:
age_avg = dataset['Age'].mean()#均值
age_std = dataset['Age'].std()#标准差
age_null_count = dataset['Age'].isnull().sum()#测试是否为空格,计算有86个
age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)#填补空值列表
dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list#空值替换
dataset['Age'] = dataset['Age'].astype(int)#转换输入类型
train['CategoricalAge'] = pd.cut(train['Age'], 5)#分割年龄为五个类别
def get_title(name):
title_search = re.search(' ([A-Za-z]+)\.', name)#正则匹配人名
# 如果标题存在&#x