Titanic best working Classifier

文档: Titanic best working Classifier.md
链接:http://note.youdao.com/noteshare?id=6e2847e19cd533c02d3658eca63e4f06&sub=6D7F43835A564D59B27AF30BF2FB09F9

  • PassengerId => 乘客ID
  • Pclass => 乘客等级(1/2/3等舱位)
  • Name => 乘客姓名
  • Sex => 性别
  • Age => 年龄
  • SibSp => 堂兄弟/妹个数
  • Parch => 父母与小孩个数
  • Ticket => 船票信息
  • Fare => 票价
  • Cabin => 客舱
  • Embarked => 登船港口

Titanic best working Classifier

kaggle link

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB None

特征工程

查看各个属性对生存率的影响

  • Pcalss
  • Sex
  • SibSp、Parch ==> SibSp+Parch+1 = FamilySize

IsAlone

dataset['IsAlone'] = 0
dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1 
   IsAlone  Survived
0        0  0.505650
1        1 0.303538 

遗失数据填充--中值

  • Embarked ==>'S'
  • Fare ==> fillna(Fare.median)
  • Age ==> np.random.randint(mean-std, mean+std, size = null_size)
  • Name == >Title

数据清洗

将特征转换成数字

1.直接特征对应 ex: 'female': 0, 'male': 1

2.均分特征对应

ex:

dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1 dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2 dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3 dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 
丢弃不需要的特征

train = train.drop(drop_elements, axis = 1)

  Survived  Pclass  Sex  Age  Fare  Embarked  IsAlone  Title
0         0       3    1    1 0 0 0 1 1 1 1 0 2 3 1 0 3 2 1 3 0 1 1 0 1 2 3 1 1 0 2 3 0 0 3 4 0 3 1 2 1 0 1 1 5 0 3 1 0 1 2 1 1 6 0 1 1 3 3 0 1 1 7 0 3 1 0 2 0 0 4 8 1 3 0 1 1 0 0 3 9 1 2 0 0 2 1 0 3 

分类器选择

选择测试得分最高的分类器

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier from sklearn.naive_bayes import GaussianNB from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis from sklearn.linear_model import LogisticRegression 

代码学习

groupby

train[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean()
#['Sex']:以Sex中心聚集数据
#as_index=False:不以Sex为index
    Sex  Survived
0  female  0.742038
1    male  0.188908

fillna

dataset['Embarked'] = dataset['Embarked'].fillna('S')
#Embarked中的nan填充为‘s

isnull, isnan, randint

age_null_count = dataset['Age'].isnull().sum()
age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count) dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list 

cut, qcut

cut是将自变量均匀分配

qcut是将因变量均匀分配

apply

dataset['Title'] = dataset['Name'].apply(get_title)
#使用get_title函数

replace, drop

dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\ 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare') dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss') 
train = train.drop(drop_elements, axis = 1)
#不加axis = 1,默认去掉行, 加了去掉列 

难点

1,Fare值的补充

data['Fare']= data['Fare'].fillna(data['Fare'].median()) 

2, 逻辑运算符的括号

data.loc[(data['Age'] >32) & (data['Age'] <= 48), 'Age'] = 2

转载于:https://www.cnblogs.com/hichens/p/11524312.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值