sklearn分析Tianic数据(逻辑回归、随机森林)及简单的特征分析

数据是Titanic:


用逻辑回归算法:

import pandas
from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation

#数据预处理
titanic=pandas.read_csv('F:\\test\\titanic_train.csv')
titanic['Age']=titanic['Age'].fillna(titanic['Age'].median())
titanic['Embarked']=titanic['Embarked'].fillna("S")
titanic.loc[titanic['Sex']=='male','Sex']=0
titanic.loc[titanic['Sex']=='female','Sex']=1
titanic.loc[titanic['Embarked']=='S','Embarked']=0
titanic.loc[titanic['Embarked']=='C','Embarked']=1
titanic.loc[titanic['Embarked']=='Q','Embarked']=2

#逻辑回归算法
predictors=['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
alg = LogisticRegression(random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)  #交叉验证分为3组
print(scores.mean())

最后正确率只有0.785634118967


然后是随机森林:

原理简述:随机森林是多个决策树,随机是体现在2个方面:①每个决策树选取样本的随机②每个样本选取特征的随机

分类的决策树最后是投票表决的方式决定最后的分类,回归的决策树是通过求森林的每颗决策树最终结果的平均值决定最后的结果

import pandas
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
#数据预处理
titanic=pandas.read_csv('F:\\test\\titanic_train.csv')
titanic['Age']=titanic['Age'].fillna(titanic['Age'].median())
titanic['Embarked']=titanic['Embarked'].fillna("S")
titanic.loc[titanic['Sex']=='male','Sex']=0
titanic.loc[titanic['Sex']=='female','Sex']=1
titanic.loc[titanic['Embarked']=='S','Embarked']=0
titanic.loc[titanic['Embarked']=='C','Embarked']=1
titanic.loc[titanic['Embarked']=='Q','Embarked']=2

#随机森林算法
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
# Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

精度有所提升:

0.822671156004


特征分析:

原理简述:假设1个样本有A、B、C三个特征,现在先求A特征的“重要性”:把A特征重置一些垃圾值,其它特征保持不变,若最终error率明显上升,说明A特征值很重要。

import pandas
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt
#数据预处理
titanic=pandas.read_csv('F:\\test\\titanic_train.csv')
titanic['Age']=titanic['Age'].fillna(titanic['Age'].median())
titanic['Embarked']=titanic['Embarked'].fillna("S")
titanic.loc[titanic['Sex']=='male','Sex']=0
titanic.loc[titanic['Sex']=='female','Sex']=1
titanic.loc[titanic['Embarked']=='S','Embarked']=0
titanic.loc[titanic['Embarked']=='C','Embarked']=1
titanic.loc[titanic['Embarked']=='Q','Embarked']=2

#添加2个特征值
# Generating a familysize column
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]

# The .apply method generates a new series
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "NameLength"]

# Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(titanic[predictors], titanic["Survived"])

# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)

# Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()

# Pick only the four best features.
predictors = ["Pclass", "Sex", "Fare","NameLength"]

alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=8, min_samples_leaf=4)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

根据特征分析,上述代码选取最重要的四个特征:

"Pclass", "Sex", "Fare","NameLength"

虽然只用了四个特征,但最终结果并没有发生太大的变化:

0.801346801347


参考:吴恩达视频


  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值