实战篇

目录

Titanic罹难乘客预测

IMDB影评得分估计

MNIST 手写体数字图片识别


Titanic罹难乘客预测

import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

#导入pandas方便数据读取和预测
#分别对训练和测试数据从本地进行读取
train = pd.read_csv("../input/titanic/train_data.csv")
test = pd.read_csv("../input/titanic/test_data.csv")
#先分别输出训练与测试数据的基本信息。这是一个好习惯,可以对数据的规模、各个特征的数据类型以及是否有缺失等,有一个总体的了解
print(train.info())
print(test.info())
#发现数据是齐全的,无需对数据进行补充
print("----"*20)

#按照我们之前对Titanic事件的经验,人工选取对预测有效的特征
selected_features = ['Pclass_1','Pclass_2','Pclass_3','Sex','Age','Emb_1','Emb_2','Emb_3','Family_size','Fare']

X_train = train[selected_features]
X_test = test[selected_features]

y_train = train['Survived']

#接下来便是采用 DicVectorizer 对特征向量化
dict_vec = DictVectorizer(sparse = False)
X_train = dict_vec.fit_transform(X_train.to_dict(orient="records"))
print(dict_vec.feature_names_)

X_test = dict_vec.transform(X_test.to_dict(orient="records"))

#从 sklearn.ensemble 中导入 RandomForestClassifier
rfc = RandomForestClassifier()

#从流行工具包 xgboost 导入 XGBClassifier 用于处理分类预测问题
#也使用默认配置初始化XGBClassifier
xgbc = XGBClassifier()

#使用5折交叉验证的方法在训练集上分别对默认配置的RandomForestClassifier以及XGBClassifier 进行性能评估,并获取平均分类准确性的得分
print(cross_val_score(rfc,X_train,y_train,cv=5).mean())
print(cross_val_score(xgbc,X_train,y_train,cv=5).mean())

#使用默认配置的RandomForestClassifier进行预测操作
rfc.fit(X_train,y_train)
rfc_y_predict = rfc.predict(X_test)
rfc_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':rfc_y_predict})
#将默认配置的 RandomForestClassifier 对测试数据的预测结果存储在文件 rfc_submission.csv中
rfc_submission.to_csv("./rfc_submission.csv",index=False)

#使用默认配置的XGBClassifier 进行预测操作
xgbc.fit(X_train,y_train)
xgbc_y_predict = xgbc.predict(X_test)
xgbc_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':xgbc_y_predict})
#将默认配置的 使用默认配置的XGBClassifier 对测试数据的预测结果存储在文件 xgbc_submission.csv中
xgbc_submission.to_csv("./xgbc_submission.csv",index=False)

#使用并行网格搜索的方式寻找更好的超参数重组合,以期待进一步提高XGBClassifier 的预测性能
params = {'max_depth':range(2,4),'n_estimators':range(100,1100,200),'learning_rate':[0.05,0.1,0.25,0.5,1.0]}

xgbc_best = XGBClassifier()
gs = GridSearchCV(xgbc_best,params,n_jobs=-1,cv=5,verbose=1)
gs.fit(X_train,y_train)

#检查优化之后的XBGBClassifier的超参数配置以及交叉验证的准确性
print("gs.best_score",gs.best_score_)
print("gs.best_params",gs.best_params_)

#使用经过优化超参数配置的 XGBClassifier 对测试数据的预测结果存储在文件 xgbc_best_submission.csv中
xgbc_best_y_predict = gs.predict(X_test)
xgbc_best_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':xgbc_best_y_predict})
xgbc_best_submission.to_csv("./xgbc_best_submission.csv",index=False)

结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 792 entries, 0 to 791
Data columns (total 17 columns):
Unnamed: 0     792 non-null int64
PassengerId    792 non-null int64
Survived       792 non-null int64
Sex            792 non-null int64
Age            792 non-null float64
Fare           792 non-null float64
Pclass_1       792 non-null int64
Pclass_2       792 non-null int64
Pclass_3       792 non-null int64
Family_size    792 non-null float64
Title_1        792 non-null int64
Title_2        792 non-null int64
Title_3        792 non-null int64
Title_4        792 non-null int64
Emb_1          792 non-null int64
Emb_2          792 non-null int64
Emb_3          792 non-null int64
dtypes: float64(3), int64(14)
memory usage: 105.3 KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 17 columns):
Unnamed: 0     100 non-null int64
PassengerId    100 non-null int64
Survived       100 non-null int64
Sex            100 non-null int64
Age            100 non-null float64
Fare           100 non-null float64
Pclass_1       100 non-null int64
Pclass_2       100 non-null int64
Pclass_3       100 non-null int64
Family_size    100 non-null float64
Title_1        100 non-null int64
Title_2        100 non-null int64
Title_3        100 non-null int64
Title_4        100 non-null int64
Emb_1          100 non-null int64
Emb_2          100 non-null int64
Emb_3          100 non-null int64
dtypes: float64(3), int64(14)
memory usage: 13.4 KB
None
--------------------------------------------------------------------------------
['Age', 'Emb_1', 'Emb_2', 'Emb_3', 'Family_size', 'Fare', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex']
D:\anaconda3\envs\tree Point five\lib\site-packages\sklearn\ensemble\forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
D:\anaconda3\envs\tree Point five\lib\site-packages\sklearn\ensemble\forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
D:\anaconda3\envs\tree Point five\lib\site-packages\sklearn\ensemble\forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
D:\anaconda3\envs\tree Point five\lib\site-packages\sklearn\ensemble\forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
0.798006329113924
D:\anaconda3\envs\tree Point five\lib\site-packages\sklearn\ensemble\forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
D:\anaconda3\envs\tree Point five\lib\site-packages\sklearn\ensemble\forest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
0.8358860759493671
Fitting 5 folds for each of 50 candidates, totalling 250 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    2.9s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    5.7s
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:    7.2s finished
gs.best_score 0.8421717171717171
gs.best_params {'learning_rate': 0.25, 'max_depth': 2, 'n_estimators': 100}

IMDB影评得分估计

MNIST 手写体数字图片识别

 

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

萌新待开发

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值