Datawhale打卡活动 Kaggle Spaceship Titanic Day3

麻辣香郭诶

已于 2022-10-01 14:25:44 修改

阅读量476

点赞数 1

分类专栏： Kaggle Spaceship Titanic打卡活动文章标签： sklearn 机器学习人工智能

于 2022-09-30 15:51:16 首次发布

本文链接：https://blog.csdn.net/qq_52171945/article/details/127124866

版权

Kaggle Spaceship Titanic打卡活动专栏收录该内容

7 篇文章 3 订阅

订阅专栏

文章目录

Datawhale打卡活动 Kaggle Spaceship Titanic
- Day 3 验证集划分与树模型

Datawhale打卡活动 Kaggle Spaceship Titanic

尝试了一个coggle科学的打卡活动（Coggle 30 Days of ML（22年10月）），记录一下学习过程！

Day 3 验证集划分与树模型

步骤1：学习sklearn中的数据划分方法

参考：sklearn中的数据集的划分 - tantao258 - 博客园 (cnblogs.com)

sklearn中的数据划分方法有如下：KFold，GroupKFold，StratifiedKFold，LeaveOneGroupOut，LeavePGroupsOut，LeaveOneOut，LeavePOut，ShuffleSplit，GroupShuffleSplit，StratifiedShuffleSplit，PredefinedSplit，TimeSeriesSplit，但是一般常见的（或者说笔者目前经常用到的）也就是KFold和StratifidKFold。

K折交叉验证（KFold、StratifiedKFold、GroupKFold）

K折交叉验证就是将原有的数据集划分为k份，然后取出其中的k-1份作为训练集，然后另外一份作为验证集。训练之后在验证集上进行验证。将k折之后得出的分类率平均值作为模型的真实分类率。

三者的区别可以看看强哥的这篇博客：【sklearn】KFold、StratifiedKFold、GroupKFold的区别 - 腾讯云开发者社区-腾讯云 (tencent.com)

留一法（LeaveOneGroupOut、LeavePGroupsOut、LeaveOneOut、LeavePOut）

留P法验证（当P为1时就是留一法）：假设有N个样本，将每P个样本作为测试样本，其他N-P个样本作为训练样本，P>1时测试集将会发生重叠。这样得到N个分类器，N个测试结果，用这N个结果的平均值来衡量模型的性能。

随机划分法（ShuffleSplit、GroupShuffleSplit、StratifiedShuffleSplit）

ShuffleSplit迭代器产生指定数量的独立的train/test数据集划分，首先对样本全体随机打乱，然后再划分出train/test对，可以使用随机数种子random_state来控制数字序列发生器使得讯算结果可重现。允许更好的控制迭代次数和train/test的样本比例

数据划分

在sklearn中有相应的方法能够对数据进行划分（ train_test_split）

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train, label, stratify=label, random_state=2022)

步骤2：导入sklearn中的树模型

sklearn中有GradientBoostingClassifier、HistGradientBoostingClassifier，还有其他的树模型：XGBClassifier、LGBMClassifier、CatBoostClassifier等树模型。通过import导入相应的树模型

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import HistGradientBoostingClassifier

步骤3：训练集和测试集进行缺失值填充（数值列填充列均值，类别列填充众数）

对训练集以及测试集进行缺失值填充，使用fillna函数。为方便操作，将训练集和测试集进行合并。

df = pd.concat([train,test])
for column in ['Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']:
    mean_val = df[column].mean()
    df[column].fillna(mean_val, inplace=True)#均值填充
for column in ['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','VIP','Cabin','HomePlanet','Name','Destination','CryoSleep']:
    df[column].fillna(df[col].mode()[0], inplace=True)#众数填充