kaggle学习笔记——说是草稿纸也行

最新推荐文章于 2023-01-07 20:33:30 发布

正饼干海胆

最新推荐文章于 2023-01-07 20:33:30 发布

阅读量376

点赞数 1

文章标签：经验分享数据挖掘机器学习

本文链接：https://blog.csdn.net/qq_47460531/article/details/118309299

版权

文章目录

前言
主要内容

前言

近期在kaggle学习很基础的数据挖掘，做笔记是用kaggle自带的笔记本，据说是Jupiter核心。本人不太会用，看了官方文档大概明白了怎么用，但是不知为何每次保存时都会报错，虽然说下次打开笔记本时笔记内容并没有丢失，但是心中总是有些不安。
于是本人打算在CSDN上再开一篇文章，当作是kaggle笔记的备份。其实如果不求运行只看编辑的话，CSDN明显更有优势（当然也许是我不会用Jupiter罢了），也许后面将这个草稿纸当作是笔记主体也未可知。
如果有人不幸点进了这篇文章，而看不懂作者的半句屁话，那很正常，毕竟我就是写给自己看的。如果看懂了一些，另一些似懂非懂，想要提问交流的，请在评论区说出来，本人有空时会尽力解答的。（我在说什么怎么可能会有人看这东西.jpg）

主要内容

数据的导入和导出

import pandas as pd
# save filepath to variable for easier access
file_path = '../input/digit-recognizer/train.csv'
# read the data and store data in DataFrame titled melbourne_data
data = pd.read_csv(file_path)

这是在kaggle笔记本上数据的导入，当然要先在右上角的Data那里点击Add data这个按钮，然后选择需要的数据导入，导入完成后点击导入的数据可以看到后面有个选项是复制路径，把这个路径复制下来填充到file_path这句话这里就可以了。上面这句话导入的是kaggle里面很经典的一个入门数据集。

# Run the code to save predictions in the format used for competition scoring

output = pd.DataFrame({'Id': test_data.Id,
                       'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)

这一步是在导出，导出前要将数据集处理成提交所需要的格式，导出后提交的步骤如下：

Begin by clicking on the blue Save Version button in the top right corner of the window. This will generate a pop-up window.
Ensure that the Save and Run All option is selected, and then click on the blue Save button.
This generates a window in the bottom left corner of the notebook. After it has finished running, click on the number to the right of the Save Version button. This pulls up a list of versions on the right of the screen. Click on the ellipsis (…) to the right of the most recent version, and select Open in Viewer. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
Click on the Output tab on the right of the screen. Then, click on the file you would like to submit, and click on the blue Submit button to submit your results to the leaderboard.

查看数据的基础属性

在进行数据挖掘前简要了解数据。
这里只写了一点点，很基础很简单，后面肯定还会更新。
（这里的data是指上文中 data = pd.read_csv(file_path) 中的变量data）

data.describe() 可以很方便查看数据的各种参数，比如平均值，最小值最大值，二分值四分值。
data.columns 查看表头，找出特征值的名称和标签名称
data.head 查看数据的前几行

对数据的简单预处理

对数字特征（numerical）列的处理

#这个操作是挑出值为数字的特征
# To keep things simple, we'll use only numerical predictors
melb_predictors = data.drop(['Price'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])


#这是在删除有缺失值的特征
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)


#这是在填补缺失值，用均值
from sklearn.impute import SimpleImputer
# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

对分类特征（categorical）列的处理

#下面说明对离散、分类数据的处理
#三种方法：1，直接忽视，去除 2，对标签编码  3，onehot encoding


# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)


#1，去除
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])


#2，编码
from sklearn.preprocessing import LabelEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in object_cols:
    label_X_train[col] = label_encoder.fit_transform(X_train[col])
    label_X_valid[col] = label_encoder.transform(X_valid[col])
    
    
#3，onehot encoding
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

流程化处理

## 这里是流程化处理
## 假定已经导入数据，区分了变量列和标签列，区分了数据集和测试集，选择了数字变量和分类变量（其中分类变量类别在10以下）


# ColumnTransformer 是转换器的集合，构建此转换器可以实现对不同列的处理
# 转换器格式是三元组（‘名称’，‘转换器’，‘要转换的列’）
# Pipeline 是流水线工具，有顺序关系


#Step 1: Define Preprocessing Steps

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])


# Step 2: Define the Model

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)


# Step 3: Create and Evaluate the Pipeline

from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

XGBoost

一种极端梯度增强（extreme gradient boosting）算法，本质上还是一种树。

引用格式

from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

XGBoost似乎不在skit-learn库里，当年想要调用的时候颇费了一番功夫，要先到python的扩展包网页下载对应版本（很重要！）的包，然后用pip安装上。
XGBoost是一个用于处理标准表格数据(存储在Pandas DataFrames中的数据类型，而不是图像和视频等更奇特的数据类型)的领先软件库。通过仔细的参数调整，您可以训练高度精确的模型。
下面介绍一些参数。

参数

n_estimators

n_estimators 指定了经过上面描述的建模周期的次数。它等于我们在集合中包含的模型的数量。太小会导致欠拟合，太大会导致过拟合。一般选择的范围是100-1000，虽然这个值很大程度上受 learning_rate 这个参数影响。
设定方法：

my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)

early_stopping_rounds

early_stopping_rounds提供了一种自动为n_estimators找到理想值的方法。当validation score 停止改进时，模型会停止迭代，由于随机性可能会导致单轮验证分数没有提升，因此需要设定一个数字，在停止之前允许进行多少轮恶化，这个数字就是early_stopping_rounds。
例如，设置early_stopping_rounds=5是一个合理的选择。在这种情况下，模型在validation score连续5轮恶化后停止。
比较好的做法是为n_estimators设置一个较高的值，然后使用early_stopping_rounds来找到停止迭代的最佳时间。
在使用early_stopping_rounds时，还需要留出一些数据来计算validation score——这可以通过设置eval_set参数来完成。

my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)],
             verbose=False)

learning_rate

我们不需要简单地将每个组件模型的预测相加，而是可以将每个模型的预测乘以一个小数值(称为学习率learning_rate)，然后再将它们相加。
这意味着我们添加到集合中的每棵树对我们的帮助更少。因此，我们可以在不过度拟合的情况下为n_estimators设置一个更高的值。如果我们使用early_stopping_rounds，适当的树木数量将自动确定。
通常，较小的学习率和大量的估计模型将产生更准确的XGBoost模型，尽管它也需要更长的时间来训练模型，因为它在整个周期中要进行更多的迭代。默认情况下，XGBoost设置learning_rate=0.1。

my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

n_jobs

在需要考虑运行时的较大数据集上，可以使用并行性来更快地构建模型。通常将参数n_jobs设置为机器上的核心数。在较小的数据集上，这没有帮助。
最终的模型不会更好，所以对拟合时间的微观优化通常只是一个干扰。但是，它在大型数据集中很有用，否则在fit命令期间您将花费很长时间等待。

my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

正饼干海胆

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
kaggle学习笔记——说是草稿纸也行

前言近期在kaggle学习很基础的数据挖掘，做笔记是用kaggle自带的笔记本，据说是Jupiter核心。本人不太会用，看了官方文档大概明白了怎么用，但是不知为何每次保存时都会报错，虽然说下次打开笔记本时笔记内容并没有丢失，但是心中总是有些不安。于是本人打算在CSDN上再开一篇文章，当作是kaggle笔记的备份。其实如果不求运行只看编辑的话，CSDN明显更有优势（当然也许是我不会用Jupiter罢了），也许后面将这个草稿纸当作是笔记主体也未可知。如果有人不幸点进了这篇文章，而看不懂作者的半句屁话，那很
复制链接

扫一扫