机器学习模型的建模与调参学习心得_机器学习模型参数优化报告心得-CSDN博客

本文链接：https://blog.csdn.net/weixin_43132892/article/details/105256112

零基础摸索中，参考零基础入门数据挖掘 - 二手车交易价格预测
本小白在初期可能更想掌握整个流程，理清楚流程脉络。知识结构图也是今后的知识点延伸扩展方向。

建模调参步骤

建模调参学习步骤

统计学习常用方法分类

统计学习

调整数据类型减小内存占用空间

# reduce_mem_usage 函数通过调整数据类型，帮助我们减少数据在内存中占用的空间
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.   
        # 计算调整数据类型前的内存使用情况     
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    # 获得每列数据类型
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            # 获得该列最小最大值
            c_min = df[col].min()
            c_max = df[col].max()
            # 如果数据类型为整型
            if str(col_type)[:3] == 'int':
                # 如果该列最小值> -128 and 该列最大值<127
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    # 则将数据类型改为int8
                    df[col] = df[col].astype(np.int8)
                # 如果最小值>-32768 and 最大值< 32767
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    # 则将数据类型转化为int16
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                # 如果数据类型为浮点型
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

运行结果：
Memory usage of dataframe is 60507328.00 MB

Memory usage after optimization is: 15724107.00 MB

Decreased by 74.0%

可以看出内存占用空间明显减小。

五折交叉验证

在使用训练集对参数进行训练的时候，人们通常会将整个训练集分为三个部分（比如mnist手写训练集）。一般分为：训练集（train_set），评估集（valid_set），测试集（test_set）这三个部分。这其实是为了保证训练效果而特意设置的。其中测试集很好理解，其实就是完全不参与训练的数据，仅仅用来观测测试效果的数据。而训练集和评估集则牵涉到下面的知识了。

因为在实际的训练中，训练的结果对于训练集的拟合程度通常还是挺好的（初始条件敏感），但是对于训练集之外的数据的拟合程度通常就不那么令人满意了。因此我们通常并不会把所有的数据集都拿来训练，而是分出一部分来（这一部分不参加训练）对训练集生成的参数进行测试，相对客观的判断这些参数对训练集之外的数据的符合程度。这种思想就称为交叉验证（Cross Validation）

使用sklearn的cross_val_score交叉验证

作用：验证某个模型在某个训练集上的稳定性，输出k个预测精度。
K折交叉验证（k-fold）是把初始训练样本分成k份，其中（k-1）份被用作训练集，剩下一份被用作评估集，这样一共可以对分类器做k次训练，并且得到k个训练结果。

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer
def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))